Page MenuHomeFreeBSD

Add initial process/thread NUMA policy configuration and management utility
AbandonedPublic

Authored by jhb on May 15 2015, 10:27 PM.

Details

Summary
  • add vm_domain policy and iterator constructs;
  • add a default system policy, defaulting to first-touch-round-robin;
  • add a vm policy for threads and processes, defaulting to NONE;
  • add syscalls to fetch and set numa policies per thread and process;
  • add numactl, a simple utility to fetch, set and execute processes.
  • add man pages for numa, numactl, numa_getaffinity, numa_setaffinity.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
stas added inline comments.May 15 2015, 11:10 PM
usr.bin/numactl/numactl.c
185

style(9) recommends variable to be sorted by size, then by name.

rpaulo added inline comments.May 16 2015, 12:34 AM
lib/libc/sys/numa_getaffinity.2
52

"or affect" ?

56

"are documented"

129

"e.g."

share/man/man4/numa.4
70

"or to change"

110

e.g.

sys/dev/sound/midi/sequencer.c
101

You're not going to change all the other functions in this file like (I think) we agreed?

sys/kern/kern_numa.c
168

Do we really want to expose seq to userland? Might be better to reset it before the copyout().

sys/vm/vm_domain.c
72

Unnecessary empty line :)

73

I assume we are aalways protected by the thread lock, right?

164

Why 10?

sys/vm/vm_phys.c
1388

extra space

usr.bin/numactl/numactl.1
81

.Fl -get?

101

"and"

wblock added a subscriber: wblock.May 16 2015, 10:14 PM
wblock added inline comments.
lib/libc/sys/numa_getaffinity.2
54

"The" is not needed, just "Valid values..."

59

"which set of the object" is odd, but maybe that is correct. "we are referring to" is... well.

How about

These arguments specify which object and set are used.

77

s/the following/these/

90

The way this is said implies that possibly invalid values can still be used. Is it more correct to just say

.Dv domain
must be set to -1.

97

"domain domain" seems redundant. Just

...from another domain if
.Dv domain

should be adequate.

102

Again, "domain domain" is redundant.

128

"may" implies permission, use "might" if something might or might not happen.

129

Worse yet, it should be followed by a comma because there is a pause: "e.g.,"

Better to just use "for example". Here, it can be a separate sentence:

Note that the VM might assign some pages from other domains.
For example, when an existing page allocation is covered by a superpage

151

"may" implies permission. How about

.Va errno
can contain these error codes:

178

Is "on the running system" needed?

wblock added inline comments.May 16 2015, 10:47 PM
share/man/man4/numa.4
43

Why "non-equal" rather than "unequal"?

57

It is confusing to mention two different uses. Simplify to just

.Cd MAXMEMDOM option is set to a value greater than 1 in a kernel configuration file.

61

s/controllable/controlled/
s/via/with/

69

Either "command" or "tool" is enough.
s/is available/is used/

72

General use is "I/O". Also, the sentence is kind of redundant with "devices, devices". How about:

Systems with non-uniform access to I/O devices may mark those devices with the local VM domain identifier.

74

s/may/can/

79

is controlled and exposes information with these

85

Try to avoid the "this". The name of the variable has already been stated, just say what it is or does:

.It va vm.ndomains
The number of VM domains which have been detected.

88

As above:

The default VM domain allocation policy.
Defaults to "first-touch-rr".

(Are the quotes part of the value? Possibly this should use one of the Enclosure and Quoting Macros from mdoc(7). Maybe .Ql

97

A table indicating...

103

The map...

108

Argh. That's an em-dash, but it's not. Better to make a full stop:

implementation is VM-focused.
The hardware

110

It's not clear what it means here. Normally, exempli gratia means "for example". So does this mean that one way hardware domains are detected is from ACPI, or should it be "i.e." to say (that is, from ACPI). Or would "(for instance, from ACPI)" capture the meaning?

Best to avoid these Latin abbreviations. They are academic shorthand, frequently misused, and not as useful as when monks were copying Latin manuscripts with ink pens.

112

s/example/example,/

118

Is that notation used elsewhere in man pages? It make some assumptions about the user that might not be true. The alternative is longer, but not ambiguous:

.Pa sys/vm/vm_domain.c
and
.Pa sys/vm/vm_domain.h .

122

"their" is probably not needed in this sentence.

127

Use a full break:

policy from processes.
Instead, if no thread policy is set, the system

147

"This" is unclear. Probably the policy. Can it be integrated with the previous sentence?

as an unconfigurable first-touch allocation policy with a fail-over to round-robin allocation.

151

Again, maybe merge it with the previous sentence:

to implement a still-unconfigurable round-robin allocation policy.

adrian marked 5 inline comments as done.May 26 2015, 10:24 PM
adrian added inline comments.
sys/vm/vm_domain.c
73

Yes, we assume that doing things to td is fine, as we're the only one doing it.

sys/vm/vm_phys.c
185

Hi,

The point behind the mutex is to enforce write-ordering and to be consistent. I don't like skipping mutexes where "it's safe"; it means that ordering isn't guaranteed.

210

The first pass is to get the string value to write out. The second pass is to parse whatever userland set. It's not repeated.

226

Fixed, thanks!

314

Fixed

319

Fixed!

800

Fixed!

1388

Fixed, thanks!

adrian marked 8 inline comments as done.May 26 2015, 10:29 PM
adrian added inline comments.
sys/dev/sound/midi/sequencer.c
101

Eventually. I'll likely do it in upstream and just pull it back into this work.

sys/vm/vm_domain.c
164

It's just an arbitrary short retry limit.

adrian marked 31 inline comments as done.May 27 2015, 10:36 PM
adrian marked an inline comment as done.May 27 2015, 10:37 PM
adrian marked 4 inline comments as done.May 27 2015, 11:42 PM
adrian updated this revision to Diff 5798.May 30 2015, 6:56 AM
adrian edited edge metadata.

Address a variety of comments from people.

kib requested changes to this revision.Jun 1 2015, 8:00 AM
kib added a reviewer: kib.
kib added a subscriber: kib.

I do not agree with this patch at all.

First, it is not supported by any real functionality behind. I object against committing 'interfaces' which cannot be validated and which user-visible ABI/API will be changed for sure. The m = NULL change in the vm_page_alloc() is from the same bucket.

Second, adding new syscalls for this is waste. Look at the procctl() which was designed exactly for the purpose of controlling runtime aspects of the processes.

The syscall is designed explicitely to made it impossible to guarantee ABI-stability. Exposing such fine kernel details as seq_t in the layout of the user-visible structure makes seq_t part of the ABI contract. Really ?

The per-process policy in your code is not real process-wide setting. It is only a non-deterministic fall-back when thread never set its own policy. If the thread ever set a policy, anybody setting process policy does not actually set per-process policy, only fallback. I mentioned the non-determinism due to the way the localcopy is coded (AKA 10 iterations).

Last, the patch is dirty. It contains spurious changes like empty lines added, generated files changes, midi/sequencer.c appearance in the patch is ridiculous etc etc.

Who is Michael Fischbein ?

sys/compat/freebsd32/freebsd32_sysent.c
621

Why generated bits appear in the review ?

sys/vm/vm_phys.c
185

What ordering ? I am curious what do you mean.

Aligned integer load is atomic, this is the fundamental assumption about the supported architecture and compiler by our code.

This revision now requires changes to proceed.Jun 1 2015, 8:00 AM
adrian added a comment.Jun 1 2015, 3:01 PM

Hi,

No idea who that person is.

The changes to the midi code is because of naming clashes with sys/seq.h, which is now included as part of the definition of struct proc / struct thread.

I'm happy to go and change the API to remove exposing seq_t. That's not a hard change.

The integer load is atomic, but there are two integer loads, and there may be a bitmap at some point (eg a domainset for things like "round-robin, from these domains only.") If you don't do that, you have a small window where the domain id (which may be -1) and the policy (which may not be NONE, or RR) aren't consistent. Using seq_t and copying like this means that we can extend the policy definition later and still have it all work correctly.

I can change the policy copy to not use a loop, but I put the loop in there to ensure that we always make progress. what do you suggest? Always loop, or return an error?

The per-process code does actually work. I've used it for a variety of things, including bhyve instances. The implementation ordering is mentioned in the manpage: thread policy, then process policy, then system policy. When I end up getting to UMA, vm-object, contig malloc and such, then any explicit policy provided there will trump the thread policy.

Generated bits show up in the review because they're part of files that are checked in and changed.

kib added a comment.Jun 1 2015, 4:28 PM
In D2559#51085, @adrian wrote:

Hi,
No idea who that person is.

Yet
+ * This code is derived from software contributed to Berkeley by
+ * Michael Fischbein.

The changes to the midi code is because of naming clashes with sys/seq.h, which is now included as part of the definition of struct proc / struct thread.

I understand that. My point is that the namespace pollution from the patch is enormous and unacceptable.

I'm happy to go and change the API to remove exposing seq_t. That's not a hard change.

Yet it is there.

And you did not answered about procctl.

The integer load is atomic, but there are two integer loads, and there may be a bitmap at some point (eg a domainset for things like "round-robin, from these domains only.") If you don't do that, you have a small window where the domain id (which may be -1) and the policy (which may not be NONE, or RR) aren't consistent. Using seq_t and copying like this means that we can extend the policy definition later and still have it all work correctly.

It is a premature design for the not yet implemented backbone facility.

I can change the policy copy to not use a loop, but I put the loop in there to ensure that we always make progress. what do you suggest? Always loop, or return an error?

At this stage, this is over-designed. And whatever it be, IMO seq_t is simply wrong primitive (pattern) to use.
If anything, the ucred handling should be the pattern to follow.

The per-process code does actually work. I've used it for a variety of things, including bhyve instances. The implementation ordering is mentioned in the manpage: thread policy, then process policy, then system policy. When I end up getting to UMA, vm-object, contig malloc and such, then any explicit policy provided there will trump the thread policy.

Did you read the explanation of the situation I asked about ? Setting per-process policy does not work for a thread which
already set thread policy.

Generated bits show up in the review because they're part of files that are checked in and changed.

Which is wrong. I have to hand-edit to remove third of the patch to get it readable. Not to mention that generated changes must not be committed together with the meaningful part.

adrian added a comment.Jun 1 2015, 8:50 PM

Oops, must've pulled the example header from something silly. lemme go fix that.

adrian added a comment.Jun 1 2015, 9:09 PM

Hi, I've fixed the attribution in the current code. I didn't know about procctl; I'm happy to use the procctl API if we're happy to extend it to include threads.

op added a subscriber: op.Jun 1 2015, 10:12 PM
rpaulo edited edge metadata.Jun 2 2015, 4:05 AM
In D2559#51210, @adrian wrote:

Hi, I've fixed the attribution in the current code. I didn't know about procctl; I'm happy to use the procctl API if we're happy to extend it to include threads.

My opinion, for whatever it is worth, is that a bunch of new syscalls wouldn't be a waste. procctl() was designed to be like ioctl() which has it's own problems. procctl() also seems to focus on "processes", not threads which is the whole point of the NUMA syscalls.

Kib,

I'm happy to use procctl(), but right now it doesn't support controlling thread stuff in the idtype_t. We'd have to go and do that.

adrian marked 2 inline comments as done.Jun 23 2015, 2:29 AM

Ok, I've updated numactl in the git branch

sys/compat/freebsd32/freebsd32_sysent.c
1

I'll make sure generated bits don't show up in the next review.

sys/vm/vm_phys.c
1

There'll be non-x86 NUMA implementations that we support soon.

The ordering is between the various parameters inside the struct. Any view on said struct should be consistent.

usr.bin/numactl/numactl.c
1

Fixed!

kib added a comment.Jun 23 2015, 10:55 AM
In D2559#56061, @adrian wrote:

Kib,
I'm happy to use procctl(), but right now it doesn't support controlling thread stuff in the idtype_t. We'd have to go and do that.

P_LWPID indicates that id is tid.

adrian marked 2 inline comments as done.Jun 23 2015, 1:57 PM

Is that how it's defined in FreeBSD?

In solaris LWPID is defined as a per-process identifier, not a global identifier.

kib added a comment.Jun 23 2015, 2:01 PM
In D2559#56149, @adrian wrote:

Is that how it's defined in FreeBSD?
In solaris LWPID is defined as a per-process identifier, not a global identifier.

In FreeBSD, there is no process-private thread ids at all.

In D2559#56151, @kib wrote:
In D2559#56149, @adrian wrote:

Is that how it's defined in FreeBSD?
In solaris LWPID is defined as a per-process identifier, not a global identifier.

In FreeBSD, there is no process-private thread ids at all.

Ok. I'm worried that reusing LWPID as meaning our global thread id may upset some ported solaris code. Is there anywhere else in our tree where we've used P_LWPID to refer to tid?

kib added a comment.Jun 23 2015, 2:31 PM
In D2559#56155, @adrian wrote:

Ok. I'm worried that reusing LWPID as meaning our global thread id may upset some ported solaris code. Is there anywhere else in our tree where we've used P_LWPID to refer to tid?

There is no use of P_LWPID in the tree at all, as it is easy to check with grep. The only code which touches the symbol, except header files, are kdump and truss, for symbolic display of the waitid(2) argument.

jhb commandeered this revision.Nov 15 2018, 9:28 PM
jhb edited reviewers, added: adrian; removed: jhb.
jhb abandoned this revision.Nov 15 2018, 9:30 PM

This was committed in the noted commit.