Add initial process/thread NUMA policy configuration and management utility
AbandonedPublic
Actions

Authored by jhb on May 15 2015, 10:27 PM.

Details

Reviewers

kib
alc
stas
rpaulo
brooks
adrian

Group Reviewers

manpages

Commits

rS285387: Add an initial NUMA affinity/policy configuration for threads and processes.

Summary

add vm_domain policy and iterator constructs;
add a default system policy, defaulting to first-touch-round-robin;
add a vm policy for threads and processes, defaulting to NONE;
add syscalls to fetch and set numa policies per thread and process;
add numactl, a simple utility to fetch, set and execute processes.
add man pages for numa, numactl, numa_getaffinity, numa_setaffinity.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

stas added inline comments.May 15 2015, 11:10 PM

usr.bin/numactl/numactl.c
185	style(9) recommends variable to be sorted by size, then by name.

rpaulo added inline comments.May 16 2015, 12:34 AM

lib/libc/sys/numa_getaffinity.2
52	"or affect" ?
56	"are documented"
129	"e.g."
share/man/man4/numa.4
70	"or to change"
110	e.g.
sys/dev/sound/midi/sequencer.c
101	You're not going to change all the other functions in this file like (I think) we agreed?
sys/kern/kern_numa.c
168	Do we really want to expose seq to userland? Might be better to reset it before the copyout().
sys/vm/vm_domain.c
72	Unnecessary empty line :)
73	I assume we are aalways protected by the thread lock, right?
164	Why 10?
sys/vm/vm_phys.c
1388	extra space
usr.bin/numactl/numactl.1
81	.Fl -get?
101	"and"

wblock added a subscriber: wblock.May 16 2015, 10:14 PM

wblock added inline comments.

lib/libc/sys/numa_getaffinity.2
54	"The" is not needed, just "Valid values..."
59	"which set of the object" is odd, but maybe that is correct. "we are referring to" is... well. How about These arguments specify which object and set are used.
77	s/the following/these/
90	The way this is said implies that possibly invalid values can still be used. Is it more correct to just say .Dv domain must be set to -1.
97	"domain domain" seems redundant. Just ...from another domain if .Dv domain should be adequate.
102	Again, "domain domain" is redundant.
128	"may" implies permission, use "might" if something might or might not happen.
129	Worse yet, it should be followed by a comma because there is a pause: "e.g.," Better to just use "for example". Here, it can be a separate sentence: Note that the VM might assign some pages from other domains. For example, when an existing page allocation is covered by a superpage
151	"may" implies permission. How about .Va errno can contain these error codes:
178	Is "on the running system" needed?

wblock added inline comments.May 16 2015, 10:47 PM

share/man/man4/numa.4
43	Why "non-equal" rather than "unequal"?
57	It is confusing to mention two different uses. Simplify to just .Cd MAXMEMDOM option is set to a value greater than 1 in a kernel configuration file.
61	s/controllable/controlled/ s/via/with/
69	Either "command" or "tool" is enough. s/is available/is used/
72	General use is "I/O". Also, the sentence is kind of redundant with "devices, devices". How about: Systems with non-uniform access to I/O devices may mark those devices with the local VM domain identifier.
74	s/may/can/
79	is controlled and exposes information with these
85	Try to avoid the "this". The name of the variable has already been stated, just say what it is or does: .It va vm.ndomains The number of VM domains which have been detected.
88	As above: The default VM domain allocation policy. Defaults to "first-touch-rr". (Are the quotes part of the value? Possibly this should use one of the Enclosure and Quoting Macros from mdoc(7). Maybe .Ql
97	A table indicating...
103	The map...
108	Argh. That's an em-dash, but it's not. Better to make a full stop: implementation is VM-focused. The hardware
110	It's not clear what it means here. Normally, exempli gratia means "for example". So does this mean that one way hardware domains are detected is from ACPI, or should it be "i.e." to say (that is, from ACPI). Or would "(for instance, from ACPI)" capture the meaning? Best to avoid these Latin abbreviations. They are academic shorthand, frequently misused, and not as useful as when monks were copying Latin manuscripts with ink pens.
112	s/example/example,/
118	Is that notation used elsewhere in man pages? It make some assumptions about the user that might not be true. The alternative is longer, but not ambiguous: .Pa sys/vm/vm_domain.c and .Pa sys/vm/vm_domain.h .
122	"their" is probably not needed in this sentence.
127	Use a full break: policy from processes. Instead, if no thread policy is set, the system
147	"This" is unclear. Probably the policy. Can it be integrated with the previous sentence? as an unconfigurable first-touch allocation policy with a fail-over to round-robin allocation.
151	Again, maybe merge it with the previous sentence: to implement a still-unconfigurable round-robin allocation policy.

adrian marked 5 inline comments as done.May 26 2015, 10:24 PM

adrian added inline comments.

sys/vm/vm_domain.c
73	Yes, we assume that doing things to td is fine, as we're the only one doing it.
sys/vm/vm_phys.c
185	Hi, The point behind the mutex is to enforce write-ordering and to be consistent. I don't like skipping mutexes where "it's safe"; it means that ordering isn't guaranteed.
210	The first pass is to get the string value to write out. The second pass is to parse whatever userland set. It's not repeated.
226	Fixed, thanks!
314	Fixed
319	Fixed!
800	Fixed!
1388	Fixed, thanks!

adrian marked 8 inline comments as done.May 26 2015, 10:29 PM

adrian added inline comments.

sys/dev/sound/midi/sequencer.c
101	Eventually. I'll likely do it in upstream and just pull it back into this work.
sys/vm/vm_domain.c
164	It's just an arbitrary short retry limit.

Address a variety of comments from people.

I do not agree with this patch at all.

First, it is not supported by any real functionality behind. I object against committing 'interfaces' which cannot be validated and which user-visible ABI/API will be changed for sure. The m = NULL change in the vm_page_alloc() is from the same bucket.

Second, adding new syscalls for this is waste. Look at the procctl() which was designed exactly for the purpose of controlling runtime aspects of the processes.

The syscall is designed explicitely to made it impossible to guarantee ABI-stability. Exposing such fine kernel details as seq_t in the layout of the user-visible structure makes seq_t part of the ABI contract. Really ?

The per-process policy in your code is not real process-wide setting. It is only a non-deterministic fall-back when thread never set its own policy. If the thread ever set a policy, anybody setting process policy does not actually set per-process policy, only fallback. I mentioned the non-determinism due to the way the localcopy is coded (AKA 10 iterations).

Last, the patch is dirty. It contains spurious changes like empty lines added, generated files changes, midi/sequencer.c appearance in the patch is ridiculous etc etc.

Who is Michael Fischbein ?

sys/compat/freebsd32/freebsd32_sysent.c
621	Why generated bits appear in the review ?
sys/vm/vm_phys.c
185	What ordering ? I am curious what do you mean. Aligned integer load is atomic, this is the fundamental assumption about the supported architecture and compiler by our code.

This revision now requires changes to proceed.Jun 1 2015, 8:00 AM

Hi,

No idea who that person is.

The changes to the midi code is because of naming clashes with sys/seq.h, which is now included as part of the definition of struct proc / struct thread.

I'm happy to go and change the API to remove exposing seq_t. That's not a hard change.

The integer load is atomic, but there are two integer loads, and there may be a bitmap at some point (eg a domainset for things like "round-robin, from these domains only.") If you don't do that, you have a small window where the domain id (which may be -1) and the policy (which may not be NONE, or RR) aren't consistent. Using seq_t and copying like this means that we can extend the policy definition later and still have it all work correctly.

I can change the policy copy to not use a loop, but I put the loop in there to ensure that we always make progress. what do you suggest? Always loop, or return an error?

The per-process code does actually work. I've used it for a variety of things, including bhyve instances. The implementation ordering is mentioned in the manpage: thread policy, then process policy, then system policy. When I end up getting to UMA, vm-object, contig malloc and such, then any explicit policy provided there will trump the thread policy.

Generated bits show up in the review because they're part of files that are checked in and changed.

In D2559#51085, @adrian wrote:

Hi,

No idea who that person is.

Yet
+ * This code is derived from software contributed to Berkeley by
+ * Michael Fischbein.

The changes to the midi code is because of naming clashes with sys/seq.h, which is now included as part of the definition of struct proc / struct thread.

I understand that. My point is that the namespace pollution from the patch is enormous and unacceptable.

I'm happy to go and change the API to remove exposing seq_t. That's not a hard change.

Yet it is there.

And you did not answered about procctl.

The integer load is atomic, but there are two integer loads, and there may be a bitmap at some point (eg a domainset for things like "round-robin, from these domains only.") If you don't do that, you have a small window where the domain id (which may be -1) and the policy (which may not be NONE, or RR) aren't consistent. Using seq_t and copying like this means that we can extend the policy definition later and still have it all work correctly.

It is a premature design for the not yet implemented backbone facility.

I can change the policy copy to not use a loop, but I put the loop in there to ensure that we always make progress. what do you suggest? Always loop, or return an error?

At this stage, this is over-designed. And whatever it be, IMO seq_t is simply wrong primitive (pattern) to use.
If anything, the ucred handling should be the pattern to follow.

The per-process code does actually work. I've used it for a variety of things, including bhyve instances. The implementation ordering is mentioned in the manpage: thread policy, then process policy, then system policy. When I end up getting to UMA, vm-object, contig malloc and such, then any explicit policy provided there will trump the thread policy.

Did you read the explanation of the situation I asked about ? Setting per-process policy does not work for a thread which
already set thread policy.

Generated bits show up in the review because they're part of files that are checked in and changed.

Which is wrong. I have to hand-edit to remove third of the patch to get it readable. Not to mention that generated changes must not be committed together with the meaningful part.

Oops, must've pulled the example header from something silly. lemme go fix that.

Hi, I've fixed the attribution in the current code. I didn't know about procctl; I'm happy to use the procctl API if we're happy to extend it to include threads.

op added a subscriber: op.Jun 1 2015, 10:12 PM

In D2559#51210, @adrian wrote:

Hi, I've fixed the attribution in the current code. I didn't know about procctl; I'm happy to use the procctl API if we're happy to extend it to include threads.

My opinion, for whatever it is worth, is that a bunch of new syscalls wouldn't be a waste. procctl() was designed to be like ioctl() which has it's own problems. procctl() also seems to focus on "processes", not threads which is the whole point of the NUMA syscalls.

rhenjau_gmail.com added a subscriber: rhenjau_gmail.com.Jun 10 2015, 7:21 AM

Kib,

I'm happy to use procctl(), but right now it doesn't support controlling thread stuff in the idtype_t. We'd have to go and do that.

Ok, I've updated numactl in the git branch

sys/compat/freebsd32/freebsd32_sysent.c
1	I'll make sure generated bits don't show up in the next review.
sys/vm/vm_phys.c
1	There'll be non-x86 NUMA implementations that we support soon. The ordering is between the various parameters inside the struct. Any view on said struct should be consistent.
usr.bin/numactl/numactl.c
1	Fixed!

In D2559#56061, @adrian wrote:

Kib,

I'm happy to use procctl(), but right now it doesn't support controlling thread stuff in the idtype_t. We'd have to go and do that.

P_LWPID indicates that id is tid.

Is that how it's defined in FreeBSD?

In solaris LWPID is defined as a per-process identifier, not a global identifier.

In D2559#56149, @adrian wrote:

Is that how it's defined in FreeBSD?

In solaris LWPID is defined as a per-process identifier, not a global identifier.

In FreeBSD, there is no process-private thread ids at all.

In D2559#56151, @kib wrote:

In D2559#56149, @adrian wrote:

Is that how it's defined in FreeBSD?

In solaris LWPID is defined as a per-process identifier, not a global identifier.

In FreeBSD, there is no process-private thread ids at all.

Ok. I'm worried that reusing LWPID as meaning our global thread id may upset some ported solaris code. Is there anywhere else in our tree where we've used P_LWPID to refer to tid?

In D2559#56155, @adrian wrote:

Ok. I'm worried that reusing LWPID as meaning our global thread id may upset some ported solaris code. Is there anywhere else in our tree where we've used P_LWPID to refer to tid?

There is no use of P_LWPID in the tree at all, as it is easy to check with grep. The only code which touches the symbol, except header files, are kdump and truss, for symbolic display of the waitid(2) argument.