- add vm_domain policy and iterator constructs;
- add a default system policy, defaulting to first-touch-round-robin;
- add a vm policy for threads and processes, defaulting to NONE;
- add syscalls to fetch and set numa policies per thread and process;
- add numactl, a simple utility to fetch, set and execute processes.
- add man pages for numa, numactl, numa_getaffinity, numa_setaffinity.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage
Event Timeline
usr.bin/numactl/numactl.c | ||
---|---|---|
185 | style(9) recommends variable to be sorted by size, then by name. |
lib/libc/sys/numa_getaffinity.2 | ||
---|---|---|
52 | "or affect" ? | |
56 | "are documented" | |
129 | "e.g." | |
share/man/man4/numa.4 | ||
70 | "or to change" | |
110 | e.g. | |
sys/dev/sound/midi/sequencer.c | ||
101 | You're not going to change all the other functions in this file like (I think) we agreed? | |
sys/kern/kern_numa.c | ||
168 | Do we really want to expose seq to userland? Might be better to reset it before the copyout(). | |
sys/vm/vm_domain.c | ||
72 | Unnecessary empty line :) | |
73 | I assume we are aalways protected by the thread lock, right? | |
164 | Why 10? | |
sys/vm/vm_phys.c | ||
1388 | extra space | |
usr.bin/numactl/numactl.1 | ||
81 | .Fl -get? | |
101 | "and" |
lib/libc/sys/numa_getaffinity.2 | ||
---|---|---|
54 | "The" is not needed, just "Valid values..." | |
59 | "which set of the object" is odd, but maybe that is correct. "we are referring to" is... well. How about These arguments specify which object and set are used. | |
77 | s/the following/these/ | |
90 | The way this is said implies that possibly invalid values can still be used. Is it more correct to just say .Dv domain | |
97 | "domain domain" seems redundant. Just ...from another domain if should be adequate. | |
102 | Again, "domain domain" is redundant. | |
128 | "may" implies permission, use "might" if something might or might not happen. | |
129 | Worse yet, it should be followed by a comma because there is a pause: "e.g.," Better to just use "for example". Here, it can be a separate sentence: Note that the VM might assign some pages from other domains. | |
151 | "may" implies permission. How about .Va errno | |
178 | Is "on the running system" needed? |
share/man/man4/numa.4 | ||
---|---|---|
43 | Why "non-equal" rather than "unequal"? | |
57 | It is confusing to mention two different uses. Simplify to just .Cd MAXMEMDOM option is set to a value greater than 1 in a kernel configuration file. | |
61 | s/controllable/controlled/ | |
69 | Either "command" or "tool" is enough. | |
72 | General use is "I/O". Also, the sentence is kind of redundant with "devices, devices". How about: Systems with non-uniform access to I/O devices may mark those devices with the local VM domain identifier. | |
74 | s/may/can/ | |
79 | is controlled and exposes information with these | |
85 | Try to avoid the "this". The name of the variable has already been stated, just say what it is or does: .It va vm.ndomains | |
88 | As above: The default VM domain allocation policy. (Are the quotes part of the value? Possibly this should use one of the Enclosure and Quoting Macros from mdoc(7). Maybe .Ql | |
97 | A table indicating... | |
103 | The map... | |
108 | Argh. That's an em-dash, but it's not. Better to make a full stop: implementation is VM-focused. | |
110 | It's not clear what it means here. Normally, exempli gratia means "for example". So does this mean that one way hardware domains are detected is from ACPI, or should it be "i.e." to say (that is, from ACPI). Or would "(for instance, from ACPI)" capture the meaning? Best to avoid these Latin abbreviations. They are academic shorthand, frequently misused, and not as useful as when monks were copying Latin manuscripts with ink pens. | |
112 | s/example/example,/ | |
118 | Is that notation used elsewhere in man pages? It make some assumptions about the user that might not be true. The alternative is longer, but not ambiguous: .Pa sys/vm/vm_domain.c | |
122 | "their" is probably not needed in this sentence. | |
127 | Use a full break: policy from processes. | |
147 | "This" is unclear. Probably the policy. Can it be integrated with the previous sentence? as an unconfigurable first-touch allocation policy with a fail-over to round-robin allocation. | |
151 | Again, maybe merge it with the previous sentence: to implement a still-unconfigurable round-robin allocation policy. |
sys/vm/vm_domain.c | ||
---|---|---|
73 | Yes, we assume that doing things to td is fine, as we're the only one doing it. | |
sys/vm/vm_phys.c | ||
185 | Hi, The point behind the mutex is to enforce write-ordering and to be consistent. I don't like skipping mutexes where "it's safe"; it means that ordering isn't guaranteed. | |
210 | The first pass is to get the string value to write out. The second pass is to parse whatever userland set. It's not repeated. | |
226 | Fixed, thanks! | |
314 | Fixed | |
319 | Fixed! | |
800 | Fixed! | |
1388 | Fixed, thanks! |
I do not agree with this patch at all.
First, it is not supported by any real functionality behind. I object against committing 'interfaces' which cannot be validated and which user-visible ABI/API will be changed for sure. The m = NULL change in the vm_page_alloc() is from the same bucket.
Second, adding new syscalls for this is waste. Look at the procctl() which was designed exactly for the purpose of controlling runtime aspects of the processes.
The syscall is designed explicitely to made it impossible to guarantee ABI-stability. Exposing such fine kernel details as seq_t in the layout of the user-visible structure makes seq_t part of the ABI contract. Really ?
The per-process policy in your code is not real process-wide setting. It is only a non-deterministic fall-back when thread never set its own policy. If the thread ever set a policy, anybody setting process policy does not actually set per-process policy, only fallback. I mentioned the non-determinism due to the way the localcopy is coded (AKA 10 iterations).
Last, the patch is dirty. It contains spurious changes like empty lines added, generated files changes, midi/sequencer.c appearance in the patch is ridiculous etc etc.
Who is Michael Fischbein ?
sys/compat/freebsd32/freebsd32_sysent.c | ||
---|---|---|
621 | Why generated bits appear in the review ? | |
sys/vm/vm_phys.c | ||
185 | What ordering ? I am curious what do you mean. Aligned integer load is atomic, this is the fundamental assumption about the supported architecture and compiler by our code. |
Hi,
No idea who that person is.
The changes to the midi code is because of naming clashes with sys/seq.h, which is now included as part of the definition of struct proc / struct thread.
I'm happy to go and change the API to remove exposing seq_t. That's not a hard change.
The integer load is atomic, but there are two integer loads, and there may be a bitmap at some point (eg a domainset for things like "round-robin, from these domains only.") If you don't do that, you have a small window where the domain id (which may be -1) and the policy (which may not be NONE, or RR) aren't consistent. Using seq_t and copying like this means that we can extend the policy definition later and still have it all work correctly.
I can change the policy copy to not use a loop, but I put the loop in there to ensure that we always make progress. what do you suggest? Always loop, or return an error?
The per-process code does actually work. I've used it for a variety of things, including bhyve instances. The implementation ordering is mentioned in the manpage: thread policy, then process policy, then system policy. When I end up getting to UMA, vm-object, contig malloc and such, then any explicit policy provided there will trump the thread policy.
Generated bits show up in the review because they're part of files that are checked in and changed.
Yet
+ * This code is derived from software contributed to Berkeley by
+ * Michael Fischbein.
The changes to the midi code is because of naming clashes with sys/seq.h, which is now included as part of the definition of struct proc / struct thread.
I understand that. My point is that the namespace pollution from the patch is enormous and unacceptable.
I'm happy to go and change the API to remove exposing seq_t. That's not a hard change.
Yet it is there.
And you did not answered about procctl.
The integer load is atomic, but there are two integer loads, and there may be a bitmap at some point (eg a domainset for things like "round-robin, from these domains only.") If you don't do that, you have a small window where the domain id (which may be -1) and the policy (which may not be NONE, or RR) aren't consistent. Using seq_t and copying like this means that we can extend the policy definition later and still have it all work correctly.
It is a premature design for the not yet implemented backbone facility.
I can change the policy copy to not use a loop, but I put the loop in there to ensure that we always make progress. what do you suggest? Always loop, or return an error?
At this stage, this is over-designed. And whatever it be, IMO seq_t is simply wrong primitive (pattern) to use.
If anything, the ucred handling should be the pattern to follow.
The per-process code does actually work. I've used it for a variety of things, including bhyve instances. The implementation ordering is mentioned in the manpage: thread policy, then process policy, then system policy. When I end up getting to UMA, vm-object, contig malloc and such, then any explicit policy provided there will trump the thread policy.
Did you read the explanation of the situation I asked about ? Setting per-process policy does not work for a thread which
already set thread policy.
Generated bits show up in the review because they're part of files that are checked in and changed.
Which is wrong. I have to hand-edit to remove third of the patch to get it readable. Not to mention that generated changes must not be committed together with the meaningful part.
Hi, I've fixed the attribution in the current code. I didn't know about procctl; I'm happy to use the procctl API if we're happy to extend it to include threads.
My opinion, for whatever it is worth, is that a bunch of new syscalls wouldn't be a waste. procctl() was designed to be like ioctl() which has it's own problems. procctl() also seems to focus on "processes", not threads which is the whole point of the NUMA syscalls.
Kib,
I'm happy to use procctl(), but right now it doesn't support controlling thread stuff in the idtype_t. We'd have to go and do that.
Ok, I've updated numactl in the git branch
sys/compat/freebsd32/freebsd32_sysent.c | ||
---|---|---|
1 | I'll make sure generated bits don't show up in the next review. | |
sys/vm/vm_phys.c | ||
1 | There'll be non-x86 NUMA implementations that we support soon. The ordering is between the various parameters inside the struct. Any view on said struct should be consistent. | |
usr.bin/numactl/numactl.c | ||
1 | Fixed! |
Is that how it's defined in FreeBSD?
In solaris LWPID is defined as a per-process identifier, not a global identifier.
Ok. I'm worried that reusing LWPID as meaning our global thread id may upset some ported solaris code. Is there anywhere else in our tree where we've used P_LWPID to refer to tid?
Ok. I'm worried that reusing LWPID as meaning our global thread id may upset some ported solaris code. Is there anywhere else in our tree where we've used P_LWPID to refer to tid?
There is no use of P_LWPID in the tree at all, as it is easy to check with grep. The only code which touches the symbol, except header files, are kdump and truss, for symbolic display of the waitid(2) argument.