Implement thread domain policy via cpuset

Authored by jeff on Dec 7 2017, 4:22 AM.



This patch replaces all numa policy and iterators with a new version that is integrated with cpuset. I'm still testing so I'm mostly looking for early design feedback. The intent of integrating with cpuset is to provide administrators with the ability to partition resources as well as allowing programs to control their own behavior. This created some complexity in cpuset due to the different synchronization behaviors of allocation vs thread switching. Iterators were rewritten so flag handling for nowait/waitok happens in a centralized place. I have implemented only the two policies that libnuma supports for now but this is trivially expanded upon.

Domainset synchronization is accomplished by making the structures immutable. This means potentially all threads and objects in the system may have read-only references to the same domainset. This allows me to put a popcnt and max in the structure in order to optimize searches. It also means we don't have to worry about torn writes or use any other synchronization like seq(9) which was used before. The domainset just holds a bitmap of allowed domains and an integer policy value. There is a domainset_ref which holds an iterator integer in either a thread or object which is used so iteration is coherent from call to call.

The iterators are written to attempt to streamline the first allocation since it is assumed that it will succeed. The iterators also revisit every domain before blocking on the first domain. So if there was a preference in the policy we will use the preferred domain. This partially addresses kib's concerns from my other two reviews. Other behaviors should be more straightforward to implement now that the flag handling is centralized. It also shrunk the iterating functions considerably.

The policy precedence is object first. So this will allow us to specify a kernel_object policy which overrides the calling thread's policy except where the calling thread is explicitly coded in the kernel to request a domain. This precedence is also consistent with what libnuma provides.

The cpuset integration was somewhat cumbersome because cpusets are not immutable. Masks can be modified on the fly with locks held. So there is more resource preallocation and tracking for domains. Integrating domains with cpusets makes things like jail work on specific domains out of the box. I have some more work to do to finish this off. For example, domain lookup should be hashed based, not a linked list. I also haven't put in any code to validate domain policies.

If you don't like names etc now is the time to speak up. I should be ready to post a final patch within a week.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.
jeff created this revision.Dec 7 2017, 4:22 AM
kib added a comment.Dec 7 2017, 5:04 PM

[I only looked at the syscall interface, not read the implementation, plan to do this later].

616 ↗(On Diff #36330)

Please remove generated files from the review and commit. Commit them as a followup after regen.

43 ↗(On Diff #36330)

Instead, replace the entry in syscalls.master with UNIMPL.

1029 ↗(On Diff #36330)

I would greatly prefer for the get/setdomain syscalls to take a pointer to the structure like this:

struct domainspec {
  domainset_t *ds_mask;
  void *ds_pad0[6];
  uint64_t ds_masklen;
  int ds_policy;
  int ds_pad1[7];

The structure and padding allow for the future ABI extension without renumbering the syscall.
I used uint64_t for ds_masklen to avoid size_t and ABI translation for compat32. It is arguable that uint32_t/uint can be used there instead.

landonf added a subscriber: landonf.Dec 9 2017, 1:51 AM
kib added a comment.EditedDec 9 2017, 11:52 AM

I admit that I like the new iterators, they are quite clean and understandable.

For non-NUMA configs, quite a lot of setup is currently done for iterators to prepare for a step which never occurs. Might be, for such machines, a special policy should be created which would be short-circuited early and which always returns domain 0 ?

143 ↗(On Diff #36330)

Could you keep the req value intact, and manipulate the flags in di_flags ? IMO the code become easier to follow and perhaps to debug, because original request parameters are visible.

jeff updated this revision to Diff 36644.Dec 15 2017, 11:37 PM

Fix a few bugs. Address the XXX comments. cpuset utility implementation and testing.

From what I can test with cpuset(1) this is functional and ready for libnuma integration.

markj accepted this revision.Dec 17 2017, 6:00 PM

All of my comments are nits. I'm doing some ad-hoc testing of the patch now, I'll let you know if I find any problems.

185 ↗(On Diff #36644)

Style: parens around return value, ditto above.

513 ↗(On Diff #36644)

Parens around return value.

1303 ↗(On Diff #36644)

Looks like domainset0 can be static.

1860 ↗(On Diff #36644)

Wrong indentation.

1866 ↗(On Diff #36644)


1952 ↗(On Diff #36644)

Unreachable return statement.

1990 ↗(On Diff #36644)

The indentation here looks wrong.

45 ↗(On Diff #36644)

You can delete the man pages for these syscalls too.

41 ↗(On Diff #36644)

DOMAINSETSETBUFSIZ appears to be unused.

68 ↗(On Diff #36644)


157 ↗(On Diff #36644)

Perhaps KASSERT(di->di_n > 0)?

46 ↗(On Diff #36644)

Perhaps provide stub implementations of these functions for the MAXMEMDOM == 1 case, to avoid unnecessary function calls in the page allocator?

296 ↗(On Diff #36644)

This is missing a man page update. -d isn't documented either.

This revision is now accepted and ready to land.Dec 17 2017, 6:00 PM
jeff added a reviewer: jhb.Dec 18 2017, 9:59 PM
jeff marked an inline comment as done.
Closed by commit rS327896: Add files for r327895 (authored by jeff, committed by ). · Explain WhyJan 12 2018, 10:58 PM
This revision was automatically updated to reflect the committed changes.