make ucred thread private
Needs ReviewPublic
Actions

Authored by kmacy on Apr 30 2018, 7:31 AM.

Details

Reviewers

Summary

the refcount on ucred becomes a system bottleneck when many threads or process share one - this zone allocates creds and tries to duplicate them on thread COW

it yields roughly a 45% speedup and brk1_processes -t 96 and mmap1_processes -t 96

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 16350

Event Timeline

kmacy created this revision.Apr 30 2018, 7:31 AM

Herald added a subscriber: imp. · View Herald TranscriptApr 30 2018, 7:31 AM

Harbormaster completed remote builds in B16350: Diff 41988.Apr 30 2018, 7:31 AM

I completely disagree with this.

The way to go is with per-cpu based reference counting which reverts back to single-word atomics as needed. This paired with a separate count of "legitimate" refs would take care of the problem altogether without duplicating everything.

That is, something like fork adds to "legitimate" refs and exit/setuid/whatever removes. Once "legitimate" refs drop to 0 no new object can get the creds and this is where the per-cpu scheme gets converted to regular atomic scheme which checks if the object should be freed.

This can be used to take care of other things, like jails or resources limits.

It is quite unclear if the vm code has any good reason to deal with creds to begin with.

In D15233#321171, @mjg wrote:

The way to go is with per-cpu based reference counting which reverts back to single-word atomics as needed. This paired with a separate count of "legitimate" refs would take care of the problem altogether without duplicating everything.

I agree that would be better although we need to look at how this would be implemented and see if we have a reasonable solution. I assume it would likely need to be some variant of counter(9). Presumably when the per-cpu var transitioned from 0 to 1 you would grab an atomic on the global? And then release on 1->0?

This can be used to take care of other things, like jails or resources limits.

It is quite unclear if the vm code has any good reason to deal with creds to begin with.

While the use seems somewhat sloppy it needs access to the uidinfo which otherwise has its own ref count. You wouldn't have to parallelize the uidinfo ref if it is just held via the cred ref. We still need to make that accounting per-cpu. Something like a counter(9) with a not-completely-coherent limit.

kbowling added a subscriber: kbowling.Apr 30 2018, 9:37 PM

In D15233#321179, @jeff wrote:

In D15233#321171, @mjg wrote:

The way to go is with per-cpu based reference counting which reverts back to single-word atomics as needed. This paired with a separate count of "legitimate" refs would take care of the problem altogether without duplicating everything.

I agree that would be better although we need to look at how this would be implemented and see if we have a reasonable solution. I assume it would likely need to be some variant of counter(9). Presumably when the per-cpu var transitioned from 0 to 1 you would grab an atomic on the global? And then release on 1->0?

It would be an invariant that the pcpu ref is > 1 in 'normal' circumstances. After say someone does setuid or whatever else getting signifying the creds are going to die at some point you decrement the "legitimate" ref and per-cpu refs by 1. If legit refs are 0, the counter transitions to atomics. This can be hacked further.

This can be used to take care of other things, like jails or resources limits.

It is quite unclear if the vm code has any good reason to deal with creds to begin with.

While the use seems somewhat sloppy it needs access to the uidinfo which otherwise has its own ref count. You wouldn't have to parallelize the uidinfo ref if it is just held via the cred ref. We still need to make that accounting per-cpu. Something like a counter(9) with a not-completely-coherent limit.

I mean is there any good reason to do this per-uid swap accounting to begin with? By default overcommit flags are 0, which in particular means the limit is not enforced whatsoever. I think it would be acceptable for the time being to flip overcommit to be a boot-time tunable and only play around with accounting if it got enabled.

The general point here is that in the normal case this is just a pessimization and fixing it requires quite some care, all while more pressing issues are here and 12.0 releng process is behind the corner.

In D15233#321371, @mjg wrote:

I mean is there any good reason to do this per-uid swap accounting to begin with? By default overcommit flags are 0, which in particular means the limit is not enforced whatsoever. I think it would be acceptable for the time being to flip overcommit to be a boot-time tunable and only play around with accounting if it got enabled.

The general point here is that in the normal case this is just a pessimization and fixing it requires quite some care, all while more pressing issues are here and 12.0 releng process is behind the corner.

It is not uncommon to disable overcommit via rc.local or sysctl.conf on many server workloads. If you search the web you can find this recommendation all over. OOM killer is not acceptable in many environments. The right thing to do is fix the accounting by building better primitives. If it doesn't make it into 12 it's not the end of the world.

I meant per-uid accounting specifically. I have no issue with counting swap usage in general, which also happens to not induce the problem.

How about this hack: *don't* crfree creds stored in vm_map_entry objs, leave it to the obj destructor. When allocating, compare curthread->td_ucred with already referenced ucred. Most of the time it should be the same in which case there is nothing to do. Otherwise unref the found ones and ref the new ones. While clearly should "fix" the microbenchmark, it should also provide a net win for something like buildworld/builkernel and most likely pkg building (which does uses different creds per jail). Hit ratio can be easily measured with little dtrace, should be solid. Hard to say about a "real" multi-user system. I think it can be tested out without actaully changing anything - just leave the pointer in and use dtrace to see if the freshly allocated entry has a matching cred pointer.

This happens to postpone the need for anything fancy and should be faster single threaded even compared to per-cpu refs.

As for managing the swap count itself, there are general ideas how to approach this kind of thing and I think it's a separate flamewar. In particular it is possible to be per-cpu friendly while not being sloppy about reaching limits whatsoever.