Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 55472 Build 52361: arc lint + arc unit
Event Timeline
What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).
General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.
If there is a scalability problem, the new big range can be partitioned so that CPUs hand out ids without conflicts.
Yes, scalability is the main motivation. We have a machine that slows down considerably when we reach >10,000 process (64 core system). allproc and proctree are the main problems. This change is just a low hanging fruit where we can easily eliminate a global lock.
General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.
Increasing the valid range won't help much with scalability, even less so with lockfree PID lookup. I have tried that. This resulted in https://github.com/freebsd/freebsd-src/commit/1b8d70b2eb713f5c75605419bb2505ce18fa304a and https://reviews.freebsd.org/D43496.
We could still increase the limit by an order of magnitude which would require 0.5MB for the bitmaps, but until we break up allproc and proctree locks I don't see us running with more than 100,000 processes.
If there is a scalability problem, the new big range can be partitioned so that CPUs hand out ids without conflicts.
Going from proposed lockfree to waitfree would be nice, but will be much more involved than this patch, as we would still need fallback to other ranges. I think having PID allocation lockfree is good enough to the point that this piece of code in no longer a concern.
I am not convinced it is of benefit (see below).
General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.
Increasing the valid range won't help much with scalability, even less so with lockfree PID lookup. I have tried that. This resulted in https://github.com/freebsd/freebsd-src/commit/1b8d70b2eb713f5c75605419bb2505ce18fa304a and https://reviews.freebsd.org/D43496.
We could still increase the limit by an order of magnitude which would require 0.5MB for the bitmaps, but until we break up allproc and proctree locks I don't see us running with more than 100,000 processes.
Increasing the range has to be done as is unrelated to performance concerns. For example kyua routinely exhausts the current range and sometimes fails because of it.
I noted with the increase done scalability can be achieved very easily by partitioning the new extended range per-cpu (or some other granularity).
Anyhow, I don't think the idea is /wrong/, I'm just not convinced it is worthwhile. For example when running 104-way poudriere (very fork heavy) the procid lock is close to the noise category. allproc and proctree are not. I would be less apprehensive if actual bottlenecks got sorted out first and this was demonstrated to speed things up, which on *current* kernel it wont modulo an accident. But then again, I'm not going to stand in the way (modulo bugs).
sys/kern/kern_fork.c | ||
---|---|---|
284 | doing this here does not work as is and leads to a lot of complication this routine is the only place which truly reserves a new id, so what you want to do is leave this spot alone and tru the atomic *after* you concluded nothing uses it. then you don't need to backpedal with proc_id_clear -- if you set it just fine, then you are the only user no matter what. | |
300 | this RFHIGHPID *probably* can be eliminated | |
301 | acq fence does not do anything here | |
sys/kern/kern_proc.c | ||
347 | this changes semantics of the original routine without a good reason that i can see it explicitly sets a bit only if it was not there already, intentionally tolerating if it was there for example for the reaper case you should instead add a new routine for fork_findpid (proc_id_try_set?) |
sys/kern/kern_fork.c | ||
---|---|---|
284 |
I think there is a possible race in what you are proposing that I was trying to address by reserving PID first: thread0: select PID X, check all idmaps. thread1: select PID X, check all idmaps. thread0: reserve the PID in pidmap thread1: reserve PID X, which is free now, but X is still reserved in idgroup map by process X+1. | |
sys/kern/kern_proc.c | ||
347 |
After the change the function behaves exactly the same - unconditionally sets the bit, just now it will return the previous value, which the reaper can simply ignore. I can add a new function, but it felt like duplicating the code. |
I don't really like the change in principle, at least not without some benchmarks which indicate this really does help under heavy load, without hurting tail latency much due to excessive retries. The code is more complex and it's harder to reason about the likelihood of PID reuse or starvation.
sys/kern/kern_fork.c | ||
---|---|---|
276 | No need for the else after a goto retry. | |
301 | Why _acq? Why is it ok to ignore failures? This should be explained by a comment. |