Page MenuHomeFreeBSD

kern: Implement lockless method for obtaining new PID.
Needs ReviewPublic

Authored by pjd on Jan 18 2024, 4:35 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sep 17 2024, 12:08 PM
Unknown Object (File)
Sep 16 2024, 6:33 PM
Unknown Object (File)
Sep 14 2024, 12:29 PM
Unknown Object (File)
Sep 11 2024, 4:31 PM
Unknown Object (File)
Sep 9 2024, 4:16 AM
Unknown Object (File)
Sep 9 2024, 4:16 AM
Unknown Object (File)
Sep 8 2024, 3:49 PM
Unknown Object (File)
Sep 8 2024, 3:50 AM
Subscribers

Details

Reviewers
kib
markj
mjg

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 55472
Build 52361: arc lint + arc unit

Event Timeline

pjd requested review of this revision.Jan 18 2024, 4:35 AM

What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).

General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.

If there is a scalability problem, the new big range can be partitioned so that CPUs hand out ids without conflicts.

In D43488#991394, @mjg wrote:

What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).

Yes, scalability is the main motivation. We have a machine that slows down considerably when we reach >10,000 process (64 core system). allproc and proctree are the main problems. This change is just a low hanging fruit where we can easily eliminate a global lock.

General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.

Increasing the valid range won't help much with scalability, even less so with lockfree PID lookup. I have tried that. This resulted in https://github.com/freebsd/freebsd-src/commit/1b8d70b2eb713f5c75605419bb2505ce18fa304a and https://reviews.freebsd.org/D43496.
We could still increase the limit by an order of magnitude which would require 0.5MB for the bitmaps, but until we break up allproc and proctree locks I don't see us running with more than 100,000 processes.

If there is a scalability problem, the new big range can be partitioned so that CPUs hand out ids without conflicts.

Going from proposed lockfree to waitfree would be nice, but will be much more involved than this patch, as we would still need fallback to other ranges. I think having PID allocation lockfree is good enough to the point that this piece of code in no longer a concern.

In D43488#991564, @pjd wrote:
In D43488#991394, @mjg wrote:

What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).

Yes, scalability is the main motivation. We have a machine that slows down considerably when we reach >10,000 process (64 core system). allproc and proctree are the main problems. This change is just a low hanging fruit where we can easily eliminate a global lock.

I am not convinced it is of benefit (see below).

General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.

Increasing the valid range won't help much with scalability, even less so with lockfree PID lookup. I have tried that. This resulted in https://github.com/freebsd/freebsd-src/commit/1b8d70b2eb713f5c75605419bb2505ce18fa304a and https://reviews.freebsd.org/D43496.
We could still increase the limit by an order of magnitude which would require 0.5MB for the bitmaps, but until we break up allproc and proctree locks I don't see us running with more than 100,000 processes.

Increasing the range has to be done as is unrelated to performance concerns. For example kyua routinely exhausts the current range and sometimes fails because of it.

I noted with the increase done scalability can be achieved very easily by partitioning the new extended range per-cpu (or some other granularity).

Anyhow, I don't think the idea is /wrong/, I'm just not convinced it is worthwhile. For example when running 104-way poudriere (very fork heavy) the procid lock is close to the noise category. allproc and proctree are not. I would be less apprehensive if actual bottlenecks got sorted out first and this was demonstrated to speed things up, which on *current* kernel it wont modulo an accident. But then again, I'm not going to stand in the way (modulo bugs).

sys/kern/kern_fork.c
284

doing this here does not work as is and leads to a lot of complication

this routine is the only place which truly reserves a new id, so what you want to do is leave this spot alone and tru the atomic *after* you concluded nothing uses it. then you don't need to backpedal with proc_id_clear -- if you set it just fine, then you are the only user no matter what.

300

this RFHIGHPID *probably* can be eliminated

301

acq fence does not do anything here

sys/kern/kern_proc.c
347

this changes semantics of the original routine without a good reason that i can see

it explicitly sets a bit only if it was not there already, intentionally tolerating if it was there

for example for the reaper case

you should instead add a new routine for fork_findpid (proc_id_try_set?)

sys/kern/kern_fork.c
284

doing this here does not work as is and leads to a lot of complication

this routine is the only place which truly reserves a new id, so what you want to do is leave this spot alone and tru the atomic *after* you concluded nothing uses it. then you don't need to backpedal with proc_id_clear -- if you set it just fine, then you are the only user no matter what.

I think there is a possible race in what you are proposing that I was trying to address by reserving PID first:

thread0: select PID X, check all idmaps.

thread1: select PID X, check all idmaps.

thread0: reserve the PID in pidmap
thread0: fork and use PID X+1, but keep pgrp of X
thread0: exit

thread1: reserve PID X, which is free now, but X is still reserved in idgroup map by process X+1.

sys/kern/kern_proc.c
347

this changes semantics of the original routine without a good reason that i can see

it explicitly sets a bit only if it was not there already, intentionally tolerating if it was there

for example for the reaper case

you should instead add a new routine for fork_findpid (proc_id_try_set?)

After the change the function behaves exactly the same - unconditionally sets the bit, just now it will return the previous value, which the reaper can simply ignore. I can add a new function, but it felt like duplicating the code.

I don't really like the change in principle, at least not without some benchmarks which indicate this really does help under heavy load, without hurting tail latency much due to excessive retries. The code is more complex and it's harder to reason about the likelihood of PID reuse or starvation.

sys/kern/kern_fork.c
276

No need for the else after a goto retry.

301

Why _acq?

Why is it ok to ignore failures? This should be explained by a comment.