kern: Implement lockless method for obtaining new PID.
Needs ReviewPublic
Actions

Authored by pjd on Jan 18 2024, 4:35 AM.

Details

Reviewers

kib
markj
mjg

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 55472
Build 52361: arc lint + arc unit

Event Timeline

pjd created this revision.Jan 18 2024, 4:35 AM

Herald added a subscriber: imp. · View Herald TranscriptJan 18 2024, 4:35 AM

pjd requested review of this revision.Jan 18 2024, 4:35 AM

Harbormaster completed remote builds in B55472: Diff 132906.Jan 18 2024, 4:35 AM

pjd added a parent revision: D43487: bitset(9): Implement BIT_ISSET_ATOMIC(), BIT_FFC() and BIT_FFC_AT()..Jan 18 2024, 4:36 AM

pjd mentioned this in D43487: bitset(9): Implement BIT_ISSET_ATOMIC(), BIT_FFC() and BIT_FFC_AT()..Jan 18 2024, 4:37 AM

pjd added reviewers: kib, markj, mjg.Jan 18 2024, 4:46 AM

What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).

General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.

If there is a scalability problem, the new big range can be partitioned so that CPUs hand out ids without conflicts.

In D43488#991394, @mjg wrote:

What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).

Yes, scalability is the main motivation. We have a machine that slows down considerably when we reach >10,000 process (64 core system). allproc and proctree are the main problems. This change is just a low hanging fruit where we can easily eliminate a global lock.

General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.

Increasing the valid range won't help much with scalability, even less so with lockfree PID lookup. I have tried that. This resulted in https://github.com/freebsd/freebsd-src/commit/1b8d70b2eb713f5c75605419bb2505ce18fa304a and https://reviews.freebsd.org/D43496.
We could still increase the limit by an order of magnitude which would require 0.5MB for the bitmaps, but until we break up allproc and proctree locks I don't see us running with more than 100,000 processes.

If there is a scalability problem, the new big range can be partitioned so that CPUs hand out ids without conflicts.

Going from proposed lockfree to waitfree would be nice, but will be much more involved than this patch, as we would still need fallback to other ranges. I think having PID allocation lockfree is good enough to the point that this piece of code in no longer a concern.

In D43488#991564, @pjd wrote:

In D43488#991394, @mjg wrote:

What's the motivation here? If you are running into scalability problems, it has to be allproc and proctree locks (amongst others).

Yes, scalability is the main motivation. We have a machine that slows down considerably when we reach >10,000 process (64 core system). allproc and proctree are the main problems. This change is just a low hanging fruit where we can easily eliminate a global lock.

I am not convinced it is of benefit (see below).

General thing which needs to be done with pid management is increasing valid range -- on linux the max pid is 4194304. There is a potential problem doing it in that TIDs were defined as max pid + 1, which may or may not come back to bite (see pget). That aside increasing bitmaps for that range is a no-go, but temporarily one could quadruple them.

Increasing the valid range won't help much with scalability, even less so with lockfree PID lookup. I have tried that. This resulted in https://github.com/freebsd/freebsd-src/commit/1b8d70b2eb713f5c75605419bb2505ce18fa304a and https://reviews.freebsd.org/D43496.
We could still increase the limit by an order of magnitude which would require 0.5MB for the bitmaps, but until we break up allproc and proctree locks I don't see us running with more than 100,000 processes.

Increasing the range has to be done as is unrelated to performance concerns. For example kyua routinely exhausts the current range and sometimes fails because of it.

I noted with the increase done scalability can be achieved very easily by partitioning the new extended range per-cpu (or some other granularity).

Anyhow, I don't think the idea is /wrong/, I'm just not convinced it is worthwhile. For example when running 104-way poudriere (very fork heavy) the procid lock is close to the noise category. allproc and proctree are not. I would be less apprehensive if actual bottlenecks got sorted out first and this was demonstrated to speed things up, which on *current* kernel it wont modulo an accident. But then again, I'm not going to stand in the way (modulo bugs).

sys/kern/kern_fork.c
284	doing this here does not work as is and leads to a lot of complication this routine is the only place which truly reserves a new id, so what you want to do is leave this spot alone and tru the atomic after you concluded nothing uses it. then you don't need to backpedal with proc_id_clear -- if you set it just fine, then you are the only user no matter what.
300	this RFHIGHPID probably can be eliminated
301	acq fence does not do anything here
sys/kern/kern_proc.c
347	this changes semantics of the original routine without a good reason that i can see it explicitly sets a bit only if it was not there already, intentionally tolerating if it was there for example for the reaper case you should instead add a new routine for fork_findpid (proc_id_try_set?)

pjd added inline comments.Jan 21 2024, 9:16 PM

sys/kern/kern_fork.c
284	doing this here does not work as is and leads to a lot of complication this routine is the only place which truly reserves a new id, so what you want to do is leave this spot alone and tru the atomic after you concluded nothing uses it. then you don't need to backpedal with proc_id_clear -- if you set it just fine, then you are the only user no matter what. I think there is a possible race in what you are proposing that I was trying to address by reserving PID first: thread0: select PID X, check all idmaps. thread1: select PID X, check all idmaps. thread0: reserve the PID in pidmap thread0: fork and use PID X+1, but keep pgrp of X thread0: exit thread1: reserve PID X, which is free now, but X is still reserved in idgroup map by process X+1.
sys/kern/kern_proc.c
347	this changes semantics of the original routine without a good reason that i can see it explicitly sets a bit only if it was not there already, intentionally tolerating if it was there for example for the reaper case you should instead add a new routine for fork_findpid (proc_id_try_set?) After the change the function behaves exactly the same - unconditionally sets the bit, just now it will return the previous value, which the reaper can simply ignore. I can add a new function, but it felt like duplicating the code.

I don't really like the change in principle, at least not without some benchmarks which indicate this really does help under heavy load, without hurting tail latency much due to excessive retries. The code is more complex and it's harder to reason about the likelihood of PID reuse or starvation.