Page MenuHomeFreeBSD

Reimplement rangelocks
ClosedPublic

Authored by kib on Sep 8 2023, 6:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 19, 5:33 AM
Unknown Object (File)
Tue, Nov 19, 5:31 AM
Unknown Object (File)
Tue, Nov 19, 4:52 AM
Unknown Object (File)
Tue, Nov 19, 4:23 AM
Unknown Object (File)
Tue, Nov 19, 4:06 AM
Unknown Object (File)
Tue, Nov 19, 2:54 AM
Unknown Object (File)
Mon, Nov 18, 9:40 PM
Unknown Object (File)
Thu, Nov 14, 8:05 PM

Details

Summary

Using the algorithms from https://doi.org/10.1145/3342195.3387533
As a perf optimization, until a conflict is detected, a trivial sleepable sx-like lock is used (no full-fledged sx to avoid blowing vnode size).

I did not yet measured the scalability systematically, but for cheat mode the line is almost same as using the vnode interlock, and real rangelock mode seems to scale instead of providing the flat line.

Tested by: pho

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib marked 2 inline comments as done.Sep 15 2023, 4:14 PM

Another batch of Mark notes.

Did you test this on arm64 or any other weakly ordered arch?

Am I right that if a thread successively locks small ranges, starting from the end of the file and proceeding to the beginning, we will not clean up marked entries on the rangelock queue?

I wonder if some periodic mechanism, e.g., VFS_SYNC, should try to clean freed entries from vnode rangelocks.

sys/kern/kern_rangelock.c
128

If there is no other synchronization for the corresponding memory accesses, then I expect the barriers really are needed.

190

trylock has the meaning "don't attempt to acquire sleepqueue spinlocks," but why? It should be harmless to do so. Consider that if both a reader and a write simultaneously trylock a range, with the current implementation it is possible for both of them to fail.

195

I don't quite understand this path: we already marked the newly inserted entry, how is it guaranteed that we will not return RL_LOCK_SUCCESS later?

197

These three lines could be replaced by cur = rl_e_unmark(rl_q_load(prev));. There is another instance below.

205
210

This condition is always false on the first iteration, right?

233

Why do you check whether cur is marked here? This is already checked once in the cur != NULL case above. Does it actually help to repeat the check before locking and checking again for a third time?

kib marked 7 inline comments as done.Oct 3 2023, 7:11 PM

Did you test this on arm64 or any other weakly ordered arch?

No

Am I right that if a thread successively locks small ranges, starting from the end of the file and proceeding to the beginning, we will not clean up marked entries on the rangelock queue?

I wonder if some periodic mechanism, e.g., VFS_SYNC, should try to clean freed entries from vnode rangelocks.

If all of them are writes then no, indeed, since rl_w_validate() goes from head to inserted lock.

I think less drastic measure can be taken, e.g. rl_w_validate() could continue the loop after cur == e is true, by only doing cleanup, and only sometimes. But I did not tried this, lets finish with the current review first.

sys/kern/kern_rangelock.c
128

Rangelocks do not sync access to memory. They sync file data manipulations, and there are enough synchronization to ensure the visibility of file data cached in the memory.

190

I think the weaker semantic is fine there, but I fixed the issue you pointed out.

195

Yes, I need to unlock e only when committed to return error.

197

rl_e_unmark() asserts that the entry is marked. I added _unchecked() version, ok.

210

You mean, that the head (not the first element in the list) is never marked. Yes, this is true.

kib marked 5 inline comments as done.

Another batch of changes.

olce added inline comments.
sys/kern/kern_rangelock.c
66–68

I think I see why you put this one after RL_CHEAT_CHEATING, although this doesn't match the numerical order.

I would add a comment about RL_CHEAT_CHEATING, something like "Tag indicating that 'head' is not a pointer but contains flags."

And probably one also for RL_CHEAT_MASK, something like: "Mask of bit flags in 'head' when 'RL_CHEAT_CHEATING' is set."

I think I'm ok with the diff now. My main complaint is that there is no bound on the number of garbage queue entries, see below.

In D41787#959376, @kib wrote:

Am I right that if a thread successively locks small ranges, starting from the end of the file and proceeding to the beginning, we will not clean up marked entries on the rangelock queue?

I wonder if some periodic mechanism, e.g., VFS_SYNC, should try to clean freed entries from vnode rangelocks.

If all of them are writes then no, indeed, since rl_w_validate() goes from head to inserted lock.

I think less drastic measure can be taken, e.g. rl_w_validate() could continue the loop after cur == e is true, by only doing cleanup, and only sometimes. But I did not tried this, lets finish with the current review first.

Hmm, why can't rangelock_unlock() clean up its own entry? Because it doesn't know which entry belongs to it (rl_q_owner is INVARIANTS-only)? Or is it because the design is trying to improve scalability by reducing the number of threads that might be performing queue operations? Or something else?

sys/kern/kern_rangelock.c
128

What mechanism provides this synchronization for, e.g., shm objects?

140

Assert that all entries are marked?

194

Can the broadcast be conditional on lock->sleepers? If lock->sleepers is false, then we scan the sleepqueue hash chain needlessly, I think.

298

This test can move into the if (cur != NULL) block.

sys/kern/kern_rangelock.c
140

Never mind, I forgot (again) that rl_e_unmark() asserts that the entry is marked.

sys/kern/kern_rangelock.c
296

You might as well load cur earlier, in the check for an empty queue.

I may be missing something obvious, but I don't quite understand the need for rl_q_free. Isn't the purpose of UMA's SMR zones exactly to handle deferred freeing? In other words, why isn't it possible to replace pieces of code inserting unlocked entries into rl_q_free with straight calls to uma_zfree_smr()? Is is because rl_insert(), and thus these pieces, are called in a SMR section?

It is too bad that the list has to be traversed from the start in rl_w_validate(). Using a doubly-linked list would allow to start from e, but has several drawbacks as well (concurrency will be harder, more memory consumption). Does anyone have an idea of how many rangelocks there typically are on a single vnode on average and at worst?

sys/kern/kern_rangelock.c
75–77

I don't understand how the use of the RL_CHEAT_* flags mandates any non-natural alignment. Could you please explain why?

100

Why use PRI_USER instead of 0 here? In other words, why boosting awaken user threads?

128

I don't think there is any need of acquire/release semantics for cheat locks, since AFAIU their whole state is contained in the head field (and also in the sleepqueue when draining, but this part is protected by the sleepqueue chain lock), so there is no synchronization needs beyond that already provided by atomic reads and fcmpset operations and mutex lock/unlock.

193–205

(Micro-)optimization: Acquiring the sleepqueue (chain) lock could be performed only when there are waiters to wakeup. This amounts to pushing down sleepq_lock() into the if (x == 0)'s then block and add there a pairing sleepq_release() call, and removing all the other sleepq_release() calls.

268–269

How can rl_q_next on a newly created entry that has just been put on the queue be marked? I don't think this is possible.

269–270

Same as above at start of rl_r_validate(), but for a different reason (head of list can't be marked).

293

I doubt there is a benefit in doing the raw comparison before the CAS.

kib marked 7 inline comments as done.Oct 4 2023, 5:08 PM

I think I'm ok with the diff now. My main complaint is that there is no bound on the number of garbage queue entries, see below.

Hmm, why can't rangelock_unlock() clean up its own entry? Because it doesn't know which entry belongs to it (rl_q_owner is INVARIANTS-only)? Or is it because the design is trying to improve scalability by reducing the number of threads that might be performing queue operations? Or something else?

To clean entry, we need to have a stable pointer to the previous entry in the lock list. We do know which entry we clear, but to find previous one to CAS-remove unlocked entry, we need to walk over the list from the start.

In D41787#959693, @olce.freebsd_certner.fr wrote:

I may be missing something obvious, but I don't quite understand the need for rl_q_free. Isn't the purpose of UMA's SMR zones exactly to handle deferred freeing? In other words, why isn't it possible to replace pieces of code inserting unlocked entries into rl_q_free with straight calls to uma_zfree_smr()? Is is because rl_insert(), and thus these pieces, are called in a SMR section?

Because the lock list is traversed unlocked. You cannot call uma_zfree_smr() from smr section anyway.

It is too bad that the list has to be traversed from the start in rl_w_validate(). Using a doubly-linked list would allow to start from e, but has several drawbacks as well (concurrency will be harder, more memory consumption). Does anyone have an idea of how many rangelocks there typically are on a single vnode on average and at worst?

How would you implement lock-free double-linked list? There are some designs, but they are so hard that it is not practical.

There is exactly one rangelock on vnode for all its lifetime. Number of locks taken in parallel from is completely determined by userspace. I think it is practically huge for consumers like postgres/any other database/varnish, but typically at most one for 'trivial' consumers like usual unix text processing tools.

sys/kern/kern_rangelock.c
75–77

On 32bit arches, pointers are aligned to the word (4 bytes) only. Cheat locks require 3 lower bits to be useful.

In fact the RL_CHEAT_CHEATING was strategically placed in the lowest bit that cannot be non-zero for real pointer, but still.

128

Page busy, same as of pgread() on tmpfs/ufs.

193–205

If we are draining there are waiters.

293

It avoids atomic for the case of non-empty list. Also it pre-populates cache line.

kib marked 4 inline comments as done.Oct 4 2023, 6:18 PM
kib added inline comments.
sys/kern/kern_rangelock.c
128

Also note that for non-cheating lock, the acquire and release atomics for lock and unlock occurs on different memory locations right now. For acq, it is next pointer for the previous lock entry, and for rel it is next pointer of the current lock entry. So this effectively depends on our acq/rel implementations being seq cst (which they are at least on x86 and arm, on arm due to store release total ordering).

kib marked 7 inline comments as done.Oct 4 2023, 7:10 PM
kib added inline comments.
sys/kern/kern_rangelock.c
66–68

Isn't it all explained in the herald comment?

100

Because the thread already got a penalty by not getting the access to the resource immediately. So when it given CPU access after sleep, it is fair to increase it priority both to compensate for the competition, and to facilitate use of the resource to free it quicker. The later is achieved by not competing in scheduling with threads that not yet experienced block.

kib marked 2 inline comments as done.

Do not forget kick_proc0().
Only wakeup if there are waiters.
Minor review tweaks.

sys/kern/kern_rangelock.c
277

What happens if two threads attempt to unmark the same node? Presumably one of them will trip the assertion in rl_e_unmark().

327

Same here, I think two threads might both try to unmark the same node.

kib marked 2 inline comments as done.Oct 4 2023, 7:26 PM
kib added inline comments.
sys/kern/kern_rangelock.c
277

Isn't next the local var? If other thread unmarked the memory (or really, removed the entry from the list), cas below fails and we retry the iteration.

markj added inline comments.
sys/kern/kern_rangelock.c
277

Oops, yes.

This revision is now accepted and ready to land.Oct 4 2023, 7:30 PM
In D41787#959772, @kib wrote:

Because the lock list is traversed unlocked. You cannot call uma_zfree_smr() from smr section anyway.

Then that sounds like a useful new feature to add to uma_zfree_smr(), if technically possible. Something for later.

How would you implement lock-free double-linked list? There are some designs, but they are so hard that it is not practical.

Yes, that would be my fear. But browsing the list from the start can easily be wasteful with lots of ranges. I'll think about it.

Number of locks taken in parallel from is completely determined by userspace. I think it is practically huge for consumers like postgres/any other database/varnish, but typically at most one for 'trivial' consumers like usual unix text processing tools.

Yes, this is what my question was really about, thanks. If there are lots of ranges, the previous point could become a problem.

sys/kern/kern_rangelock.c
42–45

I think the text is slightly misleading (the last part describes the "cheating" mode, not the other one). Here are suggestions to amend it. (You'll have to refill the paragraph.)

46–48

Suggestions to improve clarity.

66–68

Indeed. I just missed it initially. Error between chair and keyboard.

75–77

Cheat locks require 3 lower bits to be useful.

They need several more, depending on the number of readers (logarithmically).

In fact the RL_CHEAT_CHEATING was strategically placed in the lowest bit that cannot be non-zero for real pointer, but still.

Exactly. This requires only 2-byte alignment. And since RL_CHEAT_CHEATING gates the use of all other flags, 4 or 8 bytes alignment doesn't matter.

Could you please simply use UMA_ALIGNOF() instead (or whatever is the proper idiom for natural alignment)?

100

This is already handled by ULE, although probably not to the same extent, and generally should probably be left for the scheduler to figure out on its own. But I understand this is still existing practice, and changing that is out-of-scope for this review.

193–205

Sorry, the first sentence was wrong in expressing my original idea. There are waiters indeed, but we are not going to signal them unless we are the last waiter. So I think the rest of the point is valid: there is no need to take the lock if we are not the last waiter (i.e., we are not going to signal), and the fcmpset doesn't need a lock.

293

I thought the "compare and compare and swap" idiom would have been long gone by 2023, with CPUs performing the relevant optimizations, but maybe not after all.

kib marked 9 inline comments as done.Oct 4 2023, 11:15 PM
kib added inline comments.
sys/kern/kern_rangelock.c
100

Specifically this is not handled by a scheduler, and I do not see how could it be handled, unless we pre-assign a priority to each wait channel.

kib marked an inline comment as done.

When unlocking read-locked cheat and draining, only lock sleepqueue on the last wakeup.
Editorial changes in comments.

This revision now requires review to proceed.Oct 4 2023, 11:18 PM

Thanks!

sys/kern/kern_rangelock.c
100

It is "handled", indirectly, in the sense that a (user) thread waiting for a resource will have its sleep time increase, which will then influence the final priority ULE assigns to it at wakeup (normally, increase it). This takes care of the first point ("compensate for the competition") in a different way, imperfectly for sure, but still in a much better manner than explicitly setting an absolute priority, which is also a globally unfair action. That an action is needed for the second point rather reflects that we are missing a mechanism to ensure fairness for received rangelock requests across draining. I don't think it necessarily needs to be added now (or perhaps at all), time and tests will tell.

This revision is now accepted and ready to land.Oct 5 2023, 7:21 AM
sys/kern/kern_rangelock.c
100

Note that this priority parameter is effectively ignored by ULE by default. See the handling of static_boost in sched_sleep().

kib marked an inline comment as done.Oct 5 2023, 11:56 PM
kib added inline comments.
sys/kern/kern_rangelock.c
100

I do not see it ignored. static_boost is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE, and if the current thread priority is larger than then then boost is applied.

So this is strange in fact, because interactive threads are put after non-interactive (as classified by ULE) after off cpu events, and all PVM/PVFS... are higher comparing with timeshare.

sys/kern/kern_rangelock.c
100

But note that sched_sleep() does sched_prio(td, static_boost) in this case, not sched_prio(td, prio). So this merely ensures that an awoken timesharing thread will get some preference over other timesharing threads.

sys/kern/kern_rangelock.c
100

Ah indeed. I had already analyzed it a while ago but didn't connect the dots. Thanks.

A sleeping thread's priority during sleep is normally used when a sleepqueue chooses a thread to wakeup (unless SLEEPQ_UNFAIR is passed) via sleepq_signal(). Here, sleep_broadcast() is used and will wake threads in the order in which they were put to sleep, regardless of their priorities. Then, it's up to the scheduler to choose CPUs and possibly preempt existing threads there (and possibly preempt threads it has just started as it wakes up more sleeping ones, if they are numerous). Low priority threads may be started before high priority ones, but probably not by a large margin. This is probably fair enough (efficiency could be improved).

This gives me the opportunity to correct part of what I said above:

That an action is needed for the second point rather reflects that we are missing a mechanism to ensure fairness for received rangelock requests across draining.

In fact, ensuring fairness is not only a concern across draining, but in general to serve range requests. And there doesn't seem to be an obvious problem here, given what's above. The only "problem" is if some resource (range lock here) is held for too long by a low priority thread, but there, given we are talking about long-term sleep ("sleepable" in FreeBSD parlance), I don't think borrowing priorities is a good answer. The only ones that make sense that I can see for now would be to penalize such a thread at the *next* resource acquisition, and/or, more drastically, to forcibly break the lock and send a signal to the offending thread (with default action, e.g., to kill the process).

kib marked an inline comment as done.

Rebase after swapout removal.

This revision now requires review to proceed.Aug 1 2024, 12:47 AM