Page MenuHomeFreeBSD

lockmgr: introduce lock_sleepgen
Needs ReviewPublic

Authored by rlibby on Jun 17 2024, 9:15 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 14, 5:54 PM
Unknown Object (File)
Nov 19 2024, 10:06 AM
Unknown Object (File)
Nov 12 2024, 5:13 PM
Unknown Object (File)
Nov 10 2024, 4:36 AM
Unknown Object (File)
Oct 25 2024, 6:38 AM
Unknown Object (File)
Oct 17 2024, 5:30 AM
Unknown Object (File)
Sep 27 2024, 10:22 AM
Unknown Object (File)
Sep 26 2024, 5:09 PM
Subscribers

Details

Summary

The idea is that a locker can safely sleep without an interlock if it
registers as a holder by incrementing an atomic counter associated with
the lock. After taking the sleepq lock but before going to sleep, we
check that the sleepgen is still current. If not, we skip the sleep and
return ENOLCK.


This is a work in progress, but I wanted to share in order to see if
there are objections to the basic idea, and get feedback before I start
polishing. The point of this is again to reduce contention on the
bufobj interlock when there is a lot of concurrent activity against a
file or vnode.

I intend to put something like the following into a comment in
kern_lock.c, and also to add entries to the lock.9 man page.


The basic idea is to provide an alternative to an interlock for safely
sleeping on a lock without a wakeup race and when the identity of the
lock could change. This can avoid lock contention on an interlock.
Applied to bufs, this means we can sleep on buf locks without the bufobj
interlock.

In very short, a new locker avoids sleeping if the sleepgen changes
after it first loads it and checks conditions. An invalidator bumps the
sleepgen and wakes up any sleepers.

In the new locker thread, the new locker first registers itself as a
"holder" of the sleepgen lock. It atomically increments the lock's
holder count. AFTER that, it loads the lock's current sleepgen. AFTER
that, it checks the identity conditions for sleeping on the lock (e.g.
for buf, it checks that the b_bufobj and b_lblkno are as expected).

At this point, the new locker still holds no lock, and doesn't even know
that the conditions that it just checked were satisfied under the
sleepgen that it loaded (the sleepgen could already have been bumped).

The new locker then proceeds into lockmgr, passing along the sleepgen
that it loaded. If it gets to the point where it would need to sleep to
acquire the lock, then AFTER acquiring the sleepq lock it checks whether
the sleepgen is still current. If so, it can safely sleep. Otherwise,
it errors out with ENOLCK. In any case, after returning from lockmgr,
it atomically decrements the lock's holder count.

The other actor is a thread which might invalidate the identity
conditions. (In the buf example, this would be brelvp().) AFTER the
conditions are invalidated, the invalidator checks if there are any
holders. If not, it does nothing further. Otherwise it increments the
sleepgen, acquires the sleepq lock, and wakes anyone sleeping on the
lock.

If the conditions were to be invalidated after a new locker registers as
a holder, the invalidator will see the hold count, bump the sleepgen,
and wakeup. If the invalidation completes before the new locker loads
the sleepgen, the sleepgen it loads is current. It it completes after
new locker loads it sleepgen, then the new locker will not sleep. If
the invalidator completes after the new locker acquires the sleepq lock
and goes to sleep, then it will be woken.

The invalidator and new locker need to cooperate on the conditions that
validate the sleep.

Some effort has been made to reduce the extra work needed for the
non-contending cases. However, the sleepgen lock will still be somewhat
more expensive, especially for the new locker, than the non-sleepgen
lock, due to extra atomic operations. It probably makes most sense for
the new locker first to attempt optimistically an LK_NOWAIT lock, and
then fall back to the SLEEPGEN lock.


I have also though of a significantly less complex approach that also
avoids the bufobj interlock. That would be to use an alternative global
rwlock as an interlock, which could be one or several sharded locks
(e.g. by bufdomain). Then the only lockmgr support needed would be the
lockmgr_wakeall.

The advantage of the alternative would be simplicity, and needing one
less int in buf, but it would be a less complete and less general
solution.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 58209
Build 55097: arc lint + arc unit

Event Timeline

A variant idea with a smaller API, but instead with a callback from under the sleepq lock. No more sleepgen, and the caller passes a pointer to the holder count to avoid defining a new lock type.
https://github.com/rlibby/freebsd/commits/lockmgr-condwait/

[I don't have a lot of stake in this code and my thoughts here aren't very nuanced.]

I have also though of a significantly less complex approach that also avoids the bufobj interlock.

Given that we're unlikely to see new consumers of lockmgr going forward, I wonder whether sleepgen would might someday be useful for avoiding the vnode interlock in some cases? If not, then this seems like a lot of machinery to deal with one lock, though the diff isn't too big.

I have also though of a significantly less complex approach that also
avoids the bufobj interlock. That would be to use an alternative global
rwlock as an interlock, which could be one or several sharded locks
(e.g. by bufdomain). Then the only lockmgr support needed would be the
lockmgr_wakeall.

This solution suggests that you're seeing bufobj lock contention that's partially due to some false sharing, i.e., the bufobj lock is being used for multiple purposes. Out of curiosity, where else do you see bufobj lock contention?

Overall, both approaches seem reasonable to me. It took me some time to digest the idea behind lock_sleepgen, but it's not very complex and seems general enough to be included in the lockmgr interface.

[I don't have a lot of stake in this code and my thoughts here aren't very nuanced.]

I have also though of a significantly less complex approach that also avoids the bufobj interlock.

Given that we're unlikely to see new consumers of lockmgr going forward, I wonder whether sleepgen would might someday be useful for avoiding the vnode interlock in some cases? If not, then this seems like a lot of machinery to deal with one lock, though the diff isn't too big.

I haven't looked at vnode locks vs the vnode interlock, but I can try looking around to see if there might be applications.

I agree it would be better to avoid the complexity. I am trying to whittle it down. Did you happen to look at what I have pushed here?
https://github.com/rlibby/freebsd/commits/lockmgr-condwait/

I have it down to not needing the sleepgen or the holder count -- and so no special lock type or extra storage at all -- but the catch is that I have to plumb a callback and arg that gets called under the sleepq lock, and we may rarely set lock waiter bits and then not sleep (hopefully no worse than a normal timeout?).

I have also though of a significantly less complex approach that also
avoids the bufobj interlock. That would be to use an alternative global
rwlock as an interlock, which could be one or several sharded locks
(e.g. by bufdomain). Then the only lockmgr support needed would be the
lockmgr_wakeall.

This solution suggests that you're seeing bufobj lock contention that's partially due to some false sharing, i.e., the bufobj lock is being used for multiple purposes.

Yes, that solution definitely has false sharing. The idea would be that if its "rare enough", it doesn't matter. In that solution you only take the global or sharded rwlock in read mode if you first fail the try-lock, meaning you actually have to wait on a buf lock, and you only take it in write mode if someone is in such a wait for your particular buf when you brelvp. This was rare in my single-file testing but may not be in general.

Out of curiosity, where else do you see bufobj lock contention?

The major issues that I have been able to reduce but not really resolve are from these actors:

  • getblk - needs bufobj lock to insert new bufs. If I can get this patch or something like it in, then most of the other cases will have been removed.
  • bufspacedaemon / buf_recycle / brelse / brelvp - needs bufobj lock to disassociate bufs.
  • reassignbuf also needs bufobj lock, but hasn't been a focus of my testing (I've been focusing on single-file read throughput).
  • Once bufobj is reduced, the vm object lock starts to show up.

Potentially objectionable hacks I have not shared yet:

  • Make bufspacedaemon batch bufobj locks when possible (like page daemon), which can reduce useless spinning due to contention, but doesn't reduce work done under the lock.
  • TAILQ_REMOVE_PREFETCH() - prefetch tailq links in brelvp before taking the bufobj lock, surprisingly effective.
  • rw_lock_contended() - reduce the penalty to the wlock holder at rw_wlock sites that may be heavily contended (bgetvp, brelvp, vm_page_grab_pages_unlocked, page daemon).
  • vm_page_grab_pages_prealloc() - on miss from vm_page_grab_pages_unlocked, preallocate all the pages we may need before taking the vm object lock, then just insert them.

It seems fundamentally bufobj has trouble scaling with concurrency. Maybe vm object too, but bufobj is in my way first.

Overall, both approaches seem reasonable to me. It took me some time to digest the idea behind lock_sleepgen, but it's not very complex and seems general enough to be included in the lockmgr interface.

Thank you for looking.

Given that we're unlikely to see new consumers of lockmgr going forward, I wonder whether sleepgen would might someday be useful for avoiding the vnode interlock in some cases? If not, then this seems like a lot of machinery to deal with one lock, though the diff isn't too big.

vnode interlock is not needed for anything lockmgr-related for several years now. Any case of holding it while calling into VOP_LOCK is a holdover which needs to get cleaned up.

Sorry to be slow to comment. This change looks functionally correct. The main issue for me is to understand what sort of workload it helps. Do you have an example benchmark whose performance is improved with this change?