parallelize buffer cache
ClosedPublic
Actions

Authored by jeff on Feb 8 2018, 11:12 PM.

Details

Reviewers

markj
kib
imp
scottl

Commits

rS329612: Further parallelize the buffer cache.

Summary

This separates the buffer cache from a single managed entity in to multiple 'buf domains'. A bufobj is associated with a buf domain. Each domain manages its own buf space and buf count limits, has its own wait channels, and its own bufspace daemon. Each domain additionally has an optional per-cpu clean queue to alleviate pressure on the clean queue lock. A single global bufspace is not longer supportable on large machines, it becomes a point of significant contention, whether for the atomic bufspace variable, or the locks synchronizing sleeps and wakeups.

The kva space is actually still managed by a single vmem that uses per-cpu quantum caches. The buf headers are still also managed in per-cpu uma caches. All global counters have been switched from atomics to counter(9).

There is a second level of lock avoidance with B_REUSE. This favors a second chance algorithm in buf_recycle() rather than directly adjusting queue position on every access to a buf. It means frequently read bufs may linger slightly longer and buf_recycle() may have to visit more bufs to reach successful completion. However, for things like indirect blocks which are read very frequently it was a substantial reduction in lock contention.

I don't really like bufqueue() and bufqueue_acquire(). They work, they are not elegant. I would accept suggestions for other mechanisms. The rest is fairly typically divide and conquer locking work.

I changed bufspace_reserve() to use only atomic add. I believe there are no races with it missing wakeups but I could be mistaken.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jeff created this revision.Feb 8 2018, 11:12 PM

kbowling added a subscriber: kbowling.Feb 10 2018, 4:43 AM

jeff added inline comments.Feb 10 2018, 10:29 PM

kern/vfs_bio.c
543 ↗	(On Diff #39072)	This is missing a diff that switches to atomic_fetchadd. I will update after I incorporate other review feedback.
1795–1798 ↗	(On Diff #39072)	This is broken because we return with a different lock held than we start with. I have a fix in my branch that involves letting this function manage its own locks.

This closes a race in bufspace_daemon_wakeup(). replaced an atomic cmpset loop with atomic fetchadd. And fixes a locking bug with B_AGE.

markj added inline comments.Feb 16 2018, 8:41 PM

kern/vfs_bio.c
494 ↗	(On Diff #39227)	Use atomic_store_int for consistency with bufspace_daemon_wait()?
506 ↗	(On Diff #39227)	It kind of bugs me that we have bufspace_daemon_wait() and bufspace_wait(), and that bufspace_daemon_wait() is called by the bufspace daemon while bufspace_daemon_wakeup() is not. It looks a bit like this is modeled after pagedaemon_wait()/pagedaemon_wakeup(), but they're not really analogous. I would inline bufspace_daemon_wait() and rename bufspace_daemon_wakeup() to bufspace_wakeup().
1354 ↗	(On Diff #39227)	Why isn't this "if (cpu == -1)"? Looks like this is relying on mp_maxid being unsigned.
1425 ↗	(On Diff #39227)	Unneeded return statement.
1762 ↗	(On Diff #39227)	Style: spaces around '\|'.
1782 ↗	(On Diff #39227)	This is racy if bd_flush() was called because bd_wanted is true. Is that going to lead to problems?

jeff marked 5 inline comments as done.Feb 19 2018, 10:47 PM

jeff added inline comments.

kern/vfs_bio.c
506 ↗	(On Diff #39227)	What was bufspace_wakeup() is now handled via bd_wanted and bq_flush(). This is really the thing that wakes up the bufspace waiters. the _daemon_ things only wake up the deferred processing thread. I like having the two functions separate from the main body because it's easier to look at and reason about the synchronization issues.
1782 ↗	(On Diff #39227)	I'm not sure what you mean. We can miss an update to bd_wanted when inserting into one of the per-cpu queues. But this is true even with a lock. The only place that really shouldn't sleep at all is the bufspace daemon which calls flushall. If even this fails it will sleep for a short while and retry.