Page MenuHomeFreeBSD

make unix socket locking finer grained
ClosedPublic

Authored by mmacy on May 14 2018, 4:49 PM.

Details

Summary

Currently sends on unix sockets contend heavily on read locking the list lock. unix1_processes in will-it-scale peaks at 6 processes and then declines.

With this change I get a substantial improvement in number of operations per second with 96 processes:

x before
+ after
    N           Min           Max        Median           Avg        Stddev
x  11       1688420       1696389       1693578     1692766.3     2971.1702
+  10      63417955      71030114      70662504      69576423     2374684.6
Difference at 95.0% confidence
        6.78837e+07 +/- 1.49463e+06
        4010.22% +/- 88.4246%
        (Student's t, pooled s = 1.63437e+06)

"Small" iron changes (1, 2, and 4 processes):

x before1
+ after1.2
+------------------------------------------------------------------------+
|                                                                  +     |
|                                                           x      +     |
|                                                           x      +     |
|                                                           x      +     |
|                                                           x     ++     |
|                                                          xx     ++     |
|x                                                       x xx     ++     |
|                                  |__________________A_____M_____AM____||
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10       1131648       1197750     1197138.5     1190369.3     20651.839
+  10       1203840       1205056       1204919     1204827.9     353.27404
Difference at 95.0% confidence
        14458.6 +/- 13723
        1.21463% +/- 1.16683%
        (Student's t, pooled s = 14605.2)

x before2
+ after2.2
+------------------------------------------------------------------------+
|                                                                       +|
|                                                                       +|
|                                                                       +|
|                                                                       +|
|                                                                       +|
|                                                                       +|
|           x                                                           +|
|           x                                                           +|
|         x xx                                                          +|
|x        xxxx                                                          +|
|      |___AM_|                                                         A|
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10       1972843       2045866     2038186.5     2030443.8     21367.694
+  10       2400853       2402196     2401043.5     2401172.7     385.40024
Difference at 95.0% confidence
        370729 +/- 14198.9
        18.2585% +/- 0.826943%
        (Student's t, pooled s = 15111.7)

x before4
+ after4.2
    N           Min           Max        Median           Avg        Stddev
x  10       3986994       3991728     3990137.5     3989985.2     1300.0164
+  10       4799990       4806664     4806116.5       4805194     1990.6625
Difference at 95.0% confidence
        815209 +/- 1579.64
        20.4314% +/- 0.0421713%
        (Student's t, pooled s = 1681.19)

Updated locking protocol:

* Locking and synchronization:
 *
 * Three types of locks exist in the local domain socket implementation: a
 * a global linkage rwlock, the mtxpool lock, and per-unpcb mutexes.
 * The linkage lock protects the socket count, global generation number,
 * and stream/datagram global lists.
 *
 * The mtxpool lock protects the vnode from being modified while referenced.
 * Lock ordering requires that it be acquired before any unpcb locks.
 *
 * The unpcb lock (unp_mtx) protects all fields in the unpcb. Of particular
 * note is that this includes the unp_conn field. So long as the unpcb lock
 * is held the reference to the unpcb pointed to by unp_conn is valid. If we
 * require that the unpcb pointed to by unp_conn remain live in cases where
 * we need to drop the unp_mtx as when we need to acquire the lock for a
 * second unpcb the caller must first acquire an additional reference on the
 * second unpcb and then revalidate any state (typically check that unp_conn
 * is non-NULL) upon requiring the initial unpcb lock. The lock ordering
 * between unpcbs is the conventional ascending address order. Two helper
 * routines exist for this:
 *
 *   - unp_pcb_lock2(unp, unp2) - which just acquires the two locks in the
 *     safe ordering.
 *
 *   - unp_pcb_owned_lock2(unp, unp2, freed) - the lock for unp is held
 *     when called. If unp is unlocked and unp2 is subsequently freed
 *     freed will be set to 1.
 *
 * The helper routines for references are:
 *
 *   - unp_pcb_hold(unp): Can be called any time we currently hold a valid
 *     reference to unp.
 *
 *    - unp_pcb_rele(unp): The caller must hold the unp lock. If we are
 *      releasing the last reference, detach must have been called thus
 *      unp->unp_socket be NULL.
 *
 * UNIX domain sockets each have an unpcb hung off of their so_pcb pointer,
 * allocated in pru_attach() and freed in pru_detach().  The validity of that
 * pointer is an invariant, so no lock is required to dereference the so_pcb
 * pointer if a valid socket reference is held by the caller.  In practice,
 * this is always true during operations performed on a socket.  Each unpcb
 * has a back-pointer to its socket, unp_socket, which will be stable under
 * the same circumstances.
 *
 * This pointer may only be safely dereferenced as long as a valid reference
 * to the unpcb is held.  Typically, this reference will be from the socket,
 * or from another unpcb when the referring unpcb's lock is held (in order
 * that the reference not be invalidated during use).  For example, to follow
 * unp->unp_conn->unp_socket, you need to hold a lock on unp_conn to guarantee
 * that detach is not run clearing unp_socket.
Test Plan

stress by pho

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mmacy created this revision.May 14 2018, 4:49 PM
mmacy added a subscriber: mjg.May 14 2018, 4:53 PM
dvl added a subscriber: dvl.May 14 2018, 5:15 PM
emaste added a subscriber: emaste.May 14 2018, 5:31 PM
mmacy added a reviewer: mjg.May 14 2018, 8:22 PM
mmacy removed a subscriber: mjg.May 14 2018, 8:29 PM
markj added a comment.May 15 2018, 1:39 AM

This is hard to review because there is no description of the updated locking protocol in the code or in the review description.

sys/kern/uipc_usrreq.c
192 ↗(On Diff #42521)

This comment needs to be updated.

260 ↗(On Diff #42521)

Style: tab after #define.

There should not be a trailing semicolon.

310 ↗(On Diff #42521)

Style: missing blank line after the opening brace.

311 ↗(On Diff #42521)

The asserts are redundant, as we acquire both locks in every path through the function and the locks aren't recursive.

668 ↗(On Diff #42521)

This function looks like a predicate, but it may actually drop the unp pcb lock in order to avoid an LOR. Don't you therefore need to recheck for a null unp->unp_conn? I don't think the pool mutex provides enough synchronization.

710 ↗(On Diff #42521)

Style: this should come before the int declarations.

1408 ↗(On Diff #42521)

Style: the vplock declaration should come earlier.

1421 ↗(On Diff #42521)

This assert is redundant, we'll crash immediately on the next line if unp is null.

1519 ↗(On Diff #42521)

Again, it makes me nervous that this may actually unlock unp. Did you analyze all the call sites to ensure that the resulting races are harmless?

1750 ↗(On Diff #42521)

These destructions could be done in a _fini routine for the zone.

1809 ↗(On Diff #42521)

Style: extra blank lines.

1824 ↗(On Diff #42521)

Style: mixed spaces and tabs.

mmacy updated this revision to Diff 42560.May 15 2018, 2:30 AM
mmacy edited the summary of this revision. (Show Details)

All references to a unpcb (connection pointer or element in the ref list) bump the reference count. A unpcb can never be freed while another unp->unp_conn is pointing to it.

Passes at least some pho's stress tests now.

mmacy added a comment.May 15 2018, 2:40 AM

This is hard to review because there is no description of the updated locking protocol in the code or in the review description.

I thought I did that in the summary. We can discuss further once pho signs off on it.

mmacy updated this revision to Diff 42567.May 15 2018, 7:56 AM
mmacy edited the summary of this revision. (Show Details)

revise locking protocol to use an atomic refcount rather than the connection mutex

seanc added a comment.May 15 2018, 5:03 PM

I want to point out that the lack of code formatting tool like clang-format resulted in @markj spending some of his precious cycles reviewing not the substance of the patch, but instead commenting on style nits (correct according to style(9) - thank you @markj).

Using brain power for code formatting is the modern-day equivalent to creating fire by rubbing two sticks together instead of using a match/lighter. My comment isn't directed at anyone in particular, but is an observation from watching and participating in reviews in FreeBSD (vs other repos and orgs in my daily patch review queue where the review process is nearly 100% focused on the substance of the patch and not various nits that are caught by linting tools and formatters). I apologize if this sounds flippant, that is not my intent. I'm excited to see changes like these land but wish the friction of style(9) were obviated through modern tooling. This isn't a new comment to most on this review but I wanted to vent a little feedback.

imp added a comment.May 15 2018, 5:12 PM

I want to point out that the lack of code formatting tool like clang-format resulted in @markj spending some of his precious cycles reviewing not the substance of the patch, but instead commenting on style nits (correct according to style(9) - thank you @markj).
Using brain power for code formatting is the modern-day equivalent to creating fire by rubbing two sticks together instead of using a match/lighter. My comment isn't directed at anyone in particular, but is an observation from watching and participating in reviews in FreeBSD (vs other repos and orgs in my daily patch review queue where the review process is nearly 100% focused on the substance of the patch and not various nits that are caught by linting tools and formatters). I apologize if this sounds flippant, that is not my intent. I'm excited to see changes like these land but wish the friction of style(9) were obviated through modern tooling. This isn't a new comment to most on this review but I wanted to vent a little feedback.

To be fair, though, the friction from style(9) comments is 1/100th what it was years ago and just try to get something upstreamed to linux without getting one of those comments...

I want to point out that the lack of code formatting tool like clang-format

Unfortunately there are not existing tools that can produce style(9)-compliant output. clang-format and indent can get maybe 80% of the way there, but not yet close enough to build automation around it. I would like to extend clang-format to do so.

markj added a comment.May 15 2018, 5:58 PM

I want to point out that the lack of code formatting tool like clang-format resulted in @markj spending some of his precious cycles reviewing not the substance of the patch, but instead commenting on style nits (correct according to style(9) - thank you @markj).

Since we're venting... ;)

Most of the cycles that went into looking at this were actually spent trying to figure out what the high-level changes were, because the review description offers no useful details. I too often hear the feedback that FreeBSD reviews focus on style rather than substance, and as a frequent reviewer I'll say that a lot of committers (me not excepted) are bad at socializing their proposed large-scale changes. A not-uncommon pattern is to create a 500LOC+ phab review with a couple of sentences describing the change, and add a few random committers who recently touched the modified files. If you want a good review from me, I need to believe that my time isn't being wasted. I want some indication that the developer thought things through, is confident in the correctness, suitability and long-term maintainability of their patch, and has done some basic testing. I also want to believe that the developer is willing to substantially rework the patch in response to feedback from reviewers, if necessary. With all due respect, I don't really get those impressions from this review: the lack of a high-level description gives me the impression that the proposed locking protocol has not been fully thought out, and the fact that the change is still being revised indicates that it's not ready for review in the first place. (And now I'll get an email every time a revised version is uploaded, but won't know if I should spend time looking at the new rev.) There is no analysis of the effect of this change on the single-threaded case. The revision I looked last night at had some suspicious-looking races, and I can't immediately tell if my comments to that effect were addressed in the new revisions; I guess I'll just have to figure that out myself at some point. It feels like my time isn't important, and that frustrates me a lot more than anything to do with style(9).

Regarding style(9), I know it well enough that I don't really need to think about it. I get that not everyone is that way, or wants to be that way. I point out style bugs because I spend a lot of time reading code and tend to infer things about it from how self-consistent it is, but I don't usually block a review on style bugs. It would take me < 2 minutes to fix all the style issues I pointed out here, much less than the total amount of time that probably went into this, so it shouldn't be a big deal IMHO. Sure, a tool to correct style bugs would be nice, but it will never be perfect.

mmacy added a comment.May 15 2018, 6:06 PM

I want to point out that the lack of code formatting tool like clang-format resulted in @markj spending some of his precious cycles reviewing not the substance of the patch, but instead commenting on style nits (correct according to style(9) - thank you @markj).

Since we're venting... ;)
Most of the cycles that went into looking at this were actually spent trying to figure out what the high-level changes were, because the review description offers no useful details. I too often hear the feedback that FreeBSD reviews focus on style rather than substance, and as a frequent reviewer I'll say that a lot of committers (me not excepted) are bad at socializing their proposed large-scale changes. A not-uncommon pattern is to create a 500LOC+ phab review with a couple of sentences describing the change, and add a few random committers who recently touched the modified files. If you want a good review from me, I need to believe that my time isn't being wasted. I want some indication that the developer thought things through, is confident in the correctness, suitability and long-term maintainability of their patch, and has done some basic testing. I also want to believe that the developer is willing to substantially rework the patch in response to feedback from reviewers, if necessary. With all due respect, I don't really get those impressions from this review: the lack of a high-level description gives me the impression that the proposed locking protocol has not been fully thought out, and the fact that the change is still being revised indicates that it's not ready for review in the first place. (And now I'll get an email every time a revised version is uploaded, but won't know if I should spend time looking at the new rev.) There is no analysis of the effect of this change on the single-threaded case. The revision I looked last night at had some suspicious-looking races, and I can't immediately tell if my comments to that effect were addressed in the new revisions; I guess I'll just have to figure that out myself at some point. It feels like my time isn't important, and that frustrates me a lot more than anything to do with style(9).

Bear in mind that you're venting about the review itself where no criticism was made of you and not the style(9) issues. The two fundamental changes in the locking protocol were (initially) simply changing which locks protected a couple of fields (in principle). In practice the global list lock hid all manner of sins in the usage of the pcb lock. I admit that this was not fully baked and I do apologize. It now withstands pho's tests and is fairly solid. My experience with locking protocol documentation in the kernel is that they're 90% garbage. We don't make clear what is being protected in a given piece of code with a lock. Recently I've repeatedly had to wade through *many* call sites to see what a lock was in fact protecting. Maybe you feel that the VM is better, but in general in the kernel it's mostly ad hoc in practice. Perhaps we can do better in UNIX sockets.

mmacy updated this revision to Diff 42597.May 15 2018, 6:41 PM
mmacy edited the summary of this revision. (Show Details)
  • assure that detach is called on shutdown
  • close possible race between send and detach
  • update locking comment

@markj will this suffice?

markj added a comment.May 15 2018, 7:04 PM

Bear in mind that you're venting about the review itself where no criticism was made of you and not the style(9) issues.

I really intended to use the review as an example of a more general problem. In this case it's hard to express my frustration without being specific, and I apologize for picking on you. FWIW, I've wanted to write something similar many times in the past, but hadn't because I didn't want to come off as overly harsh.

I would be unhappy with this review even if the diff itself was perfect in every way. To me it speaks to an (unintentional) indifference towards the reviewers and the review process itself. I think it's a cultural problem and is partly exacerbated by phabricator itself since it's now so easy to create a review and cc a number of committers - I can do it from the shell with a single command. Some discipline is required in order to create good, useful reviews that invite constructive feedback (which may include style nits) instead of pure nitpicking.

The two fundamental changes in the locking protocol were (initially) simply changing which locks protected a couple of fields (in principle). In practice the global list lock hid all manner of sins in the usage of the pcb lock. I admit that this was not fully baked and I do apologize. It now withstands pho's tests and is fairly solid.
My experience with locking protocol documentation in the kernel is that they're 90% garbage. We don't make clear what is being protected in a given piece of code with a lock. Recently I've repeatedly had to wade through *many* call sites to see what a lock was in fact protecting. Maybe you feel that the VM is better, but in general in the kernel it's mostly ad hoc in practice. Perhaps we can do better in UNIX sockets.

To be fair, there is a difference between bad documentation for a correct locking protocol, and bad (or absent) documentation for an incorrect locking protocol. We certainly have lots of examples of both. I can't easily be useful as a reviewer unless I have a mental model of how the locking is supposed to work, and you can help me get there by writing prose. Then I can verify my mental model against both your documentation (which might just be a single paragraph explaining how things have changed relative to the current implementation) _and_ the code, which I think is the most effective way of finding bugs in this sort of change. pho's tests are a great way to gain confidence in a diff, but they don't prove correctness. If they did, PR 227285 wouldn't exist. :)

I do agree that the status quo wrt locking is not good, and I do not claim that the VM's locking protocols are well-documented. I would like for us to raise the bar.

mmacy edited the summary of this revision. (Show Details)May 15 2018, 8:02 PM
mmacy edited the summary of this revision. (Show Details)May 15 2018, 10:48 PM
markj added inline comments.May 15 2018, 11:38 PM
sys/kern/uipc_usrreq.c
363 ↗(On Diff #42597)

break;? I think you might get warnings about empty statements otherwise.

371 ↗(On Diff #42597)

Since this is a macro, you can set unp2 = NULL if the PCB was freed to help catch bugs.

694 ↗(On Diff #42597)

Shouldn't we acquire the pcb lock before reading unp_vnode?

1083 ↗(On Diff #42597)

Why isn't it necessary to re-verify that unp2 == unp->unp_conn here?

1530 ↗(On Diff #42597)

This comment is bogus now.

mmacy added inline comments.May 15 2018, 11:50 PM
sys/kern/uipc_usrreq.c
694 ↗(On Diff #42597)

We can't because of lock ordering. However, I can revalidate the vnode pointer afterwards.

1083 ↗(On Diff #42597)

The lock2 is actually an error there. I've changed the code to hold both unpcb locks by the time I reach there.

mmacy updated this revision to Diff 42606.May 16 2018, 12:29 AM
  • mark refcount volatile
  • sort unpcb by access frequency
  • incorporate feedback
mmacy added inline comments.May 16 2018, 12:38 AM
sys/kern/uipc_usrreq.c
1519 ↗(On Diff #42521)

unp2 in this case is newly created. It shouldn't be visible yet AFAICT

1824 ↗(On Diff #42521)

It shows up as only tabs in my emacs.

mmacy updated this revision to Diff 42607.May 16 2018, 12:39 AM

missed style "fixes"

This revision was not accepted when it landed; it landed in state Needs Review.May 17 2018, 5:59 PM
This revision was automatically updated to reflect the committed changes.