use fget_unlocked in pollscan, defer fdrop to sel/pollrescan and seltdclear
Needs ReviewPublic
Actions

Authored by mmacy on Jun 14 2018, 5:46 AM.

Details

Reviewers

jeff
mjg

Summary

Eliminate the FILEDESC_SLOCK preventing scaling of threaded poll operations
retain the file pointer until rescan

Avoiding the second cap_rights check makes the single threaded case marginally faster. poll1_threads now scales in the same way that poll1_processes does.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 17254

Event Timeline

mmacy created this revision.Jun 14 2018, 5:46 AM

Herald added a subscriber: imp. · View Herald TranscriptJun 14 2018, 5:46 AM

Harbormaster completed remote builds in B17254: Diff 43744.Jun 14 2018, 5:46 AM

this avoidably pessimizes the common case of single-threaded execution by adding atomic op pair for each fd. the code can check if both the process is single threaded and the fd table is not shared, in which case there is no need to grab a ref on files. this will end up being a minor pessimization for the multithreaded (and presumably rare) case while being a win for singlethreaded one.

In D15799#334039, @mjg wrote:

this avoidably pessimizes the common case of single-threaded execution by adding atomic op pair for each fd. the code can check if both the process is single threaded and the fd table is not shared, in which case there is no need to grab a ref on files. this will end up being a minor pessimization for the multithreaded (and presumably rare) case while being a win for singlethreaded one.

Easy enough change to make.

mjg added inline comments.Jun 14 2018, 5:55 AM

sys/kern/sys_generic.c
1453	removing unconditional selfdfree is highly suspicious. it probably should be done no matter what with a separate call to drop the file pointer. does the patch survive buildworld or an actual poll user? (e.g. postgres vs pgbench)

mmacy added inline comments.Jun 14 2018, 5:59 AM

sys/kern/sys_generic.c
1453	You should probably read the surrounding code. Selfdfree won't necessarily be called if the loop in selfdfree is aborted due to an error. That's what seltdclear is for.

mjg added inline comments.Jun 14 2018, 6:01 AM

sys/kern/sys_generic.c
1453	i did. the point is you postponed it past fo_poll which looks like a bug.

In D15799#334039, @mjg wrote:

this avoidably pessimizes the common case of single-threaded execution by adding atomic op pair for each fd. the code can check if both the process is single threaded and the fd table is not shared, in which case there is no need to grab a ref on files. this will end up being a minor pessimization for the multithreaded (and presumably rare) case while being a win for singlethreaded one.

on my machine it takes less than 40 clock cycles or 11ns to do a atomic_add, atomic_fetchadd pair on a line that is in cache. I would really prefer that we did not obfuscate the code with fragile exceptions for a tiny bit of performance. There are far more profitable ways to improve our single threaded perf in poll.

jeff added inline comments.Jun 14 2018, 6:43 AM

sys/kern/sys_generic.c
1453	seltdfree should be correctly unwinding all of this state. It may be faster to do it directly so we have fewer list traversals though.

In D15799#334056, @jeff wrote:

In D15799#334039, @mjg wrote:

this avoidably pessimizes the common case of single-threaded execution by adding atomic op pair for each fd. the code can check if both the process is single threaded and the fd table is not shared, in which case there is no need to grab a ref on files. this will end up being a minor pessimization for the multithreaded (and presumably rare) case while being a win for singlethreaded one.

on my machine it takes less than 40 clock cycles or 11ns to do a atomic_add, atomic_fetchadd pair on a line that is in cache. I would really prefer that we did not obfuscate the code with fragile exceptions for a tiny bit of performance. There are far more profitable ways to improve our single threaded perf in poll.

This patch converts a per-call lock/unlock pair into a ref/unref pair for each passed fd, so it does matter. More importantly vast majority of poll users are single-threaded, so this patch as presented is pessimal for real uses. I don't see how the proposal obfuscated the code in any significant way.

I definitely agree there are plenty of wins to get in this code regardless of the above.

In D15799#334058, @mjg wrote:

In D15799#334056, @jeff wrote:

In D15799#334039, @mjg wrote:

this avoidably pessimizes the common case of single-threaded execution by adding atomic op pair for each fd. the code can check if both the process is single threaded and the fd table is not shared, in which case there is no need to grab a ref on files. this will end up being a minor pessimization for the multithreaded (and presumably rare) case while being a win for singlethreaded one.

on my machine it takes less than 40 clock cycles or 11ns to do a atomic_add, atomic_fetchadd pair on a line that is in cache. I would really prefer that we did not obfuscate the code with fragile exceptions for a tiny bit of performance. There are far more profitable ways to improve our single threaded perf in poll.

This patch converts a per-call lock/unlock pair into a ref/unref pair for each passed fd, so it does matter. More importantly vast majority of poll users are single-threaded, so this patch as presented is pessimal for real uses. I don't see how the proposal obfuscated the code in any significant way.

I definitely agree there are plenty of wins to get in this code regardless of the above.

I understand but you can't guarantee the thread is the only thing which is accessing these file descriptors. Off the top of my head, the unix domain socket gc thread does fdrops() on a task queue. It _may_ be possible to start to work around these things but it becomes incredibly hard to reason about. And you'd have to audit everything else in the kernel that uses a file * to understand whether it imposes restrictions on things you can do single threaded.

mjg added inline comments.Jun 14 2018, 6:51 AM

sys/kern/sys_generic.c
1453	The func hiding under fo_poll can find there is no event to return - possible with mulltiple threads waiting on the same fd. In this case it will re-register the watcher. the original code makes sure to always call selfdfree before yet another call to fo_poll. the patch results in calling selfdfree after fo_poll, which sounds like either trouble if fo_poll re-registers the event OR an interface bug as selfdfree immediately undoes fo_poll's work. note the entire mechanism (original) is just slow and has to be reworked to perform.

as an aside, this is what select already does anyway. Poll was still using the big slock but select was using the lockless fd support.

sys/kern/sys_generic.c
1453	Each thread has it's own 'seltd' and 'selfd' structures. So you don't have to worry about effects from other threads calling select. selrecord() skips registering the callback if SELTD_RESCAN is set. A change to the selwakeup() interface that recorded which events were triggered would eliminate the rescan process entirely. It will make cases where you have to wait for events to fire faster. For this case, the event always fires, because regular files are always writable. So most of our opportunities likely lie with making the fo_poll() path cheaper.

In D15799#334059, @jeff wrote:

I understand but you can't guarantee the thread is the only thing which is accessing these file descriptors. Off the top of my head, the unix domain socket gc thread does fdrops() on a task queue. It _may_ be possible to start to work around these things but it becomes incredibly hard to reason about. And you'd have to audit everything else in the kernel that uses a file * to understand whether it imposes restrictions on things you can do single threaded.

None of this is of any concern.

If the process is single threaded and the file descriptor table is not shared, it is the only entity which can modify its own fd table.

So in particular if it has a file installed, it holds a reference to keep it alive. Also nothing but curthread can drop it.

Let's say the same file object is being inspected by the unix gc thread - it is of no significance for this process. Let's say it fdrops. Does not matter, the process at hand still has its own ref.

The optimisation of not refing/unrefing files in single-threaded processes is implemented in Linux for all syscalls translating fd -> file.

The only caveat here is that you have to remember whether you grabbed the reference or not, since after you got one the other thread/whatever can disappear and you may transition to being single-threaded.

So the idiom is of this sort:
fp = fd2fp(fd, &need_fdrop);
....
fdrop_cond(fp, need_fdrop);

I would say pretty much no obfuscation in the caller and possibly beneficial to applly globally, not only here.

mjg added inline comments.Jun 14 2018, 7:01 PM

sys/kern/sys_generic.c
1453	Each thread has it's own 'seltd' and 'selfd' structures. So you don't have to worry about effects from other threads calling select. That's not what I was worried about. selrecord() skips registering the callback if SELTD_RESCAN is set. my bad, so there indeed is no bug i was worried about. I forgot about the extra rescan "from scratch" later. A change to the selwakeup() interface that recorded which events were triggered would eliminate the rescan process entirely. It will make cases where you have to wait for events to fire faster. This adds a potential problem: in the current code by the time you call poll again the event may have been taken care of, in which case you end up going back to sleep. If you just store the info about the event and not reevaluate the state, you will end up sending threads back to userspace when the previous code would keep them in kernel. This is not a correctness bug since the recheck is inherently racy, but it may be a pessimization. For this case, the event always fires, because regular files are always writable. So most of our opportunities likely lie with making the fo_poll() path cheaper. I would argue the benchmark is just bad. The codebase was opening and closing files already and someone just reused that code to get fds for poll. It would be better to modify it to poll tcp sockets. As for polling vnodes, normally people are not expected to do it as the vnodes are always ready. The code itself is very significantly pessimized in two major ways: #ifdef MAC vn_lock(vp, LK_EXCLUSIVE \| LK_RETRY); AUDIT_ARG_VNODE1(vp); error = mac_vnode_check_poll(active_cred, fp->f_cred, vp); VOP_UNLOCK(vp, 0); if (!error) #endif a mac or audit thingy in here is almost never needed, yet the vnode locked (in an exclusive manner at that!) error = VOP_POLL(vp, events, fp->f_cred, td); The VOP_* interface contains an extreme branchfest. I don't think this is worth fighting for this particular benchmark, since as noted earlier it should not be polling files to begin with.

In D15799#334063, @mjg wrote:

In D15799#334059, @jeff wrote:

I understand but you can't guarantee the thread is the only thing which is accessing these file descriptors. Off the top of my head, the unix domain socket gc thread does fdrops() on a task queue. It _may_ be possible to start to work around these things but it becomes incredibly hard to reason about. And you'd have to audit everything else in the kernel that uses a file * to understand whether it imposes restrictions on things you can do single threaded.

None of this is of any concern.

If the process is single threaded and the file descriptor table is not shared, it is the only entity which can modify its own fd table.

So in particular if it has a file installed, it holds a reference to keep it alive. Also nothing but curthread can drop it.

Let's say the same file object is being inspected by the unix gc thread - it is of no significance for this process. Let's say it fdrops. Does not matter, the process at hand still has its own ref.

The optimisation of not refing/unrefing files in single-threaded processes is implemented in Linux for all syscalls translating fd -> file.

The only caveat here is that you have to remember whether you grabbed the reference or not, since after you got one the other thread/whatever can disappear and you may transition to being single-threaded.

So the idiom is of this sort:
fp = fd2fp(fd, &need_fdrop);
....
fdrop_cond(fp, need_fdrop);

I would say pretty much no obfuscation in the caller and possibly beneficial to applly globally, not only here.

Matt found that most of the poll cost scales per-fd. So the atomic cost remains 5% at higher fd counts. I have said elsewhere, but 5% is not worth polluting the code.

Today the code tolerates threads other than the owning process modifying the fd table. There are interfaces in the kernel which can do so. It may be that this approach is safe with everything in the tree. I don't know. Is it safe with all of the kernel modules in ports? What about third party code?

The change eliminates a scaling bottleneck and makes select and poll more similar as well as reducing the number of atomics in select. There are reasonably written applications which use select on a small set of fds in multiple threads. I think this is a positive step.

jeff added inline comments.Jun 14 2018, 8:37 PM

sys/kern/sys_generic.c
1453	Yes you would miss cleared events as well as newly set events on the same descriptor. I think the question is, do we want a faster select/poll that may open up more races and require more select/poll calls? Or do we want a slower more aggressive implementation. I agree that it is largely the vn_poll path that should be improved. But the selrecord/selwakeup paradigm in the kernel also made the select/poll implementations more expensive than they had to be. Probably if we want to improve by 50% to match linux we would have to revisit all of this. Since we have cases where we are 1/10th linux perf still or worse I'm inclined to come back later.

mjg added inline comments.Jun 14 2018, 11:30 PM

sys/kern/sys_generic.c
1453	The implementation rechecking the work does not have to be visibly slower. It so happens to current code around it is just slow and handlers themselves are expensive (e.g. they always lock stuff). I would expect performance loss in real-world programs stemming from changing this behaviour. While vn_poll and related parts are a total dog, I don't argue for fixing them per se (or at least not vn_poll itself). I advocate for changing the benchmark to make more sense, namely poll tcp sockets.

In D15799#334362, @jeff wrote:

In D15799#334063, @mjg wrote:

In D15799#334059, @jeff wrote:

I understand but you can't guarantee the thread is the only thing which is accessing these file descriptors. Off the top of my head, the unix domain socket gc thread does fdrops() on a task queue. It _may_ be possible to start to work around these things but it becomes incredibly hard to reason about. And you'd have to audit everything else in the kernel that uses a file * to understand whether it imposes restrictions on things you can do single threaded.

None of this is of any concern.

If the process is single threaded and the file descriptor table is not shared, it is the only entity which can modify its own fd table.

So in particular if it has a file installed, it holds a reference to keep it alive. Also nothing but curthread can drop it.

Let's say the same file object is being inspected by the unix gc thread - it is of no significance for this process. Let's say it fdrops. Does not matter, the process at hand still has its own ref.

The optimisation of not refing/unrefing files in single-threaded processes is implemented in Linux for all syscalls translating fd -> file.

The only caveat here is that you have to remember whether you grabbed the reference or not, since after you got one the other thread/whatever can disappear and you may transition to being single-threaded.

So the idiom is of this sort:
fp = fd2fp(fd, &need_fdrop);
....
fdrop_cond(fp, need_fdrop);

I would say pretty much no obfuscation in the caller and possibly beneficial to applly globally, not only here.

Matt found that most of the poll cost scales per-fd. So the atomic cost remains 5% at higher fd counts. I have said elsewhere, but 5% is not worth polluting the code.

Today the code tolerates threads other than the owning process modifying the fd table. There are interfaces in the kernel which can do so. It may be that this approach is safe with everything in the tree. I don't know. Is it safe with all of the kernel modules in ports? What about third party code?

The change eliminates a scaling bottleneck and makes select and poll more similar as well as reducing the number of atomics in select. There are reasonably written applications which use select on a small set of fds in multiple threads. I think this is a positive step.

In D15799#334362, @jeff wrote:

In D15799#334063, @mjg wrote:

In D15799#334059, @jeff wrote:

I understand but you can't guarantee the thread is the only thing which is accessing these file descriptors. Off the top of my head, the unix domain socket gc thread does fdrops() on a task queue. It _may_ be possible to start to work around these things but it becomes incredibly hard to reason about. And you'd have to audit everything else in the kernel that uses a file * to understand whether it imposes restrictions on things you can do single threaded.

None of this is of any concern.

If the process is single threaded and the file descriptor table is not shared, it is the only entity which can modify its own fd table.

So in particular if it has a file installed, it holds a reference to keep it alive. Also nothing but curthread can drop it.

Let's say the same file object is being inspected by the unix gc thread - it is of no significance for this process. Let's say it fdrops. Does not matter, the process at hand still has its own ref.

The optimisation of not refing/unrefing files in single-threaded processes is implemented in Linux for all syscalls translating fd -> file.

The only caveat here is that you have to remember whether you grabbed the reference or not, since after you got one the other thread/whatever can disappear and you may transition to being single-threaded.

So the idiom is of this sort:
fp = fd2fp(fd, &need_fdrop);
....
fdrop_cond(fp, need_fdrop);

I would say pretty much no obfuscation in the caller and possibly beneficial to applly globally, not only here.

Matt found that most of the poll cost scales per-fd. So the atomic cost remains 5% at higher fd counts. I have said elsewhere, but 5% is not worth polluting the code.

The 5% figure stems from testing patched and unpatched FreeBSD. I don't find it to be a legitimate assessment due to fixable enormous single-threaded cost -- with the issues removed the atomics will be more visible. In particular one of the tests passed 128 file descriptors. ref and unref gives 256 atomics which otherwise wont be present. I suggested the benchmark would be much more realistic if it polled tcp sockets instead of arguably useless (to poll) regular files. Impact of atomics on the benchmark as it is is expected to be enormous, but is unclear what range it will fall for a case doing actual work. Thus I decided to measure myself.

Reworking for tcp sockets requires a lot of churn, so I went for a hopefully good enough approximation of using unix sockets instead - the key is there is work to be done and it has a similar code in the Linux kernel. The benchmark does socketpair() 128 times, writes something to one end of each pair and then polls all the other ends. The code is here (warning: total hack): http://dpaste.com/1RAYXXN

The test box is: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz with a 4.16 upstream kernel, meltdown protection disabled

I ran:
while true; do ./pollsockstream2_processes -t 1 -s 10 | tee -a processes; ./pollsockstream2_threads -t 1 -s 10 | tee -a threads; done

The _processes case uses the optimisation. In the _threads case, there are 2 threads already (one performs the test, the other one reports the result once a second) thus it ends up not using the optimisation. I find this to be a good enough approximation of what would happen just without it.

x proc_avg
+ thread_avg
+----------------------------------------------------------------------+
|                                       +                              |
|                                       +                     x        |
|                                       +                     x        |
|                                       +                     x        |
|                                       +                     x        |
|                                       +                    xx        |
|                                      ++                    xx        |
|                                      ++                    xx        |
|                                      ++                    xx        |
|                                      ++                    xx        |
|                                      ++                    xx        |
|                                      ++                    xx        |
|                                      +++                   xx        |
|                                      +++                   xx        |
|                                      +++                   xx        |
|                                      +++                   xx x      |
|                                     ++++                   xx x      |
|                                     ++++                   xx x      |
|                                     ++++                   xx x      |
|                                     ++++                   xx x      |
|                                     ++++                   xx xx     |
|                                     ++++                  xxxxxx     |
|                                    ++++++                 xxxxxx     |
|                                    ++++++                 xxxxxx     |
|                                    ++++++                xxxxxxx     |
|                                    ++++++                xxxxxxx     |
|                                    ++++++                xxxxxxx     |
|                                    ++++++                xxxxxxx     |
|                             +     ++++++++     x        xxxxxxxxx    |
|    x          +           + ++    ++++++++     x        xxxxxxxxx    |
|+   x   + +x* +* +  ++x  x + +++ ++*++++++* x   x   xx xxxxxxxxxxxx   |
|                             |______A_M____|  |___________A__M_______||
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 134        113317        145648      142781.5     141223.61     5874.4558
+ 134        111440        133192        131247     130069.25     3726.3139
Difference at 95.0% confidence
	-11154.4 +/- 1177.88
	-7.89837% +/- 0.834056%
	(Student's t, pooled s = 4919.08)

While the results are a little bit unstable, you can see 7.9% performance loss from doing atomics. It's not terrible, not that good either. And this is the unfriendly case. Fun side note: I did a brief benchmark with the current test polling files and the difference is drastic as expected: 458512 vs 265248

Regardless, I think the above demonstrates there is a real-world loss from doing said atomics.

Today the code tolerates threads other than the owning process modifying the fd table. There are interfaces in the kernel which can do so. It may be that this approach is safe with everything in the tree. I don't know. Is it safe with all of the kernel modules in ports? What about third party code?

There are no interfaces in the kernel which do this. There are cases like mountcheckdirs which can modify root/cwd/jail vnodes. There is also code inspecting the table for the purpose of exporting the content. There is no base code modifying the *fd* table for anyone but curthread.

If there is a 3rd party module modifying the table (what for?) without stopping the affected process, it is broken. fd lookup is lockless and the process can use existing fds for arbitrary syscalls. with the fd lock you only block closing and opening, you don't to stop the process from doing $whatever with what happens to resolve to given fd.

The change eliminates a scaling bottleneck and makes select and poll more similar as well as reducing the number of atomics in select. There are reasonably written applications which use select on a small set of fds in multiple threads. I think this is a positive step.

I proposed the change myself with a note that it will pessimize the most common type of consumer -- single threaded with many fds. I noted this can be fixed with avoidance of refing/unrefing if the process is single threaded and the table is not shared. I did not have any hard data to see how much of an impact this is, now I do (see above) and I think it is better to do it than to put the patch in as it is.

However, I'm not a maintainer of this code and I'm not going to insist.

kqueue and select both use fget_unlocked. If you want to propose files without references for single threaded programs you are free to do so. You should raise it on arch@ as there is no real owner in this area. This patch further reduces differences between select and poll and reduces the number of atomics used in select which I would argue is the more frequently used of the pair.

Revision Contents
Changeset List

Path

Size

sys/

kern/

sys_generic.c

65 lines

Diff 43744

View Options

use fget_unlocked in pollscan, defer fdrop to sel/pollrescan and seltdclearNeeds ReviewPublicActions