Page MenuHomeFreeBSD

Restructure swapout tests after vm map locking was removed.
ClosedPublic

Authored by kib on Dec 30 2017, 9:08 AM.

Details

Summary

Consolidate the regions covered by the process lock. Combine similar conditions tests into one, e.g. all process flags can be test with one logical operation.

Remove labels and goto by explicitly tracking state.

Update comments.

NB. While looking at the split of the conditions in swapout_procs() and swapout(), I think that the procedure is racy. For instance, a thread which was checked for sleep time in swapout_procs(), might become runnable (because the thread lock was dropped) and then removed from the runqueue again, when we reach the thread in swapout(). Then, we would still consider the process eligible, which is not.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib created this revision.Dec 30 2017, 9:08 AM
markj added inline comments.Dec 30 2017, 2:31 PM
sys/vm/vm_swapout.c
751 ↗(On Diff #37227)

P_SYSTEM is duplicated.

757 ↗(On Diff #37227)

Do we still need to hold the process?

kib marked an inline comment as done.Dec 30 2017, 3:42 PM
kib added inline comments.
sys/vm/vm_swapout.c
757 ↗(On Diff #37227)

I am not sure.

swapout() drops the process lock, but it does that after all threads are marked as swapped out, which prevents them from running (see the comment above vm_thread_swapout() loop). Most likely, the process lock is dropped to avoid creating the order between object and process locks.

So exit1() cannot be called while the lock is dropped, and parallel swap-in/out are blocked by the process flags. OTOH, hold around swapout was the historic behavior, and it might be needed in future again.

If both you and Alan consider stopping holding the process a good idea, I will add it to the patch.

kib updated this revision to Diff 37234.Dec 30 2017, 3:43 PM

Remove duplicated P_SYSTEM from the flags set.

markj added inline comments.Dec 30 2017, 4:23 PM
sys/vm/vm_swapout.c
757 ↗(On Diff #37227)

I'm confused - we PRELE() the process before calling swapout().

kib added inline comments.Dec 30 2017, 4:42 PM
sys/vm/vm_swapout.c
757 ↗(On Diff #37227)

Ah, yes, I am confused as well.

Now I see another reason for hold. Look at the loop end, we need to ensure that p is still alive when we iterate over the p_list. I think this is a clear reason for hold.

alc added inline comments.Dec 30 2017, 6:57 PM
sys/vm/vm_swapout.c
741–746 ↗(On Diff #37234)

I would strive for consistency between the order in which conditions are described in the comment and tested in the code.

756–757 ↗(On Diff #37234)

I think that we should add a comment describing why the hold is performed.

alc added inline comments.Dec 30 2017, 7:05 PM
sys/vm/vm_swapout.c
757 ↗(On Diff #37227)

OTOH, hold around swapout was the historic behavior, and it might be needed in future again.

Are you saying that the PRELE() before swapout() is a recent change?

If we could maintain the hold across the swapout() then we wouldn't need to restart the FOREACH_PROC_IN_SYSTEM() iterator every time that we pick a process to swap out. I don't see the virtue in restarting the iterator.

kib updated this revision to Diff 37245.Dec 30 2017, 7:36 PM

Take advantage of the process hold to avoid loop restart.
Update comments.

alc added a comment.Dec 31 2017, 7:14 PM

Ugh. I claim that the variable "minslptime" is pointless and always has been, ever since its introduction in r83366. Essentially, when the FOREACH_THREAD loop completes, we can assert than the sleep time for every thread is above whichever threshold is being applied. We don't need to maintain and check a min sleep time for the collection of threads.

Index: vm/vm_swapout.c
===================================================================
--- vm/vm_swapout.c     (revision 327361)
+++ vm/vm_swapout.c     (working copy)
@@ -729,10 +729,9 @@ swapout_procs(int action)
 {
        struct proc *p;
        struct thread *td;
-       int minslptime, slptime;
+       int slptime;
        bool didswap;
 
-       minslptime = 100000;
        didswap = false;
 retry:
        sx_slock(&allproc_lock);
@@ -831,8 +830,6 @@ retry:
                                        goto nextproc;
                                }
 
-                               if (minslptime > slptime)
-                                       minslptime = slptime;
                                thread_unlock(td);
                        }
 
@@ -841,15 +838,11 @@ retry:
                         * or if this process is idle and the system is
                         * configured to swap proactively, swap it out.
                         */
-                       if ((action & VM_SWAP_NORMAL) != 0 ||
-                           ((action & VM_SWAP_IDLE) != 0 &&
-                           minslptime > swap_idle_threshold2)) {
-                               _PRELE(p);
-                               if (swapout(p) == 0)
-                                       didswap = true;
-                               PROC_UNLOCK(p);
-                               goto retry;
-                       }
+                       _PRELE(p);
+                       if (swapout(p) == 0)
+                               didswap = true;
+                       PROC_UNLOCK(p);
+                       goto retry;
                }
 nextproc:
                PROC_UNLOCK(p);

minslptime was originally reinitialized for each process, before the FOREACH_THREAD loop, and, in r327354, we moved its initialization outside the FOREACH_PROC loop. This change in initialization had no material effect on the operation of this code.

kib added a comment.Dec 31 2017, 7:49 PM
In D13693#286896, @alc wrote:

Ugh. I claim that the variable "minslptime" is pointless and always has been, ever since its introduction in r83366.

I agree. I incorporated the variable removal into my patch, but I suppose that you are going to commit your change first ?

kib updated this revision to Diff 37332.Dec 31 2017, 10:22 PM

Re-merge.

alc added a comment.Jan 1 2018, 5:54 PM

I've been looking at the history of changes to this code. Once upon a time, allproc_lock was held until either a process was swapped out, in which case, we did a "goto retry;" after releasing allproc_lock, or we completed the FOREACH_PROC loop. Then, a year and a half ago, we introduced the PHOLD and early allproc_lock release to address a lock ordering issue with vnodes. Elsewhere in this file, that change makes sense. Here, I don't see the need, especially now that we are no longer acquiring the map lock. Or, am I missing something?

P.S. Happy new year!

kib added a comment.Jan 1 2018, 7:15 PM
In D13693#287095, @alc wrote:

I've been looking at the history of changes to this code. Once upon a time, allproc_lock was held until either a process was swapped out, in which case, we did a "goto retry;" after releasing allproc_lock, or we completed the FOREACH_PROC loop. Then, a year and a half ago, we introduced the PHOLD and early allproc_lock release to address a lock ordering issue with vnodes. Elsewhere in this file, that change makes sense. Here, I don't see the need, especially now that we are no longer acquiring the map lock. Or, am I missing something?

Are you proposing to keep allproc_lock over the whole duration of FOREACH_PROC() loop ?

If yes, I think I do not agree with this completely. I mean that we will usurp the global lock for a long time, making e.g. a parallel fork() impossible, esp. if we really swap out stack for some process. I agree with the note that there is no sense in dropping the lock if we are not going to swapout() the process unconditionally for each iteration.

Or, maybe we should check for should_yield() and hold the process, drop locks and yield ? This would be more fair globally.

P.S. Happy new year!

Happy New Year to you !

alc added a comment.Jan 1 2018, 7:38 PM
In D13693#287118, @kib wrote:
In D13693#287095, @alc wrote:

I've been looking at the history of changes to this code. Once upon a time, allproc_lock was held until either a process was swapped out, in which case, we did a "goto retry;" after releasing allproc_lock, or we completed the FOREACH_PROC loop. Then, a year and a half ago, we introduced the PHOLD and early allproc_lock release to address a lock ordering issue with vnodes. Elsewhere in this file, that change makes sense. Here, I don't see the need, especially now that we are no longer acquiring the map lock. Or, am I missing something?

Are you proposing to keep allproc_lock over the whole duration of FOREACH_PROC() loop ?

To be clear, I am saying that the original reason for releasing the lock doesn't apply to this code anymore. If we are going to release it, we should be able to articulate a new reason for doing so.

kib added a comment.Jan 1 2018, 7:56 PM
In D13693#287127, @alc wrote:

To be clear, I am saying that the original reason for releasing the lock doesn't apply to this code anymore. If we are going to release it, we should be able to articulate a new reason for doing so.

As I said above, not holding the lock for whole scan time makes it possible for other numerous consumers of the lock to make the progress. We are reducing the contention on allproc_lock by dropping it, either on each iteration or on should_yeild().

markj accepted this revision.Jan 1 2018, 9:24 PM
In D13693#287136, @kib wrote:
In D13693#287127, @alc wrote:

To be clear, I am saying that the original reason for releasing the lock doesn't apply to this code anymore. If we are going to release it, we should be able to articulate a new reason for doing so.

As I said above, not holding the lock for whole scan time makes it possible for other numerous consumers of the lock to make the progress. We are reducing the contention on allproc_lock by dropping it, either on each iteration or on should_yeild().

Releasing it on each iteration seems a bit excessive, especially since users can trigger such scans anyway by calling sysctl_kern_proc(), and vm_daemon() performs another allproc scan after calling swapout_procs(). It does seem desirable to release the lock around swapout() calls though.

Happy new year!

This revision is now accepted and ready to land.Jan 1 2018, 9:24 PM
kib added a comment.Jan 1 2018, 10:02 PM
In D13693#287136, @kib wrote:
In D13693#287127, @alc wrote:

To be clear, I am saying that the original reason for releasing the lock doesn't apply to this code anymore. If we are going to release it, we should be able to articulate a new reason for doing so.

As I said above, not holding the lock for whole scan time makes it possible for other numerous consumers of the lock to make the progress. We are reducing the contention on allproc_lock by dropping it, either on each iteration or on should_yeild().

Releasing it on each iteration seems a bit excessive, especially since users can trigger such scans anyway by calling sysctl_kern_proc(), and vm_daemon() performs another allproc scan after calling swapout_procs(). It does seem desirable to release the lock around swapout() calls though.

Initially I also thought that it is bad to hold allproc during swapout(), but really we only unmap the kstacks and unwire the stack pages. So I tend to think that the only viable option is should_yield() or left it as is in the patch.

Happy new year!

Happy New Year !

alc added a comment.Jan 1 2018, 10:12 PM

I observe that we test for the P_SYSTEM flag without the benefit of the process lock in vmtotal(). Is that correct?

Also, why do we have both P_KPROC and P_SYSTEM? With one exception, create_init() in init_main.c, they are set simultaneously.

alc added a comment.Jan 1 2018, 10:28 PM
In D13693#287216, @kib wrote:
In D13693#287136, @kib wrote:
In D13693#287127, @alc wrote:

To be clear, I am saying that the original reason for releasing the lock doesn't apply to this code anymore. If we are going to release it, we should be able to articulate a new reason for doing so.

As I said above, not holding the lock for whole scan time makes it possible for other numerous consumers of the lock to make the progress. We are reducing the contention on allproc_lock by dropping it, either on each iteration or on should_yeild().

Releasing it on each iteration seems a bit excessive, especially since users can trigger such scans anyway by calling sysctl_kern_proc(), and vm_daemon() performs another allproc scan after calling swapout_procs(). It does seem desirable to release the lock around swapout() calls though.

Initially I also thought that it is bad to hold allproc during swapout(), but really we only unmap the kstacks and unwire the stack pages.

Exactly. I see nothing that makes this loop body of FOREACH_PROC particularly more expensive than some of the others. In essence, I'm looking for a reason why we should be concerned about lock contention here, but not in the other cases.

... So I tend to think that the only viable option is should_yield() or left it as is in the patch.

kib added a comment.Jan 1 2018, 10:32 PM
In D13693#287217, @alc wrote:

I observe that we test for the P_SYSTEM flag without the benefit of the process lock in vmtotal(). Is that correct?

I think it does not matter much because we currently only set P_SYSTEM on creation and vmmeter() rechecks p_state. If we see PRS_NORMAL there, then we woud perhaps see P_SYSTEM. But even if we don't it would only skew the statistics slightly.

Also, why do we have both P_KPROC and P_SYSTEM? With one exception, create_init() in init_main.c, they are set simultaneously.

I always thought that the intent was for P_KPROC to mark real kernel processes, that do not have userspace. In particular, such processes do not react to signals because they never return to absent userspace and do not go through ast(). P_SYSTEM marks processes that should receive more delicate treating. This is vague, but good enough to see that we should test for P_SYSTEM for swapout or OOM selection and init marked as P_SYSTEM but not P_KPROC.

It could be that some time ago there were more nuances, which now reduced to only init not being kproc but still having P_SYSTEM.

kib added a comment.Jan 1 2018, 10:42 PM
In D13693#287219, @alc wrote:

Exactly. I see nothing that makes this loop body of FOREACH_PROC particularly more expensive than some of the others. In essence, I'm looking for a reason why we should be concerned about lock contention here, but not in the other cases.

I think with bigger and bigger systems and gradually bigger average workloads, were current hundreds of processes become thousands, we want to avoid holding the global lock for the whole FOREACH_PROC (esp. nesting FOREACH_THREAD_IN_PROC() three times). We should not regress the code which already does not hold the lock for whole loop.

alc added inline comments.Jan 2 2018, 5:18 AM
sys/vm/vm_swapout.c
761–766 ↗(On Diff #37332)

I would suggest moving this comment outside the loop, just before initializing doswap.

772 ↗(On Diff #37332)

For this function to be called, at least one of VM_SWAP_NORMAL and VM_SWAP_IDLE must be set. If (action & VM_SWAP_NORMAL) == 0 is true, then (action & VM_SWAP_IDLE) == 0 must be false. So, checking (action & VM_SWAP_IDLE) == 0 is pointless.

783–785 ↗(On Diff #37332)

This comment essentially repeats the comment from inside the loop.

791 ↗(On Diff #37332)

"The hold on ... The next iteration ...

kib updated this revision to Diff 37406.Jan 2 2018, 11:05 AM

Edit comments.
Explicitly list flags values assumption and use it to simplify thread test expression.

This revision now requires review to proceed.Jan 2 2018, 11:05 AM
markj added a comment.Jan 2 2018, 3:41 PM
In D13693#287221, @kib wrote:
In D13693#287219, @alc wrote:

Exactly. I see nothing that makes this loop body of FOREACH_PROC particularly more expensive than some of the others. In essence, I'm looking for a reason why we should be concerned about lock contention here, but not in the other cases.

I think with bigger and bigger systems and gradually bigger average workloads, were current hundreds of processes become thousands, we want to avoid holding the global lock for the whole FOREACH_PROC (esp. nesting FOREACH_THREAD_IN_PROC() three times). We should not regress the code which already does not hold the lock for whole loop.

I'm a bit skeptical that this would represent a true regression, seeing as how swapout_procs() will only be called at most once per second by default, and then only under extreme conditions where lock contention isn't a large concern. OTOH, the single-threaded performance of swapout_procs() is not very important, so I think it would be sufficient to add /* Avoid blocking writers. */ or so to indicate that the unlock is not strictly necessary.

kib added a comment.Jan 2 2018, 4:04 PM
In D13693#287221, @kib wrote:
In D13693#287219, @alc wrote:

Exactly. I see nothing that makes this loop body of FOREACH_PROC particularly more expensive than some of the others. In essence, I'm looking for a reason why we should be concerned about lock contention here, but not in the other cases.

I think with bigger and bigger systems and gradually bigger average workloads, were current hundreds of processes become thousands, we want to avoid holding the global lock for the whole FOREACH_PROC (esp. nesting FOREACH_THREAD_IN_PROC() three times). We should not regress the code which already does not hold the lock for whole loop.

I'm a bit skeptical that this would represent a true regression, seeing as how swapout_procs() will only be called at most once per second by default, and then only under extreme conditions where lock contention isn't a large concern. OTOH, the single-threaded performance of swapout_procs() is not very important, so I think it would be sufficient to add /* Avoid blocking writers. */ or so to indicate that the unlock is not strictly necessary.

My reaction is based on the similar Peter Jeremy case, see r327468, where very large object scan caused livelock.

It is quite possible to run process with 100K threads, and if swapout_procs() is tricked into scanning that large process, system hangs for tens or even hundreds seconds.

alc added a comment.Jan 2 2018, 5:21 PM
In D13693#287221, @kib wrote:
In D13693#287219, @alc wrote:

Exactly. I see nothing that makes this loop body of FOREACH_PROC particularly more expensive than some of the others. In essence, I'm looking for a reason why we should be concerned about lock contention here, but not in the other cases.

I think with bigger and bigger systems and gradually bigger average workloads, were current hundreds of processes become thousands, we want to avoid holding the global lock for the whole FOREACH_PROC (esp. nesting FOREACH_THREAD_IN_PROC() three times). We should not regress the code which already does not hold the lock for whole loop.

I'm a bit skeptical that this would represent a true regression, seeing as how swapout_procs() will only be called at most once per second by default, and then only under extreme conditions where lock contention isn't a large concern. OTOH, the single-threaded performance of swapout_procs() is not very important, so I think it would be sufficient to add /* Avoid blocking writers. */ or so to indicate that the unlock is not strictly necessary.

I'll write a comment and post it here a little later.

sys/vm/vm_swapout.c
782 ↗(On Diff #37406)

Delete this blank line? The next two lines after the blank fit with the prior block comment.

kib marked an inline comment as done.Jan 2 2018, 5:29 PM
alc added inline comments.Jan 2 2018, 8:56 PM
sys/vm/vm_swapout.c
758–759 ↗(On Diff #37406)

"Further consideration of this process for swap out requires iterating over its threads. We release allproc_lock here so that process creation and destruction are not blocked while we iterate.

To later reacquire allproc_lock and resume iteration over the allproc list, we will first have to release the lock on this process. We place a hold on this process so that it remains in the allproc list while it is unlocked."

787–790 ↗(On Diff #37406)

I think that this comment is now subsumed by the earlier one.

kib updated this revision to Diff 37427.Jan 2 2018, 9:05 PM

Apply comments changes.

alc accepted this revision.Jan 2 2018, 10:30 PM
alc added inline comments.
sys/vm/vm_swapout.c
767 ↗(On Diff #37427)

In hindsight, "... on the process ..." sounds better.

769 ↗(On Diff #37427)

I think that there is a missing space here before the "*/".

This revision is now accepted and ready to land.Jan 2 2018, 10:30 PM
kib updated this revision to Diff 37432.Jan 2 2018, 10:35 PM

Comment fixes.

This revision now requires review to proceed.Jan 2 2018, 10:35 PM
kib added a subscriber: pho.Jan 2 2018, 10:36 PM

Peter, could you, please, test this diff ? Ideally, it should demonstrate exactly the same behavior as stock HEAD. In other words, it does not fix any of OOM scenarios

alc added inline comments.Jan 2 2018, 10:41 PM
sys/vm/vm_swapout.c
769 ↗(On Diff #37427)

I debated whether to mention the temporary release of the process lock in swapout() as reason for doing the hold so early. I'm happy to write another sentence if you like.

alc added a comment.Jan 2 2018, 10:58 PM

Please take a look at the comment at the head of this function. Every sentence after the first describes behavior different from what the function actually implements.

kib added a comment.Jan 3 2018, 10:45 AM
In D13693#287574, @alc wrote:

Please take a look at the comment at the head of this function. Every sentence after the first describes behavior different from what the function actually implements.

Do you mean that the comment should be fixed, or that we should make the code match the comment ?

kib added inline comments.Jan 3 2018, 10:47 AM
sys/vm/vm_swapout.c
769 ↗(On Diff #37427)

I think that the point where we hold the process is not very important as far as we do it before the first process unlock. But if you have a good wording for a note explaining it, of course it is useful to include the text.

pho added a comment.Jan 4 2018, 11:02 AM
In D13693#287571, @kib wrote:

Peter, could you, please, test this diff ? Ideally, it should demonstrate exactly the same behavior as stock HEAD. In other words, it does not fix any of OOM scenarios

I ran all of the stress2 tests and a buildworld/installworld on amd64. No problems seen.

kib added a comment.Jan 4 2018, 11:30 AM
In D13693#288058, @pho wrote:
In D13693#287571, @kib wrote:

Peter, could you, please, test this diff ? Ideally, it should demonstrate exactly the same behavior as stock HEAD. In other words, it does not fix any of OOM scenarios

I ran all of the stress2 tests and a buildworld/installworld on amd64. No problems seen.

Thank you, Peter.

Alan. Mark. Should I go with the commit of the current patch ? There are two supposed comments changing which can be done later.

markj added a comment.Jan 4 2018, 3:12 PM
In D13693#288066, @kib wrote:

Alan. Mark. Should I go with the commit of the current patch ? There are two supposed comments changing which can be done later.

It's fine by me.

alc added a comment.Jan 4 2018, 4:35 PM
In D13693#288066, @kib wrote:
In D13693#288058, @pho wrote:
In D13693#287571, @kib wrote:

Peter, could you, please, test this diff ? Ideally, it should demonstrate exactly the same behavior as stock HEAD. In other words, it does not fix any of OOM scenarios

I ran all of the stress2 tests and a buildworld/installworld on amd64. No problems seen.

Thank you, Peter.

Alan. Mark. Should I go with the commit of the current patch ? There are two supposed comments changing which can be done later.

Yes.

alc added a comment.Jan 4 2018, 5:43 PM
In D13693#287667, @kib wrote:
In D13693#287574, @alc wrote:

Please take a look at the comment at the head of this function. Every sentence after the first describes behavior different from what the function actually implements.

Do you mean that the comment should be fixed, or that we should make the code match the comment ?

I do know that this function has changed dramatically since FreeBSD 4.x. At that point, the comment at the head of the function may have been accurate. I will investigate further and propose something.

Also, once we are no longer dropping the process lock in the midst of this function, I would propose that we add P_INEXEC to the exclusion test at the beginning of the loop body. Any process with that flag set should later fail the swap_idle_threshold1 test anyway, and this would address any hypothetical concerns about the p->p_vmspace access in swapout().

This revision was not accepted when it landed; it landed in state Needs Review.Jan 4 2018, 6:15 PM
This revision was automatically updated to reflect the committed changes.