avoid deadlock between suspend and a user space process
Needs ReviewPublic
Actions

Authored by ouyangzhaowei_huawei.com on Apr 24 2017, 1:23 PM.

Details

Reviewers

Summary

I had patched the newest Errata, 11.0-RELEASE-p9 and 10.3-RELEASE-p18.
After made the live migration back and forth(more than 10 times), I found that a freebsd-10.3 or freebsd-11.0 VM will hang. By adding some logs in xctrl_suspend, I found the suspend will block in stop_all_proc function.
I think if the xctrl_suspend thread has held xenstore request mutex before stop_all_proc function is called, a user-space process(it happens to write xenstore) will be busy-waiting for the lock.
In result that stop_all_proc could not stop the user-space process. I think it should be changed to not-busy-waiting.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

ouyangzhaowei_huawei.com created this revision.Apr 24 2017, 1:23 PM

Herald added a subscriber: imp. · View Herald TranscriptApr 24 2017, 1:23 PM

I've asked several times before requesting the EN that you test stable/10, stable/11 and HEAD to make sure they where working fine, and I was told that everything was working fine and that all issues where fixed, yet here we are, we requested a set of ENs and it's still not working properly.

I've also already told you that I don't think this is correct, it might mitigate the issue you are seeing, but I don't think it's fixing it in any way.

Why does stop_all_proc() block? in xctrl_suspend stop_all_proc is called _before_ suspending xenstore, how can it block? Is there a thread blocked in xs_talkv trying to lock the xenstore lock, and that's preventing it from suspending?

In D10477#217228, @royger wrote:

I've asked several times before requesting the EN that you test stable/10, stable/11 and HEAD to make sure they where working fine, and I was told that everything was working fine and that all issues where fixed, yet here we are, we requested a set of ENs and it's still not working properly.

I've also already told you that I don't think this is correct, it might mitigate the issue you are seeing, but I don't think it's fixing it in any way.

Why does stop_all_proc() block? in xctrl_suspend stop_all_proc is called _before_ suspending xenstore, how can it block? Is there a thread blocked in xs_talkv trying to lock the xenstore lock, and that's preventing it from suspending?

Sorry about that, when Liu yingdong has committed the first version of the patch which include xs_is_locked function, our test team has test the migration of VM for over 1000 times. After he remove xs_is_locked(), the test team has test the basic migration for over 100 times.
But when we add more xenstore access through a user-space process, this issue may occur.

We try to track the call stack of the suspending when VM block, and found that if the xctrl_suspend has got the lock of xenstore, a user-space process try to access xenstore but never get the lock, then the stop_all_proc() function couldn't suspend this user-space process and never return.

The function xs_is_locked() could prevent the xenstore access keep waiting a lock, and make it go to sleep state for a while so that the stop_all_proc function could suspend it.
We would be very grateful if you had any other suggestions, thank you.

In any case, the patch is incorrect, there's a race. Consider the following scenario:

User-space program calls write on the xenstore-device.
xs_is_locked returns false.
Suspend procedure is started, thus calling xs_lock.
User-space syscall continues and calls into xs_dev_request_and_reply.
Deadlock again.

There's a window between the xs_is_locked call and the actual xs_lock call, which makes this approach racy. I've already commented this when this patch was originally submitted.

In D10477#217696, @royger wrote:

In any case, the patch is incorrect, there's a race. Consider the following scenario:

User-space program calls write on the xenstore-device.

xs_is_locked returns false.

Suspend procedure is started, thus calling xs_lock.

User-space syscall continues and calls into xs_dev_request_and_reply.

Deadlock again.

There's a window between the xs_is_locked call and the actual xs_lock call, which makes this approach racy. I've already commented this when this patch was originally submitted.

Yes, you're right. After more test we found that the stop_all_proc function could also be block by other processes, such as "syslogd" or "sendmail".

So refer to the freeze_processes function in Linux Kernel, we thought in case of the failure of stop all process, the suspend need to be aborted and the processes which have been stopped should be resumed.

Now we are testing this new modification, is there any suggestion? Thanks.