Page MenuHomeFreeBSD

Add "reboot -r" support - unmount old root and mount a new one.
ClosedPublic

Authored by trasz on May 31 2015, 8:04 PM.

Details

Summary

This patch makes it possible, using "reboot -r", to get the kernel
to kill all running processes, unmount all the filesystems, mount
a new rootfs and start init(8).

There are some further cleanups possible, eg. the way the init
is started, but I'd prefer to do it in a later change, so this
one remains non-intrusive and MFC-able.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I do not like this. The biggest point of disagreement is that you are exiting and re-creating the init. Minor point is signalling the system processes.

Suppose that kernel has grown the capability to execute an image from e.g. posix shm file descriptor. Then init could create posix shm file with a copy of some ELF, which does kill/wait for termination/unmount/swapoff, and mount. All of this would happen in userspace, without breaking any parent/child relations for the existing init process and existing reapers. Also, kernel would not get some bloat from single-used code.

sys/kern/kern_shutdown.c
190 ↗(On Diff #5845)

You must not signal the system processes at all.

191 ↗(On Diff #5845)

Er ?

203 ↗(On Diff #5845)

What is the purpose of this ?

wblock added inline comments.
sbin/reboot/reboot.8
115 ↗(On Diff #5845)

passive->active:
s/will kill/kills/
s/unmount/umounts/
s/mount/mounts/

116 ↗(On Diff #5845)

s/start/starts/

118 ↗(On Diff #5845)

"the" is probably not needed.

Rewrite. This still needs some cleaning up, and contains a quite
serious bug related to /tmp, but it shows the idea.

Note that this is a "proof of concept", or work in progress. Feedback is still welcome, just ignore the debugging printfs :-)

sys/kern/kern_shutdown.c
192 ↗(On Diff #7066)

Why do you need Giant ?

sys/kern/kern_shutdown.c
192 ↗(On Diff #7066)

To be extra safe - vfs_unmountall() was previously always called with Giant held; calling it without might uncover some bugs.

sys/kern/kern_shutdown.c
480 ↗(On Diff #7101)

s/preserviing/preserve/

481 ↗(On Diff #7101)

Should be "currently-running".

Current man page changes look all right, but remember to update the .Dd before commit. And it never hurts to check them with igor -R and mandoc -Tlint.

Avoid messing with vfs_mountroot() and vfs_unmountall() at all,
by temporarily removing the tmpfs from the mount list.

It's now a commit candidate.

sbin/init/init.8
287 ↗(On Diff #7218)

These sorts of changes should be committed separately

sbin/reboot/reboot.c
115 ↗(On Diff #7218)

?

sys/kern/vfs_subr.c
2712 ↗(On Diff #7218)

?

trasz added inline comments.
sbin/init/init.8
287 ↗(On Diff #7218)

Yup, and already done. It's just that I'm using svn for development and I don't have a different branch for everything.

sbin/reboot/reboot.c
115 ↗(On Diff #7218)

Oops, right; it was commented out for debugging.

sys/kern/vfs_subr.c
2712 ↗(On Diff #7218)

This breaks forced unmount of devfs with active nodes (like, a device node mounted somewhere).

sbin/init/init.8
287 ↗(On Diff #7218)

Ok. Can you upload a new diff once you have a commit candidate?

sys/kern/vfs_subr.c
2712 ↗(On Diff #7218)

But it's confusing to leave an assert here within #if 0. It should be updated, or removed if now inappropriate.

A slightly modified approach is planned; @trasz will add information and a rough idea of the timeline.

There were a few fixes/changes required for reroot to work, that should to be committed separately. Most of them was already committed; the last one is https://reviews.freebsd.org/D3467.

Updated version - things that needed to be committed separatedly got dealt
with; this also rearranges the init(8) code flow to fix the bug that resulted
in plain "reboot" triggering the "reboot -r" action.

For now, lets restrict the discussion to the kernel bits only. Ideally, you would split the patch into RB_REROOT (kernel + man page update) and the rest.

lib/libc/sys/reboot.2
118 ↗(On Diff #8168)

Which file system is the new root ? I understand the answer only from reading the source.

sbin/init/Makefile
6 ↗(On Diff #8168)

This is arguably better expressed in single line, or using '\\n' for line continuation.

sbin/init/init.c
259 ↗(On Diff #8168)

Why not write this as

if (setsid() < 0 && (errno != EPERM || getsid(0) != getpid() /* XXX: just 1 ? */))
    warning(....);

?

sys/kern/kern_shutdown.c
194 ↗(On Diff #8168)

There is no gains from locking Giant there. It only confuses the readers.

357 ↗(On Diff #8168)

this is better expressed as curproc != initproc

377 ↗(On Diff #8168)

I am looking at the code for vfs_mountroot() in the tree at r287124, and I do not understand why it would not mount an extra instance of the devfs on /dev.

Also, what makes vfs_mountroot() to mount the _new_ root ? I am missing the bits which pass the new configuration to the kernel.

388 ↗(On Diff #8168)

Isn't this leaves the transient filesystem (the mp, which carries the init binary between remounts) not attached at any point in the current filesystem hierarchy ? I do not see anything which would make the filesystem accessible by path to the new namespace. In particular, I do not understand how unmount in the init(8) reroot code worked for you.

trasz added inline comments.
lib/libc/sys/reboot.2
118 ↗(On Diff #8168)

It's described in reboot(8) (it's taken from kern.root.mountfrom kenv); I think I could reword this to say "mount the root filesystem using the same mechanism as used during normal boot process".

sys/kern/kern_shutdown.c
194 ↗(On Diff #8168)

Ok, if it's safe to call vfs_unmountall() and vfs_mountroot() without it, I'll just remove it.

377 ↗(On Diff #8168)

Well, since devfs forced unmount now works, it will mount the only instance of devfs on /dev.

I'd like to add this after this patch here goes in - it works correctly without it, and I'd like to avoid changes to vfs_mountroot.c, simply because I don't have a good way to test it properly.

As for the new root - again, it's described in reboot.8 change above.

388 ↗(On Diff #8168)

Looks like sys_unmount() just iterates over the mountlist comparing f_mntonname. Should I try to reattach it (ie lookup the vnode for /mnt, set mnt_vnodecovered and v_mountedhere, etc) or just leave it as it is? The dounmount() contains code to handle mnt_vnodecovered being NULL, for some reason, so perhaps this doesn't work just by luck?

lib/libc/sys/reboot.2
118 ↗(On Diff #8168)

It costs nothing to note there that kenv mountfrom is re-used. Syscall description should be self-contained, after all user utilities use syscalls to implement the actions.

sys/kern/kern_shutdown.c
377 ↗(On Diff #8168)

Ok, I missed that devfs is unmounted. Now I dislike this even more. My opinion is that inital devfs mount must be left as is and reused for new root.

388 ↗(On Diff #8168)

I suggest to use some directory on the initial devfs mount for the safe harbor to keep the transient filesystem around. Also, you could consider adding yet another MNTK flag to mark both initial devfs and the transient filesystems as unmountable during the reroot (after reroot the flag must be cleared).

lib/libc/sys/reboot.2
118 ↗(On Diff #8168)

Agreed, we should have some reference to what the new root file system will be.

trasz edited edge metadata.

Improve documentation, and avoid naming function a "shutdown()".

trasz marked 11 inline comments as done.Sep 8 2015, 9:27 AM
trasz marked 3 inline comments as done.Sep 8 2015, 9:28 AM
sys/kern/kern_shutdown.c
375 ↗(On Diff #8581)

To be honest, I'd prefer it the way it is now. I'm a huge fan of simplicity, and I just don't see the point of complicating the whole thing by treating devfs special in this particular case. Unless it would fix some actual bug. Would it?

386 ↗(On Diff #8581)

Again, is it required for fixing an actual bug in the code, or just a matter of personal preference?

sys/kern/kern_shutdown.c
375 ↗(On Diff #8581)

This is a false simplicity. When the devfs is unmounted, all devfs vnodes are reclaimed. This could cause unneccessary driver reinitialization, drop of the cached pages (since the vnode' vm_object is destroyed), and probably more consequences which I just do not bother to enumerate.

As an example of breakage, I believe that your process would loose the console if the filesystem is unmounted.

IMO not unmounting the initial devfs mount is simpler, if you look not at your code modifications, but at the whole system operation.

386 ↗(On Diff #8581)

Consider it a personal taste based on some experience with our VFS I have. Having a filesystem not reachable by the usual means, even for the short time, is against too many pre-supposed assumptions in the VFS.

trasz edited edge metadata.

Preserve /dev, and use /dev/reboot/ as a mountpoint for the temporary filesystem.

Good point. Ok, I think it should be ok now :-)

I am reviewing the kernel bits only for now, and I do request to commit the kernel bits only. Userspace will be reviewed separately.

Again, my biggest (and the only left) complain about the patch is the removal of the mount points from the mountlist. I think I already described how this can be done without removal. Add MNTK_ flag to inform vfs_unmountall() to keep the flagged mount point from unmounting it. Then the reroot code would set the flag for the devfs mp, and ideally for the p_textvp->v_mount as well, and clear after the call.

sys/kern/kern_shutdown.c
192 ↗(On Diff #8687)

I recommend to add MAKEDEV_WAITOK explicitely.

223 ↗(On Diff #8687)

I did not noticed this and the next line before. Do they belong to the changeset ?

391 ↗(On Diff #8687)

BTW, p_textvp could be already reclaimed, for many reasons. Then v_mount is NULL.

sys/kern/vfs_mountroot.c
227 ↗(On Diff #8687)

Where is the paired unbusy() done ? In vfs_mountroot_shuffle() ?

I have troubles convincing myself that it does happen, at least that it does happen always.

trasz edited edge metadata.

Fix stuff. As requested by kib@, this only includes kernel side of things; userspace
changes will be reviewed separately.

sys/kern/kern_shutdown.c
388 ↗(On Diff #8829)

This is weird order of things. I would only cache v_mount after vn_lock succeeded.

400 ↗(On Diff #8829)

? Is there {} missed ?

sys/kern/vfs_mountroot.c
227 ↗(On Diff #8829)

In addition to the previous comment, shouldn't the error from vfs_busy() checked ? Otherwise, at least, vfs_unbusy() would negate the counter.

trasz edited edge metadata.

Fix some more.

trasz added inline comments.
sys/kern/vfs_mountroot.c
227 ↗(On Diff #8830)

Well, because we remove tmpfs from the mountlist, the situation after this point is the same for both the usual boot and reroot: there is one filesystem, and it's devfs. The rest of the mountroot code should thus behave in exactly the same way in both cases (that's the whole point of messing with the mountlist).

kib added a reviewer: kib.
kib added inline comments.
sys/kern/vfs_mountroot.c
227 ↗(On Diff #8830)

I probably can now verbalize my unhappiness with the list manipulations. Consider the difference between real root mount and reroot operations. Real root is mounted while system did not even started init binary; there is no usermode acting in parallel. On the other hand, your code executes the same code, which relies on specific state of the mount list, and on the stability of the list state, while usermode is running. Nothing prevents a malicious code from mounting something in parallel, breaking your algorithm.

I mark the patch as reviewed, since I do not see any more 'small' bugs. Move it forward if you want, but note my vigilance with the robustness of the approach.

This revision is now accepted and ready to land.Sep 18 2015, 10:40 AM
This revision was automatically updated to reflect the committed changes.