Add FIONEXTBOOT ioctl to set nextboot file (knextboot 3/5)
AbandonedPublic
Actions

Authored by cem on Jul 31 2016, 10:20 PM.

Details

Reviewers

kib
markj
jhb
bdrewery
mckusick

Summary

Implement in UFS.

UFS inodes get an IN_NEXTBOOT i_flag when they are the nextboot file.
This allows us to adjust the general mapping when such files are written
or removed. For now, just remove any mapping when the file is
manipulated (linked, removed, written, truncate, renamed).

The IN_NEXTBOOT flag is preserved in the mount structure if the vnode is
reclaimed.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

cem updated this revision to Diff 18925.Jul 31 2016, 10:20 PM

cem retitled this revision from to Add FIONEXTBOOT ioctl to set nextboot file (knextboot 3/5).

cem updated this object.

cem edited the test plan for this revision. (Show Details)

cem added reviewers: kib, mckusick, markj.

cem set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a subscriber: imp. · View Herald TranscriptJul 31 2016, 10:20 PM

cem added a reviewer: bdrewery.Jul 31 2016, 10:21 PM

cem added a reviewer: jhb.

This patch is not against FreeBSD.

kib mentioned this in D7384: kernel nextboot: Implement ddb command (knextboot 5/5).Aug 1 2016, 8:35 AM

In D7382#153487, @kib wrote:

This patch is not against FreeBSD.

It is against FreeBSD + 1 and 2 in the series (D7379, D7381).

Move unrelated changes to previous patch.

I read through the change, and it at least makes sense to me. I can't really tell how well it would play with edge cases like a panic during a concurrent write by a process. That said, it seems to me like UFS is the wrong place to be implementing this feature: it adds a bunch of complexity to some already complex code, and it automatically excludes ZFS users. Since nextboot is a loader feature, why not implement this in the loader as well? One could instead do something like write loader(8) variables directly to partitions of a new GPT entry type, to be read directly at boot time. The usecase for this feature is somewhat narrow (probably mostly kernel developers with a specific workflow), so extending the filesystem to support it seems heavy-handed to me.

In D7382#153853, @markj wrote:

I read through the change, and it at least makes sense to me.

Cool.

I can't really tell how well it would play with edge cases like a panic during a concurrent write by a process.

We unregister the special inode whenever something manipulates it, before invoking any backing devvp IO. So the nextboot command would just be unavailable in this case.

That said, it seems to me like UFS is the wrong place to be implementing this feature: it adds a bunch of complexity to some already complex code,

Does it really? It doesn't seem especially complex to me.

and it automatically excludes ZFS users. Since nextboot is a loader feature, why not implement this in the loader as well?

This feature (nextboot(8)) is already pretty limited for ZFS users because loader cannot clear the nextboot.conf contents there. This is an incremental improvement to enhance nextboot support for systems with UFS /boot. I don't think that hurts ZFS.

One could instead do something like write loader(8) variables directly to partitions of a new GPT entry type, to be read directly at boot time.

IMO, this is a little more complicated. Documenting, requiring, or automating the creation of partitions (e.g. in the installer) is very easy to get wrong. And you would have to enhance loader to read from this magical GPT partition. It is an option, sure.

I see UFS nextboot as reaching a wider audience by default than a nextboot partition that requires manual setup. Maybe if the installer automates it, that would be better. But I don't know enough about the installer to do that work.

The usecase for this feature is somewhat narrow (probably mostly kernel developers with a specific workflow), so extending the filesystem to support it seems heavy-handed to me.

Maybe. It's a pretty small change, though.

Maybe there could be a more generic way to hook specific file(s), implemented in UFS, and then nextboot stuff could be done external to UFS. But I'm not sure what else would use the interface.

In D7382#153854, @cem wrote:

In D7382#153853, @markj wrote:

That said, it seems to me like UFS is the wrong place to be implementing this feature: it adds a bunch of complexity to some already complex code,

Does it really? It doesn't seem especially complex to me.

and it automatically excludes ZFS users. Since nextboot is a loader feature, why not implement this in the loader as well?

This feature (nextboot(8)) is already pretty limited for ZFS users because loader cannot clear the nextboot.conf contents there. This is an incremental improvement to enhance nextboot support for systems with UFS /boot. I don't think that hurts ZFS.

It doesn't hurt ZFS, but it makes this feature impossible to use on ZFS for no particularly strong reason. You could argue that the partial support of nextboot(8) on ZFS supports my stance: the fact that we rely on the filesystem in order to implement a loader feature is the reason it doesn't completely work on ZFS.

One could instead do something like write loader(8) variables directly to partitions of a new GPT entry type, to be read directly at boot time.

IMO, this is a little more complicated. Documenting, requiring, or automating the creation of partitions (e.g. in the installer) is very easy to get wrong. And you would have to enhance loader to read from this magical GPT partition. It is an option, sure.

I see UFS nextboot as reaching a wider audience by default than a nextboot partition that requires manual setup. Maybe if the installer automates it, that would be better. But I don't know enough about the installer to do that work.

DDB itself doesn't have a very wide audience. Most people who use it

don't debug the kernel on hand-installed machines, or
know how to create GPT partitions.

If you really want it after the fact, you can shave a few sectors off the swap partition. nextboot can be taught to create the partition if any free space is available.

The usecase for this feature is somewhat narrow (probably mostly kernel developers with a specific workflow), so extending the filesystem to support it seems heavy-handed to me.

Maybe. It's a pretty small change, though.

Yes, but you had to go through and change a few VOPs. And it still can't be "obviously correct" because I can't tell if something was missed.

Maybe there could be a more generic way to hook specific file(s), implemented in UFS, and then nextboot stuff could be done external to UFS. But I'm not sure what else would use the interface.

I guess that's part of my point: it seems like a narrow use-case.

In D7382#153853, @markj wrote:

One could instead do something like write loader(8) variables directly to partitions of a new GPT entry type, to be read directly at boot time.

In fact, this is quite nice idea.

With UEFI there is already a way to do this in the form of EFI's nvram variables. We are currently missing a way to set those post-boot though Warner has WIP to enable EFI runtime services which will permit this. Using a dedicated GPT boot partition as a backing store for these variables for the non-EFI case is a very interesting idea. If there is a small abstraction layer in the loader for 'fetch freebsd-specific nv vars' that the nextboot code uses, then that could have backend drivers that use EFI nvram, dedicated GPT partition, or other backing stores on other platforms. We could still retain nextboot.conf for legacy compat, but a 'ddb' nextboot command is probably far simpler to implement if it is writing to UEFI nvram or a dedicated partition. I'm a bit nervous about scribbling on filesystems from the debugger.

In D7382#153939, @jhb wrote:

With UEFI there is already a way to do this in the form of EFI's nvram variables. We are currently missing a way to set those post-boot though Warner has WIP to enable EFI runtime services which will permit this. Using a dedicated GPT boot partition as a backing store for these variables for the non-EFI case is a very interesting idea. If there is a small abstraction layer in the loader for 'fetch freebsd-specific nv vars' that the nextboot code uses, then that could have backend drivers that use EFI nvram, dedicated GPT partition, or other backing stores on other platforms. We could still retain nextboot.conf for legacy compat, but a 'ddb' nextboot command is probably far simpler to implement if it is writing to UEFI nvram or a dedicated partition. I'm a bit nervous about scribbling on filesystems from the debugger.

I have a partial implementation of nextboot using UEFI variables. The bootloader part is easy enough to do and I have it in the Netflix UEFI boot loader right now (it's basically boot1.efi that's just had this hack added). The work is stalled due to priority of other items. I'm in the midst of getting runtime services working after we call exit boot services.

A better idea, imho, would be to scribble this data at the end of the boot partition we have on all GPT disks. The down side of this idea is that we'd likely need to grow that partition which is difficult on old systems, but trivial on new ones. This would make the data be available everywhere, bypass concurrent write issues, and be a good stop-gap on UEFI systems as well. Though come to think of it, the UEFI boot partition is a FAT filesystem, so some care would be needed in ensuring we have a few extra sectors at the end to do this stuff with.

So here are a few thoughts. I'm kind of resigned to this not going in at this point, but:

I've tried to be extremely conservative as far as scribbling UFS goes. If anything changes the valid data blocks of the file, the knextboot conf is invalidated first.
knextboot conf is configured to only write the first *contiguous* data block(s) of the corresponding UFS file. It is protected by the same mechanism that prevents the kernel dumper from overwriting trailing partitions after the dump partition.
This works correctly today.
I'm happy to revert it out when Warner's EFI-variable implementation lands.

I'm cool with this going in until I can get UEFI done...

In D7382#154200, @cem wrote:

So here are a few thoughts. I'm kind of resigned to this not going in at this point, but:

I've tried to be extremely conservative as far as scribbling UFS goes. If anything changes the valid data blocks of the file, the knextboot conf is invalidated first.

knextboot conf is configured to only write the first *contiguous* data block(s) of the corresponding UFS file. It is protected by the same mechanism that prevents the kernel dumper from overwriting trailing partitions after the dump partition.

No, it is not. In fact, the code could either deadlock (if not entered under panic), or access inconsistent data (under panic). In particular, it might access random memory or write random disk blocks. Since ddb is entered at arbitrary moments, the filesystem structures, in particular, used for BMAP calculation, may be at any wrong state. Nobody codes filesystems to be interrupt-safe.

This works correctly today.

No.

In D7382#154294, @kib wrote:

knextboot conf is configured to only write the first *contiguous* data block(s) of the corresponding UFS file. It is protected by the same mechanism that prevents the kernel dumper from overwriting trailing partitions after the dump partition.

Hey Kib,

No, it is not. In fact, the code could either deadlock (if not entered under panic), or access inconsistent data (under panic).

As discussed before, it is intended to be used only under panic. Can you point out how it could possibly access inconsistent data?

In particular, it might access random memory or write random disk blocks.

I don't think so. Please explain how.

Since ddb is entered at arbitrary moments, the filesystem structures, in particular, used for BMAP calculation, may be at any wrong state.

This assumption is wrong. The BMAP calculations are done ahead of time; not in panic/ddb state.

Nobody codes filesystems to be interrupt-safe.

This works correctly today.

No.

I think you didn't read carefully enough to come to that conclusion.

In D7382#154413, @cem wrote:

In D7382#154294, @kib wrote:

knextboot conf is configured to only write the first *contiguous* data block(s) of the corresponding UFS file. It is protected by the same mechanism that prevents the kernel dumper from overwriting trailing partitions after the dump partition.

Hey Kib,

No, it is not. In fact, the code could either deadlock (if not entered under panic), or access inconsistent data (under panic).

As discussed before, it is intended to be used only under panic. Can you point out how it could possibly access inconsistent data?

In particular, it might access random memory or write random disk blocks.

I don't think so. Please explain how.

See below.

Since ddb is entered at arbitrary moments, the filesystem structures, in particular, used for BMAP calculation, may be at any wrong state.

This assumption is wrong. The BMAP calculations are done ahead of time; not in panic/ddb state.

There is no guarantee that the on-disk metadata at the time of your write matches the in-memory metadata at the time of BMAP call, since it could be not yet flushed (for async mounts), or SU might decided to revert that chunk of metadata due to unsatisfied dependencies. As result, your write, made physically through the devvp, may operate with the metadata view unsynced with the current physical metadata state on the disk.

Nobody codes filesystems to be interrupt-safe.

This works correctly today.

No.

I think you didn't read carefully enough to come to that conclusion.