Page MenuHomeFreeBSD

random(4): Attempt to persist entropy promptly
ClosedPublic

Authored by cem on Mar 28 2019, 11:14 PM.

Details

Summary

random(4): Attempt to persist entropy promptly

The goal of saving entropy in Fortuna is two-fold: (1) to provide early
availability of the random device (unblocking) on next boot; and (2), to
have known, high-quality entropy available for that initial seed. We know
it is high quality because it's output taken from Fortuna.

The FS&K paper makes it clear that Fortuna unblocks when enough bits have
been input that the output may be safely seeded. But they emphasize
that the quality of various entropy sources is unknown, and a saved entropy
file is essential for both availability and ensuring initial
unpredictability.

In FreeBSD we persist entropy using two mechanisms:

  1. The /etc/rc.d/random shutdown() function, which is used for ordinary shutdowns and reboots; and,
  1. A cron job that runs every dozen minutes or so to persist new entropy, in case the system suffers from power loss or a crash (bypassing the ordinary shutdown path).

Filesystems are free to cache dirty data indefinitely, with arbitrary flush
policy. Fsync must be used to ensure the data is persisted, especially for
the cron job save-entropy which is oriented at power loss or crash safety.

Ordinary shutdown may not need the fsync because unmount should flush out
the dirty entropy file. But it is always possible power loss or crash
occurs during the short window after rc.d/random shutdown runs and before
the filesystem is unmounted, so the additional fsync there seems harmless.

PR: 230876

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

cem created this revision.Mar 28 2019, 11:14 PM
cem planned changes to this revision.Mar 29 2019, 2:21 AM

I also plan to do the same treatment to libexec/rc/rc.d/random.

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved, and delaying it further with additional I/O activity would not benefit us.

(Personally I don't think we need even doing the fsync, though: the intention of saved-entropy is to give /dev/random some randomness that is not otherwise available by other means, like device attach, etc. at boot time, and to that extent not having the latest and freshest saved entropy doesn't really matter that much, but a final fsync doesn't seem to hurt either, so I'm neutral with this proposal.).

libexec/save-entropy/save-entropy.sh
83 ↗(On Diff #55568)

fsync'ing the directory node would result in more writes and I think we can tolerate if it failed to sync.

94 ↗(On Diff #55568)

This can be merged with the '.' (fsync saved-entropy.1 .).

markm accepted this revision.Mar 29 2019, 9:31 AM

Accept with delphi's changes.

cem added a comment.Mar 29 2019, 9:41 PM

I think a 'fsync saved-entropy.1 .' should be sufficient.
We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.

FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I agree it may be likely that we are saving far more individual seed files than necessary, and could reduce the number of directory fsyncs by simply reducing the number of rotated entropy files.

and delaying it further with additional I/O activity would not benefit us.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

the intention of saved-entropy is to give /dev/random some randomness that is not otherwise available by other means, like device attach, etc. at boot time, and to that extent not having the latest and freshest saved entropy doesn't really matter that much

I think this understanding is dangerously simplistic.

FS&K is clear on the need for a saved entropy seed file to expediently seed the PRNG safely with actual trustworthy entropy on reboot. Not all systems have immediate access to sufficient entropy to quickly fulfill Fortuna's seeding thresholds (§ 9.6.6 First Boot, "The entropy accumulator can take quite a while to seed the PRNG properly") and early entropy from sources like device attach is not necessarily unpredictable (i.e., low entropy per bit). "Keep in mind that the Fortuna accumulator will seed the generator as soon as it might have enough entropy to be really random" — the emphasis is from the original paper.

Saved entropy is essential for availability and security of the random device.

(Personally I don't think we need even doing the fsync, though: ... , but a final fsync doesn't seem to hurt either, so I'm neutral with this proposal.).

Filesystems may arbitrarily delay flushing writes indefinitely — we need the fsync to guarantee the data is actually persisted.

libexec/save-entropy/save-entropy.sh
83 ↗(On Diff #55568)

If a small number of additional writes is a problem, and we can tolerate losing every rotated file (by lost dirent reference), I propose that we can save even more writes by keeping only one entropy file.

94 ↗(On Diff #55568)

Ah, I did not realize fsync(1) took a list of files! Thank you, I'll fix it.

cem marked an inline comment as done.Mar 29 2019, 10:07 PM
cem updated this revision to Diff 55609.

Use a single fsync(1) operation in each location where it makes sense.

Extend the concept to rc.d/random shutdown().

Updated the description to clarify necessity and intent.

cem edited the summary of this revision. (Show Details)Mar 29 2019, 10:08 PM
In D19742#423456, @cem wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.
We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.
FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

File system shall guarantee the data and metadata (inode) of the new seed file be written before return of the fsync() call, and the second one would guarantee the in-flight (if any) dirent writes be finished before return of the fsync() call on '.', which is exactly what was required here.

(If a file system allows the dirent referencing an incomplete inode being retrieved after a crash, it's consistency is seriously broken; as far as I know neither UFS nor ZFS would allow it).

I agree it may be likely that we are saving far more individual seed files than necessary, and could reduce the number of directory fsyncs by simply reducing the number of rotated entropy files.

Probably, but keeping more individual seed files doesn't really cause harm, no?

and delaying it further with additional I/O activity would not benefit us.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

In order to satisfy fsync() semantics, a modern file system will have to create a new transaction, issue the write and sleep until an interrupt from disk controller signaling that the tag associated with the buffers were fully committed. The writes are not guaranteed to start right away: the only guarantee that the file system provides is that upon return of fsync(), the object referenced were committed.

Note that the additional writes may be costly for flash storage, where small writes might be amplified into larger read-back and re-writes.

the intention of saved-entropy is to give /dev/random some randomness that is not otherwise available by other means, like device attach, etc. at boot time, and to that extent not having the latest and freshest saved entropy doesn't really matter that much

I think this understanding is dangerously simplistic.
FS&K is clear on the need for a saved entropy seed file to expediently seed the PRNG safely with actual trustworthy entropy on reboot. Not all systems have immediate access to sufficient entropy to quickly fulfill Fortuna's seeding thresholds (§ 9.6.6 First Boot, "The entropy accumulator can take quite a while to seed the PRNG properly") and early entropy from sources like device attach is not necessarily unpredictable (i.e., low entropy per bit). "Keep in mind that the Fortuna accumulator will seed the generator as soon as it might have enough entropy to be really random" — the emphasis is from the original paper.
Saved entropy is essential for availability and security of the random device.

It's essential, but how would a seed saved from ~22 minutes ago being significantly less trustworthy than from ~11 minutes ago?

(Personally I don't think we need even doing the fsync, though: ... , but a final fsync doesn't seem to hurt either, so I'm neutral with this proposal.).

Filesystems may arbitrarily delay flushing writes indefinitely — we need the fsync to guarantee the data is actually persisted.

No, they can not delay flushing indefinitely.

For ZFS they would be written after up to 5 seconds by default, and with UFS it's up to 32 seconds. I would expect the sync'ing be more useful for shutdown script though.

markj added a comment.EditedMar 29 2019, 10:47 PM
In D19742#423456, @cem wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.
We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.
FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I need to actually read that, but the writes aren't atomic even with this change. If we crash before the fsync and after beginning the write to /entropy, we could end up with a truncated output file. Maybe I'm misunderstanding the meaning of your quote.

I agree it may be likely that we are saving far more individual seed files than necessary, and could reduce the number of directory fsyncs by simply reducing the number of rotated entropy files.

and delaying it further with additional I/O activity would not benefit us.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

Writes to SD cards used in embedded devices can be extremely slow. We've had issues where swap I/O to loaded cards takes > 10 seconds(!) to complete.

fsyncing after every rename does feel like overkill to me. I get the argument that it's formally the right thing to do, but in practice I would expect all of the dirents to reside in a single disk block.

delphij requested changes to this revision.Mar 29 2019, 10:50 PM

The shutdown script change LGTM, but I insist that libexec/save-entropy/save-entropy.sh line 83 should be removed as explained in previous comment.

(OPTIONAL: Note that chmod 600's were introduced to retroactively fix permissions on existing systems; it's probably worth mentioning it in the comment (something like: chmod 600 in case a pre-existing file have unsafe permissions set); they are not really needed for new systems).

This revision now requires changes to proceed.Mar 29 2019, 10:50 PM
cem added a comment.Mar 30 2019, 1:20 AM

File system shall guarantee the data and metadata (inode) of the new seed file be written before return of the fsync() call, and the second one would guarantee the in-flight (if any) dirent writes be finished before return of the fsync() call on '.', which is exactly what was required here.

Sort of. I don't think there's any formal guarantee that filesystems with dirtied dirents are crash-safe in any order or with any consistency. It's a nice property, but I'm not sure UFS has it.

(If a file system allows the dirent referencing an incomplete inode being retrieved after a crash, it's consistency is seriously broken; as far as I know neither UFS nor ZFS would allow it).

I'm not sure how to parse that. The renamed inodes were completed 11 minutes ago; only the dirents are being dirtied and then synced. I don't think anyone is objecting to syncing the new seed's inode, data, and dirent.

Probably, but keeping more individual seed files doesn't really cause harm, no?

The harm seems to be that it leads to suggestions to skip (arguably) correct syncing of the directory after individual renames.

In order to satisfy fsync() semantics, a modern file system will have to create a new transaction, issue the write and sleep until an interrupt from disk controller signaling that the tag associated with the buffers were fully committed.

Filesystems aren't required to be modern.

The only requirement to satisfy fsync semantics is that the write is issued to the backing disk and the interrupt completes, without error. Like I said before, this is probably measured in milliseconds; at most seconds.

The writes are not guaranteed to start right away: the only guarantee that the file system provides is that upon return of fsync(), the object referenced were committed.

Sure. I guess I don't see how that is a problem, outside of a byzantine scenario where a filesystem intentionally delays fsync'd files, but attempts to flush other dirty files after a short delay.

Note that the additional writes may be costly for flash storage, where small writes might be amplified into larger read-back and re-writes.

This applies to raw NAND, but is less true of SSDs newer than about 2010. They have FTLs which essentially implement a log structured data store inside the disk. I'm not convinced that the additional number of writes materially affects even the crappiest and smallest 0.3DWPD-rated QLC SSDs. My rough math comes to this adding about 4 MB additional write volume per day with 4kB write sector flash (common). The smallest QLC SSD I can find on Newegg is 120GB. 4MB additional IO on that drive is 0.00003 DWPD; it's not going to materially affect the lifetime of the disk.

On (UFS on) raw NAND, I don't think our current scheme makes much sense anyway.

It's essential, but how would a seed saved from ~22 minutes ago being significantly less trustworthy than from ~11 minutes ago?

I agree the difference is insignificant. But the inductive property doesn't hold; the saved entropy should be regenerated regularly, rather than a single time. The ~11 minute figure roughly matches the FS&K text (which says "every 10 minutes or so"). I don't know how they arrived at that frequency.

We could safely drop the default number of entropy files from 8 to 4, or 2. I believe the current number dates to the Yarrow implementation, if not earlier. Fortuna does not need a lot.

No, [filesystems] can not delay flushing indefinitely.

Where is that requirement defined?

For ZFS they would be written after up to 5 seconds by default, and with UFS it's up to 32 seconds. I would expect the sync'ing be more useful for shutdown script though.

Sure, the behavior of particular implementations may flush after bounded time. But the argument is about the abstract concept of a filesystem.

cem added a comment.Mar 30 2019, 1:33 AM
In D19742#423456, @cem wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.
We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.
FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I need to actually read that, but the writes aren't atomic even with this change. If we crash before the fsync and after beginning the write to /entropy, we could end up with a truncated output file. Maybe I'm misunderstanding the meaning of your quote.

I think FS&K basically misuse the word "atomic" (in the filesystem/database sense of the word) in that sentence. Their requirement is that once a seed is used as input to the RNG, it must be persistently overwritten before any other output from the RNG is allowed to be observed. (This is a defense against some classes of attack where the same seed would be reused and a predictable stream of numbers generated.)

We don't enforce this very robustly in FreeBSD. We get reasonably close once fsync is added; rc.d is single threaded (for now) and the the random script, which feeds in a saved seed, immediately overwrites it with dd from /dev/random. With fsync as well, I think we get close enough. When we think about parallel service startup, we would want some separate signalling mechanism to allow the equivalent of rc.d/random to feed in an old seed, retrieve a new seed, and report back to Fortuna that other reads should be unblocked.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

Writes to SD cards used in embedded devices can be extremely slow. We've had issues where swap I/O to loaded cards takes > 10 seconds(!) to complete.

Even 10 seconds per sync isn't a problem; this service runs every 11 minutes.

fsyncing after every rename does feel like overkill to me. I get the argument that it's formally the right thing to do, but in practice I would expect all of the dirents to reside in a single disk block.

Yeah, I agree that it may not have practical benefit. But I reject delphij's argument that it is somehow too costly or slow from an endurance or latency perspective; it's just not.

cem planned changes to this revision.Mar 30 2019, 1:34 AM

I will remove the contentious rename syncs; I think we're in agreement about the rest being beneficial.

cem updated this revision to Diff 55613.Mar 30 2019, 1:40 AM

Remove contentious fsync

markj accepted this revision.Mar 30 2019, 6:13 PM
In D19742#423521, @cem wrote:
In D19742#423456, @cem wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.
We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.
FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I need to actually read that, but the writes aren't atomic even with this change. If we crash before the fsync and after beginning the write to /entropy, we could end up with a truncated output file. Maybe I'm misunderstanding the meaning of your quote.

I think FS&K basically misuse the word "atomic" (in the filesystem/database sense of the word) in that sentence. Their requirement is that once a seed is used as input to the RNG, it must be persistently overwritten before any other output from the RNG is allowed to be observed. (This is a defense against some classes of attack where the same seed would be reused and a predictable stream of numbers generated.)
We don't enforce this very robustly in FreeBSD. We get reasonably close once fsync is added; rc.d is single threaded (for now) and the the random script, which feeds in a saved seed, immediately overwrites it with dd from /dev/random. With fsync as well, I think we get close enough. When we think about parallel service startup, we would want some separate signalling mechanism to allow the equivalent of rc.d/random to feed in an old seed, retrieve a new seed, and report back to Fortuna that other reads should be unblocked.

I see, thanks.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

Writes to SD cards used in embedded devices can be extremely slow. We've had issues where swap I/O to loaded cards takes > 10 seconds(!) to complete.

Even 10 seconds per sync isn't a problem; this service runs every 11 minutes.

The thing with UFS+SU is that fsync might trigger a flush of a long chain of dependencies, each of which has to complete before the next proceeds. I'm sure it won't take 11 minutes to complete the job, but it can take a surprisingly long amount of time. If the updates involve modifying a heavily used bitmap block, they might stall other applications which want to perform updates of their own. FFS' background writes mitigate this problem for the most part, but there are cases where they can be disabled.

fsyncing after every rename does feel like overkill to me. I get the argument that it's formally the right thing to do, but in practice I would expect all of the dirents to reside in a single disk block.

Yeah, I agree that it may not have practical benefit. But I reject delphij's argument that it is somehow too costly or slow from an endurance or latency perspective; it's just not.

I think the concern is that it's the sort of thing which can trigger unexpected I/O latency spikes for other applications. I doubt it'll matter for >99% of installations, but it's hard to be certain, and the fsync-per-rename looked like overkill when we consider that this change does not completely fix the problem at hand, but instead narrows the window where it can occur. With those fsyncs removed it seems more reasonable to me.

delphij accepted this revision.Mar 31 2019, 4:44 AM

LGTM, thanks!

This revision is now accepted and ready to land.Mar 31 2019, 4:44 AM
This revision was automatically updated to reflect the committed changes.