Paths

Table of Contentst

Differential D19742

random(4): Attempt to persist entropy promptly
ClosedPublic
Actions

Authored by cem on Mar 28 2019, 11:14 PM.

Details

Reviewers

markm
delphij
markj

Commits

rS345744: random(4): Attempt to persist entropy promptly

Summary

random(4): Attempt to persist entropy promptly

The goal of saving entropy in Fortuna is two-fold: (1) to provide early
availability of the random device (unblocking) on next boot; and (2), to
have known, high-quality entropy available for that initial seed. We know
it is high quality because it's output taken from Fortuna.

The FS&K paper makes it clear that Fortuna unblocks when enough bits have
been input that the output may be safely seeded. But they emphasize
that the quality of various entropy sources is unknown, and a saved entropy
file is essential for both availability and ensuring initial
unpredictability.

In FreeBSD we persist entropy using two mechanisms:

The /etc/rc.d/random shutdown() function, which is used for ordinary shutdowns and reboots; and,

A cron job that runs every dozen minutes or so to persist new entropy, in case the system suffers from power loss or a crash (bypassing the ordinary shutdown path).

Filesystems are free to cache dirty data indefinitely, with arbitrary flush
policy. Fsync must be used to ensure the data is persisted, especially for
the cron job save-entropy which is oriented at power loss or crash safety.

Ordinary shutdown may not need the fsync because unmount should flush out
the dirty entropy file. But it is always possible power loss or crash
occurs during the short window after rc.d/random shutdown runs and before
the filesystem is unmounted, so the additional fsync there seems harmless.

PR: 230876

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

cem created this revision.Mar 28 2019, 11:14 PM

Harbormaster completed remote builds in B23363: Diff 55568.Mar 28 2019, 11:14 PM

I also plan to do the same treatment to libexec/rc/rc.d/random.

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved, and delaying it further with additional I/O activity would not benefit us.

(Personally I don't think we need even doing the fsync, though: the intention of saved-entropy is to give /dev/random some randomness that is not otherwise available by other means, like device attach, etc. at boot time, and to that extent not having the latest and freshest saved entropy doesn't really matter that much, but a final fsync doesn't seem to hurt either, so I'm neutral with this proposal.).

libexec/save-entropy/save-entropy.sh
83 ↗	(On Diff #55568)	fsync'ing the directory node would result in more writes and I think we can tolerate if it failed to sync.
94 ↗	(On Diff #55568)	This can be merged with the '.' (fsync saved-entropy.1 .).

Accept with delphi's changes.

In D19742#423197, @delphij wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.

FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I agree it may be likely that we are saving far more individual seed files than necessary, and could reduce the number of directory fsyncs by simply reducing the number of rotated entropy files.

and delaying it further with additional I/O activity would not benefit us.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

the intention of saved-entropy is to give /dev/random some randomness that is not otherwise available by other means, like device attach, etc. at boot time, and to that extent not having the latest and freshest saved entropy doesn't really matter that much

I think this understanding is dangerously simplistic.

FS&K is clear on the need for a saved entropy seed file to expediently seed the PRNG safely with actual trustworthy entropy on reboot. Not all systems have immediate access to sufficient entropy to quickly fulfill Fortuna's seeding thresholds (§ 9.6.6 First Boot, "The entropy accumulator can take quite a while to seed the PRNG properly") and early entropy from sources like device attach is not necessarily unpredictable (i.e., low entropy per bit). "Keep in mind that the Fortuna accumulator will seed the generator as soon as it might have enough entropy to be really random" — the emphasis is from the original paper.

Saved entropy is essential for availability and security of the random device.

(Personally I don't think we need even doing the fsync, though: ... , but a final fsync doesn't seem to hurt either, so I'm neutral with this proposal.).

Filesystems may arbitrarily delay flushing writes indefinitely — we need the fsync to guarantee the data is actually persisted.

libexec/save-entropy/save-entropy.sh
83 ↗	(On Diff #55568)	If a small number of additional writes is a problem, and we can tolerate losing every rotated file (by lost dirent reference), I propose that we can save even more writes by keeping only one entropy file.
94 ↗	(On Diff #55568)	Ah, I did not realize fsync(1) took a list of files! Thank you, I'll fix it.

Use a single fsync(1) operation in each location where it makes sense.

Extend the concept to rc.d/random shutdown().

Updated the description to clarify necessity and intent.

Harbormaster completed remote builds in B23390: Diff 55609.Mar 29 2019, 10:07 PM

cem edited the summary of this revision. (Show Details)Mar 29 2019, 10:08 PM

In D19742#423456, @cem wrote:

In D19742#423197, @delphij wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.

FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

File system shall guarantee the data and metadata (inode) of the new seed file be written before return of the fsync() call, and the second one would guarantee the in-flight (if any) dirent writes be finished before return of the fsync() call on '.', which is exactly what was required here.

(If a file system allows the dirent referencing an incomplete inode being retrieved after a crash, it's consistency is seriously broken; as far as I know neither UFS nor ZFS would allow it).

I agree it may be likely that we are saving far more individual seed files than necessary, and could reduce the number of directory fsyncs by simply reducing the number of rotated entropy files.

Probably, but keeping more individual seed files doesn't really cause harm, no?

and delaying it further with additional I/O activity would not benefit us.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

In order to satisfy fsync() semantics, a modern file system will have to create a new transaction, issue the write and sleep until an interrupt from disk controller signaling that the tag associated with the buffers were fully committed. The writes are not guaranteed to start right away: the only guarantee that the file system provides is that upon return of fsync(), the object referenced were committed.

Note that the additional writes may be costly for flash storage, where small writes might be amplified into larger read-back and re-writes.

the intention of saved-entropy is to give /dev/random some randomness that is not otherwise available by other means, like device attach, etc. at boot time, and to that extent not having the latest and freshest saved entropy doesn't really matter that much

I think this understanding is dangerously simplistic.

FS&K is clear on the need for a saved entropy seed file to expediently seed the PRNG safely with actual trustworthy entropy on reboot. Not all systems have immediate access to sufficient entropy to quickly fulfill Fortuna's seeding thresholds (§ 9.6.6 First Boot, "The entropy accumulator can take quite a while to seed the PRNG properly") and early entropy from sources like device attach is not necessarily unpredictable (i.e., low entropy per bit). "Keep in mind that the Fortuna accumulator will seed the generator as soon as it might have enough entropy to be really random" — the emphasis is from the original paper.

Saved entropy is essential for availability and security of the random device.

It's essential, but how would a seed saved from ~22 minutes ago being significantly less trustworthy than from ~11 minutes ago?

(Personally I don't think we need even doing the fsync, though: ... , but a final fsync doesn't seem to hurt either, so I'm neutral with this proposal.).

Filesystems may arbitrarily delay flushing writes indefinitely — we need the fsync to guarantee the data is actually persisted.

No, they can not delay flushing indefinitely.

For ZFS they would be written after up to 5 seconds by default, and with UFS it's up to 32 seconds. I would expect the sync'ing be more useful for shutdown script though.

In D19742#423456, @cem wrote:

In D19742#423197, @delphij wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.

FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I need to actually read that, but the writes aren't atomic even with this change. If we crash before the fsync and after beginning the write to /entropy, we could end up with a truncated output file. Maybe I'm misunderstanding the meaning of your quote.

I agree it may be likely that we are saving far more individual seed files than necessary, and could reduce the number of directory fsyncs by simply reducing the number of rotated entropy files.

and delaying it further with additional I/O activity would not benefit us.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

Writes to SD cards used in embedded devices can be extremely slow. We've had issues where swap I/O to loaded cards takes > 10 seconds(!) to complete.

fsyncing after every rename does feel like overkill to me. I get the argument that it's formally the right thing to do, but in practice I would expect all of the dirents to reside in a single disk block.

The shutdown script change LGTM, but I insist that libexec/save-entropy/save-entropy.sh line 83 should be removed as explained in previous comment.

(OPTIONAL: Note that chmod 600's were introduced to retroactively fix permissions on existing systems; it's probably worth mentioning it in the comment (something like: chmod 600 in case a pre-existing file have unsafe permissions set); they are not really needed for new systems).

This revision now requires changes to proceed.Mar 29 2019, 10:50 PM

In D19742#423465, @delphij wrote:

File system shall guarantee the data and metadata (inode) of the new seed file be written before return of the fsync() call, and the second one would guarantee the in-flight (if any) dirent writes be finished before return of the fsync() call on '.', which is exactly what was required here.

Sort of. I don't think there's any formal guarantee that filesystems with dirtied dirents are crash-safe in any order or with any consistency. It's a nice property, but I'm not sure UFS has it.

(If a file system allows the dirent referencing an incomplete inode being retrieved after a crash, it's consistency is seriously broken; as far as I know neither UFS nor ZFS would allow it).

I'm not sure how to parse that. The renamed inodes were completed 11 minutes ago; only the dirents are being dirtied and then synced. I don't think anyone is objecting to syncing the new seed's inode, data, and dirent.

Probably, but keeping more individual seed files doesn't really cause harm, no?

The harm seems to be that it leads to suggestions to skip (arguably) correct syncing of the directory after individual renames.

In order to satisfy fsync() semantics, a modern file system will have to create a new transaction, issue the write and sleep until an interrupt from disk controller signaling that the tag associated with the buffers were fully committed.

Filesystems aren't required to be modern.

The only requirement to satisfy fsync semantics is that the write is issued to the backing disk and the interrupt completes, without error. Like I said before, this is probably measured in milliseconds; at most seconds.

The writes are not guaranteed to start right away: the only guarantee that the file system provides is that upon return of fsync(), the object referenced were committed.

Sure. I guess I don't see how that is a problem, outside of a byzantine scenario where a filesystem intentionally delays fsync'd files, but attempts to flush other dirty files after a short delay.

Note that the additional writes may be costly for flash storage, where small writes might be amplified into larger read-back and re-writes.

This applies to raw NAND, but is less true of SSDs newer than about 2010. They have FTLs which essentially implement a log structured data store inside the disk. I'm not convinced that the additional number of writes materially affects even the crappiest and smallest 0.3DWPD-rated QLC SSDs. My rough math comes to this adding about 4 MB additional write volume per day with 4kB write sector flash (common). The smallest QLC SSD I can find on Newegg is 120GB. 4MB additional IO on that drive is 0.00003 DWPD; it's not going to materially affect the lifetime of the disk.

On (UFS on) raw NAND, I don't think our current scheme makes much sense anyway.

It's essential, but how would a seed saved from ~22 minutes ago being significantly less trustworthy than from ~11 minutes ago?

I agree the difference is insignificant. But the inductive property doesn't hold; the saved entropy should be regenerated regularly, rather than a single time. The ~11 minute figure roughly matches the FS&K text (which says "every 10 minutes or so"). I don't know how they arrived at that frequency.

We could safely drop the default number of entropy files from 8 to 4, or 2. I believe the current number dates to the Yarrow implementation, if not earlier. Fortuna does not need a lot.

No, [filesystems] can not delay flushing indefinitely.

Where is that requirement defined?

For ZFS they would be written after up to 5 seconds by default, and with UFS it's up to 32 seconds. I would expect the sync'ing be more useful for shutdown script though.

Sure, the behavior of particular implementations may flush after bounded time. But the argument is about the abstract concept of a filesystem.

In D19742#423482, @markj wrote:

In D19742#423456, @cem wrote:

In D19742#423197, @delphij wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.

FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I need to actually read that, but the writes aren't atomic even with this change. If we crash before the fsync and after beginning the write to /entropy, we could end up with a truncated output file. Maybe I'm misunderstanding the meaning of your quote.

I think FS&K basically misuse the word "atomic" (in the filesystem/database sense of the word) in that sentence. Their requirement is that once a seed is used as input to the RNG, it must be persistently overwritten before any other output from the RNG is allowed to be observed. (This is a defense against some classes of attack where the same seed would be reused and a predictable stream of numbers generated.)

We don't enforce this very robustly in FreeBSD. We get reasonably close once fsync is added; rc.d is single threaded (for now) and the the random script, which feeds in a saved seed, immediately overwrites it with dd from /dev/random. With fsync as well, I think we get close enough. When we think about parallel service startup, we would want some separate signalling mechanism to allow the equivalent of rc.d/random to feed in an old seed, retrieve a new seed, and report back to Fortuna that other reads should be unblocked.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

Writes to SD cards used in embedded devices can be extremely slow. We've had issues where swap I/O to loaded cards takes > 10 seconds(!) to complete.

Even 10 seconds per sync isn't a problem; this service runs every 11 minutes.

fsyncing after every rename does feel like overkill to me. I get the argument that it's formally the right thing to do, but in practice I would expect all of the dirents to reside in a single disk block.

Yeah, I agree that it may not have practical benefit. But I reject delphij's argument that it is somehow too costly or slow from an endurance or latency perspective; it's just not.

I will remove the contentious rename syncs; I think we're in agreement about the rest being beneficial.

Remove contentious fsync

Harbormaster completed remote builds in B23393: Diff 55613.Mar 30 2019, 1:40 AM

In D19742#423521, @cem wrote:

In D19742#423482, @markj wrote:

In D19742#423456, @cem wrote:

In D19742#423197, @delphij wrote:

I think a 'fsync saved-entropy.1 .' should be sufficient.

We don't really care if the renames were not persistent until the new entropy is saved,

Sure, if that is the power-fail/crash behavior of un-fsynced renames. But I don't believe that model is accurate. There is no requirement that the underlying filesystem order the dirent writes in a way that matches this observable behavior; the only requirement is that it is persisted by fsync.

FS&K §9.6.2 is clear, "All updates to the seed file must be atomic" and goes into more detail in §9.6.5.

I need to actually read that, but the writes aren't atomic even with this change. If we crash before the fsync and after beginning the write to /entropy, we could end up with a truncated output file. Maybe I'm misunderstanding the meaning of your quote.

I think FS&K basically misuse the word "atomic" (in the filesystem/database sense of the word) in that sentence. Their requirement is that once a seed is used as input to the RNG, it must be persistently overwritten before any other output from the RNG is allowed to be observed. (This is a defense against some classes of attack where the same seed would be reused and a predictable stream of numbers generated.)

We don't enforce this very robustly in FreeBSD. We get reasonably close once fsync is added; rc.d is single threaded (for now) and the the random script, which feeds in a saved seed, immediately overwrites it with dd from /dev/random. With fsync as well, I think we get close enough. When we think about parallel service startup, we would want some separate signalling mechanism to allow the equivalent of rc.d/random to feed in an old seed, retrieve a new seed, and report back to Fortuna that other reads should be unblocked.

I see, thanks.

I don't believe IO sync latency is a real concern for this activity. Do you anticipate a few writes will be significant on the scale of the 11 minute interval this is scheduled at? I would guess even on extremely slow media we're looking at sub-second additional latency in total. The delay is not a benefit, but it is not a cost either.

Writes to SD cards used in embedded devices can be extremely slow. We've had issues where swap I/O to loaded cards takes > 10 seconds(!) to complete.

Even 10 seconds per sync isn't a problem; this service runs every 11 minutes.

The thing with UFS+SU is that fsync might trigger a flush of a long chain of dependencies, each of which has to complete before the next proceeds. I'm sure it won't take 11 minutes to complete the job, but it can take a surprisingly long amount of time. If the updates involve modifying a heavily used bitmap block, they might stall other applications which want to perform updates of their own. FFS' background writes mitigate this problem for the most part, but there are cases where they can be disabled.

fsyncing after every rename does feel like overkill to me. I get the argument that it's formally the right thing to do, but in practice I would expect all of the dirents to reside in a single disk block.

Yeah, I agree that it may not have practical benefit. But I reject delphij's argument that it is somehow too costly or slow from an endurance or latency perspective; it's just not.

I think the concern is that it's the sort of thing which can trigger unexpected I/O latency spikes for other applications. I doubt it'll matter for >99% of installations, but it's hard to be certain, and the fsync-per-rename looked like overkill when we consider that this change does not completely fix the problem at hand, but instead narrows the window where it can occur. With those fsyncs removed it seems more reasonable to me.

LGTM, thanks!

This revision is now accepted and ready to land.Mar 31 2019, 4:44 AM

Closed by commit rS345744: random(4): Attempt to persist entropy promptly (authored by cem). · Explain WhyMar 31 2019, 4:58 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: imp. · View Herald TranscriptMar 31 2019, 4:58 AM