Page MenuHomeFreeBSD

Make passwd and group write operations sync
ClosedPublic

Authored by garga on Jul 2 2015, 4:40 PM.

Details

Reviewers
bapt
gnn
Group Reviewers
manpages
Summary

When passwd and group files are being updated and a power failure happens it can end up corrupted. The issue happened on pfSense and there are details here:

https://redmine.pfsense.org/issues/4523

This patch changes all write operations to be sync. The idea of calling fsync() for directory after rename came from McKusick (thanks!).

We did almost 400 power cycles with this change applied and got any single failure.

Sponsored by: Netgate

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

garga updated this revision to Diff 6651.Jul 2 2015, 4:40 PM
garga retitled this revision from to Make passwd and group write operations sync.
garga updated this object.
garga edited the test plan for this revision. (Show Details)
garga added a reviewer: bapt.
garga set the repository for this revision to rS FreeBSD src repository.
loos added a subscriber: loos.Jul 2 2015, 4:48 PM
jim_netgate.com added a comment.EditedJul 2 2015, 4:52 PM

Notes: the power failure (or in many cases, even reboot) can happen up to 120 seconds after 'pw' touches master.passwd or /etc/group, or 'pwd_mkdb' recreates /etc/pwd.db and /etc/spwd.db, and still leave the system with 0 length files after a reboot.

Tested with both SU+J, (pfSense 2.2.3 and newer) and neither SU or J (pfSense 2.2.2 and earlier).

We have a test rig that has performed over 360 test cycles on each setup.

the following script is run shortly before removing power from the system:
#!/bin/sh
/usr/sbin/pw userdel -n 'admin'
/usr/sbin/pw groupadd all -g 1998
/usr/sbin/pw groupadd admins -g 1999
/usr/sbin/pw groupmod all -g 1998 -M ''
echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw usermod -q -n root -s /bin/tcsh -H 0
echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw useradd -m -k /etc/skel -o -q -u 0 -n admin -g wheel -s /bin/sh -d /root -c 'System Administrator' -H 0
/usr/sbin/pw unlock admin -q
/usr/sbin/pw groupmod all -g 1998 -M '0'

Systems are connected to a TrippLite PDUMH15NET PDU, such that a snmp 'set' can cause power to be removed from the system under program (script) control.

Paraphrasing email from Kirk McKusick when we inquired about the corruption and truncation issues we were seeing (which are reproducible with the setup as described above):

What is happening is that the files in question are being truncated then rewritten with new contents. SU ensures that after the truncation they will either show the correct new result or be zero length. Absent SU they can show up claiming the unwritten blocks which is why you see random data. Marking the filesystems sync should fix the problem as you will not have the (up to) two minute gap between the write and the data being flushed to disk.
Applications that are properly written take these steps:

  1. write new file to a temporary name.
  2. fsync newly written file.
  3. rename temporary name to file to be updated.
  4. For faster results, fsync directory that has updated file to commit the new file.

To significantly close the window, you can add the sync command at the end of your script. Or for a more targeted approach, fsync the /etc directory (assuming all the files you care about are in the /etc directory).

With these patches, we have closed the issue for 'pw(8)' and 'pwd_mkdb(8)'.

bapt accepted this revision.Jul 2 2015, 5:05 PM
bapt edited edge metadata.
This revision is now accepted and ready to land.Jul 2 2015, 5:05 PM
wblock added a subscriber: wblock.Jul 2 2015, 6:05 PM

Remember to update .Dd in pw_util.3. Please check it with mandoc -Tlint and igor -R before commit. Thanks!

lib/libutil/pw_util.3
236

s/to master/to the master/

Reads better and shorter to say "on success" and "on failure":

It returns a file descriptor to the master password file on success
and -1 on failure.

garga added inline comments.Jul 2 2015, 6:12 PM
lib/libutil/pw_util.3
236

I didn't realize there was one more reviewer assigned and already committed the original patch. May I commit the changes as you suggested? Do you want to review the patch?

gnn accepted this revision.Jul 2 2015, 6:22 PM
gnn added a reviewer: gnn.
garga added inline comments.Jul 2 2015, 6:29 PM
lib/libutil/pw_util.3
236

gnn already approved the change. Committed in r285053

garga closed this revision.Jul 3 2015, 1:09 AM

Thanks! Are there any plans to MFC this to stable/10 as well?

garga added a comment.Jul 3 2015, 11:28 AM

Thanks! Are there any plans to MFC this to stable/10 as well?

Yes, I tried hard to get it committed before stable/10 freeze started but necessary tests took a lot of time.

But since it's a bug, I don't believe re@ is going to block it. @bapt agreed to MFC it together with other pw changes.