vfs: fix rangelock range in vn_rdwr() for IO_APPEND
ClosedPublic
Actions

Authored by chs on Jan 6 2021, 8:25 PM.

Details

Reviewers

imp
kib
mjg
mckusick

Commits

rG11403bdeb415: vfs: fix rangelock range in vn_rdwr() for IO_APPEND

Summary

vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.

Test Plan

At Netflix we've been seeing a problem with our use of the kernel audit subsystem, where the praudit process silently exits after about 30 minutes. We are feeding the audit records into praudit with this command pipeline:

tail -F /var/audit/current | praudit -lps

The praudit command is exiting because it finds an record with type 0, which is invalid. Investigation with dtrace and ktrace showed that the problem happened when the kernel would quickly write two separate records into the audit log file:

0  10675         audit_record_write:entry
            kernel`audit_worker+0x362
            kernel`fork_exit+0x7e
            kernel`0xffffffff8094e38e

time 1607031741855180818 type 20

           0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
       0: 14 00 00 00 90 0b 00 17 00 00 5f c9 5b bd 00 00  .........._.[...
      10: 03 57 3c 00 00 00 02 2f 62 69 6e 2f 6b 65 6e 76  .W<..../bin/kenv
      20: 00 76 66 73 2e 72 6f 6f 74 2e 6d 6f 75 6e 74 66  .vfs.root.mountf
      30: 72 6f 6d 00 23 00 0a 2f 62 69 6e 2f 6b 65 6e 76  rom.#../bin/kenv
      40: 00 3e 00 00 01 6d 00 00 00 00 00 00 00 00 00 00  .>...m..........
      50: 00 bd 00 00 00 00 00 02 40 10 00 04 95 97 24 ff  ........@.....$.
      60: ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      70: 00 00 00 00 01 5d e0 00 00 00 00 00 00 00 00 00  .....]..........
      80: 00 00 00 27 00 00 00 00 00 13 b1 05 00 00 00 90  ...'............

0  10675         audit_record_write:entry
            kernel`audit_worker+0x362
            kernel`fork_exit+0x7e
            kernel`0xffffffff8094e38e

time 1607031741855180818 type 20

     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
 0: 14 00 00 00 a4 0b 00 17 00 00 5f c9 5b bd 00 00  .........._.[...
10: 03 57 3c 00 00 00 02 2f 72 6f 6f 74 2f 74 6f 6f  .W<..../root/too
20: 6c 73 2f 67 6d 69 72 72 6f 72 5f 73 74 61 74 75  ls/gmirror_statu
30: 73 00 70 72 69 6d 00 23 00 1b 2f 72 6f 6f 74 2f  s.prim.#../root/
40: 74 6f 6f 6c 73 2f 67 6d 69 72 72 6f 72 5f 73 74  tools/gmirror_st
50: 61 74 75 73 00 3e 00 00 01 64 00 00 00 00 00 00  atus.>...d......
60: 00 00 00 00 00 bd 00 00 00 00 00 00 c0 7e 00 01  .............~..
70: d3 78 24 ff ff ff ff 00 00 00 00 00 00 00 00 00  .x$.............
80: 00 00 00 00 00 00 00 00 01 5d df 00 00 00 00 00  .........]......
90: 00 00 00 00 00 00 00 27 00 00 00 00 00 13 b1 05  .......'........
a0: 00 00 00 a4                                      ....

However what the tail process sees is one read from the log file with the amount of data from both writes, but the data of the second write is all zeroes:

86204 tail CALL read(0x5,0x800a36600,0x10000)
86204 tail GIO fd 5 read 308 bytes

0x0000 1400 0000 900b 0017 0000 5fc9 5bbd 0000  |.........._.[...|
0x0010 0357 3c00 0000 022f 6269 6e2f 6b65 6e76  |.W<..../bin/kenv|
0x0020 0076 6673 2e72 6f6f 742e 6d6f 756e 7466  |.vfs.root.mountf|
0x0030 726f 6d00 2300 0a2f 6269 6e2f 6b65 6e76  |rom.#../bin/kenv|
0x0040 003e 0000 016d 0000 0000 0000 0000 0000  |.>...m..........|
0x0050 00bd 0000 0000 0002 4010 0004 9597 24ff  |........@.....$.|
0x0060 ffff ff00 0000 0000 0000 0000 0000 0000  |................|
0x0070 0000 0000 015d e000 0000 0000 0000 0000  |.....]..........|
0x0080 0000 0027 0000 0000 0013 b105 0000 0090  |...'............|
0x0090 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x00a0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x00b0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x00c0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x00d0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x00e0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x00f0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x0100 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x0110 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x0120 0000 0000 0000 0000 0000 0000 0000 0000  |................|
0x0130 0000 0000                                |....|

86204 tail RET read 308/0x134

The problem is that the recently-added VOP_READ_PGCACHE() fast path accesses the data in the page cache pages without taking the vnode VOP_LOCK, relying on the vnode rangelock to prevent reads from seeing the contents of pages which have been created by a write to the file but that have not yet been initialized. vn_rdwr() takes a write rangelock on the vnode for the range of the offset/length that is given, however for IO_APPEND writes the offset is not known and is passed as zero. So the rangelock taken by vn_rdwr() is for the wrong range in this case, and thus VOP_READ_PGCACHE() is able to access and return the wrong data. The fix is for vn_rdwr() to lock the entire file range for IO_APPEND, just like vn_io_fault() does for O_APPEND. With this fix, the tail process no longer reads the wrong data and the praudit process is able to continue processing records.