vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.
Details
At Netflix we've been seeing a problem with our use of the kernel audit subsystem, where the praudit process silently exits after about 30 minutes. We are feeding the audit records into praudit with this command pipeline:
tail -F /var/audit/current | praudit -lps
The praudit command is exiting because it finds an record with type 0, which is invalid. Investigation with dtrace and ktrace showed that the problem happened when the kernel would quickly write two separate records into the audit log file:
0 10675 audit_record_write:entry kernel`audit_worker+0x362 kernel`fork_exit+0x7e kernel`0xffffffff8094e38e
time 1607031741855180818 type 20
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 0: 14 00 00 00 90 0b 00 17 00 00 5f c9 5b bd 00 00 .........._.[... 10: 03 57 3c 00 00 00 02 2f 62 69 6e 2f 6b 65 6e 76 .W<..../bin/kenv 20: 00 76 66 73 2e 72 6f 6f 74 2e 6d 6f 75 6e 74 66 .vfs.root.mountf 30: 72 6f 6d 00 23 00 0a 2f 62 69 6e 2f 6b 65 6e 76 rom.#../bin/kenv 40: 00 3e 00 00 01 6d 00 00 00 00 00 00 00 00 00 00 .>...m.......... 50: 00 bd 00 00 00 00 00 02 40 10 00 04 95 97 24 ff ........@.....$. 60: ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 70: 00 00 00 00 01 5d e0 00 00 00 00 00 00 00 00 00 .....].......... 80: 00 00 00 27 00 00 00 00 00 13 b1 05 00 00 00 90 ...'............ 0 10675 audit_record_write:entry kernel`audit_worker+0x362 kernel`fork_exit+0x7e kernel`0xffffffff8094e38e
time 1607031741855180818 type 20
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 0: 14 00 00 00 a4 0b 00 17 00 00 5f c9 5b bd 00 00 .........._.[... 10: 03 57 3c 00 00 00 02 2f 72 6f 6f 74 2f 74 6f 6f .W<..../root/too 20: 6c 73 2f 67 6d 69 72 72 6f 72 5f 73 74 61 74 75 ls/gmirror_statu 30: 73 00 70 72 69 6d 00 23 00 1b 2f 72 6f 6f 74 2f s.prim.#../root/ 40: 74 6f 6f 6c 73 2f 67 6d 69 72 72 6f 72 5f 73 74 tools/gmirror_st 50: 61 74 75 73 00 3e 00 00 01 64 00 00 00 00 00 00 atus.>...d...... 60: 00 00 00 00 00 bd 00 00 00 00 00 00 c0 7e 00 01 .............~.. 70: d3 78 24 ff ff ff ff 00 00 00 00 00 00 00 00 00 .x$............. 80: 00 00 00 00 00 00 00 00 01 5d df 00 00 00 00 00 .........]...... 90: 00 00 00 00 00 00 00 27 00 00 00 00 00 13 b1 05 .......'........ a0: 00 00 00 a4 ....
However what the tail process sees is one read from the log file with the amount of data from both writes, but the data of the second write is all zeroes:
86204 tail CALL read(0x5,0x800a36600,0x10000)
86204 tail GIO fd 5 read 308 bytes
0x0000 1400 0000 900b 0017 0000 5fc9 5bbd 0000 |.........._.[...| 0x0010 0357 3c00 0000 022f 6269 6e2f 6b65 6e76 |.W<..../bin/kenv| 0x0020 0076 6673 2e72 6f6f 742e 6d6f 756e 7466 |.vfs.root.mountf| 0x0030 726f 6d00 2300 0a2f 6269 6e2f 6b65 6e76 |rom.#../bin/kenv| 0x0040 003e 0000 016d 0000 0000 0000 0000 0000 |.>...m..........| 0x0050 00bd 0000 0000 0002 4010 0004 9597 24ff |........@.....$.| 0x0060 ffff ff00 0000 0000 0000 0000 0000 0000 |................| 0x0070 0000 0000 015d e000 0000 0000 0000 0000 |.....]..........| 0x0080 0000 0027 0000 0000 0013 b105 0000 0090 |...'............| 0x0090 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00a0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00b0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00c0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00d0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00e0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00f0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0100 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0110 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0120 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0130 0000 0000 |....|
86204 tail RET read 308/0x134
The problem is that the recently-added VOP_READ_PGCACHE() fast path accesses the data in the page cache pages without taking the vnode VOP_LOCK, relying on the vnode rangelock to prevent reads from seeing the contents of pages which have been created by a write to the file but that have not yet been initialized. vn_rdwr() takes a write rangelock on the vnode for the range of the offset/length that is given, however for IO_APPEND writes the offset is not known and is passed as zero. So the rangelock taken by vn_rdwr() is for the wrong range in this case, and thus VOP_READ_PGCACHE() is able to access and return the wrong data. The fix is for vn_rdwr() to lock the entire file range for IO_APPEND, just like vn_io_fault() does for O_APPEND. With this fix, the tail process no longer reads the wrong data and the praudit process is able to continue processing records.
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
This change certainly fixes the problem though it write-locks far more than necessary.