Page MenuHomeFreeBSD

Detect badly behaved coredump note helpers
ClosedPublic

Authored by cem on Sep 1 2015, 11:48 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Apr 29, 7:59 PM
Unknown Object (File)
Mon, Apr 29, 3:21 PM
Unknown Object (File)
Mar 20 2024, 10:33 AM
Unknown Object (File)
Mar 3 2024, 11:24 PM
Unknown Object (File)
Jan 30 2024, 5:30 PM
Unknown Object (File)
Jan 29 2024, 9:47 AM
Unknown Object (File)
Jan 13 2024, 1:23 PM
Unknown Object (File)
Jan 11 2024, 5:44 PM

Details

Summary

Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length is tied to the length of filenames
corresponding to vnodes in the process' fd table via vn_fullpath. As
vnodes may move around during dump, this is inherently racy.

So:

  • Detect badly behaved notes in putnote() and printf a warning as a weak assertion of correct behavior. No need to punish users for long- standing kernel bugs.
  • Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to exercise the NT_PROCSTAT_FILES corruption. It simply picks random lengths to expand or truncate paths to in fo_fill_kinfo_vnode().
  • Fix note_procstat_files to self-limit in the 2nd pass. Since sometimes this will result in a short write, pad up to our advertised size.
  • Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the zero padding.

Sponsored by: EMC / Isilon Storage Division

Test Plan

Pick a process using some files.

$ sysctl 'debug.fail_point.fill_kinfo_vnode__random_path=50%return(1)'
$ kill -ABRT <some process>

The core should open in GDB or procstat -f.

Before this patch, we observed mismatched NT_PROCSTAT_FILES length, causing GDB
to error with:

<foo> is not a core dump: File format not recognized

And readelf -n to error with (e.g.):

readelf: Warning: corrupt note found at offset 23baf8 into core notes
readelf: Warning:  type: 0, namesize: 00632000, descsize: 00000000

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

cem retitled this revision from to Detect badly behaved coredump note helpers.
cem updated this object.
cem edited the test plan for this revision. (Show Details)
cem added reviewers: markj, jhb, kib.
cem added a subscriber: benno.

It seems like this problem wouldn't exist if not for the behaviour of pack_kinfo(), since otherwise the note would just be an array of fixed-size records. Is there a way that the packing could be made optional?

sys/kern/vfs_vnops.c
2384 ↗(On Diff #8412)

Have you considered generalizing this routine a bit and putting it in kern_fail.c instead?

2398 ↗(On Diff #8412)

Seems strange to use strcpy to write a single character.

In D3548#72938, @markj wrote:

It seems like this problem wouldn't exist if not for the behaviour of pack_kinfo(), since otherwise the note would just be an array of fixed-size records.

Yes. Typically paths are very short, so it's pretty wasteful to dump fixed size path records. Tools like GDB, readelf, and procstat need to work with uncompressed dumps, so it's not like gzipping cores hides all of the pain.

Is there a way that the packing could be made optional?

Sure, it could be.

sys/kern/vfs_vnops.c
2384 ↗(On Diff #8412)

Nope. I wanted to keep the failpoint as simple as possible and close to where it was used.

In fact, this is only in its own function because the variant of failpoints in FreeBSD doesn't seem to accept code that contains commas — the last argument to the macro isn't a …, I guess.

2398 ↗(On Diff #8412)

It's writing the character and trailing \0.

Sure, we could also just directly write 'A' in a loop and do one '\0' at the end, but that would require braces.

In D3548#72941, @cem wrote:
In D3548#72938, @markj wrote:

It seems like this problem wouldn't exist if not for the behaviour of pack_kinfo(), since otherwise the note would just be an array of fixed-size records.

Yes. Typically paths are very short, so it's pretty wasteful to dump fixed size path records.

kinfo_file uses PATH_MAX, which is currently 1KB. Typically processes have less than 1024 open files, so that's less than 1MB added to the core file, which can be further reduced with compression. A core of cat(1) (with 7 open files) is 4.4MB. A core of firefox (with 80 open files) is 1.1GB.

Some types of file descriptors (network sockets, anonymous pipes) aren't associated with filesystem paths. We don't need to write out PATH_MAX null bytes for them.

Tools like GDB, readelf, and procstat need to work with uncompressed dumps, so it's not like gzipping cores hides all of the pain.

What extra pain do they suffer? Tools that don't interpret the note contents just seek to the next note. libprocstat reads the entire note into a buffer and uses kf_structsize to find file descriptor boundaries.

Is there a way that the packing could be made optional?

Sure, it could be.

I'd really prefer a solution that doesn't involve having the kernel print "THIS IS A BUG". :)

But if I'm alone on this or it turns out to be difficult to do optional packing, then ok.

In D3548#72944, @markj wrote:

kinfo_file uses PATH_MAX, which is currently 1KB. Typically processes have less than 1024 open files, so that's less than 1MB added to the core file, which can be further reduced with compression. A core of cat(1) (with 7 open files) is 4.4MB. A core of firefox (with 80 open files) is 1.1GB.

Some types of file descriptors (network sockets, anonymous pipes) aren't associated with filesystem paths. We don't need to write out PATH_MAX null bytes for them.

They also put things in kf_string. If we pack them and not vnode paths, we're just adding more special cases to pack_kinfo.

(We don't need to write out PATH_MAX for most files, either. :-))

Tools like GDB, readelf, and procstat need to work with uncompressed dumps, so it's not like gzipping cores hides all of the pain.

What extra pain do they suffer?

The tools need uncompressed cores, so users need the extra disk space to store uncompressed cores in order to debug.

I'm especially thinking about our internal debugging at $DAYJOB. We commonly work with physical and virtual machines with tiny /var and /var/crash partitions.

When /var/crash is 2GB, and you have a 512MB-1GB process memory image (common RLIMIT_AS for $DAYJOB) and a bunch of open fds, those 1kB start to add up really fast. Especially if you want to debug the core on the local machine.

Tools that don't interpret the note contents just seek to the next note. libprocstat reads the entire note into a buffer and uses kf_structsize to find file descriptor boundaries.

Yeah. I don't think the CPU cost of the bloat is high. It's just the on-disk space.

I'd really prefer a solution that doesn't involve having the kernel print "THIS IS A BUG". :)

Well, we can easily remove the printfs and just pad or corrupt silently, like we used to. If this is really preferable, I can do that.

I intend it as a weak assertion. I don't think KASSERT()ing here helps users — their program already crashed, and now their kernel crashed because the core is corrupt — because it is unlikely to prevent kernel developers from introducing new regressions. But maybe if there is a printf, and someone files a bug, someone can investigate why that particular note type is inconsistent and fix it.

But hey, silently producing corrupt userspace cores should absolutely not be the common case either. We expect this printf to never trigger. If it does, we fix it.

But if I'm alone on this or it turns out to be difficult to do optional packing, then ok.

I think we need the packing at $DAYJOB, so making it optional doesn't solve the corrupt core problem for me.

In D3548#72945, @cem wrote:
In D3548#72944, @markj wrote:

kinfo_file uses PATH_MAX, which is currently 1KB. Typically processes have less than 1024 open files, so that's less than 1MB added to the core file, which can be further reduced with compression. A core of cat(1) (with 7 open files) is 4.4MB. A core of firefox (with 80 open files) is 1.1GB.

Some types of file descriptors (network sockets, anonymous pipes) aren't associated with filesystem paths. We don't need to write out PATH_MAX null bytes for them.

They also put things in kf_string. If we pack them and not vnode paths, we're just adding more special cases to pack_kinfo.

(We don't need to write out PATH_MAX for most files, either. :-))

Tools like GDB, readelf, and procstat need to work with uncompressed dumps, so it's not like gzipping cores hides all of the pain.

What extra pain do they suffer?

The tools need uncompressed cores, so users need the extra disk space to store uncompressed cores in order to debug.

I'm especially thinking about our internal debugging at $DAYJOB. We commonly work with physical and virtual machines with tiny /var and /var/crash partitions.

When /var/crash is 2GB, and you have a 512MB-1GB process memory image (common RLIMIT_AS for $DAYJOB) and a bunch of open fds, those 1kB start to add up really fast. Especially if you want to debug the core on the local machine.

I agree with your concern in principle. In practice I really don't think it makes much of a difference. Some random sampling of nodes running longevity testing failed to turn up a system with more than 10000 fds open across all processes. So if every single process on such a system dumped core, we'd end up with < 10MB of extra bloat.

Tools that don't interpret the note contents just seek to the next note. libprocstat reads the entire note into a buffer and uses kf_structsize to find file descriptor boundaries.

Yeah. I don't think the CPU cost of the bloat is high. It's just the on-disk space.

I'd really prefer a solution that doesn't involve having the kernel print "THIS IS A BUG". :)

Well, we can easily remove the printfs and just pad or corrupt silently, like we used to. If this is really preferable, I can do that.

I intend it as a weak assertion. I don't think KASSERT()ing here helps users — their program already crashed, and now their kernel crashed because the core is corrupt — because it is unlikely to prevent kernel developers from introducing new regressions. But maybe if there is a printf, and someone files a bug, someone can investigate why that particular note type is inconsistent and fix it.

Sorry, I misunderstood. I thought we would print that message every time a PROCSTAT_FILE record gets truncated (i.e. because a file was renamed and given a longer path). If the message is only printed as a result of kernel programmer error, then ok.

To be clear, the fact that PROCSTAT_FILE note records can still get (silently) truncated is the reason I'm not crazy about this change. In the CR description you claim that this feature is inherently racy, but it isn't. It can be fixed properly at the expense of some minor bloat in the core file.

But hey, silently producing corrupt userspace cores should absolutely not be the common case either. We expect this printf to never trigger. If it does, we fix it.

But if I'm alone on this or it turns out to be difficult to do optional packing, then ok.

I think we need the packing at $DAYJOB, so making it optional doesn't solve the corrupt core problem for me.

Ok.

sys/kern/imgact_elf.c
1707 ↗(On Diff #8412)

Please not add a code to lecture user, to the kernel. The pritnfs could be made into a comment.

Then, I do not see it as a significant issue when the note was shrinked. It should be padded with zeroes and considered done.

For the reversed case, AFAIU, this is limited to the file pathes, right ? If yes, writing the truncated name, or writing some placeholder, or even writing all zeroes, is fine.

In D3548#72948, @markj wrote:

I agree with your concern in principle. In practice I really don't think it makes much of a difference. Some random sampling of nodes running longevity testing failed to turn up a system with more than 10000 fds open across all processes. So if every single process on such a system dumped core, we'd end up with < 10MB of extra bloat.

Ok. The fixed packing mode will skip files at the end of the fd table. So, I've changed my mind. I'll add a knob for a no-packing mode.

Sorry, I misunderstood. I thought we would print that message every time a PROCSTAT_FILE record gets truncated (i.e. because a file was renamed and given a longer path). If the message is only printed as a result of kernel programmer error, then ok.

Correct. PROCSTAT_FILE self-truncates with kern_proc_filedesc_out(p, sb, *sizep - sizeof(structsize)).

The printf is only in the more generate putnote routine, to catch other note emitter issues.

To be clear, the fact that PROCSTAT_FILE note records can still get (silently) truncated is the reason I'm not crazy about this change. In the CR description you claim that this feature is inherently racy, but it isn't. It can be fixed properly at the expense of some minor bloat in the core file.

Right. I'll add the knob. (Though, the contents of the filenames are still inherently racy.)

sys/kern/imgact_elf.c
1707 ↗(On Diff #8412)

Please not add a code to lecture user, to the kernel. The pritnfs could be made into a comment.

The advantage of logging is that it tells us which note produced a mismatch.

Then, I do not see it as a significant issue when the note was shrinked. It should be padded with zeroes and considered done.

Ok.

For the reversed case, AFAIU, this is limited to the file pathes, right ?

I don't know — that's why I am adding this to try and catch such problems. It also should not trigger after this patch.

If yes, writing the truncated name, or writing some placeholder, or even writing all zeroes, is fine.

The patch invokes kern_proc_filedesc_out(p, sb, *sizep - sizeof(structsize));, which simply stops emitting file entries when one would overflow the given size.

cem edited edge metadata.

Address CR feedback.

Per markj, add a knob to simply turn off file-path packing. It can be
enabled/disabled mid-coredump with no result more undesirable than some fds
being truncated from the note, so I didn't feel the need to add more strict
locking.

Per kib, do not log in the underfill case -- just pad silently.

MPSAFE is not needed for SYSCTL_INT, drop it.

In D3548#73035, @cem wrote:

Address CR feedback.

Per markj, add a knob to simply turn off file-path packing. It can be
enabled/disabled mid-coredump with no result more undesirable than some fds
being truncated from the note, so I didn't feel the need to add more strict
locking.

By "optional packing" I meant that callers of kern_proc_filedesc_out() should be able to say "don't pack my kinfo_files". This could be done with a flag argument, analogous to that for kern_proc_out(). A global sysctl doesn't seem like the right way to control this: the implications of setting it are hard to explain to non-programmers, and it forces the same behaviour on all callers, so it's inflexible. A sysctl could instead be used to determine whether this hypothetical flag is set when calling kern_proc_filedesc_out() from the file note-building code.

I'll leave it up to you to decide whether to actually do this (i.e. I'll stop arguing :)). But IMO the newly-added sysctl is in the wrong place and could just be removed.

Per kib, do not log in the underfill case -- just pad silently.

I concur with markj@ that this should only be tunable for the coredump case, not the general case. A new flag to kinfo_proc_filedesc_out() should suffice to implement that?

In D3548#73041, @markj wrote:
In D3548#73035, @cem wrote:

Address CR feedback.

Per markj, add a knob to simply turn off file-path packing. It can be
enabled/disabled mid-coredump with no result more undesirable than some fds
being truncated from the note, so I didn't feel the need to add more strict
locking.

By "optional packing" I meant that callers of kern_proc_filedesc_out() should be able to say "don't pack my kinfo_files". This could be done with a flag argument, analogous to that for kern_proc_out(). A global sysctl doesn't seem like the right way to control this: the implications of setting it are hard to explain to non-programmers, and it forces the same behaviour on all callers, so it's inflexible. A sysctl could instead be used to determine whether this hypothetical flag is set when calling kern_proc_filedesc_out() from the file note-building code.

I'll leave it up to you to decide whether to actually do this (i.e. I'll stop arguing :)). But IMO the newly-added sysctl is in the wrong place and could just be removed.

Per kib, do not log in the underfill case -- just pad silently.

In D3548#73043, @jhb wrote:

I concur with markj@ that this should only be tunable for the coredump case, not the general case. A new flag to kinfo_proc_filedesc_out() should suffice to implement that?

Sure. It also has to be passed down through all of these routines, none of which exist on 10. The backport is going to be a little painful. But that's my problem. Will do.

cem edited edge metadata.

Per markj and jhb, move the sysctl to imgact_elf and only use it to tune
coredump note behavior. Pass a flag down kinfo_proc_filedesc_out() to
pack, or not pack, kinfo_files.

Document the new sysctl in core.5.

sys/kern/imgact_elf.c
1908 ↗(On Diff #8429)

Why is this volatile?

The "_enabled" seems redundant.

1909 ↗(On Diff #8429)

Any reason not to use CTLFLAG_RWTUN?

sys/sys/user.h
555 ↗(On Diff #8429)

This line should start with a tab by style(9).

I think the flag namespace should be KERN_FILEDESC_ or so for consistency with the kern_proc_out() flags above. (As opposed to FILEDESC_OUT_.)

sys/kern/imgact_elf.c
1908 ↗(On Diff #8429)

volatile because otherwise the compiler is free to optimize out reads; writes can only come from a different thread.

'_enabled' aimed to avoid name collision with pack_kinfo, back when it was named that. I can remove it.

1909 ↗(On Diff #8429)

None.

sys/sys/user.h
555 ↗(On Diff #8429)

Ok.

cem edited edge metadata.

Per markj:

  • drop _enabled
  • CTLFLAG_RW +TUN
  • style(9)
  • rename the flag to match kern_proc_out
cem marked 3 inline comments as done.Sep 2 2015, 11:35 PM
cem edited edge metadata.

Drop volatile around pack_fileinfo, per markj.

wblock added inline comments.
share/man/man5/core.5
126 ↗(On Diff #8443)

Suggestion:

However, file paths can change at any time, including during a core dump,
and this can result in truncated file descriptor data.
129 ↗(On Diff #8443)

Suggestion:

All file description information can be preserved by disabling packing.
This potentially wastes up to PATH_MAX bytes per open fd.
Packing is disabled with
cem marked an inline comment as done.
cem edited edge metadata.

Take wblock's manual page suggestions.

cem marked an inline comment as done and an inline comment as not done.Sep 3 2015, 12:51 AM
cem marked an inline comment as done.Sep 3 2015, 12:52 AM

[inline]

share/man/man5/core.5
129 ↗(On Diff #8444)

Linux makes a distinction between file descriptor and file description, so this innocuous-seeming change requires slightly more thought. (FreeBSD does not have this distinction.) I would suggest "All file descriptor path information[...]".

sys/kern/imgact_elf.c
1717 ↗(On Diff #8444)

Please start the sentence in the comment with a capital letter.

sys/kern/imgact_elf.c
1724 ↗(On Diff #8444)

I still object to the printf. There is no use for it by average users. Kernel is not the place for the manuals. For developers, it should be either KASSERT or comment.

share/man/man5/core.5
129 ↗(On Diff #8444)

It's not just the path information. If we have to truncate, we truncate whole records. I think it's ok as-is now.

sys/kern/imgact_elf.c
1717 ↗(On Diff #8444)

The function is spelled with a lowercase letter. Do we prefer to capitalize symbols or spell them as-is?

1724 ↗(On Diff #8444)

This will only print when a corrupt corefile is written.

In such a scenario, with no log message, what is a user to do?

I will change to KASSERT.

cem marked 2 inline comments as done.Sep 3 2015, 2:18 PM
cem edited edge metadata.

Per kib, printf->KASSERT.

markj edited edge metadata.
This revision is now accepted and ready to land.Sep 3 2015, 8:22 PM
This revision was automatically updated to reflect the committed changes.