Page MenuHomeFreeBSD

Add new vnode dumper to support live minidumps
ClosedPublic

Authored by mhorne on Jan 10 2022, 7:22 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, May 3, 4:56 AM
Unknown Object (File)
Fri, May 3, 4:56 AM
Unknown Object (File)
Fri, May 3, 4:56 AM
Unknown Object (File)
Fri, May 3, 4:56 AM
Unknown Object (File)
Fri, May 3, 4:55 AM
Unknown Object (File)
Fri, May 3, 4:55 AM
Unknown Object (File)
Fri, May 3, 4:55 AM
Unknown Object (File)
Fri, May 3, 4:52 AM

Details

Summary

This dumper can instantiate and write the dump's contents to a
file-backed vnode.

Unlike existing disk or network dumpers, the vnode dumper should not be
invoked during a system panic, and therefore is not added to the global
dumper_configs list. As such, it requires a different interface to
invoke the live dump. This is done with the kern.livedump sysctl, which
accepts a path to the output file as its argument. A future patch to
savecore(8) will make use of this.

A follow-up patch will add a way to specify compression for the live
dump, but encryption is considered uninteresting for the use case.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For now, I have elected to avoid the typical path for registration of a new dumper via dumpon(8) and cdev interface. The main reason is that the live dumper has a totally different lifecycle than existing dumpers; it can be considered 'always on'.
Using a sysctl to invoke/configure the dump is nothing glamorous, but it is simple. Doing this via ioctl would require more code and the addition of a new ioctl command to 'invoke dump' on a device. It also would mean that dumpon -l would always return /dev/livedump, which is somewhat misleading.

The intended interface via savecore looks like: savecore -L [-Zz] /var/crash (allowing one to request zstd or gzip compression, respectively).

sys/kern/kern_vnodedumper.c
149–151

These semantics are copied from coredump() in kern_sig.c.

sys/kern/kern_shutdown.c
400

We don't initialize stoppcbs[] anywhere, which makes sense since nothing is being stopped, but doesn't kgdb expect to find them?

sys/kern/kern_vnodedumper.c
86

I think it would be better to use an sx lock to provide mutual exclusion.

101

I don't think it's good to enable this in capability mode: the sysctl allows one to overwrite arbitrary files. If it's done this way to make savecore work, I'd prefer to simply not enter capability mode when using this interface; in normal operation savecore reads metadata off of the dump device during boot and thus should be sandboxed.

112

Why does this need to be invoked as a callback? Can't the sysctl handler set up all of the required context before calling minidumpsys()?

151

I think these checks should live in a shared subroutine.

190

IO_NODELOCKED means that the caller holds a range lock for the region, but here no range locks are held, so nothing blocks concurrent I/O to the file. See coredump().

224

Can factor this cleanup into a standalone function?

mjg added inline comments.
sys/kern/kern_vnodedumper.c
69

why are you providing a path instead of opening the intended file in savecore and passing a fd? then you would provide a referenced vnode in di.priv.

sys/kern/kern_shutdown.c
400

AFAIR stoppcbs contains per-cpu registers state. It does not seems to be too useful to dump something there, because the snapshot is not consistent. The context of stoppcb[cpuX] should match the state of the thread executing on cpuX, and pcpu[cpuX] (which contains e.g. curthread pointer).

If all this data is not consistent, I suspect both tools that look into dump, and people trying to analyze the dump, would be very confused.

sys/kern/kern_vnodedumper.c
154

EFAULT is a strange error code for the conditions there.

Handle comments.

  • Switch to passing file descriptor via sysctl rather than a path
  • Hold the vnode and range locks for the entirety of the dump
mhorne added inline comments.
sys/kern/kern_vnodedumper.c
69

Mainly it was a result of prototyping, but passing a fd is obviously much more sane. Thanks!

101

Yeah, this was for the benefit of savecore. This should be safer now that we are passing an opened file descriptor, yes?

112

Note that we still require a callback for a pseudo-dumper like this, otherwise dump_start() will try to compute the starting offset based on mediasize and blocksize, which only really makes sense for a disk device.

190

Thanks, I think I've got a better grasp on the different file/vnode locks now, please have another look.

There is also the 'advisory lock' which I see manipulated in other places, but I haven't touched it here. If I understand it right, this is mainly for the benefit of userspace programs using F_GETKL/F_SETLK, and related interfaces. If so, I don't think there's any reason to mess with it in this case.

sys/kern/kern_vnodedumper.c
75

Why cap_fcntl_rights? Don't we (at least) want CAP_WRITE? And it must be verified that fd was opened for writing, i.e., FWRITE is set. I can't see how that happens here.

mhorne marked 4 inline comments as done.

Use cap_write_rights rather than cap_fcntl_rights.

mhorne marked an inline comment as done.

Also check FWRITE for the file descriptor.

sys/kern/kern_vnodedumper.c
96

Order is reversed, range locks come before vnode locks.

99

I do not have explicit arguments, but I think that by conventions and common sense, this lock should be taken before VFS locks.

161

Flags should be IO_RANGELOCKED | IO_NODELOCKED, since you took both. You should have got the deadlock right away, no?

Handle comments from kib; re-order lock acquisition.

mhorne added inline comments.
sys/kern/kern_vnodedumper.c
161

It did not deadlock. It appears that IO_RANGELOCKED only has an effect when IO_NODELOCKED is not set. I did not see any other callers that specify both flags, but I can do this here if it is intended to be done this way.

IMO this feature requires a new privilege like PRIV_LIVEKDUMP and corresponding priv_check(), not just sysctl permission model.

sys/kern/kern_vnodedumper.c
69

It is relatively weird to pass fd to sysctl handler, as well.

I suspect that the right way to do it is through a syscall. For instance, it might be an ioctl on /dev/mem, which would imply permissions to access both the dumping vnode and /dev/mem.

I do not insist strongly, but sysctl taking fd is rare. I intended to say 'nonexistent', but we have debug.ftry_reclaim_vnode, although it is purely debugging interface.

114

Why allowing this for a capability mode process?

123

Exactly what is unsupported? I understand that the condition was not true, but what does the condition mean? Supplied key length is wrong, or what?

163

again, so what? 'error writing livedump' or similar.

Why not uprintf()?

sys/kern/kern_vnodedumper.c
69

Indeed, I do not love doing this with a sysctl, I know it is ugly. I considered the alternatives of a new syscall or creating a new livedump cdev, but both seem to be a little overkill for what is a niche debugging feature. Perhaps this is less of a problem than I think. Adding an ioctl to /dev/mem might be a happy medium, I will think on this.

114

savecore(8) runs in capability mode. Invoking the sysctl could be done via libcasper, but then it is run by a separate process and we lose the meaning of the passed fd.

I suppose it does open a small but undesirable attack surface.

sys/kern/kern_vnodedumper.c
69

I like the idea of using an ioctl on /dev/(k)mem. That's the interface used by kgdb to inspect a live system, for instance. It also sidesteps the Capsicum issue.

Change the livedump interface to an ioctl on /dev/mem.

This changes the lifecycle of the vnode dumper. Now, it is allocated upon
first use and re-allocated when the configuration changes.

Use uprintf() for error messages. TODO: update mem(4)

sys/dev/mem/memdev.c
125

"invoke" doesn't seem like the right verb unless it's MEM_INVOKE_LIVEDUMP. Maybe just MEM_MINIDUMP or MEM_KERNELDUMP?

sys/kern/kern_vnodedumper.c
95

getvnode() bumps the refcount of the returned file. The reference needs to be explicitly released with fdrop().

  • Rework ioctl to accept compression as a parameter
  • Rename to MEM_MINIDUMP
  • Include the updates to mem(4)
  • Ensure we fdrop() the file pointer before return
  • Have livedump_start() return EBUSY when a livedump is already in progress (as opposed to queueing it)

Shouldn't the new sysctl be documented somewhere?

Shouldn't the new sysctl be documented somewhere?

The change description needs updating; we will no longer be adding a new sysctl.

pauamma_gundo.com added inline comments.
share/man/man4/mem.4
255

I'd split that in 3 .Ss:

  • MEM_EXTRACT_PADDR above this (which error it returns if the address is not valid) isn't specified)
  • MEMRANGE_GET and MEMRANGE_SET right here
  • MEM_MINIDUMP (see below)
275

This is where I guess the MEM_MINIDUMP .Ss should start.

This revision now requires changes to proceed.Mar 16 2022, 11:53 PM

This looks good to me overall. My comments just refer to localized issues.

share/man/man4/mem.4
209

A minidump is really a specific format for a kernel core dump. I'd suggest making the terminology more general and just note that the output kernel core dump is in the format of a minidump. (Someday we might want to support live full memory dumps as well.)

236

I'd maybe be a bit more specific, along the lines of "minidumps taken from a running system may have inconsistent kernel data structures and thus may not be usable, especially if the system was under load. Despite this, live minidumps can be useful for offline debugging of certain types of kernel bugs, such as deadlocks."

In other words, I think we should motivate the existence of this feature a bit.

278
sys/kern/kern_vnodedumper.c
59

We should have this function explicitly check for kmem read privileges:

error = priv_check(curthread, PRIV_KMEM_READ);
if (error)
    bail

memopen() already does this, but someone might someday add a new caller of livedump_start() which omits the check.

96

We need to call dumper_destroy() here I think.

131

This could probably be a KASSERT no?

sys/sys/memrange.h
67

I'd suggest pre-allocating an int fields field. For instance, MEM_MINIDUMP fails if another livedump is in progress, but we might want to have a mode that blocks until the ongoing livedump is finished. Or we might want a mode which tries to suspend running threads before collecting the dump, etc.. For now we can simply define no flags, require callers to set flags = 0, and verify that in livedump_start().

mhorne marked 9 inline comments as done.

Handle markj and pauamma's comments.

share/man/man4/mem.4
255

Done in D34574.

With that nit fixed, LGTM.

share/man/man4/mem.4
246

Consistent naming.

This revision is now accepted and ready to land.Mar 22 2022, 8:10 PM