⚙ D21206 libc: add fdwalk

jhibbits created this revision.Aug 10 2019, 2:53 AM

Harbormaster completed remote builds in B25773: Diff 60631.Aug 10 2019, 2:53 AM

Ultimately we want some other sysctl to get the list of filedescs, because kern.proc.filedesc performs a huge amount of the work that is discarded.

lib/libc/stdlib/fdwalk.3
58	New line for new sentence. malloc(3) etc should be typesetted with .Xr.
61	Not only new fds not handled (really they could be handled depending on the value), but also closed and thus invalid fds might be passed to callback. The man page lacks explanation of the interface for callback. In particular, what the return value means and how callback should handle errno to make the return state of fdwalk(3) parseable.
lib/libc/stdlib/fdwalk.c
51	`if (error == -1)`
53	This might be too strict in mt environment. We typically allocate more space than reported, to allow the attempt to succeed even if other thread opened a new file descriptor between query and fetch. More, sometimes we do not give up on the fetch failure but retry several times.

Address @kib's feedback. Man page will probably require several iterations
to get right. It's still a little muddy.

I noted in the source that we should add a new sysctl to give us just the
descriptors. I added a retry, with a 5-try limit (completely arbitrary).
I'm not sure, and would like some input, on what to do if the retry count is
reached. We could walk as many as we have room for, or we could exit with
error. Right now it exits with error.

Harbormaster completed remote builds in B25779: Diff 60639.Aug 10 2019, 8:10 PM

jhibbits marked 4 inline comments as done.Aug 10 2019, 9:52 PM

It's not hard to add such a sysctl. If you want I can hack it up for you. I think it should just export the fd bitmap, then iteration over it is trivial.

In D21206#461101, @mjg wrote:

It's not hard to add such a sysctl. If you want I can hack it up for you. I think it should just export the fd bitmap, then iteration over it is trivial.

Thanks. I've got one half done now. It's probably less trivial than your thought. It's a subset of the filedesc handler, iterating over and building a sbuf of fd's. If you think just returning the fdmap is better, go for it.

Add a new sysctl, kern.proc.fdmap.<proc>, to provide the bitmap. Walk the bitmap instead of the file descriptor list.

Harbormaster completed remote builds in B25822: Diff 60708.Aug 13 2019, 3:18 AM

mjg added inline comments.Aug 13 2019, 1:07 PM

lib/libc/stdlib/fdwalk.c
51	You can use 0 to signify the current process and avoid getpid altogether.
83	I don't think the retry logic is useful. Instead you can pass a "typically big enough" buffer. Should it be not big enough, sysctl can return the real size along with ENOMEM in one call. Then you retry once with the new size. This avoids the initial call to get the size in the common case and retrying like in the above is very unlikely to be of concern for real-world programs - the bitmap itself would have to grow between calls.
97	Is this testing tmp twice? I think it would be nicer to put this stuff inside the loop and 'continue' to skip.
sys/kern/kern_descrip.c
3883	I don't think we should bother with arbitrary processes and even if we do, the common case of inspecting the current one can skip some of the work including locking the process and holding the table.
3893	This is legal but not recommended - should it block due to a page fault we can go off while holding the lock and blocking others for an indeterminate amount of time. Preferably the code would do a copy with disallowing page faults and fallback to something else if necessary. I don't know if there is a handy way to do it with sysctl. It is possible to wire the target buffer but I don't think it's worth it. So I think this would be good enough for now with making a local copy before doing SYSCTL_OUT. You can have a small on stack buffer to cover the common case and malloc in the worst case.

mjg added inline comments.Aug 13 2019, 1:12 PM

sys/kern/kern_descrip.c
3893	Also note if only the current process is of interest, the above concern goes out of the window. bitmaps are only freed when the entire struct is getting freed. Then you can simply lock the struct, read off the pointer and the size, unlock the struct and SYSCTL_OUT.

mjg added inline comments.Aug 13 2019, 7:37 PM

sys/kern/kern_descrip.c

3893

roughly this (untested):

static int
sysctl_kern_proc_fdmap(SYSCTL_HANDLER_ARGS)
{
        struct filedesc *fdp;
        NDSLOTTYPE *map;
        size_t size;

        if (*(int *)arg1 != 0)
                return (EINVAL);

        fdp = curproc->p_fd;
        FILEDESC_SLOCK(fdp);
        /*
         * We can safely access the map after dropping the lock because it can
         * only get freed when the struct filedesc is getting freed.
         * See fdgrowtable.
         */
        size = NDSLOTS(fdp->fd_nfiles) * NDSLOTSIZE;
        map = fdp->fd_map;
        FILEDESC_SUNLOCK(fdp);

        return (SYSCTL_OUT(req, map, size));
}

static SYSCTL_NODE(_kern_proc, KERN_PROC_FDMAP, fdmap,
    CTLFLAG_RD|CTLFLAG_CAPRD|CTLFLAG_MPSAFE, sysctl_kern_proc_fdmap,
    "File descriptor map");

mjg added inline comments.Aug 13 2019, 7:43 PM

sys/kern/kern_descrip.c
3893	apologies for spam, scratch that - only fd tables survive, old maps do get freed so a copying it will be necessary. I stand by not doing copyout with the lock held.

Incorporate mjg's changes.

Harbormaster completed remote builds in B25859: Diff 60766.Aug 14 2019, 1:46 AM

Don't call getpid(), we can pass 0 now.

Harbormaster completed remote builds in B25860: Diff 60767.Aug 14 2019, 1:50 AM

kib added inline comments.Aug 14 2019, 9:14 AM

lib/libc/stdlib/fdwalk.3
45	and _calls_ the callback
64	This is not quite true. Also, on internal errors from malloc(3) or sysctl(3), you return -1 and errno contains some code. Later you state 'errors are passed through' but I am not sure that this can be understood without looking at the code.
lib/libc/stdlib/fdwalk.c
47	No initialization in declaration.
88	I do not think there is a need to cast through uinptr_t.

kib added inline comments.Aug 14 2019, 9:17 AM

sys/kern/kern_descrip.c
3892	Extra {}.
3901	I remember mjg already pointed out that malloc under FILEDESC_SLOCK is undesirable.
3905	Don't you need to free map ?
3909	space around '\|' binary op.

Address @kib's feedback. The man page will probably need more iterating, but I
think the code is correct. It may not be optimal, though.

Harbormaster completed remote builds in B25879: Diff 60815.Aug 15 2019, 3:44 AM

mjg added inline comments.Aug 15 2019, 3:56 AM

lib/libc/stdlib/fdwalk.c
68	this formula looks like a leftover from previous attempt. sizes are guaranteed to be power of 2 times slot size. Also note my previous remark about calling this with a typically big enough buffer first and only retrying if needed. Each time sysctl can return the expected size so there is no need for manual calculation. Note the map is very rarely resized.
sys/kern/kern_descrip.c
3887	This + fdhold + fddrop business is avoidable for curproc. It should be special cased.
3907	You can read off the size without taking the lock, do malloc and take the lock. Then verify the size fits (it will almost always will), if it does not free and retry. The table will very rarely change the size and it will always grow. I would use a small on stack buffer by default, sized just like the initial map to avoid malloc in the common case. I'm also increasingly leaning towards extending sysctl api to add SYSCTL_OUT_NOFAULT or so to fast path and avoid all this crap in the common case. General support is already there (based on pre-wired mappings), the code just needs cosmetic changes + api extension. It should come in handy in other code. I'll probably put a review later. With this in place you can avoid the double copy. This can be taken care of at a later time if fdwalk is expected to land for 12.1
3917	This needs CTLFLAG_CAPRD so that it works in capability mode.

So I had a look at glib. What's the benefit of adding this function to libc? All consumers are going to link to glib anyway. Perhaps we should get the kernel part in and then submit a patch to glib to use it.

I had an extra look and there is an extremely worrisome comment: apparently the routine is expected to by async-signal safe. This means both malloc and realloc are off the table. This can be worked around with having a statically sized buffer and requesting the bitmap in chunks which fit the buffer, i.e. pass the buffer and an offset (size of the buffer * iteration counter). See https://gitlab.gnome.org/GNOME/glib/blob/master/glib/gspawn.c#L1182

I think I sufficiently hacked this for inclusion. First the kernel part, then fdwalk itself.

Full kernel diff is here: https://people.freebsd.org/~mjg/fdmap.diff

I thought sysctl api would always return the new len, unfortunately that's only the case when the oldptr is NULL. Perhaps this can be fixed at a later date.

So:

only support curproc for simplicity
if oldptr is NULL just return the size without populating anything. the read is no more racy than it would be with the lock held.
if the we are the only user of the table don't lock or try to cache anything, just copy it out as it is
for the rest make a copy but try to fit it in a local buffer if possible

static int
sysctl_kern_proc_fdmap(SYSCTL_HANDLER_ARGS)
{
        struct filedesc *fdp;
        NDSLOTTYPE lmap[NDSLOTS(NDFILE) * 16];
        NDSLOTTYPE *map;
        size_t bsize, csize;
        int error; 
        
        CTASSERT(sizeof(lmap) <= 128);
        
        if (*(int *)arg1 != 0)
                return (EINVAL); 
        
        fdp = curproc->p_fd;
                
        if (req->oldptr == NULL)
                return (SYSCTL_OUT(req, NULL, NFD2MAPSIZE(fdp->fd_nfiles)));
        
        if (curproc->p_numthreads == 1 && fdp->fd_refcnt == 1)
                /* No need to lock anything as only we can modify the map. */
                return (SYSCTL_OUT(req, fdp->fd_map, NFD2MAPSIZE(fdp->fd_nfiles)));
        
        /* Potential other threads modifying the map. */
        map = lmap;
        bsize = sizeof(lmap);
        for (;;) {
                FILEDESC_SLOCK(fdp);
                csize = NFD2MAPSIZE(fdp->fd_nfiles);
                if (bsize >= csize)
                        break;
                FILEDESC_SUNLOCK(fdp);
                if (map != lmap)
                        free(map, M_TEMP);
                bsize = csize;
                map = malloc(bsize, M_TEMP, M_WAITOK);
        }               
        memcpy(map, fdp->fd_map, csize);
        FILEDESC_SUNLOCK(fdp);
        error = SYSCTL_OUT(req, map, csize);
        if (map != lmap)
                free(map, M_TEMP);
        return (error);
}       
                
static SYSCTL_NODE(_kern_proc, KERN_PROC_FDMAP, fdmap,
    CTLFLAG_RD|CTLFLAG_CAPRD|CTLFLAG_MPSAFE, sysctl_kern_proc_fdmap,
    "File descriptor map");

There was api incompatibility in fdwalk - it is supposed to always return 0 on its own errors.

If bothering with retrying in the first place there is no point trying to limit the number of loops. We are guaranteed that everything will fit in at some point. Currently the kernel never shrinks the bitmap, but I future-proofed the code to account for it. With this in mind:

int
fdwalk(int (*cb)(void *, int), void *cbd)
{
        int mib[4];
        size_t oldlen, newlen;
        int error, i, j, len;
        NDSLOTTYPE *buf, tmp;

        mib[0] = CTL_KERN;
        mib[1] = KERN_PROC;
        mib[2] = KERN_PROC_FDMAP;
        mib[3] = 0;

        oldlen = 0;
        for (;;) {
                error = sysctl(mib, nitems(mib), NULL, &newlen, NULL, 0);
                if (error == -1)
                        return (0);
                if (oldlen < newlen) {
                        oldlen = newlen;
                        buf = alloca(newlen);
                }
                newlen = oldlen;
                error = sysctl(mib, nitems(mib), buf, &newlen, NULL, 0);
                if (error == 0)
                        break;
                if (errno != ENOMEM)
                        return (0);
        }

        /*
         * Go through the full file list.  The fdmap is an integral multiple of
         * sizeof(NDSLOTTYPE).
         */
        len = howmany(newlen, sizeof(NDSLOTTYPE));

        for (i = 0; i < len; i++) {
                /*
                 * Iterate over each bit in the slot, short-circuting when there
                 * are no more file descriptors in use in this slot.
                 */
                for (j = 0, tmp = buf[i];
                        j < NBBY * sizeof(NDSLOTTYPE) && tmp != 0;
                        j++, tmp >>= 1) {
                        if (tmp & 1) {
                                error = cb(cbd, FD(i, j));
                                if (error != 0)
                                        return (error);
                        }
                }
        }
        return (0);
}

Note I still don't know if inclusion in libc is a good idea, I don't have a strong opinion one way or the other. But if putting this in libc we better make sure glib is using it.

In D21206#462415, @mjg wrote:

I had an extra look and there is an extremely worrisome comment: apparently the routine is expected to by async-signal safe. This means both malloc and realloc are off the table. This can be worked around with having a statically sized buffer and requesting the bitmap in chunks which fit the buffer, i.e. pass the buffer and an offset (size of the buffer * iteration counter). See https://gitlab.gnome.org/GNOME/glib/blob/master/glib/gspawn.c#L1182

I fdwalk(3) needs to be async-signal safe, it is enough to switch from malloc(3) to either mmap(2)/munmap(2) or alloca(3).

I used alloca. It may be wasteful on the stack for a very poorly behaving process but it should be fine. I don't think paying for mmap is justified.

emaste added a subscriber: emaste.Aug 15 2019, 8:32 PM

@mjg glib does use fdwalk() if available, and emulates it where needed. This work I started specifically because I saw this same code flow in multiple ports.

sys/kern/kern_descrip.c
3905	D'oh, yes.
3909	This follows style of the rest of the file (copy&paste of another node). All other SYSCTL_NODE()s have this style violation. I will, however, fix this one.

rozhuk.im-gmail.com added a subscriber: rozhuk.im-gmail.com.Sep 14 2019, 7:23 PM

rozhuk.im-gmail.com added inline comments.

lib/libc/stdlib/fdwalk.c
68	Why not sysctl() return ENOMEM if buf size is lowr than needed? I create bug report for same behavior: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240573

lantw44_gmail.com added a subscriber: lantw44_gmail.com.Sep 15 2019, 3:28 AM

In D21206#467976, @jhibbits wrote:

@mjg glib does use fdwalk() if available, and emulates it where needed. This work I started specifically because I saw this same code flow in multiple ports.

It is true that GLib uses fdwalk when it exists, but it may be changed in the future because of safety concern: https://gitlab.gnome.org/GNOME/glib/merge_requests/574#note_403912. If we guarantee our fdwalk is safe for the use case of GLib, I think the man page should explicitly mention that it is safe to use fdwalk between fork and exec. If it is async-signal safe, fdwalk should be included in the list of safe functions in sigaction(2) man page.

A concern was also raised on the permission to debug a process: https://gitlab.gnome.org/GNOME/glib/issues/1638#note_400640. Is fdwalk going to work when debugging is disabled? If the answer is no, it is possible that GLib still have to switch to its slow fallback code when fdwalk provided by the system fails.

If fdwalk lands in the tree, it will be an async-signal safe implementation.
It will always work for the current process (i.e. when it requests the data for itself), in fact my suggested kernel interface only allows this mode.

Thanks for linking the discussion upstream. Is there any particular timeline where it would be beneficial to get this sorted out?

Take over this revision after a talk with @jhibbits.

GNOME releases are made in March and September. It is usually better to make an important change at the beginning of a release cycle, so it can be tested by more people for longer.

GLib upstream has CI configured to run tests on FreeBSD regularly, but it isn't going to help testing this change because it only runs on releases. It currently runs 11.3 and 12.0 jails on 12.0 kernel.

add an async-signal-safe implementation
fix return value to always be 0

I'll have to do some testing with glib to make sure it properly picks it up etc., but it does work for my toy program and matches all expectations.

kib added inline comments.Sep 17 2019, 12:32 PM

lib/libc/stdlib/fdwalk.c
57	I believe that the formal contract for alloca(3) is that the stack space is only freed on return from the function, not on the termination of the block scope. In other words, this loop allows for unbound stack allocation if raced.

mjg added inline comments.Sep 17 2019, 12:40 PM

lib/libc/stdlib/fdwalk.c
57	It can allocate multiple buffers, but each one is twice as big as the previous and the biggest you can allocate is loosely bounded by your fd limit. For more than two buffers to be allocated the code would effectively have to explicitly dup a fd outside of the previous range and keep doing it while winning the race for more buffers to be allocated. In other words this can use more memory only if the application is racing with itself on purpose and even then there is an upper bound to it.

kib accepted this revision.Sep 17 2019, 2:18 PM

kib added inline comments.

lib/libc/stdlib/fdwalk.c
57	Add a comment note that this code relies on the kernel returning specific values for newlen.

This revision is now accepted and ready to land.Sep 17 2019, 2:18 PM

swills added a subscriber: swills.Oct 31 2019, 5:17 PM

kevans added a subscriber: kevans.Dec 2 2019, 6:54 PM

val_packett.cool added a subscriber: val_packett.cool.Apr 8 2020, 9:26 PM

lbartoletti added a subscriber: lbartoletti.Sep 14 2020, 6:58 AM

I wrote some tests based on my reading of historical documentation, and posted those for review in D26809. I note that the current release of glib20 in the ports tree does indeed have fdwalk() #if 0'd out, so we'd need to upstream a patch to make it use it for FreeBSD.

If you can confirm that glib's fdwalk also passes your tests I'll take a fresh look at this patch and commit.

In D21206#597622, @mjg wrote:

If you can confirm that glib's fdwalk also passes your tests I'll take a fresh look at this patch and commit.

As discussed, glib's implementation is internal and is suboptimal in that it doesn't preserve the properties that fdwalk has historically had (i.e. from Solaris). The FreeBSD fallback is not even close because we don't mount procfd by default, and the Linux variant is only a little more accurate -- it will open /proc/self/fd and iterate over those, but it invokes the callback as it goes which violates the basic design of the original
solaris fdwalk: make a list of all open fds, then iterate over it and invoke the callback.

I did go ahead and run my tests[0] against OpenIndiana, and the semantics do match there.

[0] https://people.freebsd.org/~kevans/fdwalk_test_portable.c

bdragon added a subscriber: bdragon.Nov 30 2020, 5:40 PM

So one thing I was worried about here is a corner case where someone ends up with a BIGNUM descriptor, with a long list of empty fds in front of it. The way forward which would be slower than the proposed patch, but would avoid any corner cases would export an array of used fds in chunks. Somewhat similar to what's happening with readdir over /proc/pid/fd, except avoiding the majority of that overhead. I may end up hacking it up this month.

Besides, perhaps fdwalk would be a better fit for libutil?

Alternatively, looking at https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gspawn.c#L1470 , there is F_NEXTFD in solaris and it should be trivial to implement.

Final edit: glib only uses fdwalk for closefrom (if not implemented) or close-on-exec setting in close_range (if not implemented). The former was available for years now and I implemented the latter few weeks back (f3f3e3c44d3b1776653bbf19eab17ce006a815d8) so this would not be used to begin with at least by glib.

I talked to solaris people, both F_PREVFD and F_NEXTFD are considered internal and don't have documented semantics. As such, they are off, unless someone wants a FreeBSD-specific variant, named differently.

I also did code search on github. fdwalk *is* used in some places, for example gdb, thus it makes to sense to implement at some point.

Just bumped into this while poking around glib20. What's preventing getting this in?

The code works but has a nasty corner case -- what if the process has only 2 fds opened, say 0 and 8364742 (or some other high number). Then you copy out out rather huge bitmap and force it to scan it.

I think least bad way out is a sysctl which accepts a range of fds the process is interested in and a buffer + size to fill it. Then you get back an array of fds which fit and can continue.

For example you would call it with range 0..INT_MAX, but only have space for a number of fds which wont fit everything you got there. Denoting highest fd you got as HIGHEST_FD, you would then redo the call with HIGHEST_FD+1..INT_MAX and so on until you get 0, meaning there are no more fds.

This should be trivial to hack up, does not not induce a syscall for every fd and does not have nasty corner cases. I can review, but can't be arsed to write.

mjg abandoned this revision.Mar 12 2024, 10:28 PM

libc: add fdwalk
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 62200

include/stdlib.h

lib/libc/stdlib/Makefile.inc

lib/libc/stdlib/Symbol.map

lib/libc/stdlib/fdwalk.3

lib/libc/stdlib/fdwalk.c

lib/libc/sys/sigaction.2

sys/kern/kern_descrip.c

sys/sys/sysctl.h

libc: add fdwalkAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 62200

include/stdlib.h

lib/libc/stdlib/Makefile.inc

lib/libc/stdlib/Symbol.map

lib/libc/stdlib/fdwalk.3

lib/libc/stdlib/fdwalk.c

lib/libc/sys/sigaction.2

sys/kern/kern_descrip.c

sys/sys/sysctl.h

libc: add fdwalk
AbandonedPublic
Actions

Revision Contents
Changeset List