Page MenuHomeFreeBSD

libc: Add fdwalk(3), commonly used in Linux software
Needs ReviewPublic

Authored by jhibbits on Sat, Aug 10, 2:53 AM.

Details

Reviewers
kib
jhb
Summary

fdwalk(3) originated in SunOS/Solaris, and is commonly used in Linux
software, including GNOME-based software, like glib and vte. It walks
the current file descriptor list and runs an arbitrary function on each
file descriptor.

Test Plan

Would be nice to have one

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 25879
Build 24448: arc lint + arc unit

Event Timeline

jhibbits created this revision.Sat, Aug 10, 2:53 AM
kib added a comment.Sat, Aug 10, 7:55 AM

Ultimately we want some other sysctl to get the list of filedescs, because kern.proc.filedesc performs a huge amount of the work that is discarded.

lib/libc/stdlib/fdwalk.3
58

New line for new sentence. malloc(3) etc should be typesetted with .Xr.

61

Not only new fds not handled (really they could be handled depending on the value), but also closed and thus invalid fds might be passed to callback.

The man page lacks explanation of the interface for callback. In particular, what the return value means and how callback should handle errno to make the return state of fdwalk(3) parseable.

lib/libc/stdlib/fdwalk.c
51

if (error == -1)

53

This might be too strict in mt environment. We typically allocate more space than reported, to allow the attempt to succeed even if other thread opened a new file descriptor between query and fetch. More, sometimes we do not give up on the fetch failure but retry several times.

jhibbits updated this revision to Diff 60639.Sat, Aug 10, 8:10 PM

Address @kib's feedback. Man page will probably require several iterations
to get right. It's still a little muddy.

I noted in the source that we should add a new sysctl to give us just the
descriptors. I added a retry, with a 5-try limit (completely arbitrary).
I'm not sure, and would like some input, on what to do if the retry count is
reached. We could walk as many as we have room for, or we could exit with
error. Right now it exits with error.

jhibbits marked 4 inline comments as done.Sat, Aug 10, 9:52 PM
mjg added a subscriber: mjg.Sat, Aug 10, 10:26 PM

It's not hard to add such a sysctl. If you want I can hack it up for you. I think it should just export the fd bitmap, then iteration over it is trivial.

In D21206#461101, @mjg wrote:

It's not hard to add such a sysctl. If you want I can hack it up for you. I think it should just export the fd bitmap, then iteration over it is trivial.

Thanks. I've got one half done now. It's probably less trivial than your thought. It's a subset of the filedesc handler, iterating over and building a sbuf of fd's. If you think just returning the fdmap is better, go for it.

jhibbits updated this revision to Diff 60708.Tue, Aug 13, 3:18 AM

Add a new sysctl, kern.proc.fdmap.<proc>, to provide the bitmap. Walk the bitmap instead of the file descriptor list.

mjg added inline comments.Tue, Aug 13, 1:07 PM
lib/libc/stdlib/fdwalk.c
51

You can use 0 to signify the current process and avoid getpid altogether.

83

I don't think the retry logic is useful. Instead you can pass a "typically big enough" buffer. Should it be not big enough, sysctl can return the real size along with ENOMEM in one call. Then you retry once with the new size.

This avoids the initial call to get the size in the common case and retrying like in the above is very unlikely to be of concern for real-world programs - the bitmap itself would have to grow between calls.

97

Is this testing tmp twice? I think it would be nicer to put this stuff inside the loop and 'continue' to skip.

sys/kern/kern_descrip.c
3833

I don't think we should bother with arbitrary processes and even if we do, the common case of inspecting the current one can skip some of the work including locking the process and holding the table.

3843

This is legal but not recommended - should it block due to a page fault we can go off while holding the lock and blocking others for an indeterminate amount of time.

Preferably the code would do a copy with disallowing page faults and fallback to something else if necessary. I don't know if there is a handy way to do it with sysctl. It is possible to wire the target buffer but I don't think it's worth it.

So I think this would be good enough for now with making a local copy before doing SYSCTL_OUT. You can have a small on stack buffer to cover the common case and malloc in the worst case.

mjg added inline comments.Tue, Aug 13, 1:12 PM
sys/kern/kern_descrip.c
3843

Also note if only the current process is of interest, the above concern goes out of the window. bitmaps are only freed when the entire struct is getting freed.

Then you can simply lock the struct, read off the pointer and the size, unlock the struct and SYSCTL_OUT.

mjg added inline comments.Tue, Aug 13, 7:37 PM
sys/kern/kern_descrip.c
3843

roughly this (untested):

static int
sysctl_kern_proc_fdmap(SYSCTL_HANDLER_ARGS)
{
        struct filedesc *fdp;
        NDSLOTTYPE *map;
        size_t size;

        if (*(int *)arg1 != 0)
                return (EINVAL);

        fdp = curproc->p_fd;
        FILEDESC_SLOCK(fdp);
        /*
         * We can safely access the map after dropping the lock because it can
         * only get freed when the struct filedesc is getting freed.
         * See fdgrowtable.
         */
        size = NDSLOTS(fdp->fd_nfiles) * NDSLOTSIZE;
        map = fdp->fd_map;
        FILEDESC_SUNLOCK(fdp);

        return (SYSCTL_OUT(req, map, size));
}

static SYSCTL_NODE(_kern_proc, KERN_PROC_FDMAP, fdmap,
    CTLFLAG_RD|CTLFLAG_CAPRD|CTLFLAG_MPSAFE, sysctl_kern_proc_fdmap,
    "File descriptor map");
mjg added inline comments.Tue, Aug 13, 7:43 PM
sys/kern/kern_descrip.c
3843

apologies for spam, scratch that - only fd tables survive, old maps do get freed so a copying it will be necessary. I stand by not doing copyout with the lock held.

jhibbits updated this revision to Diff 60766.Wed, Aug 14, 1:46 AM

Incorporate mjg's changes.

jhibbits updated this revision to Diff 60767.Wed, Aug 14, 1:50 AM

Don't call getpid(), we can pass 0 now.

kib added inline comments.Wed, Aug 14, 9:14 AM
lib/libc/stdlib/fdwalk.3
45

and _calls_ the callback

64

This is not quite true. Also, on internal errors from malloc(3) or sysctl(3), you return -1 and errno contains some code.

Later you state 'errors are passed through' but I am not sure that this can be understood without looking at the code.

lib/libc/stdlib/fdwalk.c
47

No initialization in declaration.

88

I do not think there is a need to cast through uinptr_t.

kib added inline comments.Wed, Aug 14, 9:17 AM
sys/kern/kern_descrip.c
3842

Extra {}.

3851

I remember mjg already pointed out that malloc under FILEDESC_SLOCK is undesirable.

3855

Don't you need to free map ?

3859

space around '|' binary op.

jhibbits marked 6 inline comments as done.Thu, Aug 15, 3:44 AM
jhibbits updated this revision to Diff 60815.

Address @kib's feedback. The man page will probably need more iterating, but I
think the code is correct. It may not be optimal, though.

mjg added inline comments.Thu, Aug 15, 3:56 AM
lib/libc/stdlib/fdwalk.c
67

this formula looks like a leftover from previous attempt. sizes are guaranteed to be power of 2 times slot size. Also note my previous remark about calling this with a typically big enough buffer first and only retrying if needed. Each time sysctl can return the expected size so there is no need for manual calculation. Note the map is very rarely resized.

sys/kern/kern_descrip.c
3837

This + fdhold + fddrop business is avoidable for curproc. It should be special cased.

3857

You can read off the size without taking the lock, do malloc and take the lock. Then verify the size fits (it will almost always will), if it does not free and retry. The table will very rarely change the size and it will always grow. I would use a small on stack buffer by default, sized just like the initial map to avoid malloc in the common case.

I'm also increasingly leaning towards extending sysctl api to add SYSCTL_OUT_NOFAULT or so to fast path and avoid all this crap in the common case. General support is already there (based on pre-wired mappings), the code just needs cosmetic changes + api extension. It should come in handy in other code. I'll probably put a review later. With this in place you can avoid the double copy. This can be taken care of at a later time if fdwalk is expected to land for 12.1

3867

This needs CTLFLAG_CAPRD so that it works in capability mode.

mjg added a comment.Thu, Aug 15, 4:04 AM

So I had a look at glib. What's the benefit of adding this function to libc? All consumers are going to link to glib anyway. Perhaps we should get the kernel part in and then submit a patch to glib to use it.

mjg added a comment.Thu, Aug 15, 4:11 AM

I had an extra look and there is an extremely worrisome comment: apparently the routine is expected to by async-signal safe. This means both malloc and realloc are off the table. This can be worked around with having a statically sized buffer and requesting the bitmap in chunks which fit the buffer, i.e. pass the buffer and an offset (size of the buffer * iteration counter). See https://gitlab.gnome.org/GNOME/glib/blob/master/glib/gspawn.c#L1182

mjg added a comment.Thu, Aug 15, 7:44 AM

I think I sufficiently hacked this for inclusion. First the kernel part, then fdwalk itself.

Full kernel diff is here: https://people.freebsd.org/~mjg/fdmap.diff

I thought sysctl api would always return the new len, unfortunately that's only the case when the oldptr is NULL. Perhaps this can be fixed at a later date.

So:

  • only support curproc for simplicity
  • if oldptr is NULL just return the size without populating anything. the read is no more racy than it would be with the lock held.
  • if the we are the only user of the table don't lock or try to cache anything, just copy it out as it is
  • for the rest make a copy but try to fit it in a local buffer if possible
static int
sysctl_kern_proc_fdmap(SYSCTL_HANDLER_ARGS)
{
        struct filedesc *fdp;
        NDSLOTTYPE lmap[NDSLOTS(NDFILE) * 16];
        NDSLOTTYPE *map;
        size_t bsize, csize;
        int error; 
        
        CTASSERT(sizeof(lmap) <= 128);
        
        if (*(int *)arg1 != 0)
                return (EINVAL); 
        
        fdp = curproc->p_fd;
                
        if (req->oldptr == NULL)
                return (SYSCTL_OUT(req, NULL, NFD2MAPSIZE(fdp->fd_nfiles)));
        
        if (curproc->p_numthreads == 1 && fdp->fd_refcnt == 1)
                /* No need to lock anything as only we can modify the map. */
                return (SYSCTL_OUT(req, fdp->fd_map, NFD2MAPSIZE(fdp->fd_nfiles)));
        
        /* Potential other threads modifying the map. */
        map = lmap;
        bsize = sizeof(lmap);
        for (;;) {
                FILEDESC_SLOCK(fdp);
                csize = NFD2MAPSIZE(fdp->fd_nfiles);
                if (bsize >= csize)
                        break;
                FILEDESC_SUNLOCK(fdp);
                if (map != lmap)
                        free(map, M_TEMP);
                bsize = csize;
                map = malloc(bsize, M_TEMP, M_WAITOK);
        }               
        memcpy(map, fdp->fd_map, csize);
        FILEDESC_SUNLOCK(fdp);
        error = SYSCTL_OUT(req, map, csize);
        if (map != lmap)
                free(map, M_TEMP);
        return (error);
}       
                
static SYSCTL_NODE(_kern_proc, KERN_PROC_FDMAP, fdmap,
    CTLFLAG_RD|CTLFLAG_CAPRD|CTLFLAG_MPSAFE, sysctl_kern_proc_fdmap,
    "File descriptor map");

There was api incompatibility in fdwalk - it is supposed to always return 0 on its own errors.

If bothering with retrying in the first place there is no point trying to limit the number of loops. We are guaranteed that everything will fit in at some point. Currently the kernel never shrinks the bitmap, but I future-proofed the code to account for it. With this in mind:

int
fdwalk(int (*cb)(void *, int), void *cbd)
{
        int mib[4];
        size_t oldlen, newlen;
        int error, i, j, len;
        NDSLOTTYPE *buf, tmp;

        mib[0] = CTL_KERN;
        mib[1] = KERN_PROC;
        mib[2] = KERN_PROC_FDMAP;
        mib[3] = 0;

        oldlen = 0;
        for (;;) {
                error = sysctl(mib, nitems(mib), NULL, &newlen, NULL, 0);
                if (error == -1)
                        return (0);
                if (oldlen < newlen) {
                        oldlen = newlen;
                        buf = alloca(newlen);
                }
                newlen = oldlen;
                error = sysctl(mib, nitems(mib), buf, &newlen, NULL, 0);
                if (error == 0)
                        break;
                if (errno != ENOMEM)
                        return (0);
        }

        /*
         * Go through the full file list.  The fdmap is an integral multiple of
         * sizeof(NDSLOTTYPE).
         */
        len = howmany(newlen, sizeof(NDSLOTTYPE));

        for (i = 0; i < len; i++) {
                /*
                 * Iterate over each bit in the slot, short-circuting when there
                 * are no more file descriptors in use in this slot.
                 */
                for (j = 0, tmp = buf[i];
                        j < NBBY * sizeof(NDSLOTTYPE) && tmp != 0;
                        j++, tmp >>= 1) {
                        if (tmp & 1) {
                                error = cb(cbd, FD(i, j));
                                if (error != 0)
                                        return (error);
                        }
                }
        }
        return (0);
}

Note I still don't know if inclusion in libc is a good idea, I don't have a strong opinion one way or the other. But if putting this in libc we better make sure glib is using it.

kib added a comment.Thu, Aug 15, 11:39 AM
In D21206#462415, @mjg wrote:

I had an extra look and there is an extremely worrisome comment: apparently the routine is expected to by async-signal safe. This means both malloc and realloc are off the table. This can be worked around with having a statically sized buffer and requesting the bitmap in chunks which fit the buffer, i.e. pass the buffer and an offset (size of the buffer * iteration counter). See https://gitlab.gnome.org/GNOME/glib/blob/master/glib/gspawn.c#L1182

I fdwalk(3) needs to be async-signal safe, it is enough to switch from malloc(3) to either mmap(2)/munmap(2) or alloca(3).

mjg added a comment.Thu, Aug 15, 12:10 PM

I used alloca. It may be wasteful on the stack for a very poorly behaving process but it should be fine. I don't think paying for mmap is justified.

emaste added a subscriber: emaste.Thu, Aug 15, 8:32 PM