Page MenuHomeFreeBSD

netmap: fix knote() argument to match the mutex state
ClosedPublic

Authored by vmaffione on Jan 17 2019, 5:19 PM.

Details

Summary

The nm_os_selwakeup function needs to call knote() to wake up kqueue(9) users.
However, this function can be called from different code paths, with different lock requirements.
This patch fixes the knote() call argument to match the relavant lock state.

Reported-by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>

Test Plan

To be tested with two bhyve vm connected through a VALE switch.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

vmaffione created this revision.Jan 17 2019, 5:19 PM

I tested the patch with two bhyve vm connected through a VALE switch, there is no panic.

But I have a question. Is it correct to call the KNOTE() from .f_event (netmap_knrw()) context?

sys/dev/netmap/netmap_freebsd.c
1297

I think the comment should be updated.

vmaffione updated this revision to Diff 52967.Jan 18 2019, 10:51 AM
vmaffione marked an inline comment as done.

Cleanup comments related to kqueue.

Thanks a lot for testing! Yes, a lot of comments looked really inconsistent with the code. I did some more consistent update.
It would be great if you could run your tests again, since I've done a small code change in netmap_knrw().

Regarding your question, honestly I am not sure that knote() can be called from f_event, because I don't know the details of how kevent/kqueue works.
Let's assume it can be done, and in the meanwhile I'll ask around.

Added @markj to possibly answer the question about calling KNOTE() in f_event.

Everything works fine with the latest patch (Diff 2).

Thanks a lot!

markj added a comment.Jan 18 2019, 4:53 PM
In D18876#403167, @v.maffione_gmail.com wrote:

Thanks a lot for testing! Yes, a lot of comments looked really inconsistent with the code. I did some more consistent update.
It would be great if you could run your tests again, since I've done a small code change in netmap_knrw().
Regarding your question, honestly I am not sure that knote() can be called from f_event, because I don't know the details of how kevent/kqueue works.
Let's assume it can be done, and in the meanwhile I'll ask around.

Conceptually it's kind of strange to do this. f_event() is supposed to test whether the kevent condition holds. The caller uses the return value to decide whether to activate the knote (see knote()) or to keep an activated knote on the kqueue (see kqueue_scan()). This suggests that f_event should not have side effects (aside from updating a few fields in the knote). Moreover, if you call knote() from f_event, what prevents knote() from recursing back into f_event?

I'm not familiar with netmap so it's hard to say anything useful about this specific case. IMHO the use of mtx_owned() is pretty much always a hack since it hints that the locking protocol may not be well-defined.

Thanks for the explanation. Yes, I understand that this use of f_event is unusual, and I see your point.

However, the f_event having side effects is something that comes from a long standing netmap design decision, and it is the same reason why the netmap d_poll callback has side effects (indeed f_event may end up calling d_poll). Netmap has two ioctl commands (NIOCTXSYNC and NIOCRXSYNC) that perform a synchronous non-blocking I/O operation on the NIC (or other netmap port).
So one may think that a netmap application could use poll()/select()/kevent() to wait for the NIC to be ready (to transmit or receive) and then perform a NIOCTXSYNC/NIOCRXSYNC ioctl to carry out the actual I/O. However, this requires two system calls per I/O operation. So netmap was designed in such a way to have the poll()/select()/kevent() call also perform NIOCTXSYNC and/or NIOCRXSYNC, if needed, in the context of the same system call. This is the reason why d_poll and f_event have side effects. This behavior is visible to user applications, so we can't break it.

From what I can see, uncontrolled recursion is prevented thanks to the 'hint' argument. The netmap f_event callback calls knote() with hint != 0, so that when f_event() is called again, the function returns immediately (once it sees hint != 0).

But, why do we need to call knote() (netmap_notify) from netmap_poll()?
It seems this is not necessary.

Index: sys/dev/netmap/netmap.c
===================================================================
--- sys/dev/netmap/netmap.c     (revision 343254)
+++ sys/dev/netmap/netmap.c     (working copy)
@@ -3386,9 +3386,6 @@
                        if (found) { /* notify other listeners */
                                revents |= want_tx;
                                want_tx = 0;
-#ifndef linux
-                               kring->nm_notify(kring, 0);
-#endif /* linux */
                        }
                }
                /* if there were any packet to forward we must have handled them by now */
@@ -3447,9 +3444,6 @@
                        if (found) {
                                revents |= want_rx;
                                retry_rx = 0;
-#ifndef linux
-                               kring->nm_notify(kring, 0);
-#endif /* linux */
                        }
                }

With this patch, two bhyve vm + vale switch work as expected.

That's another interesting question.
It looks like it was added to support the case where multiple threads poll() on the same netmap port. In that case, one of the thread performs the actual TXSYNC or RXSYNC, and if some work was actually performed (e.g. ring indices were advanced) the thread will call nm_notify() to wake up all the other threads.

This mechanism was already removed on Linux, and I think we should remove it also on FreeBSD. However, I won't do that in this commit, which has a different purpose.

markj added a comment.Jan 22 2019, 3:36 PM
In D18876#403783, @v.maffione_gmail.com wrote:

Thanks for the explanation. Yes, I understand that this use of f_event is unusual, and I see your point.
However, the f_event having side effects is something that comes from a long standing netmap design decision, and it is the same reason why the netmap d_poll callback has side effects (indeed f_event may end up calling d_poll). Netmap has two ioctl commands (NIOCTXSYNC and NIOCRXSYNC) that perform a synchronous non-blocking I/O operation on the NIC (or other netmap port).
So one may think that a netmap application could use poll()/select()/kevent() to wait for the NIC to be ready (to transmit or receive) and then perform a NIOCTXSYNC/NIOCRXSYNC ioctl to carry out the actual I/O. However, this requires two system calls per I/O operation. So netmap was designed in such a way to have the poll()/select()/kevent() call also perform NIOCTXSYNC and/or NIOCRXSYNC, if needed, in the context of the same system call. This is the reason why d_poll and f_event have side effects. This behavior is visible to user applications, so we can't break it.

Ok.

From what I can see, uncontrolled recursion is prevented thanks to the 'hint' argument. The netmap f_event callback calls knote() with hint != 0, so that when f_event() is called again, the function returns immediately (once it sees hint != 0).

I see, I missed that. I don't see any particular problems with the patch then.

sys/dev/netmap/netmap_freebsd.c
1325

Indentation should be by four spaces.

vmaffione updated this revision to Diff 53106.Jan 23 2019, 11:38 AM
vmaffione marked an inline comment as done.

Fixed indentation as suggested.

vmaffione updated this revision to Diff 53107.Jan 23 2019, 11:40 AM

Uploading diff with some context (-U 100).

I need some time to test latest patch.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 23 2019, 2:21 PM
This revision was automatically updated to reflect the committed changes.

I tested r343346 and found a problem.

acquiring duplicate lock of same type: "nm_kn_lock"
 1st nm_kn_lock @ /usr/src/sys/kern/kern_event.c:2366
 2nd nm_kn_lock @ /usr/src/sys/kern/kern_event.c:2366
stack backtrace:
#0 0xffffffff80c44d33 at witness_debugger+0x73
#1 0xffffffff80c44a83 at witness_checkorder+0xac3
#2 0xffffffff80bbc1b5 at __mtx_lock_flags+0x95
#3 0xffffffff80b918eb at knote+0x3b
#4 0xffffffff807ab1ae at nm_os_selwakeup+0x9e
#5 0xffffffff807a635d at netmap_notify+0x1d
#6 0xffffffff807b9fb0 at nm_vale_flush+0xb70
#7 0xffffffff807b901d at netmap_vale_vp_txsync+0x51d
#8 0xffffffff807a5c94 at netmap_poll+0x364
#9 0xffffffff807abd9f at netmap_knread+0x1f
#10 0xffffffff80b909fd at kqueue_kevent+0x56d
#11 0xffffffff80b90417 at kern_kevent_fp+0x97
#12 0xffffffff80b9032f at kern_kevent+0x9f
#13 0xffffffff80b90020 at kern_kevent_generic+0x70
#14 0xffffffff80b8ff11 at sys_kevent+0x61
#15 0xffffffff810abe96 at amd64_syscall+0x276
#16 0xffffffff8108612d at fast_syscall_common+0x101

This situation can be reproduced if four processes are launched simultaneously:

  1. pkt-gen -i vale0:0 -f tx -n 12000000
  2. pkt-gen -i vale0:1 -f tx -n 12000000
  3. netmap_test vale0:0
  4. netmap_test vale0:1

Both netmap_test process stuck in kernel on trying to get mutex.

procstat -kk 66849
  PID    TID COMM                TDNAME              KSTACK
66849 102071 netmap_test         -                   __mtx_lock_sleep+0xe8 __mtx_lock_flags+0xee knote+0x3b nm_os_selwakeup+0x9e netmap_notify+0x1d nm_vale_flush+0xb70 netmap_vale_vp_txsync+0x51d netmap_poll+0x364 netmap_knread+0x1f kqueue_kevent+0x56d kern_kevent_fp+0x97 kern_kevent+0x9f kern_kevent_generic+0x70 sys_kevent+0x61 amd64_syscall+0x276 fast_syscall_common+0x101

Where netmap test is:

#include <stdio.h>
#include <assert.h>

#include <sys/types.h>
#include <sys/event.h>
#include <sys/time.h>

#ifndef NETMAP_WITH_LIBS
#define NETMAP_WITH_LIBS
#endif
#include <net/netmap_user.h>

static int
kq_add(int kq, int fd)
{
        struct kevent changes[1];
        int ret;

        EV_SET(&changes[0], fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
        ret = kevent(kq, changes, 1, NULL, 0, NULL);
        assert(ret != -1);

        printf("[%s] success\n", __func__);

        return (ret);
}

static void
kq_wait(int kq)
{
        struct kevent events[1];
        int ret;

        ret = kevent(kq, NULL, 0, events, 1, NULL);
        assert(ret != -1);

        //printf("[%s] success\n", __func__);
}

int main(int argc, char **argv)
{
        struct nm_desc *d;
        int kq;
        unsigned char *buf;
        struct nm_pkthdr h;

        d = nm_open(argv[1], NULL, 0, 0);
        assert(d != NULL);

        kq = kqueue();
        assert(kq != -1);

        kq_add(kq, d->fd);

        int n = 0;
        while (1) {
                kq_wait(kq);
                while ( (buf = nm_nextpkt(d, &h)) ) {
                        n++;
                        //printf("%d recv len: %d\n", n, h.len);
                }
                if (n >= 1200000)
                        printf("n = %d\n", n);
        }

        return (0);
}

Thanks for reporting. Also the code sample can be useful to add some unit tests for kqueue.

It looks like this deadlock is not related to the nm_notify() callsite in the netmap_poll() function, because in the stack trace I see nm_notify() called by the VALE code. Note that nm_notify is called on a different port (https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_vale.c#L1193). It would be useful to verify that if that code is removed (https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap.c#L3389, https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap.c#L3389), you still have the deadlock.

My hypothesis is that this is an AB BA deadlock, caused by the two netmap_test processes. Process 1 calls kevent(), which grabs the nm_kn_lock lock of port 1, and calls f_event (netmap_knrw).
Then, netmap_knrw calls netmap_poll, which ends up forwarding some packets to port 2, and then calls nm_notify() on port 2 (which needs the nm_kn_lock of port 2, when calling knote()).
If another process takes the the same steps, but first grabbing the nm_kn_lock of port 2 and then the one of port 1, you have the deadlock.
But I would need way more debugging to understand if this is the case.
In any case, it looks that the f_event calling netmap_poll() is the source of the problem... We cannot avoid it, so probably the locking scheme is broken and needs to be reworked.

Or, this is caused by mtx_locked

I could not reproduce your issue, but this patch may be useful to match the actual locks involved against the corresponding netmap ports.

diff --git a/sys/dev/netmap/netmap.c b/sys/dev/netmap/netmap.c
index 8b508737..d2d9546e 100644
--- a/sys/dev/netmap/netmap.c
+++ b/sys/dev/netmap/netmap.c
@@ -896,9 +896,9 @@ netmap_krings_create(struct netmap_adapter *na, u_int tailroom)
                        ND("ktx %s h %d c %d t %d",
                                kring->name, kring->rhead, kring->rcur, kring->rtail);
                        mtx_init(&kring->q_lock, (t == NR_TX ? "nm_txq_lock" : "nm_rxq_lock"), NULL, MTX_DEF);
-                       nm_os_selinfo_init(&kring->si);
+                       nm_os_selinfo_init(&kring->si, kring->name);
                }
-               nm_os_selinfo_init(&na->si[t]);
+               nm_os_selinfo_init(&na->si[t], na->name);
        }
 
 
diff --git a/sys/dev/netmap/netmap_freebsd.c b/sys/dev/netmap/netmap_freebsd.c
index ad26d2b4..ac9ce1fd 100644
--- a/sys/dev/netmap/netmap_freebsd.c
+++ b/sys/dev/netmap/netmap_freebsd.c
@@ -75,9 +75,10 @@
 
 /* ======================== FREEBSD-SPECIFIC ROUTINES ================== */
 
-void nm_os_selinfo_init(NM_SELINFO_T *si) {
+void nm_os_selinfo_init(NM_SELINFO_T *si, const char *name) {
        struct mtx *m = &si->m;
-       mtx_init(m, "nm_kn_lock", NULL, MTX_DEF);
+       snprintf(si->name, sizeof(si->name), "%s knlock", name);
+       mtx_init(m, si->name, NULL, MTX_DEF);
        knlist_init_mtx(&si->si.si_note, m);
 }
 
diff --git a/sys/dev/netmap/netmap_kern.h b/sys/dev/netmap/netmap_kern.h
index 68b1fd76..c7c82c92 100644
--- a/sys/dev/netmap/netmap_kern.h
+++ b/sys/dev/netmap/netmap_kern.h
@@ -134,6 +134,7 @@ struct netmap_adapter *netmap_getna(if_t ifp);
 struct nm_selinfo {
        struct selinfo si;
        struct mtx m;
+       char name[32];
 };
 
 
@@ -295,7 +296,7 @@ struct netmap_priv_d;
 struct nm_bdg_args;
 
 /* os-specific NM_SELINFO_T initialzation/destruction functions */
-void nm_os_selinfo_init(NM_SELINFO_T *);
+void nm_os_selinfo_init(NM_SELINFO_T *, const char *);
 void nm_os_selinfo_uninit(NM_SELINFO_T *);
 
 const char *nm_dump_buf(char *p, int len, int lim, char *dst);