Page MenuHomeFreeBSD

Improve the epoll for socket to be compatible with Linux
Needs ReviewPublic

Authored by vico13.chen_gmail.com on Jul 5 2021, 4:07 AM.

Details

Reviewers
trasz
dchagin
Group Reviewers
Linux Emulation
Summary

The epoll behavior for Unix socket is:

  1. If both sender and receiver are shutdown, Linux reports 'EPOLLHUP | EPOLLRDHUP | EPOLLRDNORM | EPOLLIN'
  2. If only receiver is shutdown, Linux reports 'EPOLLRDHUP | EPOLLRDNORM | EPOLLIN'.
  3. For EPOLL error, Linux reports it with other epoll events but not report epoll error only once error detected.

The current code for socket only handles 'CANTRCVMORE' (receiver shutdown) in kevent filter read and only handle 'CANTSENDMORE' in kevent filter write.

For Linux, the epoll behaviors, like pipe, socket, are quite different. And this patch creates a new mechanism to report epoll events according to each component, and this patch only fixes the socket case to align Linux epoll behavior, and other components, like pipe, can be improved based on the new mechanism.

Introduce 'pevents' in kevent to record the poll events, and handle them in 'linux_event.c' to report correct epoll events to match Linux behavior.
Introduce 'pmask' in kevent to return the EPOLL events which are requested by applications.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256952

Test Plan
  1. Create a socketpair and create a epoll.
  2. Add fd0 into epoll list.
  3. epoll wait.
  4. Write to fd1 and close fd1, and read in fd0

I have created a simple test case which is compile on Linux, how to share the case?

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

hi, how was it tested? also, could you please resend a patch with full context?

vico13.chen_gmail.com edited the summary of this revision. (Show Details)
vico13.chen_gmail.com edited the test plan for this revision. (Show Details)
  1. Introduce 'pmask' in kevent to return the EPOLL events which are requested by applications.
  2. Add a test case which is compiled on Linux.
wulf added inline comments.
sys/sys/event.h
96

This breaks both API and ABI.
API gets broken though eliminating of ext[3] array member and ABI gets broken through shifting of ext member offset.

May be it is better to add new kqueue filter flag for sockets, e.g. NOTE_POLLHUP or NOTE_EPOLLHUP which have EPOLLHUP semantic?

and return ENOTCONN in output ff_flags when both NOTE_POLLHUP is set in input fflags and both sender and receiver are shutdown

I enclosed a POC patch for NOTE_HUP support to bugzilla ticket

wulf,
This patch is applied in our project which is based on FreeBSD12, and all works well, and we don't see it breaks the API and ABI. Could you please describe more details how it breaks the API and ABI?
If there are concerns for eliminating of ext[3] array, how about just introducing pevent and pmask and keep ext[4] in 'struct kevent'?

For the EPOLL, we need a generic design to make this feature to match Linux, and a good design can help us enhance this feature in futhure, but not just handle the socket case as we have find simliar issue for pipe also.

As linux, each module handle its own epoll, and then combine with the masks to report the epoll event to applications. And It is better for FreeBSD to make the similiar design.
So this patch introduce 'pevent' for record the epoll for each module (socket, pipe, or others), and 'pmask' to record the application request, and finally, combine 'pevent' and 'pmasks' in 'kevent_to_epoll'.

We have another patch to handle timeout case.

Test cases

This test case only handle EPOLLRDHUP:

  1. Run this case on Linux, it prints two lines that events are recieved( '0x2010': EPOLLHUP & EPOLLRDHUP)

​2. Run this case on BSD, it will hang as no EPOLLHUP or EPOLLRDHUP reported.​
​3. Run this case on BSD with this patch, it prints two events:

a. First: print 'eventmask 0'. EPOLLIN reported as READ, and this patch just mask it but it still wake up the epoll_wait. (we have following patch to fix this issue)
b. Second: print 'eventmask 0x2010'. EPOLLHUP & EPOLLRDHUP are recieved for close.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256952

wulf,
all works well, and we don't see it breaks the API and ABI. Could you please describe more details how it breaks the API and ABI?

if your pre-patch code uses ext[3] to store some bits of context in it, it can not be compiled and executed on post-patch system due to array out-of-bound access. This is an API breakage.
if your binary is compiled out of code which uses ext[0] to store some bits of context and it was built on pre-patch system, it can not run on post-patch system correctly anymore as ext[0] value is clobbered by pevents member which have the same offset. This is an ABI breakage.

If there are concerns for eliminating of ext[3] array, how about just introducing pevent and pmask and keep ext[4] in 'struct kevent'?

That is possible, but most probably it would result in creating new system call for kevent().
Why do you store epoll events in separate member? kqueue(2) explicitly states that ext[0] and ext[1] fields are created for that:

ext        Extended data passed to and from kernel.  The ext[0] and
           ext[1] members use is defined by the filter.  If the filter
           does not use them, the members are copied unchanged.

You need to add new filter flag e.g. NOTE_EPOLL to indicate that your code starts using ext[0] to return event data in epoll format

For the EPOLL, we need a generic design to make this feature to match Linux, and a good design can help us enhance this feature in futhure, but not just handle the socket case as we have find simliar issue for pipe also.

Do you have a plan to improve generic epoll support in FreeBSD?

sys/compat/linux/linux_event.c
237

That wont work. AFAIK EV_DISABLE-d kevents can't produce any wakeups
So ep.events = EPOLLRDHUP in your test case will never fire.
It is possible to enable kevent and set low watermark high enough here, but that results in other issues :-(

sys/compat/linux/linux_event.c
237

That wont work. AFAIK EV_DISABLE-d kevents can't produce any wakeups
So ep.events = EPOLLRDHUP in your test case will never fire.
It is possible to enable kevent and set low watermark high enough here, but that results in other issues :-(

You are right! That is another issue. In FreeBSD 13, the EVRD only includes EPOLLIN and EPOLLRDNORM. But from the application view, we can't assume the developers only request 'EPOLLRDHUP'. Like the case I wrote, EPOLLRDHUP will never fire. If we add EPOLLRDHUP and other EPOLL event into EVRD, it results other issues, we need fix them, but not simply forbidden them to be fired. So we need a wel design to meet the Linux behavior and also make sure there are no other issues in BSD.

BTW, what issues will be introduced if add EPOLLRDHUP in EVRD?

In D31037#709587, @wulf wrote:

wulf,
all works well, and we don't see it breaks the API and ABI. Could you please describe more details how it breaks the API and ABI?

if your pre-patch code uses ext[3] to store some bits of context in it, it can not be compiled and executed on post-patch system due to array out-of-bound access. This is an API breakage.
if your binary is compiled out of code which uses ext[0] to store some bits of context and it was built on pre-patch system, it can not run on post-patch system correctly anymore as ext[0] value is clobbered by pevents member which have the same offset. This is an ABI breakage.

The orginal structure is 'unit64 ext[4]', and this patch uses ext[0] for pevent and pmask, and the structure is changed to 'unit64 ext[3]'. I know some test cases is based on 'unit64 ext[4]', and there are even compling failures if apply this patch due to the change from 'unit64 ext[4]' to 'unit64 ext[3]', I beblieve we can fix these errors. Why do I make this change? I am not sure whether it will cause other issue once I simply add two members 'pevent' and 'pmask' into the struture 'struct kevent' and keep 'unit64 ext[4]' as it will increase the size of 'struct kevent'. If no concerns to add two member to 'struct event', we can keep '__unit64 ext[4]'. Any suggestions?

If there are concerns for eliminating of ext[3] array, how about just introducing pevent and pmask and keep ext[4] in 'struct kevent'?

That is possible, but most probably it would result in creating new system call for kevent().

Is that due to increase the size of 'struct kevent' if add two more member into it?

Why do you store epoll events in separate member? kqueue(2) explicitly states that ext[0] and ext[1] fields are created for that:

ext        Extended data passed to and from kernel.  The ext[0] and
           ext[1] members use is defined by the filter.  If the filter
           does not use them, the members are copied unchanged.

You need to add new filter flag e.g. NOTE_EPOLL to indicate that your code starts using ext[0] to return event data in epoll format

It is OK to just use ext[0] for 'pevent' and 'pmask' (8 bytes). But my concern is: since ext[0] is a generic define, if other components also use ext[0], it will be conflict. Who use ext[0] must confirm no others use it. It is a potential risk.
So I make 'pevent' and 'pmask' to replace 'ext[0]'.

For the EPOLL, we need a generic design to make this feature to match Linux, and a good design can help us enhance this feature in futhure, but not just handle the socket case as we have find simliar issue for pipe also.

Do you have a plan to improve generic epoll support in FreeBSD?

Yes, this patch is the start to create a flexible framework to support generic epoll. The key idea is introduced 'pevent' and 'pmask' in 'struct kevent', and let each component to implement what behavior for epoll, and the compatible linux just do the generic handling.
Definetly, I need the communtity to give more suggestions to make epoll to match the Linux behavior.

I have another patch based on this patch to solve 2 issues:

  1. No mask. If test case only requests EPOLLRDHUP, but EPOLLIN catched, the epollwait will be waken up. It does not match the Linux behavior.
  2. No timeout.
sys/sys/event.h
96

This breaks both API and ABI.
API gets broken though eliminating of ext[3] array member and ABI gets broken through shifting of ext member offset.

May be it is better to add new kqueue filter flag for sockets, e.g. NOTE_POLLHUP or NOTE_EPOLLHUP which have EPOLLHUP semantic?

It seems filter flag doesn't have enough bits, and we can't only consider EPOLLUP, and we need consider all EPOLL events. Each component sets the EPOLL event in 'pevents' or other variable (filter flag?), and we also need a 'pmask' to record what EPOLL event the applications request.

Is that possible to fix the issues (break API and ABI) due to this change? Based on FreeBSD 12, after applying this patch, all kernel build works well, and we encounter only one compiling failure due to one test case try to access 'ext[4]', and we fix it in our local.
Another approach is keep 'unit64_t ext[4]', but add two more members 'pevent' and 'pmask' into 'struct kevent', like:
struct kevent {
uintptr_t ident; /* identifier for this event */
short filter; /* filter for event */
unsigned short flags; /* action flags for kqueue */
unsigned int fflags; /* filter flag value */
int64_t data; /* filter data value */
void *udata; /* opaque user data identifier */
unsigned int pevents; /* poll events, used to Linux epoll */
unsigned int pmask; /* poll mask, used to Linux epoll */
uint64_t ext[4]; /* extensions */

Why do you store epoll events in separate member? kqueue(2) explicitly states that ext[0] and ext[1] fields are created for that:

ext        Extended data passed to and from kernel.  The ext[0] and
           ext[1] members use is defined by the filter.  If the filter
           does not use them, the members are copied unchanged.

You need to add new filter flag e.g. NOTE_EPOLL to indicate that your code starts using ext[0] to return event data in epoll format

It is OK to just use ext[0] for 'pevent' and 'pmask' (8 bytes). But my concern is: since ext[0] is a generic define, if other components also use ext[0], it will be conflict. Who use ext[0] must confirm no others use it. It is a potential risk.

Which conflict?
If application does not specify NOTE_EPOLL in changelist filter flags, kevent() behavior remains the same.
If application specifies NOTE_EPOLL in changelist filter flags, than it is definitely awared of ext[0] changes.
Also kqueue can return NOTE_EPOLL in eventlist for those filters which support ext[0] so application can check NOTE_EPOLL filter flag in returned event and use ext[0] if it is set and fallback to traditional kqueue flags/fflags if it reset.

sys/compat/linux/linux_event.c
237

BTW, what issues will be introduced if add EPOLLRDHUP in EVRD?

Adding EPOLLRDHUP in EVRD is a bad idea. It introduces unwanted wakeups if only EPOLLRDHUP event is specified.

The better way is to replace disabled kevent with enabled kevent and set low watermark high enough to get wakeups only on EV_EOF. I.e. replace EV_SET(kevent, fd, EVFILT_READ, EV_ADD|EV_DISABLE, 0, 0, 0); line with EV_SET(kevent, fd, EVFILT_READ, EV_ADD|EV_ENABLE, NOTE_LOWAT, INT_MAX, 0);

The problem here is that EPOLL_CTL_ADD can reset NOTE_LOWAT, INT_MAX back to 0, 0 in some sitiations

In D31037#711053, @wulf wrote:

Why do you store epoll events in separate member? kqueue(2) explicitly states that ext[0] and ext[1] fields are created for that:

ext        Extended data passed to and from kernel.  The ext[0] and
           ext[1] members use is defined by the filter.  If the filter
           does not use them, the members are copied unchanged.

You need to add new filter flag e.g. NOTE_EPOLL to indicate that your code starts using ext[0] to return event data in epoll format

It is OK to just use ext[0] for 'pevent' and 'pmask' (8 bytes). But my concern is: since ext[0] is a generic define, if other components also use ext[0], it will be conflict. Who use ext[0] must confirm no others use it. It is a potential risk.

Which conflict?

Two concerns:

  1. In 'linux_epoll_ctl', we set ext[0], and it means, in furture, others can't re-use it, and they must use ext[1].
  2. In 'linux_epoll_ctl', we set ext[0], and then the kevent is registered. Is there a risk that 'ext[0]' of this kevent is overrided by other components?

If application does not specify NOTE_EPOLL in changelist filter flags, kevent() behavior remains the same.
If application specifies NOTE_EPOLL in changelist filter flags, than it is definitely awared of ext[0] changes.
Also kqueue can return NOTE_EPOLL in eventlist for those filters which support ext[0] so application can check NOTE_EPOLL filter flag in returned event and use ext[0] if it is set and fallback to traditional kqueue flags/fflags if it reset.

I grep 'NOTE_EPOLL' in kernel, but I can't find it. And I also grep the linux epoll headers, and there is no NOTE_EPOLL also. Is NOTE_EPOLL a standard epoll event?

sys/compat/linux/linux_event.c
237

BTW, what issues will be introduced if add EPOLLRDHUP in EVRD?

Adding EPOLLRDHUP in EVRD is a bad idea. It introduces unwanted wakeups if only EPOLLRDHUP event is specified.

Does it mean it will affect the performance? I am not familiar with this part, could you please explain more why it introduces unwanted wakeups? Why can't we make other epoll events' behavior as the same as EPOLLIN/EPOLLRDNORM?

As the current EVRD only includes EPOLLIN and EPOLLRDNORM, it means if application only specify EPOLLRDHUP or others, the functionality can't work. We need firstly make the functionality work as Linux, and then consider the performance. How do you think?

The better way is to replace disabled kevent with enabled kevent and set low watermark high enough to get wakeups only on EV_EOF. I.e. replace EV_SET(kevent, fd, EVFILT_READ, EV_ADD|EV_DISABLE, 0, 0, 0); line with EV_SET(kevent, fd, EVFILT_READ, EV_ADD|EV_ENABLE, NOTE_LOWAT, INT_MAX, 0);

I grep NOTE_LOWAT, and only find 2 in uipc_socket.c. What's the mechanism to set NOTE_LOWAT to avoid unwanted wakeups? Does it mean the wakeup will be delayed? If the applicaiton speicfy only EPOLLRDHUP, we should return it as soon as possible.

The problem here is that EPOLL_CTL_ADD can reset NOTE_LOWAT, INT_MAX back to 0, 0 in some sitiations

In D31037#711053, @wulf wrote:

Why do you store epoll events in separate member? kqueue(2) explicitly states that ext[0] and ext[1] fields are created for that:

ext        Extended data passed to and from kernel.  The ext[0] and
           ext[1] members use is defined by the filter.  If the filter
           does not use them, the members are copied unchanged.

You need to add new filter flag e.g. NOTE_EPOLL to indicate that your code starts using ext[0] to return event data in epoll format

It is OK to just use ext[0] for 'pevent' and 'pmask' (8 bytes). But my concern is: since ext[0] is a generic define, if other components also use ext[0], it will be conflict. Who use ext[0] must confirm no others use it. It is a potential risk.

Which conflict?

Two concerns:

  1. In 'linux_epoll_ctl', we set ext[0], and it means, in furture, others can't re-use it, and they must use ext[1].

There are no 'others' in linux_epoll_ctl. Only epoll compat code uses these kevents

  1. In 'linux_epoll_ctl', we set ext[0], and then the kevent is registered. Is there a risk that 'ext[0]' of this kevent is overrided by other components?

See previous answer.

If application does not specify NOTE_EPOLL in changelist filter flags, kevent() behavior remains the same.
If application specifies NOTE_EPOLL in changelist filter flags, than it is definitely awared of ext[0] changes.
Also kqueue can return NOTE_EPOLL in eventlist for those filters which support ext[0] so application can check NOTE_EPOLL filter flag in returned event and use ext[0] if it is set and fallback to traditional kqueue flags/fflags if it reset.

I grep 'NOTE_EPOLL' in kernel, but I can't find it. And I also grep the linux epoll headers, and there is no NOTE_EPOLL also. Is NOTE_EPOLL a standard epoll event?

No, it is not standard epoll event. It should be added to sys/event.h

sys/compat/linux/linux_event.c
237

BTW, what issues will be introduced if add EPOLLRDHUP in EVRD?

Adding EPOLLRDHUP in EVRD is a bad idea. It introduces unwanted wakeups if only EPOLLRDHUP event is specified.

Does it mean it will affect the performance? I am not familiar with this part, could you please explain more why it introduces unwanted wakeups? Why can't we make other epoll events' behavior as the same as EPOLLIN/EPOLLRDNORM?

EVFILT_READ kevent fires every time when you got some data in socket receive buffer. That is not what you expect in a case of single EPOLLRDHUP

As the current EVRD only includes EPOLLIN and EPOLLRDNORM, it means if application only specify EPOLLRDHUP or others, the functionality can't work. We need firstly make the functionality work as Linux, and then consider the performance. How do you think?

See previous answer. We need some other way for single EPOLLRDHUP. Inclusion of EPOLLRDHUP in to EVRD does not work properly.

The better way is to replace disabled kevent with enabled kevent and set low watermark high enough to get wakeups only on EV_EOF. I.e. replace EV_SET(kevent, fd, EVFILT_READ, EV_ADD|EV_DISABLE, 0, 0, 0); line with EV_SET(kevent, fd, EVFILT_READ, EV_ADD|EV_ENABLE, NOTE_LOWAT, INT_MAX, 0);

I grep NOTE_LOWAT, and only find 2 in uipc_socket.c. What's the mechanism to set NOTE_LOWAT to avoid unwanted wakeups? Does it mean the wakeup will be delayed? If the applicaiton speicfy only EPOLLRDHUP, we should return it as soon as possible.

Yes, it means that wakeup will be delayed until receive buffer is filled with INT_MAX bytes or EV_EOF condition happened. That is exactly what we want in 'only EPOLLRDHUP' case. Unfortunately, NOTE_LOWAT is supported for sockets only.