Details

Reviewers

imp
tsoome
allanjude
bz
ashfixit_gmail.com

Commits

rG3daa8e165c66: pxeboot: improve and simplify rx handling

Summary

This pushes the bulk of the rx servicing into a single loop that's only
slightly convoluted, and it addresses a problem with rx handling in the
process. If we hit a tx interrupt while we're processing, we'd
previously drop the frame on the floor completely and ultimately
timeout, increasing boot time on particularly busy hosts as we keep
having to backoff and resend.

After this patch, we don't seem to hit timeouts at all on zoo anymore
though loading a 27M kernel is still relatively slow (~1m20s).

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kevans created this revision.Aug 12 2021, 3:02 AM

Herald added a subscriber: dab. · View Herald TranscriptAug 12 2021, 3:02 AM

kevans requested review of this revision.Aug 12 2021, 3:02 AM

Harbormaster completed remote builds in B41005: Diff 93580.Aug 12 2021, 3:03 AM

I've largely based this off of what Figure 3-5 in the PXE 2.1 spec seems to claim we should do... (and observing on zoo that we are seeing tx interrupts interlaced while we're processing a single rx frame)

This is something I was suspecting to happen:) Still need to read the resulting code carefully. Which method was used to load kernel, tftp or nfs?

tsoome added inline comments.Aug 12 2021, 8:55 AM

stand/i386/libi386/pxe.c
518	This is not correct. We get Frame, with FrameLength and allocate buffer for it. The actual data size in Frame, is BufferLength and to receive full frame, we may need to iterate PXENV_UNDI_ISR_IN_GET_NEXT till we have whole frame (FrameLength). "FrameLength: This parameter contains the total length of the receive frame. A receive frame may contain more than one data buffer. If FrameLength is not the same as BufferLength, the application will have to call PXENV_UNDI_ISR several times to receive the complete frame. In the case of the multi buffered frame, all buffers (except the last one) must contain at least 512 bytes of data (this does not include the media header). In other words, the minimum buffer size is 512 bytes plus the size of the media header." That is, we do get one frame, but it may be split to at least 512B chunks.

kevans added inline comments.Aug 12 2021, 2:09 PM

stand/i386/libi386/pxe.c
518	Right, that's what we do. This is the difference between Frame and frame; we're actually reading the end of the frame (spread out over multiple buffers) and able to collect a whole new frame, but this function can only deal with a single packet at a time so we must not traverse into the next. Note that `rsize` and `size` haven't changed, the latter still tracks the complete frame size.

kevans added inline comments.Aug 12 2021, 2:14 PM

stand/i386/libi386/pxe.c
518	(I would have liked to rename the local `frame` to `framebuf` or something to help disambiguate)

I've got one confirmed instance of ~30-75min pxeboot time dropped to ~5min

This looks reasonable. Any idea of the 'chaining packets' comment / TODO item would reduce the loading time substantially or not? Or is the problem elsewhere?

I don't see anything major wrong with this, but I've not fact-checked it against the pxeboot standards, so consider this what the linux world wold call 'acked by'

I don't know about timeouts or recent other breakage (maybe fixed again).
Thanks a lot for working on this! Very much appreciated!

1m20s for an amd64 GENERIC kernel is still way out of what should be possible is my guts feeling.
One thing I keep thinking of is loading kernel via NFS vs. tftp (there used to be a LOADER_TFT_SUPPORT make.conf setting?).
I wonder if anyone looked into that lately to see if one or the other makes a difference and it may give another clue?

Discussed with tsoome: record if we have another ethernet frame pending after
handling this one, and skip the START/PROCESS dance in the next call. Primitive
testing in zoo seems to suggest that we hit the data_pending case 99% of the
time, but that may have been artificially inflated by the delay added to observe
that...

Harbormaster completed remote builds in B41192: Diff 94150.Aug 24 2021, 11:33 PM

Ok, after some thinking, I did realize, we need to restructure this code some more. The problem is, pxe_netif_receive() should return complete Frame unless there is nothing coming, but UNDI ISR can end up in different states, and in some cases, we would want to receive again. Or to put other way - the current implementation is using pxe_netif_receive() as interrupt routine for UNDI, but UNDI interrupt and pxe_netif_receive have different semantics.

So, the idea should be something like:

have main loop to read data in pxe_netif_receive() - its job is to re-call ISR receive if needed.
ISR receive should call getnext or start+in process like it is currently doing.

after PXENV_UNDI_ISR_IN_PROCESS, we must check for PXENV_UNDI_ISR_OUT_TRANSMIT (and use PXENV_UNDI_ISR_IN_GET_NEXT to get to PXENV_UNDI_ISR_OUT_DONE or PXENV_UNDI_ISR_OUT_BUSY, on which we return and pxe_netif_receive() will call us again.

in case of PXENV_UNDI_ISR_OUT_RECEIVE, we pick up Frame and use PXENV_UNDI_ISR_IN_GET_NEXT to receive all buffer fragments (rsize < size), if we still do not have PXENV_UNDI_ISR_OUT_DONE, it means we have another Frame, so we set data_pending, return to pxe_netif_receive(), it will exit loop and return the data. Next call to pxe_netif_receive() will process next frame.

This way we can pick up whole frame even when PXE is using split buffer, we do get the frame when there was still transmit going on and in case we got some other issue (busy/status code).

Restructure a little more and handle PXENV_UNDI_ISR_OUT_BUSY. We can avoid
having to free/reallocate isr if we're about to restart the operation anyways.

I've also re-worded the comment around data_pending to make it clear that we
don't really know if we would have caught another packet or not. The subsequent
PXENV_UNDI_ISR_IN_GET_NEXT call could just as easily return a TRANSMIT, but we
would need to be prepared to buffer another packet in case it doesn't if we
wanted to chase it all the way to DONE.

Harbormaster completed remote builds in B41203: Diff 94177.Aug 25 2021, 5:26 PM

Simplify the terminal condition of the loop a little more (we check Status both
before the loop and after the GET_NEXT call) and indicate that we'll often not
hit it.

Harbormaster completed remote builds in B41205: Diff 94179.Aug 25 2021, 5:29 PM

Hopefully final update for a while: on the off chance that we did only catch
tx interrupts, make sure we restart just as we did if we caught BUSY.

Harbormaster completed remote builds in B41206: Diff 94182.Aug 25 2021, 5:44 PM

Other than those few nits, it seems to be good (I hope the hw tests will also confirm;)