Page MenuHomeFreeBSD

Vmbus and storage enhancements for HyperV. FreeBSD Bug 195238. Micorsoft OSTC.
ClosedPublic

Authored by whu on Feb 25 2015, 8:51 AM.

Details

Reviewers
royger
Summary

This is the Hyper-V driver enhancements from Micosoft Open Source
Technology Center. The Hyper-V vmbus and storage drivers were re-written to
support multi-channel, vector interrupt, signal optimization, and scatter
and gather read and write. Thanks Delphij and Royger for early reviews and
other helps.

Test Plan

Tested by Microsoft OSTC team on Hyper-V (Windows Server 2012R2,
2012, 2008) with 64 bit and 32 bit FreeBSD guest OS. Also tested on physical
installation.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

whu retitled this revision from to Vmbus and storage enhancements for HyperV. FreeBSD Bug 195238. Micorsoft OSTC..
whu updated this object.
whu edited the test plan for this revision. (Show Details)
whu added a reviewer: royger.
whu added subscribers: jhb, delphij, gibbs.

Hi,

I have a few (minor) style comments in line.

By the way, what does the new 'options HYPERV' do? It seems like that the only reason is to make sure that the kernel have that apic_vector.S hook? (Is it possible to move the hook into a dedicated HyperV .S file and installed on demand via setidt? This way it would be easier to make modifications in the future when a new HyperV module is available).

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
222

All caller seems to be within this file (line 1806). It's probably a good idea to define this one as static (unless there is other compelling reasons, of course).

320

Style consistency (minor): the { }'s are unneeded here and is inconsistent with the the change that resulted in line 433.

328

We usually recommends memset(&obj, 0, sizeof(obj)) instead of sizeof(type) as it's easier to read.

Hi Xin,

Thanks for the review. I will incorporate those style changes today. Regarding to the 'HYPERV' option, very likely we will add new code (such as SRIOV support) soon and this will be more useful to be there. Adding idt vector code in apic_vector.S is aligned with other XEN's similar implementation. Also even it is compiled, the code would not be used unless the vmbus device is detected, which only happens on Hyper-V. So can we keep this part of the code as it is?

  • Incorporate review comments from Xing (Delphij@).
sys/dev/hyperv/include/hyperv.h
375

Nit: s/splited/split/

443–445

A few wording suggestions:

"Starting with win8, this field is used to specify..."

"Before win8, all incoming channel interrupts are only
delivered on CPU 0. Setting this value to 0 preserves
the earlier behavior."

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
389

Waiting for 500 ticks may not be the best as it isn't a fixed time scale (some kernels use 1000 ticks a second, some do 100, and it's actually a user-tunable parameter). I saw from KY's earlier comments below that another instance was intended to be 5 seconds. If you want to do 5 seconds use '5 * hz'. If you want to do a half-second, use 'hz / 2'. If you want something like 300 milliseconds you can use '300 * hz / 1000'. Newer versions of FreeBSD also support more fine-grained ways to specify times (e.g. you can specify timeouts in milliseconds or microseconds directly using sbintime_t, compare callout_reset() to callout_reset_sbt()). If you would like this, I can add a 'sema_timedwait_sbt()' variant that you can use to specify a more precise timeout value.

sys/dev/hyperv/vmbus/hv_channel_mgmt.c
237

Why is this lock a spin mutex? Both of the places it is used it should be fine to use a regular mutex (MTX_DEF) instead, and regular mutexes are preferred.

300

Looking at the existing hyper-v code and the changes in this patch, I think this lock should also be fine to be MTX_DEF instead of MTX_SPIN. You only really need MTX_SPIN in FreeBSD if you are part of the scheduler implementation or if you use a lock inside an interrupt filter.

I see that hyperv does use an interrupt filter, but it only uses it to schedule other interrupt handlers, it doesn't use any locks in that filter routine, so all of the mutexes in hyperv should be fine to be MTX_DEF instead of MTX_SPIN.

sys/dev/hyperv/vmbus/hv_vmbus_drv_freebsd.c
376

You don't actually need the '&' for the i386 case here, but this change is otherwise fine. (Sorry, I forgot to look at this specific part you had asked me about earlier.)

  • Incorporate review comments from jhb.

Thanks John for the review. I just updated the code with your review comments.

sys/amd64/conf/GENERIC
362

I would recommend that options HYPERV is placed together with device hyperv and a comment is added to note that they need to be added or removed together (just like the Xen case).

sys/amd64/conf/NOTES
508

Same here.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
145

It would be nice that all the structs in the file are aligned in the same way (tab or space). It can be done in a different patch however.

325

Unneeded braces.

431

space between return and (ENODEV).

458

IMHO I would remove spaces between function calls and checks for error codes.

536

Can this be folder together with the previous if?

823

Should this printf and KASSERT be turned into a panic? Or is this expected to crash on debug kernel builds but work on non-debug kernels?

971

If you are using M_WAITOK sgl_node will never be NULL, so you can remove the if (or replace it with a KASSERT if you are really paranoid.

978

Same here.

987

And here.

1652

Missing parentheses around the return value. FreeBSD doesn't use negative return values, although this is a private function so I don't think it matters much.

1718

Space between return and (0);

1848

Parenthesses.

1853

IMHO this seems to duplicate the code from the busdma API by creating a custom bounce buffer function. I would recommend that you look at using BUS_DMA(9). The Xen blkfront driver uses it, so you can probably take it as an example.

sys/dev/hyperv/vmbus/hv_channel_mgmt.c
305–308

Since memcmp doesn't return a boolean value the comparison should be == 0.

413

Same here.

  • Incoporate review comments from royger.
sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1853

Have you looked into using the BUS_DMA(9) framework in order to replace your custom bounce buffer code? I don't see it addressed in the last patch and no rationale has been provided about why you need you custom bounce buffer implementation.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1853

Actually we had evaluated to see if we can leverage BUS_DMA before this code was written. It turned out that there is no callback API to check the page_alignment of middle segments before busdam cna decide if a bounce buffer is needed for a particular segment. There is a callback, ie "bus_dma_filter_t *filter", but the parameters are not sufficient for the storvsc driver.

Also we want the driver code be eventually used in FreeBSD 9 based vendor code. To have our own implementation could give us flexibility and ease our test effort with this regard.

1853

Sorry, the comments sits in my browser for couple weeks and I forgot to click the submit button.

Actually both us and Netapp(our storage partner) had evaluated to see if we can leverage BUS_DMA before this code was written. It turned out that there is no callback API to check the page_alignment of middle segments before busdam cna decide if a bounce buffer is needed for a particular segment. There is a callback, ie "bus_dma_filter_t *filter", but the parameters are not sufficient for the storvsc driver.

Also we want the driver code be eventually used in FreeBSD 9 based vendor code. To have our own implementation could give us flexibility and ease our test effort with this regard.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1853

I would like to hear the opinions of the others in this regard, but IMHO I would rather have BUS_DMA fixed to suit your needs rather than hand rolling your own bounce buffer. Maybe someone else will also make use of this feature in other divers, in which case we can reduce code duplication.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1853

Thanks Roger. Actually the bounce buffer request was coming from our partner NetApp. They need to have the bounce buffer implemented to support its virtual appliance in Azure. Except for that, we don't see any need for this as all reads/writes were aligned in normal case. As I mentioned earlier we discussed the ways of implementing it with NetApp and decided to go with our own code due to the limitations current BUS_DMA code have.

We had thought about adding new interfaces into BUS_DMA. But since NetApp's VA is based on FreeBSD 9 and we need to move fast to accommodate our partner's request due to business reasons. The code has already been adopted by NetApp in its VA product.

This doesn't mean we will not fix this issue. Your point is well taken. I will investigate the possibility of adding interfaces in BUS_DMA. Once we feel comfortable to switch, we will have this changed and check in a new version. Since we are now way behind our schedule, for the time being can we get this in the head first? Thanks so much!

Hello,

If we indeed take the code as-is, the BUS_DMA limitations that prevented it's usage need to be clearly written down as a comment, so we know what's missing and when can this code be switched.

From your previous comment, I think the problem you were seeing with current BUS_DMA code is that there's no way to get the address of each segment in the sgl array (bus_dma_segment_t[]) aligned to the start of a page?

whu edited edge metadata.
  • Add comment in storvsc driver for the potential BUS_DMA fix and using it for bounce buffer.
sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1777

I'm not an expert on BUS_DMA code, but IIRC setting the alignment to PAGE_SIZE when calling bus_dma_tag_create should make sure all segments start at a page boundary. Xen blkfront also relies on this in order to work.

  • Change the comment in storvsc driver to better state the reason
sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

I just change the comment a little bit to better clarify why BUS-DMA's API is not suitable currently. This is the excerpt from Hovy Xu @NetApp. As I mentioned earlier, this code was specifically written fro NetApp's ONTAP product and NetApp has evaluated and tested the code in its facility. The email was sent on 10/9/2014 from Hovy:

"I just took a closer look at the APIs provided by busdma. There is no a callback API to check the page alignment of middle segments before busdma can decide if a bounce buffer is needed for a particular segment. There is a callback, ie “bus_dma_filter_t *filter”, but the parameters are not sufficient for storvsc driver.

If busdma does not work for storvsc, your scatter-gather patch is the only option."

If you take a look of storvsc_check_bounce_buffer_sgl(), it needs a list of addresses to check if bounce buffer is needed or not, while the bus_dma_file_t *filter only provides one address.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

No, the filter in bus_dma is executed against every bus_dma_segment. Again I'm not sure what's missing, because AFAICT current code should work fine in this case.

bus_dma_run_filter [1] already has a check to make sure the start address of a segment is aligned to the alignment passed when creating the bus_dma tag, so you don't even need to create a custom filter.

[1] http://fxr.watson.org/fxr/source/x86/x86/busdma_machdep.c#L96

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

Here is a more detailed explanation from the requester of this feature.

The storvsc requests to communicate with Hypervisor requires that all segments except the first one start at page boundary, and all segments but the last one end at page boundary, and the first segment can start at any address and the last segment can end at any address.

If we use BUS_DMA, both alignment and boundary parameters for bus_dma_tag_create() will be PAGE_SIZE. This will work for middle segments, but does not work for the first and last segment (the first segment does not need to start at page boundary, and the last segment does not need to end at page boundary.)

The BUS_DMA filter, typedef int bus_dma_filter_t(struct bus_dma_tag_common *tc, bus_addr_t paddr), cannot help here. It only tells us the starting address of a segment. We don’t know whether or not the address comes from middle/last segments or the first segment, which needs different handling. For example, if paddr is not page aligned, it needs bounce buffer for middle/last segments, but not needed for the first segment.

Hope this clarifies.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

OK, the first and last segments don't need to start/end at a page boundary, but if they start and end at a page boundary this is not a problem for the driver right?

IMHO I don't see the problem in using bus_dma here.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

The driver requires that the first segment do not need to start at page boundary, but must end at page boundary. It also requires the last segment must start at page boundary, but do not need to end at page boundary.

Let's take an example. If the upper layer passes down a request with following segments to the driver:

Seg1

offset = 512
size = 3584

Seg2

offset = 512
size = 4096

Seg3

offset = 512
size = 3584

For the first seg, ie Seg1, it dose not need bounce buffer, because it ends at page boundary. It wouldn't work if it were copied to a bounce buffer. Because if so there would be a hole in the bounce buffer since the 3584 bytes would be copied to the beginning of the bounce buffer page.

On the other hand, it must use the bounce buffer for the last segment (Seg3). Since it doesn't start at page boundary, we have to copy it to a bounce buffer which starts the segment at page boundary.

BUS_DMA doesn't have a way to tell if it is the first segment or last segment, which should be treated differently. That's why we need our own implementation.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

I'm not sure how you can end with such segments in the same request. IIRC a request contains a start address and a length, and this is what is passed down to the driver.

Then if the start address is not aligned according to the parameters you passed when creating the bus_dma tag it will be aligned, but there are never going to be holes in the middle, because the length of each segment is going to change so it is contiguous.

Segments don't have an offset, only a start address and a length. If you set the alignment to PAGE_SIZE, all segments are going to have a length of PAGE_SIZE and address is going to be aligned to PAGE_SIZE.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

When I talk about the offset in the example above, I meant the offset of the segment address from the page boundary. We care much about this offset of the address in this discussion.

If the addresses of first segment and second segment are not contiguous, and the first segment doesn't start at page boundary and is less than a page size, BUS_DMA will copy the first segment to a bounce buffer page without adding the following data from second segment to the rest of the page. This leaves a hole at the end of this bounce buffer page and the start of second segment. This is not allowed in Windows. And for the last page, if it starts at page boundary (offset = 0) and size < PAGE_SIZE, we don't need to copy it to bounce

The IO requests with such kind of segments come from upper layer of NetApp's ONTAP product. As I mentioned earlier, the SG handling is specifically developed for NetApp because its ONTAP product generates such requests. And in the Windows host, it has the restriction to SG IOs that there should be no holes anywhere in the request.

As I mentioned earlier and in the comments, we will see if BUS_DMA will work or even get it work later. Currently we really need to get this in HEAD. Both my client and my boss put pressure on me as why it is not in while the customer has already been using it. You understanding is greatly appreciated. Thanks so much!

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

Roger,

Hovy Xu, the NetApp engineer who had been working on this with us also looked at your comments. Since he doesn't have the account to reply, I am posting his response here. His comments start with "HXU>" below.

I'm not sure how you can end with such segments in the same request. IIRC a request contains a start address and a length, and this is what is passed down to the driver.

HXU> the reason to use offset and size instead of start address and length is to easier describe the alignment issue of each segment. I skipped the page aligned part of the start address, and only showed the offset part.

Then if the start address is not aligned according to the parameters you passed when creating the bus_dma tag it will be aligned, but there are never going to be holes in the middle, because the length of each segment is going to change so it is contiguous.

HXU> I guess you talked about the code in _bus_dmamap_addseg() as shown below. That is true only if two adjacent segments have contiguous data address (curaddr == segs[seg].ds_addr + segs[seg].ds_len) plus other conditions (e.g. max seg size and in the same page if max seg size is page size). The seg1 and seg2 I provided could come from separate pages. In this case, there will have hole between the first and second segments if using BUS_DMA.

572 /*
573 * Insert chunk into a segment, coalescing with
574 * previous segment if possible.
575 */
576 seg = *segp;
577 if (seg == -1) {
578 seg = 0;
579 segs[seg].ds_addr = curaddr;
580 segs[seg].ds_len = sgsize;
581 } else {
582 if (curaddr == segs[seg].ds_addr + segs[seg].ds_len &&
583 (segs[seg].ds_len + sgsize) <= dmat->common.maxsegsz &&
584 (dmat->common.boundary == 0 ||
585 (segs[seg].ds_addr & bmask) == (curaddr & bmask)))
586 segs[seg].ds_len += sgsize;
587 else {
588 if (++seg >= dmat->common.nsegments)
589 return (0);
590 segs[seg].ds_addr = curaddr;
591 segs[seg].ds_len = sgsize;
592 }
593 }

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

Segments will always have contiguous data addresses. If you look at the bio struct, there's only one field that contains the start virtual address that's bio_data. How come can you end up with segments that are not contiguous in virtual address space or have holes?

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

Once again, this was requested and designed for Netapp ONTAP product. ONTAP has such segments passed down from its upper layer.

We would not even need SG support in this driver if we just need to support regular freeBSD data path -- we have never seen this data path being used during our in-house FreeBSD test. Netapp has tested this path in its ONTAP product and it works as expect while dealing with non-contiguous segments.

USB has some whacky DMA requirements that sound somewhat similar (see BUS_DMA_KEEP_PG_OFFSET). However, if that does not cover this then it might require a bit of work to teach bus dma to handle this specific edge case.

sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
1782

Hovy from NetApp also asked me to post the response:

Segments will always have contiguous data addresses. If you look at the bio struct, > there's only one field that contains the start virtual address that's bio_data. How come can you end up with segments that are not contiguous in virtual address space or have holes?

HXU>
NetApp DataONTAP does not issue CAM_DATA_VADDR/PADDR or CAM_DATA_BIO requests. Instead, it generates CAM_DATA_SG or CAM_DATA_SG_PADDR requests. There is no reason two adjacent segments have to be contiguous in virtual address space. It is most likely not contiguous for each segment.

Regarding to the comment from John:

USB has some whacky DMA requirements that sound somewhat similar (see BUS_DMA_KEEP_PG_OFFSET). However, if that does not cover this then it might require a bit of work to teach bus dma to handle this specific edge case.

BUS_DMA_KEEP_PG_OFFSET will apply to every segment. Windows requires only to keep the offset for the first segment if it ends at page boundary. For other segments, especially the last one, Windows still requires to start at page boundary so there is no hole in the middle. Currently BUS_DMA doesn't have a way to tell if it is first, middle or last segment.

As I put in the comment, we will see how to fix this in BUS_DMA later after we have all the critical tasks completed.

I've crafted a patch to add information about first/last segments to bus_dma_filter, it is only implemented for x86 right now, so it breaks other arches:

https://people.freebsd.org/~royger/busdma_filter_x86.patch

Can you test if this makes it possible to use bus_dma with your use-case?

Roger.

To be clear, if PG_KEEP_OFFSET isn't sufficient, I'd be tempted to go with the private allocator for now. If we teach bus_dma about this type of restriction, I think I'd prefer it to be a new flag. I'm less sure about extending the filter callback. I'd actually like to remove the filter callback longer term. Also, changing it would be a fairly invasive API/ABI change that would be difficult to MFC.

In D1964#42644, @royger wrote:

I've crafted a patch to add information about first/last segments to bus_dma_filter, it is only implemented for x86 right now, so it breaks other arches:

https://people.freebsd.org/~royger/busdma_filter_x86.patch

Can you test if this makes it possible to use bus_dma with your use-case?

Roger.

Thanks Roger! However this one won't work. Here is the response from Netapp:

"Roger's patch does not work for StorVSC. For each SG request, bounce_bus_dmamap_load_buffer/phys and _bus_dmamap_count_pages/phys can be called multiple times (one per source segment). With his patch, the filter routine could receive multiple segments with BUS_DMAFILTER_FIRST_SEG flag for the same request. Hyoper-V StorVSC driver only wants the first source segment to have that flag.

I don't think NetApp have resource to apply and verify the patch at this time."

Also such API changes could affect the callers throughout the entire kernel, which is not what we intend to.

As you know even it works, Microsoft doesn't have the test environment to generate such SG requests in house. All the test needs to be done by Netapp since only ONTAP is creating such requrests. I have been onsite at Netapp last October to help coordinate such efforts. It doesn't seems likely that Netapp can again allocate resource to re-integrate and complete full regression test in a short period of time. Same as the engineers in Microsoft are busy working on the network improvement of our drivers. So a change like this will require both companies to plan ahead and agree upon. This may eventually turns into a business decision from management.

In D1964#42722, @jhb wrote:

To be clear, if PG_KEEP_OFFSET isn't sufficient, I'd be tempted to go with the private allocator for now. If we teach bus_dma about this type of restriction, I think I'd prefer it to be a new flag. I'm less sure about extending the filter callback. I'd actually like to remove the filter callback longer term. Also, changing it would be a fairly invasive API/ABI change that would be difficult to MFC.

Thanks John! Fully agreed. We will evaluate the possibility of adding a new flag to address this problem in bus dma once both Microsoft and Netapp agree to allocate resource to work on this. For now, I think using the private allocator is the right way to go.

The Microsoft Ignite event is coming up next week. The team here would like to see the code in head before this event so the management can make some announcement.

If there is no more objection, I would like to close up the review and check in the code. Please respond by COB Monday 4/27 GMT +0 if you have any more concerns. Thanks so much!

royger edited edge metadata.

OK, but getting rid of this custom bounce buffer should be a high priority.

This revision is now accepted and ready to land.Apr 27 2015, 5:53 PM
In D1964#43418, @royger wrote:

OK, but getting rid of this custom bounce buffer should be a high priority.

Yes. We will work with Netapp closely to develop a solution using BUS_DMA for Storvsc driver.

Many many thanks for getting this reviewed and approved on time!

Code committed to head as r282212. Thanks all for the review!