Grant-Table User-Space Device

Authored by on Jul 23 2016, 4:15 PM.



A grant-table user-space device will allow user-space applications to map and share grants (Xen way to share memory) among Xen domains.

Test Plan

Tested using a custom test

and xen qdisk backend .

Diff Detail

rS FreeBSD src repository
Lint Skipped
Unit Tests Skipped retitled this revision from to Grant-Table User-Space Device. updated this object. edited the test plan for this revision. (Show Details) added a reviewer: royger. set the repository for this revision to rS FreeBSD src repository.
pfg added a subscriber: pfg.Jul 25 2016, 2:27 PM

The Xen code may have it's own style guide but it might be good to check the style(9) manpage and perhaps the indent(1) example in /usr/share/examples.


These two lines should be moved after the license. The license should be strictly as suggested here:


Instead of this, you can just add $FreeBSD$ in the comment, after the license.

imp added inline comments.Jul 25 2016, 3:33 PM

To be clear, the lines before the copyright line are what should be moved, not the copyright line.


For .c files, this is the correct way. For .h files, you are correct.

pfg added inline comments.Jul 25 2016, 6:38 PM

Hmm.. then I guess I have screwed up more than once :(.

royger edited edge metadata.Jul 27 2016, 3:54 PM

Thanks, the code looks good, but it's a complicated device and there are no comments almost anywhere describing what's going on. Please add comments to the major functions (as described in style(9).

I would appreciate if you could add a comment at the top of the file describing how a user-space client is expected to use the device. The order of the operations and what's done in each one of them, for example if the client first has to issue an ioctl and then a mmap, and which parameters should be used, you can use a simple example in order to clarify all this.


The bracket should be on the same line as the iterator.


Same here, this should be on the same line as the if (there are a bunch of those, so I'm going to stop commenting on each one of them).


Would it be better to use a red-black tree in order to store the list of grants from a performance point of view? I expect that the list is going to be mostly accessed in order to search for granted regions rather than adding or removing them.

There's an example of a red-black tree usage in sys/vm/vm_phys.c, search for the RB_* macros :).


So you do the cleanup of expired grefs when new ones are mapped, shouldn't it be better to defer the cleanup to some background task, so that it doesn't interfere with the allocation of new grefs? Take a look at the TASKQUEUE(9) man page, it allows to schedule asynchronous task execution.


gref cannot be NULL, because you are suing M_WAITOK, so you can remove this check (or turn it into a KASSERT if you are paranoid :))


Same here, you should probably look into making gref_list_dtor a task that you can schedule in the background, because AFAICT there's no obvious reason that it should be explicitly done at this point.


You should not declare variables here, instead do it at the top of the function (didn't you get a complain from the compiler here?)


No need for the brackets here.


gref == NULL (gref is not a boolean, so you should not use !).


malloc with M_WAITOK cannot fail (there are a couple more below which I'm not going to comment on).


Why do you need to set handle to -1 and host_addr to 0, aren't you just unmapping the grants?


It seems like indentation is off here, but maybe it's just phabricator. Could you make sure you indent the lines that you have to break (because they are too long) using 4 spaces.


Don't mix declarations with code, this should be declared at the beginning of the function.


All the cases are indented one level more than required. "switch" and "case" should be at the same indent level.


You should create two separate helpers here, one for the gnt map and one for the gnt share cases, in order to help the readability of the code.


This comments (here and below) don't follow the coding style, check style(9).


If the function doesn't have any local variables an empty newline should be left at the top (here and below).


Hm, why did you add this C++ compat thing? None of the FreeBSD headers for Xen user-space devices have it AFAIK. edited edge metadata. marked an inline comment as done.

Style Corrections marked 17 inline comments as done.Aug 1 2016, 3:45 PM marked 5 inline comments as done.Aug 1 2016, 5:36 PM

I've added comments to all parts of the code, and tried to conform with the style guidelines everywhere.
I'll add instructions on how to use the device in the header file in an hour.

I haven't added TASKQUEUE or RB_TREE yet, because I thought it will be better that this version go in first, and then incrementally add these things. Otherwise it will make debugging a bit difficult, especially with Async tasks, the code is anyway too complex. If this version works perfectly, I'll add these performance improvements later on. There are other things we can (and have to) improve on too, one of which is file offset allocation, which is on my mind. Also, notify is yet to be implemented.



It might be an overkill, given that we access the list only twice or thrice for every allocated grant (mmap, maybe notify, and dealloc). Linux uses a list.


I guess it should be rare that some domain keeps an allocated grant mapped, even when access from this side has ended. This function is always called when any grant is deallocated, but just to kill phantom grants still alive, I call it here.
Linux also calls it, but I guess their reasons are different, they want to ensure max number of free grants are available before the allocation happens.


Since this is an error, mostly only free() calls will happen here.


This function might be called in case the Map Hyper-call fails partially (say half of the grants are mapped and the rest are not) (line 924). In that case, I enumerate over the successful maps, and un-map them. Since the user can retry the mmap in such a case, to redo the hyper-call, I update the internal data-structures to represent the fact that none of the grants are mapped anymore. After this, the state of the internal data structures is same as it was before the mmap had happened.


I've removed it. It's the default template for C header files on my system. It's not required here, as we just have structure definitions, and even if there were any declarations, I guess the kernel doesn't have any C++ code that would require this.

pfg added a comment.Aug 1 2016, 6:02 PM


I haven't added TASKQUEUE or RB_TREE yet, because I thought it will be better that this version go in first, and then incrementally add these things. Otherwise it will make debugging a bit difficult, especially with Async tasks, the code is anyway too complex.

The taskqueue(9) and tree(3) stuff has been tested very well as it is used in the rest of the system. There is no good reason not to use it from the start. marked 5 inline comments as done.Aug 1 2016, 6:14 PM

The taskqueue(9) and tree(3) stuff has been tested very well as it is used in the rest of the system. There is no good reason not to use it from the start.

Mhm. Noted. I'll add them now. :)

Added Async cleanup using Taskqueue and RedBlack Tree for internal data structures.
Please review :)

royger added a comment.Aug 2 2016, 4:12 PM

Just some initial thoughts, I will try to do a more in-deep review tomorrow.


With the addition of the RB tree now the structure requires two different handlers, one for the RB tree and another one for the TAILQ, which is rather unfortunate (also taking into account that there might be hundreds of those on a busy Dom0.

Could you please benchmark if the usage of the RB tree makes any difference from a performance point of view?

In order to test it you should create a NULL memdisk (all reads will return 0, writes are ignored) using:

# mdconfig -a -t null -s 10g

Attach it to a guest using the Qdisk backend and then run fio (or any other disk benchmark) against the raw block device. You should test the pre-RB and the RB versions of the gnttab device and decide if the addition of the RB tree is worth it or not.


Since this now runs in a taskqueue you should use a STAILQ list, and only hold the mutex in order to pick the first element from the queue and remove it. The unmap and cleanup should be done without holding the mutex in order to prevent contention. If the element can not be removed you will have to put it into a local queue that will be appended to the global one once processing has finished. You will have to do it that way in order to prevent iterating forever over grants that cannot be unmapped. marked 2 inline comments as done. added inline comments.

Consistently better performance using RB tree. I guess we should keep it :)

P.S. - And even better performance using (RB + Async) :) :)


Done, captain! edited the test plan for this revision. (Show Details)

Add notify function UNMAP_NOTIFY_CLEAR_BYTE. edited the test plan for this revision. (Show Details)Aug 3 2016, 7:49 PM
royger added inline comments.Aug 5 2016, 4:53 PM

Great, it looks marginally better.

FWIW, in FreeBSD (and other projects), when you give performance figures you usually have to use a table of some sort, with the relevant data, and the stddev. Without this data results could be considered useless. If you want to give this a try, there's a very good cli tool in FreeBSD called 'ministat(1)', that will give you a nice graph and will tell you whether there's a performance benefit or it's just an statistic illusion. IMHO it's not necessary in this case, but I would recommend that you give it a try if you want/can.


In order to reduce the size of this struct, you should put gref_list_next and gref_tree_next inside of a union, since AFAICT, you will never have a gref added to both the cleanup tree and a RB tree (same applies to gntdev_gmap).

And the struct definitions should all be at the top of the file. Having functions interleaved with struct definitions makes the code harder to read IMHO (I might be convinced otherwise if you think it's better the other way around).


I don't see any reason for this ioctl to exists, it's not implemented and will never be. Should be removed from the header too.

Added UNMAP_NOTIFY_SEND_EVENT code, factored out notify function, used a union to save space, and added usage instructions in the device header file. marked 3 inline comments as done.
  • Header file updated with usage instructions.
  • set_max_grants ioctl removed.
  • Refactored notify and event-channel notification code.

Noted. I'll keep it in mind from next time onwards.


I've shifted both iterator structures inside a union.

I'm a little inclined towards keeping the struct definitions in the middle because the code is divided into segments, as follows:

  1. Global Device Data-Structures
  2. Grant Allocation Data-Structures and Methods
  3. Grant Mapping Data-Structures and Methods
  4. Grant Mapping Pager
  5. Grant Device Methods
  6. Character Device Methods (ioctl/mmap...)
  7. Private Device Attachment Functions

The DS defined in (2) has no use in any of the other parts of the code, and the same is true for the DS defined in (3). Most of their use is inside their respective sections, and thus I defined them at the beginning of their respective sections.


Removed. :)

royger accepted this revision.Aug 21 2016, 3:50 PM
royger edited edge metadata.

LGTM, just one minor comment.


Shouldn't there be some check to make sure you don't write past the mapped region? (Maybe it's somewhere else, but I'm not able to find it).



This revision is now accepted and ready to land.Aug 21 2016, 3:50 PM marked an inline comment as done.Aug 21 2016, 4:18 PM added inline comments.

Yes. The check is already present.
The check is done when choosing the gmap/gref to make the notify structure into. Line 840-841 for gmap, and function call to gntdev_find_grefs(.., index, ..) at line 829 for gref.

royger added inline comments.Aug 22 2016, 9:41 AM

Now that I look again into this, isn't this a little bit too simplistic? At some point priv_user->file_offset is going to overflow, returning into 0, and then we risk overlapping with existing regions. IMHO, you should use something like a region manager (rangeset), that can keep track of start and sizes of the allocated regions, it's going to add a little bit more of logic, but at least we will be on the safe side. marked an inline comment as done.Aug 22 2016, 9:44 AM added inline comments.

Aye Aye! This was an improvement I left on for something later. (Linux uses similar simplistic logic for alloc, but perhaps something more complicated for map).
Since, we now have time, I'll look into the rangeset. :) updated this revision to Diff 19617.EditedAug 24 2016, 9:08 AM edited edge metadata.

Added rangeset for file-offset allocation. File offsets can now be reused. :)

Please note that this differential update hasn't been tested. I do not have access to the testing machine anymore.

This revision now requires review to proceed.Aug 24 2016, 9:08 AM
royger accepted this revision.Oct 31 2016, 3:35 PM
royger edited edge metadata.
This revision is now accepted and ready to land.Oct 31 2016, 3:35 PM
royger closed this revision.Oct 31 2016, 3:35 PM