Page MenuHomeFreeBSD

Non-transparent superpages support.
Needs ReviewPublic

Authored by kib on May 1 2020, 3:53 PM.

Details

Reviewers
markj
alc
Summary

WIP

Test program, also the API illustration: https://reviews.freebsd.org/P384

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 32231

Event Timeline

kib created this revision.May 1 2020, 3:53 PM
kib requested review of this revision.May 1 2020, 3:53 PM
markj added a comment.May 1 2020, 5:42 PM

Just some thoughts on the MI components, I have only skimmed so far:

  • If we are not going to be API compatible with Linux, I don't really like the reuse of the "hugetlb" name. This can be finalized later of course.
  • Rather than having a separate pmap_enter_hugetlb(), why not extend pmap_enter() to handle psind=2? I think we will want this eventually anyway.
  • I assumed that we would use mmap() rather than having a HUGETLB_MMAP ioctl. mmap() can specify the required alignment with the MAP_ALIGNED() flag. The hugetlb device object can implement the populate method and return a page with m->psind == 2, so the device does not have to create the mapping. What are the disadvantages to using mmap() instead of HUGETLB_MMAP?
kib retitled this revision from Nn-trasparent superpage support. to Non-trasparent superpage support..May 1 2020, 5:53 PM
kib added a comment.EditedMay 1 2020, 6:02 PM

Just some thoughts on the MI components, I have only skimmed so far:

  • If we are not going to be API compatible with Linux, I don't really like the reuse of the "hugetlb" name. This can be finalized later of course.

Sure, but please propose the name.

  • Rather than having a separate pmap_enter_hugetlb(), why not extend pmap_enter() to handle psind=2? I think we will want this eventually anyway.

pmap_enter_hugetlb() is not equivalent to pmap_enter(psind = 1). For start, enter_hugetlb() asserts that there is no existing mapping, and that the requested mapping is very special, second I do not want to either slow-down existing pmap_enter() nor make its logic even more convoluted. I can see an argument that I should add a new PMAP_ENTER_HUGETLB flag and call pmap_enter(), while internally pmap_enter() would jump to pmap_enter_hugetlb().

  • I assumed that we would use mmap() rather than having a HUGETLB_MMAP ioctl. mmap() can specify the required alignment with the MAP_ALIGNED() flag. The hugetlb device object can implement the populate method and return a page with m->psind == 2, so the device does not have to create the mapping. What are the disadvantages to using mmap() instead of HUGETLB_MMAP?

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

No, m->psind = 2 cannot work now, this woud be very invasive change (it is different from pmap_enter(psind = 1)).

markj added a comment.May 1 2020, 6:26 PM
In D24652#542724, @kib wrote:

Just some thoughts on the MI components, I have only skimmed so far:

  • If we are not going to be API compatible with Linux, I don't really like the reuse of the "hugetlb" name. This can be finalized later of course.

Sure, but please propose the name.

Maybe "largepage" or "largemap"?

  • Rather than having a separate pmap_enter_hugetlb(), why not extend pmap_enter() to handle psind=2? I think we will want this eventually anyway.

pmap_enter_hugetlb() is not equivalent to pmap_enter(psind = 1). For start, enter_hugetlb() asserts that there is no existing mapping, and that the requested mapping is very special, second I do not want to either slow-down existing pmap_enter() nor make its logic even more convoluted. I can see an argument that I should add a new PMAP_ENTER_HUGETLB flag and call pmap_enter(), while internally pmap_enter() would jump to pmap_enter_hugetlb().

We already have PMAP_ENTER_NOREPLACE to indicate the desired semantic. I do not really see why there would any performance impact: pmap_enter() handles the psind == 1 case in exactly one place where it calls pmap_enter_pde() and skips the rest of the function. So we could simply extend it to handle psind > 0 and call pmap_enter_pde() or pmap_enter_pdpe() depending on the request.

  • I assumed that we would use mmap() rather than having a HUGETLB_MMAP ioctl. mmap() can specify the required alignment with the MAP_ALIGNED() flag. The hugetlb device object can implement the populate method and return a page with m->psind == 2, so the device does not have to create the mapping. What are the disadvantages to using mmap() instead of HUGETLB_MMAP?

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

I agree that we should have a /dev/hugetlb to provide a configuration interface (e.g., to specify reclamation policy), but it looks like HUGETLB_MMAP is really just duplicating a subset of the mmap() interface, while lacking some centralized logic like capability rights checks. It should be possible to extend d_mmap_single() a bit to specify the required map entry flags.

No, m->psind = 2 cannot work now, this woud be very invasive change (it is different from pmap_enter(psind = 1)).

Ok, but do you agree that it is the right long-term direction? Even if the initial implementation is more specialized.

markj added a comment.May 1 2020, 6:34 PM
In D24652#542724, @kib wrote:

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

I agree that we should have a /dev/hugetlb to provide a configuration interface (e.g., to specify reclamation policy), but it looks like HUGETLB_MMAP is really just duplicating a subset of the mmap() interface, while lacking some centralized logic like capability rights checks. It should be possible to extend d_mmap_single() a bit to specify the required map entry flags.

To be a bit more specific, when I tried to design this I imagined that /dev/hugetlb would perform the page allocation (and defrag, or whatever) at mmap() time, and create the mapping upon the fault using the populate interface. I think this would address the possibility of other system calls populating the object.

kib added a comment.May 1 2020, 7:11 PM
In D24652#542724, @kib wrote:

Just some thoughts on the MI components, I have only skimmed so far:

  • If we are not going to be API compatible with Linux, I don't really like the reuse of the "hugetlb" name. This can be finalized later of course.

Sure, but please propose the name.

Maybe "largepage" or "largemap"?

largemap is already taken, largepage might be. One slight advantage of 'hugetlb' is that somebody with Linux background would find it, despite FreeBSD feature is different. Largepage might be, but best would be to indicate that it is non-transparent non-demotion.

  • Rather than having a separate pmap_enter_hugetlb(), why not extend pmap_enter() to handle psind=2? I think we will want this eventually anyway.

pmap_enter_hugetlb() is not equivalent to pmap_enter(psind = 1). For start, enter_hugetlb() asserts that there is no existing mapping, and that the requested mapping is very special, second I do not want to either slow-down existing pmap_enter() nor make its logic even more convoluted. I can see an argument that I should add a new PMAP_ENTER_HUGETLB flag and call pmap_enter(), while internally pmap_enter() would jump to pmap_enter_hugetlb().

We already have PMAP_ENTER_NOREPLACE to indicate the desired semantic. I do not really see why there would any performance impact: pmap_enter() handles the psind == 1 case in exactly one place where it calls pmap_enter_pde() and skips the rest of the function. So we could simply extend it to handle psind > 0 and call pmap_enter_pde() or pmap_enter_pdpe() depending on the request.

No, I do not believe that PMAP_ENTER_NOREPLACE has similar semantic, it bails out if a mapping exists. And I cannot reuse pmap_enter_pde() there as well. Just for example, there must be no existing mapping for the new flag, which PMAP_ENTER_NOREPLACE does not indicate.

I still think that branch out for PMAP_ENTER_LARGEPAGE (or some other name) is the best.

  • I assumed that we would use mmap() rather than having a HUGETLB_MMAP ioctl. mmap() can specify the required alignment with the MAP_ALIGNED() flag. The hugetlb device object can implement the populate method and return a page with m->psind == 2, so the device does not have to create the mapping. What are the disadvantages to using mmap() instead of HUGETLB_MMAP?

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

I agree that we should have a /dev/hugetlb to provide a configuration interface (e.g., to specify reclamation policy), but it looks like HUGETLB_MMAP is really just duplicating a subset of the mmap() interface, while lacking some centralized logic like capability rights checks. It should be possible to extend d_mmap_single() a bit to specify the required map entry flags.

No, m->psind = 2 cannot work now, this woud be very invasive change (it is different from pmap_enter(psind = 1)).

Ok, but do you agree that it is the right long-term direction? Even if the initial implementation is more specialized.

m->psind is tightly tied to the transparent promotion. I am not sure, my intuition strongly disagrees that transparent promotion to 1G could ever work. I think it would just cause 10-20% populated reservations to hang around never completing.

Also, for 2M promotions, we have to check 512 ptes. For 1G, it is either 512x512 for 4k, or 512 for 2M. This is both unpractical and unbelievable. If a program do create very large memory allocations that benefit from contig and from PG_PS, then it typically expresses it explicitly, e.g. postgres.

In D24652#542724, @kib wrote:

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

I agree that we should have a /dev/hugetlb to provide a configuration interface (e.g., to specify reclamation policy), but it looks like HUGETLB_MMAP is really just duplicating a subset of the mmap() interface, while lacking some centralized logic like capability rights checks. It should be possible to extend d_mmap_single() a bit to specify the required map entry flags.

RIght, /dev/hugetlb is some dev-time artifact that does not need to survive for the final comittable version. The configuration interface, even if named /dev/hugetlb, would be something different.
This device would only exist until the proper usermode API is designed, and most likely never committed.

To be a bit more specific, when I tried to design this I imagined that /dev/hugetlb would perform the page allocation (and defrag, or whatever) at mmap() time, and create the mapping upon the fault using the populate interface. I think this would address the possibility of other system calls populating the object.

I think part of the realistic use-case requirements there is no faults even soft. Also I believe it is impossible to cleanly handle the situation where page fault is unable to satisfy the request for contiguous memory. There is no other way to react than to send a signal, but this is probably non-starter for consumers. Either committed to success on later accesses, or upfront failure at the mapping creation time is required for practical applications.

It is both userspace and some kernel drivers that would need to get the contig memory under the mapping, and vm_fault_quick_hold() is not adequate as well.

I considered your question, which can be reformulated could this be more naturally plugged into existing VM mapping and faulting lazy approach, and decided that lazy instantiation of the mapping is not suitable for planned DPDK and OFED uses. It does not contradict to other uses e.g. as large shared region by postgres, so I do not see why try to overcome such problems instead of avoiding them from the beginning. HUGETLB on Linux does it similarly, they even preallocate all superpages memory at boot.

markj added a comment.May 1 2020, 8:23 PM
In D24652#542741, @kib wrote:
In D24652#542724, @kib wrote:

Just some thoughts on the MI components, I have only skimmed so far:

  • If we are not going to be API compatible with Linux, I don't really like the reuse of the "hugetlb" name. This can be finalized later of course.

Sure, but please propose the name.

Maybe "largepage" or "largemap"?

largemap is already taken, largepage might be. One slight advantage of 'hugetlb' is that somebody with Linux background would find it, despite FreeBSD feature is different. Largepage might be, but best would be to indicate that it is non-transparent non-demotion.

Ok. It can be left for now anyway.

  • Rather than having a separate pmap_enter_hugetlb(), why not extend pmap_enter() to handle psind=2? I think we will want this eventually anyway.

pmap_enter_hugetlb() is not equivalent to pmap_enter(psind = 1). For start, enter_hugetlb() asserts that there is no existing mapping, and that the requested mapping is very special, second I do not want to either slow-down existing pmap_enter() nor make its logic even more convoluted. I can see an argument that I should add a new PMAP_ENTER_HUGETLB flag and call pmap_enter(), while internally pmap_enter() would jump to pmap_enter_hugetlb().

We already have PMAP_ENTER_NOREPLACE to indicate the desired semantic. I do not really see why there would any performance impact: pmap_enter() handles the psind == 1 case in exactly one place where it calls pmap_enter_pde() and skips the rest of the function. So we could simply extend it to handle psind > 0 and call pmap_enter_pde() or pmap_enter_pdpe() depending on the request.

No, I do not believe that PMAP_ENTER_NOREPLACE has similar semantic, it bails out if a mapping exists. And I cannot reuse pmap_enter_pde() there as well. Just for example, there must be no existing mapping for the new flag, which PMAP_ENTER_NOREPLACE does not indicate.

I still think that branch out for PMAP_ENTER_LARGEPAGE (or some other name) is the best.

  • I assumed that we would use mmap() rather than having a HUGETLB_MMAP ioctl. mmap() can specify the required alignment with the MAP_ALIGNED() flag. The hugetlb device object can implement the populate method and return a page with m->psind == 2, so the device does not have to create the mapping. What are the disadvantages to using mmap() instead of HUGETLB_MMAP?

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

I agree that we should have a /dev/hugetlb to provide a configuration interface (e.g., to specify reclamation policy), but it looks like HUGETLB_MMAP is really just duplicating a subset of the mmap() interface, while lacking some centralized logic like capability rights checks. It should be possible to extend d_mmap_single() a bit to specify the required map entry flags.

No, m->psind = 2 cannot work now, this woud be very invasive change (it is different from pmap_enter(psind = 1)).

Ok, but do you agree that it is the right long-term direction? Even if the initial implementation is more specialized.

m->psind is tightly tied to the transparent promotion. I am not sure, my intuition strongly disagrees that transparent promotion to 1G could ever work. I think it would just cause 10-20% populated reservations to hang around never completing.

I really do not see why it is tightly coupled. Sure, vm_reserv sets it when a reservation is fully populated, but it is really only used by the fault handler. AFAIK there is no reason why a pager populate handler cannot set it either. Else why does vm_fault_populate() even try to handle m->psind != 0? The page fault handler does not set OBJ_COLORED when calling vm_pager_populate(). In fact I think Alan at one point mentioned that the populate interface was designed partly to accommodate an implementation of non-transparent superpages.

I agree that transparent promotion to 1GB is probably not important to userspace, at least for anon memory, but kernel_object could certainly make use of it on large systems.

Also, for 2M promotions, we have to check 512 ptes. For 1G, it is either 512x512 for 4k, or 512 for 2M. This is both unpractical and unbelievable. If a program do create very large memory allocations that benefit from contig and from PG_PS, then it typically expresses it explicitly, e.g. postgres.

I do not really see why checking 512 2MB mappings is impractical, but it could be optimized. Anyway, there is not always a binary distinction between transparent and non-transparent superpages. Suppose one application maps a 2MB shared object and faults on every 4KB page, resulting in a promotion to a 2MB mapping. Then m->psind == 1 for the first page in the range, and *all* processes which map the object will automatically get a 2MB mapping upon the first fault (assuming the mapping is suitably aligned). In particular, there is no promotion involved. m->psind = 1 just says, "page m begins a run of 512 physically contiguous pages, all contiguous within the same object." /dev/hugetlb can provide the same hint, and the VM will automatically use it.

In D24652#542724, @kib wrote:

One problem there is that I need to set some special map entry flags, another is that I need to avoid populating the object with any pages from any other syscalls until my population code do it. This is part of the problem in the API design, and it will be even more severe for e.g. posix shm, which is one of the reason why I went with /deb/hugetlb now to get things going in lower layers.

I agree that we should have a /dev/hugetlb to provide a configuration interface (e.g., to specify reclamation policy), but it looks like HUGETLB_MMAP is really just duplicating a subset of the mmap() interface, while lacking some centralized logic like capability rights checks. It should be possible to extend d_mmap_single() a bit to specify the required map entry flags.

RIght, /dev/hugetlb is some dev-time artifact that does not need to survive for the final comittable version. The configuration interface, even if named /dev/hugetlb, would be something different.
This device would only exist until the proper usermode API is designed, and most likely never committed.

Well, I suspect it would be useful to keep /dev/hugetlb, unless you are thinking of introducing a new file type. I just dislike HUGETLB_MMAP.

To be a bit more specific, when I tried to design this I imagined that /dev/hugetlb would perform the page allocation (and defrag, or whatever) at mmap() time, and create the mapping upon the fault using the populate interface. I think this would address the possibility of other system calls populating the object.

I think part of the realistic use-case requirements there is no faults even soft.

Why? The userspace allocator can trivially provide this guarantee if it is required, either by touching the mapping returned by mmap(), or with mlock().

Also I believe it is impossible to cleanly handle the situation where page fault is unable to satisfy the request for contiguous memory. There is no other way to react than to send a signal, but this is probably non-starter for consumers. Either committed to success on later accesses, or upfront failure at the mapping creation time is required for practical applications.

I agree. The device mmap() handler should perform the contig allocation, or signal failure if it is unable. Then the fault handler does not perform any allocations except potentially allocating PTPs. In fact I think you do not need a populate method if hugetlb sets m->psind = 2 and inserts contig pages into the object at mmap() time: vm_fault_soft_fast() will see the resident page and automatically create a 1GB mapping, assuming that pmap_enter() can handle psind == 2.

It is both userspace and some kernel drivers that would need to get the contig memory under the mapping, and vm_fault_quick_hold() is not adequate as well.

I considered your question, which can be reformulated could this be more naturally plugged into existing VM mapping and faulting lazy approach, and decided that lazy instantiation of the mapping is not suitable for planned DPDK and OFED uses. It does not contradict to other uses e.g. as large shared region by postgres, so I do not see why try to overcome such problems instead of avoiding them from the beginning. HUGETLB on Linux does it similarly, they even preallocate all superpages memory at boot.

I believe that with hugetlbfs reservations, even Linux only instantiates the PG_PS mapping at fault time. The contig allocation is done earlier, it can be done at boot or during mmap() depending on the policy.

kib added a comment.May 1 2020, 8:52 PM

I do not really see why checking 512 2MB mappings is impractical, but it could be optimized. Anyway, there is not always a binary distinction between transparent and non-transparent superpages. Suppose one application maps a 2MB shared object and faults on every 4KB page, resulting in a promotion to a 2MB mapping. Then m->psind == 1 for the first page in the range, and *all* processes which map the object will automatically get a 2MB mapping upon the first fault (assuming the mapping is suitably aligned). In particular, there is no promotion involved. m->psind = 1 just says, "page m begins a run of 512 physically contiguous pages, all contiguous within the same object." /dev/hugetlb can provide the same hint, and the VM will automatically use it.

I do not see how could I safely set m->psind = 1 (not mentioning 2) for these kinds of objects/pages. First, code currently assumes that psind = 1 implies existence of the reservation. In reverse, if there is no reservation, then fault handler does not try pmap_enter(psind = 1). Second, I believe that it is currently implicitly assumed that userspace mappings with psind == 1 are managed.

I think that instantiating pdes/pdpes can be moved to the fault handler if you prefer this way, but it still should be a special path both in vm_fault and in pmap_enter(_hugetlb).

markj added a comment.May 1 2020, 9:30 PM
In D24652#542766, @kib wrote:

I do not really see why checking 512 2MB mappings is impractical, but it could be optimized. Anyway, there is not always a binary distinction between transparent and non-transparent superpages. Suppose one application maps a 2MB shared object and faults on every 4KB page, resulting in a promotion to a 2MB mapping. Then m->psind == 1 for the first page in the range, and *all* processes which map the object will automatically get a 2MB mapping upon the first fault (assuming the mapping is suitably aligned). In particular, there is no promotion involved. m->psind = 1 just says, "page m begins a run of 512 physically contiguous pages, all contiguous within the same object." /dev/hugetlb can provide the same hint, and the VM will automatically use it.

I do not see how could I safely set m->psind = 1 (not mentioning 2) for these kinds of objects/pages. First, code currently assumes that psind = 1 implies existence of the reservation. In reverse, if there is no reservation, then fault handler does not try pmap_enter(psind = 1).

I forgot that vm_fault_soft_fast() uses vm_reserv_to_superpage(). Is that what you are referring to? I believe this is just for convenience. In principle you could use this instead:

static vm_page_t
vm_fault_page_to_superpage(vm_page_t m)
{
    vm_page_t m_super;
    vm_paddr_t pa;
    int i;

    pa = m->phys_addr;
    for (i = MAXPAGESIZES; i > 0; i--) {
        m_super = m - atop(pa & (pagesizes[i] - 1));
        if (m_super->psind == i)
            return (m_super);
    }
    return (NULL);
}

This doesn't quite work because m - atop(pa & (pagesizes[i] - 1)) might not be an element of vm_page_array, but it could be fixed.

If you are referring to vm_fault_populate(), then I do not see why there is a dependency on reservations.

Second, I believe that it is currently implicitly assumed that userspace mappings with psind == 1 are managed.

Why?

I think that instantiating pdes/pdpes can be moved to the fault handler if you prefer this way, but it still should be a special path both in vm_fault and in pmap_enter(_hugetlb).

My belief is that m->psind is not (or, should not be) specific to vm_reserv, and that anything which manages contiguous memory reservations, like hugetlb or vm_reserv, should be able to pass it as a hint to the fault handler without any special MI handling. Up until now vm_reserv is the only subsystem which provides such reservations, so there may be some coupling, but it should be fixed.

This rather long block is a summary of the discussion that Kostik
and I had in email. Dropping it in here so that it is part of the
public record. Regular font are my comments; italics font are
Kostik's response to my comments.

Looking at Linux, it appears that their superpage support is done
through mmap:

BEGIN LINUX USER MANUAL

MAP_HUGETLB (since Linux 2.6.32)
Allocate the mapping using "huge pages." See the Linux kernel
source file Documentation/admin-guide/mm/hugetlbpage.rst for
further information, as well as NOTES, below.

MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8)
Used in conjunction with MAP_HUGETLB to select alternative
hugetlb page sizes (respectively, 2 MB and 1 GB) on systems
that support multiple hugetlb page sizes.

More generally, the desired huge page size can be configured
by encoding the base-2 logarithm of the desired page size in
the six bits at the offset MAP_HUGE_SHIFT. (A value of zero
in this bit field provides the default huge page size; the
default huge page size can be discovered via the Hugepagesize
field exposed by /proc/meminfo.) Thus, the above two
constants are defined as:

#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)

The range of huge page sizes that are supported by the system
can be discovered by listing the subdirectories in /sys/kernel/mm/hugepages.

NOTES:

Huge page (Huge TLB) mappings
For mappings that employ huge pages, the requirements for the arguments
of mmap() and munmap() differ somewhat from the requirements for mappings
that use the native system page size.

For mmap(), offset must be a multiple of the underlying huge page
size. The system automatically aligns length to be a multiple of the
underlying huge page size.

For munmap(), addr and length must both be a multiple of the underlying huge page size.

Certain flags constants are defined only if suitable feature test
macros are defined (possibly by default): _DEFAULT_SOURCE with glibc
2.19 or later; or _BSD_SOURCE or _SVID_SOURCE in glibc 2.19 and
earlier. (Employing _GNU_SOURCE also suffices, and requiring that
macro specifically would have been more logical, since these flags
are all Linux-specific.) The relevant flags are: MAP_32BIT,
MAP_ANONYMOUS (and the synonym MAP_ANON), MAP_DENYWRITE,
MAP_EXECUTABLE, MAP_FILE, MAP_GROWSDOWN, MAP_HUGETLB, MAP_LOCKED,

MAP_NONBLOCK, MAP_NORESERVE, MAP_POPULATE, and MAP_STACK.

END LINUX USER MANUAL

We might as well use the same interface rather than inventing our own
and then having to add software to map from the Linux interface to ours.

Your interface lacks the ability to have the region backed by a
file (fd, offset parameters of mmap). In looking through it, the
only functionality that it adds that is not easily provided by
mmap (extended with a few extra flags) is the ability to select
the domain to be used. It is not clear to me that this extra bit
of flexability warrents creating a whole new and different interface.

Our mmap(2) interface currently has eight flag bits reserved for
superpage details to be specified via the MAP_ALIGNED(n) flags
(which are rather poorly described in the manual page). At the moment
it appears that only one of the bits is used where the MAP_ALIGNED_SUPER
flags is defined as MAP_ALIGNED(1). the Linux choices could be added:

#define MAP_ALIGNED_SUPER MAP_ALIGNED(1)
#define MAP_HUGE_2MB MAP_ALIGNED(1)
#define MAP_HUGE_1GB MAP_ALIGNED(2)

Some of the bits reserved for page alignment could be repurposed
to pass in the domain to use, or at least a hint on which domain
to use. Or the domain information could be passed in after the
mmap() using madvise(2).

So bottom line is I think we can do this with mmap(2).

This interface is definitely not final. It is the fastest way for me
to get into the internal parts of the implementation.

I somewhat agree with the note about 'fd', but I do not have any intent
to allow mmap file with supepages, it it too drastic change for both
VFS/VM and filesystems code. What I agree with, is that users probably
want a way to share the hugetlb mappings between unrelated processes,
so some namespace for such mappings would be required.

So does this mean that existing mmap mappings of files are never promoted
to superpages? Even if the initial mapping requested to be aligned?
I assume as the pages of the file fault, they are brought in to the
correct place in the superpage reservation. So they will be properly
set up to be promoted.

Current transparent superpages are best-effort. If allocator cannot claim
all 512 pages for given superpage, then promotion does not happen.
There is a mechanism called reservations which should help allocator
to not steal a 4k page from the 2m range, basically giving a hint that
the page might become a part of the superpage. But reservation cannot
veto the stealing.

You can watch yourself how often this mechanism works and does not, look
at vm.pmap.pde.* sysctl values after buildworld. You would see that it
works, but not fantastic. Promotion failures are typically more frequent
that successfull promotions.

As I briefly noted in my previous mail, I plan to define the API using
the posix shm interface, perhaps it will start with some flag for shm_open2().

Let me explain some design issues there, in particular, why I do not plan
to provide hugetlb file mappings, and less directly, why I do not want to
use normal mmap(2) (but this is not final).

For superpage mappings to be possible, two things must match:

  1. backing physical memory must be contiguous, and contig blocks must start on the natural boundary. I.e. if we have a file with cached pages in vp->v_object, and want to eventually map it with 2M superpages, pages in each 2M range must have contig phys addresses, and each page on 2M boundary have its low 21 bit of phys address zero. Now, for 1G pages, we must have 1G contig runs, with each 1G page having low 31 bit of physical address zero.

Per above, if the mapping is initially aligned and pages are placed
properly as the are brought in, then it should work.

  1. The virtual address used for mapping must be naturally aligned on the selected superpage boundary. This is relatively easy.

For the item 1, there are multiple problems. First, trying to allocate
contig physical memory is quite time-consuming and failure-prone, esp.
on fragmented machine. For transparent superpages, Alan implemented
reservations which are very 'non-enforcing' so to say. If the slot for
some page in 2M run is not available, no additional measures are taken.

I concur that we should add code to forcibly move out pages that are
blocking a superpage from being formed. Obviously there needs to be a
threshhold before this sort of action is taken. Also, a file that has
been previously loaded into memory (either through reading or initially
writing it) may not be well laid out. But, if it gets mapped with a
request for superpages, its existing cache pages could be flushed, so
that it can be read into an appropriate superpage reservation.

Even though there is no effort to reconstruct superpages, we do in fact
get them and they do work well. I don't know how well 1G superpages will
work under the current constructs. It may be necessary to guard them more
closely and/or make more effort to recreate them.

No, 1G cannot work in transparent way for any practial values.
And my project is not about it.

For transparent pages, promotion and demotion are relatively costly,
because of the need to touch all 512 constituent 4k pages, e.g. remove
or allocate and add pv list entries, and change ptes. I suspect that it
could steal significant part of the not too large performance gain of using
2m TLB entries.

We configure x86 CPUs to use cached accesses to page table structures,
and TLB miss is not too awful from performance PoV if page table itself
is cached. More, I suspect that most gain from the superpages comes from
the fact that we are able to use more TLB entries, not because one entry
can serve more VA. When I developed PCID support, it was non-trivial to
find a good benchmark to show its value because of the low cost of TLB miss.

See some very recent measures of the superpages effect on database
benchmarks, e.g.
https://wiki.postgresql.org/images/7/7d/PostgreSQL_and_Huge_pages_-_PGConf.2019.pdf

IMO the conclusion is that it helps some, but the performance gains are
not high. This is all for the normal applications use of superpages, which
is not the case for DPDK and Juniper use of it.

Second, we do not know which files user would want to map for superpages.
Enforcing contig allocation for each file is non-starter, see above.
At mmap(2) time, it is too late to try to fix allocation.

Per above, we could just flush a file from the cache if it is scattered
in physical memory and becomes requested for a superpage mapping. It
will be a high cost, but if it is going to make a significant improvement
in how the application will run, then worth that overhead.

Oh, I probably see one part of the misundertanding.

I believe that Linux only allows for non-transparent superpages mount
option on something that is similar to our tmpfs, not for a real storage
filesystems. In other words, there is no need to flush anything (for them),
but there is some administrative overhead to setup hugetlb mount, and
communicate the location of the mount to the applications.

This is why I intend to use posix shm for final API.

Another problem I see is that our VM is too lazy, which causes
interesting consequences for the API. Assume that we use normal mmap(2)
and do some minimal reasonable modifications to support flags you
mentioned. Mmap() ends with creating a vm_map_entry_t in vm_map for the
current address space, the page tables are not filled, and typically
there is no pages in the queue of the backing object. On fault, the page
is allocated and inserted both into the object and page table.

For non-transparent superpages it is clearly the issue, because existing
code would allocate some random physical 4k page that cannot satisfy
alignment and contig requirements. So I need to instantiate everything
at the mmap time and more, I must avoid userspace from faulting in the
mmaped region until page tables are fully populated.

No, vm_fault.c code can be adapted for the non-transient superpages, but
it is so large rewrite that it is simply not worth it IMO.

I think that the common case will be that a file will not be randomly
accessed before being mmap'ed with a request for superpages. So, flushing
existing pages is not likely to incur much cost. Another way to mitigate
this would be to look at the size of a file. If its size is less than a
the size of a superpage (which is the vast majority of files), then it can
be read in randomly as now. If it is bigger than a superpage, then give it
a superpage reservation so that it is read in to appropriate physical pages.

So my API, either the current device ioctl, or something that I will do
with posix shm, intents to pre-allocate phys memory _and_ fill page tables
at the request time. Since it is so different from the mmap(2) normal
operations, it is really inconvenient to try to plug it into the existing
syscall. And then it allows me to add semantic not compatible with mmap.

You can document places where you do not provide mmap semantics. So, if
you are not going to allow mmap of files using superpages, you can return
ENOTSUPP when that request is made. That is going to be a lot easier for
application writers to understand than trying to figure out when they have
to use mmap versus shm.

kib updated this revision to Diff 71346.May 3 2020, 10:04 PM
kib edited the summary of this revision. (Show Details)

Use phys pager and some modification of the populate method.
Switch to shm/mmap API.

markj added a comment.May 4 2020, 1:45 PM

Just adding some notes here. Some are not really relevant to the topic or have already been discussed on IRC.

Your interface lacks the ability to have the region backed by a
file (fd, offset parameters of mmap). In looking through it, the
only functionality that it adds that is not easily provided by
mmap (extended with a few extra flags) is the ability to select
the domain to be used. It is not clear to me that this extra bit
of flexability warrents creating a whole new and different interface.

Our mmap(2) interface currently has eight flag bits reserved for
superpage details to be specified via the MAP_ALIGNED(n) flags
(which are rather poorly described in the manual page). At the moment
it appears that only one of the bits is used where the MAP_ALIGNED_SUPER
flags is defined as MAP_ALIGNED(1). the Linux choices could be added:

#define MAP_ALIGNED_SUPER MAP_ALIGNED(1)
#define MAP_HUGE_2MB MAP_ALIGNED(1)
#define MAP_HUGE_1GB MAP_ALIGNED(2)

Some of the bits reserved for page alignment could be repurposed
to pass in the domain to use, or at least a hint on which domain
to use. Or the domain information could be passed in after the
mmap() using madvise(2).

Jeff proposed a new system call for this purpose a while ago, see D14891.

If the mmap()ing thread has its domain policy set by cpuset, then it should
automatically get memory from the correct domain. I'd prefer to see a reason
that this mechanism cannot be used before adding a new way to request a
specific domain.

This interface is definitely not final. It is the fastest way for me
to get into the internal parts of the implementation.

I somewhat agree with the note about 'fd', but I do not have any intent
to allow mmap file with supepages, it it too drastic change for both
VFS/VM and filesystems code.

I believe we also noted somewhere that Linux has the same restriction on
non-transparent superpage use, i.e., they cannot be used to back a regular file.
Implementing this in FreeBSD looks difficult.

So does this mean that existing mmap mappings of files are never promoted
to superpages? Even if the initial mapping requested to be aligned?
I assume as the pages of the file fault, they are brought in to the
correct place in the superpage reservation. So they will be properly
set up to be promoted.

Current transparent superpages are best-effort. If allocator cannot claim
all 512 pages for given superpage, then promotion does not happen.
There is a mechanism called reservations which should help allocator
to not steal a 4k page from the 2m range, basically giving a hint that
the page might become a part of the superpage. But reservation cannot
veto the stealing.

You can watch yourself how often this mechanism works and does not, look
at vm.pmap.pde.* sysctl values after buildworld. You would see that it
works, but not fantastic. Promotion failures are typically more frequent
that successfull promotions.

Part of the issue is a conservative promotion policy. If clang faults on only 511 4KB
pages in an aligned 2MB region in .text, no promotion occurs. One trick to observe
this difference is to run

$ clang --version # force clang binary to be mapped, so its object is colored
$ dd if=$(which clang) of=/dev/null bs=1M # force reservations to become fully populated
$ make buildkernel

For me this provides a 3-4% reduction in wall clock time on an amd64 system. I measured
a significantly larger speedup on an arm64 server system some time ago.

Before:
real 2m15.761s
user 63m8.811s
sys 3m21.556s

After:
real 2m10.862s
user 60m46.336s
sys 3m16.858s

As I briefly noted in my previous mail, I plan to define the API using
the posix shm interface, perhaps it will start with some flag for shm_open2().

Let me explain some design issues there, in particular, why I do not plan
to provide hugetlb file mappings, and less directly, why I do not want to
use normal mmap(2) (but this is not final).

For superpage mappings to be possible, two things must match:

  1. backing physical memory must be contiguous, and contig blocks must start on the natural boundary. I.e. if we have a file with cached pages in vp->v_object, and want to eventually map it with 2M superpages, pages in each 2M range must have contig phys addresses, and each page on 2M boundary have its low 21 bit of phys address zero. Now, for 1G pages, we must have 1G contig runs, with each 1G page having low 31 bit of physical address zero.

Per above, if the mapping is initially aligned and pages are placed
properly as the are brought in, then it should work.

  1. The virtual address used for mapping must be naturally aligned on the selected superpage boundary. This is relatively easy.

For the item 1, there are multiple problems. First, trying to allocate
contig physical memory is quite time-consuming and failure-prone, esp.
on fragmented machine. For transparent superpages, Alan implemented
reservations which are very 'non-enforcing' so to say. If the slot for
some page in 2M run is not available, no additional measures are taken.

I concur that we should add code to forcibly move out pages that are
blocking a superpage from being formed. Obviously there needs to be a
threshhold before this sort of action is taken. Also, a file that has
been previously loaded into memory (either through reading or initially
writing it) may not be well laid out. But, if it gets mapped with a
request for superpages, its existing cache pages could be flushed, so
that it can be read into an appropriate superpage reservation.

Even though there is no effort to reconstruct superpages, we do in fact
get them and they do work well. I don't know how well 1G superpages will
work under the current constructs. It may be necessary to guard them more
closely and/or make more effort to recreate them.

No, 1G cannot work in transparent way for any practial values.
And my project is not about it.

For transparent pages, promotion and demotion are relatively costly,
because of the need to touch all 512 constituent 4k pages, e.g. remove
or allocate and add pv list entries, and change ptes. I suspect that it
could steal significant part of the not too large performance gain of using
2m TLB entries.

Since r321378 this cost is well-amortized for objects that stay resident in
memory for a long time, as in the clang example above.

We configure x86 CPUs to use cached accesses to page table structures,
and TLB miss is not too awful from performance PoV if page table itself
is cached. More, I suspect that most gain from the superpages comes from
the fact that we are able to use more TLB entries, not because one entry
can serve more VA. When I developed PCID support, it was non-trivial to
find a good benchmark to show its value because of the low cost of TLB miss.

With r321378 superpages can provide additional benefits beyond reduced TLB
pressure. In particular, a single soft fault on a resident page belonging to a fully populated
reservation allows immediate instantiation of a 2MB mapping. So this has the benefits
of reducing the number of page faults as well.

Of course this is not relevant for all types of applications. I want to make the
argument that the reservation system enables multiple optimizations, not just a reduction
in TLB usage.

sys/amd64/amd64/pmap.c
5817

The MI layer only calls pmap_protect() when restricting permissions. I believe this change is not sufficient to avoid soft faults after a protection change.

sys/vm/vm_fault.c
529

/* should be on its own line.

kib marked an inline comment as done.May 4 2020, 3:16 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
5817

Right, I remembered it as a thing to do during the vm_map.c but failed. In fact, I also should disable changing the inheritance mode for largepage entries, because CoW cannot work there.

markj added inline comments.May 4 2020, 4:20 PM
sys/sys/mman.h
296

I prefer not to have a custom domain policy if possible. We already have general mechanisms for defining this policy, we should use that unless there is some specific use-case to motivate a custom interface, and in this case the policy should be expressed using a domainset. It is also kind of fragile to specify policy using a single domain ID; it is possible for system to have empty NUMA domains, and this actually happens in practice.

kib updated this revision to Diff 71372.May 4 2020, 4:45 PM
kib marked 2 inline comments as done.

Allow pmap_enter() on valid largepage entry assuming that only protection changes.
Remove domain arg from the ioctl, assume that contig_alloc uses current thread policy (needs to be considered for 12).

kib updated this revision to Diff 71585.May 9 2020, 2:41 PM

pass over pmap adding handling of 1G superpages in user space
check pkru when entering large page
stop storing shmfd * into phys object, they have different lifetimes
added accounting of the total allocated largepages per sizes
handle requests for wire/unwire largepages
disallow writes to unpopulated largepage shmfd
filled assert messages
add support to posixshmcontrol
remove all vestiges of /dev/hugetlb

kib retitled this revision from Non-trasparent superpage support. to Non-transparent superpage support..May 9 2020, 2:42 PM
kib added a reviewer: alc.
alc added a comment.May 10 2020, 7:21 AM
In D24652#542741, @kib wrote:

m->psind is tightly tied to the transparent promotion. I am not sure, my intuition strongly disagrees that transparent promotion to 1G could ever work. I think it would just cause 10-20% populated reservations to hang around never completing.

I really do not see why it is tightly coupled. Sure, vm_reserv sets it when a reservation is fully populated, but it is really only used by the fault handler. AFAIK there is no reason why a pager populate handler cannot set it either. Else why does vm_fault_populate() even try to handle m->psind != 0? The page fault handler does not set OBJ_COLORED when calling vm_pager_populate(). In fact I think Alan at one point mentioned that the populate interface was designed partly to accommodate an implementation of non-transparent superpages.

For what it's worth, Mark is correctly recalling my intentions. First, m->psind should not be "tightly coupled" with reservations. Second, I had hoped that populate would provide a pathway to non-transparent pages.

alc added a comment.May 10 2020, 7:24 AM

My belief is that m->psind is not (or, should not be) specific to vm_reserv, and that anything which manages contiguous memory reservations, like hugetlb or vm_reserv, should be able to pass it as a hint to the fault handler without any special MI handling. Up until now vm_reserv is the only subsystem which provides such reservations, so there may be some coupling, but it should be fixed.

Yes, please.

gbe added a subscriber: gbe.May 10 2020, 7:54 AM
alc added a comment.May 10 2020, 8:28 AM

We configure x86 CPUs to use cached accesses to page table structures,
and TLB miss is not too awful from performance PoV if page table itself
is cached. More, I suspect that most gain from the superpages comes from
the fact that we are able to use more TLB entries, not because one entry
can serve more VA.

For several generations of their microarchitecture, Intel's 2nd level TLB has been unified in two senses of that word: (1) it holds both 4KB and 2MB mappings and (2) it holds both data and instruction mappings. So, you don't really gain very many additional TLB entries by using 4KB and 2MB pages simultaneously. Specifically, you only gain extra entries in the 1st level data and instruction TLBs, where 4KB and 2MB entries are stored in separate structures. A result of this organization that doesn't get enough attention is the effect of competition between data and instruction mappings. In particular, suppose that you are using 2MB data mappings but 4KB instruction mappings for a relatively large program, like PostgreSQL. We've seen cases, e.g., clang, where forcing the use of 2MB instruction mappings reduced 2nd level TLB misses caused by data accesses by half.

kib added a comment.May 10 2020, 9:29 AM
In D24652#545448, @alc wrote:

My belief is that m->psind is not (or, should not be) specific to vm_reserv, and that anything which manages contiguous memory reservations, like hugetlb or vm_reserv, should be able to pass it as a hint to the fault handler without any special MI handling. Up until now vm_reserv is the only subsystem which provides such reservations, so there may be some coupling, but it should be fixed.

Yes, please.

This is in fact the v3.0 (or even 3.1) of the patch. After above discussion with Mark, and some on-line talk, I tried to use m->psind to communicate the level of the pte that needs to be installed by pmap_enter(). Unortunately, it causes vm_reserv subsystem to break. Most serious reason was that I cannot clear m->psind in the exitsing structure of pagers before vm_object_terminate() frees all object' pages (I cannot move vm_pager_dealloc() earlier, I tried that at mgmt cdev pager times, and it is worse than the problems it tries to solve). So there are free pages with either m->psind == 1 or worse m->psind == 2 left in the free pools.

Then either asserts in vm_reserv, like vm_reserv_populate(), fire. Or vm_fault_soft_fast() misbehaves in mysterious ways. This was patch v2.0. So I switched to half of what was discussed. What is left from the v2.0 is the rework to allow faults on the superpage mappings, and to process faults with populate.

Note that I still cannot get away from the special flag to pmap_enter() indicating that non-level 0 pte must be installed, this is true even for transparent superpages, which is the reason why we explicitly pass psind = 1. For instance, for transparent superpages, psind = 1 and non-consistent PKRU keys is only soft error, which causes both populate and fast fault cases to retract to psind = 0 ptes. But for non-transparent superpages, non-consistent PKRU means user error and must not shred large pte into smaller.

That said, I do not quite see what would m->psind > 0 for this patch give. I need special behavior at all levels:

  • for vm_map, clip must not clip superenties, so vm_map_entry_t needs a flag to indicate it and avoid splitting
  • for vm_fault/populate, we need to know that whole superpage entry must be installed (see above), but there are curious additional details. Consider 1G superpage, which consists of 256k struct vm_pages. If we busy all of them in pager->populate(), and then unbusy after pmap_enter(), this causes visible many seconds (close to a minute) processing of the page fault. So I coded the interface where vm_map_entry with no-clip flag (mask) only requires first page in the run to be busied.
  • for pmap_enter(), special cases are PKRU and strict avoidance of demotion for pdes.

My point is that additional information besides m->psind > 0 is needed to ensure special behavior in all layers which might not (easily) get to the m (vm_map) or need to distinguish why psind is set (transparent vs. non-transparent). More, currently pmap cannot operate looking at m->psind > 0 as well, pmap_enter() needs psind argument.

kib added a comment.May 10 2020, 4:06 PM

One thing that I realized I missed to state sufficiently clear in my response, is that Mark' notes are about the patch are about v1.0, where the syscall (actually ioctl) code ran to completion, i.e. it established vm_map_entry. allocated backing contig pages and installed pdpe/pde entries.

kib added a comment.May 10 2020, 4:09 PM
In D24652#545452, @alc wrote:

For several generations of their microarchitecture, Intel's 2nd level TLB has been unified in two senses of that word: (1) it holds both 4KB and 2MB mappings and (2) it holds both data and instruction mappings. So, you don't really gain very many additional TLB entries by using 4KB and 2MB pages simultaneously. Specifically, you only gain extra entries in the 1st level data and instruction TLBs, where 4KB and 2MB entries are stored in separate structures. A result of this organization that doesn't get enough attention is the effect of competition between data and instruction mappings. In particular, suppose that you are using 2MB data mappings but 4KB instruction mappings for a relatively large program, like PostgreSQL. We've seen cases, e.g., clang, where forcing the use of 2MB instruction mappings reduced 2nd level TLB misses caused by data accesses by half.

You remember results of the PCID testing ? There were no measurable reduction of the context switch latency, although hwpmc demonstrated 10x times reduction of the TLB misses rate. (On the other hand, when PTI is enabled and PCID is used to cache kernel mode ptes, syscall entry latency does shrink by ~50%).

alc added a comment.May 10 2020, 6:37 PM
In D24652#545459, @kib wrote:
In D24652#545448, @alc wrote:

My belief is that m->psind is not (or, should not be) specific to vm_reserv, and that anything which manages contiguous memory reservations, like hugetlb or vm_reserv, should be able to pass it as a hint to the fault handler without any special MI handling. Up until now vm_reserv is the only subsystem which provides such reservations, so there may be some coupling, but it should be fixed.

Yes, please.

This is in fact the v3.0 (or even 3.1) of the patch. After above discussion with Mark, and some on-line talk, I tried to use m->psind to communicate the level of the pte that needs to be installed by pmap_enter(). Unortunately, it causes vm_reserv subsystem to break. Most serious reason was that I cannot clear m->psind in the exitsing structure of pagers before vm_object_terminate() frees all object' pages (I cannot move vm_pager_dealloc() earlier, I tried that at mgmt cdev pager times, and it is worse than the problems it tries to solve). So there are free pages with either m->psind == 1 or worse m->psind == 2 left in the free pools.

Then either asserts in vm_reserv, like vm_reserv_populate(), fire. Or vm_fault_soft_fast() misbehaves in mysterious ways. This was patch v2.0. So I switched to half of what was discussed. What is left from the v2.0 is the rework to allow faults on the superpage mappings, and to process faults with populate.

Note that I still cannot get away from the special flag to pmap_enter() indicating that non-level 0 pte must be installed, this is true even for transparent superpages, which is the reason why we explicitly pass psind = 1. For instance, for transparent superpages, psind = 1 and non-consistent PKRU keys is only soft error, which causes both populate and fast fault cases to retract to psind = 0 ptes. But for non-transparent superpages, non-consistent PKRU means user error and must not shred large pte into smaller.

That said, I do not quite see what would m->psind > 0 for this patch give. I need special behavior at all levels:

  • for vm_map, clip must not clip superenties, so vm_map_entry_t needs a flag to indicate it and avoid splitting
  • for vm_fault/populate, we need to know that whole superpage entry must be installed (see above), but there are curious additional details. Consider 1G superpage, which consists of 256k struct vm_pages. If we busy all of them in pager->populate(), and then unbusy after pmap_enter(), this causes visible many seconds (close to a minute) processing of the page fault. So I coded the interface where vm_map_entry with no-clip flag (mask) only requires first page in the run to be busied.
  • for pmap_enter(), special cases are PKRU and strict avoidance of demotion for pdes.

My point is that additional information besides m->psind > 0 is needed to ensure special behavior in all layers which might not (easily) get to the m (vm_map) or need to distinguish why psind is set (transparent vs. non-transparent). More, currently pmap cannot operate looking at m->psind > 0 as well, pmap_enter() needs psind argument.

If I understood everything correctly, then I don't think that I would argue with anything that you wrote. The way that I would summarize your points is to say that people must not forget the distinction between virtual-to-physical mappings and physical pages. The value of "m->psind" says something about the physical page, but almost nothing about the mapping(s) to that physical page. In other words, to have a superpage mapping, you need to have a physical superpage, but in general having a physical superpage does not imply that every mapping to it is going to be a superpage mapping. (Hence, as you point out, pmap_enter() currently takes a psind value to define the size of the mapping.) Moreover, the value of "m->psind" should never be expected to say anything about the properties of a mapping.

I would like to see any new mechanism by which we are creating physical superpages set the value of "m->psind" appropriately, so that operations on physical pages might exploit the knowledge that they are working with a physical superpage, regardless of whether that physical superpage was created by the reservation system or by some other means. That said, what makes this current patch "tricky" has little to do with physical memory. This patch is creating a new type of mapping with different properties than we've had before. For example, if I understood correctly, it is not demotable for purposes of changing any mapping attributes at a 4KB granularity. So, it makes perfect sense to me that the data structures that manage mappings (as opposed to physical memory) need to be able to express this difference. (And, the size of the physical page, i.e., "m->psind", does not describe those differences.)

alc added a comment.May 10 2020, 7:43 PM
In D24652#545553, @kib wrote:
In D24652#545452, @alc wrote:

For several generations of their microarchitecture, Intel's 2nd level TLB has been unified in two senses of that word: (1) it holds both 4KB and 2MB mappings and (2) it holds both data and instruction mappings. So, you don't really gain very many additional TLB entries by using 4KB and 2MB pages simultaneously. Specifically, you only gain extra entries in the 1st level data and instruction TLBs, where 4KB and 2MB entries are stored in separate structures. A result of this organization that doesn't get enough attention is the effect of competition between data and instruction mappings. In particular, suppose that you are using 2MB data mappings but 4KB instruction mappings for a relatively large program, like PostgreSQL. We've seen cases, e.g., clang, where forcing the use of 2MB instruction mappings reduced 2nd level TLB misses caused by data accesses by half.

You remember results of the PCID testing ? There were no measurable reduction of the context switch latency, although hwpmc demonstrated 10x times reduction of the TLB misses rate. (On the other hand, when PTI is enabled and PCID is used to cache kernel mode ptes, syscall entry latency does shrink by ~50%).

I do. However, I don't recall which benchmark saw the 10x reduction in TLB misses. Was it lmbench's lat_ctx? If so, then the reason why a 10x reduction in TLB misses has little effect on that benchmark's results is easy to explain: After every context switch to a process, that process sequentially reads *every* word in the malloc()ed region whose size is determined by the command line parameter to the benchmark. In other words, for every TLB miss that PCID might eliminate you are performing 512 sequential memory accesses (on a 64-bit machine). So, it shouldn't be surprising that the savings from the avoided TLB misses after a context switch are going to appear insignificant. I don't think that I was aware of this "feature" of lat_ctx when you were testing PCID.

kib added a comment.May 10 2020, 8:37 PM
In D24652#545603, @alc wrote:

I would like to see any new mechanism by which we are creating physical superpages set the value of "m->psind" appropriately, so that operations on physical pages might exploit the knowledge that they are working with a physical superpage, regardless of whether that physical superpage was created by the reservation system or by some other means. That said, what makes this current patch "tricky" has little to do with physical memory. This patch is creating a new type of mapping with different properties than we've had before. For example, if I understood correctly, it is not demotable for purposes of changing any mapping attributes at a 4KB granularity. So, it makes perfect sense to me that the data structures that manage mappings (as opposed to physical memory) need to be able to express this difference. (And, the size of the physical page, i.e., "m->psind", does not describe those differences.)

I am saying that in this patch, setting m->psind would be a write-only operation, this data is unused because other layers must provide some additional controls anyway. And then, if setting m->psind, I get problems with reservations, which resolution I do not see as providing any value neither to this work, nor to the vm architecture. I think to resolve this problem, some notion of psind 'ownership' should be developed. For instance, as Mark proposed, m-object->flags & OBJ_COLORED might be the indication that m->psind is due to populated reservation.

kib added a comment.May 10 2020, 8:41 PM
In D24652#545630, @alc wrote:
In D24652#545553, @kib wrote:
In D24652#545452, @alc wrote:

For several generations of their microarchitecture, Intel's 2nd level TLB has been unified in two senses of that word: (1) it holds both 4KB and 2MB mappings and (2) it holds both data and instruction mappings. So, you don't really gain very many additional TLB entries by using 4KB and 2MB pages simultaneously. Specifically, you only gain extra entries in the 1st level data and instruction TLBs, where 4KB and 2MB entries are stored in separate structures. A result of this organization that doesn't get enough attention is the effect of competition between data and instruction mappings. In particular, suppose that you are using 2MB data mappings but 4KB instruction mappings for a relatively large program, like PostgreSQL. We've seen cases, e.g., clang, where forcing the use of 2MB instruction mappings reduced 2nd level TLB misses caused by data accesses by half.

You remember results of the PCID testing ? There were no measurable reduction of the context switch latency, although hwpmc demonstrated 10x times reduction of the TLB misses rate. (On the other hand, when PTI is enabled and PCID is used to cache kernel mode ptes, syscall entry latency does shrink by ~50%).

I do. However, I don't recall which benchmark saw the 10x reduction in TLB misses. Was it lmbench's lat_ctx? If so, then the reason why a 10x reduction in TLB misses has little effect on that benchmark's results is easy to explain: After every context switch to a process, that process sequentially reads *every* word in the malloc()ed region whose size is determined by the command line parameter to the benchmark. In other words, for every TLB miss that PCID might eliminate you are performing 512 sequential memory accesses (on a 64-bit machine). So, it shouldn't be surprising that the savings from the avoided TLB misses after a context switch are going to appear insignificant. I don't think that I was aware of this "feature" of lat_ctx when you were testing PCID.

I remember I tried different allocation sizes, including something ridiculously small. Anyway I tried to say that I do not expect to see significant changes in the target software (DPDK) due to use of largepage mappings. In particular because it would be dominated by other memory traffic, similar to what you noted about lat_ctx.

On the other hand, having official way to get physical contiguous mapping in userspace would be immediately useful for DPDK, OFED, and some other HPC consumers, from what I was told.

alc added a comment.May 10 2020, 10:33 PM
In D24652#545634, @kib wrote:
In D24652#545630, @alc wrote:
In D24652#545553, @kib wrote:
In D24652#545452, @alc wrote:

For several generations of their microarchitecture, Intel's 2nd level TLB has been unified in two senses of that word: (1) it holds both 4KB and 2MB mappings and (2) it holds both data and instruction mappings. So, you don't really gain very many additional TLB entries by using 4KB and 2MB pages simultaneously. Specifically, you only gain extra entries in the 1st level data and instruction TLBs, where 4KB and 2MB entries are stored in separate structures. A result of this organization that doesn't get enough attention is the effect of competition between data and instruction mappings. In particular, suppose that you are using 2MB data mappings but 4KB instruction mappings for a relatively large program, like PostgreSQL. We've seen cases, e.g., clang, where forcing the use of 2MB instruction mappings reduced 2nd level TLB misses caused by data accesses by half.

You remember results of the PCID testing ? There were no measurable reduction of the context switch latency, although hwpmc demonstrated 10x times reduction of the TLB misses rate. (On the other hand, when PTI is enabled and PCID is used to cache kernel mode ptes, syscall entry latency does shrink by ~50%).

I do. However, I don't recall which benchmark saw the 10x reduction in TLB misses. Was it lmbench's lat_ctx? If so, then the reason why a 10x reduction in TLB misses has little effect on that benchmark's results is easy to explain: After every context switch to a process, that process sequentially reads *every* word in the malloc()ed region whose size is determined by the command line parameter to the benchmark. In other words, for every TLB miss that PCID might eliminate you are performing 512 sequential memory accesses (on a 64-bit machine). So, it shouldn't be surprising that the savings from the avoided TLB misses after a context switch are going to appear insignificant. I don't think that I was aware of this "feature" of lat_ctx when you were testing PCID.

I remember I tried different allocation sizes, including something ridiculously small. Anyway I tried to say that I do not expect to see significant changes in the target software (DPDK) due to use of largepage mappings. In particular because it would be dominated by other memory traffic, similar to what you noted about lat_ctx.

On the other hand, having official way to get physical contiguous mapping in userspace would be immediately useful for DPDK, OFED, and some other HPC consumers, from what I was told.

We have a paper to appear this summer that includes a detailed, head-to-head comparison of FreeBSD's current superpage support, Linux's transparent huge pages (THP), and the modifications to Linux proposed in two credible papers, Ingens and Hawkeye, that have appeared in the past few years. I should be able to share this paper with you in a few weeks. There were a few applications where FreeBSD's conservative promotion policy was significantly slower (i.e., a difference > 2%) than Linux's allocate-and-map a superpage on first access behavior, which is essentially what you are implementing here. More often than not, "HPC" applications are touching the bulk of their memory during initialization, and so promotion occurs quickly enough. You just pay the cost of a lot more page faults. Conversely, there were two applications, one being gcc, where FreeBSD outperformed Linux by 5% or more because THP doesn't handle incremental heap growth very well (unless the growth is by whole superpages).

alc added a comment.May 10 2020, 11:02 PM
In D24652#545633, @kib wrote:
In D24652#545603, @alc wrote:

I would like to see any new mechanism by which we are creating physical superpages set the value of "m->psind" appropriately, so that operations on physical pages might exploit the knowledge that they are working with a physical superpage, regardless of whether that physical superpage was created by the reservation system or by some other means. That said, what makes this current patch "tricky" has little to do with physical memory. This patch is creating a new type of mapping with different properties than we've had before. For example, if I understood correctly, it is not demotable for purposes of changing any mapping attributes at a 4KB granularity. So, it makes perfect sense to me that the data structures that manage mappings (as opposed to physical memory) need to be able to express this difference. (And, the size of the physical page, i.e., "m->psind", does not describe those differences.)

I am saying that in this patch, setting m->psind would be a write-only operation, this data is unused because other layers must provide some additional controls anyway. And then, if setting m->psind, I get problems with reservations, which resolution I do not see as providing any value neither to this work, nor to the vm architecture. I think to resolve this problem, some notion of psind 'ownership' should be developed. For instance, as Mark proposed, m-object->flags & OBJ_COLORED might be the indication that m->psind is due to populated reservation.

Okay, I will probably ask again about this later. I am still trying to wrap up my class for the semester, and I have a different paper that is due in roughly two weeks. So, I can't imagine that I will be able to spend more time looking at this change or the PV list locking change until the end of the month. However, in the time that I've spent today looking at the change and the sample application, nothing has really bothered me about it from an overall design standpoint.

alc added a comment.May 10 2020, 11:09 PM

On a barely related note, I wondered if you or Mark had observed whether the recently committed update to jemalloc has "fixed" the horrific anonymous memory fragmentation that occurs with applications like clang that mmap() a lot of files.

Looking at DPDK, it wants to determine the number of huge pages available in each domain before performing any allocation. I think the only way to get that information today is to read vm.phys_free, which is meant to be human-readable. We could add a set of sysctls vm.domain.<domidx>.largepage.<psind>.free to provide that information.

This is not quite enough. DPDK operates in legacy mode on FreeBSD, which means that it pre-allocates all of the large pages it can, based on the contigmem configuration. On Linux this is controlled by hugetlbfs reservations. I suspect it would be useful for posixshmcontrol to be able to reserve large pages, so an administrator can provide a fixed number of preallocated large pages to applications. On NUMA systems cpuset(1) could be used to specify the policy. What do you think?

sys/kern/uipc_shm.c
295

"it takes"

793

I am thinking about a function that attempts to reclaim from the domain(s) specified by the policy. The implementation should:

  1. attempt to allocate contig memory from each domain specified in the policy, then
  2. if not successful, restart the iterator and attempt to reclaim memory from each domain before trying the allocation again.
798

So in the default policy, if insufficient contiguous memory is available we will just keep retrying in a loop?

markj added inline comments.May 12 2020, 10:25 PM
sys/kern/uipc_shm.c
296

Did you look at all at the "object busy" mechanism that Jeff added? It effectively allows one to busy all pages in an object.

kib marked an inline comment as done.May 13 2020, 4:19 PM
kib added inline comments.
sys/kern/uipc_shm.c
296

Well, vm_object_busy() prevents new busy, or rather, it allows new busy but quickly backs them out. But existing busy pages are left as is, so something that iterates over all pages in the superpage would be needed, like vm_page_ps_test().

But all of that is not needed there, the busy of the single page is performed only to avoid a truncation while pmap_enter() is done, in some future where truncation will be implemented for largepage shmfd.

798

Yes, this should be improved somehow. I think the improvement should be mostly local to the function that you propose above.

The detail that I do not like now is that vm_wait() sleep is uninterruptible. I think that e.g. vm_wait_intr() should be added, then the thread_susp_check() call below can be removed. I restructured the loop to make it easier to change.

kib updated this revision to Diff 71724.May 13 2020, 4:22 PM

Restructure truncate_largepage() loop to avoid freeing page that was already inserted into the object, because dropped object lock allows mmap to succeed.
Add -l option to 'posixshmcontrol create' to specify largepage shmfd and its large page size. You still need to truncate afterward.

kib updated this revision to Diff 71758.May 14 2020, 11:08 AM

Make largepage allocation interruptible, otherwise it is an easy DoS, add vm_wait_intr().
Change default policy to do one contig_reclaim().

kib updated this revision to Diff 71793.May 14 2020, 7:26 PM

Rebase after conflicts resolution.

kib updated this revision to Diff 71838.May 15 2020, 9:38 PM

Fix MAP_FIXED | MAP_EXCL error return in non-error case.
Make number of contig reclaims for default alloc policy configurable.

kib updated this revision to Diff 71897.May 17 2020, 9:56 PM

Remove MAP_LARGEMAP. Its only purpose was to turn off addr hint adjustment for non-MAP_FIXED mode. The same effect can be achieved by saving original hint and passing it when we detect that shmfd is for largepages.

Add memfd_create(MFD_HUGETLB) support, mostly untested. I think there are fine semantics differences between Linux and us there, but I believe that the porting should be much easier anyway.

kib updated this revision to Diff 71899.May 17 2020, 9:58 PM
kib retitled this revision from Non-transparent superpage support. to Non-transparent superpages support..

Upload corrected version of the patch (fd != 0 -> fd == -1).

kib updated this revision to Diff 71901.May 18 2020, 12:20 AM

Fix flow control for calling vm_pager_populate.
Fix largepage accounting in pmap.
Fix shm largepage statistic.

kib updated this revision to Diff 74013.Thu, Jul 2, 8:46 AM

Refresh patch after rebase. Fix conflicts in vm_map.c.

kib updated this revision to Diff 74286.EditedFri, Jul 10, 6:35 PM
kib added a subscriber: kevans.

Rebase after recent uipc_shm.c changes.

kevans added inline comments.Fri, Jul 10, 7:16 PM
sys/vm/vm_mmap.c
424

No specific suggestion, but it'd be nice if we had a cleaner way to do this, though I suppose this won't really expand to too many other file types. shmfd can now tell easily enough from some fo_* operation, for example, if it's configured for this since we've recently started recording the opening shm_flags.