Page MenuHomeFreeBSD

Implement software AF and DBM management for arm64.
ClosedPublic

Authored by markj on Jul 10 2019, 8:58 PM.
Tags
None
Referenced Files
F106953890: D20907.id59642.diff
Wed, Jan 8, 12:32 AM
F106953625: D20907.id59691.diff
Wed, Jan 8, 12:27 AM
Unknown Object (File)
Tue, Jan 7, 1:03 PM
Unknown Object (File)
Sat, Dec 28, 7:01 AM
Unknown Object (File)
Thu, Dec 26, 1:13 PM
Unknown Object (File)
Wed, Dec 25, 10:53 AM
Unknown Object (File)
Sun, Dec 22, 7:16 PM
Unknown Object (File)
Sat, Dec 21, 10:21 PM

Details

Summary

Following Alan's suggestion, use a software bit to emulate the DBM bit.
This will make it easier to support hardware DBM management in the
future. I also implemented software AF handling.

The change adds ATTR_SW_DBM. If it is set, the PTE is logically
writeable, but it is write-protected until the first write.
pmap_fault() marks the PTE read/write when handling a permission fault.

pmap_protect() and pmap_remove_write() are modified to clear
ATTR_SW_DBM when removing write permissions. I implemented
pmap_clear_modify(), and updated pmap_copy() to clear access and
modification bits. pmap_ts_referenced() now clears the access flag.

The trap handler now invokes pmap_fault() for all kernel faults. The
comment about only having to worry about translation faults in the
direct map is wrong: mappings in exec_map may be promoted or demoted, so
we have to handle translation faults there too.

Test Plan

Without the patch:

# for i in $(seq 1 10); do rm -rf /usr/obj/*; time make -s -j96 buildkernel >/dev/null; done
      285.97 real     19522.85 user      1936.66 sys
      288.78 real     19520.76 user      1978.89 sys
      287.94 real     19501.91 user      1971.79 sys
      289.48 real     19706.73 user      1932.10 sys
      290.79 real     19713.99 user      1948.50 sys
      288.80 real     19685.91 user      1933.88 sys
      288.42 real     19542.20 user      1984.51 sys
      286.64 real     19622.04 user      1981.77 sys
      288.69 real     19677.83 user      1942.67 sys
      288.19 real     19562.56 user      2058.54 sys

With the patch:

# for i in $(seq 1 10); do rm -rf /usr/obj/*; time make -s -j96 buildkernel >/dev/null; done
      295.40 real     20009.13 user      2093.96 sys                                                                                          
      296.74 real     20134.21 user      2093.74 sys                   
      298.14 real     20222.30 user      2046.39 sys                   
      297.63 real     20196.40 user      2070.36 sys
      295.27 real     20312.78 user      2047.11 sys    
      293.93 real     20194.31 user      2056.06 sys               
      297.38 real     20451.89 user      2035.46 sys
      297.60 real     20138.13 user      2061.30 sys                                                                                          
      294.96 real     20157.21 user      2044.70 sys                                                                                          
      297.25 real     20164.89 user      2076.60 sys

I tested two multi-threaded applications used during the build,
ctfmerge and lld. Their runtimes don't change significantly, so
I suspect that using a read lock in pmap_fault() is unlike to provide
much benefit.

Without the patch:

# for i in $(seq 1 10); do time ld.lld --export-dynamic -T /usr/src/sys/conf/ldscript.arm64 -o kernel.full -X *.o; done
        2.45 real         3.62 user        32.91 sys
        2.39 real         3.86 user        32.02 sys
        2.38 real         3.57 user        31.68 sys
        2.43 real         3.91 user        32.20 sys
        2.39 real         3.45 user        32.00 sys
        2.34 real         3.54 user        33.38 sys
        2.36 real         3.57 user        30.85 sys
        2.42 real         3.53 user        34.30 sys
        2.34 real         3.65 user        32.65 sys
        2.41 real         3.79 user        31.73 sys
# for i in $(seq 1 10); do time ctfmerge -L VERSION -o kernel *.o; done                                                                                                                                                                     
       13.85 real        31.20 user         0.97 sys                                                                                          
       13.84 real        31.31 user         0.83 sys                                                                                          
       13.86 real        31.19 user         1.00 sys                                                                                          
       13.79 real        30.94 user         0.84 sys                                                                                          
       13.98 real        31.87 user         0.92 sys                                                                                          
       14.08 real        31.86 user         1.04 sys                                                                                          
       14.05 real        31.95 user         0.89 sys                                                                                          
       13.73 real        30.88 user         0.83 sys                                                                                          
       14.05 real        31.83 user         1.05 sys                                                                                          
       14.05 real        31.77 user         0.94 sys

With the patch:

# for i in $(seq 1 10); do time ld.lld --export-dynamic -T /usr/src/sys/conf/ldscript.arm64 -o kernel.full -X *.o; done                       
        2.48 real         3.51 user        36.54 sys                                                                                          
        2.33 real         3.41 user        33.98 sys                                                                                          
        2.30 real         3.36 user        35.73 sys                                                                                          
        2.34 real         3.57 user        33.00 sys                                                                                          
        2.33 real         3.55 user        35.41 sys                                                                                          
        2.32 real         3.52 user        34.47 sys                                                                                          
        2.31 real         3.51 user        34.28 sys                                                                                          
        2.41 real         3.91 user        35.10 sys                   
        2.32 real         3.85 user        34.24 sys                                                                                          
        2.40 real         3.35 user        33.40 sys
# for i in $(seq 1 10); do time ctfmerge -L VERSION -o kernel *.o; done
       13.83 real        31.57 user         0.72 sys                                                                                          
       13.86 real        31.55 user         0.75 sys                                                                                          
       13.96 real        31.86 user         0.75 sys                                                                                          
       13.77 real        31.21 user         0.77 sys                   
       13.65 real        30.83 user         0.69 sys                                                                                          
       13.80 real        31.36 user         0.64 sys
       13.65 real        30.76 user         0.78 sys
       13.81 real        31.31 user         0.68 sys
       13.68 real        30.88 user         0.65 sys
       13.77 real        31.21 user         0.78 sys

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 25285
Build 23949: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/arm64/arm64/pmap.c
221–224

I would consider restricting the use of ATTR_SW_DBM to managed mappings.

sys/arm64/arm64/pmap.c
326–329

I would suggest that you go ahead and commit these style changes.

sys/arm64/arm64/pmap.c
326–329

Given that you are editing these definitions, maybe put them in alphabetical order.

sys/arm64/arm64/pmap.c
4744–4745

I don't think that you should be removing "not_cleared". It exists for reasons unrelated to safe_to_clear_referenced(). Suppose that the only mapping of "m" is as part of a superpage mapping and that "m" is not the 4KB page within the superpage that is chosen by the "hash" to trigger the actual clearing of the reference bit. We still want pmap_ts_referenced() to return 1, not 0, to indicate that "m" has been referenced.

sys/arm64/arm64/pmap.c
3421

This is also a speculative mapping, so ATTR_AF should be cleared, just like pmap_enter_quick_locked().

sys/arm64/arm64/pmap.c
326–329

To be honest, I would just as well see these macros deleted, and have us use atomics directly on PTEs.

sys/arm64/arm64/pmap.c
2930–2933

This looks like tabs got replaced by spaces.

sys/arm64/arm64/pmap.c
5606–5612

Suppose that we have two concurrent calls on the same address to this function because of a missing ATTR_AF bit. Isn't the caller who loses the race to acquire the pmap lock going to receive KERN_FAILURE, and as a result call vm_fault()?

markj marked 5 inline comments as done.
  • Fix style in PTE macro definitions.
  • Fix reference bit counting in pmap_ts_referenced(). Also fix a bug where we were clearing AF in wired 4KB mappings.
  • In pmap_fault(), don't check whether AF is already set.
  • Don't set AF in mappings created by pmap_enter_2mpage().
  • Fix whitespace before a comment.
sys/arm64/arm64/pmap.c
326–329

I created them so I could find where we operate on the page table.

sys/arm64/arm64/pmap.c
221–224

I had thought about it a bit and didn't see a strong reason to go either way. I'll change it.

326–329

Hmm, I rather like them. It's been helpful to be able to quickly locate all the places where we modify a PTE. I was going to add pmap_set_bits() and pmap_clear_bits() here to replace some of the inline atomics that I added. I did that on RISC-V too.

sys/arm64/arm64/trap.c
220

I'm not sure why we require both read and write permissions for a write fault. It means that write-only mappings can't be written to.

sys/arm64/arm64/trap.c
217–220

This is a change in behaviour - we don't set VM_PROT_READ in this case anymore. The WnR bit is not valid for instruction aborts, so the old code seemed a bit dubious to me.

sys/arm64/arm64/pmap.c
4652–4653

Suppose that there are two superpage mappings and one 4KB page mapping. Further, suppose that because of the "hash", we clear the accessed bit for one of the superpage mappings but not the other. The first increments cleared and the second increments not_cleared. Then, as we try to process the 4KB page mapping, we are forced to retry. Consequently, we will reexamine the superpage mappings. For the one that had its accessed bit cleared, there is no issue. However, for the one that did not have its accessed bit clear, we should beware of "double counting". This is why not_cleared was reset on a retry. In contrast, in this new version refs is being incremented both times that we examine the superpage mapping that doesn't have its accessed bit cleared.

markj marked an inline comment as done.
  • Rebase.
  • Fix pmap_ts_referenced() reference counting, second attempt.
  • Add pmap_{clear,set}_bits().
  • Set ATTR_SW_DBM only on managed mappings.
sys/arm64/arm64/pmap.c
4652–4653

I see. I made the same mistake in the risc-v pmap.

I also plan to measure how long lld takes to link the kernel, with and without this change applied. lld is multi-threaded, so we can get a sense for whether using a read lock in pmap_fault() is beneficial.

sys/arm64/arm64/pmap.c
2832–2834

This could be committed now. It doesn't really depend on the rest of the patch.

2930–2933

This could be committed now. It doesn't really depend on the rest of the patch.

3159–3164

This doesn't appear to write-protect unmanaged mappings. I think that you should restore:

if ((prot & VM_PROT_WRITE) == 0)
        new_l3 |= ATTR_AP(ATTR_AP_RO);

here and below ...

3173–3174

... change this to

if ((prot & VM_PROT_WRITE) != 0) {
        new_l3 |= ATTR_SW_DBM;
        if ((flags & VM_PROT_WRITE) == 0)
                new_l3 |= ATTR_AP(ATTR_AP_RO);
}
3913–3914

This seems backwards. You want to set the bit, not clear it.

3961–3962

This seems backwards. You want to set the bit, not clear it.

4309

This could be committed now. It doesn't really depend on the rest of the patch.

4604–4605

Shouldn't this be:

if ((oldpte & ATTR_AP_RW_BIT) ==
4874

Shouldn't this be pmap_set_bits(l3, ATTR_AP_RW_BIT)? Or its equivalent pmap_set_bits(l3, ATTR_AP(ATTR_AP_RO))?

5610

I would suggest adding an XXX comment about this optimization. In other words, I don't think that we need this optimization in place before committing this change.

markj marked 5 inline comments as done.

Fix ATTR_AP_RW_BIT handling bugs.

markj added inline comments.
sys/arm64/arm64/pmap.c
3364

We should reverse the order of these tests.

  • Rebase.
  • Update pmap_mincore() to only call pmap_pte_dirty() on managed mappings.
sys/arm64/arm64/pmap.c
534

I would retain the first sentence, replacing the word "page" with the acronym "PTE". (The way that a "dirty bit" is implemented on this architecture is unusual enough that sentences like this may help the person reading this code for the first time.)

3421

Strictly speaking, l don't think that we should clear ATTR_AF until we know that this is a managed mapping, which is below.

3714

Strictly speaking, l don't think that we should clear ATTR_AF until we know that this is a managed mapping, which is below.

markj marked 2 inline comments as done.
  • Leave AF set in unmanaged speculatively created mappings.
  • Restore part of the comment above pmap_pte_dirty().
sys/arm64/arm64/pmap.c
3421

For consistency, should the amd64 and other pmaps set PG_A on unmanaged mappings?

I also plan to measure how long lld takes to link the kernel, with and without this change applied. lld is multi-threaded, so we can get a sense for whether using a read lock in pmap_fault() is beneficial.

See the updated testing section.

sys/arm64/arm64/pmap.c
3421

Yes.

sys/arm64/arm64/pmap.c
4688–4689

This paragraph describes code that has been (correctly) deleted, so it too can be deleted.

sys/arm64/arm64/pmap.c
4843

Just an FYI, not a request for a change ... I really don't remember why I included this validity test when I wrote the amd64 and i386 versions. On a successful demotion, this PTE has to be valid.

Remove obsolete comment from pmap_ts_referenced().

markj added inline comments.
sys/arm64/arm64/pmap.c
4688–4689

I note that it is present in the i386 pmap.

4843

I can't see a reason for it either.

I've been testing this change in combination with the just committed r349975, and I haven't seen any suspicious behavior. Moreover, in regards to reviewing this patch, I've read over it multiple times and I'm past the point of diminishing returns. So, I'm going to suggest that you commit it.

sys/arm64/arm64/pmap.c
5467

Since this line is moving, I would suggest rewriting it in a more conventional way:

managed = (tpte & ATTR_SW_MANAGED) != 0;
This revision is now accepted and ready to land.Jul 13 2019, 4:58 PM

On amd64, we've done a bit of testing here with https://github.com/GraphChi/graphchi-cpp doing pagerank, and we've seen a lot of vm object lock contention and a little pmap lock contention. In regards to the pmap lock, some of the functions showing up in the lock profile, e.g., pmap_is_prefaultable(), really only need a read lock on the pmap. In other words, I think that there may be motivation beyond accessed and dirty bit emulation to consider changing the pmap lock to a rw lock.

sys/arm64/arm64/pmap.c
4701–4706

Actually, this comment isn't describing the code that was deleted. It is about the code that remains. It should be restored, and I should revert the i386 commit. :-(

sys/arm64/arm64/pmap.c
5353

Can you please assert here that the page is dirty if it is writeable.

markj marked 4 inline comments as done.
  • Restore comment in pmap_ts_referenced().
  • Assert that writeable mappings are dirty in pmap_l2_demote_locked().
  • Style in pmap_mincore().
This revision now requires review to proceed.Jul 14 2019, 4:22 PM
sys/arm64/arm64/pmap.c
5362

Shouldn't this be: ... != (ATTR_AP_RW_BIT | ATTR_SW_DBM)? In other words, that DBM is set, but our "inverted" dirty bit is still 1?

sys/arm64/arm64/pmap.c
4701–4706

Hmm. Well, we want to leave the access flag set independent of whether or not the mapping may be demoted, since otherwise a subsequent access will trigger a soft fault or a hw page table walk depending on the platform, and we wish to avoid that for wired mappings. Indeed, we have the same check for ATTR_SW_WIRED when examining small mappings of the page. So, I would argue that the deleted comment is too specific anyway. I will restore it for the purpose of this commit, though.

5362

I'm a bit confused. On amd64, when promoting a PDE, we clear PG_RW on the PTEs if they are not dirty. In particular, we will not create a writeable clean superpage, which is why the equivalent assertion is there in pmap_demote_pde(). On arm64, pmap_promote_l2() may create a clean, writeable L2 entry, presumably because without this change all writeable entries were considered dirty. Shouldn't we also implement amd64's logic here for that assertion to be correct?

sys/arm64/arm64/pmap.c
4701–4706

On native (as opposed to bhyve EPT virtualized) amd64, we were and are still clearing PG_A for 4KB wired mappings. This comment was written before the bhyve integration, and I think it made more sense then, because we were clearly handling PG_A differently for 2MB and 4KB mappings.

Here is what I think makes this code and comment confusing. On early EPT hardware, which required software accessed and dirty bit emulation, the emulation had limitations (compared to arm64 :-) ). Specifically, I believe that we couldn't have a PTE that was dirty but not accessed, and so we have to destroy the mapping in that case. Thus, in that case, on a future access, we don't just exercise the pmap-level emulation code, but invoke vm_fault(). We don't want to invoke vm_fault() (and potentially sleep for an indeterminate period) if the mapping is wired, so we do nothing to the PTE. However, in other cases, even for wired mappings, I believe that we do clear the emulated accessed bit and let the pmap-level emulation code turn the accessed bit back on on a future access. In other words, we are not trying to prevent emulation exceptions on wired mappings. We are only trying to prevent vm_fault() calls.

arm64 doesn't have this issue found in the old EPT hardware, so we could choose to unconditionally clear the accessed bit on wired 4KB page mappings.

5362

Yes, we should.

  • Check for AF when promoting.
  • Convert clean, RW mappings to RO when promoting.
  • When demoting, assert that writeable mappings are dirty.

The changes look okay. I've booted a kernel with this latest version of the patch, and I'm testing it overnight. I'll let you know if the promotion count changes significantly.

Without the patch, a "buildworld" generates the following counts:

vm.pmap.l2.promotions: 113985
vm.pmap.l2.p_failures: 1
vm.pmap.l2.mappings: 59015
vm.pmap.l2.demotions: 11697
vm.reserv.reclaimed: 0
vm.reserv.partpopq:
DOMAIN    LEVEL     SIZE  NUMBER

     0,      -1, 255792K,    163

vm.reserv.fullpop: 54
vm.reserv.freed: 884846
vm.reserv.broken: 0

And, with the patch, the counts are:

vm.pmap.l2.promotions: 113896
vm.pmap.l2.p_failures: 810
vm.pmap.l2.mappings: 58187
vm.pmap.l2.demotions: 10528
vm.reserv.reclaimed: 0
vm.reserv.partpopq: 
DOMAIN    LEVEL     SIZE  NUMBER

     0,      -1, 255820K,    163

vm.reserv.fullpop: 56
vm.reserv.freed: 884929
vm.reserv.broken: 0

The changes in the counts seem reasonable. I expected an uptick in p_failures, but not the reduction in demotions. I expect the mappings count to vary a bit. It's timing dependent, because it depends on portions of clang's code and read-only data being faulted in by one process so that future clang processes prefault superpage mappings during execve().

This revision is now accepted and ready to land.Jul 15 2019, 2:52 PM