Page MenuHomeFreeBSD

Avoid vm_free_count() for physmem available before VM ACPI walk.
ClosedPublic

Authored by dgmorris_earthlink.net on Dec 10 2020, 5:20 PM.

Details

Reviewers
vangyzen
markj
Summary

Derive physical memory available in cpu_startup() from the phys_avail
array instead of vm_free_count(), as that requires vm_ndomains() to be
set (which it isn't as that is at (SI_SUB_VM - 1) which is after CPU).

On some x64 platforms, this results in major amounts of memory not being
shown as available in the boot message, which can be confusing.

Test Plan

Booted on platform showing the symptom with a modified version containing both versions of the logic to check correctness:

2020-12-10T00:41:17.099818+00:00 <0.7> /boot/kernel.amd64/kernel: real memory = 807429734400 (770025 MB)
2020-12-10T00:41:17.099824+00:00 <0.7> /boot/kernel.amd64/kernel: Physical memory chunk(s):
2020-12-10T00:41:17.099829+00:00 <0.7> /boot/kernel.amd64/kernel: 0x0000000000010000 - 0x000000000009efff, 585728 bytes (143 pages)
2020-12-10T00:41:17.099839+00:00 <0.7> /boot/kernel.amd64/kernel: 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
2020-12-10T00:41:17.099848+00:00 <0.7> /boot/kernel.amd64/kernel: 0x0000000005b00000 - 0x000000005042cfff, 1251135488 bytes (305453 pages)
2020-12-10T00:41:17.099854+00:00 <0.7> /boot/kernel.amd64/kernel: 0x00000000505a7000 - 0x00000000682fefff, 399867904 bytes (97624 pages)
2020-12-10T00:41:17.099860+00:00 <0.7> /boot/kernel.amd64/kernel: 0x000000006ffff000 - 0x000000006fffffff, 4096 bytes (1 pages)
2020-12-10T00:41:17.099865+00:00 <0.7> /boot/kernel.amd64/kernel: 0x000000010000c000 - 0x0000005d8945cfff, 397439995904 bytes (97031249 pages)
2020-12-10T00:41:17.099872+00:00 <0.7> /boot/kernel.amd64/kernel: 0x0000006060400000 - 0x00000060605ebfff, 2015232 bytes (492 pages)
2020-12-10T00:41:17.099877+00:00 <0.7> /boot/kernel.amd64/kernel: 0x000000607000a000 - 0x0000008ddbffffff, 195085426688 bytes (47628278 pages)
2020-12-10T00:41:17.099883+00:00 <0.7> /boot/kernel.amd64/kernel: 0x0000009070000000 - 0x000000b06fffffff, 137438953472 bytes (33554432 pages)
2020-12-10T00:41:17.099888+00:00 <0.7> /boot/kernel.amd64/kernel: 0x000000b070000000 - 0x000000b86ffbdfff, 34359468032 bytes (8388542 pages)
2020-12-10T00:41:17.099894+00:00 <0.7> /boot/kernel.amd64/kernel: fix_tunable(): returning 12582912 for tunable "nbuf", mem = 790273982464
2020-12-10T00:41:17.099899+00:00 <0.7> /boot/kernel.amd64/kernel: fix_tunable(): returning 16384 for tunable "nswbuf", mem = 790273982464
2020-12-10T00:41:17.099904+00:00 <0.7> /boot/kernel.amd64/kernel: fix_tunable(): returning 12582912 for tunable "nbuf", mem = 790273982464
2020-12-10T00:41:17.099910+00:00 <0.7> /boot/kernel.amd64/kernel: avail memory = 576431333376 (549727 MB)
2020-12-10T00:41:17.099915+00:00 <0.7> /boot/kernel.amd64/kernel: NEW WAY avail memory = 765978501120 (730494 MB)

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

This revision is now accepted and ready to land.Dec 10 2020, 5:23 PM

which it isn't as that is at (SI_SUB_VM - 1) which is after CPU

I don't quite follow. SI_SUB_VM is before SI_SUB_CPU.

which it isn't as that is at (SI_SUB_VM - 1) which is after CPU

I don't quite follow. SI_SUB_VM is before SI_SUB_CPU.

Hmm... I got a wire crossed somewhere it seems on that. Just ignore this for the moment and I'll get back to you, will need to do some checking.

Mark -- first, apologies for going off on the wrong track on this originally. I got the idea of the order being wrong from other triage comments and ran with that. Mea culpa.

That said, digging into this more/again thanks to you setting me straight there -- I think we have a bigger, nastier bug here. vm_page_startup() builds the segments from phys_avail, sets up pages, etc. But when it calls vm_phys_init(), that path will coalesce adjacent, co-domained contiguous segments. That makes sense -- but it throws off the continue check back in vm_page_startup() that does the actual vm_phys_enqueue_contig(). I'll attach a debug boot output that I hope will illuminate this (fair warning, I've done several variations and this may have bugs in what's reported as START vs. END, please excuse any such errors in advance -- the important thing is to note the domain 1 ranges which do not get a "PHYS_ADD_SEG to DOMAIN" line - those are ones where we continued because the coalescing threw off the "entire contained in one of the ranges or doesn't overlap any of them" logic -- the segment spans the last two segments). There's a similar issue earlier in the table, but the page count is less so it doesn't stand out.

I believe that this is more than a boot time print issue now (which is logical if we skip adding pages to the allocator). Looking at the vm_stats output on a different run and comparing the boot messaging:
2020-12-10T00:41:17.099910+00:00 <0.7> /boot/kernel.amd64/kernel: avail memory = 576431333376 (549727 MB)
2020-12-10T00:41:17.099915+00:00 <0.7> /boot/kernel.amd64/kernel: NEW WAY avail memory = 765978501120 (730494 MB)

vm.stats.vm.v_page_count: 145063257

Which unless I really need more coffee and screwed up the math, is 566,653Mb -- way, way below the 730,494Mb we'd expect.

So -- just to run this by you, because obviously I want to get something you'll like for upstream -- assuming you agree, is there a reason we don't coalesce the phys_avail array as well? I'm thinking that would be the simplest solution as it would restore the assumption in the segment walk loop.
The other obvious possibility is to try to adjust the end of the avail range being considered if the next avail range is adjacent. I've prototyped that, and got something that mostly works (an earlier small segment in Domain 0 still wasn't added right, and I'd need to chase that down), but figured it makes much more sense to ping you at this point and see what you'd be thinking.
Thanks again.

So -- just to run this by you, because obviously I want to get something you'll like for upstream -- assuming you agree, is there a reason we don't coalesce the phys_avail array as well? I'm thinking that would be the simplest solution as it would restore the assumption in the segment walk loop.

Just to make sure I understand how this is supposed to work:

  1. getmemsize() adds a placeholder vm_phys_seg[] entry for the kernel and preloaded data (this is used later in case preloaded kernel modules are unloaded).
  2. Then we parse the SMAP or EFI system map to populate physmap. This results in a sorted, coalesced list of physical memory segments reported by the BIOS.
  3. Then we call pmap_bootstrap(), which allocates kernel page table pages and adds additional vm_phys_seg[] placeholders, also near the beginning of physical memory.
  4. Then we iterate a page at a time over physmap[] and add pages to phys_avail[]. Because physmap[] is sorted phys_avail[]` is also sorted and maximally coalesced. But, the regions added in 1) are excluded.
  5. vm_page_startup() carves chunks out of phys_avail[] for the page and reservation arrays. This might split entries.
  6. Then we populate vm_phys_segs[] with the remaining segments from phys_avail[] and coalesce entries. Some coalescing might happen because of placeholder vm_phys_seg[] entries that were added. See r338431.
  7. Finally we populate the freelists in the physical memory allocator. We do so only for vm_phys_seg[] segments for which there is a covering phys_avail[] segment. Because of the coalescing of vm_phys segments with placeholder segments, this can result in us not populating freelists for the vm_phys segments containing the kernel and bootstrapped kernel page table pages.

So I think this is a real bug, but that doesn't appear to be the main issue since it should really only affect the beginning of physical memory. Rather, we are somehow getting adjacent phys_avail[] entries that can be coalesced, but the loop in getmemsize() shouldn't permit that, from my reading. Is something splitting phys_avail[] segments by calling vm_phys_avail_split() without actually carving anything out? Does OneFS have any custom vm_phys_early_alloc() calls?

Could you please provide a dump of:

  • the physmap (machdep.smap or machdep.efi_map output),
  • the vm_phys_segs[], vm_phys_early_segs[] and phys_avail[] arrays at the beginning of vm_page_startup()

?

No custom calls that I know of (or can find on a quick double check of the code). May end up being one I missed, of course.
The domain 1 ranges I see with the problem are avail and contiguous, just split into separate ranges -- I'll get you the output you requested.

No custom calls that I know of (or can find on a quick double check of the code). May end up being one I missed, of course.
The domain 1 ranges I see with the problem are avail and contiguous, just split into separate ranges -- I'll get you the output you requested.

Could you please also show physmap at the beginning of vm_page_startup()? You'll need to make it a global variable, right now it's allocated on the stack.

Hopefully everything you need/asked for in this attachment.

Since I assume the latency split may be related, I booted "-v" and pulled the SRAT lines relevant to memory (you don't need all the CPU and table stuff, I assume). Left out all the 0 addr, 0 len disabled ones as well:

SRAT: Found memory domain 0 addr 0x0 len 0x80000000: enabled
SRAT: Found memory domain 0 addr 0x100000000 len 0x5f70000000: enabled
SRAT: Found memory domain 1 addr 0x6070000000 len 0x3000000000: enabled
SRAT: Found memory domain 1 addr 0x9070000000 len 0x2000000000: enabled
SRAT: Found memory domain 1 addr 0xb070000000 len 0x800000000: enabled
SRAT: Found memory domain 1 addr 0xb870000000 len 0x400000000: enabled
SRAT: Ignoring memory at addr 0xb870000000

Thanks. Could you give D27613 a try? It seems the problem is that the memory affinity table derived from the SRAT may have uncoalesced adjacent entries, so when vm_phys_early_init() is called it will slice up phys_avail[] entries which should have remained contiguous.

Yeah, that makes sense given what we're seeing. I'll try it out and let you know.

With D24757:

SRAT: Ignoring memory at addr 0xb870000000
PHYS_AVAIL INITIAL: 0x0000000000010000 - 0x000000000009efff, 585728 bytes (143 pages)
PHYS_AVAIL INITIAL: 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
PHYS_AVAIL INITIAL: 0x0000000004f00000 - 0x000000005042cfff, 1263718400 bytes (308525 pages)
PHYS_AVAIL INITIAL: 0x00000000505a7000 - 0x00000000682fefff, 399867904 bytes (97624 pages)
PHYS_AVAIL INITIAL: 0x000000006ffff000 - 0x000000006fffffff, 4096 bytes (1 pages)
PHYS_AVAIL INITIAL: 0x0000000100000000 - 0x000000b86ffbeffe, 787857797119 bytes (192348094 pages)

EARLY_SEGS: 0x0000000000200000 to 0x0000000004c39000, 19001 pages Dom 0.
EARLY_SEGS: 0x0000000004c42000 to 0x0000000004c8a000, 72 pages Dom 0.

PHYSMAP[0]: 0x0000000000010000 to 0x000000000009f000.
PHYSMAP[2]: 0x0000000000100000 to 0x000000005042d000.
PHYSMAP[4]: 0x00000000505a7000 to 0x00000000682ff000.
PHYSMAP[6]: 0x000000006ffff000 to 0x0000000070000000.
PHYSMAP[8]: 0x0000000100000000 to 0x000000b870000000.

PHYS_AVAIL POST EARLY STARTUP: 0x0000000000010000 - 0x000000000009efff, 585728 bytes (143 pages)
PHYS_AVAIL POST EARLY STARTUP: 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
PHYS_AVAIL POST EARLY STARTUP: 0x0000000004f00000 - 0x000000005042cfff, 1263718400 bytes (308525 pages)
PHYS_AVAIL POST EARLY STARTUP: 0x00000000505a7000 - 0x00000000682fefff, 399867904 bytes (97624 pages)
PHYS_AVAIL POST EARLY STARTUP: 0x000000006ffff000 - 0x000000006fffffff, 4096 bytes (1 pages)
PHYS_AVAIL POST EARLY STARTUP: 0x0000000100000000 - 0x000000606fffffff, 409900941312 bytes (100073472 pages)
PHYS_AVAIL POST EARLY STARTUP: 0x0000006070000000 - 0x000000b86ffbdfff, 377956851712 bytes (92274622 pages)

SEGS: 0x0000000000200000 to 0x0000000001000000, 3584 pages Dom 0.
SEGS: 0x0000000001000000 to 0x0000000004c39000, 15417 pages Dom 0.
SEGS: 0x0000000004c42000 to 0x0000000004c8a000, 72 pages Dom 0.

PHYSMAP[0]: 0x0000000000010000 to 0x000000000009f000.
PHYSMAP[2]: 0x0000000000100000 to 0x000000005042d000.
PHYSMAP[4]: 0x00000000505a7000 to 0x00000000682ff000.
PHYSMAP[6]: 0x000000006ffff000 to 0x0000000070000000.
PHYSMAP[8]: 0x0000000100000000 to 0x000000b870000000.

PHYS_AVAIL PRE PHYS_INIT: 0x0000000000010000 - 0x000000000009efff, 585728 bytes (143 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x0000000004f00000 - 0x000000005042cfff, 1263718400 bytes (308525 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x00000000505a7000 - 0x00000000682fefff, 399867904 bytes (97624 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x000000006ffff000 - 0x000000006fffffff, 4096 bytes (1 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x000000010000b000 - 0x0000005dc7c5cfff, 398488576000 bytes (97287250 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x000000606e800000 - 0x000000606e8f1fff, 991232 bytes (242 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x000000607000a000 - 0x000000b607dfffff, 367620218880 bytes (89751030 pages)
PHYS_AVAIL PRE PHYS_INIT: 0x000000b86fe00000 - 0x000000b86ffbdfff, 1826816 bytes (446 pages)

SEGS: 0x0000000000010000 to 0x000000000009f000, 143 pages Dom 0.
SEGS: 0x0000000000100000 to 0x0000000000200000, 256 pages Dom 0.
SEGS: 0x0000000000200000 to 0x0000000001000000, 3584 pages Dom 0.
SEGS: 0x0000000001000000 to 0x0000000004c39000, 15417 pages Dom 0.
SEGS: 0x0000000004c42000 to 0x0000000004c8a000, 72 pages Dom 0.
SEGS: 0x0000000004f00000 to 0x000000005042d000, 308525 pages Dom 0.
SEGS: 0x00000000505a7000 to 0x00000000682ff000, 97624 pages Dom 0.
SEGS: 0x000000006ffff000 to 0x0000000070000000, 1 pages Dom 0.
SEGS: 0x000000010000b000 to 0x0000005dc7c5d000, 97287250 pages Dom 0.
SEGS: 0x000000606e800000 to 0x000000606e8f2000, 242 pages Dom 0.
SEGS: 0x000000607000a000 to 0x000000b607e00000, 89751030 pages Dom 1.
SEGS: 0x000000b86fe00000 to 0x000000b86ffbe000, 446 pages Dom 1.

PHYSMAP[0]: 0x0000000000010000 to 0x000000000009f000.
PHYSMAP[2]: 0x0000000000100000 to 0x000000005042d000.
PHYSMAP[4]: 0x00000000505a7000 to 0x00000000682ff000.
PHYSMAP[6]: 0x000000006ffff000 to 0x0000000070000000.
PHYSMAP[8]: 0x0000000100000000 to 0x000000b870000000.

PHYS_AVAIL AFTER PHYS_INIT: 0x0000000000010000 - 0x000000000009efff, 585728 bytes (143 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x0000000004f00000 - 0x000000005042cfff, 1263718400 bytes (308525 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x00000000505a7000 - 0x00000000682fefff, 399867904 bytes (97624 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x000000006ffff000 - 0x000000006fffffff, 4096 bytes (1 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x000000010000b000 - 0x0000005dc7c5cfff, 398488576000 bytes (97287250 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x000000606e800000 - 0x000000606e8f1fff, 991232 bytes (242 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x000000607000a000 - 0x000000b607dfffff, 367620218880 bytes (89751030 pages)
PHYS_AVAIL AFTER PHYS_INIT: 0x000000b86fe00000 - 0x000000b86ffbdfff, 1826816 bytes (446 pages)

SEGS: 0x0000000000010000 to 0x000000000009f000, 143 pages Dom 0.
SEGS: 0x0000000000100000 to 0x0000000001000000, 3840 pages Dom 0.
SEGS: 0x0000000001000000 to 0x0000000004c39000, 15417 pages Dom 0.
SEGS: 0x0000000004c42000 to 0x0000000004c8a000, 72 pages Dom 0.
SEGS: 0x0000000004f00000 to 0x000000005042d000, 308525 pages Dom 0.
SEGS: 0x00000000505a7000 to 0x00000000682ff000, 97624 pages Dom 0.
SEGS: 0x000000006ffff000 to 0x0000000070000000, 1 pages Dom 0.
SEGS: 0x000000010000b000 to 0x0000005dc7c5d000, 97287250 pages Dom 0.
SEGS: 0x000000606e800000 to 0x000000606e8f2000, 242 pages Dom 0.
SEGS: 0x000000607000a000 to 0x000000b607e00000, 89751030 pages Dom 1.
SEGS: 0x000000b86fe00000 to 0x000000b86ffbe000, 446 pages Dom 1.

PHYSMAP[0]: 0x0000000000010000 to 0x000000000009f000.
PHYSMAP[2]: 0x0000000000100000 to 0x000000005042d000.
PHYSMAP[4]: 0x00000000505a7000 to 0x00000000682ff000.
PHYSMAP[6]: 0x000000006ffff000 to 0x0000000070000000.
PHYSMAP[8]: 0x0000000100000000 to 0x000000b870000000.

VT(efifb): resolution 1024x768
CPU: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz (2394.45-MHz K8-class CPU)

Origin="GenuineIntel"  Id=0x50657  Family=0x6  Model=0x55  Stepping=7
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
AMD Features2=0x121<LAHF,ABM,Prefetch>
Structured Extended Features=0xd39ffffb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,MPX,PQE,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,AVX512CD,AVX512BW,AVX512VL>
Structured Extended Features2=0x808<PKU>
Structured Extended Features3=0xbc000400<IBPB,STIBP,L1DFL,ARCH_CAP,SSBD>
XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
IA32_ARCH_CAPS=0xab<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME>
VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
TSC: P-state invariant, performance statistics

real memory = 807429734400 (770025 MB)
avail memory = 753585815552 (718675 MB)

Looks right to me, but figured I'd post it here so you could look it over as well just in case. Thanks again.

Thanks for confirming.

real memory = 807429734400 (770025 MB)
avail memory = 753585815552 (718675 MB)

That's still a pretty big difference. How big is struct vm_page?

I'd have to get the machine back to go drill down, but there's some caching allocations in there -- I don't think that's out of the norm.