Page MenuHomeFreeBSD

md(4): add support for unmapped preloaded images
Needs ReviewPublic

Authored by kib on Jul 2 2025, 1:15 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Oct 13, 3:57 AM
Unknown Object (File)
Sun, Oct 12, 8:24 AM
Unknown Object (File)
Sun, Oct 12, 8:24 AM
Unknown Object (File)
Sun, Oct 12, 8:23 AM
Unknown Object (File)
Sun, Oct 12, 8:23 AM
Unknown Object (File)
Sun, Oct 12, 8:23 AM
Unknown Object (File)
Sun, Oct 12, 8:23 AM
Unknown Object (File)
Sat, Oct 11, 9:43 PM
Subscribers

Details

Reviewers
imp
markj
Summary

Making copies directly on physical addresses eliminates the need to remap the images into KVA, and should be equally fast on DMAP arches.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib edited the summary of this revision. (Show Details)

Rebase.

This looks like an interesting optimization.... I'll see about getting D45404 in finally soon...

https://reviews.freebsd.org/D51466 seems like it might be working, though my test jig is kinda wonky..

kib added reviewers: imp, markj.

Integrate with 15925062e1ba75cb4908a68655b

I still cannot test it, @imp would you please?

Hmmm... that didn't work well...

With the patch

Trying to mount root from ufs:/dev/ufs/root [rw]...
WARNING: WITNESS option enabled, expect reduced performance.
ata1: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: stat1=0x00 err=0x00 lsb=0xff msb=0xff
ata1: reset tp2 stat0=00 stat1=00 devices=0x10000
pass0 at ata1 bus 0 scbus1 target 0 lun 0
pass0: <QEMU QEMU DVD-ROM 2.5+> Removable CD-ROM SCSI device
pass0: Serial Number QM00003
pass0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0 at ata1 bus 0 scbus1 target 0 lun 0
cd0: <QEMU QEMU DVD-ROM 2.5+> Removable CD-ROM SCSI device
cd0: Serial Number QM00003
cd0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0: 0MB (449 2048 byte sectors)
GEOM: new disk cd0
mountroot: waiting for device /dev/ufs/root...
Mounting from ufs:/dev/ufs/root failed with error 19.
Trying to mount root from ufs:/dev/md0 []...
Attempted recovery for standard superblock: failed
Attempted extraction of recovery data from standard superblock: failed
Attempt to find boot zone recovery data.
Finding an alternate superblock failed.
Check for only non-critical errors in standard superblock
Failed, superblock has critical errors
Mounting from ufs:/dev/md0 failed with error 2; retrying for 3 more seconds
...

Without the patch

Trying to mount root from ufs:/dev/ufs/root [rw]...
WARNING: WITNESS option enabled, expect reduced performance.
ata1: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: stat1=0x00 err=0x00 lsb=0xff msb=0xff
ata1: reset tp2 stat0=00 stat1=00 devices=0x10000
atrtc0: providing initial system time
Dual Console: Serial Primary, Video Secondary
start_init: trying /sbin/init
pass0 at ata1 bus 0 scbus1 target 0 lun 0
pass0: <QEMU QEMU DVD-ROM 2.5+> Removable CD-ROM SCSI device
pass0: Serial Number QM00003
pass0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0 at ata1 bus 0 scbus1 target 0 lun 0
cd0: <QEMU QEMU DVD-ROM 2.5+> Removable CD-ROM SCSI device
cd0: Serial Number QM00003
cd0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0: 0MB (449 2048 byte sectors)
GEOM: new disk cd0
2025-07-23T05:01:25.965673+00:00 - init 18 - - login_getclass: unknown class 'daemon'
machdep.bootmethod: BIOS
RC COMMAND RUNNING -- SUCCESS!!!!!

my /etc/rc is just 'sysctl machdep.bootmethod\necho RC COMMAND RUNNING -- SUCCESS" in this test image.

The md disk is seen in both:

md0: Preloaded image <preload0 0x00000000bc4e9000> 61176320 bytes at 0

hmm that's the same for both images (with and without the patch).
So what to try next?

So with the setup that bapt had boots off /dev/md0... But the boot loader loads it off one of the disks because MEMDISK appears in the boot loader as a disk. I'm not sure why the 'cd' appears there, though.

I'm booting a FreeBSD image that I build to test the boot loader (it's build with tools/boot/full-test.sh), but that tool is kinda hard for people that aren't me to setup sometimes since it assumes too many things about my system. And is heavily geared towards my linuxboot testing.

In D51128#1175258, @imp wrote:

So with the setup that bapt had boots off /dev/md0... But the boot loader loads it off one of the disks because MEMDISK appears in the boot loader as a disk. I'm not sure why the 'cd' appears there, though.

I'm booting a FreeBSD image that I build to test the boot loader (it's build with tools/boot/full-test.sh), but that tool is kinda hard for people that aren't me to setup sometimes since it assumes too many things about my system. And is heavily geared towards my linuxboot testing.

I might be able to simplify things somewhat to give you the ability to test this... I'm just using qemu + user-mode tftp + the memdisk stuff and ipxe scripts that bapt pointed me at.

In D51128#1175259, @imp wrote:
In D51128#1175258, @imp wrote:

So with the setup that bapt had boots off /dev/md0... But the boot loader loads it off one of the disks because MEMDISK appears in the boot loader as a disk. I'm not sure why the 'cd' appears there, though.

I'm booting a FreeBSD image that I build to test the boot loader (it's build with tools/boot/full-test.sh), but that tool is kinda hard for people that aren't me to setup sometimes since it assumes too many things about my system. And is heavily geared towards my linuxboot testing.

I might be able to simplify things somewhat to give you the ability to test this... I'm just using qemu + user-mode tftp + the memdisk stuff and ipxe scripts that bapt pointed me at.

I will need something manageable to debug it locally, yes. qemu should be fine, but other dependencies are problematic.

Are you sure that the vm_pages backing a preloaded image are properly initialized?

Are you sure that the vm_pages backing a preloaded image are properly initialized?

I am not sure but I tend to think that they are.

First, I remember that I added initialization of the vm pages for the kernel itself, since otherwise DMAR busdma was not able to dump. Now I am unable to find this code. But I see the code with the comment

Initialize pages not covered by phys_avail[]

in vm_page_startup(), which should cover all preloaded pages, including kernel and the preloaded md images. For kernel, I am sure, for the md images, not completely.

Anyway, I need some way to boot with md preloaded image from loader to see what is going on. Would be nice to be able to preload md not only from PXE, but also from the local /boot or any other fs accessible to loader.

Anyway, I need some way to boot with md preloaded image from loader to see what is going on. Would be nice to be able to preload md not only from PXE, but also from the local /boot or any other fs accessible to loader.

You already can do that... it's how people do md root for detached files...

In D51128#1177455, @imp wrote:

Anyway, I need some way to boot with md preloaded image from loader to see what is going on. Would be nice to be able to preload md not only from PXE, but also from the local /boot or any other fs accessible to loader.

You already can do that... it's how people do md root for detached files...

Yes, but it is for different type of 'preload'. These trigger the call to md_preloaded(..., true) at line 2155. I need to test md_preloaded(..., false) at line 2172 (patched sources).

In D51128#1177456, @kib wrote:
In D51128#1177455, @imp wrote:

Anyway, I need some way to boot with md preloaded image from loader to see what is going on. Would be nice to be able to preload md not only from PXE, but also from the local /boot or any other fs accessible to loader.

You already can do that... it's how people do md root for detached files...

Yes, but it is for different type of 'preload'. These trigger the call to md_preloaded(..., true) at line 2155. I need to test md_preloaded(..., false) at line 2172 (patched sources).

Then unfortunately, you need to run the qemu + ipxe stuff that was referred to earlier. pkg install ipxe (though you won't have to set it up for the network). Then you can do:

qemu-system-x86_64 -boot n -m 4g -cdrom /usr/local/share/ipxe/ipxe.iso -device virtio-net,netdev=n1 -netdev user,id=n1,tftp=$(pwd),bootfile=/freebsd.ipxe \
        -monitor telnet::4444,server,nowait \
        -nographic -serial stdio $*

where "pwd" has freebsd.ipxe

#!ipxe
initrd tftp://10.0.2.2/freebsd-bios.img
chain tftp://10.0.2.2/memdisk harddisk raw

memdisk is from ~imp/memdisk on freefall
and freebsd-bios.img is a bootable FreeBSD disk image that you've injected your test kernel to. A sane small image is one that has /boot and /rescue with the kernel you want to test, made with:

mkdir /tmp/tree
cp -r /boot /rescue /tmp/tree
# copy the kernel to /tmp/tree/boot/kernel
makefs -t ffs -B little -o label=root,version=2,bsize=32768,fsize=4096,density=16384 /tmp/foo.ufs /tmp/tree
mkimg -s gpt -b /boot/pmbr -p freebsd-boot:/boot/gptboot -p freebsd-ufs:/tmp/foo.ufs  -o /tmp/freebsd-bios.img

Qemu handles all the networking as 'user' so that you don't have to run ixpe servers, or any other services. The qemu command puts everything on the stdout, no graphics needed.

I'm about to head out for vacation, but if this isn't minimal enough, I'll see what I can do about hacking something, but it's tricky since this memdisk can't be in the memory map so I have to allocate memory and then remove that memory from the memory map from the BIOS before passing that to the kernel.

EDIT: Fixed pkg name to ipxe. It has no dependencies. And I've never looked at the memory map hacking that memdisk does. memdisk is part of the syslinux package on Linux.

Properly advance psrc/pdst.

In D51128#1178793, @kib wrote:

Properly advance psrc/pdst.

So is this ready for me to test again? Or is more work needed?

In D51128#1178877, @imp wrote:
In D51128#1178793, @kib wrote:

Properly advance psrc/pdst.

So is this ready for me to test again? Or is more work needed?

I still did not set up the env so I did not tested this. I found the error by reviewing my own change.

If re-test takes 5-10 mins, I would gladly appreciate it. If it is a hour or more effort, then do not bother.

Making copies directly on physical addresses eliminates the need to remap the images into KVA, and should be equally fast on DMAP arches.

Should we modify MD_PRELOAD images to support unmapped I/O as well?

In D51128#1177444, @kib wrote:

Are you sure that the vm_pages backing a preloaded image are properly initialized?

I am not sure but I tend to think that they are.

First, I remember that I added initialization of the vm pages for the kernel itself, since otherwise DMAR busdma was not able to dump. Now I am unable to find this code. But I see the code with the comment

Initialize pages not covered by phys_avail[]

in vm_page_startup(), which should cover all preloaded pages, including kernel and the preloaded md images. For kernel, I am sure, for the md images, not completely.

I think I added that to support unmapping CPU microcode images. The idea was, you could concatenate microcode images for many different CPUs and load a single file at boot time, then the kernel saves a copy of the image it needs and uses kmem_bootstrap_free() to free the pages backing the file[*]. So, if the mechanism used for preloaded MD files is the same, then it should work, but I am not too familiar with how those images are implemented.

\[*] I'm pretty sure this will not work as intended on PE systems...

In D51128#1179119, @kib wrote:
In D51128#1178877, @imp wrote:
In D51128#1178793, @kib wrote:

Properly advance psrc/pdst.

So is this ready for me to test again? Or is more work needed?

I still did not set up the env so I did not tested this. I found the error by reviewing my own change.

If I read correctly, then the bug affected only the BIO_VLIST path, but I do not see how such BIOs are created.

Making copies directly on physical addresses eliminates the need to remap the images into KVA, and should be equally fast on DMAP arches.

Should we modify MD_PRELOAD images to support unmapped I/O as well?

We do not support unmapped io on any preloaded images. This patch only removes mapping for the image, it does not add the support for unmapped BIOs.
That would be the next part, but I need to fix this patch first.

In D51128#1177444, @kib wrote:

Are you sure that the vm_pages backing a preloaded image are properly initialized?

I am not sure but I tend to think that they are.

First, I remember that I added initialization of the vm pages for the kernel itself, since otherwise DMAR busdma was not able to dump. Now I am unable to find this code. But I see the code with the comment

Initialize pages not covered by phys_avail[]

in vm_page_startup(), which should cover all preloaded pages, including kernel and the preloaded md images. For kernel, I am sure, for the md images, not completely.

I think I added that to support unmapping CPU microcode images. The idea was, you could concatenate microcode images for many different CPUs and load a single file at boot time, then the kernel saves a copy of the image it needs and uses kmem_bootstrap_free() to free the pages backing the file[*]. So, if the mechanism used for preloaded MD files is the same, then it should work, but I am not too familiar with how those images are implemented.

\[*] I'm pretty sure this will not work as intended on PE systems...

What is PE?

In D51128#1177444, @kib wrote:

Are you sure that the vm_pages backing a preloaded image are properly initialized?

I am not sure but I tend to think that they are.

First, I remember that I added initialization of the vm pages for the kernel itself, since otherwise DMAR busdma was not able to dump. Now I am unable to find this code. But I see the code with the comment

Initialize pages not covered by phys_avail[]

in vm_page_startup(), which should cover all preloaded pages, including kernel and the preloaded md images. For kernel, I am sure, for the md images, not completely.

I think I added that to support unmapping CPU microcode images. The idea was, you could concatenate microcode images for many different CPUs and load a single file at boot time, then the kernel saves a copy of the image it needs and uses kmem_bootstrap_free() to free the pages backing the file[*]. So, if the mechanism used for preloaded MD files is the same, then it should work, but I am not too familiar with how those images are implemented.

\[*] I'm pretty sure this will not work as intended on PE systems...

What is PE?

I meant Intel PE cores. I thought P and E cores might require different microcode images, but it seems this is not the case.

I do not know what is wrong there. For an easy experiment, I added the following code to the md.c:

diff --git a/sys/dev/md/md.c b/sys/dev/md/md.c
index 19983004282f..7b69dafb0a60 100644
--- a/sys/dev/md/md.c
+++ b/sys/dev/md/md.c
@@ -2175,6 +2175,19 @@ g_md_init(struct g_class *mp __unused)
 			sx_xunlock(&md_sx);
 		}
 	}
+	{
+		vm_size_t sz = 32ULL * 1024 * 1024;
+		vm_page_t m = vm_page_alloc_noobj_contig(VM_ALLOC_NORMAL,
+		    sz / PAGE_SIZE, 0, ~0ULL, 1, 0, VM_MEMATTR_DEFAULT);
+		if (m != NULL) {
+			sprintf(scratch, "mdx%#016jx", (uintmax_t)VM_PAGE_TO_PHYS(m));
+			sx_xlock(&md_sx);
+			md_preloaded(NULL, VM_PAGE_TO_PHYS(m), sz, scratch, false);
+			sx_xunlock(&md_sx);
+		} else {
+			printf("Cannot alloc phys memory for mdx\n");
+		}
+	}
 
 	status_dev = make_dev(&mdctl_cdevsw, INT_MAX, UID_ROOT, GID_WHEEL,
 	    0600, MDCTL_NAME);

which creates the fake preloaded 32M image. Then I did newfs/fsck on it, which did not found any errors. Then I mounted the volume, copied some files on it, unmounted (to flush caches), then mounted again. The md5 sums of the copied files were correct.