Page MenuHomeFreeBSD

loader bcache redesign experiment
ClosedPublic

Authored by tsoome on Dec 26 2015, 2:57 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 28, 6:16 PM
Unknown Object (File)
Thu, Nov 28, 2:16 PM
Unknown Object (File)
Thu, Nov 28, 5:05 AM
Unknown Object (File)
Thu, Nov 28, 5:01 AM
Unknown Object (File)
Wed, Nov 27, 1:53 AM
Unknown Object (File)
Tue, Nov 26, 10:08 PM
Unknown Object (File)
Sun, Nov 24, 11:42 PM
Unknown Object (File)
Sun, Nov 24, 3:54 PM

Details

Summary

The block cache implementation in loader has proven to be almost useless, and in worst case even slowing down the disk reads due to insufficient cache size and extra memory copy. Also the current cache implementation does not work with zfs built on top of multiple disks.

The goal of this project is to explore the alternative implementation of bcache; instead of LRU this code is using simple hash (to have O(1) read from cache), and instead of single global cache, the code is creating block cache per device.
If whole data is not in cache, the read_strategy() will read the missing blocks + possible number of read ahead blocks. To simplify read ahead management, the read ahead will not wrap over bcache end, so in worst case, single block physical read will be performed to fill the last block in bcache.

The implementation is activated for biosdisk and bioscd devices and on tests, the speed boost is quite noticeable, especially for dosfs (the update for dosfs is not included in this diff), but depending on amount of cache allocated for cache. To simplify testing, this code has set loader heap quite large.

Note1: the dv_strategy() call has extra argument for offset, this was added to support partial block reads from dosfs, I did not include the dosfs update in this diff to keep logical separation.

Note2: as of current version, I did leave write not to update cache blocks as I'm not sure if it's at all needed. Not that important for the scope of the experiment, but is easy to fix.

Note3: The BOOT2 preprocessor guard is "leftover" from illumos code, as I have somewhat reworked stage2. Also as this diff is presented as experiment and not straight in patch, if it happens the idea is worth to be considered for integration, I assume some cleanup/updates will be in order anyhow.

Test Plan

For test purposes, on illumos, I have been using following simple script:
: fs-load

0 bcachestat
seconds
s" /platform/i86pc/kernel/amd64/unix" 1 load
s" /boot_archive" s" rootfs" s" -t " 3 load
seconds
swap - ." total: " . ." sec" cr 
0 unload
0 bcachestat

;
The issue is to read in sufficiently large file to get meaningful result for time difference; also depending on media obviously.

The consistency of the loaded data has been verified by the feature of the illumos kernel, it can verify the SHA1 checksum of loaded module.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

tsoome retitled this revision from to loader bcache redesign experiment.
tsoome updated this object.
tsoome edited the test plan for this revision. (Show Details)
tsoome set the repository for this revision to rS FreeBSD src repository - subversion.
tsoome edited edge metadata.

Reduced my loader /boot/kernel/kernel load time from an emulated USB CDROM via IPMI over LAN (0ms latency) from 27 seconds to 7 seconds

Nice work! There are some minor style issues (which I personally feel that we can address at a later time, but I have noted in comments inline), but I think we can integrate this as-is for now and do a follow up commit to address that, feel free to fix them if you would like to make the changes before someone commits the work though :)

Should we add you to the copyright notice in bcache.c? (Since your contribution have changed more than 1/3 of the file, I believe you deserve to get credit in the copyright block for that).

sys/boot/common/bcache.c
129

129-136: basically we are rounding up the number to the next power of two. Perhaps we can just use fls (and import it if it's not already in libstand)? There is also an algorithm here: https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 if there is no built-in function for this one.

149

This is the same code of bcache_free except bcache_units decrement. Can the code be refactored a little bit so we can reuse the same code for the two different codepaths?

206

I'd just delete this code block.

423

I think with this amount of contribution you should add yourself to the copyright header.

reworked cache block calculation to use fls(); refactored bcache instance free;
but also to illustrate the need for offset variable for dv_strategy, I did include full dosfs update here as well.

The dosfs performace issue is essentially in two parts, one is about FAT itself - as the cluster allocation means lot of FAT lookups, the in core FAT cache is needed. In this diff, the whole FAT is cached. Most likely not the best option, but simplest. Note, if the in core FAT cache setup fails, the code will fall back on single block reads - even if it will hurt performance a lot, the loading will be able to continue. The resulting code in fatget() can surely be cleaned up even more.

The second part of the dosfs performance issue is actual disk block management. As there is need for partial block reads, the old code was using single 512B block cache, essentially adding extra memory copy operation and this extra memory move apparently does hurt performance as well. Instead the offset of partial read is passed to strategy and common bcache is used, avoiding the need for dedicated buffering. Side effect is simplified dosfs code.

While the generic bcache seems to give quite nice boost in loading speed, the dosfs update with combination of bcache proved to provide quite drastic differences; 270MB file load from usb stick on real hardware was ~15minutes versus ~20 seconds in my tests.

last diff did break the context.

implemented the bcache_free_instance()

reworked to use fls(). and yes, the hash function (x & (nblks - 1)) implies the cache size power of 2, therefore splitting total space by rshift should give exactly what is needed. Also, as the code needs to support multi disk zfs pool setup, we actually do want to allocate reasonably sized chunks of cache. The difficulty, obviously, is to pick an method to get positive results from cache and not to kill the loader in process by eating up all the memory.

I agree. deleted.

notice about the guilty one added.

struct devdesc move to stand.h was left out from previous diff. needed to access device unit number for FAT cache.

sys/boot/common/bcache.c
25–26

This should go with the existing copyright statement

a bit of headsup, I have already found the dosfs update was a bit optimistic, and ioread() will be back as its logic is a bit more specific, the problem was revealed by qemu vfat (fixed).

found out I was missing bcache integration with efi (fixed).

I also did discover there must be missed corner case seemingly related to bcache, but I haven't been able to identify root cause yet, revealed by freebsd zfs.

lib/libstand/stand.h
158

typo: dependent
also git complained about trailing whitespace here

162

trailing whitespace

  1. dosfs is now again fully operational.
  2. did add bcache support for efi, pc98, userboot
  3. did update all instances of strategy for offset argument. While bcache by itself is platform independent and the support is easy to add, adding support should be tested - I don't have necessary hardware. At least the compile should succeed now on all platforms.
  4. fixed bcache block list update after dv_strategy, so this diff should be fully functional.
  5. did add experimental read ahead automatic tuning. The idea is, if reads are getting buffer misses, the RA block count will be decremented to avoid inflated reads and restored on cache hits.

fixed issues pointed by emaste.

sys/boot/common/bcache.c
159

With warnings enabled from D4839.diff:

/usr/home/emaste/src/freebsd/sys/boot/efi/loader/../../common/bcache.c:158:19: error: 
      comparison of integers of different signs: 'int' and 'u_int'
      (aka 'unsigned int') [-Werror,-Wsign-compare]
    for (i = 0; i < bc->bcache_nblks; i++) {
                ~ ^ ~~~~~~~~~~~~~~~~
266
/usr/home/emaste/src/freebsd/sys/boot/efi/loader/../../common/bcache.c:257:12: error: 
      comparison of integers of different signs: 'int' and 'u_int'
      (aka 'unsigned int') [-Werror,-Wsign-compare]
    if (ra != bc->bcache_nblks) { /* do we have RA space? */
        ~~ ^  ~~~~~~~~~~~~~~~~

type issues after svn update

Testing this in QEMU (applied on top of the recent EFI/ZFS changes) I see

>> FreeBSD EFI boot block
   Loader path: /boot/loader.efi

   Initializing modules: ZFS UFS
   Probing 8 block devices.......+... done
    ZFS found no pools
    UFS found 1 partition
Failed to start image provided by UFS (9)
panic: No bootable partitions found!

Testing this in QEMU (applied on top of the recent EFI/ZFS changes) I see

>> FreeBSD EFI boot block
   Loader path: /boot/loader.efi

   Initializing modules: ZFS UFS
   Probing 8 block devices.......+... done
    ZFS found no pools
    UFS found 1 partition
Failed to start image provided by UFS (9)
panic: No bootable partitions found!

hm. it seems to me, this is an result of issues in boot1.

the boot1 device read calls (for both ufs/zfs) are not using strategy() calls but are implementing direct read from device via block protocol, meaning that bcache is not related with boot1 itself at all (confirmed by nm); now, the error 9 in this report seems to be EFI_OUT_OF_RESOURCES which by itself is an cause for alarm.

I did quick check about boot1 in regard to EFI LoadImage/StartImage interface usage and I did spot at least one issue to be reviewed and checked - according to UEFI 2.5 docs, if the LoadImage() was used with provides SourceBuffer, that buffer can be freed after the LoadImage is completed, but that does not happen in current code - which is essentially EFI memory leak.

Also note that StartImage() caller is not removed from memory (unless boot services exit is initiated), meaning if the loading is implemented in two stages like its done now, the code will stay in memory and boot1 should be really careful to free as much memory as possible - maybe even to the point to consider if boot1 in its current form is actually viable solution, but of course it also depends on how much memory is available in system.

Also as an minor nit, it should be considered to use StartImage() ExitData for error reporting in boot1.

In any case, this issue is important enough and needs more investigation from both boot1 and loader.efi side. Also what kind of options are used for efi memory allocations, as those options affect which memory areas are used, for example.

BTW: as mentioned in very beginning of this project, I'm reserving fairly large chunk of memory (64MB) for loader and using same amount for both BIOS/UEFI variants, default in freebsd is 3MB, eventually this should be reviewd, but currently it will affect very small setups.

Patch does not apply cleanly after r294767.

resolving conflict after r294767.

Testing this in QEMU (applied on top of the recent EFI/ZFS changes) I see

>> FreeBSD EFI boot block
   Loader path: /boot/loader.efi

   Initializing modules: ZFS UFS
   Probing 8 block devices.......+... done
    ZFS found no pools
    UFS found 1 partition
Failed to start image provided by UFS (9)
panic: No bootable partitions found!

hm. it seems to me, this is an result of issues in boot1.

the boot1 device read calls (for both ufs/zfs) are not using strategy() calls but are implementing direct read from device via block protocol, meaning that bcache is not related with boot1 itself at all (confirmed by nm); now, the error 9 in this report seems to be EFI_OUT_OF_RESOURCES which by itself is an cause for alarm.

I did quick check about boot1 in regard to EFI LoadImage/StartImage interface usage and I did spot at least one issue to be reviewed and checked - according to UEFI 2.5 docs, if the LoadImage() was used with provides SourceBuffer, that buffer can be freed after the LoadImage is completed, but that does not happen in current code - which is essentially EFI memory leak.

Also note that StartImage() caller is not removed from memory (unless boot services exit is initiated), meaning if the loading is implemented in two stages like its done now, the code will stay in memory and boot1 should be really careful to free as much memory as possible - maybe even to the point to consider if boot1 in its current form is actually viable solution, but of course it also depends on how much memory is available in system.

Also as an minor nit, it should be considered to use StartImage() ExitData for error reporting in boot1.

In any case, this issue is important enough and needs more investigation from both boot1 and loader.efi side. Also what kind of options are used for efi memory allocations, as those options affect which memory areas are used, for example.

BTW: as mentioned in very beginning of this project, I'm reserving fairly large chunk of memory (64MB) for loader and using same amount for both BIOS/UEFI variants, default in freebsd is 3MB, eventually this should be reviewd, but currently it will affect very small setups.

In D4713#105895, @tsoome_me.com wrote:

Testing this in QEMU (applied on top of the recent EFI/ZFS changes) I see

>> FreeBSD EFI boot block
   Loader path: /boot/loader.efi

   Initializing modules: ZFS UFS
   Probing 8 block devices.......+... done
    ZFS found no pools
    UFS found 1 partition
Failed to start image provided by UFS (9)
panic: No bootable partitions found!

hm. it seems to me, this is an result of issues in boot1.

the boot1 device read calls (for both ufs/zfs) are not using strategy() calls but are implementing direct read from device via block protocol, meaning that bcache is not related with boot1 itself at all (confirmed by nm); now, the error 9 in this report seems to be EFI_OUT_OF_RESOURCES which by itself is an cause for alarm.

I did quick check about boot1 in regard to EFI LoadImage/StartImage interface usage and I did spot at least one issue to be reviewed and checked - according to UEFI 2.5 docs, if the LoadImage() was used with provides SourceBuffer, that buffer can be freed after the LoadImage is completed, but that does not happen in current code - which is essentially EFI memory leak.

Also note that StartImage() caller is not removed from memory (unless boot services exit is initiated), meaning if the loading is implemented in two stages like its done now, the code will stay in memory and boot1 should be really careful to free as much memory as possible - maybe even to the point to consider if boot1 in its current form is actually viable solution, but of course it also depends on how much memory is available in system.

Also as an minor nit, it should be considered to use StartImage() ExitData for error reporting in boot1.

In any case, this issue is important enough and needs more investigation from both boot1 and loader.efi side. Also what kind of options are used for efi memory allocations, as those options affect which memory areas are used, for example.

BTW: as mentioned in very beginning of this project, I'm reserving fairly large chunk of memory (64MB) for loader and using same amount for both BIOS/UEFI variants, default in freebsd is 3MB, eventually this should be reviewd, but currently it will affect very small setups.

UPDATE: with latest svn update my UEFI vmware fusion VM is booting ok. however, the test scenario with loading large file into loader memory does not work as the EFI memory allocation for loading is not that generous...

Booting non-uefi with an image compiled with this patch yesterday is working well. Hope to get to benchmark it today to show how much of an improvement the bcache provides

My new server (10.2-RELEASE/amd64, non-uefi, 6x disk ZFS raidz2 array) was taking several minutes (no exaggeration) to make it through the loader. I came across this and hacked the diff up so that it would cleanly apply to a fresh 10.2-RELEASE install. I built the newly patched loader and installed it on my drives, and it now works as quickly as I'd expect it to (a couple seconds max).

My new server (10.2-RELEASE/amd64, non-uefi, 6x disk ZFS raidz2 array) was taking several minutes (no exaggeration) to make it through the loader. I came across this and hacked the diff up so that it would cleanly apply to a fresh 10.2-RELEASE install. I built the newly patched loader and installed it on my drives, and it now works as quickly as I'd expect it to (a couple seconds max).

Thanks for feedback, note that I haven't forgotten about this code, just letting people to have chance to test it. Also, while I'm very certain the idea itself is holding quite well, I don't like too much how the binding with actual disk interface is done, but the few ideas I have, haven't yet had chance to get mature enough;)

update against revision 296103.

update against revision 296323.

Updated to revision 296425.

Tested again now and still experience the Failed to start image provided by UFS (9) issue.

Tested again now and still experience the Failed to start image provided by UFS (9) issue.

You are probably using qemu with fairly small ram setup. I have been investigating this issue more, and the root cause is a bit disturbing to be honest.

with StartImage() interface the UEFI firmware will have (at least) 2 images in memory - the caller and callee, so anything (memory pools/pages/handles) allocated by caller is also staying there. From one side, since boot1 is quite small it is possible to free most of the resources with little work, (freeing zfs memory is a bit more), you still will have the boot1 image in memory - which essentially should bring us to the question about how reasonable it is to have boot1 in its current form at all. In BIOS systems those stages replace each other, in UEFI, thats not the case. Now for secondary issue, the different UEFI firmwares implement memory management in different way and leaving allocated chunks from boot1 around does fragment the available space for loader.efi - till BootServices are on, the firmware owns the memory map. So, after all this long talk, to have bcache in loader.efi, the heap for it is enlarged (in this patch 64MB is used) + add up the 48MB copy stage area, your qemu instance seems run out of the chunks... Since our allocations are done as EfiLoaderData, memmap command does help to list those "leftovers" from boot1.

update: I did test with booting loader.efi directly with qemu (no os, just loader itself), with round numbers, 128M ram did fail to start, 256M did start. What I also have found, qemu memory allocation logic sometimes does behave wierd, for example, AllocateAddress with provided address can fail (not found) despite the area is free, and once the same size chunk is allocated with AnyPages, then AllocateAddress can succeed just fine... probably should check for OVMF updates as well;)

Tried applying at r297184. There are conflicts.

I'll rebase this work on HEAD soon, but meanwhile I did test memstick.img with:

qemu-system-x86_64 -m 256M -bios qemu/OVMF-X64-r15214/OVMF.fd -hda Downloads/bcache_r296425_memstick.img

and at least down to 256MB of RAM it does find and load ufs from that memstick.

Updated to revision 297636.

I have done more testing of how big a difference this makes in high latency IPMI (my biggest pain point)

Without this bcache patch:
average kernel load time: 12:20

bcache seems totally unused when booting from a CD

With the bcache patch:

# bcachestat
cache blocks: 32768
cache blocksz: 512
cache readahead: 51080
unit cache blocks: 4096
cached units: 0
465 ops, 0 bypass, 1137 hits, 2056 misses

Run 1: 4:56

# load /boot/kernel/kernel
# bcachestat
cache blocks: 32768
cache blocksz: 512
cache readahead: 105496
unit cache blocks: 4096
cached units: 0
12849 ops, 0 bypasses, 49790 hits, 3068 misses

Run 2: 4:37

# unload
# load /boot/kernel/kernel
# bcachestat
cache blocks: 32768
cache blocksz: 512
cache readahead: 157864
unit cache blocks: 4096
cached units: 0
25192 ops, 0 bypasses, 98316 hits, 4000 misses

Run 3: 4:28

# unload
# load /boot/kernel/kernel
# bcachestat
cache blocks: 32768
cache blocksz: 512
cache readahead: 210232
unit cache blocks: 4096
cached units: 0
37535 ops, 0 bypasses, 146842 hits, 4932 misses

I would really like to commit this.

Updated to revision 297812.

Updated to revision 297943.

allanjude added a reviewer: allanjude.

I am going to commit this now

This revision is now accepted and ready to land.Apr 18 2016, 2:48 PM
This revision was automatically updated to reflect the committed changes.

Sorry to re-review something from five years ago, but I had a question about this code...

head/sys/boot/common/bcache.c
238 ↗(On Diff #15303)

I understand where the second part comes from -- if bc->ra is already the minimum allowed value, we don't want to reduce it further. But why do we avoid decreasing the read-ahead when nblk - i is too small?

Sorry to re-review something from five years ago, but I had a question about this code...

head/sys/boot/common/bcache.c
238 ↗(On Diff #15303)

I *think* the idea was to try to avioid reducing readahead when we got most blocks from cache, but did miss few. That is, we reduce readahead in case the significant number of blocks is missing.