Page MenuHomeFreeBSD

EFI boot: Fix boot freeze on some systems
Needs ReviewPublic

Authored by dasebek_gmail.com on Jun 20 2021, 1:23 PM.

Details

Reviewers
kib
Summary

This patch addresses a problem that, immediately before the kernel is started, the amd64_tramp function calls the efi_copy_finish function to copy the kernel image from a temporary staging area in the memory to the actual memory address where the kernel expects to be run from. The problem is that the boot loader, including the efi_copy_finish function, may have been loaded by the UEFI firmware somewhere in the range where the kernel is being copied to. The efi_copy_finish function may thus overwrite its own instructions by the kernel image, causing the system to freeze. This is not a problem for the amd64_tramp trampoline itself, which is first copied to a safe memory location before it is executed.

My patch does the following:

  • Instead of calling the efi_copy_finish function, which may be located anywhere in the memory, the copy operation is done by the trampoline itself. This is the most important part of the patch.
  • Adds missing return value checks.
  • Before the kernel is copied from the staging area, a new efi_verify_destination_type function checks that the target memory area is safe to use. If not, it only prints a warning message. (This is useful mainly for debugging purposes when a system freezes.)

Because I changed the parameters of the trampoline function, I created a new amd64_tramp_inline function and kept the original amd64_tramp intact in order not to break any other existing code. However, the amd64_tramp function does not seem to be used anywhere else in the existing code, so it could be modified directly. The efi_copy_finish function is not needed anymore too.

[PR: 209821]
[PR: 256722]

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

@andrew Given I have seen similar on aarch64 u-boot loader.efi boots, could it be that it's an issue there as well?

In D30828#695823, @bz wrote:

@andrew Given I have seen similar on aarch64 u-boot loader.efi boots, could it be that it's an issue there as well?

It's unlikely to be the same issue as there is no trampoline on arm64. We boot the kernel directly from where it's loaded in memory.

In D30828#695823, @bz wrote:

@andrew Given I have seen similar on aarch64 u-boot loader.efi boots, could it be that it's an issue there as well?

As @andrew mentioned, the aarch64 kernel is run from the "staging area", no further copying is involved there. However, the machine-independent parts of this patch could detect problems that were previously silently ignored, for example, if the size of the staging area is too small or if the memory allocation fails.

I have a WIP that avoids copying on amd64 as well, starting kernel from the staging area. No ETA.

This patch (or some variant of it) would be still needed because new loader/kernel hand-off interface cannot be made compatible with old kernels, which assume that kernload==2M.

In D30828#695823, @bz wrote:

@andrew Given I have seen similar on aarch64 u-boot loader.efi boots, could it be that it's an issue there as well?

As @andrew mentioned, the aarch64 kernel is run from the "staging area", no further copying is involved there. However, the machine-independent parts of this patch could detect problems that were previously silently ignored, for example, if the size of the staging area is too small or if the memory allocation fails.

What if people load modules and memory disks? Would that be detected with this? Would other parts need further checks?

In D30828#695872, @bz wrote:
In D30828#695823, @bz wrote:

@andrew Given I have seen similar on aarch64 u-boot loader.efi boots, could it be that it's an issue there as well?

As @andrew mentioned, the aarch64 kernel is run from the "staging area", no further copying is involved there. However, the machine-independent parts of this patch could detect problems that were previously silently ignored, for example, if the size of the staging area is too small or if the memory allocation fails.

What if people load modules and memory disks? Would that be detected with this? Would other parts need further checks?

It comes to the same staging area.

One of the outcome of my changes should be removal of the unreasonable cap on the staging size, either immediate, or as a next step. There would be still a limit by current arbitrary decision to require it all sit under 4G and related fragmentation, but this is still better than 64M.