Page MenuHomeFreeBSD

geli: use unmapped I/O
ClosedPublic

Authored by asomers on Jul 14 2020, 8:04 PM.

Details

Summary

geli: use unmapped I/O

Use unmapped I/O for geli. Unlike most geom providers, geli needs to
manipulate data on every read or write. Previously it used mapped bios; now
it uses sf_buf(9). This change adds two additional data copies on reads
(none on writes), but still ends up being faster due to less memory mapping
activity.

On my 16-core, dual socket server using geli on top of md(4) devices, this
change increases geli IOPs by about 3x.

Note that geli still can't use unmapped I/O when data integrity verification
is enabled.

Test Plan

geli's ATF test suite, plus a bespoke benchmark program.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/geom/eli/g_eli_privacy.c
296 ↗(On Diff #74602)

It is not safe to allocate more than one sf buf with blocking. Imagine that there are N total sf bufs in the system, and you have N threads trying to blockingly allocate 2 sf bufs each. The resulting deadlock is real.

For instance, sendfile code only blocks for the first sfbuf allocation, all later allocs are non-blocking and code adopts to possible failure.

Of course this is only relevant for arches without direct map.

sys/geom/eli/g_eli_privacy.c
296 ↗(On Diff #74602)

So I need to ensure that any geli worker thread will only block on allocation for up to 1 buffer at a time. Should that be one buffer for the entire thread, or just one buffer for each bio?

sys/geom/eli/g_eli_privacy.c
296 ↗(On Diff #74602)

Nevermind, I'm going to do something else instead. But I need to build an x86 head VM first to test it. Don't expect new code until tomorrow.

Fix deadlock during out-of-sf_buf condition

If insufficient sf_bufs are available, map the bio instead.
Tested on x86 with an artificially restricted sf_buf pool.

sys/geom/eli/g_eli_privacy.c
291 ↗(On Diff #74624)

Note that this malloc(9) is not needed on DMAP arches, it only adds overhead.
Generally, malloc(9) on write path is the source of deadlocks, but probably it is too late to worry about geli.

I probably should suggest this explicitly: use unmapped path only on DMAP arches, leave current transient remapping when DMAP is not available.

sys/kern/subr_bus_dma.c
692 ↗(On Diff #74624)

We have b_bus_dmamap_load_ma() which is more efficient than roundtrip through sf_buf_kva()->pmap_kextract() from bus_dmamap_load_buffer(). I am not sure how easy is to pass bio_ma up to this point, but it is definitely better than use VA-based load_buffer() in _bus_dmamap_load_sf_buf().

It might be worth reconstructing bio_ma[] from sf_buf array with sf_buf_page() if passing bio_ma is too problematic.

sys/geom/eli/g_eli_privacy.c
291 ↗(On Diff #74624)

So you think that sf_buf would be too slow on non-DMAP arches to be worth using? Ok, that's easy. But why do you say that this malloc is not needed? Are you suggesting that crypto(9) work on an array of vm_page pointers instead or even on a struct bio ?

sys/geom/eli/g_eli_privacy.c
291 ↗(On Diff #74624)

Rather I think that non-DMAP arches are already slow, and make the efforts to speed up DMAP arches too hard.

I am saying that malloc() is not needed because you already have the bio_ma array and on DMAP arches it is same as the array of sf_bufs. Crypto(9) operating directly on array of vm_pages is the best option, I think. At least, when I worked on unmapped io and busdma, directly taking array of vm_pages was the option that required the minimum amount of plumbing comparing with any other case.

Use a vm_page_t array instead of sf_buf, only on DMAP architectures.

And make g_io_bio_copyin a private function to geli

Now I am confused. It seems that the new patch works on the page at the time, unless I missed a detail. Then, if we return to the design where you used sf bufs, you actually do not need more then one sf_buf at the time. I am not stating that it is possible to changing the patch back to sf_bufs, since it might be impossible to use sleepable allocation, in which case you would need to retract to transient mapping. But if this is not issue, then 'single sf buf at a time' might be an option.

I thought that sometimes geli might need to see more than page, I know that geli is the one of the providers where buffer size can be larger than the page size.

sys/dev/cxgbe/crypto/t4_crypto.c
277 ↗(On Diff #74707)

Look at e.g. sys/powerpc/include/vmparam.h. PMAP_HAS_DMAP can be runtime value.

In D25671#569908, @kib wrote:

Now I am confused. It seems that the new patch works on the page at the time, unless I missed a detail. Then, if we return to the design where you used sf bufs, you actually do not need more then one sf_buf at the time. I am not stating that it is possible to changing the patch back to sf_bufs, since it might be impossible to use sleepable allocation, in which case you would need to retract to transient mapping. But if this is not issue, then 'single sf buf at a time' might be an option.

Yes, I think you're missing a detail. g_eli_crypto_run creates one crypto request per sector. A request may span multiple pages if either:

  1. The user's buffer was not page-aligned, and spanned a page boundary, and/or
  2. The sector size is larger than the page size.
sys/dev/cxgbe/crypto/t4_crypto.c
277 ↗(On Diff #74707)

Darn! Ok, I can make it a runtime check. Unfortunately, that's going to add a bunch of useless code for arches like x86 where the dmap is never defined.

sys/dev/cxgbe/crypto/t4_crypto.c
277 ↗(On Diff #74707)

You can make the check conditional on a compile-time define that the particular arch needs it.

sys/dev/cxgbe/crypto/t4_crypto.c
277 ↗(On Diff #74707)

I think you want

#ifdef PMAP_HAS_DMAP
    if (PMAP_HAS_DMAP) {
...
   }
#endif
sys/dev/cxgbe/crypto/t4_crypto.c
277 ↗(On Diff #74707)

No, that won't work, because it's defined on all arches. It's just that on some it's defined to 0. At least on those arches, the compiler can eliminate some of the code as dead.

sys/dev/cxgbe/crypto/t4_crypto.c
277 ↗(On Diff #74707)
#if PMAP_HAS_DMAP
    if (PMAP_HAS_DMAP) {
...
   }
#endif

(#if instead of #ifdef)

Allow PMAP_HAS_DMAP to change at runtime

This does somewhat increase code size on arches like x86, but at least the code
is more readable without all those #ifdefs.

Conditionally compile most CRYPTO_BUF_VMPAGE code

sys/geom/eli/g_eli_privacy.c
195 ↗(On Diff #74750)

Why do you clear the flag ?

sys/opencrypto/criov.c
68 ↗(On Diff #74750)

I would not use VM_PAGE_ prefix for something not coming from vm_page.{h,c}.

154 ↗(On Diff #74750)

Again, please do not use vm_page_ prefix for something not coming from vm_page.{c,h}.

182 ↗(On Diff #74750)

() are not needed. also you can use atop() there

183 ↗(On Diff #74750)

Is this same as skip = trunc_page(skip); ?

209 ↗(On Diff #74750)

return (processed);

asomers added inline comments.
sys/geom/eli/g_eli_privacy.c
195 ↗(On Diff #74750)

Because during writes, g_eli_crypto_run mallocs a new buffer and points bp->bio_data to it. So the bio is mapped now if it wasn't before.

sys/opencrypto/criov.c
183 ↗(On Diff #74750)

Yes

sys/geom/eli/g_eli_privacy.c
195 ↗(On Diff #74750)

Either this should be stated as comment, or BIO_UNMAPPED cleared at the point where you reset bio_data, or both.

asomers marked an inline comment as done.

Respond to kib's comments, mostly style. Also, rebase.

A couple of suggestions that IMO improve layering:

  • Define a constant for use by the crypto subsystem and consumers instead of sprinkling PMAP_HAS_DMAP.
  • Avoid conditionally defining struct members and code when possible, especially in headers. This removes the need to import machine/param.h in crypto headers.
  • Use a more generic name instead of VM_PAGE or VMPAGE to refer to unmapped cryptops. UNMAPPED would be a bit clearer and is consistent with GEOM and the buffer cache. PHYSMEM might be another.
sys/opencrypto/criov.c
163 ↗(On Diff #74770)

Style: missing space after the second *. There are other instances of this below.

68 ↗(On Diff #74750)

I agree.

  • Avoid conditionally defining struct members and code when possible, especially in headers. This removes the need to import machine/param.h in crypto headers.

Ok, we've reached the bikeshedding point of the review. @markj frowns on conditional compilation, but @mjg and @kib request it. I tend to agree with Mark in this case, just for the sake of readability. So what shall we do?

  • Avoid conditionally defining struct members and code when possible, especially in headers. This removes the need to import machine/param.h in crypto headers.

Ok, we've reached the bikeshedding point of the review.

Well, maybe @jhb has some more interesting opinions on the change since he has been doing most of the recent work on opencrypto.

@markj frowns on conditional compilation, but @mjg and @kib request it. I tend to agree with Mark in this case, just for the sake of readability. So what shall we do?

To be clear, I frown on conditional compilation in headers when it's not strictly necessary. In my experience conditional definition of struct members leads to fragility: debuggers might end up with an inconsistent view of the structure layout, it's easy to forget that structures might be used by userspace, etc.

  • Avoid conditionally defining struct members and code when possible, especially in headers. This removes the need to import machine/param.h in crypto headers.

Ok, we've reached the bikeshedding point of the review.

Well, maybe @jhb has some more interesting opinions on the change since he has been doing most of the recent work on opencrypto.

@markj frowns on conditional compilation, but @mjg and @kib request it. I tend to agree with Mark in this case, just for the sake of readability. So what shall we do?

I did not asked for strict conditional compilation, I suggested to only accept unmapped bios when platform has DMAP and as consequence can very cheaply provide VA access to the unmapped page.
What I did asked for is to stop using vm_page/VM_PAGE prefixes.

To be clear, I frown on conditional compilation in headers when it's not strictly necessary. In my experience conditional definition of struct members leads to fragility: debuggers might end up with an inconsistent view of the structure layout, it's easy to forget that structures might be used by userspace, etc.

I agree.

Minimize conditional compilation, and style fix to "char*"

In general this is something I was planning on doing anyway (see my other comment). I disagree on calling it "unmapped" as it is poor name (and too bio-specific) for reasons I've given above. I think the sglist API shows we already use APIs with "vmpages" in the name outside of vm_page.{c,h} so that argument is kind of moot anyway.

sys/opencrypto/criov.c
68 ↗(On Diff #74750)

Hmm, I would have preferred calling all this VMPAGES instead of UNMAPPED. UNMAPPED could also have been backed by an sglist(9), so it is ambiguous. I think VMPAGES is a better name if it is an array of vm_page_t. This is also consistent with the sglist(9) API which uses "vmpages" and not "unmapped".

There are other places where knowing it's an array of VMPAGES is important (KTLS is one I was planning on implementing this exact feature for, as well as an earlier prototype I had a few years ago to teach /dev/crypto to wire user pages to avoid the overhead of copies when using ccr(4) from OpenSSL's engine interface for TLS).

sys/opencrypto/cryptodev.h
429 ↗(On Diff #74809)

In general style(9) wants a blank line before comments (here and many other places throughout)

sys/opencrypto/criov.c
68 ↗(On Diff #74750)

My thinking was that we're not just using VM pages, but are additionally relying on the direct map to access their contents. A name that captures this would be ideal IMO. sglist just fetches their physical addresses, so it's not quite the same.

UNMAPPED is admittedly not a good name. I'd prefer something more specific than VMPAGES but I don't have any good suggestions, so I'm ok with reverting that change back. Sorry for causing churn. I still think we should avoid adding anything to the vm_page_* namespace.

Use CRYPTO_BUF_VMPAGE instead of CRYPTO_BUF_UNMAPPED, and name static functions cvm_page_XXX

I tested the patch with safexcel(4) on arm64, no problems seen when running GELI tests.

ping. Is there anything else I need to do here?

It looks fine for me with regards to interaction with unmapped/DMAP. But I do not know geli code to say more.

sys/opencrypto/criov.c
68 ↗(On Diff #74750)

I agree not adding to the vm_page_* namespace, but this is a VMPAGES suffix. Note that for some use cases you won't even use the direct map (e.g when using ccr(4) with KTLS and the separate AAD, the actual TLS payload would be an unmapped mbuf which uses physaddrs and in the case it's file-backed the output buffer would be a vm_page_t array and the driver would just DMA from one to the other without ever using the direct map).

Note that if we bothered to have an sglist buffer type it would have similar mapping constraints. I think we just have to describe it as a limitation in the documentation (crypto_buffer.9 in this case) that this type of buffer only works with crypto cursors and the crypto_copy* API (and thus with all device drivers) if the architecture supports a direct map.

Somehow my earlier comment was stuck. There are still some style nits about leading spaces before block comments (and before the new "skip" macro in criov.c). I think the general change looks good. I think it would be good to commit the new buffer type separately from the geli changes (so two commits). I will probably test this for KTLS in the near future.

Note that I do think there are some cases where even integrity would work fine as unmapped. ccr(4) (and Intel QAT) can both offload combined ETA including verifying or storing the hash via DMA. You might consider at least making it a sysctl/tunable as to whether geli can use unmapped requests for the integrity case.

In D25671#577993, @jhb wrote:

Somehow my earlier comment was stuck. There are still some style nits about leading spaces before block comments (and before the new "skip" macro in criov.c). I think the general change looks good. I think it would be good to commit the new buffer type separately from the geli changes (so two commits). I will probably test this for KTLS in the near future.

Are you saying that you want a blank line above every block comment?

Note that I do think there are some cases where even integrity would work fine as unmapped. ccr(4) (and Intel QAT) can both offload combined ETA including verifying or storing the hash via DMA. You might consider at least making it a sysctl/tunable as to whether geli can use unmapped requests for the integrity case.

QAT? I didn't know there was a driver for that yet.

  • Rebase
  • Update man pages
  • Use "vmpage" terminology consistently instead of "unmapped"
sigsys_gmail.com added inline comments.
sys/opencrypto/criov.c
297 ↗(On Diff #75763)

Really not sure I understand this code in general, but should the whole amount left to process be subtracted on every iteration that processes a small chunk of it?

sys/opencrypto/criov.c
297 ↗(On Diff #75763)

Good catch. You're right, but testing didn't catch it because geli never tries to advance by more than 4096 bytes at a time.

Fix crypto_cursor_advance when amount > 4096

ping. Any further comments on this PR?

bcr added a subscriber: bcr.

OK from manpages.

sys/opencrypto/criov.c
605 ↗(On Diff #75809)

Does this compile on powerpc?

209 ↗(On Diff #74750)

This is still outstanding.

Fix build on powerpc and fix one style bug

I agree that the GELI and opencrypto bits should be committed separately.

This revision is now accepted and ready to land.Aug 25 2020, 3:26 PM
This revision was automatically updated to reflect the committed changes.