Page MenuHomeFreeBSD

Add ccp(4): experimental driver for AMD Crypto Co-Processor
ClosedPublic

Authored by cem on Oct 18 2017, 5:51 PM.

Details

Summary
  • Registers TRNG source for random(4)
  • Finds available queues, LSBs; allocates static objects
  • Allocates a shared MSI-X for all queues. The hardware does not have separate interrupts per queue. Working interrupt mode driver.
  • Computes SHA hashes, HMAC. Passes cryptotest.py, cryptocheck tests.
  • Does AES-CBC, CTR mode, and XTS. cryptotest.py and cryptocheck pass.
  • Support for "authenc" (AES + HMAC). (SHA1 seems to result in "unaligned" cleartext inputs from cryptocheck -- which the engine cannot handle. SHA2 seems to work fine.)
  • GCM passes for block-multiple AAD, input lengths

Largely based on ccr(4), part of cxgbe(4).

Rough performance averages on AMD Ryzen 1950X (4kB buffer):
aesni: SHA1: ~8300 Mb/s SHA256: ~8000 Mb/s
ccp: ~630 Mb/s SHA256: ~660 Mb/s SHA512: ~700 Mb/s
cryptosoft: ~1800 Mb/s SHA256: ~1800 Mb/s SHA512: ~2700 Mb/s

As you can see, performance is poor in comparison to aesni(4) and even
cryptosoft (due to high setup cost). At a larger buffer size (128kB),
throughput is a little better (but still worse than aesni(4)):

aesni: SHA1:~10400 Mb/s SHA256: ~9950 Mb/s
ccp: ~2200 Mb/s SHA256: ~2600 Mb/s SHA512: ~3800 Mb/s
cryptosoft: ~1750 Mb/s SHA256: ~1800 Mb/s SHA512: ~2700 Mb/s

AES performance has a similar story:

aesni: 4kB: ~11250 Mb/s 128kB: ~11250 Mb/s
ccp: ~350 Mb/s 128kB: ~4600 Mb/s
cryptosoft: ~1750 Mb/s 128kB: ~1700 Mb/s

This driver is EXPERIMENTAL. You should verify cryptographic results
on typical and corner case inputs from your application against a known-
good implementation.

Test Plan

Passing cryptocheck and cryptotest.py in supported modes. Unaligned inputs are rejected.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
markj added inline comments.Oct 23 2017, 5:54 PM
sys/crypto/ccp/ccp.c
195 ↗(On Diff #34215)

s/SGLs/SGEs/

238 ↗(On Diff #34215)

Since ccp_initialize_queue() is only used once, it be a bit neater to have a ccp_initialize_queues() instead and move this loop there.

813 ↗(On Diff #34215)

"error" is write-only.

832 ↗(On Diff #34215)

This function and the one below appear to be unused.

sys/crypto/ccp/ccp.h
24 ↗(On Diff #34215)

Missing $FreeBSD$ tags in the headers.

sys/crypto/ccp/ccp_hardware.c
92 ↗(On Diff #34215)

This line is misindented.

sys/crypto/ccp/ccp_hardware.h
26 ↗(On Diff #34215)

Missing include guard.

sys/crypto/ccp/ccp_lsb.c
43 ↗(On Diff #34215)

Some of these headers aren't needed: dev/pci/*, mutex.h, lock.h, rman.h, module.h, ...

sys/modules/ccp/Makefile
13 ↗(On Diff #34215)

Need to include opt_ddb.h to SRCS to be able to build it as a standalone KLD.

cem updated this revision to Diff 34268.Oct 23 2017, 9:21 PM
cem marked 8 inline comments as done.

Fix review issues from markj

sys/crypto/ccp/ccp.c
238 ↗(On Diff #34215)

Ok

832 ↗(On Diff #34215)

Yeah, I was thinking it would be nice to have a kernel API like ioat that doesn't go through opencrypto, but don't have that very concrete yet.

cem updated this revision to Diff 34382.Oct 27 2017, 2:25 AM
  • Discard now useless test code
  • Use acquire/release routines from OCF process()
  • Clear out any descriptors written for a transaction that experiences an error before submission
cem updated this revision to Diff 34734.Nov 4 2017, 12:18 AM

Fix backwards "active" queue entries calculation (only used in debug asserts for now).

cem updated this revision to Diff 34841.Nov 5 2017, 11:59 PM

Use MSI-X interrupts. It turns out we were neglecting to set up the 2nd
vector, which is where the relevant interrupts were delivered.

cem added a comment.EditedNov 17 2017, 7:32 PM

Improvements I'd like to make before commit:

  • Ensure there is sufficient room in the submission rings, or return EAGAIN/block depending on wait flags. Currently there are only KASSERTs.
  • Make debug spam printfs sysctlable and possibly CTR/KTR records.
  • Track whole transactions and clear out the whole thing if any segment errors. This is probably identical to advancing the head past the next segment with associated callback.

Reviewers, is there anything else you'd especially like to see before accepting this for commit?

cem retitled this revision from Add ccp(4): driver for AMD Crypto Co-Processor to Add ccp(4): experimental driver for AMD Crypto Co-Processor.Nov 17 2017, 7:34 PM
cem edited the summary of this revision. (Show Details)
cem edited the test plan for this revision. (Show Details)
markj added inline comments.Dec 6 2017, 4:18 PM
sys/crypto/ccp/ccp.c
234 ↗(On Diff #34841)

There's a lot of printf("XXX...") in the driver, which ought to be cleaned up at some point.

236 ↗(On Diff #34841)

Not something to be addressed here, but crypto_get_driverid() should probably just use unr(9). :/

459 ↗(On Diff #34841)

Seems like an odd initialization default, especially since we don't set cipher_mode in the AES-XTS case. We don't register support for AES-XTS, so this isn't a major problem at the moment, but it looks weird.

sys/crypto/ccp/ccp_hardware.c
459 ↗(On Diff #34841)

You could just set ec before breaking from the loop?

660 ↗(On Diff #34841)

This looks like it should be assigning to error instead. There's also an extra newline.

679 ↗(On Diff #34841)

It looks like we don't do proper cleanup in the caller if this loop aborts. e.g., we won't call pci_release_msi().

681 ↗(On Diff #34841)

"rid" may be modified by bus_alloc_resource_any(). In practice I guess it's not an issue for SYS_RES_IRQ, but the man page explicitly warns about this.

745 ↗(On Diff #34841)

Is it worth checking the value of g_ccp_ring_order first to make sure it's sane?

795 ↗(On Diff #34841)

It'd be nice to come up with a better name for "i" given its use in this cleanup block.

1029 ↗(On Diff #34841)

mflags is unused.

1043 ↗(On Diff #34841)

Should this be addressed? :)

1048 ↗(On Diff #34841)

It would be a good idea to verify this too. Perhaps also cache the paddrs, but the lookup is pretty cheap, so it may not be worth the effort.

1067 ↗(On Diff #34841)

Note that min() will truncate to 32 bits.

1234 ↗(On Diff #34841)

Write-only var, but I'm guessing that addressing the XXX might change that.

1402 ↗(On Diff #34841)

keydata and keydata_len are left uninitialized if we go through the default case.

1408 ↗(On Diff #34841)

It'd be nice to clean this up.

1444 ↗(On Diff #34841)

This doesn't seem to be used either?

1483 ↗(On Diff #34841)

Does C actually guarantee that this cast will give you the low 32 bits? It would be a bit cleaner to explicitly mask the bits you want.

sys/x86/include/bus.h
121 ↗(On Diff #34841)

This should be committed separately.

cem added inline comments.Dec 6 2017, 6:06 PM
sys/crypto/ccp/ccp.c
234 ↗(On Diff #34841)

Definitely. At one point I was keeping this stuff separate in a git stash but I guess it got merged in sometime. Some of it even leaks stuff like key information.

459 ↗(On Diff #34841)

I suppose. Mostly it just needs to be initialized to something other than CCP_AES_MODE_GCTR.

Ideally AES-XTS will be added at some point, and that will make it more clear. We could #if 0 the the XTS case of the outer switch, making it return EINVAL? (Though as you point out, it should never be invoked as we don't register support.)

sys/crypto/ccp/ccp_hardware.c
679 ↗(On Diff #34841)

Yes, that's true. Handling cleanup fully correctly if some step in initialization fails is hard.

681 ↗(On Diff #34841)

Hm. We could use a temporary copy for the call to bus_alloc_resource_any.

745 ↗(On Diff #34841)

Might as well.

1029 ↗(On Diff #34841)

At this time, yes, but the intent is to eventually pass M_WAITOK / M_NOWAIT.

1043 ↗(On Diff #34841)

Probably :(

1048 ↗(On Diff #34841)

Sure. Note that the H_vectors here are local to this driver (since they're in reverse order). Can you think of an annotation short of __align(PAGE_SIZE) to make sure they don't cross a page boundary at compile time? I suppose some clever CTASSERTs might work.

1067 ↗(On Diff #34841)

The other consideration is that crd_len is int, so we're dealing with at most 31 bit lengths.

1234 ↗(On Diff #34841)

It's currently vestigial from ccr(4) based on the following commented out line, but yeah, the XXX probably needs addressing.

1402 ↗(On Diff #34841)

The default case here is impossible to reach. This routine is invoked from ccp_process with a session mode of BLKCIPHER. That requires ccp_newsession with one of CBC, ICM, or XTS. And newsession shouldn't be called with XTS at this time as it isn't registered.

1408 ↗(On Diff #34841)

As in, actually implement XTS? :-) Yes.

1444 ↗(On Diff #34841)

Vestigial from the ccr driver.

1483 ↗(On Diff #34841)

I believe so. Not sure how else it could be interpreted. Note that vm_paddr_t is an integer rather than pointer type.

https://stackoverflow.com/questions/6752567/casting-a-large-number-type-to-a-smaller-type claims that the standard says:

If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2^n where n is the number of bits used to represent the unsigned type). [Note: In a two's complement representation, this conversion is conceptual and there is no change in the bit pattern (if there is no truncation).]

markj added inline comments.Dec 6 2017, 6:34 PM
sys/crypto/ccp/ccp_hardware.c
1048 ↗(On Diff #34841)

I can't think of another annotation that might help, but I think this CTASSERT might do it:

CTASSERT(PAGE_SIZE - ((uintptr_t)&vec[0] % PAGE_SIZE) >= sizeof(vec))

I think all of the vectors would together fit in less of a page, so you could maybe put them in a single array and align that.

1402 ↗(On Diff #34841)

The compiler doesn't know that. :)

1483 ↗(On Diff #34841)

I see, thanks.

cem added inline comments.Dec 6 2017, 6:59 PM
sys/crypto/ccp/ccp.c
459 ↗(On Diff #34841)

The other reason to use ECB is that it is so clearly a bogus value. The hardware will do it, but OCF and this driver don't support it (for good reason).

sys/crypto/ccp/ccp_hardware.c
1048 ↗(On Diff #34841)

Thanks, I'll do something like that.

1402 ↗(On Diff #34841)

It doesn't produce an error. I don't recall it producing a warning, either, thought it may be. Do you think the compiler is assuming UB and generating bad code? The CBC and ICM tests pass, so I think that is unlikely.

markj added inline comments.Dec 6 2017, 7:18 PM
sys/crypto/ccp/ccp_hardware.c
1402 ↗(On Diff #34841)

At least gcc 6 emits an error for this.

cem added inline comments.Dec 6 2017, 7:26 PM
sys/crypto/ccp/ccp_hardware.c
1402 ↗(On Diff #34841)

Ah, sure. I have been building with Clang. Is keydata / keydata_len all I need to initialize to silence the GCC warning?

markj added inline comments.Dec 6 2017, 7:45 PM
sys/crypto/ccp/ccp_hardware.c
1402 ↗(On Diff #34841)

Yep, looks like it. gcc is also complaining about sgl_nsegs being write-only. With those two things fixed, ccp builds.

cem updated this revision to Diff 36370.EditedDec 8 2017, 12:24 AM
cem marked 29 inline comments as done.
cem edited the summary of this revision. (Show Details)
  • Return EAGAIN if we're out of ring space (squash to ENOMEM at top level for OCF consumers). (I decided to punt on a sleepable API.) Ring space is evaluated lazily and partial submissions are rolled back if we discover ourselves to be out of space.
  • Fix up cleanup on descriptor error; clear out and error the entire operation/transaction (through next callback).
  • Be sure to zero sensitive key data in session descriptor.
  • Fix up most of the review feedback from Mark
cem updated this revision to Diff 36371.Dec 8 2017, 12:46 AM

cryptotest.py: Add similar skipping for unsupported AES-GCM inputs

cem updated this revision to Diff 37380.Jan 1 2018, 10:48 PM

Implement AES-XTS.

cem edited the summary of this revision. (Show Details)Jan 13 2018, 5:48 PM
cem marked an inline comment as done.Jan 13 2018, 5:55 PM

Committed MAXADDR 48bit separately.

cem updated this revision to Diff 38008.Jan 15 2018, 11:24 PM
cem edited the summary of this revision. (Show Details)

Hide debugging (and especially sensitive content printing) behind special
macros. DPRINTF is debug spam and can be enabled at runtime. INSECURE_DEBUG()
prints sensitive material and is compiled out by default.

cem updated this revision to Diff 38010.Jan 15 2018, 11:27 PM

Accidentally posted two patches; remove accidentally added patch.

cem marked 4 inline comments as done.Jan 15 2018, 11:29 PM

I think the driver is more or less ready to go in now.

cem added a comment.Jan 15 2018, 11:36 PM

Put trivial random changes in independent review for secteam: https://reviews.freebsd.org/D13925

CCing (maybe :-)) interested parties for the random source portion of this driver.

cem updated this revision to Diff 38045.Jan 16 2018, 6:11 PM

Rebase past smaller changes pulled out and committed already.

If there are no objections in the next day or two, I'll go ahead and commit
this to the tree.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 18 2018, 10:01 PM
This revision was automatically updated to reflect the committed changes.