Page MenuHomeFreeBSD

Add AER register reporting support via sysctl
Needs RevisionPublic

Authored by jhibbits on Aug 29 2016, 7:20 PM.

Details

Reviewers
jhb
imp
Group Reviewers
Src Committers
Summary

This adds reporting of AER registers from PCI Express devices, exporting via records in hw.pci similar to MCA records. The goal is to be able to retrieve these error records and input them into a database or other reporting mechanism.

This adds new sysctl nodes under hw.pci:

hw.pci.pcib_error_count
hw.pci.pcib_error_records
hw.pci.pcib_probe
hw.pci.pcib_clear_error_records

Plus two tools to read and inject errors. There is also a README included.

Work done by Bo Pang <cjysqpb_AT_hotmail.com>, an intern for EMC/Isilon Storage Division . His summary follows:

Patch on pci_pci.c contains most implementation of AER on the driver.
Patch on pcib_private.h adds some entry in softc.
New file ReadAER.c reads AER records, compiles using gcc.
New file Inject_em0.c is a hard-coded script based my hardware, to inject error into device em0. Compiles using gcc, runs under root privileges.
Test Plan

Test tool included.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

jhibbits updated this revision to Diff 19798.Aug 29 2016, 7:20 PM
jhibbits retitled this revision from to Add AER register reporting support via sysctl.
jhibbits updated this object.
jhibbits edited the test plan for this revision. (Show Details)
jhibbits added reviewers: jhb, Src Committers.
jhibbits set the repository for this revision to rS FreeBSD src repository.
emaste added a subscriber: emaste.Aug 29 2016, 8:28 PM
imp added a comment.Feb 6 2017, 9:19 PM

For the commit, I'd suggest doing the injection tool (both userland and kernel) separate from the kernel changes to pcib.

imp added inline comments.Feb 6 2017, 9:44 PM
sys/dev/pci/pci_pci.c
973

How do I atomically read and clear the records?

1018–1019

I don't see where this is set.
Also, it should be per-bridge maybe?
We'd like to get counts on a per-card basis, and I don't see that happening here...

1052

I'd be tempted to rate limit this message.
Also, it dropped the oldest record, so maybe that's a better message?
And aren't we leaking the report by not freeing 'next' here?

1056

WAITOK seems unwise here.

1068

do not print this in the interrupt handler.
Just exit. That would also simplify the code.

1105

WAITOK in an interrupt handler? That strikes me as unwise. Plus we're limiting it to a certain number: seems like it would be better to preallocate.

1253

Where do we setup to get the individual device counts? this looks like it does the bridges, but it appears that devices also report this information. In experiments that I've done on a machine suffering from these issues, it is rare to see the pcibX device match the down-stream nvmeY device, for example.

imp added a comment.EditedFeb 6 2017, 9:47 PM

For example:
pciconf -bBlaec pcib5
pcib5@pci0:0:2:2: class=0x060400 card=0x083315d9 chip=0x6f068086 rev=0x01 hdr=0x01

  bus range  = 4-4
  window[1c] = type I/O Port, range 16, addr 0xf000-0xfff, disabled
  window[20] = type Memory, range 32, addr 0xfb500000-0xfb5fffff, enabled
  window[24] = type Prefetchable Memory, range 64, addr 0xfff00000-0xfffff, disabled
  cap 0d[40] = PCI Bridge card=0x083315d9
  cap 05[60] = MSI supports 2 messages, vector masks 
  cap 10[90] = PCI-Express 2 root port max data 256(256) ARI disabled
               link x4(x4) speed 2.5(8.0)
               slot 0 power limit 25000 mW surprise
  cap 01[e0] = powerspec 3  supports D0 D3  current D0
  ecap 000b[100] = Vendor 1 ID 2
  ecap 000d[110] = ACS 1
  ecap 0001[148] = AER 1 0 fatal 1 non-fatal 2 corrected
  ecap 000b[1d0] = Vendor 1 ID 3
  ecap 0019[250] = PCIe Sec 1 lane errors 0
  ecap 000b[280] = Vendor 1 ID 5
  ecap 000b[300] = Vendor 1 ID 8
PCI-e errors = Correctable Error Detected
               Non-Fatal Error Detected
   Non-fatal = Completion Timeout
   Corrected = Replay Timer Timeout
               Advisory Non-Fatal Error

but the device says:

pciconf -blaec nvme2
nvme2@pci0:4:0:0: class=0x010802 card=0x00031c58 chip=0x00031c58 rev=0x05 hdr=0x00

  bar   [10] = type Memory, range 64, base 0xfb530000, size 16384, enabled
  bar   [20] = type Memory, range 64, base 0xfb520000, size 65536, enabled
  cap 01[c0] = powerspec 3  supports D0 D3  current D0
  cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR
               link x4(x4) speed 2.5(8.0) ASPM disabled(L0s/L1)
  cap 05[c8] = MSI supports 32 messages, 64 bit, vector masks 
  cap 11[e0] = MSI-X supports 129 messages, enabled
               Table in map 0x10[0x2000], PBA in map 0x10[0x3000]
  ecap 0001[100] = AER 2 0 fatal 1 non-fatal 6 corrected
  ecap 0019[180] = PCIe Sec 1 lane errors 0
PCI-e errors = Correctable Error Detected
   Non-fatal = Unsupported Request
   Corrected = Receiver Error
               Bad TLP
               Bad DLLP
               REPLAY_NUM Rollover
               Replay Timer Timeout
               Advisory Non-Fatal Error

For topology pcib5 -> pci5 -> nvme2

imp added a comment.Feb 6 2017, 10:12 PM

After talking with John, these two reports are self-consistent. The bridge see one kind of thing going on (timeout) while the card sees something else (bad header so ignored).

imp requested changes to this revision.Feb 11 2017, 4:41 PM
imp added a reviewer: imp.

Some more comments based on first trying to use this code, and then trying to rework it.
While the code is decently written, there's a number of fundamental assumptions it makes that aren't reflective of the AER in the PCIe spec that require changes.
I've not looked at all at the userland test suite, so have no opinion on that.

So in its current form, we can't use it.

I'm reworking things to allow two types of usage. One by client devices that want to 'poll' their AER status and keep counts / records of what happened on that device. The second form is only on Root Complexes and Root Error Collection devices that collect records there (and maybe for root complexes walks the tree polling all the agents and endpoints involved to complete the record, if possible).

sys/dev/pci/pci_pci.c
1065

We should only clear the bits that we're interested in. temp, or masked temp. Not all of them. There's a race here if we use a fixed mask.

1076

This code only reads one record. There may be more than one.

1079

And this is why. We should clear things one bit at a time to support devices that keep a stack of errors rather than just the last one. We also ignore the pointer to the last error which is present even in devices that don't support multiple error reports.

1088

Same criticism.

1093–1101

All of these are only valid for Root Complex or Root Event Collector devices. They will be garbage values when not present (on my system they are all ff's).

1137

This only works for Root Complex or Event Collecting device types. Yet we do this for all devices. No interrupts will issue on the other bridges.

1147–1181

All this code now conflicts with hot plug code. The hot plug code establishes the same interrupt. There's a register to find out which interrupt MSI to use for Root Complex or Root Event Collector devices that we're ignoring here. To the extent that this code works, it may be happy accident.

This revision now requires changes to proceed.Feb 11 2017, 4:41 PM
In D7697#197154, @imp wrote:

I'm reworking things to allow two types of usage. One by client devices that want to 'poll' their AER status and keep counts / records of what happened on that device. The second form is only on Root Complexes and Root Error Collection devices that collect records there (and maybe for root complexes walks the tree polling all the agents and endpoints involved to complete the record, if possible).

Thanks Warner for all your feedback!
I finished my school so now circling back here, trying to push this 1+ years old work further.

Sounds like you were also working on this topic -- How's it going?
Do you have any suggestion on what I should do next? I am new to this community.

Thanks again,
Bo