Changeset View
Changeset View
Standalone View
Standalone View
tools/tools/pci/README
- This file was added.
PCI-E root port bridge error injection tool | |||||
------------------------------------------------- | |||||
* Usage: | |||||
compiles using C99. Running on root privilege. | |||||
InjectAER -a: automatically try error-injection on a device.", | |||||
InjectAER -l: wizard, list available devices and methods and guide selection.", | |||||
InjectAER -h: Usage." | |||||
* Meaning of Methods: | |||||
Probing: Using configuration space request to probe a non-existent function. | |||||
NIC_flag: Set NIC flag when BusMaster is disabled. | |||||
Details see background info. | |||||
* Types of error can possibly injected: | |||||
COR/: Correctable error/ | |||||
non-Fatal/: non-fatal uncorrectable error/ | |||||
Fatal. fatal uncorrectable error. | |||||
Details see background info. | |||||
* Limitation: | |||||
This program relies on the AER driver support patch on dev/pci/pci_pci.c | |||||
Probing method: | |||||
I. A PCI-E root port bridge device (CLASS=0x060400) supports AER. | |||||
II. A PCI-E device on the secondary bus under the bridge. | |||||
III. The device “supports AER” or “has not implemented Role-Based Error Reporting”. | |||||
IV. The device has not implement last function (current standard is function 7). | |||||
* Since the bridge only forwards the error, the bridge’s AER status registers will not | |||||
record any detailed record. Instead, the root port status is expected to record | |||||
corresponding error. | |||||
NIC_flag method: | |||||
I. A PCI-E root port bridge device (CLASS=0x060400) supports AER. | |||||
II. A PCI-E Ethernet device on the secondary bus under the bridge. (Class=0x020000) | |||||
III. The Ethernet device should be unused (to prevent from connection lost). | |||||
* If Runtime Error: Error Stage Description: | |||||
0: normal abort | |||||
----Main------ | |||||
1: open device character file | |||||
2: Unexpected method selection | |||||
----scan------ | |||||
11: Retrive device list | |||||
12: Using ioctl access device config space | |||||
13: Using ioctl access device PCIE extended config sapce | |||||
----Inject_dev(Recording initial settings)------ | |||||
21: Using ioctl to access bridge config | |||||
22: Using ioctl to access device config | |||||
23: Using ioctl to access device AER cap | |||||
----Inject_dev-------- | |||||
31: ioctl modify settings on bridge config | |||||
32: ioctl modify settings on device config | |||||
33: ioctl modify settings on device AER cap | |||||
34: sysctl calling device probing | |||||
35: After probing: ioctl clearing bits | |||||
----Inject_dev(Restore initial settings)------ | |||||
41: Using ioctl to access bridge config | |||||
42: Using ioctl to access device config | |||||
43: Using ioctl to access device AER cap | |||||
----Inject_if----------- | |||||
51: Open socket | |||||
52: Using ioctl to access device flag | |||||
53: Using ioctl to access bridge config | |||||
----General Helpers---- | |||||
101: ioctl searching for PCIE extended config space | |||||
102: ioctl searching for AER cap config sapce | |||||
* Background (This program) | |||||
The program can be run with different parameters that leads to automatic mode or | |||||
wizard mode. In automatic mode, the program will pick the first bridge-device | |||||
combination in the queue and performs a non-fatal error injection using the first | |||||
recorded method. In wizard mode, the program will let the user choose which | |||||
bridge-device combination to perform injection and which method to use. If the | |||||
device is capable to inject fatal error, the program will also prompt user to choose. | |||||
Before any attempt to make change on device, the program will first save current | |||||
configurations and print current error status. Configurations are saved in a global | |||||
structure “initial_config”. After completion of error-injection, the program will then | |||||
restore previous configurations and clear error status. | |||||
If an runtime error happens, the error handler will print stage number and errno translation. | |||||
If the error happens after some configuration modification to the device, | |||||
the error handler will try restore the configuration. If the restoration failed, | |||||
it will print out all the configuration saved previously. | |||||
* Background (Probing) | |||||
Device probing is commonly used when the system initializes and tries to scan all the peripheral | |||||
devices attached to the system. It is normally done by sending configuration-space read | |||||
requests to the PCI bus, with specific device (slot) number and function number. | |||||
By default, a request will receive a “Master abort” completion status and return value of all ‘1’s | |||||
if the request reaches non-existent device or function. In addition to that, a PCI-E device will | |||||
generate “Unsupported Request” error message, but that will not be signaled to the root to | |||||
trigger an interrupt by most default settings. | |||||
The goal is try to clear the path for the “Unsupported Request” error message to be successfully | |||||
forwarded to the root. After making many tests on my hardware from Intel, I found the most | |||||
effective way to achieve the goal is causing “Unsupported Request” on an end-point devices, | |||||
and let the upstream PCI-E bridges to forward the error messages to root ports, which will then | |||||
generate the interrupt. | |||||
End-point devices varies their behaviors a lot for different hardware. Based on all the tests I | |||||
made on my available hardware, there are two important factors that affect the behavior of | |||||
a device when it receives an invalid configuration space access request: the implement of | |||||
Role-Based Error Reporting (RBER) and AER. | |||||
I. The device implements neither RBER nor AER. | |||||
In this case, the invalid configuration space request will be treated as “unsupported | |||||
request” non-fatal error. By enabling the “unsupported request” report and non- | |||||
fatal error report bits in device control section, the device will send a non-fatal | |||||
uncorrectable error message to upstream bridge, “ERR_NONFATAL”. | |||||
II. The device implements AER but not RBER. | |||||
This case is similar to the first case that the error message can be sent by enabling | |||||
error reporting bits. Better than the first case, I can change the severity of | |||||
“unsupported request” in the AER uncorrectable error severity section, and will be | |||||
able to send “ERR_NONFATAL” and “ERR_FATAL” message to the bridge. | |||||
III. The device implements both RBER and AER. | |||||
Starting from PCI-E specification ver. 1.1, PCI-E devices are required to implement | |||||
RBER. In this case, with RBER, the device will “be smart” and change the type of | |||||
error signaling based on the error detection agent. When the device receives an | |||||
invalid configuration space access request, the device with RBER will treat the | |||||
“unsupported request” non-fatal error as a masked “advisory non-fatal” correctable | |||||
error, in order to avoid disturbing the probing process. With AER support, I can clear | |||||
the mask for “advisory non-fatal” error on the AER correctable error mask section, | |||||
to let the device send an “ERR_COR” message. In addition to that, I can avoid the | |||||
participation of RBER by changing the severity of “unsupported request” error to | |||||
“FATAL”, and the device will send an “ERR_FATAL” message. | |||||
IV. The device implements RBER but not AER | |||||
Unfortunately, this is a dead end. Without AER, I cannot change the severity of | |||||
“unsupported request” to “fatal”, or let the device report correctable error. The | |||||
error signaling flow chart on the specific sheet (pp. 291 [1]) also implies the result. | |||||
* Some thoughts: | |||||
All cases in this method rely on a bridge device and a child device, | |||||
where the child sends error message, and the bridge forwards it. Technically we should be able | |||||
to apply the same trick to a bridge device, which could be better because we do not need to | |||||
deal with the children. However, I did tons on tests on my PCI-E root port bridges but none of | |||||
them shows any sign of generating an error to respond an invalid configuration space request. | |||||
According to the flowchart (pp. 291 [1]) there should be at least a correctable error recorded in | |||||
AER status, but my devices disagree with that. | |||||
* Background (NIC_flag) | |||||
I can disable the “Bus Master Enable” bits on the bridge, and then change the flag on the sub | |||||
stream NIC device. After that, the bridge will regard all the I/O requests from the device to be | |||||
“unsupported request”. This error is generated on the bridge, so not only the root port status | |||||
records the error, but also the AER status of the bridge will reflect the error detail. | |||||
* I have not done many tests on this method. The only problem | |||||
I noticed is the NIC may become unstable and lost link after performing injection. | |||||
* Reference | |||||
[1]: PCI-Express Base Specification Revision 1.1 | |||||
$FreeBSD$ | |||||
No newline at end of file |