Changeset View
Changeset View
Standalone View
Standalone View
tools/tools/pci/README
- This file was added.
| PCI-E root port bridge error injection tool | |||||
| ------------------------------------------------- | |||||
| * Usage: | |||||
| compiles using C99. Running on root privilege. | |||||
| InjectAER -a: automatically try error-injection on a device.", | |||||
| InjectAER -l: wizard, list available devices and methods and guide selection.", | |||||
| InjectAER -h: Usage." | |||||
| * Meaning of Methods: | |||||
| Probing: Using configuration space request to probe a non-existent function. | |||||
| NIC_flag: Set NIC flag when BusMaster is disabled. | |||||
| Details see background info. | |||||
| * Types of error can possibly injected: | |||||
| COR/: Correctable error/ | |||||
| non-Fatal/: non-fatal uncorrectable error/ | |||||
| Fatal. fatal uncorrectable error. | |||||
| Details see background info. | |||||
| * Limitation: | |||||
| This program relies on the AER driver support patch on dev/pci/pci_pci.c | |||||
| Probing method: | |||||
| I. A PCI-E root port bridge device (CLASS=0x060400) supports AER. | |||||
| II. A PCI-E device on the secondary bus under the bridge. | |||||
| III. The device “supports AER” or “has not implemented Role-Based Error Reporting”. | |||||
| IV. The device has not implement last function (current standard is function 7). | |||||
| * Since the bridge only forwards the error, the bridge’s AER status registers will not | |||||
| record any detailed record. Instead, the root port status is expected to record | |||||
| corresponding error. | |||||
| NIC_flag method: | |||||
| I. A PCI-E root port bridge device (CLASS=0x060400) supports AER. | |||||
| II. A PCI-E Ethernet device on the secondary bus under the bridge. (Class=0x020000) | |||||
| III. The Ethernet device should be unused (to prevent from connection lost). | |||||
| * If Runtime Error: Error Stage Description: | |||||
| 0: normal abort | |||||
| ----Main------ | |||||
| 1: open device character file | |||||
| 2: Unexpected method selection | |||||
| ----scan------ | |||||
| 11: Retrive device list | |||||
| 12: Using ioctl access device config space | |||||
| 13: Using ioctl access device PCIE extended config sapce | |||||
| ----Inject_dev(Recording initial settings)------ | |||||
| 21: Using ioctl to access bridge config | |||||
| 22: Using ioctl to access device config | |||||
| 23: Using ioctl to access device AER cap | |||||
| ----Inject_dev-------- | |||||
| 31: ioctl modify settings on bridge config | |||||
| 32: ioctl modify settings on device config | |||||
| 33: ioctl modify settings on device AER cap | |||||
| 34: sysctl calling device probing | |||||
| 35: After probing: ioctl clearing bits | |||||
| ----Inject_dev(Restore initial settings)------ | |||||
| 41: Using ioctl to access bridge config | |||||
| 42: Using ioctl to access device config | |||||
| 43: Using ioctl to access device AER cap | |||||
| ----Inject_if----------- | |||||
| 51: Open socket | |||||
| 52: Using ioctl to access device flag | |||||
| 53: Using ioctl to access bridge config | |||||
| ----General Helpers---- | |||||
| 101: ioctl searching for PCIE extended config space | |||||
| 102: ioctl searching for AER cap config sapce | |||||
| * Background (This program) | |||||
| The program can be run with different parameters that leads to automatic mode or | |||||
| wizard mode. In automatic mode, the program will pick the first bridge-device | |||||
| combination in the queue and performs a non-fatal error injection using the first | |||||
| recorded method. In wizard mode, the program will let the user choose which | |||||
| bridge-device combination to perform injection and which method to use. If the | |||||
| device is capable to inject fatal error, the program will also prompt user to choose. | |||||
| Before any attempt to make change on device, the program will first save current | |||||
| configurations and print current error status. Configurations are saved in a global | |||||
| structure “initial_config”. After completion of error-injection, the program will then | |||||
| restore previous configurations and clear error status. | |||||
| If an runtime error happens, the error handler will print stage number and errno translation. | |||||
| If the error happens after some configuration modification to the device, | |||||
| the error handler will try restore the configuration. If the restoration failed, | |||||
| it will print out all the configuration saved previously. | |||||
| * Background (Probing) | |||||
| Device probing is commonly used when the system initializes and tries to scan all the peripheral | |||||
| devices attached to the system. It is normally done by sending configuration-space read | |||||
| requests to the PCI bus, with specific device (slot) number and function number. | |||||
| By default, a request will receive a “Master abort” completion status and return value of all ‘1’s | |||||
| if the request reaches non-existent device or function. In addition to that, a PCI-E device will | |||||
| generate “Unsupported Request” error message, but that will not be signaled to the root to | |||||
| trigger an interrupt by most default settings. | |||||
| The goal is try to clear the path for the “Unsupported Request” error message to be successfully | |||||
| forwarded to the root. After making many tests on my hardware from Intel, I found the most | |||||
| effective way to achieve the goal is causing “Unsupported Request” on an end-point devices, | |||||
| and let the upstream PCI-E bridges to forward the error messages to root ports, which will then | |||||
| generate the interrupt. | |||||
| End-point devices varies their behaviors a lot for different hardware. Based on all the tests I | |||||
| made on my available hardware, there are two important factors that affect the behavior of | |||||
| a device when it receives an invalid configuration space access request: the implement of | |||||
| Role-Based Error Reporting (RBER) and AER. | |||||
| I. The device implements neither RBER nor AER. | |||||
| In this case, the invalid configuration space request will be treated as “unsupported | |||||
| request” non-fatal error. By enabling the “unsupported request” report and non- | |||||
| fatal error report bits in device control section, the device will send a non-fatal | |||||
| uncorrectable error message to upstream bridge, “ERR_NONFATAL”. | |||||
| II. The device implements AER but not RBER. | |||||
| This case is similar to the first case that the error message can be sent by enabling | |||||
| error reporting bits. Better than the first case, I can change the severity of | |||||
| “unsupported request” in the AER uncorrectable error severity section, and will be | |||||
| able to send “ERR_NONFATAL” and “ERR_FATAL” message to the bridge. | |||||
| III. The device implements both RBER and AER. | |||||
| Starting from PCI-E specification ver. 1.1, PCI-E devices are required to implement | |||||
| RBER. In this case, with RBER, the device will “be smart” and change the type of | |||||
| error signaling based on the error detection agent. When the device receives an | |||||
| invalid configuration space access request, the device with RBER will treat the | |||||
| “unsupported request” non-fatal error as a masked “advisory non-fatal” correctable | |||||
| error, in order to avoid disturbing the probing process. With AER support, I can clear | |||||
| the mask for “advisory non-fatal” error on the AER correctable error mask section, | |||||
| to let the device send an “ERR_COR” message. In addition to that, I can avoid the | |||||
| participation of RBER by changing the severity of “unsupported request” error to | |||||
| “FATAL”, and the device will send an “ERR_FATAL” message. | |||||
| IV. The device implements RBER but not AER | |||||
| Unfortunately, this is a dead end. Without AER, I cannot change the severity of | |||||
| “unsupported request” to “fatal”, or let the device report correctable error. The | |||||
| error signaling flow chart on the specific sheet (pp. 291 [1]) also implies the result. | |||||
| * Some thoughts: | |||||
| All cases in this method rely on a bridge device and a child device, | |||||
| where the child sends error message, and the bridge forwards it. Technically we should be able | |||||
| to apply the same trick to a bridge device, which could be better because we do not need to | |||||
| deal with the children. However, I did tons on tests on my PCI-E root port bridges but none of | |||||
| them shows any sign of generating an error to respond an invalid configuration space request. | |||||
| According to the flowchart (pp. 291 [1]) there should be at least a correctable error recorded in | |||||
| AER status, but my devices disagree with that. | |||||
| * Background (NIC_flag) | |||||
| I can disable the “Bus Master Enable” bits on the bridge, and then change the flag on the sub | |||||
| stream NIC device. After that, the bridge will regard all the I/O requests from the device to be | |||||
| “unsupported request”. This error is generated on the bridge, so not only the root port status | |||||
| records the error, but also the AER status of the bridge will reflect the error detail. | |||||
| * I have not done many tests on this method. The only problem | |||||
| I noticed is the NIC may become unstable and lost link after performing injection. | |||||
| * Reference | |||||
| [1]: PCI-Express Base Specification Revision 1.1 | |||||
| $FreeBSD$ | |||||
| No newline at end of file | |||||