268 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			268 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
   The PCI Express Advanced Error Reporting Driver Guide HOWTO
 | 
						|
		T. Long Nguyen	<tom.l.nguyen@intel.com>
 | 
						|
		Yanmin Zhang	<yanmin.zhang@intel.com>
 | 
						|
				07/29/2006
 | 
						|
 | 
						|
 | 
						|
1. Overview
 | 
						|
 | 
						|
1.1 About this guide
 | 
						|
 | 
						|
This guide describes the basics of the PCI Express Advanced Error
 | 
						|
Reporting (AER) driver and provides information on how to use it, as
 | 
						|
well as how to enable the drivers of endpoint devices to conform with
 | 
						|
PCI Express AER driver.
 | 
						|
 | 
						|
1.2 Copyright (C) Intel Corporation 2006.
 | 
						|
 | 
						|
1.3 What is the PCI Express AER Driver?
 | 
						|
 | 
						|
PCI Express error signaling can occur on the PCI Express link itself
 | 
						|
or on behalf of transactions initiated on the link. PCI Express
 | 
						|
defines two error reporting paradigms: the baseline capability and
 | 
						|
the Advanced Error Reporting capability. The baseline capability is
 | 
						|
required of all PCI Express components providing a minimum defined
 | 
						|
set of error reporting requirements. Advanced Error Reporting
 | 
						|
capability is implemented with a PCI Express advanced error reporting
 | 
						|
extended capability structure providing more robust error reporting.
 | 
						|
 | 
						|
The PCI Express AER driver provides the infrastructure to support PCI
 | 
						|
Express Advanced Error Reporting capability. The PCI Express AER
 | 
						|
driver provides three basic functions:
 | 
						|
 | 
						|
-	Gathers the comprehensive error information if errors occurred.
 | 
						|
-	Reports error to the users.
 | 
						|
-	Performs error recovery actions.
 | 
						|
 | 
						|
AER driver only attaches root ports which support PCI-Express AER
 | 
						|
capability.
 | 
						|
 | 
						|
 | 
						|
2. User Guide
 | 
						|
 | 
						|
2.1 Include the PCI Express AER Root Driver into the Linux Kernel
 | 
						|
 | 
						|
The PCI Express AER Root driver is a Root Port service driver attached
 | 
						|
to the PCI Express Port Bus driver. If a user wants to use it, the driver
 | 
						|
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
 | 
						|
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
 | 
						|
CONFIG_PCIEAER = y.
 | 
						|
 | 
						|
2.2 Load PCI Express AER Root Driver
 | 
						|
 | 
						|
Some systems have AER support in firmware. Enabling Linux AER support at
 | 
						|
the same time the firmware handles AER may result in unpredictable
 | 
						|
behavior. Therefore, Linux does not handle AER events unless the firmware
 | 
						|
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
 | 
						|
Specification for details regarding _OSC usage.
 | 
						|
 | 
						|
2.3 AER error output
 | 
						|
 | 
						|
When a PCIe AER error is captured, an error message will be output to
 | 
						|
console. If it's a correctable error, it is output as a warning.
 | 
						|
Otherwise, it is printed as an error. So users could choose different
 | 
						|
log level to filter out correctable error messages.
 | 
						|
 | 
						|
Below shows an example:
 | 
						|
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
 | 
						|
0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
 | 
						|
0000:50:00.0:    [20] Unsupported Request    (First)
 | 
						|
0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
 | 
						|
 | 
						|
In the example, 'Requester ID' means the ID of the device who sends
 | 
						|
the error message to root port. Pls. refer to pci express specs for
 | 
						|
other fields.
 | 
						|
 | 
						|
2.4 AER Statistics / Counters
 | 
						|
 | 
						|
When PCIe AER errors are captured, the counters / statistics are also exposed
 | 
						|
in the form of sysfs attributes which are documented at
 | 
						|
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
 | 
						|
 | 
						|
3. Developer Guide
 | 
						|
 | 
						|
To enable AER aware support requires a software driver to configure
 | 
						|
the AER capability structure within its device and to provide callbacks.
 | 
						|
 | 
						|
To support AER better, developers need understand how AER does work
 | 
						|
firstly.
 | 
						|
 | 
						|
PCI Express errors are classified into two types: correctable errors
 | 
						|
and uncorrectable errors. This classification is based on the impacts
 | 
						|
of those errors, which may result in degraded performance or function
 | 
						|
failure.
 | 
						|
 | 
						|
Correctable errors pose no impacts on the functionality of the
 | 
						|
interface. The PCI Express protocol can recover without any software
 | 
						|
intervention or any loss of data. These errors are detected and
 | 
						|
corrected by hardware. Unlike correctable errors, uncorrectable
 | 
						|
errors impact functionality of the interface. Uncorrectable errors
 | 
						|
can cause a particular transaction or a particular PCI Express link
 | 
						|
to be unreliable. Depending on those error conditions, uncorrectable
 | 
						|
errors are further classified into non-fatal errors and fatal errors.
 | 
						|
Non-fatal errors cause the particular transaction to be unreliable,
 | 
						|
but the PCI Express link itself is fully functional. Fatal errors, on
 | 
						|
the other hand, cause the link to be unreliable.
 | 
						|
 | 
						|
When AER is enabled, a PCI Express device will automatically send an
 | 
						|
error message to the PCIe root port above it when the device captures
 | 
						|
an error. The Root Port, upon receiving an error reporting message,
 | 
						|
internally processes and logs the error message in its PCI Express
 | 
						|
capability structure. Error information being logged includes storing
 | 
						|
the error reporting agent's requestor ID into the Error Source
 | 
						|
Identification Registers and setting the error bits of the Root Error
 | 
						|
Status Register accordingly. If AER error reporting is enabled in Root
 | 
						|
Error Command Register, the Root Port generates an interrupt if an
 | 
						|
error is detected.
 | 
						|
 | 
						|
Note that the errors as described above are related to the PCI Express
 | 
						|
hierarchy and links. These errors do not include any device specific
 | 
						|
errors because device specific errors will still get sent directly to
 | 
						|
the device driver.
 | 
						|
 | 
						|
3.1 Configure the AER capability structure
 | 
						|
 | 
						|
AER aware drivers of PCI Express component need change the device
 | 
						|
control registers to enable AER. They also could change AER registers,
 | 
						|
including mask and severity registers. Helper function
 | 
						|
pci_enable_pcie_error_reporting could be used to enable AER. See
 | 
						|
section 3.3.
 | 
						|
 | 
						|
3.2. Provide callbacks
 | 
						|
 | 
						|
3.2.1 callback reset_link to reset pci express link
 | 
						|
 | 
						|
This callback is used to reset the pci express physical link when a
 | 
						|
fatal error happens. The root port aer service driver provides a
 | 
						|
default reset_link function, but different upstream ports might
 | 
						|
have different specifications to reset pci express link, so all
 | 
						|
upstream ports should provide their own reset_link functions.
 | 
						|
 | 
						|
In struct pcie_port_service_driver, a new pointer, reset_link, is
 | 
						|
added.
 | 
						|
 | 
						|
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
 | 
						|
 | 
						|
Section 3.2.2.2 provides more detailed info on when to call
 | 
						|
reset_link.
 | 
						|
 | 
						|
3.2.2 PCI error-recovery callbacks
 | 
						|
 | 
						|
The PCI Express AER Root driver uses error callbacks to coordinate
 | 
						|
with downstream device drivers associated with a hierarchy in question
 | 
						|
when performing error recovery actions.
 | 
						|
 | 
						|
Data struct pci_driver has a pointer, err_handler, to point to
 | 
						|
pci_error_handlers who consists of a couple of callback function
 | 
						|
pointers. AER driver follows the rules defined in
 | 
						|
pci-error-recovery.txt except pci express specific parts (e.g.
 | 
						|
reset_link). Pls. refer to pci-error-recovery.txt for detailed
 | 
						|
definitions of the callbacks.
 | 
						|
 | 
						|
Below sections specify when to call the error callback functions.
 | 
						|
 | 
						|
3.2.2.1 Correctable errors
 | 
						|
 | 
						|
Correctable errors pose no impacts on the functionality of
 | 
						|
the interface. The PCI Express protocol can recover without any
 | 
						|
software intervention or any loss of data. These errors do not
 | 
						|
require any recovery actions. The AER driver clears the device's
 | 
						|
correctable error status register accordingly and logs these errors.
 | 
						|
 | 
						|
3.2.2.2 Non-correctable (non-fatal and fatal) errors
 | 
						|
 | 
						|
If an error message indicates a non-fatal error, performing link reset
 | 
						|
at upstream is not required. The AER driver calls error_detected(dev,
 | 
						|
pci_channel_io_normal) to all drivers associated within a hierarchy in
 | 
						|
question. for example,
 | 
						|
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
 | 
						|
If Upstream port A captures an AER error, the hierarchy consists of
 | 
						|
Downstream port B and EndPoint.
 | 
						|
 | 
						|
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
 | 
						|
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
 | 
						|
whether it can recover or the AER driver calls mmio_enabled as next.
 | 
						|
 | 
						|
If an error message indicates a fatal error, kernel will broadcast
 | 
						|
error_detected(dev, pci_channel_io_frozen) to all drivers within
 | 
						|
a hierarchy in question. Then, performing link reset at upstream is
 | 
						|
necessary. As different kinds of devices might use different approaches
 | 
						|
to reset link, AER port service driver is required to provide the
 | 
						|
function to reset link. Firstly, kernel looks for if the upstream
 | 
						|
component has an aer driver. If it has, kernel uses the reset_link
 | 
						|
callback of the aer driver. If the upstream component has no aer driver
 | 
						|
and the port is downstream port, we will perform a hot reset as the
 | 
						|
default by setting the Secondary Bus Reset bit of the Bridge Control
 | 
						|
register associated with the downstream port. As for upstream ports,
 | 
						|
they should provide their own aer service drivers with reset_link
 | 
						|
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
 | 
						|
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
 | 
						|
to mmio_enabled.
 | 
						|
 | 
						|
3.3 helper functions
 | 
						|
 | 
						|
3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 | 
						|
pci_enable_pcie_error_reporting enables the device to send error
 | 
						|
messages to root port when an error is detected. Note that devices
 | 
						|
don't enable the error reporting by default, so device drivers need
 | 
						|
call this function to enable it.
 | 
						|
 | 
						|
3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
 | 
						|
pci_disable_pcie_error_reporting disables the device to send error
 | 
						|
messages to root port when an error is detected.
 | 
						|
 | 
						|
3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
 | 
						|
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
 | 
						|
error status register.
 | 
						|
 | 
						|
3.4 Frequent Asked Questions
 | 
						|
 | 
						|
Q: What happens if a PCI Express device driver does not provide an
 | 
						|
error recovery handler (pci_driver->err_handler is equal to NULL)?
 | 
						|
 | 
						|
A: The devices attached with the driver won't be recovered. If the
 | 
						|
error is fatal, kernel will print out warning messages. Please refer
 | 
						|
to section 3 for more information.
 | 
						|
 | 
						|
Q: What happens if an upstream port service driver does not provide
 | 
						|
callback reset_link?
 | 
						|
 | 
						|
A: Fatal error recovery will fail if the errors are reported by the
 | 
						|
upstream ports who are attached by the service driver.
 | 
						|
 | 
						|
Q: How does this infrastructure deal with driver that is not PCI
 | 
						|
Express aware?
 | 
						|
 | 
						|
A: This infrastructure calls the error callback functions of the
 | 
						|
driver when an error happens. But if the driver is not aware of
 | 
						|
PCI Express, the device might not report its own errors to root
 | 
						|
port.
 | 
						|
 | 
						|
Q: What modifications will that driver need to make it compatible
 | 
						|
with the PCI Express AER Root driver?
 | 
						|
 | 
						|
A: It could call the helper functions to enable AER in devices and
 | 
						|
cleanup uncorrectable status register. Pls. refer to section 3.3.
 | 
						|
 | 
						|
 | 
						|
4. Software error injection
 | 
						|
 | 
						|
Debugging PCIe AER error recovery code is quite difficult because it
 | 
						|
is hard to trigger real hardware errors. Software based error
 | 
						|
injection can be used to fake various kinds of PCIe errors.
 | 
						|
 | 
						|
First you should enable PCIe AER software error injection in kernel
 | 
						|
configuration, that is, following item should be in your .config.
 | 
						|
 | 
						|
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
 | 
						|
 | 
						|
After reboot with new kernel or insert the module, a device file named
 | 
						|
/dev/aer_inject should be created.
 | 
						|
 | 
						|
Then, you need a user space tool named aer-inject, which can be gotten
 | 
						|
from:
 | 
						|
    https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
 | 
						|
 | 
						|
More information about aer-inject can be found in the document comes
 | 
						|
with its source code.
 |