PCIe Bus error in Hetzner Servers

Here’s an issue i’ve had with a variety of Hetzner dedicated servers. I first discovered it when I was alerted that a server with 1TB NVMe drive had reached 100% capacity. I found that /var/log/syslog was filling up many times per second with these errors.

less /var/log/syslog


Dec 15 20:16:37 Ubuntu-2204-jammy-amd64-base kernel: [18698.029674] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 15 20:16:37 Ubuntu-2204-jammy-amd64-base kernel: [18698.029677] nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Dec 15 20:16:37 Ubuntu-2204-jammy-amd64-base kernel: [18698.029680] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.752410] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.752420] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.752422] nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.752425] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.796264] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.796273] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.796276] nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Dec 15 20:16:42 Ubuntu-2204-jammy-amd64-base kernel: [18702.796279] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
Dec 15 20:16:45 Ubuntu-2204-jammy-amd64-base kernel: [18706.493606] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Dec 15 20:16:45 Ubuntu-2204-jammy-amd64-base kernel: [18706.493614] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 15 20:16:45 Ubuntu-2204-jammy-amd64-base kernel: [18706.493617] nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Dec 15 20:16:45 Ubuntu-2204-jammy-amd64-base kernel: [18706.493620] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
Dec 15 20:16:46 Ubuntu-2204-jammy-amd64-base kernel: [18707.444456] pcieport 0000:00:01.3: AER: Corrected error received: 0000:02:00.0
Dec 15 20:16:46 Ubuntu-2204-jammy-amd64-base kernel: [18707.444464] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 15 20:16:46 Ubuntu-2204-jammy-amd64-base kernel: [18707.444467] nvme 0000:02:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Dec 15 20:16:46 Ubuntu-2204-jammy-amd64-base kernel: [18707.444470] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Dec 15 20:16:48 Ubuntu-2204-jammy-amd64-base kernel: [18709.500278] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Dec 15 20:16:48 Ubuntu-2204-jammy-amd64-base kernel: [18709.500286] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 15 20:16:48 Ubuntu-2204-jammy-amd64-base kernel: [18709.500289] nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Dec 15 20:16:48 Ubuntu-2204-jammy-amd64-base kernel: [18709.500291] nvme 0000:01:00.0:    [ 0] RxErr                  (First)
Dec 15 20:16:50 Ubuntu-2204-jammy-amd64-base kernel: [18710.562333] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0

If you open a technical support ticket with them (requires scheduling a 30min window, Central European Time (UTC+1)), they’ll either clean the drive connectors or swap a part. Sometimes the errors go away completely, sometimes they reduce to once every few seconds. If the latter happens, Hetzner will say that’s within their acceptable limits.

You can run dmesg | grep -i aer to check that all of the errors are corrected.

[123710.515154] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123710.675379] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123710.749724] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123711.700107] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123712.160531] pcieport 0000:00:01.3: AER: Corrected error received: 0000:02:00.0
[123712.467374] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123712.502386] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123713.109711] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123713.933761] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123714.493832] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123714.951897] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123716.512362] pcieport 0000:00:01.3: AER: Corrected error received: 0000:02:00.0
[123716.844129] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123716.991703] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123717.622934] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123718.658627] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123719.577563] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123720.710414] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123721.017129] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
[123721.065247] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0

I did find the following on google.

Top Answer

I believe this may be due to PCIe Active State Power Management that is transitioning the link to a lower power state and maybe causing the device to trigger these errors. I believe the device in question is the Sunrise Point-LP PCI Express Root Port.

Try using the pcie_aspm=off boot parameter to see if this stops the messages. Note that this will increase the power consumption of your machine as it disables the power savings.

Interesting. I’ll have to look into that. Thanks!