Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia error "GPU has fallen off the bus" #3363

Open
esplinr opened this issue Aug 22, 2024 · 18 comments
Open

nvidia error "GPU has fallen off the bus" #3363

esplinr opened this issue Aug 22, 2024 · 18 comments

Comments

@esplinr
Copy link

esplinr commented Aug 22, 2024

Distribution (run cat /etc/os-release):

NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

From NVIDIA Settings: NVIDIA Driver Version: 555.58.02
From apt search system76 |grep installed: system76-driver-nvidia/jammy,jammy,now 20.04.94~1723838773~22.04~8237cd8 all [installed]
From flatpak list: nvidia-555-58-02 org.freedesktop.Platform.GL32.nvidia-555-58-02 1.4 user

uname -a
Linux richard 6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06 SMP PREEMPT_DYNAMIC Wed J x86_64 x86_64 x86_64 GNU/Linux

Issue/Bug Description:
About half of the time I return to my computer after a break, the computer refuses to wake and the fans are going at full blast.

The only two times I checked the logs from before the reboot, they ended with these lines:

Aug 21 21:03:38.740382 richard kernel: workqueue: nv_drm_handle_hotplug_event [nvidia_drm] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
Aug 21 21:04:12.444523 richard kernel: snd_hda_intel 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessible
Aug 21 21:04:12.672363 richard kernel: NVRM: GPU at PCI:0000:01:00: GPU-58eb6437-6614-ceb3-7b75-a8316586b521
Aug 21 21:04:12.672560 richard kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 21 21:04:12.672615 richard kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 21 21:04:13.243482 richard kernel: NVRM: Error in service of callback 
Aug 21 21:04:34.378353 richard kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
Aug 21 21:04:34.378391 richard kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f

Steps to reproduce (if you know):
Leave the computer for more than 10 minutes, and it happens about 50% of the time.

I thought it was related to #3313 because it correlates with a suspend, but I've had it happen twice when the screen blanks but before the automatic suspend should have happened.

I've also had a couple of times where I jiggled the mouse and it appeared to recover correctly from suspend, but I didn't proceed to log back in and the machine hung with the fan at full blast.

Expected behavior:
The computer should wake up from a blank screen or suspend.

Other Notes:
My research suggests that previous NVIDIA drivers had a bug that showed the similar behavior when the GPU entered a low powered state. My problem does seem correlated with when the machine is idle and reducing power consumption.

@alspitz
Copy link

alspitz commented Aug 26, 2024

Had this happen on Ubuntu 22.04 (GPU has fallen off the bus) immediately after upgrading from 550 to 555 and rebooting (mistake!).

@mdbartos
Copy link

I have been having a similar issue since applying an update from Pop Shop on 8/22. Before this update everything was running fine.

Behavior

The computer will randomly freeze, usually about 10-15 minutes after booting, and the fans will start running at full blast even under no workload. System does not respond to mouse input and I have to reboot either by holding down the power button or using Alt+SysRq+b.

Output of journalctl during the crash is as follows:

Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: AER: Multiple Correctable error message received from 0000:01:00.0
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:   device [8086:a70d] error status/mask=00008001/00002000
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:    [ 0] RxErr                  (First)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:    [15] HeaderOF
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: AER: Multiple Uncorrectable (Fatal) error message received from 0000:00:01.0
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:   device [8086:a70d] error status/mask=00040000/00010000
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0:    [18] MalfTLP                (First)
Aug 25 03:44:51 balthasar kernel: pcieport 0000:00:01.0: AER:   TLP Header: 40000020 010000ff fff47880 00000000
Aug 25 03:44:51 balthasar kernel: nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Aug 25 03:44:51 balthasar kernel: snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Aug 25 03:44:51 balthasar kernel: NVRM: GPU at PCI:0000:01:00: GPU-9614e587-880d-7880-9895-2a74c029fbbe
Aug 25 03:44:51 balthasar kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 25 03:44:51 balthasar kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 25 03:44:51 balthasar kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                  NVRM: nvidia-bug-report.sh as root to collect this data before
                                  NVRM: the NVIDIA kernel module is unloaded.
Aug 25 03:44:52 balthasar kernel: pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
Aug 25 03:44:53 balthasar kernel: pcieport 0000:00:01.0: retraining failed
Aug 25 03:44:55 balthasar kernel: pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s

System Info

cat /etc/os-release

NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os

uname -a

Linux balthasar 6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06 SMP PREEMPT_DYNAMIC Wed J x86_64 x86_64 x86_64 GNU/Linux

NVIDIA Driver Version is 555.58.02. GPU is a 16 GB NVIDIA GeForce RTX 4070 Ti Super.

Suspected cause

The contents of the update that seem to have broken the system are as follows (lines related to some packages omitted):

Start-Date: 2024-08-22  17:36:12
Commandline: packagekit role='update-packages'
Requested-By: akagi (1000)
Upgrade: ...
pop-launcher:amd64 (1.2.3~1722960871~22.04~c994240, 1.2.3~1723669139~22.04~6a1b8b9),
...
popsicle:amd64 (1.3.3~1721773298~22.04~3a87912, 1.3.3~1724174665~22.04~a473f89),
system76-io-dkms:amd64 (1.0.3~1707324885~22.04~3dd4c32, 1.0.4~1724333961~22.04~968f68c),
pop-gtk-theme:amd64 (5.5.1~1686085983~22.04~190b5cc, 5.5.1~1723827328~22.04~25ea85d),
libwayland-cursor0:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-cursor0:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
...
system76-power:amd64 (1.2.0~1722536955~22.04~9894c79, 1.2.1~1724333998~22.04~8b9184c),
busybox-static:amd64 (1:1.30.1-7ubuntu3, 1:1.30.1-7ubuntu3.1),
libwayland-server0:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-server0:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
...
libcom-err2:amd64 (1.46.5-2ubuntu1.1, 1.46.5-2ubuntu1.2),
libcom-err2:i386 (1.46.5-2ubuntu1.1, 1.46.5-2ubuntu1.2),
...
pop-gnome-shell-theme:amd64 (5.5.1~1686085983~22.04~190b5cc, 5.5.1~1723827328~22.04~25ea85d),
...
busybox-initramfs:amd64 (1:1.30.1-7ubuntu3, 1:1.30.1-7ubuntu3.1),
...
popsicle-gtk:amd64 (1.3.3~1721773298~22.04~3a87912, 1.3.3~1724174665~22.04~a473f89),
...
system76-driver:amd64 (20.04.93~1722974544~22.04~bb3c2fe, 20.04.95~1724334075~22.04~12b4d15),
...
libwayland-egl1:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-egl1:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
...
libwayland-client0:amd64 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
libwayland-client0:i386 (1.20.0-1ubuntu0.1, 1.22.0-2pop1~1722453806~22.04~accf54c),
system76-driver-nvidia:amd64 (20.04.93~1722974544~22.04~bb3c2fe, 20.04.95~1724334075~22.04~12b4d15),
...
system76-dkms:amd64 (1.0.15~1718228158~22.04~ec10d1d, 1.0.15~1723747371~22.04~341bcde),
...
intel-microcode:amd64 (3.20240514.0ubuntu0.22.04.1, 3.20240813.0ubuntu0.22.04.2),
...
End-Date: 2024-08-22  17:37:11

Of these, I imagine the culprit is system76-driver-nvidia.

@leviport
Copy link
Member

A potential workaround is to use the 550-server version until 555 is updated:

sudo apt purge ~nnvidia
sudo apt install nvidia-driver-550-server

then reboot

@mdbartos
Copy link

mdbartos commented Aug 26, 2024

A potential workaround is to use the 550-server version until 555 is updated:

sudo apt purge ~nnvidia
sudo apt install nvidia-driver-550-server

then reboot

The first line doesn't work for me, should it be *nvidia? Or should it be ~nvidia?

@leviport
Copy link
Member

Nope, ~nnvidia. I just tested it, and it should work for you.

@mdbartos
Copy link

OK, figured it out. The command works in bash but not zsh.

I ran the above commands and rebooted. Upon rebooting the first time, it booted into a console displaying the error message shown by @esplinr above:

nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
nvidia 0000:01:00.0: probe with driver nvidia failed with error -1

The second time I rebooted it successfully loaded Pop!_OS. I will follow up if freezes persist.

@mdbartos
Copy link

mdbartos commented Aug 26, 2024

It was working for a while but unfortunately the freezes persist. I started getting them again after trying to watch a youtube video in chromium (which is when I first noticed the issue).

Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480828:ERROR:gl_display.cc(497)] EGL Driver message (Critical) eglInitialize: glXQueryExtensionsString returned NULL
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480837:ERROR:gl_display.cc(767)] eglInitialize OpenGLES failed with error EGL_NOT_INITIALIZED
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480846:ERROR:gl_display.cc(801)] Initialization of all EGL display types failed.
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.480855:ERROR:gl_ozone_egl.cc(26)] GLDisplayEGL::Initialize failed.
Aug 26 18:03:29 balthasar org.chromium.Chromium.desktop[7854]: [106:106:0826/180329.481662:ERROR:viz_main_impl.cc(166)] Exiting GPU process due to errors during initialization
Aug 26 18:06:29 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:11:33 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:16:05 balthasar NetworkManager[874]: <info>  [1724714165.7504] dhcp6 (wlo1): state changed new lease, address=2600:1702:3830:1c10::28
Aug 26 18:16:39 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:17:01 balthasar CRON[8867]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 26 18:17:01 balthasar CRON[8868]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 26 18:17:01 balthasar CRON[8867]: pam_unix(cron:session): session closed for user root
Aug 26 18:17:17 balthasar NetworkManager[874]: <info>  [1724714237.7465] dhcp6 (enp6s0): state changed new lease, address=2600:1702:3830:1c10::39
Aug 26 18:21:44 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:26:48 balthasar geoclue[1324]: Failed to query location: Not Found
Aug 26 18:27:21 balthasar gnome-shell[2667]: Can't update stage views actor <unnamed>[<MetaWindowGroup>:0x58fd886e4370] is on because it needs an allocation.
Aug 26 18:27:21 balthasar gnome-shell[2667]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x58fd8b0b6b40] is on because it needs an allocation.
Aug 26 18:27:21 balthasar gnome-shell[2667]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x58fd8b0bade0] is on because it needs an allocation.
Aug 26 18:28:10 balthasar systemd[2500]: app-gnome-x\x2dterminal\x2demulator-5299.scope: Consumed 5min 27.419s CPU time.
Aug 26 18:28:35 balthasar org.chromium.Chromium.desktop[7876]: [128:140:0826/182835.192379:ERROR:shared_image_manager.cc(327)] SharedImageManager::ProduceMemory: Trying to Produce a Memory representation from a>
Aug 26 18:28:37 balthasar org.chromium.Chromium.desktop[7876]: [128:140:0826/182837.692810:ERROR:shared_image_manager.cc(327)] SharedImageManager::ProduceMemory: Trying to Produce a Memory representation from a>
Aug 26 18:29:00 balthasar kernel: NVRM: GPU at PCI:0000:01:00: GPU-9614e587-880d-7880-9895-2a74c029fbbe
Aug 26 18:29:00 balthasar kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Aug 26 18:29:00 balthasar kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Aug 26 18:29:00 balthasar kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                  NVRM: nvidia-bug-report.sh as root to collect this data before
                                  NVRM: the NVIDIA kernel module is unloaded.

Oddly, the nvidia-driver-555 was installed on 8/8 and the computer worked fine up until 8/22 when the additional updates were installed.

For now, I am just removing all nvidia drivers using sudo apt remove ~nnvidia and seeing if the system can stay up.

@mdbartos
Copy link

Same freeze occurs even with all nvidia drivers uninstalled. Not sure how to proceed at this point. Will probably need a support ticket.

Aug 26 19:08:11 balthasar kernel: nouveau 0000:01:00.0: timeout
Aug 26 19:08:11 balthasar kernel: WARNING: CPU: 4 PID: 2541 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmtu102.c:45 tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel: Modules linked in: tls rfcomm snd_seq_dummy snd_hrtimer nvme_fabrics ccm cmac algif_hash algif_skcipher af_alg zstd intel_rapl_msr intel_rapl_common intel_uncore_frequency inte>
Aug 26 19:08:11 balthasar kernel:  ecdh_generic snd_seq_device iTCO_wdt intel_pmc_bxt cfg80211 hid_multitouch mtd system76_thelio_io(OE) ecc bfq snd_timer joydev intel_cstate input_leds mei_hdcp iTCO_vendor_sup>
Aug 26 19:08:11 balthasar kernel:  pinctrl_alderlake aesni_intel crypto_simd cryptd
Aug 26 19:08:11 balthasar kernel: CPU: 4 PID: 2541 Comm: Xorg Tainted: G        W  OE      6.9.3-76060903-generic #202405300957~1721174657~22.04~abb7c06
Aug 26 19:08:11 balthasar kernel: Hardware name: System76 Thelio Mira/Thelio Mira, BIOS FJd Z5 06/12/2024
Aug 26 19:08:11 balthasar kernel: RIP: 0010:tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel: Code: 8b 40 10 48 8b 78 10 48 8b 5f 50 48 85 db 75 03 48 8b 1f e8 bc b5 1e f1 48 89 da 48 c7 c7 62 7a 62 c1 48 89 c6 e8 fa 1c 71 f0 <0f> 0b eb 88 e8 d1 44 83 f1 90 90 90 90 90 >
Aug 26 19:08:11 balthasar kernel: RSP: 0018:ffffb8444c82f560 EFLAGS: 00010246
Aug 26 19:08:11 balthasar kernel: RAX: 0000000000000000 RBX: ffff8bacc4580500 RCX: 0000000000000000
Aug 26 19:08:11 balthasar kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Aug 26 19:08:11 balthasar kernel: RBP: ffffb8444c82f5a8 R08: 0000000000000000 R09: 0000000000000000
Aug 26 19:08:11 balthasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8bacc23e8000
Aug 26 19:08:11 balthasar kernel: R13: 0000000080000001 R14: 0000000000000000 R15: 0000000000000001
Aug 26 19:08:11 balthasar kernel: FS:  00007369e09d4a80(0000) GS:ffff8bcbfee00000(0000) knlGS:0000000000000000
Aug 26 19:08:11 balthasar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 26 19:08:11 balthasar kernel: CR2: 00007369e059d000 CR3: 00000001622d2000 CR4: 0000000000f50ef0
Aug 26 19:08:11 balthasar kernel: PKRU: 55555554
Aug 26 19:08:11 balthasar kernel: Call Trace:
Aug 26 19:08:11 balthasar kernel:  <TASK>
Aug 26 19:08:11 balthasar kernel:  ? show_regs+0x6c/0x80
Aug 26 19:08:11 balthasar kernel:  ? __warn+0x88/0x140
Aug 26 19:08:11 balthasar kernel:  ? tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? report_bug+0x182/0x1b0
Aug 26 19:08:11 balthasar kernel:  ? handle_bug+0x46/0x90
Aug 26 19:08:11 balthasar kernel:  ? exc_invalid_op+0x18/0x80
Aug 26 19:08:11 balthasar kernel:  ? asm_exc_invalid_op+0x1b/0x20
Aug 26 19:08:11 balthasar kernel:  ? tu102_vmm_flush+0x176/0x180 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_iter.constprop.0+0x3d5/0x7d0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_gp100_vmm_pgt_dma+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_gp100_vmm_pgt_dma+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_ptes_get_map+0x103/0x140 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_gp100_vmm_pgt_dma+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? nvkm_vmm_map_valid+0xcb/0x210 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_map_locked+0x228/0x3c0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? nvkm_ioctl_new+0x1cc/0x2e0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_vmm_map+0x9e/0x100 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_mem_map_dma+0x57/0x90 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_uvmm_mthd_map.isra.0+0x23b/0x3d0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_uvmm_mthd+0x9e/0x540 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_object_mthd+0x17/0x40 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_ioctl_mthd+0x5d/0xc0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_ioctl+0x132/0x2a0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvkm_client_ioctl+0xe/0x20 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvif_object_mthd+0xd8/0x220 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nvif_vmm_map+0x87/0x150 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_mem_map+0xab/0x100 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_vma_new+0x223/0x250 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_gem_object_open+0x1ce/0x1f0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  drm_gem_handle_create_tail+0xd4/0x1a0
Aug 26 19:08:11 balthasar kernel:  drm_gem_handle_create+0x35/0x50
Aug 26 19:08:11 balthasar kernel:  nouveau_gem_ioctl_new+0xdd/0x170 [nouveau]
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nouveau_gem_ioctl_new+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  drm_ioctl_kernel+0xb9/0x120
Aug 26 19:08:11 balthasar kernel:  drm_ioctl+0x301/0x5a0
Aug 26 19:08:11 balthasar kernel:  ? __pfx_nouveau_gem_ioctl_new+0x10/0x10 [nouveau]
Aug 26 19:08:11 balthasar kernel:  nouveau_drm_ioctl+0x61/0xc0 [nouveau]
Aug 26 19:08:11 balthasar kernel:  __x64_sys_ioctl+0xa0/0xf0
Aug 26 19:08:11 balthasar kernel:  x64_sys_call+0xa68/0x24b0
Aug 26 19:08:11 balthasar kernel:  do_syscall_64+0x80/0x170
Aug 26 19:08:11 balthasar kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Aug 26 19:08:11 balthasar kernel:  ? handle_mm_fault+0xaf/0x340
Aug 26 19:08:11 balthasar kernel:  ? do_user_addr_fault+0x18d/0x690
Aug 26 19:08:11 balthasar kernel:  ? irqentry_exit_to_user_mode+0x76/0x270
Aug 26 19:08:11 balthasar kernel:  ? irqentry_exit+0x43/0x50
Aug 26 19:08:11 balthasar kernel:  ? exc_page_fault+0x93/0x1b0
Aug 26 19:08:11 balthasar kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Aug 26 19:08:11 balthasar kernel: RIP: 0033:0x7369e0d1a94f
Aug 26 19:08:11 balthasar kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 >
Aug 26 19:08:11 balthasar kernel: RSP: 002b:00007fff63d87dc0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 26 19:08:11 balthasar kernel: RAX: ffffffffffffffda RBX: 00007fff63d87e70 RCX: 00007369e0d1a94f
Aug 26 19:08:11 balthasar kernel: RDX: 00007fff63d87e70 RSI: 00000000c0306480 RDI: 0000000000000012
Aug 26 19:08:11 balthasar kernel: RBP: 00000000c0306480 R08: 00006017eabeb010 R09: 00006017ec4f6fa0
Aug 26 19:08:11 balthasar kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00006017eac2f910
Aug 26 19:08:11 balthasar kernel: R13: 0000000000000012 R14: 00007fff63d87e70 R15: 0000000000000900
Aug 26 19:08:11 balthasar kernel:  </TASK>
Aug 26 19:08:11 balthasar kernel: ---[ end trace 0000000000000000 ]---
Aug 26 19:08:11 balthasar kernel: nouveau 0000:01:00.0: timer: stalled at ffffffffffffffff

@pjreed
Copy link

pjreed commented Aug 28, 2024

I just thought I'd add that I have a System76 Gazelle and have been dealing with this issue for about a year and a half now.

At some point not too long after I first got this laptop, I started to have the same issue you're describing. I opened a ticket and spent a while diagnosing it with System76 support, and eventually I sent my laptop in and they replaced the mainboard, and the whole process of sending it in and getting it back took a few weeks. After I got it back, I continued to have the same problem. I really couldn't afford to be without my work computer for a few more weeks, so I've just gotten used to having my laptop randomly lock up when I'm away from it.

I spent a while trying to debug it and found out that this issue happens specifically when the GPU wakes up from being in a low-power mode, and running a low-power process that's constantly touching the GPU (like glxgears) seems to alleviate the issue to a degree. Without it, my laptop often locks up at least once a day, sometimes more often, and on occasion even while I'm using it; if I just leave glxgears running in a corner, it will often be fine for several days at a time, sometimes over a week.

The interesting thing is that sometime recently, I realized my laptop had reached a point where it had been running over three weeks solid without freezing. I suspect that nvidia-driver-550 in specific may have done something to help, because I updated to nvidia-driver-555 a week ago and the problem suddenly resumed; now I'm getting freezes regularly again.

I just tried installing nvidia-driver-550-server, and I'm running on that now. No freezes yet, but it's only been 30 minutes, so it remains to be seen if that will work as well as nvidia-driver-550. I really wish System76 would preserve at least their last few releases on their apt server...

@mdbartos
Copy link

I believe I have solved the issue on my machine.

I initially tried to run a live disk but was unable to see the boot menu because no video signal was being sent to the monitor before the login screen appeared. I then attempted to resolve the problem by adding 'nomodeset' to the kernel boot parameters. However, this made it so that video signal was never output to the monitor at all, and I thought I had bricked my computer.

After consulting the hardware manual for the Thelio Mira, I realized there were another set of dedicated HDMI/Displayport ports on the GPU itself. I unplugged my HDMI cable from the integrated graphics HDMI port and into the dedicated graphics HDMI port. After this change, I was able to get video signal and see the boot menu.

Moreover, after this change, the freezing issue has not returned, and I'm not getting warnings and errors in journalctl anymore. I thus upgraded to NVIDIA driver 555 again. There were also some updated system packages I installed from system76, but I don't think they are relevant.

tl;dr: I plugged my monitor into the dedicated graphics HDMI port instead of the integrated graphics HDMI port on my Thelio Mira. This solved the lack of video output at boot time, and the freezing issue has not returned. I am running NVIDIA driver 555 again.

@mdbartos
Copy link

Spoke too soon, the computer froze again---this time with no video output. The uptime was much longer this time around though.

@leviport
Copy link
Member

Since you have System76 hardware, I recommend opening a support ticket: https://support.system76.com

@esplinr
Copy link
Author

esplinr commented Sep 10, 2024

I'm still seeing this happen even after installing the latest updates to the NVidia driver from System76.

I can reliably reproduce it by locking my screen without suspending the laptop; when the machine goes into low power mode it locks up and the fans go to full blast. After the reboot, the previous boot's log messages contain the same errors about GPU falling off the bus and nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state.

The problem is manageable by updating Settings -> Power -> Screen Blank to "Never" and shortening the time to Automatic Suspend.

Next step for me is to open a support ticket with System76.

@esplinr
Copy link
Author

esplinr commented Sep 16, 2024

Support confirmed that rolling back to the 550 driver is the recommended workaround until the 560 driver is released.

I see that the 560 driver is now in the repository, so hopefully that resolves the problem.

@mmstick
Copy link
Member

mmstick commented Sep 16, 2024

560 has already been released

@mdbartos
Copy link

Update: I ended up getting an advance replacement for my machine (Thelio Mira). I believe the issue was hardware-related, as the freezing persisted even when running a live disk. After getting a replacement, I haven't had any freezing issues.

However, I did notice that the CPU and GPU temperatures were intermittently running rather hot on the new machine (~85 C for a few seconds at a time) under the default 'Balanced' power profile when running heavy workloads. The fans also tend to speed up and slow down rather than maintain a steady level. The temperature issues and fan thrashing went away after switching to 'Power Saver'. I wonder if these intermittent high temperatures may have contributed to a hardware failure.

@Vetpetmon
Copy link

Vetpetmon commented Sep 19, 2024

Having this issue since September 3, shortly after doing a fresh reinstall and updating from NVIDIA driver version 550 to 555. Persists after the update to 560. My hardware is not damaged and is not System76 hardware.

EDIT: Important details

GPU Driver version: 560.35.03
CUDA version: 12.6

Kernel: linux-image-6.9.3-76060903-generic             6.9.3-76060903.202405300957~1721174657~22.04~abb7c06         amd64        Linux kernel image for version 6.9.3 on 64 bit x86 SMP

Motherboard vendor: ASUSTeK COMPUTER INC.
Motherboard product: PRIME A320M-K

Firmware (BIOS) vendor: American Megatrends Inc.
Firmware version: 5216
Firmware date: 08/30/2019
Boot mode: uefi

Got the issue today, will be trying pcie_aspm=off pci=nommconf in boot/efi/loader/entries

First, logging is sane, but then in the middle of playing a Steam Proton game and turning down the graphical settings to reduce GPU temps from ~80 C to ~58 C, things run smoothly until about 5 minutes in, and everything goes choppy. I can hear audio, but my mic isn't going through. Not choppy, according to a friend, "It completely died" while my display and input slowed to ~5 FPS. Mouse too was affected.

Logs looked like this:

Sep 19 16:18:02 bubz kernel: [68183.615790] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.615810] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.615818] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:02 bubz kernel: [68183.615826] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.637883] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.637897] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.637902] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.637907] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.726186] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.726207] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.726214] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:02 bubz kernel: [68183.726223] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.770080] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.770097] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.770104] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.770112] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.792116] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.792126] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.792129] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.792133] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.803143] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.803156] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:02 bubz kernel: [68183.803160] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:02 bubz kernel: [68183.803164] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.814216] pcieport 0000:00:03.1: AER: Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.814231] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.814236] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:02 bubz kernel: [68183.814242] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.825183] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:02 bubz kernel: [68183.825204] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.825209] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001040/00006000
Sep 19 16:18:02 bubz kernel: [68183.825215] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:02 bubz kernel: [68183.825220] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.825226] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.825231] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:02 bubz kernel: [68183.825236] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:02 bubz kernel: [68183.825243] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:02 bubz kernel: [68183.825247] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:02 bubz kernel: [68183.825252] snd_hda_intel 0000:07:00.1:    [12] Timeout               

.... Gets more extreme....

Sep 19 16:18:04 bubz kernel: [68185.875052] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:04 bubz kernel: [68185.875237] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.875244] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000011c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.875252] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.875259] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.875265] pcieport 0000:00:03.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.875270] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.875295] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.875301] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.875308] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.875450] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.875457] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.875463] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.886092] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.886277] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.886280] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.886284] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.886287] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.886290] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.886352] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.886355] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.886359] nvidia 0000:07:00.0:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.886362] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.886365] nvidia 0000:07:00.0: AER:   Error of this Agent is reported first
Sep 19 16:18:04 bubz kernel: [68185.886669] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.886672] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.886675] snd_hda_intel 0000:07:00.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.886678] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.897099] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.897184] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.897192] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000011c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.897201] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.897208] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.897215] pcieport 0000:00:03.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.897222] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.897243] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.897251] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.897259] nvidia 0000:07:00.0:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.897266] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.897273] nvidia 0000:07:00.0: AER:   Error of this Agent is reported first
Sep 19 16:18:04 bubz kernel: [68185.897375] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.897382] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001100/0000a000
Sep 19 16:18:04 bubz kernel: [68185.897406] snd_hda_intel 0000:07:00.1:    [ 8] Rollover              
Sep 19 16:18:04 bubz kernel: [68185.897413] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.908150] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:04 bubz kernel: [68185.908740] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.908748] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.908756] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.908762] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.908768] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.908777] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.908784] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.908790] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.908799] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.908804] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:04 bubz kernel: [68185.908811] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:04 bubz kernel: [68185.919143] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0
Sep 19 16:18:04 bubz kernel: [68185.920895] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:04 bubz kernel: [68185.920900] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=000010c0/00006000
Sep 19 16:18:04 bubz kernel: [68185.920904] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:04 bubz kernel: [68185.920907] pcieport 0000:00:03.1:    [ 7] BadDLLP               
Sep 19 16:18:04 bubz kernel: [68185.920909] pcieport 0000:00:03.1:    [12] Timeout          

Error codes 6, 7, 8, and 12 become more and more consistent. GDM has a "lol" moment about 8 seconds in:

Sep 19 16:18:14 bubz /usr/libexec/gdm-x-session[2026]: (EE) event3  - SINOWEALTH Game Mouse: client bug: event processing lagging behind by 376ms, your system is too slow

Then it sped back up after... about 5000 or so error lines, but mic audio still did not go through. Short-lived speed-up, logging calms down, but then...

Sep 19 16:18:18 bubz kernel: [68199.145805] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.145890] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:18 bubz kernel: [68199.145894] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Sep 19 16:18:18 bubz kernel: [68199.145898] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:18 bubz kernel: [68199.146058] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:18 bubz kernel: [68199.146061] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:18 bubz kernel: [68199.146065] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:18 bubz kernel: [68199.146071] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:18 bubz kernel: [68199.146074] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:18 bubz kernel: [68199.146077] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:18 bubz kernel: [68199.146096] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146106] pcieport 0000:00:03.1: AER: found no error details for 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146110] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146122] pcieport 0000:00:03.1: AER: found no error details for 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146125] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:18 bubz kernel: [68199.146177] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
Sep 19 16:18:18 bubz kernel: [68199.146180] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Sep 19 16:18:18 bubz kernel: [68199.146184] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:18 bubz kernel: [68199.146189] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:07:00.0

EDIT 3: Oh, I dug further. I found ONE error code 14, which couldn't be corrected.

Sep 19 16:18:24 bubz kernel: [68205.828117] nvidia 0000:07:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:24 bubz kernel: [68205.828120] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00001000/0000a000
Sep 19 16:18:24 bubz kernel: [68205.828123] nvidia 0000:07:00.0:    [12] Timeout               
Sep 19 16:18:24 bubz kernel: [68205.828223] snd_hda_intel 0000:07:00.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:24 bubz kernel: [68205.828225] snd_hda_intel 0000:07:00.1:   device [10de:1aeb] error status/mask=00001000/0000a000
Sep 19 16:18:24 bubz kernel: [68205.828228] snd_hda_intel 0000:07:00.1:    [12] Timeout               
Sep 19 16:18:24 bubz kernel: [68205.828302] pcieport 0000:00:03.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:07:00.0
Sep 19 16:18:24 bubz kernel: [68205.828423] nvidia 0000:07:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
Sep 19 16:18:24 bubz kernel: [68205.828427] nvidia 0000:07:00.0:   device [10de:2184] error status/mask=00004000/00000000
Sep 19 16:18:24 bubz kernel: [68205.828430] nvidia 0000:07:00.0:    [14] CmpltTO                (First)
Sep 19 16:18:24 bubz kernel: [68205.860406] nvidia 0000:07:00.0: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860412] snd_hda_intel 0000:07:00.1: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860414] xhci_hcd 0000:07:00.2: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860416] pci 0000:07:00.3: AER: can't recover (no error_detected callback)
Sep 19 16:18:24 bubz kernel: [68205.860445] pcieport 0000:00:03.1: AER: device recovery failed
Sep 19 16:18:24 bubz kernel: [68205.860448] pcieport 0000:00:03.1: AER: Multiple Correctable error message receive

Picks back up, now failing to fetch details for some errors. After 16 seconds of starting, the nail is hit into the coffin, and my user session is completely dead:

Sep 19 16:18:28 bubz kernel: [68209.812231] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:28 bubz kernel: [68209.855497] NVRM: GPU at PCI:0000:07:00: GPU-73994a87-f3a9-e97c-5add-f6a9813a6033
Sep 19 16:18:28 bubz kernel: [68209.855502] NVRM: Xid (PCI:0000:07:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 19 16:18:28 bubz kernel: [68209.855518] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Sep 19 16:18:28 bubz kernel: [68209.855530] NVRM: A GPU crash dump has been created. If possible, please run
Sep 19 16:18:28 bubz kernel: [68209.855530] NVRM: nvidia-bug-report.sh as root to collect this data before
Sep 19 16:18:28 bubz kernel: [68209.855530] NVRM: the NVIDIA kernel module is unloaded.
Sep 19 16:18:28 bubz kernel: [68209.855638] pcieport 0000:00:03.1: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
Sep 19 16:18:28 bubz kernel: [68209.855644] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00003040/00006000
Sep 19 16:18:28 bubz kernel: [68209.855648] pcieport 0000:00:03.1:    [ 6] BadTLP                
Sep 19 16:18:28 bubz kernel: [68209.855652] pcieport 0000:00:03.1:    [12] Timeout               
Sep 19 16:18:28 bubz kernel: [68209.855659] pcieport 0000:00:03.1: AER: Multiple Correctable error message received from 0000:00:00.0
Sep 19 16:18:28 bubz kernel: [68209.855679] pcieport 0000:00:03.1: AER: found no error details for 0000:00:00.0
Sep 19 16:18:28 bubz kernel: [68209.959602] xhci_hcd 0000:07:00.2: Unable to change power state from D3hot to D0, device inaccessible
Sep 19 16:18:28 bubz kernel: [68210.031632] xhci_hcd 0000:07:00.2: Unable to change power state from D3cold to D0, device inaccessible
Sep 19 16:18:28 bubz kernel: [68210.031649] xhci_hcd 0000:07:00.2: Controller not ready at resume -19
Sep 19 16:18:28 bubz kernel: [68210.031652] xhci_hcd 0000:07:00.2: PCI post-resume error -19!
Sep 19 16:18:28 bubz kernel: [68210.031656] xhci_hcd 0000:07:00.2: HC died; cleaning up
Sep 19 16:18:42 bubz gnome-shell[2161]: Window manager warning: Failed to start restart helper: Failed to execute child process “/usr/libexec/mutter-restart-helper” (No such file or directory)
Sep 19 16:18:42 bubz kernel: [68223.434374] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:6:0:0x0000000f
Sep 19 16:18:42 bubz kernel: [68223.434389] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:4:0:0x0000000f
Sep 19 16:18:47 bubz /usr/libexec/gdm-x-session[2026]: (WW) NVIDIA: Wait for channel idle timed out.

No recovery, besides forcing the power button.

Earlier, before updating on Sep 18, 2024, I had placed boot parameters and the problem didn't persist. Never had this happen in my 2023 installation of Pop_OS.

I've tried re-seating the GPU into the PCIe slot, thinking it's just physically loose. So, confirming that the GPU and PCIe ports are just fine on Windows, and that the GPU is securely seated in the slot, I suspect it's a kernel issue, specifically after going from a high power state to a low-power state.

Implicated PCI:

These are my results from sudo lspci -v:

Bridge:

00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 27, IOMMU group 4
	Bus: primary=00, secondary=07, subordinate=07, sec-latency=0
	I/O behind bridge: 0000e000-0000efff [size=4K]
	Memory behind bridge: f5000000-f60fffff [size=17M]
	Prefetchable memory behind bridge: 00000000e0000000-00000000f20fffff [size=289M]
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Port (Slot+), MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [c0] Subsystem: ASUSTeK Computer Inc. Family 17h (Models 00h-0fh) PCIe GPP Bridge
	Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2a0] Access Control Services
	Capabilities: [370] L1 PM Substates
	Kernel driver in use: pcieport

GPU:

07:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. [MSI] TU116 [GeForce GTX 1660]
	Flags: bus master, fast devsel, latency 0, IRQ 68, IOMMU group 13
	Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

EDIT 2: 07:00.2 (USB 3.1 Host Controller) also errored out after the GPU fell off of the bus. No idea what this component does, but it only shows up AFTER the userspace is terminated non-gracefully. It's the part of the GPU that reports the GPU is completely inaccessible after the power state change.

07:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
	Subsystem: Micro-Star International Co., Ltd. [MSI] TU116 USB 3.1 Host Controller
	Flags: fast devsel, IRQ 51, IOMMU group 13
	Memory at f2000000 (64-bit, prefetchable) [size=256K]
	Memory at f2040000 (64-bit, prefetchable) [size=64K]
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

If anyone else has issues like I do, please open /var/log/syslog immediately after rebooting.

@Vetpetmon
Copy link

Vetpetmon commented Sep 20, 2024

Please push out a fix in a newer kernel version! (Or, instructions to get the August 2023 (or Q1/Q2 2024) kernel back would be appreciated, I unfortunately lost those while doing a fresh reinstall of Pop_OS in August 2024, and that kernel version worked amazingly well with my hardware!) EDIT: linux-generic and all other related kernel components just showed up as upgrade-able, I cannot wait to see if this is fixed tomorrow morning!

I just checked my kernel version and found a 1:1 match with the initial issue poster's kernel version. I've have random Xid 79 errors with and without the PCIe bridge errors, so this can be replicated.

My GPU is a MSI GTX 1660. It appears to be an issue stemming from the pcieport kernel drivers, as that's where the error logging starts, and the error codes line up with being failures at the PCIe bridge. The GPU can be at 55 C, and still fall off the bus, so thermals aren't suspected, but often, closing a demanding program (or even turning down the graphics settings from medium to low) runs the risk of the bridge getting a bad TLP even at a GPU temp of 65 C, and then the system/session stability devolves from there.

Additionally, I suspect this might be somewhat related to this kernel version, but will need further testing to see if GNOME crashes from suspend put the system in a state unstable enough to cause bad TLPs, even after restarting GDM: #3254 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants