-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia error "GPU has fallen off the bus" #3363
Comments
Had this happen on Ubuntu 22.04 (GPU has fallen off the bus) immediately after upgrading from 550 to 555 and rebooting (mistake!). |
I have been having a similar issue since applying an update from Pop Shop on 8/22. Before this update everything was running fine. BehaviorThe computer will randomly freeze, usually about 10-15 minutes after booting, and the fans will start running at full blast even under no workload. System does not respond to mouse input and I have to reboot either by holding down the power button or using Alt+SysRq+b. Output of journalctl during the crash is as follows:
System Infocat /etc/os-release
uname -a
NVIDIA Driver Version is 555.58.02. GPU is a 16 GB NVIDIA GeForce RTX 4070 Ti Super. Suspected causeThe contents of the update that seem to have broken the system are as follows (lines related to some packages omitted):
Of these, I imagine the culprit is |
A potential workaround is to use the 550-server version until 555 is updated:
then reboot |
The first line doesn't work for me, should it be *nvidia? Or should it be ~nvidia? |
Nope, |
OK, figured it out. The command works in bash but not zsh. I ran the above commands and rebooted. Upon rebooting the first time, it booted into a console displaying the error message shown by @esplinr above:
The second time I rebooted it successfully loaded Pop!_OS. I will follow up if freezes persist. |
It was working for a while but unfortunately the freezes persist. I started getting them again after trying to watch a youtube video in chromium (which is when I first noticed the issue).
Oddly, the nvidia-driver-555 was installed on 8/8 and the computer worked fine up until 8/22 when the additional updates were installed. For now, I am just removing all nvidia drivers using |
Same freeze occurs even with all nvidia drivers uninstalled. Not sure how to proceed at this point. Will probably need a support ticket.
|
I just thought I'd add that I have a System76 Gazelle and have been dealing with this issue for about a year and a half now. At some point not too long after I first got this laptop, I started to have the same issue you're describing. I opened a ticket and spent a while diagnosing it with System76 support, and eventually I sent my laptop in and they replaced the mainboard, and the whole process of sending it in and getting it back took a few weeks. After I got it back, I continued to have the same problem. I really couldn't afford to be without my work computer for a few more weeks, so I've just gotten used to having my laptop randomly lock up when I'm away from it. I spent a while trying to debug it and found out that this issue happens specifically when the GPU wakes up from being in a low-power mode, and running a low-power process that's constantly touching the GPU (like The interesting thing is that sometime recently, I realized my laptop had reached a point where it had been running over three weeks solid without freezing. I suspect that I just tried installing |
I believe I have solved the issue on my machine. I initially tried to run a live disk but was unable to see the boot menu because no video signal was being sent to the monitor before the login screen appeared. I then attempted to resolve the problem by adding 'nomodeset' to the kernel boot parameters. However, this made it so that video signal was never output to the monitor at all, and I thought I had bricked my computer. After consulting the hardware manual for the Thelio Mira, I realized there were another set of dedicated HDMI/Displayport ports on the GPU itself. I unplugged my HDMI cable from the integrated graphics HDMI port and into the dedicated graphics HDMI port. After this change, I was able to get video signal and see the boot menu. Moreover, after this change, the freezing issue has not returned, and I'm not getting warnings and errors in journalctl anymore. I thus upgraded to NVIDIA driver 555 again. There were also some updated system packages I installed from system76, but I don't think they are relevant. tl;dr: I plugged my monitor into the dedicated graphics HDMI port instead of the integrated graphics HDMI port on my Thelio Mira. This solved the lack of video output at boot time, and the freezing issue has not returned. I am running NVIDIA driver 555 again. |
Spoke too soon, the computer froze again---this time with no video output. The uptime was much longer this time around though. |
Since you have System76 hardware, I recommend opening a support ticket: https://support.system76.com |
I'm still seeing this happen even after installing the latest updates to the NVidia driver from System76. I can reliably reproduce it by locking my screen without suspending the laptop; when the machine goes into low power mode it locks up and the fans go to full blast. After the reboot, the previous boot's log messages contain the same errors about GPU falling off the bus and The problem is manageable by updating Settings -> Power -> Screen Blank to "Never" and shortening the time to Automatic Suspend. Next step for me is to open a support ticket with System76. |
Support confirmed that rolling back to the 550 driver is the recommended workaround until the 560 driver is released. I see that the 560 driver is now in the repository, so hopefully that resolves the problem. |
560 has already been released |
Update: I ended up getting an advance replacement for my machine (Thelio Mira). I believe the issue was hardware-related, as the freezing persisted even when running a live disk. After getting a replacement, I haven't had any freezing issues. However, I did notice that the CPU and GPU temperatures were intermittently running rather hot on the new machine (~85 C for a few seconds at a time) under the default 'Balanced' power profile when running heavy workloads. The fans also tend to speed up and slow down rather than maintain a steady level. The temperature issues and fan thrashing went away after switching to 'Power Saver'. I wonder if these intermittent high temperatures may have contributed to a hardware failure. |
Having this issue since September 3, shortly after doing a fresh reinstall and updating from NVIDIA driver version 550 to 555. Persists after the update to 560. My hardware is not damaged and is not System76 hardware. EDIT: Important details
Got the issue today, will be trying First, logging is sane, but then in the middle of playing a Steam Proton game and turning down the graphical settings to reduce GPU temps from ~80 C to ~58 C, things run smoothly until about 5 minutes in, and everything goes choppy. I can hear audio, but my mic isn't going through. Not choppy, according to a friend, "It completely died" while my display and input slowed to ~5 FPS. Mouse too was affected. Logs looked like this:
.... Gets more extreme....
Error codes 6, 7, 8, and 12 become more and more consistent. GDM has a "lol" moment about 8 seconds in:
Then it sped back up after... about 5000 or so error lines, but mic audio still did not go through. Short-lived speed-up, logging calms down, but then...
EDIT 3: Oh, I dug further. I found ONE error code 14, which couldn't be corrected.
Picks back up, now failing to fetch details for some errors. After 16 seconds of starting, the nail is hit into the coffin, and my user session is completely dead:
No recovery, besides forcing the power button. Earlier, before updating on Sep 18, 2024, I had placed boot parameters and the problem didn't persist. Never had this happen in my 2023 installation of Pop_OS. I've tried re-seating the GPU into the PCIe slot, thinking it's just physically loose. So, confirming that the GPU and PCIe ports are just fine on Windows, and that the GPU is securely seated in the slot, I suspect it's a kernel issue, specifically after going from a high power state to a low-power state. Implicated PCI:These are my results from Bridge:
GPU:
EDIT 2: 07:00.2 (USB 3.1 Host Controller) also errored out after the GPU fell off of the bus. No idea what this component does, but it only shows up AFTER the userspace is terminated non-gracefully. It's the part of the GPU that reports the GPU is completely inaccessible after the power state change.
If anyone else has issues like I do, please open |
I just checked my kernel version and found a 1:1 match with the initial issue poster's kernel version. I've have random Xid 79 errors with and without the PCIe bridge errors, so this can be replicated. My GPU is a MSI GTX 1660. It appears to be an issue stemming from the pcieport kernel drivers, as that's where the error logging starts, and the error codes line up with being failures at the PCIe bridge. The GPU can be at 55 C, and still fall off the bus, so thermals aren't suspected, but often, closing a demanding program (or even turning down the graphics settings from medium to low) runs the risk of the bridge getting a bad TLP even at a GPU temp of 65 C, and then the system/session stability devolves from there. Additionally, I suspect this might be somewhat related to this kernel version, but will need further testing to see if GNOME crashes from suspend put the system in a state unstable enough to cause bad TLPs, even after restarting GDM: #3254 (comment) |
Distribution (run
cat /etc/os-release
):Related Application and/or Package Version (run
apt policy $PACKAGE NAME
):From NVIDIA Settings:
NVIDIA Driver Version: 555.58.02
From
apt search system76 |grep installed
:system76-driver-nvidia/jammy,jammy,now 20.04.94~1723838773~22.04~8237cd8 all [installed]
From
flatpak list
:nvidia-555-58-02 org.freedesktop.Platform.GL32.nvidia-555-58-02 1.4 user
Issue/Bug Description:
About half of the time I return to my computer after a break, the computer refuses to wake and the fans are going at full blast.
The only two times I checked the logs from before the reboot, they ended with these lines:
Steps to reproduce (if you know):
Leave the computer for more than 10 minutes, and it happens about 50% of the time.
I thought it was related to #3313 because it correlates with a suspend, but I've had it happen twice when the screen blanks but before the automatic suspend should have happened.
I've also had a couple of times where I jiggled the mouse and it appeared to recover correctly from suspend, but I didn't proceed to log back in and the machine hung with the fan at full blast.
Expected behavior:
The computer should wake up from a blank screen or suspend.
Other Notes:
My research suggests that previous NVIDIA drivers had a bug that showed the similar behavior when the GPU entered a low powered state. My problem does seem correlated with when the machine is idle and reducing power consumption.
The text was updated successfully, but these errors were encountered: