-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu related crashes with kernel >= 6.9.7 #309
Comments
There isn't much of change between asahi-6.9.5-1 and asahi-6.9.6-1 and I don't see relevant changes. It looks like there is an issue with handling failing |
I was running in the same (or similar)
|
asahi-6.9.7-1 contains @asahilina's GPUVM changes so a regression caused by that is at least possible |
Hi, as I see that @mkurz run a macbook pro, for information, I got this issue on a m2 air. This is totally random but happen several times per day. |
This is in drm/sched so it's less likely to be GPUVM related...
This is an impossible condition, since the job credit count is always 1 and the credit limit is 1280 or something like that. So I think there is some kind of memory corruption... |
The realloc crash has some interesting strings...
This string is not from the kernel... @oliverbestmann, do you have any idea where this came from? |
Also are we sure this is reproducible with v6.9.6 in at least some cases? Because then it can't be the GPUVM stuff... |
If it's reproducible with asahi-6.9.6-1 there's no obvious change which would explain why it's not in asahi-6.9.5-1 as well. Nothing in |
Are these kernels built with clang/llvm by any chance? So far everyone reporting this is on something other than Fedora, and Ella specifically pointed this out on Discord:
|
@cyrinux please describe which systems you use. Do you use Fedora-Asahi-Remix? @mkurz / @oliverbestmann do you use LLVM or gcc to build the kernel? |
I use nixos unstable with https://github.com/tpwrules/nixos-apple-silicon/ overlay. 😸
|
My kernel is built with GCC.
|
Please also report your Mesa versions, and the Rust version used for the kernel compile too. At this point I'm pretty sure this is random memory corruption, but none of us on Fedora can reproduce it so far... |
|
I am running Arch Linux ARM with all packages up to date, thanks to @joske's pull requests: https://github.com/AsahiLinux/PKGBUILDs/pulls/joske You find the
So for me this happend when going from 6.9.6-1 to 6.9.7-1 |
btw. after upgrading llvm/clang I had to re-compile mesa. |
I'm bisecting configs and running into some scary mm-related crashes that have nothing to do with the GPU. I think there is some horrible regression here that affects some kernel configs... Everyone, please post the value of these kernel configs:
For reference, on Fedora we have:
|
From https://github.com/joske/PKGBUILDs/blob/kernel/linux-asahi/config:
Both the same when building 6.9.6-1 or 6.9.7-1. The only difference between in config between the two kernels is: joske/PKGBUILDs@14913f3#diff-3a3fd6cbc5653e937609572c62143e181842a4a1ebdc1b55e9e2e34e6aa6c5fc |
I just ran into this, also using https://github.com/tpwrules/nixos-apple-silicon/tree/6015c1e2f91896e0b7a983c2824c665af32f568a
|
Sorry, I really need a consistent way to reproduce this to track it down. So far I've been unable to repro the The crash itself makes no sense. It's memory corruption, where the drm_sched job gets clobbered with something else, and then somehow consistently after that the changes made by drm_sched directly cause a crash in the allocator, in what has to be a subsequent ioctl call because the drm_sched stuff is the last thing the ioctl does. That it's somehow this consistent is very, very strange. I would have expected heap corruption to manifest in more varied ways after the fact. The actual lifetimes of the allocations involved are extremely simple, so I'm 99% sure this isn't a silly lifetime problem in my code (at least not as it relates to the specific structures referenced in the crashes). The code in both the I tried running the same kernel under kASAN and came up with nothing. I also tried Ella's config with kASAN, still nothing, Best guess is there is a spurious page being freed or something like that, so memory is reused while it is still in use. I actually already ran into one of these before (fixed in 2bb1499) which would perfectly explain this kind of behavior, except for the fact that that particular one only happened on DART pagetable freeing which only really happens when unbinding drivers (which is why we didn't notice for so long). If there is a similar bug lurking somewhere else, but it only happens sometimes, then that might explain this and the other badness.
Edit: The 52-bit VA thing is unrelated unfortunately. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, seems like I am a bit late now, probably nothing new, but still: |
Unfortunately, I just confirmed that the 52-bit problem is completely unrelated. Upstream Linux is just broken with the combination of LPA2 (52-bit support), 16K pages, and non-LPA2 hardware. Please don't build with 52-bit support. So now we're back to square one... I have no idea how to repro the GPU issue ;; |
This implies that it is working fine for you on a macbook pro m1 with wayland and gnome? Running chromium also works? What information would be helpful to you? |
That's the first time I hear gnome is involved, and also nobody mentioned chromium until your previous post ^^;; (the OP does in fact mention the process name is chromium in the oops log, but I missed that bit...) The more info about the setup I get the better, and if you can try more workloads (for example, webgl tests and other browsery things) and see if you can find something that reproduces it fast that would be very useful... Right now I'm testing chromium on an M2 Pro Mac Mini and a bunch of maps and webGL stuff doesn't seem to cause any issues, but this is on Fedora. If there's something about the userspace build that matters here, maybe I need to install another distro... |
You are right, I only mentioned wayland and gnome in the issue tpwrules/nixos-apple-silicon#218 here, I am sorry for that. I just checked my previous boot logs to find everything i can. Here is a different stack trace. This ne does not contain the Warning about a kernel paging request:
but then a few minutes later:
Then i have one from 6.9.7:
I got this warning from chromium in the log 3260 times: It looks like it is not only chromium, here I have one crash in Xwayland on 6.9.7:
Running a video conference on zoom triggered the freeze for me the fast - it takes only a few minutes for the system to freeze. -- Regarding the build: You could probably just follow the installation instructions here to get the exact same kernel build, chromium, wayland + gnome (well, at least thats what nix promises you): https://github.com/tpwrules/nixos-apple-silicon/blob/main/docs/uefi-standalone.md |
I went back looking for similar crashes and found this one, possibly related, though I don't know what I was doing at the time:
|
I am on KDE Plasma and don't even have Gnome installed... I use Firefox not Chromium... |
While trying to reproduce this I encountered the following:
and a reboot after maybe 30 second. Those were the two last and only interesting lines in Nevermind, might not be related. The kernel log survived and has the usual error 9 minutes later:
|
I got this error just surfing with Firefox using Plasma on Wayland and a few minutes of uptime (NixOS):
I may be able to go on a bisection adventure this weekend if more is not turned up. |
Just happened to me now on my M1 Air although the system didn't finish writing the full crash report. Happened on first boot with new 6.9.9 kernel with about 3 minutes uptime upon loading Firefox with a bunch of tabs being restored. Interestingly hasn't happened again (yet) with 30 minutes uptime. Only thing different is I didn't restore my old tabs this time. UPDATE: Ran for several hours and than crashed again as soon as I opened VS Code (an Electron app). Interestingly I have built a kernel without crashing so it doesn't appear unstable from a compute perspective.
|
kasan hit in a kernel with
|
Well... that explains everything. That means this is another bug in drm_scheduler, and it has nothing to do with the GPUVM changes or our driver. It affects every GPU driver using drm_scheduler. The bug is that it is possible for an entity to run out of jobs to run, but be about to execute a new iteration of the job work function (which would stop executing only after seeing the queue empty during an iteration). Then a new job is queued, and it's the "first" job (since the queue was empty), so It's a really crazy race, since a whole GPU job needs to be dequeued, and run to completion, and be freed, involving a huge amount of driver and firmware code, all during a few instructions in I suspect this might be a regression introduced when the drm_scheduler was converted to workqueues recently (instead of kthreads). @jannau I pushed a silly hack to disable that whole mechanism (we don't need it) directly to asahi-wip, can you please test/tag and push it to the bits branch? I'll try to fix this properly later ^^ (please do not use that commit with other non-Asahi GPU drivers, they may rely on that functionality) |
I've pushed asahi-6.9.9-7 with that workaround |
Thank you. I've updated my system using @cjdell's pull request and will report back in case it crashes again. |
Works perfectly fine so far. Thanks for looking into the issue and for the quick workaround! |
Can also confirm stability with 50+ hours uptime. Love your hard work on this project. No plans on going back to macOS. 🙂 |
Also testing asahi-6.9.9-7 on ALARM and so far looks good. Thanks! |
Actually title should be changed from |
The bug actually affects all of 6.9.x and probably a few earlier versions too, it's just a coincidence that it apparently only manifested starting with 6.9.7.
…On July 23, 2024 9:37:05 AM GMT+02:00, Matthias Kurz ***@***.***> wrote:
Actually title should be changed from `gpu related crashes with kernel >= 6.9.6` to `gpu related crashes with kernel >= 6.9.7` IMHO
--
Reply to this email directly or view it on GitHub:
#309 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
I've renamed it anyways, as it was a pretty consistent coincidence. |
looks like the regression was introduced in:
I don't see any upstream users of |
@robclark nouveau is using variable credits, just not |
hmm, if nouveau is using that, it makes it more complicated to revert. But that patch is fatally flawed, the whole point of a single-producer-single-consumer queue is that you have just a single producer and single consumer. That patch violates this rule. |
I suspect the correct fix is to remove the Edit: In fact this was already proposed here but for some reason Luben never implemented the proposed simplified |
I've only looked briefly at the credit patches, but the call in |
Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: #309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: #309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]>
Fixes a race condition reported here: AsahiLinux/linux#309 (comment) The whole premise of lockless access to a single-producer-single- consumer queue is that there is just a single producer and single consumer. That means we can't call drm_sched_can_queue() (which is about queueing more work to the hw, not to the spsc queue) from anywhere other than the consumer (wq). This call in the producer is just an optimization to avoid scheduling the consuming worker if it cannot yet queue more work to the hw. It is safe to drop this optimization to avoid the race condition. Suggested-by: Asahi Lina <[email protected]> Fixes: a78422e ("drm/sched: implement dynamic job-flow control") Closes: AsahiLinux/linux#309 Cc: [email protected] Signed-off-by: Rob Clark <[email protected]> Reviewed-by: Danilo Krummrich <[email protected]> Tested-by: Janne Grunau <[email protected]> Signed-off-by: Danilo Krummrich <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Since updating from 6.9.5 to to 6.9.6 (and 6.9.9) i get random gpu/drm related crashes after a few minutes of usage.
Going back to 6.9.5 brings back a stable system.
The text was updated successfully, but these errors were encountered: