Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when debugging with lldb on MacOS #4769

Open
UnityAlex opened this issue Jun 20, 2024 · 13 comments
Open

Crash when debugging with lldb on MacOS #4769

UnityAlex opened this issue Jun 20, 2024 · 13 comments
Assignees
Labels
documentation Documentation related issue os-mac-os-x macOS aka OSX
Milestone

Comments

@UnityAlex
Copy link

Description

When using a native lldb debugger attached to CoreCLR on MacOS (ARM64) breakpoints in certain locations can cause the process to crash.

Reproduction Steps

Sample code:

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Hello, World!");
        Console.ReadKey();
        string foo = null;
        Console.WriteLine($"foo: {foo.Length}");
    }
}

The idea of the sample is to trigger the native exception handling for a null reference exception. Which is where we have our breakpoint in lldb.

  1. Run sample
  2. attach lldb debugger to process
  3. put a breakpoint on function PAL_DispatchException: breakpoint set --name PAL_DispatchException
  4. Press a key in the CoreCLR console for the running process to trigger the exception
  5. See the breakpoint hit in lldb, usually in some memmove on an access violation
  6. Attempt to continue, silent crash occurs. If you wait long enough MacOS will usually give you a dialog with a crash report. It looks like there might be a stack overflow in the exception handling.

Expected behavior

No crash

Actual behavior

Silent crash.

Regression?

No response

Known Workarounds

No response

Configuration

.net version 8.0.201
MacOS -- 14.5
M1 ARM64
Does not happen on windows. I haven't tried linux yet.

Other information

If it helps the beginning few frames of what I suspect is an overflow looks like:

0   libcoreclr.dylib                         0x3289a5d4c CorUnix::GetCurrentPalThread() + 0 (thread.hpp:684) [inlined]
1   libcoreclr.dylib                         0x3289a5d4c CorUnix::InternalGetCurrentThread() + 0 (thread.hpp:689) [inlined]
2   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:428)
3   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
4   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
5   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
6   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
7   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
8   libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
9   libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)
10  libcoreclr.dylib                         0x3289a5d4c PAL_DispatchException + 36 (machexception.cpp:422)
11  libcoreclr.dylib                         0x3289a5a2c PAL_DispatchExceptionWrapper + 16 (dispatchexceptionwrapper.S:39)

This is followed by 500 ish more frames of the same thing.

Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@MichalPetryka
Copy link

Do you have the SOS plugin installed in your lldb?

@UnityAlex
Copy link
Author

I don't. I can do that if it would help though.

@MichalPetryka
Copy link

MichalPetryka commented Jun 20, 2024

I was wondering if the issue might be related to the presence of the plugin or the lack thereof.

@UnityAlex
Copy link
Author

I am having difficulties getting this plugin working on my machine. When I install following the instructions here: https://github.com/dotnet/diagnostics/blob/main/documentation/installing-sos-instructions.md it appears to break my lldb:

~ % lldb
zsh: killed     lldb

If I uninstall: dotnet-sos uninstall
It works fine again. I see some mentions in documentation that I might need to build the sos plugin myself and install that. Do you know if that's still true for MacOS m1 machines?

@tommcdon
Copy link
Member

This issue is tracked on dotnet/runtime#99977.

@UnityAlex
Copy link
Author

@tommcdon The issue you linked appears to be sos plugin specific. Sorry for the delay it took me a bit to find @lambdageek 's workaround: #4551 (comment) to get lldb working with the plugin but I can still reproduce the crash with and without the plugin installed.

@lambdageek lambdageek reopened this Jun 20, 2024
@vvuk
Copy link

vvuk commented Jun 20, 2024

Here's a full set of steps to reproduce:

  1. Set up a copy of lldb that can load dotnet-sos, as described here: libsosplugin.dylib: CoreCLR host crash on macOS Sonoma 14.4 on arm64 #4551 (comment)
mkdir Foo
cd Foo

dotnet new console

cat <<EOF > Program.cs
class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Hello, World!");
        Console.ReadKey();
        string foo = null;
        Console.WriteLine($"foo: {foo.Length}");
    }
}
EOF

dotnet build
dotnet publish --sc
  1. In one window/tab: ./bin/Release/net8.0/osx-arm64/publish/Foo
  2. In another: ~/lldb -n Foo
  3. When lldb attaches, set a breakpoint: breakpoint set --name PAL_DispatchException (note: this seems to be required to hit the issue; without a breakpoint, I haven't been able to reproduce)
  4. Hit enter in the first window
  5. Observe crash in CLR runtime inside Foo in platform_memmove:
> ~/lldb -n Foo
Current symbol store settings:
-> Cache: /Users/vladimir/.dotnet/symbolcache
-> Server: https://msdl.microsoft.com/download/symbols/ Timeout: 4 RetryCount: 0
(lldb) process attach --name "Foo"
Process 13444 stopped
* thread dotnet/runtime#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000019c182db4 libsystem_kernel.dylib`read + 8
libsystem_kernel.dylib`read:
->  0x19c182db4 <+8>:  b.lo   0x19c182dd4               ; <+40>
    0x19c182db8 <+12>: pacibsp
    0x19c182dbc <+16>: stp    x29, x30, [sp, #-0x10]!
    0x19c182dc0 <+20>: mov    x29, sp
Target 0: (Foo) stopped.
Executable module set to "/Users/vladimir/tmp/Foo/bin/Release/net8.0/osx-arm64/publish/Foo".
Architecture set to: arm64-apple-macosx-.
(lldb) breakpoint set --name PAL_DispatchException
Breakpoint 1: 2 locations.
(lldb) c
Process 13444 resuming
Process 13444 stopped
* thread dotnet/runtime#2, stop reason = EXC_BAD_ACCESS (code=2, address=0x16a8e3c08)
    frame #0: 0x000000019c1f3248 libsystem_platform.dylib`_platform_memmove + 168
libsystem_platform.dylib`:
->  0x19c1f3248 <+168>: stp    q2, q3, [x0]
    0x19c1f324c <+172>: subs   x2, x2, #0x40
    0x19c1f3250 <+176>: b.ls   0x19c1f326c               ; <+204>
    0x19c1f3254 <+180>: stp    q0, q1, [x3]
Target 0: (Foo) stopped.
(lldb) bt
* thread dotnet/runtime#2, stop reason = EXC_BAD_ACCESS (code=2, address=0x16a8e3c08)
  * frame #0: 0x000000019c1f3248 libsystem_platform.dylib`_platform_memmove + 168
    frame dotnet/runtime#1: 0x0000000105854414 libcoreclr.dylib`SEHExceptionThread(void*) + 1368
    frame dotnet/runtime#2: 0x000000019c1c2f94 libsystem_pthread.dylib`_pthread_start + 136
(lldb)

@lambdageek lambdageek added the os-mac-os-x macOS aka OSX label Jun 20, 2024
@tommcdon
Copy link
Member

@vvuk thanks for providing the repro steps. We have a few clarifying questions:

  1. Does this issue only reproduce when following the directions on libsosplugin.dylib: CoreCLR host crash on macOS Sonoma 14.4 on arm64 #4551 (comment) (skipping step 1 in the repro steps above)?
  2. Does this issue reproduce when launching the app from lldb?
  3. Does this issue only reproduce only when setting a breakpoint on PAL_DispatchException?

@vvuk
Copy link

vvuk commented Jun 25, 2024

Does this issue only reproduce when following the directions on libsosplugin.dylib: CoreCLR host crash on macOS Sonoma 14.4 on arm64 diagnostics#4551 (comment) (skipping step 1 in the repro steps above)?

I can reproduce it without loading libsosplugin at all, using non-modified lldb. It seems like just attaching causes an issue.

Does this issue reproduce when launching the app from lldb?

It doesn't seem to (both with and without libsosplugin). But I've also heard that there are cases where it's not 100% reproducible like it seems to be with the steps above (though I suppose you can skip libsosplugin).

Does this issue only reproduce only when setting a breakpoint on PAL_DispatchException?

Without any breakpoints set, the debugger correctly stops in pthread_kill. If I try to set other breakpoints after attaching, for example on CallDescrWorkerInternal... then weird things happen. I think that CallDescrWorkerInteral is already on the stack so the breakpoint shouldn't be hit, but the process seems to hang instead of crashing.

@vvuk
Copy link

vvuk commented Jun 25, 2024

This might be already understood, but it seems like there is a bad interaction with the mach exception handler thread that CoreCLR creates and the mechanism by which lldb attaches to an existing process.

If I build a debug runtime and set NONPAL_TRACING=1 and I run the little hello world program above, here's what happens. On process launch:

NONPAL_TRACE: SEHInitializeMachExceptions: TASK PORT count 1
NONPAL_TRACE: SEHInitializeMachExceptions: TASK PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
NONPAL_TRACE: Enabling handlers for thread 00000103 exception mask 0000007e exception port 00001c03
NONPAL_TRACE: EnableMachExceptions: THREAD PORT count 1
NONPAL_TRACE: EnableMachExceptions: THREAD PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
... bunch of threads ...
Hello World  [the process waits for a keypress at this point]

Then I attach lldb at this point, and type finish. Note not continue -- I need the debugger to actually manipulate the process, which is likely what the effect of setting the breakpoint on PAL_x was. The following trace logs show up after the finish:

NONPAL_TRACE: Enabling handlers for thread 00001f03 exception mask 0000007e exception port 00000c03
NONPAL_TRACE: EnableMachExceptions: THREAD PORT count 1
NONPAL_TRACE: EnableMachExceptions: THREAD PORT mask 0000007e handler: 00000000 behavior 00000000 flavor 0
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007307 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00000103 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0x661d00019c1c355c sp 000000016cfb3fe0 fp 000000016cfb4070 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00000103
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00000103 port 00007307
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000730b to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00002903 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0xca7580019c1c355c sp 000000016d4e9a00 fp 000000016d4e9a90 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00002903
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00002903 port 0000730b
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000730f to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00003d03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0xc23880019c1c355c sp 000000016d949a70 fp 000000016d949b00 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00003d03

Assert failure(PID 84850 [0x00014b72], Thread: 12211081 [0xba5389]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00003d03 port 0000730f
NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007313 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00001207 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0xaf6200019c1c355c sp 000000016d2fa090 fp 000000016d2fa120 pc 0x19c1c355c cpsr 60001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000

Assert failure(PID 84850 [0x00014b72], Thread: 12211223 [0xba5417]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: HijackFaultingThread thread 00001207
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00001207 port 00007313

Assert failure(PID 84850 [0x00014b72], Thread: 12211248 [0xba5430]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet


Assert failure(PID 84850 [0x00014b72], Thread: 12211088 [0xba5390]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 00007317 to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00007e03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0x19c1c355c sp 000000016dec1cd0 fp 000000016dec1d60 pc 0x19c1c355c cpsr 40001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00007e03
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00007e03 port 00007317

Assert failure(PID 84850 [0x00014b72], Thread: 12211503 [0xba552f]): fWasAttached
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4234
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000731b to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00003d03 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1
NONPAL_TRACE: ExceptionNotification subcode[1] = 19c1c355c
NONPAL_TRACE: ExceptionNotification actual  lr 0x19c1c355c sp 000000016d946c60 fp 000000016d946cf0 pc 0x19c1c355c cpsr 40001000
NONPAL_TRACE: ExceptionNotification far 0000000000000000 esr f2000000 exception 00000000
NONPAL_TRACE: HijackFaultingThread thread 00003d03
NONPAL_TRACE: ReplyToNotification KERN_SUCCESS thread 00003d03 port 0000731b

Assert failure(PID 84850 [0x00014b72], Thread: 12211248 [0xba5430]): pOldContext == NULL
    File: /opt/UnitySrc/u/runtime/src/coreclr/debug/ee/controller.cpp Line: 4169
    Image: /opt/UnitySrc/u/runtime/artifacts/bin/testhost/net8.0-osx-Debug-arm64/dotnet

NONPAL_TRACE: Received message EXCEPTION_RAISE_64 (00000965) from (remote) 0000731f to (local) 00000c03
NONPAL_TRACE: ExceptionNotification EXC_BREAKPOINT (6) thread 00000103 flavor 5
NONPAL_TRACE: ExceptionNotification subcode[0] = 1

At other times I did this, I didn't get any of the assertion failures, but just got a stream of EXC_BREAKPOINT exception notifications. At this point lldb is still waiting for finish to finish; attempting to interact with the process gives me error: Command requires a process which is currently stopped. (because it's not stopped). If I hit enter in the process itself, I get another EXC_BREAKPOINT notice, followed by the proper EXC_BAD_ACCESS which prints an Unhandled exception message.

The dotnet process doesn't exit at that point; it's hung, and lldb still thinks it's not stopped.

@vvuk
Copy link

vvuk commented Jun 25, 2024

Ah ha. If I set PAL_MachExceptionMode=2 (MachException_SuppressDebugging) then everything works as it should on attach. When lldb actually launches the process this is checked and exception handling doesn't grab EXC_MASK_BREAKPOINT | EXC_MASK_SOFTWARE. @tommcdon I guess this is why you were asking if the issue is reproducible if lldb launches the process?

@tommcdon
Copy link
Member

tommcdon commented Jul 1, 2024

Ah ha. If I set PAL_MachExceptionMode=2 (MachException_SuppressDebugging) then everything works as it should on attach. When lldb actually launches the process this is checked and exception handling doesn't grab EXC_MASK_BREAKPOINT | EXC_MASK_SOFTWARE. @tommcdon I guess this is why you were asking if the issue is reproducible if lldb launches the process?

Thanks for the details @vvuk! It seems we should document the PAL_MachExceptionMode=2 workaround which seems to disable PAL handling of breakpoint exceptions. I'll move this issue to the dotnet/diagnostics repo and mark this as a documentation issue.

@tommcdon tommcdon transferred this issue from dotnet/runtime Jul 1, 2024
@tommcdon tommcdon added documentation Documentation related issue and removed area-Diagnostics-coreclr labels Jul 1, 2024
@tommcdon tommcdon added this to the 9.0.0 milestone Jul 1, 2024
@tommcdon tommcdon self-assigned this Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation related issue os-mac-os-x macOS aka OSX
Projects
None yet
Development

No branches or pull requests

5 participants