Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredump when run the example #40

Open
louishust opened this issue Nov 15, 2022 · 20 comments
Open

coredump when run the example #40

louishust opened this issue Nov 15, 2022 · 20 comments
Labels
bug Something isn't working

Comments

@louishust
Copy link

When run the example: get the coredump

OMP_PROC_BIND=TRUE numaprof ./omp-loop
C: [tid:4173805] Tool (or Pin) caused signal 11 at PC 0x7f7fb6ba1ad4
/root/loushuai/numaprof-1.1.3/build/install/bin/numaprof: line 78: 4173805 Segmentation fault      (core dumped) ${PINTOOL_PIN} -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@"
```

Run omp-loop directly is fine:
```
$ time ./omp-loop
real    0m1.878s
user    3m8.540s
sys     0m2.472s
```

numaprof:numaprof-1.1.3.tar
pin-3.24-98612-g6bd5931f2-gcc-linux

linux uname -r
5.19.8



@svalat
Copy link
Member

svalat commented Nov 15, 2022

Hi @louishust,
thanks for reporting the issue.

Case:

Are we speeking about the example from https://memtt.github.io/numaprof/example.html ?

I tried on my laptop and do not see the issue (ubuntu 22.04).

More arch infos

Can you give more infos on your setup so I can try to reproduce.

Your kernel is a bit more recent than mine what distribution are you using ?

What memory architecture ? Cpu version, NUMA configuration ?

If you want to GDB

In case you have time, you can patch the PREFIX/bin/numaprof script to gdb the pin tool :

-${PINTOOL_PIN} -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@" &
+${PINTOOL_PIN} -appdebug -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@" &

And then after launching connect via GDB to the GDB server launched by pintool:

# Start GDB, then issue this command at the prompt:
target remote :33030

@louishust
Copy link
Author

Hi @svalat ,

I use the Basic OpenMP example @ https://memtt.github.io/numaprof/example.html

cpu info:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              176
On-line CPU(s) list: 0-175
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           4
NUMA node(s):        4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             999.854
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000

NUMA INFO:

available: 4 nodes (0-3)
node 0 cpus: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 132 136 140 144 148 152 156 160 164 168 172
node 0 size: 96076 MB
node 0 free: 3149 MB
node 1 cpus: 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173
node 1 size: 96720 MB
node 1 free: 243 MB
node 2 cpus: 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 102 106 110 114 118 122 126 130 134 138 142 146 150 154 158 162 166 170 174
node 2 size: 96757 MB
node 2 free: 237 MB
node 3 cpus: 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 103 107 111 115 119 123 127 131 135 139 143 147 151 155 159 163 167 171 175
node 3 size: 96747 MB
node 3 free: 256 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

OS INFO:

 cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

@louishust
Copy link
Author

I tried to debug it.

The numaprof session:

numaprof ./omp-loop
Application stopped until continued from debugger.
Start GDB, then issue this command at the (gdb) prompt:
  target remote :36317
C: [tid:4187720] Tool (or Pin) caused signal 11 at PC 0x7f742e9e9ad4
/root/loushuai/numaprof-1.1.3/build/install/bin/numaprof: line 80: 4187720 Segmentation fault      (core dumped) ${PINTOOL_PIN} -appdebug -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@"

and the gdb session:

$ gdb omp-loop
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from omp-loop...done.

(gdb) target remote :36317
Remote debugging using :36317
warning: remote target does not support file transfer, attempting to access files from local filesystem.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
0x00007f742fe01050 in _start () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-101.el8.x86_64
(gdb) bt
#0  0x00007f742fe01050 in _start () from /lib64/ld-linux-x86-64.so.2
#1  0x0000000000000001 in ?? ()
#2  0x00007fffe7706643 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Remote connection closed

The core debug:

]$ gdb omp-loop core.4187720.omp-loop.11
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from omp-loop...done.
[New LWP 4187720]
[New LWP 4187723]
Core was generated by `./omp-loop'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f742e9e9ad4 in ?? ()
[Current thread is 1 (LWP 4187720)]
(gdb) bt
#0  0x00007f742e9e9ad4 in ?? ()
#1  0x00007f741ca3c200 in ?? ()
#2  0x00007f742d644010 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x000000000000002f in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x000000000000002f in ?? ()
#7  0x0000000000600dc0 in __do_global_dtors_aux_fini_array_entry ()
#8  0x00007f742ea78881 in ?? ()
#9  0x00007f742d860500 in ?? ()
#10 0x00007f742fe00939 in ?? ()
#11 0x00007fffe77052a0 in ?? ()
#12 0x00007f743002a000 in ?? ()
#13 0x0000000000000000 in ?? ()

@svalat
Copy link
Member

svalat commented Nov 18, 2022

Hi,
I added a few messages in the logging can you try the master branch and run the example with -V option and report me what it logs:

numaprof -V ./omp-loop

Is it starting to run the program ? Or, is the issue happening during the topology loading ?
I made a quick test under a local VM centos 8 stream and didn't see an issue.

@louishust
Copy link
Author

NUMAPROF: Building process tracker...
NUMAPROF: Get cpu number : 176
NUMAPROF: Building TLS...
NUMAPROF: Instrumenting the binary...
NUMAPROF: Start program...
NUMAPROF: Size = 1024, ncpu = 176
NUMAPROF: Thread is binded on NUMA -1
NUMAPROF: Numa initial mapping : -1
NUMAPROF: Numa initial mem mapping : NO_BIND
NUMAPROF: register thread 305813
C: [tid:305813] Tool (or Pin) caused signal 11 at PC 0x7f3b8d34db74
/root/loushuai/numaprof/build/install/bin/numaprof: line 78: 305813 Segmentation fault      (core dumped) ${PINTOOL_PIN} -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@"

@svalat svalat added the bug Something isn't working label Nov 21, 2022
@svalat
Copy link
Member

svalat commented Nov 22, 2022

Hum,
it started and this is probably during a memory access that the crash arise. Not sure how to debug this, I need to give a look.

Can I ask you to try adding some printfs in the program you run (the example) to see at which stage it crashed ?
On the #pragma omp parallel / before / after ?

Can you also build the example without openmp and tells me if it is working with a sequential example ?

@louishust
Copy link
Author

I modified the omp-loop.cpp source file like below:

#include <cstddef>
#include <iostream>

#define SIZE (200*1024*1024/sizeof(float))
#define REPEAT 100

int main(void)
{
        std::cout << "location 1" << std::endl;
        float * buffer = new float[SIZE];

        std::cout << "location 2" << std::endl;
        for (size_t i = 0 ; i < SIZE ; i++)
                buffer[i] = i;

        std::cout << "location 3" << std::endl;
        for (int i = 0 ; i < REPEAT ; i++)
        {
                for (size_t i = 0 ; i < SIZE ; i++)
                        buffer[i]++;
        }

        std::cout << "location 4" << std::endl;

        delete [] buffer;

        std::cout << "location 5" << std::endl;

        return 0;
}

Then i run it again with openmp:

[root@host57 loushuai]$  g++ -g -fopenmp omp-loop.cpp -o omp-loop
[root@host57 loushuai]$ ./omp-loop
location 1
location 2
location 3
location 4
location 5
[root@host57 loushuai]$ numaprof -V ./omp-loop
NUMAPROF: Building process tracker...
NUMAPROF: Get cpu number : 176
NUMAPROF: Building TLS...
NUMAPROF: Instrumenting the binary...
NUMAPROF: Start program...
NUMAPROF: Size = 1024, ncpu = 176
NUMAPROF: Thread is binded on NUMA -1
NUMAPROF: Numa initial mapping : -1
NUMAPROF: Numa initial mem mapping : NO_BIND
NUMAPROF: register thread 374947
C: [tid:374947] Tool (or Pin) caused signal 11 at PC 0x7fceca58fb74
/root/loushuai/numaprof/build/install/bin/numaprof: line 78: 374947 Segmentation fault      (core dumped) ${PINTOOL_PIN} -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@"

Then run it without openmp

[root@host57 loushuai]$  g++ -g  omp-loop.cpp -o omp-loop
[root@host57 loushuai]$ numaprof -V ./omp-loop
NUMAPROF: Building process tracker...
NUMAPROF: Get cpu number : 176
NUMAPROF: Building TLS...
NUMAPROF: Instrumenting the binary...
NUMAPROF: Start program...
NUMAPROF: Size = 1024, ncpu = 176
NUMAPROF: Thread is binded on NUMA -1
NUMAPROF: Numa initial mapping : -1
NUMAPROF: Numa initial mem mapping : NO_BIND
NUMAPROF: register thread 375013
C: [tid:375013] Tool (or Pin) caused signal 11 at PC 0x7ff3aa1cbb74
/root/loushuai/numaprof/build/install/bin/numaprof: line 78: 375013 Segmentation fault      (core dumped) ${PINTOOL_PIN} -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@"

@svalat
Copy link
Member

svalat commented Nov 23, 2022

I pushed a more verbose version in branch debug/verbose, can you make a test with it and send me the logs ?

git checkout debug/verbose

In case, you can also make a more verbose run (maybe a bit too much) by printing each read/write by de-commenting:

-//#define NUMAPROF_TRACE_RW
+#define NUMAPROF_TRACE_RW

In src/integration/pintool/numaprof.cpp.

If you make this most verbose run, send me just the last ~10 lines it is enougth.

@louishust
Copy link
Author

[root@host57 loushuai]$ numaprof -V ./omp-loop
NUMAPROF: Building process tracker...
NUMAPROF: Get cpu number : 176
NUMAPROF: Building TLS...
NUMAPROF: Instrumenting the binary...
NUMAPROF: Start program...
NUMAPROF: createThreadTracker[before] tid=0
NUMAPROF: Size = 1024, ncpu = 176
NUMAPROF: Thread is binded on NUMA -1
NUMAPROF: Numa initial mapping : -1
NUMAPROF: Numa initial mem mapping : NO_BIND
NUMAPROF: register thread 416008
NUMAPROF: createThreadTracker[after] tid=0
0x7f67844c7d28: W 0x7f67846f2460, NUMA -2 thread ID 0 [before]
0x7f67844c7d28: W 0x7f67846f2460, NUMA 1 thread ID 0 [after]
0x7f67844c7d2f: R 0x7f67846f2e30, NUMA 1 thread ID 0 [before]
C: [tid:416008] Tool (or Pin) caused signal 11 at PC 0x7f6783067f84
/root/loushuai/numaprof/build/install/bin/numaprof: line 78: 416008 Segmentation fault      (core dumped) ${PINTOOL_PIN} -t ${NUMAPROF_PREFIX}/lib/libnumaprof-pintool.so -- "$@"

@svalat
Copy link
Member

svalat commented Nov 25, 2022

Damn,
I pushed a new commit on the debug/verbose branch to get more in depth prints in the read/write function.

If you can run again.

Sorry for the inconvenience, I need to find a way to get this stack trace with symbols it will be far simpler :(.

@louishust
Copy link
Author

log.log

@svalat
Copy link
Member

svalat commented Nov 26, 2022

?? Did you git pull and make install the last commit I made ? I didn't see the new messages I added.

What is wearied is that now you have more Read/Write messages, which says the bug happens a bit randomly.

@louishust
Copy link
Author

Damn, i miss the last commit you made!

20221127.log

@svalat
Copy link
Member

svalat commented Dec 1, 2022

Opps, sorry, I forgot to answer beginning of the week.
OK, I think this is related to your NUMA topology,
I added a commit with more prints in the access matrix handling the issue might be there.
Can you pull (branch debug/verbose), try again and report log ?

@louishust
Copy link
Author

@svalat
Copy link
Member

svalat commented Dec 2, 2022

Hum, sorry, I cannot access the log file. Is it too big for github ? If it is you can compress it or keep only the head and tail of the fail:

logfile=20221202.log
smaller=20221202-summary.log
head -n 1000 $logfile > $smaller
echo "=========== TRUNCATE ===========" >> $smaller
tail -n 250 $logfile >> $smaller

@louishust
Copy link
Author

20221202.log

svalat pushed a commit that referenced this issue Dec 2, 2022
…de is -1 (working on #40)

Currently the -1 is multiplied by the number of nodes which make us goind outside of the
array. We need to return strictly -1 is one of the entering numa value is -1 and not multiply.
@svalat
Copy link
Member

svalat commented Dec 2, 2022

Ah, yes, don't know why by chance it didn't crashed on some previous tests I made with similar servers.
I pushed a fix in the master branch, if you can test it.

@louishust
Copy link
Author

louishust commented Dec 6, 2022

It works!

But I can not view it with browser.

[root@host57 loushuai]$ cat ~/.numaprof/htpasswd
greatdb:lemYEq3Ce/yj6
C[root@host57 loushuai]$ numaprof-webview ./numaprof-omp-loop-1061885.json
INE9TUG7N6D6LWLWKKYEEI3IT5HT9ZGE T19CRYT7PSQEF7BCJCYIYIRBW69TD541
/root/.numaprof/htpasswd
 * Loading file...
 * Prepare func list...
 * Prepare asm func list...
 * Done
 * Serving Flask app 'server' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on all addresses.
   WARNING: This is a development server. Do not use it in a production deployment.
 * Running on http://172.17.130.57:8080/ (Press CTRL+C to quit)
172.16.11.251 - - [06/Dec/2022 08:50:19] "GET / HTTP/1.1" 401 -
Invalid login from greatdb
172.16.11.251 - - [06/Dec/2022 08:50:47] "GET / HTTP/1.1" 401 -

image

I can not login with username : greatdb, password: lemYEq3Ce/yj6

@svalat
Copy link
Member

svalat commented Dec 6, 2022

Cool for the profiling.

For the GUI, did you used the user/password which was asked in the terminal the first time you launched the numaprof-webview command ?

In case you can change it or add another users via:

numaprof-passwd {USER}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants