Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification example - segmentation fault on some systems #136

Open
krichardsson opened this issue Nov 8, 2023 · 19 comments
Open

Classification example - segmentation fault on some systems #136

krichardsson opened this issue Nov 8, 2023 · 19 comments
Labels
bug Something isn't working

Comments

@krichardsson
Copy link
Contributor

There is a discussion indicating that there are issues running the classification example.

I did a quick test and found some (other) problems:

  1. The requirements.txt file contains fixed versions which is bad from a maintainability point of view and I got a bunch of conflicts on my machine.
  2. When running python train_classifier.py I get a segmentation fault(!), not sure why.

My conclusion is that we should take a look at this example and make sure it works.

@knmcguire knmcguire added bug Something isn't working triage needed labels Feb 15, 2024
@gemenerik
Copy link
Member

gemenerik commented Feb 19, 2024

There is a discussion indicating that there are issues running the classification example.

Answered ✅

1. The requirements.txt file contains fixed versions which is bad from a maintainability point of view and I got a bunch of conflicts on my machine.

Conflicts are avoided by using a separate Python environment to install the requirements into. From experience I know it can be real troublesome to work with deep learning repos with loose requirements. Considering this an application and not so much a library I think it should be acceptable to have fixed versions? But I'm curious to hear arguments for setting them loose.

2. When running `python train_classifier.py` I get a segmentation fault(!), not sure why.

With a Python=3.10 conda env + pip installing the requirements.txt (as instructed in the classification demo docs) training works for me out of the box.

@luigifeola
Copy link

2. When running `python train_classifier.py` I get a segmentation fault(!), not sure why.

With a Python=3.10 conda env + pip installing the requirements.txt (as instructed in the classification demo docs) training works for me out of the box.

Hi @gemenerik, I still have segmentation fault, event after creating a conda environment from scratch. Anything else I can do to execute the code?

@gemenerik
Copy link
Member

Can you share some more details? Like what OS you are using? A terminal printout? Anything that helps me reproduce the problem.

@luigifeola
Copy link

Sure, here it is.
OS: Pop!_OS 22.04
Currently I created a conda environment, even if I installed all the packages listed in aideck-gap8-examples/examples/ai/classification/requirements.txt using as usual pip install -r requirements.txt

$ conda list -n ai-classification python
# packages in environment at /home/gigi-labs/.miniconda/envs/ai-classification:
#
# Name                    Version                   Build  Channel
python                    3.10.14              h955ad1f_1  

This is the terminal output when I try to run the train_classifier.py script:

(ai-classification) gigi-labs@pop-os:/media/gigi-labs/T7/repos/bitcraze/aideck-gap8-examples/examples/ai/classification$ python train_classifier.py

2024-06-28 15:06:02.204756: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-28 15:06:02.298341: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-28 15:06:02.659914: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
2024-06-28 15:06:02.659964: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
2024-06-28 15:06:02.659968: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
training_data/*/*/*
Found 1375 images belonging to 2 classes.
Found 450 images belonging to 2 classes.
2024-06-28 15:06:03.066179: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-28 15:06:03.089418: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
2024-06-28 15:06:03.089438: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2024-06-28 15:06:03.089635: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 separable_conv2d (Separable  (None, 162, 122, 3)      7         
 Conv2D)                                                         
                                                                 
 resizing (Resizing)         (None, 96, 96, 3)         0         
                                                                 
 mobilenetv2_0.35_96 (Functi  (None, 3, 3, 1280)       410208    
 onal)                                                           
                                                                 
 separable_conv2d_1 (Separab  (None, 1, 1, 32)         52512     
 leConv2D)                                                       
                                                                 
 dropout (Dropout)           (None, 1, 1, 32)          0         
                                                                 
 global_average_pooling2d (G  (None, 32)               0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 66        
                                                                 
=================================================================
Total params: 462,793
Trainable params: 52,585
Non-trainable params: 410,208
_________________________________________________________________
Number of trainable weights = 8
Epoch 1/20
Segmentation fault (core dumped)

Is it a problem if I store and run everything from an external SSD?
Any help is really appreciated.

@knmcguire knmcguire reopened this Jul 1, 2024
@gemenerik
Copy link
Member

Oof, that is not a very informative error. Can you run any of the official tensorflow examples for this install?

@luigifeola
Copy link

Some more info, nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I tried this quickstart example , and the model is correctly trained (exactly as in here)

@luigifeola
Copy link

luigifeola commented Jul 3, 2024

Good news, the train_classifier.py script works with a docker container pre-built with tensorflow (without installing the packages in requirements.txt). The docker image is the nvcr.io/nvidia/tensorflow:23.03-tf2-py3 which runs Python 3.8.10

@gemenerik
Copy link
Member

gemenerik commented Jul 3, 2024

Good idea to try a docker container. Instead of an nvidia one, I will try to find a tensorflow/tensorflow container that works for this example.

EDIT: that will likely be tensorflow/tensorflow:2.11.0

@gemenerik
Copy link
Member

If you have a chance to test it; create a file train_classifier.sh in the examples/ai/classification folder, with:

#!/usr/bin/env bash
set -e

full_path=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

cd ${full_path}

pip install pillow scipy
python train_classifier.py

From repository root folder run:

docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow:2.11.0 examples/ai/classification/train_classifier.sh

@luigifeola
Copy link

Thanks for your support.

However it does not work. This is the output I got:

~/aideck-gap8-examples$ docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow:2.11.0 examples/ai/classification/train_classifier.sh

Collecting pillow
  Downloading pillow-10.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
     |████████████████████████████████| 4.4 MB 3.2 MB/s 
Collecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
     |████████████████████████████████| 34.5 MB 7.5 MB/s 
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /usr/local/lib/python3.8/dist-packages (from scipy) (1.23.4)
Installing collected packages: pillow, scipy
Successfully installed pillow-10.4.0 scipy-1.10.1
WARNING: You are using pip version 20.2.4; however, version 24.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
2024-07-03 13:30:47.917060: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-03 13:30:47.978827: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
training_data/*/*/*
Found 1375 images belonging to 2 classes.
Found 450 images belonging to 2 classes.
2024-07-03 13:30:48.742408: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_0.35_96_no_top.h5
2019640/2019640 [==============================] - 0s 0us/step
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 separable_conv2d (Separable  (None, 162, 122, 3)      7         
 Conv2D)                                                         
                                                                 
 resizing (Resizing)         (None, 96, 96, 3)         0         
                                                                 
 mobilenetv2_0.35_96 (Functi  (None, 3, 3, 1280)       410208    
 onal)                                                           
                                                                 
 separable_conv2d_1 (Separab  (None, 1, 1, 32)         52512     
 leConv2D)                                                       
                                                                 
 dropout (Dropout)           (None, 1, 1, 32)          0         
                                                                 
 global_average_pooling2d (G  (None, 32)               0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 66        
                                                                 
=================================================================
Total params: 462,793
Trainable params: 52,585
Non-trainable params: 410,208
_________________________________________________________________
Number of trainable weights = 8
Epoch 1/20
examples/ai/classification/train_classifier.sh: line 9:    13 Segmentation fault      (core dumped) python train_classifier.py

@gemenerik
Copy link
Member

Curious. Do you have an NVIDIA GPU?

@luigifeola
Copy link

Sorry for the late reply.

Yes I have an NVIDIA GPU, this is my nvidia-smi output:

Mon Jul  8 09:39:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A3000 12GB La...    On  | 00000000:01:00.0  On |                  Off |
| N/A   44C    P0              21W /  80W |    914MiB / 12288MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@gemenerik
Copy link
Member

Thanks! I think for now we'll leave this issue open and consider the NVIDIA docker a workaround for NVIDIA GPU users that run into the segmentation fault.

@gemenerik gemenerik changed the title Classification example is not working Classification example - segmentation fault on some systems Jul 9, 2024
@gemenerik
Copy link
Member

@luigifeola the above might work with the tensorflow/tensorflow:2.11.0-gpu docker image

@luigifeola
Copy link

Hi @gemenerik sorry for the super late reply. Actually even with docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow:2.11.0-gpu examples/ai/classification/train_classifier.sh I got the segmentation fault error:

Collecting pillow
  Downloading pillow-10.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
     |████████████████████████████████| 4.4 MB 4.0 MB/s 
Collecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
     |████████████████████████████████| 34.5 MB 154.1 MB/s 
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /usr/local/lib/python3.8/dist-packages (from scipy) (1.23.4)
Installing collected packages: pillow, scipy
Successfully installed pillow-10.4.0 scipy-1.10.1
WARNING: You are using pip version 20.2.4; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
2024-07-29 17:27:50.002395: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 17:27:50.082995: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
training_data/*/*/*
Found 7785 images belonging to 2 classes.
Found 2601 images belonging to 2 classes.
2024-07-29 17:27:51.023124: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: UNKNOWN ERROR (34)
2024-07-29 17:27:51.023153: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2024-07-29 17:27:51.023290: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_0.35_96_no_top.h5
2019640/2019640 [==============================] - 0s 0us/step
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 separable_conv2d (Separable  (None, 162, 122, 3)      7         
 Conv2D)                                                         
                                                                 
 resizing (Resizing)         (None, 96, 96, 3)         0         
                                                                 
 mobilenetv2_0.35_96 (Functi  (None, 3, 3, 1280)       410208    
 onal)                                                           
                                                                 
 separable_conv2d_1 (Separab  (None, 1, 1, 32)         52512     
 leConv2D)                                                       
                                                                 
 dropout (Dropout)           (None, 1, 1, 32)          0         
                                                                 
 global_average_pooling2d (G  (None, 32)               0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 2)                 66        
                                                                 
=================================================================
Total params: 462,793
Trainable params: 52,585
Non-trainable params: 410,208
_________________________________________________________________
Number of trainable weights = 8
Epoch 1/20
examples/ai/classification/train_classifier.sh: line 9:    13 Segmentation fault      (core dumped) python train_classifier.py

@knmcguire
Copy link
Member

Hi! Rik will be back next week so I'll notify him once he is back

@gemenerik
Copy link
Member

It may be related to how TensorFlow is built, possibly involving the GPU. Works fine on a GTX 1080 system. Haven't been able to reproduce the problem and a workaround was found, so not digging deeper for now.

@luigifeola
Copy link

luigifeola commented Sep 4, 2024

Hi @gemenerik,
My solution is now working using the tensorflow/tensorflow:2.11.0-gpu Docker image as you suggested. However, it's necessary to pass some additional arguments to the Docker container, such as: --gpus all --ipc=host --shm-size=4g --ulimit memlock=-1.

Additionally, the tensorflow Docker container recommends running in non-root mode. To follow this best practice, I created a custom image based on tensorflow/tensorflow:2.11.0-gpu, where I added a non-root user called user. I'm happy to share the custom image if needed.

The Lite model works well on my custom dataset, but when deployed, it detects ~90% of the time the background. This seems to be a separate issue, and I will open a new issue to address it. #145

Thanks again for your support!

@gemenerik
Copy link
Member

Related to this, documentation has been updated to include instructions for Docker-based training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants