Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data pre-processing for YOLO detection model #3

Open
cspino opened this issue Feb 13, 2024 · 1 comment
Open

Data pre-processing for YOLO detection model #3

cspino opened this issue Feb 13, 2024 · 1 comment
Assignees

Comments

@cspino
Copy link
Collaborator

cspino commented Feb 13, 2024

Here is an update regarding the processing steps needed to be able to feed our data into the YOLOv8 model.

First, here is a recap of what the model expects as input:

  • The data should be organized as follows:
dataset/
│
├── images/
│   ├── train/
│   │   ├── sub-cal056_ses-M12_STIR_0.png
│   │   ├── sub-cal056_ses-M12_STIR_1.png
│   │   └── ...
│   ├── val/
│   │   ├── sub-tor006_ses-M12_PSIR_0.png
│   │   └── ...
│   └── test/
│       ├── sub-tor007_ses-M12_PSIR_0.png
│       └── ...
│
├── labels/
│   ├── train/
│   │   ├── sub-cal056_ses-M12_STIR_0.txt
│   │   ├── sub-cal056_ses-M12_STIR_1.txt
│   │   └── ...
│   ├── val/
│   │   ├── sub-tor006_ses-M12_PSIR_0.txt
│   │   └── ...
│   └── test/
│       ├── sub-tor007_ses-M12_PSIR_0.txt
│       └── ...
│
└── data.yaml
  • To train a model, a path to the .yaml file must be given. This file should have the following content:
path: "dataset" # dataset root dir
train: "images/train"  # train images (relative to 'path')
val: "images/val"  # val images (relative to 'path')
test: "images/test"   # test images (optional)

nc: 1 # number of classes
names: ["lesion"] # classes to detect
  • Labels are stored as .txt files containing the bounding box coordinates of every object in a given image. These files must have the same filename as their corresponding image. If an image has no object, a .txt file is not necessary.

The .txt file should be formatted like this:
<class number> <x_center> <y_center> <width> <height>
where the coordinates are normalized between 0 and 1. This means that x_center and width in pixels must be divided by the image width and y_center and height should be divided by the image height.

For example, this image has two lesions:

0 0.512500 0.404687 0.012500 0.040625
0 0.496875 0.453125 0.012500 0.031250

What I’ve accomplished so far:

(I’ve pushed everything to branch cs/2d_lesion_detection. I included a Jupiter notebook and a few images to reproduce the tests that I’ve run so far.)

  • The function nifti_to_png takes a path to a nifti image and saves each slice as a png. For now, every slice is saved, but since we’re only interested in the slices containing the spinal cord, this is where I plan on checking the spinal cord segmentation in order to discard the slices that we don’t want.

  • The function labels_from_nifti takes a path to a nifti segmentation and for every slice, extracts the bounding box coordinates for each lesion and creates the .txt file.

  • With these functions, I created a small test dataset with just a few images and was able to train a model and then get a prediction (no lesions were predicted). Training metrics are automatically saved to a ‘runs’ folder.

Here is one of the ground truth images with the lesions labelled:
image

The next steps:

  • Only save slices that contain part of the spinal cord
  • We’ll need a function that can take batches of data from a BIDS database and organize it in the YOLOv8 format
    • Separate into train/val/test?
    • Create pngs, txt files and a yaml file
  • Try training on more data
@cspino cspino self-assigned this Feb 13, 2024
@cspino
Copy link
Collaborator Author

cspino commented Mar 5, 2024

Here are the latest changes that I've made to the pre-processing steps.

1. Train, Test, Val datasets from Canproco database

The script train_test_val_from_BIDS.py:

  • Goes through the Canproco database (path given as input),
  • Makes a list of all the scans that have a *lesion-manual.nii.gz file,
  • Randomly splits that list into 3 datasets (train, val, test) using the specified proportions
  • Saves the final lists in a json file

Since it's the scans that are being split into the different datasets (and not the patients), two scans from the same patient (M0 and M12) might end up in two different datasets (ie. M0 in the training set and M1 in the val set). I'm not sure if that's a problem or not. But all slices from a single scan necessarily end up in the same dataset. This way, we avoid the possible validation bias of having very similar slices in the training set and in the validation/ test sets.

Here is an example of what the json file looks like:

{"train": ["sub-cal056_ses-M12_STIR",
            "sub-edm011_ses-M0_PSIR",
            "sub-cal072_ses-M0_STIR"],
  "val": ["sub-cal157_ses-M12_STIR"],
  "test": ["sub-edm076_ses-M0_PSIR"]}

2. Spinal cord segmentation when missing

The script sc_seg_from_list.py takes the json list from (1), and for every scan:

  • Checks if the spinal cord segmentation file exists
  • If not, creates it using the spinal cord toolbox

The sc segmentation is used in script (3) to select only slices that contain part of the spinal cord. This is done to avoid having a large proportion of empty slices (to avoid class imbalance).

3. BIDS to YOLO format

The script pre-process.py takes the same json list from (1), then generates the YOLO formatted dataset:

  • only slices that contain part of the spinal cord are saved as .png images
  • a .txt file with the label bounding box coordinates is generated for each slice that contains a lesion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant