Dataset aggregation #1

plbenveniste · 2024-02-08T21:48:23Z

Here is an issue to describe the aggregation of available datasets.
The dataset which are of interest for this project are:

Labeled datasets
- canproco : PSIR and STIR contrast
- sct-testing-large : T1, T2 and T2*
- basel-mp2rage : MP2RAGE
- Bavaria : T2w
- msseg_challenge_2021 : FLAIR but need to first crop the top of the image : only very little of the spinal cord is included and I think that there are no lesion segmented in the spinal cord
Unlabeled datasets:
- nih-ms-mp2rage : MP2RAGE
- umass-ms-* : T2w sag, STIR_T2w sag, T1w sag, Gad T1w sag, FMPIR_T2w sag, T2w ax, PD ax, Gad T1w ax
- karolinska : still in DICOM format : T1 and T2 (issue related : 76)

plbenveniste · 2024-02-15T20:21:49Z

For the training of a first nnUNet model (model called Dataset101_singleClassNnunetMsLesion), I aggregated the lesion segmentation in a json file with the path to each lesion-segmentation file and the corresponding image (therefore, an image can appear more than once if it has multiple segmentation file) into a training dictionnary. It was split 80% for training and 20% for testing. If the image didn't include any segmentation file, it was added to an inference dictionnary. I obtained the following:

Canproco:
- 363 images for training
- 83 images for testing
- 341 images for inference
Basel:
- 318 images for training
- 70 images for testing
- 105 images for inference
SCT-Testing-Large:
- 1282 images for training
- 307 images for testing
- 0 images for inference
Bavaria:
- 183 images for training
- 30 images for testing
- 639 images for inference

This makes for a total of:

Total image for training: 2146
Total image for testing: 490
Total image for inference: 1085

Note

We are currently manually labelling images in CanProCo, therefore the numbers above are about to change.

Furthermore, when running nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity, it raised a lot of warning about orientation or dimension mismatch. Therefore, a works need to be done in correcting the orientation of each of the above datasets.

plbenveniste · 2024-02-16T19:46:00Z

Correction of header was done using dataset_correction.py on branch plb/dataset_aggregation. What we do, is we first change the orientation to the image's header and then use the SCT function sct_image to copy the image header to the lesion mask header.

Output of correction:

Canproco : 1511 files corrected (pushed to branch plb/correct_header_2 on canproco dataset git annex): this includes SC seg, lesion seg and disc labels.
Basel : 272 files corrected (pushed on branch plb/correct_header on basel dataset git annex): this includes SC seg and lesion seg
Sct-testing-large: 1241 files corrected (pushed to branch plb/correct_header on sct-testing-large dataset git annex) : this only includes lesion segmentations
Bavaria-quebec : 213 files corrected (pushed to branch plb/correct_header on bavaria-quebec dataset git annex) : this only includes lesion segmentations

This is the issue that deals with the modified data: neuropoly/data-management#301

jcohenadad · 2024-02-28T03:45:07Z

I just remember that we also have a lot of data from UMass (git-annex data : umass-ms-* (3 datasets))

plbenveniste · 2024-06-12T15:30:39Z

Referencing these issues regarding the problem caused by the dataset_correction.py script: issue 301 and issue 305.

The script dataset_correction.py shouldn't be used anymore or should be corrected at least.

plbenveniste · 2024-06-26T15:49:18Z

I updated the new code to aggregate the following datasets, which are labelled:

basel-ms-mp2rage
bavaria-quebec-spine-ms-unstitched
canproco
nih-ms-mp2rage
sct-testing-large

The command ran on kronos was:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

The output is the following:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1636
Number of images in validation set: 569
Number of images in test set: 544
Total number of images in the dataset: 2749

The total number of images in the dataset (2749) is different from the total number of derivatives (4407) because we decided to keep only those which have lesions.

The output is the following file: dataset_2024-06-26_seed42_lesionOnly.json

jcohenadad · 2024-06-26T16:23:23Z

we decided to keep only those which have lesions.

for now, but maybe in the future it would be desirable to develop a model that also has good specificity (ie: high true negative rate)

plbenveniste · 2024-06-26T20:15:21Z

There was an issue in the code when gathering segmentations from nih-ms-mp2rage.
The code was ran again:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

This is the output of the code:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1712
Number of images in validation set: 590
Number of images in test set: 569
Total number of images in the dataset: 2871

plbenveniste self-assigned this Feb 8, 2024

plbenveniste mentioned this issue Feb 16, 2024

Correcting header segmentation files neuropoly/data-management#301

Closed

plbenveniste added the data label Feb 22, 2024

plbenveniste mentioned this issue Feb 22, 2024

Orientation problem labels-disc M12 subjects ivadomed/canproco#71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset aggregation #1

Dataset aggregation #1

plbenveniste commented Feb 8, 2024 •

edited

Loading

plbenveniste commented Feb 15, 2024

plbenveniste commented Feb 16, 2024 •

edited by jcohenadad

Loading

jcohenadad commented Feb 28, 2024 •

edited by plbenveniste

Loading

plbenveniste commented Jun 12, 2024

plbenveniste commented Jun 26, 2024 •

edited

Loading

jcohenadad commented Jun 26, 2024

plbenveniste commented Jun 26, 2024

Dataset aggregation #1

Dataset aggregation #1

Comments

plbenveniste commented Feb 8, 2024 • edited Loading

plbenveniste commented Feb 15, 2024

plbenveniste commented Feb 16, 2024 • edited by jcohenadad Loading

jcohenadad commented Feb 28, 2024 • edited by plbenveniste Loading

plbenveniste commented Jun 12, 2024

plbenveniste commented Jun 26, 2024 • edited Loading

jcohenadad commented Jun 26, 2024

plbenveniste commented Jun 26, 2024

plbenveniste commented Feb 8, 2024 •

edited

Loading

plbenveniste commented Feb 16, 2024 •

edited by jcohenadad

Loading

jcohenadad commented Feb 28, 2024 •

edited by plbenveniste

Loading

plbenveniste commented Jun 26, 2024 •

edited

Loading