Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset aggregation #1

Open
plbenveniste opened this issue Feb 8, 2024 · 7 comments
Open

Dataset aggregation #1

plbenveniste opened this issue Feb 8, 2024 · 7 comments
Assignees
Labels

Comments

@plbenveniste
Copy link
Collaborator

plbenveniste commented Feb 8, 2024

Here is an issue to describe the aggregation of available datasets.
The dataset which are of interest for this project are:

  • Labeled datasets

    • canproco : PSIR and STIR contrast
    • sct-testing-large : T1, T2 and T2*
    • basel-mp2rage : MP2RAGE
    • Bavaria : T2w
    • msseg_challenge_2021 : FLAIR but need to first crop the top of the image : only very little of the spinal cord is included and I think that there are no lesion segmented in the spinal cord
  • Unlabeled datasets:

    • nih-ms-mp2rage : MP2RAGE
    • umass-ms-* : T2w sag, STIR_T2w sag, T1w sag, Gad T1w sag, FMPIR_T2w sag, T2w ax, PD ax, Gad T1w ax
    • karolinska : still in DICOM format : T1 and T2 (issue related : 76)
@plbenveniste plbenveniste self-assigned this Feb 8, 2024
@plbenveniste
Copy link
Collaborator Author

For the training of a first nnUNet model (model called Dataset101_singleClassNnunetMsLesion), I aggregated the lesion segmentation in a json file with the path to each lesion-segmentation file and the corresponding image (therefore, an image can appear more than once if it has multiple segmentation file) into a training dictionnary. It was split 80% for training and 20% for testing. If the image didn't include any segmentation file, it was added to an inference dictionnary. I obtained the following:

  • Canproco:
    • 363 images for training
    • 83 images for testing
    • 341 images for inference
  • Basel:
    • 318 images for training
    • 70 images for testing
    • 105 images for inference
  • SCT-Testing-Large:
    • 1282 images for training
    • 307 images for testing
    • 0 images for inference
  • Bavaria:
    • 183 images for training
    • 30 images for testing
    • 639 images for inference

This makes for a total of:

  • Total image for training: 2146
  • Total image for testing: 490
  • Total image for inference: 1085

Note

We are currently manually labelling images in CanProCo, therefore the numbers above are about to change.

Furthermore, when running nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity, it raised a lot of warning about orientation or dimension mismatch. Therefore, a works need to be done in correcting the orientation of each of the above datasets.

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Feb 16, 2024

Correction of header was done using dataset_correction.py on branch plb/dataset_aggregation. What we do, is we first change the orientation to the image's header and then use the SCT function sct_image to copy the image header to the lesion mask header.

Output of correction:

  • Canproco : 1511 files corrected (pushed to branch plb/correct_header_2 on canproco dataset git annex): this includes SC seg, lesion seg and disc labels.
  • Basel : 272 files corrected (pushed on branch plb/correct_header on basel dataset git annex): this includes SC seg and lesion seg
  • Sct-testing-large: 1241 files corrected (pushed to branch plb/correct_header on sct-testing-large dataset git annex) : this only includes lesion segmentations
  • Bavaria-quebec : 213 files corrected (pushed to branch plb/correct_header on bavaria-quebec dataset git annex) : this only includes lesion segmentations

This is the issue that deals with the modified data: neuropoly/data-management#301

@jcohenadad
Copy link
Member

jcohenadad commented Feb 28, 2024

I just remember that we also have a lot of data from UMass (git-annex data : umass-ms-* (3 datasets))

@plbenveniste
Copy link
Collaborator Author

Referencing these issues regarding the problem caused by the dataset_correction.py script: issue 301 and issue 305.

The script dataset_correction.py shouldn't be used anymore or should be corrected at least.

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Jun 26, 2024

I updated the new code to aggregate the following datasets, which are labelled:

  • basel-ms-mp2rage
  • bavaria-quebec-spine-ms-unstitched
  • canproco
  • nih-ms-mp2rage
  • sct-testing-large

The command ran on kronos was:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

The output is the following:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1636
Number of images in validation set: 569
Number of images in test set: 544
Total number of images in the dataset: 2749

The total number of images in the dataset (2749) is different from the total number of derivatives (4407) because we decided to keep only those which have lesions.

The output is the following file: dataset_2024-06-26_seed42_lesionOnly.json

@jcohenadad
Copy link
Member

we decided to keep only those which have lesions.

for now, but maybe in the future it would be desirable to develop a model that also has good specificity (ie: high true negative rate)

@plbenveniste
Copy link
Collaborator Author

There was an issue in the code when gathering segmentations from nih-ms-mp2rage.
The code was ran again:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

This is the output of the code:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1712
Number of images in validation set: 590
Number of images in test set: 569
Total number of images in the dataset: 2871

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants