Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training fails with IndexError when performing train/test dataset splitting #104

Open
raehik opened this issue Nov 21, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@raehik
Copy link
Collaborator

raehik commented Nov 21, 2023

Running a simple training command with some low-resolution training data (only 100 samples) gives me an error that gets triggered in train_for_one_epoch, when we enumerate(dataloader). On main branch, run

mlflow run . --experiment-name raehik -e train --env-manager=local \
-P forcing_data_path=<path> \
-P learning_rate=0/5e-4/15/5e-5/30/5e-6 -P n_epochs=200 -P weight_decay=0.00 -P train_split=0.8 \
-P test_split=0.85 -P model_module_name=models.models1 -P model_cls_name=FullyCNN -P batchsize=4 \
-P transformation_cls_name=SoftPlusTransform -P submodel=transform3 \
-P loss_cls_name=HeteroskedasticGaussianLossV2

On main (as of the merging of #97 in early December 2023)

python src/gz21_ocean_momentum/cli/train.py \
--in-train-data-dir <path> --subdomains-file examples/cli-configs/training-subdomains-paper.yaml \
--initial-learning-rate 5.0e-4 --decay-at-epoch-milestones 15 --decay-at-epoch-milestones 30 --decay-factor 0.00 \
--train-split 0.8 --test-split 0.85 --batch-size 4 --epochs 200

gives IndexError: index <x> is out of bounds for axis 0 with size 80. The index seems between 80-320 (I've certainly seen low 80s and high 300s).

There are 320 samples in all training subdomains combined (4 spatial domains, 80% each for training). We do batching with a size of 4. I've tried investigating and tinkering with these, but I've not managed to resolve it.

Doing either of these prevents the issue from occurring:

  • skip the subsetting (use the whole dataset for training)
  • use a single spatial domain

It would seem the problem is somewhere in Subset_ or related code, or in my xarray generated from the data step.

@raehik raehik changed the title IndexError on Subset_ PyTorch dataset Training fails with IndexError when performing train/test dataset splitting Dec 4, 2023
@raehik raehik added the bug Something isn't working label Dec 4, 2023
@dorchard
Copy link
Collaborator

@CemGultekin1 I wonder if you came across something similar to this when looking at the gz code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants