Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losses from validation set are NAN #91

Open
dhorangic opened this issue Jul 17, 2021 · 7 comments
Open

Losses from validation set are NAN #91

dhorangic opened this issue Jul 17, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@dhorangic
Copy link

I have a training dataset of 384 labeled images - I have checked dataset, since all of these are simulated images they should not be mislabeled or any issues with the dataset itself like negative x/y values for bounding boxes. Then I have a validation set of 95 images - same thing, simulated images I generated with some other code. I load the validation and training sets with core.Dataset and then give to the model with the following code:

mdl.fit(dataset=trainset, val_dataset = validset, epochs=10, learning_rate=0.01, verbose=True)

I run the code on google colab notebook, prior to my code I do have to install detecto package
When I run the code I get a NAN error for the loss:
Screenshot (16)
Not sure why this is happening, any help appreciated

@dhorangic dhorangic added the bug Something isn't working label Jul 17, 2021
@dhorangic
Copy link
Author

In addition to this, am having some issues with predictions that are strange - trained the model with the same exact code minus the validation set 2 days ago and got this prediction for an unseen image
predictions1
However, now when I try to train the model with the exact same code (no validation set) there are no predictions generated, not on a previously seen training set image or the unseen image:
Screenshot (19)
These predictions should all be the same - no code was changed here
My notebook with the code is here

@alankbi
Copy link
Owner

alankbi commented Jul 18, 2021

Could you take a look at #36 and see if any of those solutions work for you? Looking at the second image my guess is that it's the NaN issue causing the model to not train properly and therefore not produce any results.

This is definitely a weird issue for sure - if none of those work it could be worth trying to create a new notebook from scratch to see if that helps at all.

@dhorangic
Copy link
Author

Hello, looked at #36 . Went through the code that generates the dataset again to make sure there were no issues with formatting.
Have created a new notebook and am running the code again right now
Here I uploaded an xml file I created - I just want to make sure the format is correct, since it seems like the other person's issue was with their dataset
One question - the content tag in the .xml files is wrong - it has 'content' but really the files end up in the folder 'train' or 'test'. Will this effect the results (does the tag have to be accurate)?

@dhorangic
Copy link
Author

Model finished training in a brand new notebook
Screenshot (21)
Same issue. Won't even make a prediction on images that were already in the training set, empty tensors
I will fix the .xml tag now to see if that helps

@alankbi
Copy link
Owner

alankbi commented Aug 4, 2021

Hello, looked at #36 . Went through the code that generates the dataset again to make sure there were no issues with formatting.
Have created a new notebook and am running the code again right now
Here I uploaded an xml file I created - I just want to make sure the format is correct, since it seems like the other person's issue was with their dataset
One question - the content tag in the .xml files is wrong - it has 'content' but really the files end up in the folder 'train' or 'test'. Will this effect the results (does the tag have to be accurate)?

The content tag shouldn't affect things at all, so that should hopefully not be the issue.

Were you able to ultimately figure this out? It seems like this issue has shown up now and then with a few people, but haven't been able to get a clear fix for it that works for everyone.

@makya-stell
Copy link

Error

I am getting a similar error, except it is giving a division by zero error. Any suggestions on what I am doing wrong with calculating my loss.

@alankbi
Copy link
Owner

alankbi commented Mar 20, 2022

It looks like your validation dataset doesn't have any images in it on line 548 - what's the output when you run len(test_dataset)?

If it's 0, then this likely indicates some kind of issue with the format of the folder containing the XML files. If so, could you share an image of what those look like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants