Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the cookbook default to stricter requirements when merging/concatenating data? #319

Open
dougiesquire opened this issue Jan 25, 2023 · 8 comments

Comments

@dougiesquire
Copy link
Collaborator

Motivating example here: COSIMA/cosima-recipes#229. Files with the same naming in a single experiment are on different domains depending on the output* directory. Should the cookbook check whether indexes are the same for data being merged/concatenated?

E.g. passing join="exact" to the open_mfdataset() call within the cosima-cookbook will then return an error when indexes to be aligned are not equal. This could currently be passed through kwargs, but should it be the default?

@rmholmes
Copy link

rmholmes commented Jan 25, 2023

Thanks for catching this @dougiesquire.

I'm not sure this is a cookbook issue. I think it's more an issue with the data itself. I don't think it's a good idea to have output defined on different regions using the same file name. I'd suggest that a good way to deal with this issue is to rename the ocean_daily_3d_u_%.nc files in output196-output279 to something like ocean_daily_3d_u_southern_ocean_%.nc. They can then be separated using the nc_file argument to cc.querying.getvar.

But I guess even then, it would still be useful to flag it so that the user knows they have to use nc_file.

@dougiesquire
Copy link
Collaborator Author

Thanks @rmholmes. I wasn't meaning to suggest that the issue is with the cookbook, but having join="exact" as default would've saved me a bunch of time yesterday. I.e. it could be useful for helping to find/flag issues with the data.

My guess is that most uses of the cookbook are to query/load datasets that should have consistent indexes. So having join="exact" as default could make sense - users could always override the kwarg if they want to merge inconsistent data.
But, I'm probably just not across the full range of cookbook use cases.

@dougiesquire
Copy link
Collaborator Author

But yes, for fixing the specific issue with 01deg_jra55v13_ryf9091, changing the name of the nc files sounds sensible to me. Who would be in charge of doing that?

@rmholmes
Copy link

rmholmes commented Jan 25, 2023

Sorry @dougiesquire, I didn't completely take in your comment here as I'd copied my response across from the cosima-recipes issue you'd put up. I'd support a move to the stricter requirements.

I think @AndyHoggANU ran that simulation.

@angus-g
Copy link
Collaborator

angus-g commented Jan 25, 2023

I agree that join="exact" seems like a sensible default, it's a bit crazy that xarray tries to concatenate datasets like that in the first place!

@aidanheerdegen
Copy link
Collaborator

Is there any time penalty with join="exact"? ISTR folks complaining about xarray doing time consuming checks on coordinates under some circumstances, but it may be me misremembering, or the issue may no longer be a problem.

@angus-g
Copy link
Collaborator

angus-g commented Jan 25, 2023

I think the checks by compat are more expensive than those for join (which just tries to align dimension sizes)? Probably one of those things where it's best to just benchmark it.

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Jan 25, 2023

Agreed. I wouldn't expect any difference in speed for data that can be joined.

As @angus-g mentioned, there are other kwargs that can be changed to improve performance, but they require making some assumptions about the data being loaded. I don't know whether these are justified for the COSIMA data?

EDIT: see the Note here: https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants