Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 24] Too many open files #94

Open
Eisbrenner opened this issue Dec 2, 2020 · 9 comments
Open

OSError: [Errno 24] Too many open files #94

Eisbrenner opened this issue Dec 2, 2020 · 9 comments

Comments

@Eisbrenner
Copy link

Eisbrenner commented Dec 2, 2020

Using Intake (intake-xarray) on a larger number of files, e.g. daily data for one or two decades, results in a too-many-files error.

Loading the same set of data with xarray.open_mfdatasets works just fine.

Versions:

python 3.8.2
xarray 0.16.1
intake 0.6.0
intake-xarray 0.4.0

For me, the total number of files was: 9464

The Intake catalog I have looks something like this:

metadata:
  version: 1

plugins:
  source:
    - module: intake_xarray

sources:
  daily_mean:
    driver: netcdf
    args:
      urlpath: "{{ env(HOME) }}/path/to/data*.nc"
    xarray_kwargs:
      combine: by_coords
      parallel: True

then, using

intake.open_catalog(path_to_catalog)["daily_mean"].to_dask().chunk({"time": -1, "longitude": 10, "latitude":10})

throws an error of the form of

OSError: [Errno 24] Too many open files: 'path/to/catalog.yml'

loading the same data with

xr.open_mfdataset("~/path/to/data*.nc", combine="by_coords", parallel=True).chunk({"time": -1, "longitude": 10, "latitude":10})

works just fine.

@martindurant
Copy link
Member

martindurant commented Dec 2, 2020

I don't immediately know the reason, but if this is running on macOS, the files-open limit is pretty low by default. You can do something like ulimit -n 40096.

@scottyhq
Copy link
Collaborator

scottyhq commented Dec 2, 2020

@Eisbrenner - could you please try running the same code on intake-xarray master? pip install git+https://[email protected]/intake/intake-xarray.git@master

@Eisbrenner
Copy link
Author

Eisbrenner commented Dec 2, 2020

@martindurant i've seen the work around via increased ulimit; however, I think the behavior was changed directly in xarray and the limit should not be breached anymore. I thought it will be beneficial to intake to be aware of this too, thus I thought i'd share this regardless.

@scottyhq
I'll check the master branch tomorrow!

@Eisbrenner
Copy link
Author

with version intake-xarray 0.4.0+23.g2f4bfb3 I still get the same error. Also, I might add that my ulimit is in fact below the file count.

@martindurant
Copy link
Member

Is the error happening during a compute(), or while creating the xarray object?

@Eisbrenner
Copy link
Author

while creating the object; this (below) is all I'm doing, maybe there is something in the few lines of code which I'm not aware of what it does. I'm still trying to get my head around some of this, for example Dask in general.

  1. the xarray.open_mfdataset variant
data = (
    xr.open_mfdataset(
        "/path/to/data/metoffice_foam1_amm7_NWS_SAL_dm*.nc",
        parallel=True,
        combine="by_coords",
    )
    .chunk({"time": -1, "longitude": -1, "latitude": -1})
    .rename({"so": "salinity"})
).sel(depth=0)
  1. the intake.open_catalog variant; note, the catalog is as above in the initial post here.
data = (
    intake.open_catalog("/path/to/catalog/copernicus-reanalysis.yml")["daily_mean"]
    .to_dask()
    .chunk({"time": -1, "longitude": -1, "latitude": -1})
    .rename({"so": "salinity"})
).sel(depth=0)

the error occurs from this command.

@martindurant
Copy link
Member

Can you please compare intake.open_catalog(path_to_catalog)["daily_mean"].to_dask() versus xr.open_mfdataset(...) ? I assume they are not identical - if it completes at all.

@Eisbrenner
Copy link
Author

Eisbrenner commented Dec 4, 2020

data = intake.open_catalog(path_to_catalog)["daily_mean"].to_dask()
# [...]
# [...].venv/lib/python3.8/site-packages/fsspec/implementations/local.py in _open(self)
# OSError: [Errno 24] Too many open files: '[...]'
data = xr.open_mfdataset(path_to_data, parallel=True, combine="by_coords")
type(data)
# xarray.core.dataset.Dataset

I'll quickly check what the output is with a small enough set of files.

EDIT:

data = intake.open_catalog(path_to_catalog)["test_daily_mean"].to_dask()
type(data)
# xarray.core.dataset.Dataset

here "test_daily_mean" is just a subset of the files, e.g. "/path/to/data/files_199*.nc" instead of "/path/to/data/files_*.nc"

@martindurant
Copy link
Member

In NetCDFSource._open_dataset

        if self._can_be_local:
            url = fsspec.open_local(self.urlpath, **self.storage_options)
        else:
            # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
            url = fsspec.open(self.urlpath, **self.storage_options).open()

and local files are held open. Perhaps it would make sense to explicitly check for URLs that are already local, and pass them straight to xarray and let if do the opening of things in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants