OSError: [Errno 24] Too many open files #94

Eisbrenner · 2020-12-02T09:53:41Z

Using Intake (intake-xarray) on a larger number of files, e.g. daily data for one or two decades, results in a too-many-files error.

Loading the same set of data with xarray.open_mfdatasets works just fine.

Versions:

python 3.8.2
xarray 0.16.1
intake 0.6.0
intake-xarray 0.4.0

For me, the total number of files was: 9464

The Intake catalog I have looks something like this:

metadata:
  version: 1

plugins:
  source:
    - module: intake_xarray

sources:
  daily_mean:
    driver: netcdf
    args:
      urlpath: "{{ env(HOME) }}/path/to/data*.nc"
    xarray_kwargs:
      combine: by_coords
      parallel: True

then, using

intake.open_catalog(path_to_catalog)["daily_mean"].to_dask().chunk({"time": -1, "longitude": 10, "latitude":10})

throws an error of the form of

OSError: [Errno 24] Too many open files: 'path/to/catalog.yml'

loading the same data with

xr.open_mfdataset("~/path/to/data*.nc", combine="by_coords", parallel=True).chunk({"time": -1, "longitude": 10, "latitude":10})

works just fine.

The text was updated successfully, but these errors were encountered:

martindurant · 2020-12-02T16:36:39Z

I don't immediately know the reason, but if this is running on macOS, the files-open limit is pretty low by default. You can do something like ulimit -n 40096.

scottyhq · 2020-12-02T17:37:43Z

@Eisbrenner - could you please try running the same code on intake-xarray master? pip install git+https://[email protected]/intake/intake-xarray.git@master

Eisbrenner · 2020-12-02T23:30:50Z

@martindurant i've seen the work around via increased ulimit; however, I think the behavior was changed directly in xarray and the limit should not be breached anymore. I thought it will be beneficial to intake to be aware of this too, thus I thought i'd share this regardless.

@scottyhq
I'll check the master branch tomorrow!

Eisbrenner · 2020-12-03T07:59:01Z

with version intake-xarray 0.4.0+23.g2f4bfb3 I still get the same error. Also, I might add that my ulimit is in fact below the file count.

martindurant · 2020-12-03T18:13:29Z

Is the error happening during a compute(), or while creating the xarray object?

Eisbrenner · 2020-12-04T15:39:00Z

while creating the object; this (below) is all I'm doing, maybe there is something in the few lines of code which I'm not aware of what it does. I'm still trying to get my head around some of this, for example Dask in general.

the xarray.open_mfdataset variant

data = (
    xr.open_mfdataset(
        "/path/to/data/metoffice_foam1_amm7_NWS_SAL_dm*.nc",
        parallel=True,
        combine="by_coords",
    )
    .chunk({"time": -1, "longitude": -1, "latitude": -1})
    .rename({"so": "salinity"})
).sel(depth=0)

the intake.open_catalog variant; note, the catalog is as above in the initial post here.

data = (
    intake.open_catalog("/path/to/catalog/copernicus-reanalysis.yml")["daily_mean"]
    .to_dask()
    .chunk({"time": -1, "longitude": -1, "latitude": -1})
    .rename({"so": "salinity"})
).sel(depth=0)

the error occurs from this command.

martindurant · 2020-12-04T15:56:17Z

Can you please compare intake.open_catalog(path_to_catalog)["daily_mean"].to_dask() versus xr.open_mfdataset(...) ? I assume they are not identical - if it completes at all.

Eisbrenner · 2020-12-04T16:29:36Z

data = intake.open_catalog(path_to_catalog)["daily_mean"].to_dask()
# [...]
# [...].venv/lib/python3.8/site-packages/fsspec/implementations/local.py in _open(self)
# OSError: [Errno 24] Too many open files: '[...]'

data = xr.open_mfdataset(path_to_data, parallel=True, combine="by_coords")
type(data)
# xarray.core.dataset.Dataset

I'll quickly check what the output is with a small enough set of files.

EDIT:

data = intake.open_catalog(path_to_catalog)["test_daily_mean"].to_dask()
type(data)
# xarray.core.dataset.Dataset

here "test_daily_mean" is just a subset of the files, e.g. "/path/to/data/files_199*.nc" instead of "/path/to/data/files_*.nc"

martindurant · 2020-12-04T16:36:35Z

In NetCDFSource._open_dataset

        if self._can_be_local:
            url = fsspec.open_local(self.urlpath, **self.storage_options)
        else:
            # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
            url = fsspec.open(self.urlpath, **self.storage_options).open()

and local files are held open. Perhaps it would make sense to explicitly check for URLs that are already local, and pass them straight to xarray and let if do the opening of things in that case.

This was referenced Jan 13, 2021

Document release process (and cut 0.4.1) #95

Closed

Allow engine='zarr' and passing args for new xarray API #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: [Errno 24] Too many open files #94

OSError: [Errno 24] Too many open files #94

Eisbrenner commented Dec 2, 2020 •

edited

Loading

martindurant commented Dec 2, 2020 •

edited

Loading

scottyhq commented Dec 2, 2020

Eisbrenner commented Dec 2, 2020 •

edited

Loading

Eisbrenner commented Dec 3, 2020

martindurant commented Dec 3, 2020

Eisbrenner commented Dec 4, 2020

martindurant commented Dec 4, 2020

Eisbrenner commented Dec 4, 2020 •

edited

Loading

martindurant commented Dec 4, 2020

OSError: [Errno 24] Too many open files #94

OSError: [Errno 24] Too many open files #94

Comments

Eisbrenner commented Dec 2, 2020 • edited Loading

martindurant commented Dec 2, 2020 • edited Loading

scottyhq commented Dec 2, 2020

Eisbrenner commented Dec 2, 2020 • edited Loading

Eisbrenner commented Dec 3, 2020

martindurant commented Dec 3, 2020

Eisbrenner commented Dec 4, 2020

martindurant commented Dec 4, 2020

Eisbrenner commented Dec 4, 2020 • edited Loading

martindurant commented Dec 4, 2020

Eisbrenner commented Dec 2, 2020 •

edited

Loading

martindurant commented Dec 2, 2020 •

edited

Loading

Eisbrenner commented Dec 2, 2020 •

edited

Loading

Eisbrenner commented Dec 4, 2020 •

edited

Loading