Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

Closed
ledm opened this issue Jun 26, 2023 · 20 comments
Assignees

Comments

@ledm
Copy link
Contributor

ledm commented Jun 26, 2023

On jasmin, jobs are being killed when the following code runs:

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

This occurs with several versions of esmvalcore (2.8.0, 2.8.1, 2.9.0).

The error occurs for all four schemes and a range of level values (0.0, 0.1, 0.5)

c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,]) # killled
c2 = extract_levels(cube, scheme='nearest', levels = [0.5 ,])# killled
c2 = extract_levels(cube, scheme='linear', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='nearest_extrapolate', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='linear_extrapolate', levels = [0.5 ,])# killled

In all cases, the error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870

    # Now perform the actual vertical interpolation.
    new_data = stratify.interpolate(levels,
                                    src_levels_broadcast,
                                    cube.core_data(),
                                    axis=z_axis,
                                    interpolation=interpolation,
                                    extrapolation=extrapolation)

Stratify (version '0.3.0') is a c/python interface wrapper and it previously caused trouble. It is not lazy so it may try to load 120GB files into memory and other issues like that. My previous solution to this problem was the write my own preprocessor:

ESMValGroup/ESMValCore#1039
ESMValGroup/ESMValCore#1048

Which has been abandoned, but I'm tempted to bring it back. (The deadline for this piece of work is 24th july!)

This is an extension of the discussion here: #3239

@bouweandela
Copy link
Member

stratify is lazy since v0.3.0 and the extract_levels preprocessor is lazy in the ESMValCore development branch and release candidate ESMValCore v2.9.0rc1. The iris function broadcast_to_shape is now lazy (SciTools/iris#5359), but it is not yet in a released version of iris.

You could try installing iris from source (clone the repository and run pip install .) or wait for the upcoming iris 3.6.1 release.

@ledm
Copy link
Contributor Author

ledm commented Jun 26, 2023

Okay, that's not the problem either!

Just loading the data is enough for it to get killed!

import iris
# from esmvalcore.preprocessor._regrid import extract_levels
from esmvalcore.preprocessor._volume import extract_surface

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)

print(cube)
print(cube.data[:,0,:,:])

Also results in Killed. It's not @bjlittle's stratify's fault.

Just loading this data file breaks.

@bouweandela
Copy link
Member

That's because you're trying to load all the data into memory, maybe it doesn't fit?

Try something like

import iris

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)

print(cube)
print(cube.core_data()[:,0,:,:])

@bouweandela
Copy link
Member

See also ESMValGroup/ESMValCore#2114

@ledm
Copy link
Contributor Author

ledm commented Jun 27, 2023

This works, thanks! .... but this returns a dask array, which is not what I want. I just want to extract the surface layer of a cube, returning a cube (convert 4D -> 3D, or 3D -> 2D). extract_layer is unable to do that for these files either!

@ledm
Copy link
Contributor Author

ledm commented Jun 27, 2023

Also, I should say that I've tried moving the preprocessor order around and I had the same problem with regrid as well. I think that likely also realises the data, @bouweandela.

@valeriupredoi
Copy link
Contributor

iris=3.6.1 is now available on conda-forge and it gets pulled in our environment, so if you can try regenerating the env and use it, see if that fixes your issue @ledm 🍺

@ledm
Copy link
Contributor Author

ledm commented Jun 27, 2023

Just to confirm my email @valeriupredoi, updating to iris=3.6.1 does not solve this issue.

Method:

mamba install iris=3.6.1

in ESMValCore:

pip install --editable '.[develop]'

Then in an interactive python script:

>>> import iris
>>> iris.__version__
'3.6.1'
>>> import esmvalcore
>>> esmvalcore.__version__
'2.9.0.dev0+gb12682d2a.d20230627'
>>> import stratify
>>> stratify.__version__
'0.3.0'

@ledm
Copy link
Contributor Author

ledm commented Jun 27, 2023

Okay, so more investigation: watching top while running the script at the start of this issue results in a huge spike in MEM usage. The file itself is only 2GB, but I've seen up to 8GB using top. Memory being several times larger that the file suggests a memory issue in iris/stratify.

This is probably why re-ordering the preprocessors failed me earlier. I had assumed that if I extracted a smaller region first, then the surface layer, it would mean that less memory would be needed (this didn't work!). A memory leak means that it doesn't really matter how small a region you make it, as it will leak and break anyway.

@valeriupredoi
Copy link
Contributor

@ledm here's what I found out: the script you gave me ie

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

needs 13G of memory RES (resident) to run to completion; this with:

esmvalcore                2.9.0rc1           pyh39db41b_0    conda-forge/label/esmvalcore_rc
esmvaltool                2.9.0.dev41+gda7f3dbe6          pypi_0    pypi
iris                      3.6.1              pyha770c72_0    conda-forge
python-stratify           0.3.0           py311h1f0f07a_0    conda-forge

and the file in question is indeed 2GB but do remember that's a compressed netCDF4 file, usually with a 40% compression factor. So that means that extract_levels loads the entire data into memory roughly 3 times. @bouweandela says extract levels is not lazy so it's very clear how not lazy it is - why the footprint is so bad ie about 3x larger than the actual file size in memory is beyond me. Sorry I misinterpreted thinking new iris will solve this, obv not. But the question is - why is sci3 killing your job when it only needs 13G of mem? Unless that job was different, I see no reason why. Now, I believe stratify is lazy now, so we can go about and make the extract levels lazy, in fact we should do that, but in the meantime, try running on a node that may not kick you out 😁

@valeriupredoi
Copy link
Contributor

the source of this problem is vinterp (old name) or stratify.interpolate() (new name) becoming completely realized/computed/not lazy due to levels and src_levels_broadcast being <class 'numpy.ndarray'> - this is sexacyly @bouweandela 's issue ESMValGroup/ESMValCore#2114 - just to confirm, indeed, the dta in the example above is <class 'dask.array.core.Array'> - so making the coords lazy should be easy

@ledm
Copy link
Contributor Author

ledm commented Jun 27, 2023

try running on a node that may not kick you out

Lol, if only it were that easy. This gets killed for me on sci1, sci3, sci4, sci6, and the LOTUS high-mem queue!

@valeriupredoi
Copy link
Contributor

sci2 did the trick for me. We now know where the problem lies, so fixing should follow 😁

@ledm
Copy link
Contributor Author

ledm commented Jun 27, 2023

Okay - running my original recipe (lol not fried chicken!) on sci2 now. Don't know if this is useful information, but it's trying to download 20GB of data from ESGF now. Not sure why it never got there before on sci1. (sci3 isn't connected to ESGF, I don't think)

@ledm
Copy link
Contributor Author

ledm commented Jun 30, 2023

Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64

Calling this a big W.

@ledm
Copy link
Contributor Author

ledm commented Jun 30, 2023

Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64

Calling this a big W.

Correction. This was on sci2. On sci3, it just got killed the normal way. No idea whats going on. Starting to think it's a jasmin thing. Will try sci6 next.

@ledm
Copy link
Contributor Author

ledm commented Jul 6, 2023

Continuing with this, here's a minimal testing recipe.

https://github.com/ESMValGroup/ESMValTool/blob/AscensionIslandMarineProtectedArea/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_o2_testing.yml

On JASMIN.sci1 this fails for me. If I comment out either recipe, it runs fine.

The fact that it works with one dataset but fails with two makes me think that perhaps something isn't being properly closed after it's finished? Or its trying to run two things at once, even when max_parallel_tasks: 1 in my config-user file.

@bouweandela
Copy link
Member

bouweandela commented May 6, 2024

The issue mentioned in the top post has been solved in ESMValGroup/ESMValCore#2120 which will be available in the upcoming v2.11.0 release of ESMValCore.

I also investigated the recipe in #3244 (comment):

@bouweandela
Copy link
Member

Continuing with this, here's a minimal testing recipe.

https://github.com/ESMValGroup/ESMValTool/blob/AscensionIslandMarineProtectedArea/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_o2_testing.yml

@ledm The recipe now runs with the ESMValCore main branch (and soon to be released v2.11.0). Even though regridding is not lazy, this isn't such a problem as the data has already been reduced in size a lot by computing the climate statistics and vertical level extraction before regridding.

@bouweandela
Copy link
Member

With ESMValGroup/ESMValCore#2457 merged regridding is now automatically lazy for data with 2D lat/lon coordinates as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants