Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

ledm · 2023-06-26T14:29:43Z

On jasmin, jobs are being killed when the following code runs:

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

This occurs with several versions of esmvalcore (2.8.0, 2.8.1, 2.9.0).

The error occurs for all four schemes and a range of level values (0.0, 0.1, 0.5)

c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,]) # killled
c2 = extract_levels(cube, scheme='nearest', levels = [0.5 ,])# killled
c2 = extract_levels(cube, scheme='linear', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='nearest_extrapolate', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='linear_extrapolate', levels = [0.5 ,])# killled

In all cases, the error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870

    # Now perform the actual vertical interpolation.
    new_data = stratify.interpolate(levels,
                                    src_levels_broadcast,
                                    cube.core_data(),
                                    axis=z_axis,
                                    interpolation=interpolation,
                                    extrapolation=extrapolation)

Stratify (version '0.3.0') is a c/python interface wrapper and it previously caused trouble. It is not lazy so it may try to load 120GB files into memory and other issues like that. My previous solution to this problem was the write my own preprocessor:

ESMValGroup/ESMValCore#1039
ESMValGroup/ESMValCore#1048

Which has been abandoned, but I'm tempted to bring it back. (The deadline for this piece of work is 24th july!)

This is an extension of the discussion here: #3239

The text was updated successfully, but these errors were encountered:

bouweandela · 2023-06-26T14:38:52Z

stratify is lazy since v0.3.0 and the extract_levels preprocessor is lazy in the ESMValCore development branch and release candidate ESMValCore v2.9.0rc1. The iris function broadcast_to_shape is now lazy (SciTools/iris#5359), but it is not yet in a released version of iris.

You could try installing iris from source (clone the repository and run pip install .) or wait for the upcoming iris 3.6.1 release.

ledm · 2023-06-26T14:49:09Z

Okay, that's not the problem either!

Just loading the data is enough for it to get killed!

import iris
# from esmvalcore.preprocessor._regrid import extract_levels
from esmvalcore.preprocessor._volume import extract_surface

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)

print(cube)
print(cube.data[:,0,:,:])

Also results in Killed. It's not @bjlittle's stratify's fault.

Just loading this data file breaks.

bouweandela · 2023-06-26T16:06:02Z

That's because you're trying to load all the data into memory, maybe it doesn't fit?

Try something like

import iris

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)

print(cube)
print(cube.core_data()[:,0,:,:])

bouweandela · 2023-06-27T05:59:58Z

See also ESMValGroup/ESMValCore#2114

ledm · 2023-06-27T07:53:12Z

This works, thanks! .... but this returns a dask array, which is not what I want. I just want to extract the surface layer of a cube, returning a cube (convert 4D -> 3D, or 3D -> 2D). extract_layer is unable to do that for these files either!

ledm · 2023-06-27T08:56:22Z

Also, I should say that I've tried moving the preprocessor order around and I had the same problem with regrid as well. I think that likely also realises the data, @bouweandela.

valeriupredoi · 2023-06-27T09:11:51Z

iris=3.6.1 is now available on conda-forge and it gets pulled in our environment, so if you can try regenerating the env and use it, see if that fixes your issue @ledm 🍺

ledm · 2023-06-27T10:22:10Z

Just to confirm my email @valeriupredoi, updating to iris=3.6.1 does not solve this issue.

Method:

mamba install iris=3.6.1

in ESMValCore:

pip install --editable '.[develop]'

Then in an interactive python script:

>>> import iris
>>> iris.__version__
'3.6.1'
>>> import esmvalcore
>>> esmvalcore.__version__
'2.9.0.dev0+gb12682d2a.d20230627'
>>> import stratify
>>> stratify.__version__
'0.3.0'

ledm · 2023-06-27T10:50:56Z

Okay, so more investigation: watching top while running the script at the start of this issue results in a huge spike in MEM usage. The file itself is only 2GB, but I've seen up to 8GB using top. Memory being several times larger that the file suggests a memory issue in iris/stratify.

This is probably why re-ordering the preprocessors failed me earlier. I had assumed that if I extracted a smaller region first, then the surface layer, it would mean that less memory would be needed (this didn't work!). A memory leak means that it doesn't really matter how small a region you make it, as it will leak and break anyway.

valeriupredoi · 2023-06-27T12:33:53Z

@ledm here's what I found out: the script you gave me ie

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

needs 13G of memory RES (resident) to run to completion; this with:

esmvalcore                2.9.0rc1           pyh39db41b_0    conda-forge/label/esmvalcore_rc
esmvaltool                2.9.0.dev41+gda7f3dbe6          pypi_0    pypi
iris                      3.6.1              pyha770c72_0    conda-forge
python-stratify           0.3.0           py311h1f0f07a_0    conda-forge

and the file in question is indeed 2GB but do remember that's a compressed netCDF4 file, usually with a 40% compression factor. So that means that extract_levels loads the entire data into memory roughly 3 times. @bouweandela says extract levels is not lazy so it's very clear how not lazy it is - why the footprint is so bad ie about 3x larger than the actual file size in memory is beyond me. Sorry I misinterpreted thinking new iris will solve this, obv not. But the question is - why is sci3 killing your job when it only needs 13G of mem? Unless that job was different, I see no reason why. Now, I believe stratify is lazy now, so we can go about and make the extract levels lazy, in fact we should do that, but in the meantime, try running on a node that may not kick you out 😁

valeriupredoi · 2023-06-27T13:11:12Z

the source of this problem is vinterp (old name) or stratify.interpolate() (new name) becoming completely realized/computed/not lazy due to levels and src_levels_broadcast being <class 'numpy.ndarray'> - this is sexacyly @bouweandela 's issue ESMValGroup/ESMValCore#2114 - just to confirm, indeed, the dta in the example above is <class 'dask.array.core.Array'> - so making the coords lazy should be easy

ledm · 2023-06-27T13:39:29Z

try running on a node that may not kick you out

Lol, if only it were that easy. This gets killed for me on sci1, sci3, sci4, sci6, and the LOTUS high-mem queue!

valeriupredoi · 2023-06-27T13:42:36Z

sci2 did the trick for me. We now know where the problem lies, so fixing should follow 😁

ledm · 2023-06-27T14:05:57Z

Okay - running my original recipe (lol not fried chicken!) on sci2 now. Don't know if this is useful information, but it's trying to download 20GB of data from ESGF now. Not sure why it never got there before on sci1. (sci3 isn't connected to ESGF, I don't think)

ledm · 2023-06-30T15:09:33Z

Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64

Calling this a big W.

ledm · 2023-06-30T15:20:27Z

Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64
Calling this a big W.

Correction. This was on sci2. On sci3, it just got killed the normal way. No idea whats going on. Starting to think it's a jasmin thing. Will try sci6 next.

ledm · 2023-07-06T09:21:01Z

Continuing with this, here's a minimal testing recipe.

https://github.com/ESMValGroup/ESMValTool/blob/AscensionIslandMarineProtectedArea/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_o2_testing.yml

On JASMIN.sci1 this fails for me. If I comment out either recipe, it runs fine.

The fact that it works with one dataset but fails with two makes me think that perhaps something isn't being properly closed after it's finished? Or its trying to run two things at once, even when max_parallel_tasks: 1 in my config-user file.

bouweandela · 2024-05-06T11:05:48Z

The issue mentioned in the top post has been solved in ESMValGroup/ESMValCore#2120 which will be available in the upcoming v2.11.0 release of ESMValCore.

I also investigated the recipe in #3244 (comment):

There appears to be an issue with the climate_statistics preprocessor function caused by the 1D temporal weights consisting of a single Dask chunk, this results in too large chunks. This should be fixed by Avoid large chunks in climate_statistics preprocessor function with period='full' ESMValCore#2404.
The regridding preprocessor scheme used in the recipe is not lazy, maybe it could be replaced by one of the iris-esmf-regrid ones? It would be rather nice if the esmvalcore.regrid preprocessor function automatically did that whenever possible. Opened Automatically select a lazy regridding scheme in regrid preprocessor function for 2D lat/2D lon grids when possible ESMValCore#2405 to discuss the posibilities.

bouweandela · 2024-05-07T06:38:24Z

Continuing with this, here's a minimal testing recipe.

https://github.com/ESMValGroup/ESMValTool/blob/AscensionIslandMarineProtectedArea/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_o2_testing.yml

@ledm The recipe now runs with the ESMValCore main branch (and soon to be released v2.11.0). Even though regridding is not lazy, this isn't such a problem as the data has already been reduced in size a lot by computing the climate statistics and vertical level extraction before regridding.

bouweandela · 2024-09-09T15:09:52Z

With ESMValGroup/ESMValCore#2457 merged regridding is now automatically lazy for data with 2D lat/lon coordinates as well.

ledm mentioned this issue Jun 26, 2023

Recipe testing and output comparison for release 2.9.0 #3239

Closed

ledm assigned ledm, bouweandela and valeriupredoi Jul 6, 2023

bouweandela mentioned this issue May 6, 2024

Avoid large chunks in climate_statistics preprocessor function with period='full' ESMValGroup/ESMValCore#2404

Merged

9 tasks

bouweandela closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

ledm commented Jun 26, 2023 •

edited

Loading

bouweandela commented Jun 26, 2023

ledm commented Jun 26, 2023

bouweandela commented Jun 26, 2023

bouweandela commented Jun 27, 2023

ledm commented Jun 27, 2023

ledm commented Jun 27, 2023

valeriupredoi commented Jun 27, 2023

ledm commented Jun 27, 2023

ledm commented Jun 27, 2023

valeriupredoi commented Jun 27, 2023

valeriupredoi commented Jun 27, 2023

ledm commented Jun 27, 2023 •

edited

Loading

valeriupredoi commented Jun 27, 2023

ledm commented Jun 27, 2023

ledm commented Jun 30, 2023

ledm commented Jun 30, 2023

ledm commented Jul 6, 2023

bouweandela commented May 6, 2024 •

edited

Loading

bouweandela commented May 7, 2024

bouweandela commented Sep 9, 2024

Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

Comments

ledm commented Jun 26, 2023 • edited Loading

bouweandela commented Jun 26, 2023

ledm commented Jun 26, 2023

bouweandela commented Jun 26, 2023

bouweandela commented Jun 27, 2023

ledm commented Jun 27, 2023

ledm commented Jun 27, 2023

valeriupredoi commented Jun 27, 2023

ledm commented Jun 27, 2023

ledm commented Jun 27, 2023

valeriupredoi commented Jun 27, 2023

valeriupredoi commented Jun 27, 2023

ledm commented Jun 27, 2023 • edited Loading

valeriupredoi commented Jun 27, 2023

ledm commented Jun 27, 2023

ledm commented Jun 30, 2023

ledm commented Jun 30, 2023

ledm commented Jul 6, 2023

bouweandela commented May 6, 2024 • edited Loading

bouweandela commented May 7, 2024

bouweandela commented Sep 9, 2024

ledm commented Jun 26, 2023 •

edited

Loading

ledm commented Jun 27, 2023 •

edited

Loading

bouweandela commented May 6, 2024 •

edited

Loading