Issue writing preprocessed cube for CESM2-WACCM dataset #3738

rswamina · 2024-08-16T10:56:44Z

ESMValTool seems to have trouble writing a preprocessed cube after extracting IPCC regions for the CEMS2-WACCM dataset. I don't know if the program hangs or simply is taking too long but the log message stops at a point where the cube is being saved under the preproc directory. I am using the esmvaltool module installed on JASMIN. As far as I can tell there are no typos!

Here is the recipe:

# ESMValTool
# recipe_test_ssp245_daily_pr.yml
---
documentation:
  description: |
    This is a recipe to download data sets from ESGF nodes and extract IPCC regions.

  authors:
    - swaminathan_ranjini

  title: |
    
    Recipe to download data from ESGF nodes and extract regions.


  maintainer:
    - swaminathan_ranjini

datasets: 

  - {dataset: CESM2-WACCM, project: CMIP6, exp: historical, ensemble: r(1:3)i1p1f1, start_year: 1995, end_year: 2014, grid: gn}

  - {dataset: CESM2-WACCM, project: CMIP6, exp: ssp245, ensemble: r(1:3)i1p1f1, start_year: 2081, end_year: 2100, grid: gn}

preprocessors:
  preproc_extract_region_land_NCA:
    extract_shape:
      shapefile : IPCC-AR6-shapefiles/IPCC-WGI-reference-regions-v4.shp
      decomposed : False
      method : contains
      crop: True
      ids: 
        - 'N.Central-America'
    mask_landsea:
      mask_out : sea

  preproc_extract_region_land_SCA:
    extract_shape:
      shapefile : IPCC-AR6-shapefiles/IPCC-WGI-reference-regions-v4.shp
      decomposed : False
      method : contains
      crop: True
      ids: 
        - 'S.Central-America'
    mask_landsea:
      mask_out : sea

diagnostics:
  day_pr_NCA:
    description: calculate annual means for region
    variables:
      pr:
        preprocessor: preproc_extract_region_land_NCA
        project: CMIP6
        mip: day
    scripts: null

Here are the final lines of the error message. After the last line shown, the program just waits and I don't know why it hangs there. This recipe works for other models.:

2024-08-16 10:48:27,951 UTC [30002] DEBUG   Running preprocessor function 'save' on the data
[<iris 'Cube' of precipitation_flux / (kg m-2 s-1) (time: 7300; latitude: 20; longitude: 29)>]
loaded from original input file(s)
[LocalFile('/badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/day/pr/gn/v20190415/pr_day_CESM2-WACCM_historical_r1i1p1f1_gn_19900101-19991231.nc'),
 LocalFile('/badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/day/pr/gn/v20190415/pr_day_CESM2-WACCM_historical_r1i1p1f1_gn_20000101-20091231.nc'),
 LocalFile('/badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/day/pr/gn/v20190415/pr_day_CESM2-WACCM_historical_r1i1p1f1_gn_20100101-20150101.nc'),
 LocalFile('/badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/fx/sftlf/gn/v20190227/sftlf_fx_CESM2-WACCM_historical_r1i1p1f1_gn.nc')]
with function argument(s)
compress = False,
filename = PosixPath('/work/scratch-nopw2/ranjinis/hot-models-2/recipe_test_ssp245_daily_pr_20240816_104803/preproc/day_pr_CIM/pr/CMIP6_CESM2-WACCM_day_historical_r1i1p1f1_pr_gn_1995-2014.nc')

The text was updated successfully, but these errors were encountered:

rswamina · 2024-08-16T11:03:39Z

@valeriupredoi - can you please look into what I might be doing wrong when you get a chance? Thanks!

valeriupredoi · 2024-08-16T11:44:29Z

@rswamina could you please post the entire trace or the debug file, please? Not much I can gauge from that bit of output 😁

rswamina · 2024-08-16T12:24:10Z

Yes, of course..will try to attache the whole main_log_debug.txt file attached above. Let me know if you cannot access it.
main_log_debug.txt

bouweandela · 2024-08-26T11:57:07Z

If all preprocessor functions are lazy, the save step is where the variable (in this case pr) data is loaded from disk, computations happen, and the result is written to the output file. Are you sure nothing is happening? How long did you wait for?

The timestamps in the attached debug log do not match those in the top post.

rswamina · 2024-08-26T12:51:57Z

I reran it to generate a fresh debug file. Please consider only the timestamps in the file. I waited around 30 minutes and then killed the job. How long should I wait for?

rswamina · 2024-08-26T13:24:55Z

I should add that I tried this several times wondering if there was an issue with JASMIN. The longest I waited (not sure if it was this particular run) was 30 minutes.

bouweandela · 2024-08-26T14:01:03Z

I tried running the recipe on my laptop with just the first dataset

  - {dataset: CESM2-WACCM, project: CMIP6, exp: historical, ensemble: r1i1p1f1, start_year: 1995, end_year: 2014, grid: gn}

It completes in about 6 seconds and uses 2GB of RAM with ESMValCore v2.11.0 and 9 seconds / 4GB of RAM with v2.10.0. Maybe it is something specific to Jasmin indeed. Can you access the files that are listed in the top post, e.g. with ncdump?

schlunma · 2024-08-26T14:03:05Z

I had a similar problem with other data that originated from using dask distributed with the mask_landsea data. This will be fixed soon (ESMValGroup/ESMValCore#2515).

bouweandela · 2024-08-26T14:04:27Z

From the debug log it looks like @rswamina is using the default scheduler (so no distributed).

rswamina · 2024-09-10T09:16:15Z

I am using the default scheduler. On JASMIN, SSP245 has just the r1 ensemble member's data. It seems to hang for both historical and ssp245 experiments. I tried an ncdump on the files and am able to see the file content under the /badc/ archive path. This was not an issue for other models. I cannot think of what else could be an issue unless someone else can reproduce this on JASMIN. I will also add that I have successfully been able to process historical data for this model a few months ago.

valeriupredoi · 2024-09-10T09:26:22Z

Am travelling back to the UK today, will have a look tommz or on Thu, before I fo that though - Ranjini, could you pls confirm there is enough room on the disk you writing to, and what iris and esmvalcore you using? Cheers 🍺

rswamina · 2024-09-10T09:57:34Z

Thanks @valeriupredoi. I have enough space on disk. I am using the esmvaltool module installed on JASMIN for this, here are the version details:

>esmvaltool version
ERROR 1: PROJ: proj_create_from_database: Open of /apps/jasmin/community/esmvaltool/miniconda3_py311_23.11.0-2/envs/esmvaltool/share/proj failed
ESMValCore: 2.10.0
ESMValTool: 2.10.0

I am not sure how to check the iris version though.

valeriupredoi · 2024-09-10T10:00:45Z

Thanks Ranjini! I know how to do that 😁 Speaking of, I should install 2.11 rather sooner than later

rswamina · 2024-09-16T10:54:07Z

Hi @valeriupredoi - can you let me know if you diagnosed the problem? If this might take time, I will exclude the data set for now. Thanks!

valeriupredoi · 2024-09-16T11:25:08Z

on it, will have a look today, apols for the delay, too many emails to check after holidays 🤣

valeriupredoi · 2024-09-16T16:41:21Z

hi Ranjini, am currently running that recipe, a few points:

it's slow at computing things (20min CPU time and it's not outputted any actual preprocessed file, not to mention that my job gets deferred on a very regular basis - low memory, but high CPU load, so the actual wall time is 2-3x)
make sure you are not saving intermediary files (that will mega mega slow your run, and will eat up disk in no time)
my guess is that what you thought was the save process taking too long is actually the compute (preprocessing) step that is taking a looong time, if you use more than one parallel process, the stdout doesn't come in chronological order to the screen and file, and may appear that the save is taking forever, whereas a different process is in fact taking a long time
if you want, you can output to /gws/nopw/j04/esmeval/ranjiniout/ rather than your HOME or wherever you out to, esmeval is sure to have a lot of disk

rswamina · 2024-09-16T16:54:12Z

@valeriupredoi - Thanks..I am outputting to scratch in a subdirectory created under (/work/scratch-nopw2/ranjinis/). There is sufficient space there from what I see:

>df -h  /work/scratch-nopw2/
Filesystem                   Size  Used Avail Use% Mounted on
172.26.70.68:/scratch-nopw2  200T  175T   26T  88% /work/scratch-nopw2

Are writes to scratch slower or as you indicate is this still a processing delay. I wonder why that is, it is not very much data.

valeriupredoi · 2024-09-16T17:11:44Z

My guess is it's not data but rather, those region selections are intensive. Have left my run on overnight, see how long it takes 🍺

rswamina · 2024-09-17T08:35:32Z

I am curious as to why. This worked quite quickly for @bouweandela on a different machine and the same preprocessing worked quickly again for other model data sets (same time period but perhaps different resolutions). I have successfully run this preprocessor on JASMIN before for the same data set but I can't remember if I just let it run overnight.

valeriupredoi · 2024-09-17T11:08:39Z

so my job is still running, it hasn't outputted any actual preprocessed data, and it looks like it's going nowhere - this is on sci2, and the load on the node is pretty decent, it's not running out of memory, and it should definitely run a little faster 😁 (note: max_parallel_tasks = null, no Dask settings whatsoever)

valeriupredoi · 2024-09-17T11:30:51Z

OK, starting to make sense of this: the save is indeed waiting on a fairly large number of still running tasks (see the wait statement in _task.py l.787) - about 800 of these:

{PreprocessingTask('day_pr_NCA/pr'): <multiprocessing.pool.ApplyResult object at 0x7fe229b64190>}

valeriupredoi · 2024-09-17T11:37:23Z

In fact, after a fair while, the ready tasks are empty still (empty set()) - nothing actually gets done!

rswamina · 2024-09-17T12:07:14Z

So, what does this mean?

valeriupredoi · 2024-09-17T12:52:42Z

found it! This codeblock trips it:

    for cube in cubes:
        logger.debug("Saving cube:\n%s\nwith %s data to %s", cube,
                     "lazy" if cube.has_lazy_data() else "realized", filename)

Unbelievably sneaky!

valeriupredoi · 2024-09-17T13:05:09Z

@bouweandela have a look at this - simplifying for debugging that code block above (from _io.py):

    for cube in cubes:
        if cube.has_lazy_data():
            data_state = "lazy"
        else:
            data_state = "realized"
        logger.debug("Saving cube: %s with %s data to %s", cube, data_state, filename)

the whole run hangs; commenting out the debug logger line:

    for cube in cubes:
        if cube.has_lazy_data():
            data_state = "lazy"
        else:
            data_state = "realized"
        # logger.debug("Saving cube: %s with %s data to %s", cube, data_state, filename)

run goes through at light speed! Something is really wonky at JASMIN with debug stringIO

valeriupredoi · 2024-09-17T13:15:04Z

@rswamina you can now run the recipe no problem, I've taken out the offending bit - it is the issue with logger.debug("Saving cube: %s", cube) - why that hangs is beyond me, I'll do some due diligence, but I need lunch now

rswamina · 2024-09-17T13:16:20Z

@valeriupredoi - many thanks for this!!! I will let you know if I have further issues.

valeriupredoi · 2024-09-17T13:25:31Z

The cube variable needs be explicitly cast to a string with str(cube) so the debug logger doesn't hang 😮‍💨

rswamina · 2024-09-17T13:51:38Z

So why does this not happen for other data sets?

valeriupredoi · 2024-09-17T13:58:59Z

that is a one million dollar question! Here is a model code that mimicks our logging mechanism:

import logging
import logging.config
import esmvalcore
import iris
import time
import yaml

from pathlib import Path


logger = logging.getLogger(__name__)

def _update_stream_level(cfg: dict, level=None):
    """Update the log level for the stream handlers."""
    handlers = cfg['handlers']

    for handler in handlers.values():
        if level is not None and 'stream' in handler:
            if handler['stream'] in ('ext://sys.stdout', 'ext://sys.stderr'):
                handler['level'] = level.upper()

def configure_logging(cfg_file: str = None,
                      output_dir: str = None,
                      console_log_level: str = None) -> list:
    if cfg_file is None:
        cfg_loc = Path(esmvalcore.__file__ + "esmvalcore")
        cfg_file = cfg_loc.parents[0] / 'config' / 'config-logging.yml'

    cfg_file = Path(cfg_file).absolute()

    with open(cfg_file) as file_handler:
        cfg = yaml.safe_load(file_handler)

    if output_dir is None:
        _purge_file_handlers(cfg)

    _update_stream_level(cfg, level=console_log_level)

    logging.config.dictConfig(cfg)
    logging.Formatter.converter = time.gmtime
    logging.captureWarnings(True)


def loaddebug():
    # configure logger
    run_dir = "."
    log_files = configure_logging(output_dir="testdebuglog",
                                  console_log_level="debug")
    c = iris.load_cube("testRS.nc")
    logger.debug("crap")
    logger.debug("cube %s", c)
    return c

if __name__ == "__main__":
    loaddebug()

maybe you can make some sense with it - I'll just open a PR and plop a str() in that logger statement in _io.py just so we are on the safe side 🍺

bouweandela · 2024-09-18T08:35:50Z

I suspect this issue may be related to the (un)availability of parallel writes on most Jasmin filesystems (see here). In particular, I would recommend choosing a storage location that supports parallel writes for the esmvaltool output_dir. It is likely that multiple threads will write to the log files in parallel.

Could you try using e.g. output_dir: /work/scratch-pw2/ranjinis/ (note the missing no in the path)?

valeriupredoi · 2024-09-18T09:23:57Z

that would be interesting to experiment with! But I don't think that's the case here, since if we don't include the cube in the debug message, then all goes through nicely, and equally, if we force cast it to str, then all goes through nicely too. I suspect there's something wonky with JASMIN encoding, and some particular corner case related to that dataset in specific, dunno what it is, and I don't think we should spend too much time investigating 🍺

valeriupredoi · 2024-09-18T13:33:34Z

I suspect this issue may be related to the (un)availability of parallel writes on most Jasmin filesystems (see here). In particular, I would recommend choosing a storage location that supports parallel writes for the esmvaltool output_dir. It is likely that multiple threads will write to the log files in parallel.

Could you try using e.g. output_dir: /work/scratch-pw2/ranjinis/ (note the missing no in the path)?

@bouweandela I just reintroduced the original debug logger with cube and not str(cube) and tested on the the pw scratch - same behaviour ie it hangs ad infinitum, it's not PW the issue

rswamina · 2024-09-19T07:53:07Z

Apologies..that was a typo. I used nopw2. Update from my side is that I was able to use the recipe for 3 ensemble members and a single experiment and a single region successfully but when I add more regions for extraction, it once again hangs at least takes very long and I end up killing the job. It seems like some kind of memory issue on JASMIN but I checked and resources are available.

valeriupredoi · 2024-09-19T12:06:17Z

so this particular hang is due to that logger issue with cube - it could be that similar instances of this issue are rearing their heads where I've not str-ed instances of cube - is it hanging at save again, or elsewhere?

rswamina added the bug label Aug 16, 2024

valeriupredoi mentioned this issue Sep 17, 2024

Explicitly cast to str a few cube and cubes instances in _io.py ESMValGroup/ESMValCore#2526

Open

5 tasks

Issue writing preprocessed cube for CESM2-WACCM dataset #3738

Issue writing preprocessed cube for CESM2-WACCM dataset #3738

Comments

rswamina commented Aug 16, 2024 • edited Loading

rswamina commented Aug 16, 2024

valeriupredoi commented Aug 16, 2024

rswamina commented Aug 16, 2024

bouweandela commented Aug 26, 2024

rswamina commented Aug 26, 2024

rswamina commented Aug 26, 2024

bouweandela commented Aug 26, 2024 • edited Loading

schlunma commented Aug 26, 2024

bouweandela commented Aug 26, 2024

rswamina commented Sep 10, 2024

valeriupredoi commented Sep 10, 2024

rswamina commented Sep 10, 2024

valeriupredoi commented Sep 10, 2024

rswamina commented Sep 16, 2024

valeriupredoi commented Sep 16, 2024

valeriupredoi commented Sep 16, 2024

rswamina commented Sep 16, 2024

valeriupredoi commented Sep 16, 2024

rswamina commented Sep 17, 2024

valeriupredoi commented Sep 17, 2024 • edited Loading

valeriupredoi commented Sep 17, 2024 • edited Loading

valeriupredoi commented Sep 17, 2024

rswamina commented Sep 17, 2024

valeriupredoi commented Sep 17, 2024 • edited Loading

valeriupredoi commented Sep 17, 2024

valeriupredoi commented Sep 17, 2024

rswamina commented Sep 17, 2024 • edited Loading

valeriupredoi commented Sep 17, 2024

rswamina commented Sep 17, 2024

valeriupredoi commented Sep 17, 2024

bouweandela commented Sep 18, 2024 • edited Loading

valeriupredoi commented Sep 18, 2024

valeriupredoi commented Sep 18, 2024

rswamina commented Sep 19, 2024

valeriupredoi commented Sep 19, 2024 • edited Loading

rswamina commented Aug 16, 2024 •

edited

Loading

bouweandela commented Aug 26, 2024 •

edited

Loading

valeriupredoi commented Sep 17, 2024 •

edited

Loading

valeriupredoi commented Sep 17, 2024 •

edited

Loading

valeriupredoi commented Sep 17, 2024 •

edited

Loading

rswamina commented Sep 17, 2024 •

edited

Loading

bouweandela commented Sep 18, 2024 •

edited

Loading

valeriupredoi commented Sep 19, 2024 •

edited

Loading