Add industry JRC data processing #354

brynpickering · 2024-04-11T12:57:43Z

Fixes #345.

Workflow checks haven't completed on my device.

This starts the process of modularising the JRC-IDEES data processing pipeline, which I will later apply to transport and heat (and replace the existing processing scripts).

Checklist

Any checks which are not relevant to the PR can be pre-checked by the PR creator. All others should be checked by the reviewer. You can add extra checklist items here if required by the PR.

for more information, see https://pre-commit.ci

irm-codebase

Couple of nice to haves. Nothing too important.

config/default.yaml

irm-codebase · 2024-04-11T14:15:00Z

rules/jrc-idees.smk

@@ -0,0 +1,47 @@
+"Rules regarding JRC-IDEES Data"
+
+JRC_IDEES_SPATIAL_SCOPE = [


Nice-to-have:
Should we consider moving this to the configuration?
As it is, data will always be downloaded for all JRC countries even if it is unused (such as in the minimal configuration).

No, you need data from some countries to fill in neighbours, so even if they're not in the list of model countries we need to pull in and process all data

I've now combined main countries and infill countries into one list and only unzip and process that list of countries. It's a bit of a proof of concept and might be too verbose to be worth keeping

Thank you!
And agreed, I suppose it's fine to just do all countries unless the files are huge if this does not work well with the rest of the workflow.

timtroendle · 2024-04-11T15:37:32Z

I started my review and I am trying to finish it today but It may take a while (because I need to do a few other things first).

brynpickering · 2024-04-11T17:15:58Z

I have a few local changes to commit that address the file unzipping (and downloading) to limit to just the defined countries and their infill neighbours (if necessary). Will push that later this evening.

timtroendle · 2024-04-11T19:05:58Z

Ok, I will finish the review tomorrow in that case.

brynpickering · 2024-04-12T08:22:56Z

@timtroendle ready for review now.

I'm aware that get_countries_to_unzip is quite verbose. It would be less so if we could access eurocalliopelib from the main snakemake env (and so use the country code conversion utils). Anyway, I've done it as a bit of a proof-of-concept to show how we could filter the JRC files to process to just those that are relevant to the configured countries (and their infill neighbours, where necessary). I would be fine to revert this to processing all countries and then having the filtering happening much further down the line.

lib/eurocalliopelib/utils.py

brynpickering · 2024-04-12T08:27:23Z

rules/jrc-idees.smk

+        countries = config["scope"]["spatial"]["countries"]
+    wildcard_constraints:
+        sector = "((industry)|(transport)|(tertiary))"
+    output: temp(directory("build/data/jrc-idees/{sector}/unprocessed"))


I'm outputting to a folder rather than wildcarded individual files because it makes other references to this data lighter (heat, transport, etc. only need to reference the directory) and it seems to make sense to me to do all the unzipping in one go into a temporary directory as it is a very quick step and all the files then get deleted as soon as the downstream processing is complete. The rule is much simpler when not trying to filter countries (see earlier commits)

brynpickering · 2024-04-12T08:28:32Z

scripts/heat/jrc_idees.py

This file will move/change eventually. I have a script ready to process tertiary sector data that is comparable to the industry one in this PR

brynpickering · 2024-04-12T08:29:01Z

scripts/transport/jrc_idees.py

This file will move/change eventually. I have a script ready to process transport sector data that is comparable to the industry one in this PR

irm-codebase · 2024-04-12T08:57:15Z

rules/transport.smk

-            "build/data/jrc-idees/transport/unprocessed/{country_code}.xlsx",
-            country_code=JRC_IDEES_SCOPE
-        )
+        data = "build/data/jrc-idees/transport/unprocessed"


Question: will this break snakemake operations?
Passing just a folder might break operations since it won't detect changes in the files when building the DAG.
If this is due to my suggestion to make the countries extracted flexible, then it was not a good one...

Another option is to add a function in the smk that build an array with the filenames in this folder, and specify rule order...

The directory is temporary - it is deleted once all rules that rely on it are completed (in the case of JRC-IDEES data, there is only one downstream rule per unzipped directory). It's not really related to keeping all countries extracted flexible. It's more about limiting the number of times you have to reference that array of filenames from different rules.

This means that whenever you build the DAG again, if it needs to re-run a rule that relies on the directory then unzipping files into the directory will need to be re-run (since those files have been deleted). Since the unzipping is a light operation, this keeps us from having lots of unnecessary files lying around (you can just get them again whenever you want by unzipping the downloaded data)

irm-codebase

Just a few comments in relation to SMK DAG generation.

scripts/jrc-idees/industry.py

timtroendle

Looks good, but I had a range of minor comments.

config/default.yaml

timtroendle · 2024-04-11T15:39:50Z

Snakefile

@@ -16,6 +16,7 @@ techs_template_dir = f"{model_template_dir}techs/"

 include: "./rules/shapes.smk"
 include: "./rules/data.smk"
+include: "./rules/jrc-idees.smk"


Not particularly important but I moved the previous eurostat.smk and jrc-idees.smk into data.smk, as the container for all downloading and pre-processing of all data that is not sector specific. The idea being that we don't generate too many rule files, especially not rule files that aren't feature-based. You didn't like that idea?

I prefer one per source as they do become large enough rule files to be worth splitting off. It also is in line with the concept of modularising different major data sources.

One could imagine a future where we split off JRC processing completely and just store the pre-built files on zenodo for convenience.

config/schema.yaml

lib/eurocalliopelib/utils.py

timtroendle · 2024-04-13T12:54:20Z

tests/lib/test_utils.py

+class TestRenameAndGroupby:
+    @pytest.fixture
+    def da(self, request):
+        data = [1, 2, 3, 4, 5]


Maybe just using "1" as value everywhere could make the tests easier to understand.

It's good to have different values so you know that you've got the correct index items selected / grouped / summed later.

timtroendle · 2024-04-13T12:56:08Z

tests/lib/test_utils.py

+        ],
+    )
+    @pytest.mark.parametrize("da", ["single_index", "multi_index"], indirect=True)
+    def test_rename_groupby_keep_renamed(self, rename_dict, expected, da):


name should be keep_non_renamed, right?

Yes. I've also changed this to drop_other_dim_items to try and make it easier to understand the argument.

tests/lib/test_utils.py

timtroendle · 2024-04-13T13:00:22Z

tests/lib/test_utils.py

+        [
+            (
+                {"A": "A1", "B": "B1", "C": "C1", "D": "D1", "E": "E1"},
+                {"A1": [1, 5], "B1": [np.nan, 4], "D1": [4, np.nan], "E1": [5, 1]},


Hm this expected behaviour surprises me. So if dropna=True you do not sum?

No, it's just a convenience to be able to compare to a single value in the other tests as it parameterises over a single and multi index dataarray. Comparing either [1, 1] or [1] to 1 succeeds with numpy array comparisons. Can't do this with NaNs as the data is different in each array being compared (e.g. [1, np.nan]).

timtroendle · 2024-04-13T13:01:59Z

tests/lib/test_utils.py

+            (
+                {"A": "A1", "B": "B1", "C": "C1", "D": "D1", "E": "E1"},
+                {
+                    "A1": [1, 5],


This expected behaviour also surprises me. Why is this not summed? And where does the 5 come from?

I have added a bit more explanation in the creation of the fixture

brynpickering · 2024-04-16T09:01:16Z

@timtroendle the only remaining question here is on the use of a directory output for unzipped JRC data. It's really a question around whether to filter countries or not. If we don't filter countries then we can return to the rule unzipping individual files with ease. If we filter, then it becomes reasonably cumbersome to reference the filtering helper function in multiple rule files. I could move the heat and transport JRC processing rules into jrc-idees.smk which would make referencing the helper function clearer.

irm-codebase

Suggested some small improvements to the outputed xarray datasets.

irm-codebase · 2024-04-18T10:57:54Z

scripts/jrc-idees/industry.py

+        unit = "kt"
+
+    processed_data.columns = processed_data.columns.rename("year").astype(int)
+    processed_da = processed_data.stack().rename("jrc-idees-industry-twh").to_xarray()


The use of jrc-idees-industry-twh as the data variable name is quite confusing, since it also shows up in production, which is not in energy units. As far as I understand:

energy data is in twh.

production data tends to be in kt, but twh is still in the data variable name.

If possible/sensible, please:

For the energy dataset, consider unstacking the energy coordinate into two data variables (final, useful). It would fit the use of xr.Dataset a bit better, since these two variables apply to all other coordinates.

For production, remove the (kt) from the produced_material coordinate, since it's non-atomic usage.

This proooooobably affects other outputs of this JRC module. But it would make further processing easier on our side.

irm-codebase · 2024-04-18T13:23:31Z

scripts/jrc-idees/industry.py

+            country_code=df.index.name.split(":")[0],
+            cat_name=df.index.name.split(": ")[1],
+        )
+        .rename_axis(index="produced_material")


This one is also a tad confusing. It seems that it's the production method of a given material in some cases (Electric arc, Integrated steelworks both give kt of steel?).

In some cases it has units, in others it does not. Is there a way to make this tidier?

Not without lots of hardcoding. I would leave this to downstream applications and not this processing script, I think.

irm-codebase · 2024-04-28T16:15:38Z

One last thing I noticed while processing data for #355: it seems this new JRC changes the final amounts for some energy carriers. Is this expected?

While implementing the processing for "other" industries, I ran into checksum differences between the JRC data in SCEC and EC. At first I thought it was the different energy unit ( ktoe to twh), but after more testing I found it really is not the case.

Should I open a separate issue for this, or would you like to debug it within this PR?

Here are some checksum results before / after this JRC update:

Before (in SCEC), already converted to twh:

ipdb> jrc_energy_df.loc[jrc_energy_df.index.get_level_values("carrier_name") == "Diesel oil (incl. biofuels)"].sum().sum()
2848.481153523807
ipdb>

After (using this PR in EC):

ipdb> jrc_energy.sel(carrier_name="Diesel oil (incl. biofuels)").sum()
<xarray.Dataset>
Dimensions:       ()
Coordinates:
    carrier_name  <U27 'Diesel oil (incl. biofuels)'
Data variables:
    value         float64 2.309e+03

…product naming

brynpickering · 2024-04-29T16:57:38Z

@timtroendle I've moved back to non-directory output. See what you think about the country filtering - do you think it's overkill?

@irm-codebase I've cleaned up the output data files. Units are now only referenced in the xarray attributes, "..-twh" doesn't get attached to either dataset name. Energy is a dataset with two variables, production is a single array.

I haven't managed to check the data inconsistency yet. I'll do that tomorrow.

for more information, see https://pre-commit.ci

brynpickering · 2024-04-30T15:38:39Z

I've cleaned up the code so that it now matches the output of the old SC-EC notebook (here). There were some issues with one of the cell colours not being captured for chemicals industry and with assigning carrier names to subsections. Both issues are now fixed and there are logging / data checking points added to keep an eye on it in future.

I've removed country filtering as it actually kills missing data filling for the heat sector. It could be cleaned up there to handle only a subset of countries, but I'd say that's a separate PR.

timtroendle · 2024-05-01T07:38:27Z

Separate comment here re: country filtering. I am totally fine with not filtering countries at this stage of the workflow. I agree that it adds unnecessary complexity, especially considering that these files and their processing is very light weight.

timtroendle

Looks all good to me. I find the file output clean and I agree with the non-filtering of countries.

Honestly, I went rather quickly through this as I had no major comments in the first iteration anyways. If there is anything specific I should have a close look at other than the file output, the country filtering, and the minor comments above, please let me know.

timtroendle · 2024-05-01T07:42:08Z

config/schema.yaml

@@ -144,6 +144,9 @@ properties:
    root-directory:
        type: string
        description: Path to the root directory of euro-calliope containing scripts and template folders.
+    max-threads:


This parameter doesn't exist anymore and it should therefore not be in schema.yaml.

timtroendle · 2024-05-01T07:47:57Z

scripts/jrc-idees/heat.py

@@ -31,9 +31,8 @@


 def process_jrc_heat_tertiary_sector_data(
-    paths_to_national_data: list[str], out_path: str
+    paths_to_national_data: list[Path], out_path: str


Wait! These cannot be paths, they are strings coming directly from Snakemake.

timtroendle · 2024-05-01T07:49:35Z

scripts/jrc-idees/transport.py

@@ -39,12 +39,11 @@


 def process_jrc_transport_data(
-    paths_to_data: list[str],
-    dataset: object,
+    paths_to_data: list[Path],


Hm... is it me, or is this wrong too? Or is this a new Snakemake feature I am not aware of?

I believe that is just a python annotation for better hinting?
(edit: upon further inspection, you may be right here! this one comes from snakemake, and I am not sure how they process paths beyond strings)

irm-codebase · 2024-05-01T08:04:34Z

Separate comment here re: country filtering. I am totally fine with not filtering countries at this stage of the workflow. I agree that it adds unnecessary complexity, especially considering that these files and their processing is very light weight.

Honestly... I agree. It was not a good suggestion on my part. I did not follow KISS rules on this one...
Besides, if we do end up moving to a more modular setup, it makes sense to just filter stuff out at the end.

irm-codebase

Small comment regarding default exception ignoring in one file. Rest looks good!
I'll approve this. Change that line at your discretion :)

irm-codebase · 2024-05-01T08:16:33Z

lib/eurocalliopelib/utils.py

+def convert_valid_countries(country_codes: list, output: str = "alpha3") -> dict:
+    """
+    Convert a list of country codes / names to a list of uniform ISO coded country
+    codes. If an input item isn't a valid country (e.g. "EU27") then print the code and


I'd be careful around skipping exceptions by default, since this is an utility function.

I'd introduce a flag to ignore exceptions, with the default being raising them. This helps avoid issues down the line since exceptions skipping is made explicit in the code calling this.

irm-codebase · 2024-05-15T13:58:20Z

@brynpickering just to note that this fix might have introduced some extra hurdles... see #383

…-idees-industry-processing Add industry JRC data processing

Add industry JRC data processing

260ff7b

brynpickering requested review from timtroendle and irm-codebase April 11, 2024 12:57

pre-commit-ci bot and others added 2 commits April 11, 2024 12:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e9ad9e

for more information, see https://pre-commit.ci

Add utils & tests

cfb24a4

irm-codebase approved these changes Apr 11, 2024

View reviewed changes

Update JRC-IDEES country selection process

e45ad54

brynpickering commented Apr 12, 2024

View reviewed changes

lib/eurocalliopelib/utils.py Show resolved Hide resolved

brynpickering commented Apr 12, 2024

View reviewed changes

lib/eurocalliopelib/utils.py Show resolved Hide resolved

brynpickering commented Apr 12, 2024

View reviewed changes

Fix wildcard constraint

674aef6

tud-mchen6 requested a review from irm-codebase April 12, 2024 08:43

irm-codebase reviewed Apr 12, 2024

View reviewed changes

Fix directory naming

6cbf01e

irm-codebase reviewed Apr 12, 2024

View reviewed changes

scripts/jrc-idees/industry.py Show resolved Hide resolved

sjpfenninger mentioned this pull request Apr 12, 2024

Pin hdf5 and h5py in all xarray environments #357

Merged

5 tasks

timtroendle requested changes Apr 13, 2024

View reviewed changes

Clean up scripts and tests; remove configurable max threads

9f7dc12

irm-codebase reviewed Apr 18, 2024

View reviewed changes

irm-codebase mentioned this pull request Apr 22, 2024

Add industry module #340

Merged

3 tasks

irm-codebase added the Industry Industrial energy demand label Apr 23, 2024

Move all JRC rules into one place; Remove directory output; clean up …

ec05182

…product naming

pre-commit-ci bot and others added 3 commits April 29, 2024 16:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

10a30a4

for more information, see https://pre-commit.ci

Fix data mismatch; add more checks

492cb67

Fix filepaths; remove JRC country data filtering

1899945

brynpickering requested review from timtroendle and irm-codebase April 30, 2024 15:35

timtroendle approved these changes May 1, 2024

View reviewed changes

irm-codebase approved these changes May 1, 2024

View reviewed changes

Updates following review; add changelog entry

a7393fb

brynpickering merged commit 739d2c6 into develop May 2, 2024
4 checks passed

brynpickering deleted the add-jrc-idees-industry-processing branch May 2, 2024 11:52

irm-codebase mentioned this pull request May 15, 2024

Use xarray datasets, not pandas tidy dataframes for eurostat, jrc-idees, and CH data #314

Open

jnnr pushed a commit to jnnr/euro-calliope that referenced this pull request Aug 27, 2024

Merge pull request calliope-project#354 from calliope-project/add-jrc…

3a4a348

…-idees-industry-processing Add industry JRC data processing

jnnr pushed a commit to jnnr/euro-calliope that referenced this pull request Aug 27, 2024

Merge pull request calliope-project#354 from calliope-project/add-jrc…

84a9f86

…-idees-industry-processing Add industry JRC data processing

jnnr pushed a commit to jnnr/euro-calliope that referenced this pull request Sep 3, 2024

Merge pull request calliope-project#354 from calliope-project/add-jrc…

672b220

…-idees-industry-processing Add industry JRC data processing

		@@ -0,0 +1,47 @@
		"Rules regarding JRC-IDEES Data"

		JRC_IDEES_SPATIAL_SCOPE = [

Add industry JRC data processing #354

Add industry JRC data processing #354

Conversation

brynpickering commented Apr 11, 2024 • edited Loading

Checklist

irm-codebase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timtroendle commented Apr 11, 2024 • edited Loading

brynpickering commented Apr 11, 2024

timtroendle commented Apr 11, 2024

brynpickering commented Apr 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irm-codebase left a comment

Choose a reason for hiding this comment

timtroendle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brynpickering commented Apr 16, 2024

irm-codebase left a comment • edited Loading

Choose a reason for hiding this comment

irm-codebase Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irm-codebase commented Apr 28, 2024

brynpickering commented Apr 29, 2024

brynpickering commented Apr 30, 2024

timtroendle commented May 1, 2024

timtroendle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irm-codebase May 1, 2024 • edited Loading

Choose a reason for hiding this comment

irm-codebase commented May 1, 2024

irm-codebase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irm-codebase commented May 15, 2024

brynpickering commented Apr 11, 2024 •

edited

Loading

timtroendle commented Apr 11, 2024 •

edited

Loading

irm-codebase left a comment •

edited

Loading

irm-codebase Apr 18, 2024 •

edited

Loading

irm-codebase May 1, 2024 •

edited

Loading