Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add industry JRC data processing #354

Merged
merged 12 commits into from
May 2, 2024

Conversation

brynpickering
Copy link
Member

@brynpickering brynpickering commented Apr 11, 2024

Fixes #345.

Workflow checks haven't completed on my device.

This starts the process of modularising the JRC-IDEES data processing pipeline, which I will later apply to transport and heat (and replace the existing processing scripts).

Checklist

Any checks which are not relevant to the PR can be pre-checked by the PR creator. All others should be checked by the reviewer. You can add extra checklist items here if required by the PR.

  • CHANGELOG updated
  • Minimal workflow tests pass
  • Tests added to cover contribution
  • Documentation updated
  • Configuration schema updated

Copy link
Contributor

@irm-codebase irm-codebase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nice to haves. Nothing too important.

config/default.yaml Outdated Show resolved Hide resolved
@@ -0,0 +1,47 @@
"Rules regarding JRC-IDEES Data"

JRC_IDEES_SPATIAL_SCOPE = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice-to-have:
Should we consider moving this to the configuration?
As it is, data will always be downloaded for all JRC countries even if it is unused (such as in the minimal configuration).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you need data from some countries to fill in neighbours, so even if they're not in the list of model countries we need to pull in and process all data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now combined main countries and infill countries into one list and only unzip and process that list of countries. It's a bit of a proof of concept and might be too verbose to be worth keeping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
And agreed, I suppose it's fine to just do all countries unless the files are huge if this does not work well with the rest of the workflow.

@timtroendle
Copy link
Member

timtroendle commented Apr 11, 2024

I started my review and I am trying to finish it today but It may take a while (because I need to do a few other things first).

@brynpickering
Copy link
Member Author

I have a few local changes to commit that address the file unzipping (and downloading) to limit to just the defined countries and their infill neighbours (if necessary). Will push that later this evening.

@timtroendle
Copy link
Member

Ok, I will finish the review tomorrow in that case.

@brynpickering
Copy link
Member Author

@timtroendle ready for review now.

I'm aware that get_countries_to_unzip is quite verbose. It would be less so if we could access eurocalliopelib from the main snakemake env (and so use the country code conversion utils). Anyway, I've done it as a bit of a proof-of-concept to show how we could filter the JRC files to process to just those that are relevant to the configured countries (and their infill neighbours, where necessary). I would be fine to revert this to processing all countries and then having the filtering happening much further down the line.

countries = config["scope"]["spatial"]["countries"]
wildcard_constraints:
sector = "((industry)|(transport)|(tertiary))"
output: temp(directory("build/data/jrc-idees/{sector}/unprocessed"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm outputting to a folder rather than wildcarded individual files because it makes other references to this data lighter (heat, transport, etc. only need to reference the directory) and it seems to make sense to me to do all the unzipping in one go into a temporary directory as it is a very quick step and all the files then get deleted as soon as the downstream processing is complete. The rule is much simpler when not trying to filter countries (see earlier commits)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will move/change eventually. I have a script ready to process tertiary sector data that is comparable to the industry one in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will move/change eventually. I have a script ready to process transport sector data that is comparable to the industry one in this PR

"build/data/jrc-idees/transport/unprocessed/{country_code}.xlsx",
country_code=JRC_IDEES_SCOPE
)
data = "build/data/jrc-idees/transport/unprocessed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: will this break snakemake operations?
Passing just a folder might break operations since it won't detect changes in the files when building the DAG.
If this is due to my suggestion to make the countries extracted flexible, then it was not a good one...

Another option is to add a function in the smk that build an array with the filenames in this folder, and specify rule order...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory is temporary - it is deleted once all rules that rely on it are completed (in the case of JRC-IDEES data, there is only one downstream rule per unzipped directory). It's not really related to keeping all countries extracted flexible. It's more about limiting the number of times you have to reference that array of filenames from different rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that whenever you build the DAG again, if it needs to re-run a rule that relies on the directory then unzipping files into the directory will need to be re-run (since those files have been deleted). Since the unzipping is a light operation, this keeps us from having lots of unnecessary files lying around (you can just get them again whenever you want by unzipping the downloaded data)

Copy link
Contributor

@irm-codebase irm-codebase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments in relation to SMK DAG generation.

Copy link
Member

@timtroendle timtroendle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I had a range of minor comments.

config/default.yaml Outdated Show resolved Hide resolved
@@ -16,6 +16,7 @@ techs_template_dir = f"{model_template_dir}techs/"

include: "./rules/shapes.smk"
include: "./rules/data.smk"
include: "./rules/jrc-idees.smk"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly important but I moved the previous eurostat.smk and jrc-idees.smk into data.smk, as the container for all downloading and pre-processing of all data that is not sector specific. The idea being that we don't generate too many rule files, especially not rule files that aren't feature-based. You didn't like that idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer one per source as they do become large enough rule files to be worth splitting off. It also is in line with the concept of modularising different major data sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could imagine a future where we split off JRC processing completely and just store the pre-built files on zenodo for convenience.

config/schema.yaml Outdated Show resolved Hide resolved
lib/eurocalliopelib/utils.py Outdated Show resolved Hide resolved
lib/eurocalliopelib/utils.py Outdated Show resolved Hide resolved
class TestRenameAndGroupby:
@pytest.fixture
def da(self, request):
data = [1, 2, 3, 4, 5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just using "1" as value everywhere could make the tests easier to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to have different values so you know that you've got the correct index items selected / grouped / summed later.

],
)
@pytest.mark.parametrize("da", ["single_index", "multi_index"], indirect=True)
def test_rename_groupby_keep_renamed(self, rename_dict, expected, da):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should be keep_non_renamed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've also changed this to drop_other_dim_items to try and make it easier to understand the argument.

tests/lib/test_utils.py Outdated Show resolved Hide resolved
[
(
{"A": "A1", "B": "B1", "C": "C1", "D": "D1", "E": "E1"},
{"A1": [1, 5], "B1": [np.nan, 4], "D1": [4, np.nan], "E1": [5, 1]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this expected behaviour surprises me. So if dropna=True you do not sum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just a convenience to be able to compare to a single value in the other tests as it parameterises over a single and multi index dataarray. Comparing either [1, 1] or [1] to 1 succeeds with numpy array comparisons. Can't do this with NaNs as the data is different in each array being compared (e.g. [1, np.nan]).

(
{"A": "A1", "B": "B1", "C": "C1", "D": "D1", "E": "E1"},
{
"A1": [1, 5],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expected behaviour also surprises me. Why is this not summed? And where does the 5 come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a bit more explanation in the creation of the fixture

@brynpickering
Copy link
Member Author

@timtroendle the only remaining question here is on the use of a directory output for unzipped JRC data. It's really a question around whether to filter countries or not. If we don't filter countries then we can return to the rule unzipping individual files with ease. If we filter, then it becomes reasonably cumbersome to reference the filtering helper function in multiple rule files. I could move the heat and transport JRC processing rules into jrc-idees.smk which would make referencing the helper function clearer.

Copy link
Contributor

@irm-codebase irm-codebase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some small improvements to the outputed xarray datasets.

unit = "kt"

processed_data.columns = processed_data.columns.rename("year").astype(int)
processed_da = processed_data.stack().rename("jrc-idees-industry-twh").to_xarray()
Copy link
Contributor

@irm-codebase irm-codebase Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of jrc-idees-industry-twh as the data variable name is quite confusing, since it also shows up in production, which is not in energy units. As far as I understand:

  • energy data is in twh.
  • production data tends to be in kt, but twh is still in the data variable name.

If possible/sensible, please:

  • For the energy dataset, consider unstacking the energy coordinate into two data variables (final, useful). It would fit the use of xr.Dataset a bit better, since these two variables apply to all other coordinates.
  • For production, remove the (kt) from the produced_material coordinate, since it's non-atomic usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proooooobably affects other outputs of this JRC module. But it would make further processing easier on our side.

country_code=df.index.name.split(":")[0],
cat_name=df.index.name.split(": ")[1],
)
.rename_axis(index="produced_material")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is also a tad confusing. It seems that it's the production method of a given material in some cases (Electric arc, Integrated steelworks both give kt of steel?).

In some cases it has units, in others it does not. Is there a way to make this tidier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without lots of hardcoding. I would leave this to downstream applications and not this processing script, I think.

@irm-codebase irm-codebase mentioned this pull request Apr 22, 2024
3 tasks
@irm-codebase irm-codebase added the Industry Industrial energy demand label Apr 23, 2024
@irm-codebase
Copy link
Contributor

One last thing I noticed while processing data for #355: it seems this new JRC changes the final amounts for some energy carriers. Is this expected?

While implementing the processing for "other" industries, I ran into checksum differences between the JRC data in SCEC and EC. At first I thought it was the different energy unit ( ktoe to twh), but after more testing I found it really is not the case.

Should I open a separate issue for this, or would you like to debug it within this PR?


Here are some checksum results before / after this JRC update:

Before (in SCEC), already converted to twh:

ipdb> jrc_energy_df.loc[jrc_energy_df.index.get_level_values("carrier_name") == "Diesel oil (incl. biofuels)"].sum().sum()
2848.481153523807
ipdb> 

After (using this PR in EC):

ipdb> jrc_energy.sel(carrier_name="Diesel oil (incl. biofuels)").sum()
<xarray.Dataset>
Dimensions:       ()
Coordinates:
    carrier_name  <U27 'Diesel oil (incl. biofuels)'
Data variables:
    value         float64 2.309e+03

@brynpickering
Copy link
Member Author

@timtroendle I've moved back to non-directory output. See what you think about the country filtering - do you think it's overkill?

@irm-codebase I've cleaned up the output data files. Units are now only referenced in the xarray attributes, "..-twh" doesn't get attached to either dataset name. Energy is a dataset with two variables, production is a single array.

I haven't managed to check the data inconsistency yet. I'll do that tomorrow.

@brynpickering
Copy link
Member Author

I've cleaned up the code so that it now matches the output of the old SC-EC notebook (here). There were some issues with one of the cell colours not being captured for chemicals industry and with assigning carrier names to subsections. Both issues are now fixed and there are logging / data checking points added to keep an eye on it in future.

I've removed country filtering as it actually kills missing data filling for the heat sector. It could be cleaned up there to handle only a subset of countries, but I'd say that's a separate PR.

@timtroendle
Copy link
Member

Separate comment here re: country filtering. I am totally fine with not filtering countries at this stage of the workflow. I agree that it adds unnecessary complexity, especially considering that these files and their processing is very light weight.

Copy link
Member

@timtroendle timtroendle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me. I find the file output clean and I agree with the non-filtering of countries.

Honestly, I went rather quickly through this as I had no major comments in the first iteration anyways. If there is anything specific I should have a close look at other than the file output, the country filtering, and the minor comments above, please let me know.

@@ -144,6 +144,9 @@ properties:
root-directory:
type: string
description: Path to the root directory of euro-calliope containing scripts and template folders.
max-threads:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter doesn't exist anymore and it should therefore not be in schema.yaml.

@@ -31,9 +31,8 @@


def process_jrc_heat_tertiary_sector_data(
paths_to_national_data: list[str], out_path: str
paths_to_national_data: list[Path], out_path: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait! These cannot be paths, they are strings coming directly from Snakemake.

@@ -39,12 +39,11 @@


def process_jrc_transport_data(
paths_to_data: list[str],
dataset: object,
paths_to_data: list[Path],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... is it me, or is this wrong too? Or is this a new Snakemake feature I am not aware of?

Copy link
Contributor

@irm-codebase irm-codebase May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is just a python annotation for better hinting?
(edit: upon further inspection, you may be right here! this one comes from snakemake, and I am not sure how they process paths beyond strings)

@irm-codebase
Copy link
Contributor

Separate comment here re: country filtering. I am totally fine with not filtering countries at this stage of the workflow. I agree that it adds unnecessary complexity, especially considering that these files and their processing is very light weight.

Honestly... I agree. It was not a good suggestion on my part. I did not follow KISS rules on this one...
Besides, if we do end up moving to a more modular setup, it makes sense to just filter stuff out at the end.

Copy link
Contributor

@irm-codebase irm-codebase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment regarding default exception ignoring in one file. Rest looks good!
I'll approve this. Change that line at your discretion :)

def convert_valid_countries(country_codes: list, output: str = "alpha3") -> dict:
"""
Convert a list of country codes / names to a list of uniform ISO coded country
codes. If an input item isn't a valid country (e.g. "EU27") then print the code and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be careful around skipping exceptions by default, since this is an utility function.

I'd introduce a flag to ignore exceptions, with the default being raising them. This helps avoid issues down the line since exceptions skipping is made explicit in the code calling this.

@brynpickering brynpickering merged commit 739d2c6 into develop May 2, 2024
4 checks passed
@brynpickering brynpickering deleted the add-jrc-idees-industry-processing branch May 2, 2024 11:52
@irm-codebase
Copy link
Contributor

@brynpickering just to note that this fix might have introduced some extra hurdles... see #383

jnnr pushed a commit to jnnr/euro-calliope that referenced this pull request Aug 27, 2024
…-idees-industry-processing

Add industry JRC data processing
jnnr pushed a commit to jnnr/euro-calliope that referenced this pull request Aug 27, 2024
…-idees-industry-processing

Add industry JRC data processing
jnnr pushed a commit to jnnr/euro-calliope that referenced this pull request Sep 3, 2024
…-idees-industry-processing

Add industry JRC data processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Industry Industrial energy demand
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Industry: add JRC processing step to remove raw_data files
3 participants