-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add industry JRC data processing #354
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nice to haves. Nothing too important.
@@ -0,0 +1,47 @@ | |||
"Rules regarding JRC-IDEES Data" | |||
|
|||
JRC_IDEES_SPATIAL_SCOPE = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice-to-have:
Should we consider moving this to the configuration?
As it is, data will always be downloaded for all JRC countries even if it is unused (such as in the minimal configuration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you need data from some countries to fill in neighbours, so even if they're not in the list of model countries we need to pull in and process all data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now combined main countries and infill countries into one list and only unzip and process that list of countries. It's a bit of a proof of concept and might be too verbose to be worth keeping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
And agreed, I suppose it's fine to just do all countries unless the files are huge if this does not work well with the rest of the workflow.
I started my review and I am trying to finish it today but It may take a while (because I need to do a few other things first). |
I have a few local changes to commit that address the file unzipping (and downloading) to limit to just the defined countries and their infill neighbours (if necessary). Will push that later this evening. |
Ok, I will finish the review tomorrow in that case. |
@timtroendle ready for review now. I'm aware that |
rules/jrc-idees.smk
Outdated
countries = config["scope"]["spatial"]["countries"] | ||
wildcard_constraints: | ||
sector = "((industry)|(transport)|(tertiary))" | ||
output: temp(directory("build/data/jrc-idees/{sector}/unprocessed")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm outputting to a folder rather than wildcarded individual files because it makes other references to this data lighter (heat, transport, etc. only need to reference the directory) and it seems to make sense to me to do all the unzipping in one go into a temporary directory as it is a very quick step and all the files then get deleted as soon as the downstream processing is complete. The rule is much simpler when not trying to filter countries (see earlier commits)
scripts/heat/jrc_idees.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file will move/change eventually. I have a script ready to process tertiary sector data that is comparable to the industry one in this PR
scripts/transport/jrc_idees.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file will move/change eventually. I have a script ready to process transport sector data that is comparable to the industry one in this PR
rules/transport.smk
Outdated
"build/data/jrc-idees/transport/unprocessed/{country_code}.xlsx", | ||
country_code=JRC_IDEES_SCOPE | ||
) | ||
data = "build/data/jrc-idees/transport/unprocessed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: will this break snakemake operations?
Passing just a folder might break operations since it won't detect changes in the files when building the DAG.
If this is due to my suggestion to make the countries extracted flexible, then it was not a good one...
Another option is to add a function in the smk that build an array with the filenames in this folder, and specify rule order...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory is temporary - it is deleted once all rules that rely on it are completed (in the case of JRC-IDEES data, there is only one downstream rule per unzipped directory). It's not really related to keeping all countries extracted flexible. It's more about limiting the number of times you have to reference that array of filenames from different rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that whenever you build the DAG again, if it needs to re-run a rule that relies on the directory then unzipping files into the directory will need to be re-run (since those files have been deleted). Since the unzipping is a light operation, this keeps us from having lots of unnecessary files lying around (you can just get them again whenever you want by unzipping the downloaded data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments in relation to SMK DAG generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I had a range of minor comments.
@@ -16,6 +16,7 @@ techs_template_dir = f"{model_template_dir}techs/" | |||
|
|||
include: "./rules/shapes.smk" | |||
include: "./rules/data.smk" | |||
include: "./rules/jrc-idees.smk" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly important but I moved the previous eurostat.smk
and jrc-idees.smk
into data.smk
, as the container for all downloading and pre-processing of all data that is not sector specific. The idea being that we don't generate too many rule files, especially not rule files that aren't feature-based. You didn't like that idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer one per source as they do become large enough rule files to be worth splitting off. It also is in line with the concept of modularising different major data sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could imagine a future where we split off JRC processing completely and just store the pre-built files on zenodo for convenience.
tests/lib/test_utils.py
Outdated
class TestRenameAndGroupby: | ||
@pytest.fixture | ||
def da(self, request): | ||
data = [1, 2, 3, 4, 5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just using "1" as value everywhere could make the tests easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to have different values so you know that you've got the correct index items selected / grouped / summed later.
tests/lib/test_utils.py
Outdated
], | ||
) | ||
@pytest.mark.parametrize("da", ["single_index", "multi_index"], indirect=True) | ||
def test_rename_groupby_keep_renamed(self, rename_dict, expected, da): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name should be keep_non_renamed
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've also changed this to drop_other_dim_items
to try and make it easier to understand the argument.
tests/lib/test_utils.py
Outdated
[ | ||
( | ||
{"A": "A1", "B": "B1", "C": "C1", "D": "D1", "E": "E1"}, | ||
{"A1": [1, 5], "B1": [np.nan, 4], "D1": [4, np.nan], "E1": [5, 1]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this expected behaviour surprises me. So if dropna=True
you do not sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's just a convenience to be able to compare to a single value in the other tests as it parameterises over a single and multi index dataarray. Comparing either [1, 1] or [1] to 1 succeeds with numpy array comparisons. Can't do this with NaNs as the data is different in each array being compared (e.g. [1, np.nan]).
tests/lib/test_utils.py
Outdated
( | ||
{"A": "A1", "B": "B1", "C": "C1", "D": "D1", "E": "E1"}, | ||
{ | ||
"A1": [1, 5], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expected behaviour also surprises me. Why is this not summed? And where does the 5
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a bit more explanation in the creation of the fixture
@timtroendle the only remaining question here is on the use of a directory output for unzipped JRC data. It's really a question around whether to filter countries or not. If we don't filter countries then we can return to the rule unzipping individual files with ease. If we filter, then it becomes reasonably cumbersome to reference the filtering helper function in multiple rule files. I could move the heat and transport JRC processing rules into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some small improvements to the outputed xarray datasets.
scripts/jrc-idees/industry.py
Outdated
unit = "kt" | ||
|
||
processed_data.columns = processed_data.columns.rename("year").astype(int) | ||
processed_da = processed_data.stack().rename("jrc-idees-industry-twh").to_xarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of jrc-idees-industry-twh
as the data variable name is quite confusing, since it also shows up in production, which is not in energy units. As far as I understand:
- energy data is in
twh
. - production data tends to be in
kt
, but twh is still in the data variable name.
If possible/sensible, please:
- For the
energy
dataset, consider unstacking theenergy
coordinate into two data variables (final
,useful
). It would fit the use ofxr.Dataset
a bit better, since these two variables apply to all other coordinates. - For production, remove the (kt) from the
produced_material
coordinate, since it's non-atomic usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proooooobably affects other outputs of this JRC module. But it would make further processing easier on our side.
country_code=df.index.name.split(":")[0], | ||
cat_name=df.index.name.split(": ")[1], | ||
) | ||
.rename_axis(index="produced_material") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is also a tad confusing. It seems that it's the production method of a given material in some cases (Electric arc, Integrated steelworks both give kt of steel?).
In some cases it has units, in others it does not. Is there a way to make this tidier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not without lots of hardcoding. I would leave this to downstream applications and not this processing script, I think.
One last thing I noticed while processing data for #355: it seems this new JRC changes the final amounts for some energy carriers. Is this expected? While implementing the processing for "other" industries, I ran into checksum differences between the JRC data in SCEC and EC. At first I thought it was the different energy unit ( Should I open a separate issue for this, or would you like to debug it within this PR? Here are some checksum results before / after this JRC update: Before (in SCEC), already converted to ipdb> jrc_energy_df.loc[jrc_energy_df.index.get_level_values("carrier_name") == "Diesel oil (incl. biofuels)"].sum().sum()
2848.481153523807
ipdb> After (using this PR in EC): ipdb> jrc_energy.sel(carrier_name="Diesel oil (incl. biofuels)").sum()
<xarray.Dataset>
Dimensions: ()
Coordinates:
carrier_name <U27 'Diesel oil (incl. biofuels)'
Data variables:
value float64 2.309e+03 |
@timtroendle I've moved back to non-directory output. See what you think about the country filtering - do you think it's overkill? @irm-codebase I've cleaned up the output data files. Units are now only referenced in the xarray attributes, "..-twh" doesn't get attached to either dataset name. Energy is a dataset with two variables, production is a single array. I haven't managed to check the data inconsistency yet. I'll do that tomorrow. |
I've cleaned up the code so that it now matches the output of the old SC-EC notebook (here). There were some issues with one of the cell colours not being captured for chemicals industry and with assigning carrier names to subsections. Both issues are now fixed and there are logging / data checking points added to keep an eye on it in future. I've removed country filtering as it actually kills missing data filling for the heat sector. It could be cleaned up there to handle only a subset of countries, but I'd say that's a separate PR. |
Separate comment here re: country filtering. I am totally fine with not filtering countries at this stage of the workflow. I agree that it adds unnecessary complexity, especially considering that these files and their processing is very light weight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good to me. I find the file output clean and I agree with the non-filtering of countries.
Honestly, I went rather quickly through this as I had no major comments in the first iteration anyways. If there is anything specific I should have a close look at other than the file output, the country filtering, and the minor comments above, please let me know.
config/schema.yaml
Outdated
@@ -144,6 +144,9 @@ properties: | |||
root-directory: | |||
type: string | |||
description: Path to the root directory of euro-calliope containing scripts and template folders. | |||
max-threads: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter doesn't exist anymore and it should therefore not be in schema.yaml.
scripts/jrc-idees/heat.py
Outdated
@@ -31,9 +31,8 @@ | |||
|
|||
|
|||
def process_jrc_heat_tertiary_sector_data( | |||
paths_to_national_data: list[str], out_path: str | |||
paths_to_national_data: list[Path], out_path: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait! These cannot be paths, they are strings coming directly from Snakemake.
scripts/jrc-idees/transport.py
Outdated
@@ -39,12 +39,11 @@ | |||
|
|||
|
|||
def process_jrc_transport_data( | |||
paths_to_data: list[str], | |||
dataset: object, | |||
paths_to_data: list[Path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... is it me, or is this wrong too? Or is this a new Snakemake feature I am not aware of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is just a python annotation for better hinting?
(edit: upon further inspection, you may be right here! this one comes from snakemake, and I am not sure how they process paths beyond strings)
Honestly... I agree. It was not a good suggestion on my part. I did not follow KISS rules on this one... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment regarding default exception ignoring in one file. Rest looks good!
I'll approve this. Change that line at your discretion :)
lib/eurocalliopelib/utils.py
Outdated
def convert_valid_countries(country_codes: list, output: str = "alpha3") -> dict: | ||
""" | ||
Convert a list of country codes / names to a list of uniform ISO coded country | ||
codes. If an input item isn't a valid country (e.g. "EU27") then print the code and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be careful around skipping exceptions by default, since this is an utility function.
I'd introduce a flag to ignore exceptions, with the default being raising them. This helps avoid issues down the line since exceptions skipping is made explicit in the code calling this.
@brynpickering just to note that this fix might have introduced some extra hurdles... see #383 |
…-idees-industry-processing Add industry JRC data processing
…-idees-industry-processing Add industry JRC data processing
…-idees-industry-processing Add industry JRC data processing
Fixes #345.
Workflow checks haven't completed on my device.
This starts the process of modularising the JRC-IDEES data processing pipeline, which I will later apply to transport and heat (and replace the existing processing scripts).
Checklist
Any checks which are not relevant to the PR can be pre-checked by the PR creator. All others should be checked by the reviewer. You can add extra checklist items here if required by the PR.