-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add industry JRC data processing #354
Changes from 3 commits
260ff7b
1e9ad9e
cfb24a4
e45ad54
674aef6
6cbf01e
9f7dc12
ec05182
10a30a4
492cb67
1899945
a7393fb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -144,6 +144,9 @@ properties: | |
root-directory: | ||
type: string | ||
description: Path to the root directory of euro-calliope containing scripts and template folders. | ||
max-threads: | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This parameter doesn't exist anymore and it should therefore not be in schema.yaml. |
||
type: integer | ||
description: maximum available threads for multiprocessing, in those rules that are able to accept multiple threads. | ||
cluster-sync: | ||
type: object | ||
description: Configuration for the "work local, build on remote" workflow. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,8 @@ | ||
"""Utility functions.""" | ||
|
||
import pandas as pd | ||
import pycountry | ||
import xarray as xr | ||
|
||
|
||
def eu_country_code_to_iso3(eu_country_code): | ||
|
@@ -16,10 +18,10 @@ def eu_country_code_to_iso3(eu_country_code): | |
|
||
def convert_country_code(input_country, output="alpha3"): | ||
""" | ||
Converts input country code or name into either a 2- or 3-letter code. | ||
Converts input country code or name into either either a 2- or 3-letter code. | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
ISO alpha2: alpha2 | ||
ISO alpha2 with Eurostat codes: alpha2_eu | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
ISO alpha2 with Eurostat codes: alpha2_eurostat | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
ISO alpha3: alpha3 | ||
|
||
""" | ||
|
@@ -36,7 +38,7 @@ def convert_country_code(input_country, output="alpha3"): | |
if output == "alpha2": | ||
return pycountry.countries.lookup(input_country).alpha_2 | ||
|
||
if output == "alpha2_eu": | ||
if output == "alpha2_eurostat": | ||
result = pycountry.countries.lookup(input_country).alpha_2 | ||
if result == "GB": | ||
return "UK" | ||
|
@@ -49,12 +51,101 @@ def convert_country_code(input_country, output="alpha3"): | |
return pycountry.countries.lookup(input_country).alpha_3 | ||
|
||
|
||
# conversion utils | ||
def convert_valid_countries(country_codes: list, output: str = "alpha3") -> dict: | ||
""" | ||
Convert a list of country codes / names to a list of uniform ISO coded country | ||
codes. If an input item isn't a valid country (e.g. "EU27") then print the code and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be careful around skipping exceptions by default, since this is an utility function. I'd introduce a flag to ignore exceptions, with the default being raising them. This helps avoid issues down the line since exceptions skipping is made explicit in the code calling this. |
||
continue, instead of raising an exception | ||
|
||
Args: | ||
country_codes (list): | ||
Strings defining country codes / names | ||
(["France", "FRA", "FR"] will all be treated the same) | ||
|
||
Returns: | ||
dict: Mapping from input country code/name to output country code for all valid input countries | ||
""" | ||
|
||
mapped_codes = {} | ||
for country_code in country_codes: | ||
try: | ||
mapped_codes[country_code] = convert_country_code( | ||
country_code, output=output | ||
) | ||
except LookupError: | ||
print(f"Skipping country/region {country_code} in annual energy balances") | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
continue | ||
return mapped_codes | ||
|
||
|
||
def rename_and_groupby( | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
da: xr.DataArray, | ||
rename_dict: dict, | ||
dim_name: str, | ||
new_dim_name: str = None, | ||
dropna: bool = False, | ||
keep_non_renamed: bool = False, | ||
) -> xr.DataArray: | ||
""" | ||
Take an xarray dataarray and rename the contents of a given dimension | ||
as well as (optionally) rename that dimension. | ||
If renaming the contents has some overlap (e.g. {'foo' : 'A', 'bar': 'A'}) | ||
then the returned dataarray will be grouped over the new dimension items | ||
(by summing the data). | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Args: | ||
da (xr.DataArray): | ||
Input dataarray with the dimension "dim_name". | ||
rename_dict (dict): | ||
Dictionary to map items in the dimension "dim_name" to new names ({"old_item_name": "new_item_name"}). | ||
dim_name (str): | ||
Dimension on which to rename items. | ||
new_dim_name (str, optional): Defaults to None. | ||
If not None, rename the dimension "dim_name" to the given string. | ||
dropna (bool, optional): Defaults to False. | ||
If True, drop any items in "dim_name" after renaming/grouping which have all NaN values along all other dimensions. | ||
keep_non_renamed (bool, optional): Defaults to False. | ||
If False, any item in "dim_name" that is not referred to in "rename_dict" will be removed from that dimension in the returned array. | ||
Returns: | ||
(xr.DataArray): Same as "da" but with the items in "dim_name" renamed and possibly a. grouped, b. "dim_name" itself renamed. | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
rename_series = pd.Series(rename_dict).rename_axis(index=dim_name) | ||
if keep_non_renamed is True: | ||
existing_dim_items = da[dim_name].to_series() | ||
rename_series = rename_series.reindex(existing_dim_items).fillna( | ||
existing_dim_items | ||
) | ||
|
||
if new_dim_name is None: | ||
new_dim_name = f"_{dim_name}" # placeholder that we'll revert | ||
revert_dim_name = True | ||
else: | ||
revert_dim_name = False | ||
|
||
rename_da = xr.DataArray(rename_series.rename(new_dim_name)) | ||
da = ( | ||
da.reindex({dim_name: rename_da[dim_name]}) | ||
.groupby(rename_da) | ||
.sum(dim_name, skipna=True, min_count=1, keep_attrs=True) | ||
) | ||
if revert_dim_name: | ||
da = da.rename({new_dim_name: dim_name}) | ||
new_dim_name = dim_name | ||
if dropna: | ||
da = da.dropna(new_dim_name, how="all") | ||
return da | ||
|
||
|
||
def ktoe_to_twh(array): | ||
"""Convert KTOE to TWH""" | ||
return array * 1.163e-2 | ||
|
||
|
||
def gwh_to_tj(array): | ||
"""Convert GWh to TJ""" | ||
return array * 3.6 | ||
|
||
|
||
def pj_to_twh(array): | ||
"""Convert PJ to TWh""" | ||
return array / 3.6 | ||
|
@@ -63,8 +154,3 @@ def pj_to_twh(array): | |
def tj_to_twh(array): | ||
"""Convert TJ to TWh""" | ||
return pj_to_twh(array) / 1000 | ||
|
||
|
||
def gwh_to_tj(array): | ||
"""Convert GWh to TJ""" | ||
return array * 3.6 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
"Rules regarding JRC-IDEES Data" | ||
|
||
JRC_IDEES_SPATIAL_SCOPE = [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice-to-have: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, you need data from some countries to fill in neighbours, so even if they're not in the list of model countries we need to pull in and process all data There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've now combined main countries and infill countries into one list and only unzip and process that list of countries. It's a bit of a proof of concept and might be too verbose to be worth keeping There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! |
||
"AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "EL", "ES", "FI", "FR", | ||
"HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL", "PL", "PT", "RO", | ||
"SE", "SI", "SK", "UK" | ||
] | ||
|
||
|
||
rule download_jrc_idees_zipped: | ||
message: "Download JRC IDEES zip file for {wildcards.country_code}" | ||
params: url = config["data-sources"]["jrc-idees"] | ||
output: protected("data/automatic/jrc-idees/{country_code}.zip") | ||
conda: "../envs/shell.yaml" | ||
localrule: True | ||
shell: "curl -sSLo {output} '{params.url}'" | ||
|
||
|
||
rule jrc_idees_unzipped: | ||
message: "Unzip all JRC-IDEES {wildcards.sector} sector country data" | ||
input: | ||
countries = [ | ||
f"data/automatic/jrc-idees/{country_code}.zip" | ||
for country_code in [ | ||
pycountry.countries.lookup(country).alpha_2 for country in config["scope"]["spatial"]["countries"] | ||
] | ||
if country_code in JRC_IDEES_SPATIAL_SCOPE | ||
] | ||
params: sector_title_case = lambda wildcards: wildcards.sector.title() | ||
wildcard_constraints: | ||
sector = "((industry)|(transport)|(tertiary))" | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
output: temp(directory("build/data/jrc-idees/{sector}/unprocessed")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer to not use directory as an output. This is discouraged by Snakemake and it makes the code harder to understand. Without running the code, I have no idea what's happening here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, ok. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Problem is that there's the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm outputting to a folder rather than wildcarded individual files because it makes other references to this data lighter (heat, transport, etc. only need to reference the directory) and it seems to make sense to me to do all the unzipping in one go into a temporary directory as it is a very quick step and all the files then get deleted as soon as the downstream processing is complete. The rule is much simpler when not trying to filter countries (see earlier commits) |
||
conda: "../envs/shell.yaml" | ||
shell: "unzip 'data/automatic/jrc-idees/*.zip' '*{params.sector_title_case}*' -d {output}" | ||
|
||
|
||
|
||
rule jrc_idees_industry_processed: | ||
message: "Process {wildcards.dataset} industry data from JRC-IDEES to be used in understanding current and future industry demand" | ||
input: | ||
unprocessed_data = "build/data/jrc-idees/industry/unprocessed" | ||
output: "build/data/jrc-idees/industry/processed-{dataset}.nc" | ||
wildcard_constraints: | ||
dataset = "((energy)|(production))" | ||
brynpickering marked this conversation as resolved.
Show resolved
Hide resolved
|
||
conda: "../envs/default.yaml" | ||
threads: config["max-threads"] | ||
script: "../scripts/jrc-idees/industry.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly important but I moved the previous
eurostat.smk
andjrc-idees.smk
intodata.smk
, as the container for all downloading and pre-processing of all data that is not sector specific. The idea being that we don't generate too many rule files, especially not rule files that aren't feature-based. You didn't like that idea?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer one per source as they do become large enough rule files to be worth splitting off. It also is in line with the concept of modularising different major data sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could imagine a future where we split off JRC processing completely and just store the pre-built files on zenodo for convenience.