Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Reading and writing datasets with the CLI

Norman Fomferra edited this page Sep 1, 2016 · 11 revisions

This page is no longer in use. Its content has been input to the following more up-to-date pages


We still have to work out a way to let users conveniently pass in dataset references that are opened by the CLI, and also specify a file output for any workflow or operation output.

For example

  • If we have an input argument NAME=VALUE and the type of the input NAME is xarray.Dataset, then we could accept a command argument VALUE of the form DATASOURCE-ID,DATE1,DATE2 which would evaluate something like workflow.set_input(NAME, read_dataset(DATASOURCE-ID,DATE1,DATE2)).
  • If we have an input argument NAME=VALUE and the type of the input NAME is xarray.DataArray or xarray.Variable, then we could accept a command argument VALUE of the form VARIABLE-NAME,DATASOURCE-ID,DATE1,DATE2 which would evaluate something like workflow.set_input(NAME, read_dataset(DATASOURCE-ID,DATE1,DATE2)[VARIABLE-NAME]).
  • We could also accept URL[?QUERY] for web service hosted datasets.
  • If we have an output option argument --output NAME=VALUE and the type of the output NAME is Dataset, then we could accept a command argument VALUE of the form FILE-PATH[,FORMAT-NAME] which would evaluate something like write_dataset(workflow.get_output(NAME))

-- Norman

This makes sense. However, it wouldn't really solve the reason for being creative. The visualization op now takes in a filepath in which to save the plot. So, having --output of type Dataset really does not solve it. I'm not exactly sure how to go around this, but we will have multiple types of workflow outputs, such as *.nc files for datasets, *.txt for comma separated values, or tables, *.png, *.jpeg. *.pdf, whatever, for plots.

This could of course be solved by having a giant command line invocation, such as "wflow.json --input1 SOME_DATASET --input2 ANOTHER_DATASET --startTime XXXX-XX-XX --endTime YYY-YY-YY --output1 /home/user/Desktop/fig1.png --output2 /home/user/Desktop/fig2.png --output3 /home/user/Desktop/fig3.png --output4 /home/user/Desktop/fig4.png --output5 /home/user/Desktop/fig5.png --output6 /home/user/Desktop/correlation_parameters.txt"

Having a single '--outputFolder /home/user/Desktop' parameter for the command line invocation and then somehow be able to use this to construct the actual output names would be preferred. This can be done by using this one folder parameter to construct actual names using 'expression' nodes. This has to be done because the only way how to provide an input to a node is by connecting it to another node. So, either the name comes in from the command line invocation, or it is created in another node.

There could be default values in operations. But this again does not really solve the problem, as invoking an operation that write an output twice would result in the output being overwritten. I'm not sure, maybe having a giant CLI invocation is actually the way to go.

-- Jānis

Note it is not --input1 SOME_DATASET but input1=SOME_DATASET instead. I proposed --output NAME=FILE or -o NAME=FILE to explicitly provide a FILE sink for the output named NAME. NAME must of course not necessarily be a Dataset.

What about out_dir=... -o output1=$out_dir/fig1.png -o output2=$out_dir/fig2.png to shorten things? We can also register data_writers so we have a mapping from format name, file extension, or data type to a function write_data(data, file) to make it more convenient for users.

Note that you could also combine the analysis in a Python function correlation_analysis that would take two datasets and pairs of variables plus a few options as inputs and just the output directory as output. Use case 9 is generic enough to live in its own function!

-- Norman