Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master file structure for each granule Type could reduce code significantly and add flexibility ? #58

Open
alex-s-gardner opened this issue Aug 13, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Aug 13, 2023

If we were to autogenerate maps of the h5 data tree with variable types for each granule type then we could replace a lot of the repetitive code with loops. It may also allow us to move to a single points function for all granule types.

Both #55, #57 and most others consist of a lot of find and replace as everything is hardcoded. If we were to autogenerate maps of the h5 data tree with variable types and save the output in the package then a lot of code could be replaced with loops. We could autogenerate new data trees each time a new version of the package was created.

This would also allow the user to pass a list of variables to the points function if they wanted more/less than the defaults.

This would also make it easier acomodate new granule types.

@alex-s-gardner alex-s-gardner added the enhancement New feature or request label Aug 13, 2023
@alex-s-gardner
Copy link
Collaborator Author

Looking at this more, I think the undertaking is more than I had anticipated. The idea is that a lot of the code looks like this:

    z = open_dataset(group, "land_ice_segments/h_li")[start:step:stop]::Vector{Float32}
    sigma_geo_h = open_dataset(group, "land_ice_segments/sigma_geo_h")[start:step:stop]::Vector{Float32}
    h_li_sigma = open_dataset(group, "land_ice_segments/h_li_sigma")[start:step:stop]::Vector{Float32}
    t = open_dataset(group, "land_ice_segments/delta_time")[start:step:stop]::Vector{Float64}
    q = open_dataset(group, "land_ice_segments/atl06_quality_summary")[start:step:stop]::Vector{Int8}
    dem = open_dataset(group, "land_ice_segments/dem/dem_h")[start:step:stop]::Vector{Float32}
    spot_number = read_attribute(group, "atlas_spot_number")::String
    atlas_beam_type = read_attribute(group, "atlas_beam_type")::String

And could be replaces with a more compositable from along the lines of:

vars = [
("land_ice_segments/h_li", Vector{Float32}, :variable)
("land_ice_segments/h_li_sigma", Vector{Float32}, :variable)
...
("atlas_beam_type", String, :attribute)
]

# **this part would take some fancy metaprogramming**
for var in vars
    if var[3] == :variable
        open_dataset(group, var[1])[start:step:stop]::var[2]
    else
        read_attribute(group, var[1])::var[2]
    end
end        

@alex-s-gardner
Copy link
Collaborator Author

I guess what we actually need (at least I do) is a higher level h5 function that that can accept filename of fid and a list of variables and attributes to extract and it would return a structure... I guess this would be an h5 subsetter, with the ability to coordinate ranges. Does something like this exist? I wrote a bad one for mattlab that's in the climate tookbox : https://www.mathworks.com/matlabcentral/fileexchange/70338-climate-data-toolbox-for-matlab

@evetion
Copy link
Owner

evetion commented Aug 14, 2023

Yeah, this is something we need to maintain this package long-term if we need to support more products/versions.

I love the idea of pre-generating most of this data, that helps a lot with validation.

But we also need something to implement

  • renaming
  • repeat/fill attributes to match length/common (time) dimension
  • postprocess to change type (bool)
  • postprocess to handle matrix data
  • postprocess indexing/lookups like for ATL03 or GEDI
  • handling non-shared dimensions (mostly time now, maybe there are exceptions)

@alex-s-gardner
Copy link
Collaborator Author

  • repeat attributes to match length

Are you thinking of this for Tables? Here I was thinking just to return the same as point. But the first thing I would do is turn it into a DataFrame.. for that I'm planning to make use of FillArrays.jl

@evetion
Copy link
Owner

evetion commented Aug 14, 2023

  • repeat attributes to match length

Are you thinking of this for Tables? Here I was thinking just to return the same as point. But the first thing I would do is turn it into a DataFrame.. for that I'm planning to make use of FillArrays.jl

I just meant that we need to take care to make sure all output has the same length/common dimension. Fill is ideal for attributes indeed.

@alex-s-gardner
Copy link
Collaborator Author

alex-s-gardner commented Aug 14, 2023

@evetion should we use higher level HDF5.jl functionality and read directly to a table

e.g.

using DataFrames

vars = ["gt1l/land_segments/longitude", ...]
var_names = ["lat", ...]

h5 = HDF5.h5open(ATL08_fn)
df = DataFrame()

for i = eachindex(vars)
    if isa(h5[vars[i]], HDF5.Dataset)
        df[!,"$(var_names[i])"] = h5[vars[i]][]
    else
         # this would be replaced with a FillArray 
        df[!,"$(var_names[i])"] .= h5[vars[i]][]
    end
end

If this seems like an intelligent approach then I would add support for:

  • cropping along dimension
  • and seeing if I could figure out a way to handle non-shared dimensions by assigning a reference variable to which dimensionality needs to be matched.

The remaining items would happen in post processing.

This would essentially be an h5_to_df function... I might even move the renaming of variables as a post processing step

@evetion
Copy link
Owner

evetion commented Aug 14, 2023

Well, the DataFrame is a pain only because of the multiple tracks that need/can be concatenated.

In this case, the selection of vars (I would handle attributes as a separate argument altogether), is just hard for the user API alone.

How do you give a name, location, and postprocess function for multiple cars easily? A Var object seems most logical?

@alex-s-gardner
Copy link
Collaborator Author

alex-s-gardner commented Aug 14, 2023

Yes, I think a Var and Att object is the way to go.

For multiple tracks we could just vcat DataFrames. In my workflows I'm finding moving to single value per table cell is the most efficient and intuitive way to work with the data. Using FillArrays means we can so this with minimal overhead.

If we went down this road it would allow you to concat Arrow files on disk without needed to read in the data to memory.

@alex-s-gardner
Copy link
Collaborator Author

Maybe an object like this?

"""
    GetDataset()
a dataset request object
"""
@kwdef struct GetDataset
    dataset_path::String # e.g. "gt1l/land_segments/terrain/h_te_mean"
    new_name::Union{String,Nothing} = nothing # e.g. "height"
    apply_function::Union{Function, Nothing} = nothing # e.g. x -> x .> 0
    dimension_ranges::Union{Vector{NamedTuple},Nothing} = nothing # e.g. [(name="latitude", min=50, max=60)]
    description::Union{String,Nothing} = nothing # e.g. "Standard land-ice segment height"
    unit::Union{String,Nothing} = nothing # e.g. "m above the WGS 84 ellipsoid"
end

One would then pass a Vector{GetDataset} to a h5_to_table and it would return a table/dataframe

@evetion
Copy link
Owner

evetion commented Aug 16, 2023

I'm not sure what description and unit will do here?

The other design issue I see is that we either do know beforehand what all the names in a HDF5 are (ICESat, GEDI) so it's easy to check correct keys, dimensions, types, and do something about repeated groups like tracks. If we go purely generic, that becomes much harder. Although we could also parse a HDF5 file and generate such records from it (like generating a Struct based on some JSON file, which you can do with JSON3).

In terms of an interface, I wouldn't even go with h5_to_table, but index directly on a granule, or on a new HDF5Table wrapper around File?

Anyway, feel free to implement a first prototype that could work for ICESat-2 or GEDI, and let's generalize from there?

@alex-s-gardner
Copy link
Collaborator Author

I think a HDF5Table wrapper around File is the way to go... it would greatly reduce the the data wrangling needed to work with unique h5 structures. HDF5Table would use FillArrays to populate lower dimensional data. I guess one restriction is that requested datasets must share at least one dimension and are assumed to be unchanging across others.

@alex-s-gardner
Copy link
Collaborator Author

Any operations to the data or name changes would be handled outside of HDF5Table

@evetion evetion mentioned this issue Mar 1, 2024
@alex-s-gardner
Copy link
Collaborator Author

Here's my first attempt at H5ToTable

@evetion
Copy link
Owner

evetion commented Jun 23, 2024

Here's my first attempt at H5ToTable

Sorry for taking so long to come up with an answer, it's looking great. I came up with something similar in a dead branch on SpaceLiDAR (not as polished), so it seems a good direction to go in.

However, I got stuck on integrating it as a predefined structure for each product, needing to include the size/datatype and its shared dimensions(!), as well as a function that would be applied on loading (needed for the timestamp at the very least). I might've made it too complex, by making it specific to ICESat-2 and GEDI, not sure.

This is functionality I would love to see in SpaceLiDAR before we submit it to JOSS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants