Detect Data Package version with `version()` function #262

peterdesmet · 2024-08-29T16:00:36Z

To support Data Package v2 we need to be able to detect the version used by a package.

$schema is undefined
- version = 1.0
- We theoretically should look at the profile property in this case (see backward compatibility note) but frictionless-r ignores this property since it doesn't use it (it is useful for validation etc.).
$schema = https://datapackage.org/profiles/1.0/datapackage.json
- version = 1.0
- profile is ignored (since new property $schema is used)
$schema = https://datapackage.org/profiles/2.0/datapackage.json
- version = 2.0
- profile is ignored (deprecated in 2.0)
$schema = https://datapackage.org/profiles/2.1-rc.1/datapackage.json
- version = 2.1-rc.1 (theoretical example)
- profile is ignored
$schema = https://fiscal.datapackage.org/profiles/fiscal-data-package.json, https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/camtrap-dp-profile.json or any other value
- version = >=2.0: we can't detect the version, but it is higher or equal to 2.0 since $schema is used.
- profile is ignored

Even if we would read profile, the end result would still be version = 1.0

profile is undefined
- profile is assumed to be data-package (see https://specs.frictionlessdata.io/profiles/#introduction)
- version = 1.0 (implied by profile use)
profile = data-package
- version = 1.0 (implied by profile use)
profile= tabular-data-package, fiscal-data-package, https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/camtrap-dp-profile.json or any other value
- version = 1.0 (implied by profile use)

In my opinion, the best way to implement this is with a version(package) function. This allows us in the future to create a version()<- function. Alternative names:

get_version(): this limits us in the future from having a version(). I'm tempted to rename all functions that start with get_
package_version(): the version logic is the same for package, resource, dialect, schema: it's just the name of the file in the URL that is different (datapackage.json, dataresource.json). I therefore think we can make one function for all of these, rather than four functions.

I think we can generalize to a version(list) function:

Use the logic above for any incoming list (JSON). Search for $schema and get the version from the URL if it starts with https://datapackage.org, otherwise use 1.0 (if undefined $schema) or >2.0.
Setting the version is a bit harder, since - especially for a value like https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/camtrap-dp-profile.json if it's a package, resource, etc. I think this can be solved with an extra argument in the set function: version(level = "schema") <- 2.0 would assign https://datapackage.org/profiles/2.0/tableschema.json

The text was updated successfully, but these errors were encountered:

peterdesmet · 2024-08-29T16:07:55Z

@PietrH @sannegovaert @damianooldoni @khusmann thoughts?

@khusmann we would be able to detect what list we have (package, resource, schema) with the API change you suggested (#252), but currently we'll have to make due with dumb lists. 😄

khusmann · 2024-08-29T16:58:36Z

Hmm, I like version() but that it would alias base::version(). I wonder if it'd make sense to namespace our methods with a prefix, similar to how forcats does? forcats uses fct_*(); we could use fls_* or something? (So fls_version(), fls_schema() etc.)

Now that you bring this up, I'm realizing that the issue of supporting 2.0 is a little more complex than I was thinking at first. Do we want frictionless-r to upgrade v1 packages to v2 when they are loaded? If packages are upgraded, then all frictionless-r methods can all expect v2 schemas & features, which keeps implementation simple.

Simultaneous support of v1 and v2 seems potentially complex. We could:

Use the same API, but then functions would sometimes need to change behavior when they were operating on a v1 vs v2 descriptor, and that might be hard to manage / maintain. It also creates potential unexpected behavior when mixing versions (reading from multiple sources working at different versions)...
Namespace our API based on frictionless version, e.g. fls1_*() vs fls2_*()
Create a new package for frictionless v2, e.g. library(frictionless2)

Option 3 would give us the most separation between versions and be easiest to support, I think. library(frictionless) would give you an API that would load & write frictionless v1, whereas library(frictionless2) would give you an API that would load and write frictionless v2. And for backwards compatibility, if you loaded a v1 package with library(frictionless2), it'd just get upgraded to v2.

Anyway, these are just some initial thoughts!

peterdesmet · 2024-08-29T20:38:11Z

As far as I'm aware, there is no base::version(), so there's no conflict. On prefixes: some tidyverse use it (like stringr or forcats) and others don't (like purrr and dplyr). I think it makes the most sense if there is an obvious two-or-three-letter abbreviation, dp or fls doesn't feel obvious to me.

v2 support

I think I want the following design principle for supporting Data Package v2:

An unchanged v1 package remains v1 and is written to disk as v1
An unchanged v1 resource remains v1, even if it is part of a v2 package. I don't think this was discussed in the Data Package WG, but I believe it is supported/designed to be like this. I quite like the idea of not touching untouched resources.
An added/replaced resource is a v2 resource and as a result, its containing package becomes v2.
An unchanged v2 package/resource stays a v2 package/resource

flowchart TD
    p1["`**package (v1)**
    resource 1 (v1)`"]
    r1["resource 1 (v1)"]
    p2["`**package (v2)**
    resource 1 (v1)
    resource 2 (v2)`"]
    read_resource("read_resource()")
    add_resource("add_resource()")

    p1 --> read_resource --> r1
    p1 --> add_resource --> p2

However, doing so doesn't give users the choice of manipulating a Data Package, but having it stay a v1 (i.e. the current frictionless behaviour). Solutions:

Have them use an older version of frictionless, which is not ideal, since updates in dependencies etc. might cause that old version to break.
Have them provide a version to add_resource() (do you want this to be an v1 or v2 resource?). If they pick v2, the package becomes a v2 as well.
Have them use a frictionless2 package. It's an interesting suggestion I hadn't considered yet.
- Pros: Very clear for the user (and developers) + less complexity within functions + easier for contributors
- Cons: more maintenance (e.g. feature support like Support reading inline data with schema #59) needs to be implemented twice or copied + doesn't really convey the message that Data Package v2 is backwards compatible with v1
- Unsure: should a frictionless2 drop design principle 1 and potentially 3 and upconvert an incoming v1 (and its resources) to v2?

It is a choice we have to make soon though, because changes like #258 don't make sense for frictionless if we opt for frictionless2

khusmann · 2024-08-29T21:30:26Z

As far as I'm aware, there is no base::version(), so there's no conflict.

Ah yes, it's not base::version(), it looks like it's just a global constant, not a function: base::version. But frictionless::version() would still mask this.

I think it makes the most sense if there is an obvious two-or-three-letter abbreviation, dp or fls doesn't feel obvious to me.

Yeah, I don't like fls either -- but if we found a good abbrv it would help avoid aliasing & also be nice for autocomplete (e.g. type dp_<tab> would list all available frictionless verbs). That said, I don't feel strongly on this; just brainstorming.

I think I want the following design principle for supporting Data Package v2:

What you lay out here is similar to what I was imagining re: one api for both versions. It's do-able, but definitely introduces significant complexity, and gives me pause about long-term maintainability / edge cases.

However, doing so doesn't give users the choice of manipulating a Data Package, but having it stay a v1 (i.e. the current frictionless behaviour).
Have them use an older version of frictionless, which is not ideal, since updates in dependencies etc. might cause that old version to break.

Exactly -- I think the ability to continue to support workflows in v1-land (without using old versions of frictionless-r) is very valuable.

Cons: more maintenance

I think the unfortunate reality is that if we want to continue the support of manipulating & writing v1 packages, we're stuck with the maintenance of v1 and v2 code-paths. The question is how we manage this complexity. Making v1 and v2 separate packages would keep these concerns separate and make it so that we don't have to think "how will this impact the v1 flow?" every time we do something in v2, and vise versa. Yes, it would duplicate some code, but in this case that's actually reducing complexity by decoupling the implementations.

doesn't really convey the message that Data Package v2 is backwards compatible with v1

I think a separate package actually gives us simpler design principles that mirror the v2 backwards compatibility rules:

library(frictionless) reads and writes v1 packages.
library(frictionless2) reads v1 and v2 packages, but only writes v2 packages. (So v1 packages are instantly converted to v2 with a warning or something when they are loaded)

So we're backwards compatible in that a v1 package is always 100% usable with both library(frictionless) and library(frictionless2), but if you want to use v2 features, you should use library(frictionless2)

I think supporting packages with mixed v1 and v2 resources would open a can of worms...

It is a choice we have to make soon though, because changes like #258 don't make sense for frictionless if we opt for frictionless2

Exactly, we're at a perfect crossroads if we like this direction: we'd freeze frictionless with the API it currently has and only update with bugfixes & v1 features. That'd then give us a clean slate & separate sandbox for frictionless2, where we are free to add features (and make API changes) without making v1 unstable.

PietrH · 2024-08-30T07:51:03Z

I'm not a big fan of creating a frictionless2 package, I think this will be confusing for new users just hearing about frictionless.

About version(), what do we expect users to do with this information? Ideally the package should just handle all that stuff and users don't need to care about what version they are reading. If it's (mostly) an internal function, I see little harm in calling it get_version() to avoid a namespace collision, even if the other functions in frictionless-r stop using verbs as a default.

peterdesmet added datapackage:v2 complexity:low Likely not complex to implement labels Aug 29, 2024

peterdesmet mentioned this issue Aug 29, 2024

Rename get_schema() (and helper functions) to schema() #263

Open

4 tasks

peterdesmet mentioned this issue Aug 30, 2024

How to convert a v1 to a v2 package #264

Open

46 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect Data Package version with `version()` function #262

Detect Data Package version with `version()` function #262

peterdesmet commented Aug 29, 2024

peterdesmet commented Aug 29, 2024

khusmann commented Aug 29, 2024

peterdesmet commented Aug 29, 2024

khusmann commented Aug 29, 2024 •

edited

Loading

PietrH commented Aug 30, 2024

Detect Data Package version with version() function #262

Detect Data Package version with version() function #262

Comments

peterdesmet commented Aug 29, 2024

peterdesmet commented Aug 29, 2024

khusmann commented Aug 29, 2024

peterdesmet commented Aug 29, 2024

v2 support

khusmann commented Aug 29, 2024 • edited Loading

PietrH commented Aug 30, 2024

Detect Data Package version with `version()` function #262

Detect Data Package version with `version()` function #262

khusmann commented Aug 29, 2024 •

edited

Loading