Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the meaning of "match" between sdist and wheel #1594

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

konstin
Copy link
Contributor

@konstin konstin commented Aug 25, 2024

In the current version of the description of Dynamic in the core metadata, it is unclear whether "match" means string equivalence or semantic equivalence. Given that pypa/packaging is reordering version specifiers between sdist and wheel when used by a build backend (which to my best knowledge are all build backend written in python), we can clarify this to semantic equivalence.

image


📚 Documentation preview 📚: https://python-packaging-user-guide--1594.org.readthedocs.build/en/1594/

@konstin konstin changed the title Clarity the meaning of "match" between sdist and wheel Clarify the meaning of "match" between sdist and wheel Aug 25, 2024
In the current version of the description of `Dynamic` in the core metadata, it is unclear whether "match" means string equivalence or semantic equivalence. Given that pypa/packaging is reordering version specifiers between sdist and wheel when used by a build backend (which to my best knowledge are all build backend written in python), we can clarify this to semantic equivalence.
@sinoroc
Copy link
Contributor

sinoroc commented Aug 25, 2024

@konstin, this seems like a reasonable clarification to me but I don't know what is the process when making edits to a specification. I think maybe it should be discussed in the "Packaging" category of discuss.python.org first.

@konstin
Copy link
Contributor Author

konstin commented Aug 26, 2024

This is a clarification on what is already happening in practice (like e.g. #1550), since packaging is already reordering specifiers. I will of course open a discourse thread if it's required, but i'd like to avoid the noise if it isn't.

@brettcannon
Copy link
Member

It's technically up to @pfmoore whether we need to have a discussion or not.

@pfmoore
Copy link
Member

pfmoore commented Aug 26, 2024

For the example of specifiers I agree1 that this is the intended meaning, and the proposed change is just a textual clarification, which can be done by a PR.

For the Name field, I'm not as happy, though. I think it's correct to say that the values must be semantically equivalent (i.e., normalise to the same value) but I don't want to give the impression that this means it's OK for build backends to change the name - the Name field is used unnormalised as the "official" (display) name for a project, and it's absolutely not OK to arbitrarily normalise that name when building a wheel.

So while the PR as it stands is OK, I'd like it to add a note that makes it clear that the equivalence rule does not imply permission to alter user-supplied values in a way that could be visible to the user (with the project name given as an explicit example of this).

Footnotes

  1. as the PEP author

@konstin
Copy link
Contributor Author

konstin commented Aug 26, 2024

That's a really interesting perspective, cause the inverse is how we initially got to recognizing this as a problem: In uv, we were preserving the order of specifiers exactly because we didn't want to alter user provided values. We were then seeing differences in the lockfile (which serializes the requirements of workspace packages so we can check lockfile freshness) when this information was read directly from pyproject.toml vs. through PEP 517.

In uv, we are also currently also normalizing all package and extra names (though we currently don't have a build backend), since we found that these were often inconsistent, even for a single wheel or project. In the resolver, we couldn't determine a source for a canonical name, so we chose the normalized name instead. Effectively, our implementation is assuming the inverse: You can normalize the package name, but we were treating the specifier order as user input. For context, i have a PR up normalizing the specifier order (astral-sh/uv#6333), and my intent is to upstream this concept.

If we want to preserve certain characteristic (e.g. the spelling of the project name), should we annotate this for the applicable core metadata fields?

@pfmoore
Copy link
Member

pfmoore commented Aug 26, 2024

OK, I've thought some more about this and changed my mind. There appears to be a lot more subtlety here than I originally thought, and I'm not comfortable with simply changing "match" to "semantically equivalent". I'm happy for now to explore the issue further here, rather than going to a full community debate and possibly a PEP, but I'm not ruling that out at this point.

In uv, we are also currently also normalizing all package and extra names

First of all, what is uv doing modifying project metadata? As far as I'm aware, uv doesn't build wheels itself - as you say, it's not a build backend1. In fact, I'd go further than that - why is uv even impacted by this question? The Dynamic metadata field was introduced purely to allow tools that consume sdists to avoid doing a build step if all they need is metadata that's not marked as dynamic in the sdist. Nobody should be comparing values from the sdist and the wheel - the whole point is that you don't need to do that.

If I had to clarify "match", I'd probably say that it means "tools consuming the data can use data from either source without fear of errors, and tools that produce the data must write the data so that the previous assumption holds". But I'm not sure that's better than simply saying "match" - it's just longer.

You can normalize the package name, but we were treating the specifier order as user input.

Of course you can normalise the package name for the purposes of the resolver, but you also need to retain the unnormalised name, because that's what you have to use when displaying the package name to the user. If you're displaying normalised names to the user, and you haven't received bug reports yet, let me find the users who complained at pip for doing precisely that, and ask them to express their unhappiness to you as well 🙂

If we want to preserve certain characteristic (e.g. the spelling of the project name), should we annotate this for the applicable core metadata fields?

Personally, I think that might cause more disruption than it warrants. This is the first time I've seen anyone suggest that it's an issue, and I'd like to understand why it's causing you a problem before we start trying to "solve" it.

Footnotes

  1. I'll ignore the implications of "currently" - we can argue the correctness of any future uv build backend independently of this discussion.

@brettcannon
Copy link
Member

Given that pypa/packaging is reordering version specifiers between sdist and wheel when used by a build backend

I just realized there's a slight misunderstanding here: packaging isn't causing build back-ends to use it to normalize or re-order anything. I think what's happening is build back-ends are using packaging to validate pyproject.toml or PKG-INFO values and then they are blindly taking the str of the various objects and writing it out to METADATA. I.e. packaging itself isn't directly causing this as we actually don't have an API to take in core metadata and then write it back out; you can only consume metadata in packaging.

If we want to preserve certain characteristic (e.g. the spelling of the project name), should we annotate this for the applicable core metadata fields?

I've seen projects store the original name as the "display name" and/or the normalized name as, well, the "normalized name" if they wanted to avoid normalizing at file input time.

why is uv even impacted by this question?

I'm going to guess it's about what gets written out to uv.lock somehow. I bet they are getting metadata from sdists for the lock file and that would sometimes require building it. That would put them in a position to be comparing PKG-INFO to METADATA for any field, not just Dynamic.

@konstin
Copy link
Contributor Author

konstin commented Aug 27, 2024

Of course you can normalise the package name for the purposes of the resolver, but you also need to retain the unnormalised name, because that's what you have to use when displaying the package name to the user. If you're displaying normalised names to the user, and you haven't received bug reports yet, let me find the users who complained at pip for doing precisely that, and ask them to express their unhappiness to you as well 🙂

Please share! We haven't received any reports, but we always try to learn from other projects' experiences.

So far, we're showing the name in all sorts of messages and serializing the name into the lockfile (which needs to be user-readable, for auditing and PR reviews). Relatively early in development we switched to normalized names after seeing that the spelling of the name in the wheel filename, in {name}-{version}.dist-info, in METADATA, on pypi and in their own documentation often mismatched. If the METADATA one should be the canonical one, we'll switch to that when adding a uv build backend.

For reference, here's an excerpt from a uv.lock. albatross is a workspace member, anyio is an external dependency:

[[package]]
name = "albatross"
version = "0.1.0"
source = { editable = "." }
dependencies = [
    { name = "bird-feeder" },
    { name = "tqdm" },
]

[package.metadata]
requires-dist = [
    { name = "bird-feeder", editable = "packages/bird-feeder" },
    { name = "tqdm", specifier = ">=4,<5" },
]

[[package]]
name = "anyio"
version = "4.4.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
    { name = "idna" },
    { name = "sniffio" },
]
sdist = { url = "https://files.pythonhosted.org/packages/e6/e3/c4c8d473d6780ef1853d630d581f70d655b4f8d7553c6997958c283039a2/anyio-4.4.0.tar.gz", hash = "sha256:5aadc6a1bbb7cdb0bede386cac5e2940f5e2ff3aa20277e991cf028e0585ce94", size = 163930 }
wheels = [
    { url = "https://files.pythonhosted.org/packages/7b/a2/10639a79341f6c019dedc95bd48a4928eed9f1d1197f4c04f546fc7ae0ff/anyio-4.4.0-py3-none-any.whl", hash = "sha256:c1b2d8f46a8a812513012e1107cb0e68c17159a7a594208005a57dc776e1bdc7", size = 86780 },
]

@pfmoore
Copy link
Member

pfmoore commented Aug 27, 2024

Please share! We haven't received any reports, but we always try to learn from other projects' experiences.

I'll be honest, I'm not sure I can. We've been using the project name unnormalised for years now, and have treated cases where we've accidentally displayed normalised names as bugs for as long, so any reports would be from a long time ago. I did a search on the tracker, and couldn't find anything specific, but it's difficult to think of a good search term. So I may well be misremembering the history here.

There's a lot of tangential discussion of issues related to name normalisation on Discourse. One that I found is https://discuss.python.org/t/change-in-pypi-upload-behavior-intentional-accidental-pebkac/27707. It's mostly about normalising the project name in wheel filenames and .disti-info directory names, but it does contain a lot of comments suggesting that people aren't willing to just switch to normalised names everywhere. How much of that is because people prefer a different display name, and how much is because of the churn it would cause, I can't say.

It's possible that I'm either mistaken in how much importance I'm putting on this, or flat-out spreading FUD based on what I recall of old discussions. So feel free to ignore me on this point.

None of this is relevant to this issue, though. I'd still like to understand why the precise wording matters to you. I clarified my intended meaning of "match" above:

If I had to clarify "match", I'd probably say that it means "tools consuming the data can use data from either source without fear of errors, and tools that produce the data must write the data so that the previous assumption holds". But I'm not sure that's better than simply saying "match" - it's just longer.

and that remains my position.

If this is about building a backend, then I'm not the person to advise you as I have no experience with that. I'd suggest asking other backend developers what their view is, as no-one else seems to have hit an issue with the wording as it stands. If it turns out we do need a more precise wording, I'd like to see what's being proposed before saying whether it needs a PEP - but the change in this PR to say "semantically equivalent" is no better in practice than the current form, because we don't have a precise definition of what "semantic equivalence" means (both in terms of the complexities of specifier equivalence, and in terms of the question of whether the display form of a project name counts as "semantically significant").

@konstin
Copy link
Contributor Author

konstin commented Aug 27, 2024

My intention with this change was to clarify that the strings don't need to be identical between pyproject.toml and METADATA, which i didn't find clear in the current version, hoping it saves the next person from hitting the same problems i had.

tools consuming the data can use data from either source without fear of errors, and tools that produce the data must write the data so that the previous assumption holds

We can totally use this as clarification for what "match" means. For a tool that does this kind of transformation, it's important to know what kind of changes are safe to do without fear of errors, and which changes could lead to an error. I think we agree that let's say removing whitespace in version specifiers (like >= 3 to >=3) is safe, but is replacing ~= 2.2 with >= 2.2, == 2.*? (Explanatory examples, there's no problem path rn to discuss). Independent of the exact wording in the spec, as a consumer, an output-writer and also future producer of metadata, and especially one that can't just depend on packaging, we have to take care of these kinds of normalizations and transformations, and we're trying to align around standards and community norms to maximize interoperability.

@pfmoore
Copy link
Member

pfmoore commented Aug 27, 2024

My intention with this change was to clarify that the strings don't need to be identical between pyproject.toml and METADATA

But that's got nothing to do with the core metadata Dynamic field, which was introduced by PEP 643. What controls the process of writing METADATA from pyproject.toml is the project.dynamic field introduced in PEP 621 and documented here. The key statement here is

Static metadata is specified in the pyproject.toml file directly and cannot be specified or changed by a tool (this includes data referred to by the metadata, e.g. the contents of files referenced by the metadata).

I think this is pretty explicit that the data provided by the user must be copied unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants