Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate comparisons vs. normalizations in URL rules for diffs #22

Open
Mr0grog opened this issue Oct 22, 2019 · 0 comments
Open

Separate comparisons vs. normalizations in URL rules for diffs #22

Mr0grog opened this issue Oct 22, 2019 · 0 comments
Labels
enhancement New feature or request never-stale

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Oct 22, 2019

In edgi-govdata-archiving/web-monitoring-processing#401, we added customized URL comparisons for the links and images in our diffs that help resolve issues around things like archive-relative URLs and session IDs or other transient information in URLs.

Unfortunately, we mixed two concepts together there:

  1. URLs that need special comparisons — i.e. you need to know both sides of the comparison in order to do it. The Wayback comparison is like this — it only kicks in if both URLs are Wayback URLs.

  2. URLs that need to be compared according to a canonical or normalized form — i.e. you only need to know one side to do the comparison. The servlet session IDs are like this — we just want to remove the session ID from the URL, regardless of what’s in the URL we are comparing to.

The difference is important because two of the first type of rule can’t be combined, but two of the second kind of rule can, and we the second kind can also be combined with the first. E.g. you wouldn’t compare the following two URLs as the same:

https://web.archive.org/web/20180101000000/https://www.example.gov/
https://www.webarchive.org.uk/wayback/en/archive/20190525141538/https://www.example.gov/

But you would want to combine the session ID rule with the Wayback rule so these are the same:

https://web.archive.org/web/20180101000000/https://www.example.gov/;jsessionid=123
https://web.archive.org/web/20190525141538/https://www.example.gov/;jsessionid=987

I don’t think this needs to change the public API (set list of rules to use via the url_rules parameter), but under the hood, we should be able to treat these differently. As a bonus, normalized URLs are something we can cache to speed up comparisons a bit.

@stale stale bot closed this as completed Apr 26, 2020
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jul 1, 2020
@Mr0grog Mr0grog reopened this Jul 1, 2020
@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@stale stale bot closed this as completed Jun 16, 2021
@Mr0grog Mr0grog reopened this Jun 18, 2021
@stale stale bot removed the stale label Jun 18, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 18, 2021
@stale stale bot added the stale label Jan 8, 2022
@stale stale bot closed this as completed Apr 16, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022
@Mr0grog Mr0grog reopened this Apr 17, 2022
@stale stale bot removed the stale label Apr 17, 2022
@Mr0grog Mr0grog added enhancement New feature or request never-stale labels Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request never-stale
Projects
None yet
Development

No branches or pull requests

1 participant