Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML diff should tokenize on some punctuation #6

Open
Mr0grog opened this issue Nov 21, 2018 · 3 comments
Open

HTML diff should tokenize on some punctuation #6

Mr0grog opened this issue Nov 21, 2018 · 3 comments
Labels
enhancement New feature or request experiment Experimental changes to a diff that need lots of testing and may or may not work out well never-stale

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Nov 21, 2018

This FTP diffing problem made me realize we should probably be splitting tokens in the HTML diff on periods (and maybe other punctuation?), not just on whitespace:

screen shot 2018-11-21 at 9 04 50 am

(Of course we don’t really want to use this differ on FTP listings, but that’s a different matter.)

This requires some care, though — we probably want to treat the periods as tokens themselves (in case they change), unlike whitespace. We’ve also talked about this before in terms of general punctuation handling — it would be really useful not only to split this way, but to tag and count punctuation changes separately from other changes. We might not prioritize a punctuation change for analysts to look at like we do a word change, and it would be nice to call out clearly that a change was merely in punctuation.

There are also punctuation changes we might want to treat extra special and even suppress in many cases. For example, changing to ' (apostrophe to prime) is a change we’ve seen before, and not one we generally care about.

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 21, 2018

@Frijol
Copy link
Contributor

Frijol commented Mar 5, 2019

Would this be a good good-first-issue label candidate?

@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 5, 2019

I wish it was, but the HTML diff is an incredibly horrifying mess, and nobody should try and screw with it unless they are ready for a lot of setbacks and a lot of WTFs. That is why it is not already marked with “help wanted.”

@Mr0grog Mr0grog transferred this issue from edgi-govdata-archiving/web-monitoring-processing Oct 26, 2020
@stale stale bot added the stale label Jun 2, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot removed the stale label Jun 4, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale stale bot added the stale label Jan 8, 2022
@stale stale bot closed this as completed Apr 16, 2022
@Mr0grog Mr0grog added enhancement New feature or request never-stale and removed stale labels Apr 17, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Apr 17, 2022
@Mr0grog Mr0grog reopened this Apr 17, 2022
@Mr0grog Mr0grog added the experiment Experimental changes to a diff that need lots of testing and may or may not work out well label Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experiment Experimental changes to a diff that need lots of testing and may or may not work out well never-stale
Projects
None yet
Development

No branches or pull requests

2 participants