Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly wrong regex patterns? #163

Open
RadhiFadlillah opened this issue Sep 16, 2024 · 1 comment
Open

Possibly wrong regex patterns? #163

RadhiFadlillah opened this issue Sep 16, 2024 · 1 comment

Comments

@RadhiFadlillah
Copy link
Contributor

RadhiFadlillah commented Sep 16, 2024

Hi @adbar. Thanks for this awesome library. While working on my port I noticed a strange issue with several regexes that don't give me the expected output.

First is COPYRIGHT_PATTERN which currently defined like this:

COPYRIGHT_PATTERN = re.compile(
    rf"(?:©|\©|Copyright|\(c\))\D*(?:{YEAR_RE}-)?({YEAR_RE})\D"
)

Given the following string:

<p>&copy; Copyright 1999-2020 Asia Pacific Star. All rights reserved.</p>

I expected the COPYRIGHT_PATTERN to capture 2020, however right now it's capture the 1999 as can be seen here. So what do you think? Is that the expected behavior?


Next is TIMESTAMP_PATTERN which currently defined like this:

TIMESTAMP_PATTERN = re.compile(
    rf"({YEAR_RE}-{MONTH_RE}-{DAY_RE}).[0-9]{{2}}:[0-9]{{2}}:[0-9]{{2}}"
)

Given the following string:

1991-03-21T00:00:00
1995-07-23T00:00:00
2020-05-10T18:59:01
2021-02-15T11:29:21
2023-01-10T22:18:38

TIMESTAMP_PATTERN will only capture the last three while ignoring the first two as can be seen here. I believe it was a mistake since 1991-03-21T00:00:00 and 1995-07-23T00:00:00 are valid timestamps as well.

Thanks!

@RadhiFadlillah RadhiFadlillah changed the title Possibly wrong copyright pattern? Possibly wrong regex patterns? Sep 16, 2024
@RadhiFadlillah
Copy link
Contributor Author

For COPYRIGHT_PATTERN I think it can be modified by putting the dash outside the optional non-capturing group as can be seen here:

COPYRIGHT_PATTERN = re.compile(
-    rf"(?:©|\&copy;|Copyright|\(c\))\D*(?:{YEAR_RE}-)?({YEAR_RE})\D"
+    rf"(?:©|\&copy;|Copyright|\(c\))\D*(?:{YEAR_RE})?-?({YEAR_RE})\D"
)

And for TIMESTAMP_PATTERN I think it can be fixed by surrounding YEAR_RE, MONTH_RE and DAY_RE with non-capturing group as can be seen here:

TIMESTAMP_PATTERN = re.compile(
-    rf"({YEAR_RE}-{MONTH_RE}-{DAY_RE}).[0-9]{{2}}:[0-9]{{2}}:[0-9]{{2}}"
+    rf"((?:{YEAR_RE})-(?:{MONTH_RE})-(?:{DAY_RE})).[0-9]{{2}}:[0-9]{{2}}:[0-9]{{2}}"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant