Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a heuristic for determining the charset #706

Open
oliverklee opened this issue Sep 5, 2024 · 3 comments
Open

Add a heuristic for determining the charset #706

oliverklee opened this issue Sep 5, 2024 · 3 comments

Comments

@oliverklee
Copy link
Contributor

From #688 (comment):

In essence: have some heuristic to determine the input encoding (BOM, @charset, try a few common charsets and pick the first one that doesn’t produce errors), then convert to UTF-8 and, from that point on, all the tokens of interest to us will be ASCII-only and can be parsed using regular string functions.

@oliverklee oliverklee changed the title Add a heuristic for determining the charset Work out a heuristic for determining the charset Sep 5, 2024
@oliverklee
Copy link
Contributor Author

We can follow what browsers do: https://developer.mozilla.org/en-US/docs/Web/CSS/@charset

@oliverklee oliverklee changed the title Work out a heuristic for determining the charset Add a heuristic for determining the charset Sep 5, 2024
@sabberworm
Copy link
Contributor

sabberworm commented Sep 5, 2024

We can follow what browsers do: https://developer.mozilla.org/en-US/docs/Web/CSS/@charset

Yes good idea. Though browsers have a Content-Type header that may include a charset= specifier that we don’t have (as well as the resolved charset of the referring document). But we can definitely follow what browsers do absent charset=.

@JakeQZ
Copy link
Contributor

JakeQZ commented Sep 5, 2024

Though browsers have a Content-Type header that may include a charset= specifier that we don’t have (as well as the resolved charset of the referring document).

We can use the value provided to Settings::withDefaultCharset in its place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants