Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode normalisation across apertium tools #24

Open
flammie opened this issue Dec 31, 2020 · 15 comments
Open

Unicode normalisation across apertium tools #24

flammie opened this issue Dec 31, 2020 · 15 comments
Labels
enhancement New feature or request

Comments

@flammie
Copy link
Member

flammie commented Dec 31, 2020

It seems to me that good portion of apertium IRC traffic is people checking on unicode character variants like:

10:43 +spectie> .u ô
10:43  begiak> U+006F LATIN SMALL LETTER O (o)
10:43  begiak> U+0302 COMBINING CIRCUMFLEX ACCENT (âWL̂)
10:43 +spectie> .u ô
10:43  begiak> U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX (ô)

I think this is something that the tools should take care of somehow, I'd suggest NFC normalization for all input, perhaps with a warning in compiler type tools. NFC is the nicest for most FSA letter automata. If agreed this might be a good starter task for gsoc candidates?

@TinoDidriksen
Copy link
Member

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

@TinoDidriksen TinoDidriksen added the enhancement New feature or request label Dec 31, 2020
@mr-martian
Copy link
Contributor

ICU provides an a way to define custom normalizations. The documentation isn't terribly helpful, but it looks to me like we just need to edit https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2/nfc.txt to make a more conservative NFC and then use these instructions https://unicode-org.github.io/icu/userguide/transforms/normalization/ under this license https://www.unicode.org/license.html

@flammie
Copy link
Member Author

flammie commented Dec 31, 2020

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

@TinoDidriksen
Copy link
Member

From what I can see, we just don't want any of the > rules. E.g. rule 212A>004B says Kelvin sign should turn into capital K.

@TinoDidriksen
Copy link
Member

A quick'n'dirty shortcut would be to use a transformation that only hits grapheme clusters with combining marks. For example:
echo -n 'ôôÅÅ' | uconv -x '([:^Nonspacing Mark:] [:Nonspacing Mark:]+) > &NFC($1)' | uconv -x any-name
yields
\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{ANGSTROM SIGN}\N{LATIN CAPITAL LETTER A WITH RING ABOVE}

It turns (U+006F U+0302) into ô (U+00F4), but doesn't touch .

However, it would touch if that had any combining marks after it. I posit that is so rare we don't have to worry.

@mr-martian
Copy link
Contributor

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

https://gist.github.com/mr-martian/80d99c2ca29a36ac483cca84bbc4ec3a

Not quite collaborative editing, but hopefully at least a bit more readable

@mr-martian
Copy link
Contributor

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

@flammie
Copy link
Member Author

flammie commented Feb 12, 2021

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

Hmm, this looks all ok to me, though I have no good knowledge for most scripts in the list. It doesn't seem to have anything more problematic than Å for Ångström sing and K for Kelvin sign afaics, for latin / generic?

@unhammer
Copy link
Member

Should this be a step that apertium/apy runs before the pipeline? or something done within morph analysis? (My first thought is it seems easier and cleaner to do it before analysis)

@mr-martian
Copy link
Contributor

I would expect it to be in conjunction with format handling (either before or after, not sure which).

@xavivars
Copy link
Member

Should deformating take care of this? Or are you thinking something in between deformating and analysis?

@mr-martian
Copy link
Contributor

Inserting a normalizer between deformatting and analysis would handle it without requiring every deformatter to be updated and also deals with the issue (that I guess was discussed on IRC rather than here) that sooner or later someone might care about normalized vs not and want to turn it off.

@unhammer
Copy link
Member

Having it after deformatting would mean it could run on only the translated parts of the text, and not touch formatting (so that when Word2022 exports an html page with combining chars in its class names it will still look as ugly as intended)

@TinoDidriksen
Copy link
Member

@TinoDidriksen
Copy link
Member

And here's a helper script I have for a similar task: https://gist.github.com/TinoDidriksen/aa6b8047e26fb6876b4b9f90c51988f3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants