Unicode normalisation across apertium tools #24

flammie · 2020-12-31T17:27:56Z

It seems to me that good portion of apertium IRC traffic is people checking on unicode character variants like:

10:43 +spectie> .u ô
10:43  begiak> U+006F LATIN SMALL LETTER O (o)
10:43  begiak> U+0302 COMBINING CIRCUMFLEX ACCENT (âWL̂)
10:43 +spectie> .u ô
10:43  begiak> U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX (ô)

I think this is something that the tools should take care of somehow, I'd suggest NFC normalization for all input, perhaps with a warning in compiler type tools. NFC is the nicest for most FSA letter automata. If agreed this might be a good starter task for gsoc candidates?

The text was updated successfully, but these errors were encountered:

TinoDidriksen · 2020-12-31T17:56:29Z

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

mr-martian · 2020-12-31T18:17:15Z

ICU provides an a way to define custom normalizations. The documentation isn't terribly helpful, but it looks to me like we just need to edit https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2/nfc.txt to make a more conservative NFC and then use these instructions https://unicode-org.github.io/icu/userguide/transforms/normalization/ under this license https://www.unicode.org/license.html

flammie · 2020-12-31T19:17:46Z

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

TinoDidriksen · 2020-12-31T19:47:02Z

From what I can see, we just don't want any of the > rules. E.g. rule 212A>004B says Kelvin sign should turn into capital K.

TinoDidriksen · 2020-12-31T20:21:27Z

A quick'n'dirty shortcut would be to use a transformation that only hits grapheme clusters with combining marks. For example:
echo -n 'ôôÅÅ' | uconv -x '([:^Nonspacing Mark:] [:Nonspacing Mark:]+) > &NFC($1)' | uconv -x any-name
yields
\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{ANGSTROM SIGN}\N{LATIN CAPITAL LETTER A WITH RING ABOVE}

It turns ô (U+006F U+0302) into ô (U+00F4), but doesn't touch Å.

However, it would touch Å if that had any combining marks after it. I posit that is so rare we don't have to worry.

mr-martian · 2021-02-11T16:22:45Z

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

https://gist.github.com/mr-martian/80d99c2ca29a36ac483cca84bbc4ec3a

Not quite collaborative editing, but hopefully at least a bit more readable

mr-martian · 2021-02-11T16:40:12Z

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

flammie · 2021-02-12T16:38:35Z

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

Hmm, this looks all ok to me, though I have no good knowledge for most scripts in the list. It doesn't seem to have anything more problematic than Å for Ångström sing and K for Kelvin sign afaics, for latin / generic?

unhammer · 2021-04-14T13:19:39Z

Should this be a step that apertium/apy runs before the pipeline? or something done within morph analysis? (My first thought is it seems easier and cleaner to do it before analysis)

mr-martian · 2021-04-14T13:22:41Z

I would expect it to be in conjunction with format handling (either before or after, not sure which).

xavivars · 2021-04-14T13:22:49Z

Should deformating take care of this? Or are you thinking something in between deformating and analysis?

mr-martian · 2021-04-14T13:25:33Z

Inserting a normalizer between deformatting and analysis would handle it without requiring every deformatter to be updated and also deals with the issue (that I guess was discussed on IRC rather than here) that sooner or later someone might care about normalized vs not and want to turn it off.

unhammer · 2021-04-14T13:30:51Z

Having it after deformatting would mean it could run on only the translated parts of the text, and not touch formatting (so that when Word2022 exports an html page with combining chars in its class names it will still look as ugly as intended)

TinoDidriksen · 2021-05-15T13:47:42Z

Relevant IRC log: https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2021-02-11.log

TinoDidriksen · 2021-05-15T14:15:56Z

And here's a helper script I have for a similar task: https://gist.github.com/TinoDidriksen/aa6b8047e26fb6876b4b9f90c51988f3

TinoDidriksen added the enhancement New feature or request label Dec 31, 2020

flammie mentioned this issue Mar 5, 2021

Error in analysis giellalt/lang-sme#14

Closed

flammie mentioned this issue Apr 14, 2021

handling both combined and non-combined characters equivalently apertium/apertium-mos#2

Open

mr-martian mentioned this issue May 22, 2021

ICU stuff apertium/lttoolbox#115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode normalisation across apertium tools #24

Unicode normalisation across apertium tools #24

flammie commented Dec 31, 2020

TinoDidriksen commented Dec 31, 2020

mr-martian commented Dec 31, 2020

flammie commented Dec 31, 2020

TinoDidriksen commented Dec 31, 2020

TinoDidriksen commented Dec 31, 2020

mr-martian commented Feb 11, 2021

mr-martian commented Feb 11, 2021

flammie commented Feb 12, 2021

unhammer commented Apr 14, 2021

mr-martian commented Apr 14, 2021

xavivars commented Apr 14, 2021

mr-martian commented Apr 14, 2021

unhammer commented Apr 14, 2021

TinoDidriksen commented May 15, 2021

TinoDidriksen commented May 15, 2021

Unicode normalisation across apertium tools #24

Unicode normalisation across apertium tools #24

Comments

flammie commented Dec 31, 2020

TinoDidriksen commented Dec 31, 2020

mr-martian commented Dec 31, 2020

flammie commented Dec 31, 2020

TinoDidriksen commented Dec 31, 2020

TinoDidriksen commented Dec 31, 2020

mr-martian commented Feb 11, 2021

mr-martian commented Feb 11, 2021

flammie commented Feb 12, 2021

unhammer commented Apr 14, 2021

mr-martian commented Apr 14, 2021

xavivars commented Apr 14, 2021

mr-martian commented Apr 14, 2021

unhammer commented Apr 14, 2021

TinoDidriksen commented May 15, 2021

TinoDidriksen commented May 15, 2021