Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling both combined and non-combined characters equivalently #2

Open
unhammer opened this issue Apr 14, 2021 · 4 comments
Open

handling both combined and non-combined characters equivalently #2

unhammer opened this issue Apr 14, 2021 · 4 comments

Comments

@unhammer
Copy link
Member

unhammer commented Apr 14, 2021

$ echo "kũuni kũuni" | apertium -d . mos-morph
^kũuni/kũuni<n><sg>$ ^kũuni/*kũuni$

The first one is a single codepoint: x169 LATIN SMALL LETTER U WITH TILDE,
the second is two codepoints: x75 LATIN SMALL LETTER U composed with x303 COMBINING TILDE.

The .dix file has an entry for the single-codepoint version, so we get an analysis for only that one.

.acx doesn't help here since it's two codepoints.

Possible solutions:

  • use a pardef for every single tilde-entry in the .dix file – simple, but very ugly: <i>k</i><par n="ũ"/><i>un</i><par n="kũun/i__n"/>
  • use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how
  • change lttoolbox to treat them equivalently – big job, but everyone wins

@fatkab @ftyers thoughts?

@fatkab
Copy link
Collaborator

fatkab commented Apr 14, 2021 via email

@flammie
Copy link
Member

flammie commented Apr 14, 2021

  • use some hfst-trickery to do basically the same thing on compile – slightly more complicated, but a one-time job for someone who knows how

You can use hfst-substitute pre-composed characeters with automaton containing the disjunction but it's a lot of hacking around

  • change lttoolbox to treat them equivalently – big job, but everyone wins

apertium/organisation#24

@ftyers
Copy link
Member

ftyers commented Apr 14, 2021

The easiest thing is to use a spellrelax-type script, e.g. this one for Basaa.

@mr-martian
Copy link
Contributor

As a stop-gap measure, I've added a normalizing morph mode mos-nmorph in 73cc4b4 that uses uconv -x any-nfc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants