Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing and normalisation of Cyrillic characters #1636

Open
taygun opened this issue Sep 5, 2022 · 2 comments
Open

Indexing and normalisation of Cyrillic characters #1636

taygun opened this issue Sep 5, 2022 · 2 comments
Labels

Comments

@taygun
Copy link

taygun commented Sep 5, 2022

Describe the bug
When searching for the address ("Олега Оникієнка вулиця 77а") of this OSM place no result are returned. The issue seems to be caused by the fact the the address is indexed with Cyrillic "a". If the query search contains the Cyrillic character "a", the above address is returned.

Steps to Reproduce

Steps to reproduce the behavior:
No results returned when searched with Latin Small Letter A: pelias.github.io
Result returned when searched with Cyrillic Small Letter A: pelias.github.io

Expected behavior
Expected the address to be returned when using Latin character

@taygun taygun added the bug label Sep 5, 2022
@missinglink
Copy link
Member

Hmm yes I can confirm the issue you are seeing, it seems to be affecting queries to the /v1/autocomplete endpoint but not the /v1/search endpoint, which helps narrow down the scope.

We use the icu-folding filter in elasticsearch to 'fold' the Cyrillic form to the Latin form.

It seems as though we are using this filter correctly in all of the analyzers, with the exception of peliasHousenumber which has a numeric character filter, and so it doesn't apply.

I'm not really sure what's going on here, the expected behaviour is that we fold Cyrillic to ASCII for precisely this purpose.

@orangejulius
Copy link
Member

Ah, very nice discovery @missinglink. I think we originally discovered this issue back in pelias/pelias#833 but never narrowed down the cause.

It feels like adding the icu-folding filter is relatively safe, maybe we should try that out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants