Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking advice regarding classification problem only present with Chinese #49

Open
nmstoker opened this issue Mar 1, 2016 · 4 comments

Comments

@nmstoker
Copy link

nmstoker commented Mar 1, 2016

Hello,

I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with pdfminer and whilst langid works excellently with all my samples in a variety of languages, it seems to have problems for me when I run it with Chinese (I have samples in both simplified or traditional) because it always suggests 'en'.

Does anyone have any advice on how I should approach investigating what the problem might be?

Are there any standard example documents that I could try that would confirm there isn't something quirky with my PDF extraction?

I could be wrong, but I don't think it's necessarily a UTF-8 encoding issue as I have managed to get it working with other non-Latin texts (eg Cyrillic).

The languages that I've found to work with my samples, so far, are: en, it, de, ru. I will be checking pt, fr, pl and ja ones shortly.

There is a tiny portion of English in the header section, but that does not throw off the language detection for the other samples and I have tried focusing on pages where the body of the text is entirely Chinese and present in significantly larger quantities than in the header.

It also makes no difference if I preselect the languages (unfortunately the false suggestion of English needs to be in the list, as there are likely to be samples in English present)

langid.set_languages(['en','es','pt','fr','ru','pl','de','it','ja', 'zh'])

Even if I try taking out English then it merely suggests a different wrong language (eg German), although the confidence level is fairly low (eg typically 0.16 to 0.25, whether it guesses English or German).

My set up is Windows 7, with Python 2.7 (needed due to use of PDFMiner, although I could try Python 3.5 if it was thought to solve the issue).

Many thanks,
Neil

@tripleee
Copy link

tripleee commented Mar 1, 2016

Are you sure the documents are in UTF-8? Windows software would often default to UTF-16 (if not some legacy code page).

@saffsd
Copy link
Owner

saffsd commented Mar 8, 2016

This definitely sounds like an encoding issue on the document side. When we trained langid.py we tried to include a representative sample of encodings, but I think the coverage for Chinese might be pretty poor. It's possible to retrain langid.py but this requires a bit of effort and training data. As @tripleee points out, windows often uses UTF-16, and quite a bit of the langid.py training data is in UTF-8. The easiest thing to try might be to transcode all documents (perhaps PDFMiner supports this directly? I'm not familiar) to UTF-8 and try again. Hope that helps!

@bittlingmayer
Copy link

For what it's worth, I see the opposite issue: bias towards Chinese

¡No! (only 24%)
‪#‎WCIT
Tʻagavorn apracʻ kenna
ՏԵՍԱՆՅՈՒԹ
#Cizre
#MustRead (only 77%)
Աֆրիկա (2nd, only 14%)

All are identified as Chinese, generally with > 98% probability.

Perhaps the Chinese data is actually all in the Latin alphabet? This should be the easiest language to keep separate, so it reeks of fundamental bug or preproc issue.

@bittlingmayer
Copy link

Pardon, looks like in most cases it is the result of invisible chars in dirty data. (But ՏԵՍԱՆՅՈՒԹ and ¡No! are clean, and ʻ in Tʻagavorn apracʻ kenna is not so exotic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants