pdfminer test broken #402

bosd · 2022-09-24T15:30:20Z

PDFminer tests are broken.

Can't get pdfminer to parse the amazon sample similar to pdftotext

We support different PDF parsers. Each have its own strengths and weaknesses.
Testing each parsers against the same template does not lead to consistent results.

Current test cases might need some work.
One test only checks if a string is resulted. see

invoice2data/tests/test_lib.py

Lines 78 to 87 in f6080ba

    
           def test_extract_data_pdfminer(self): 
        
               pdf_files = get_sample_files('.pdf') 
        
               for file in pdf_files: 
        
                   try: 
        
                       res = extract_data(file, None, pdfminer_wrapper) 
        
                       print(res)  # Check why logger.info is not working, for the time being using print 
        
                   except ImportError: 
        
                       # print("pdfminer module not installed!") 
        
                       self.assertTrue(False, "pdfminer is not installed") 
        
                       self.assertTrue(type(res) is str, "return is not a string")

Which is likely to pass.

However when comparing the actual result it fails.
As in case of the amazon.pdf example.
The parsing with pdfminer results in a different text layout then with pdftotext parser.
Which results in the regexes failling.

Proposed solution:

Update testing mechanism. Create parser specific tests
Adapt the template file could contain the preffered parser and setting.

As an example these use cases:

A) Invoices in which the issuer data is incapsulated in a image.
(vat number, issuer name & adress)
That data is actually needed to match a template.
So to be able to match a template, that image need to be parsed by OCR.
As far as I know. pdftotext is unable to do that.
But pdfminer.six would be capable to do that (--all-texts)

B) Using invoice2data as a module. An invoice is parsed by default with the pdftotext parser.
The extracted text is enough to match a template. But from experience we know that for full detection of the fields a different parser e.g. pdfplbumber could be used.
In the template a key could be added which leads to re-parsing the invoice with that specific parser.

The text was updated successfully, but these errors were encountered:

bosd changed the title ~~PDFminer implementation broken~~ pdfminer test broken Sep 24, 2022

bosd added this to the 0.4.0 release milestone Sep 24, 2022

bosd mentioned this issue Sep 24, 2022

Add posibility to parse multiple line definitions #378

Closed

bosd removed this from the 0.4.0 release milestone Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfminer test broken #402

pdfminer test broken #402

bosd commented Sep 24, 2022 •

edited

Loading

pdfminer test broken #402

pdfminer test broken #402

Comments

bosd commented Sep 24, 2022 • edited Loading

bosd commented Sep 24, 2022 •

edited

Loading