Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfminer test broken #402

Open
bosd opened this issue Sep 24, 2022 · 0 comments
Open

pdfminer test broken #402

bosd opened this issue Sep 24, 2022 · 0 comments

Comments

@bosd
Copy link
Collaborator

bosd commented Sep 24, 2022

PDFminer tests are broken.

Can't get pdfminer to parse the amazon sample similar to pdftotext

We support different PDF parsers. Each have its own strengths and weaknesses.
Testing each parsers against the same template does not lead to consistent results.

Current test cases might need some work.
One test only checks if a string is resulted. see

def test_extract_data_pdfminer(self):
pdf_files = get_sample_files('.pdf')
for file in pdf_files:
try:
res = extract_data(file, None, pdfminer_wrapper)
print(res) # Check why logger.info is not working, for the time being using print
except ImportError:
# print("pdfminer module not installed!")
self.assertTrue(False, "pdfminer is not installed")
self.assertTrue(type(res) is str, "return is not a string")

Which is likely to pass.

However when comparing the actual result it fails.
As in case of the amazon.pdf example.
The parsing with pdfminer results in a different text layout then with pdftotext parser.
Which results in the regexes failling.

Proposed solution:

  1. Update testing mechanism. Create parser specific tests
  2. Adapt the template file could contain the preffered parser and setting.

As an example these use cases:

A) Invoices in which the issuer data is incapsulated in a image.
(vat number, issuer name & adress)
That data is actually needed to match a template.
So to be able to match a template, that image need to be parsed by OCR.
As far as I know. pdftotext is unable to do that.
But pdfminer.six would be capable to do that (--all-texts)

B) Using invoice2data as a module. An invoice is parsed by default with the pdftotext parser.
The extracted text is enough to match a template. But from experience we know that for full detection of the fields a different parser e.g. pdfplbumber could be used.
In the template a key could be added which leads to re-parsing the invoice with that specific parser.

@bosd bosd changed the title PDFminer implementation broken pdfminer test broken Sep 24, 2022
@bosd bosd added this to the 0.4.0 release milestone Sep 24, 2022
@bosd bosd removed this from the 0.4.0 release milestone Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant