Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdplumber return empty string on importerror #481

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Feb 25, 2023

Was running some tests, encountered following error when pdfplumber is not available.
This PR returns and empty value and let invoice2data fail gracefully.

Before:
invoice2data input.pdf --input-reader=pdfplumber

DEBUG:invoice2data.input.pdfplumber: Cannot import pdfplumber
Traceback (most recent call last):
  File "/home/emiel/.local/bin/invoice2data", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/main.py", line 312, in main
    res = extract_data(f.name, templates=templates, input_module=input_module)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/main.py", line 166, in extract_data
    extracted_str = input_module.to_text(invoicefile).decode("utf-8")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/input/pdfplumber.py", line 29, in to_text
    with pdfplumber.open(path, laparams={"detect_vertical": True}) as pdf:
         ^^^^^^^^^^
UnboundLocalError: cannot access local variable 'pdfplumber' where it is not associated with a value

After:

ERROR:invoice2data.input.pdfplumber: Cannot import pdfplumber
ERROR:root: Failed to extract text from testin.pdf using invoice2data.input.pdfplumber

fixes #362

@@ -19,7 +19,8 @@ def to_text(path):
try:
import pdfplumber
except ImportError:
logger.debug("Cannot import pdfplumber")
logger.error("Cannot import pdfplumber")
return "".encode("UTF-8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning empty string suggests that invoice was parsed but was empty.
If we want to return some value then make it None please.

If you take a look at pdftotext.py however, you'll see it raises EnvironmentError if pdftotext is missing. So returning None will make pdfplumber.py somehow incompatible with the pdftotext.py.

As for decision what is better: return None or raise EnvironmentError - I have no idea or preference.

Copy link
Collaborator Author

@bosd bosd Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmilecki I agree, we need to think of an solution for this.

Returning None conflicts with:

extracted_str = input_module.to_text(invoicefile).decode("utf-8")

as Nonetype cannot be decoded
(Maybe we can remove the decode line? I assume it is a python2 leftover)

pdftotext might be a different story, as it is one of the default/main parsers.
So making everything fail when it's unavailable is not a big deal.

Is there a way to raise the error, only if the pdfplumber input module is called?
We don't want the whole lib to fail on this missing requirement.

This pr #491
is currently failing because of the missing pdfplumber.
(the test should not even run when it is unavailable but that's a different sunbject)
In that example, there should be an ImportError or EnvironmentError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Template validator ?
2 participants