pdplumber return empty string on importerror #481

bosd · 2023-02-25T16:33:55Z

Was running some tests, encountered following error when pdfplumber is not available.
This PR returns and empty value and let invoice2data fail gracefully.

Before:
invoice2data input.pdf --input-reader=pdfplumber

DEBUG:invoice2data.input.pdfplumber: Cannot import pdfplumber
Traceback (most recent call last):
  File "/home/emiel/.local/bin/invoice2data", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/main.py", line 312, in main
    res = extract_data(f.name, templates=templates, input_module=input_module)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/main.py", line 166, in extract_data
    extracted_str = input_module.to_text(invoicefile).decode("utf-8")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/input/pdfplumber.py", line 29, in to_text
    with pdfplumber.open(path, laparams={"detect_vertical": True}) as pdf:
         ^^^^^^^^^^
UnboundLocalError: cannot access local variable 'pdfplumber' where it is not associated with a value

After:

ERROR:invoice2data.input.pdfplumber: Cannot import pdfplumber
ERROR:root: Failed to extract text from testin.pdf using invoice2data.input.pdfplumber

fixes #362

rmilecki · 2023-03-11T12:55:53Z

src/invoice2data/input/pdfplumber.py

@@ -19,7 +19,8 @@ def to_text(path):
    try:
        import pdfplumber
    except ImportError:
-        logger.debug("Cannot import pdfplumber")
+        logger.error("Cannot import pdfplumber")
+        return "".encode("UTF-8")


Returning empty string suggests that invoice was parsed but was empty.
If we want to return some value then make it None please.

If you take a look at pdftotext.py however, you'll see it raises EnvironmentError if pdftotext is missing. So returning None will make pdfplumber.py somehow incompatible with the pdftotext.py.

As for decision what is better: return None or raise EnvironmentError - I have no idea or preference.

@rmilecki I agree, we need to think of an solution for this.

Returning None conflicts with:

invoice2data/src/invoice2data/main.py

Line 89 in a5bdd50

extracted_str = input_module.to_text(invoicefile).decode("utf-8")

as Nonetype cannot be decoded
(Maybe we can remove the decode line? I assume it is a python2 leftover)

pdftotext might be a different story, as it is one of the default/main parsers.
So making everything fail when it's unavailable is not a big deal.

Is there a way to raise the error, only if the pdfplumber input module is called?
We don't want the whole lib to fail on this missing requirement.

This pr #491
is currently failing because of the missing pdfplumber.
(the test should not even run when it is unavailable but that's a different sunbject)
In that example, there should be an ImportError or EnvironmentError

bosd force-pushed the importerr-pdfplumber branch from 024d674 to ed9499d Compare February 25, 2023 16:38

rmilecki requested changes Mar 11, 2023

View reviewed changes

bosd force-pushed the importerr-pdfplumber branch from ed9499d to f7d42b7 Compare March 11, 2023 23:15

rmilecki mentioned this pull request Mar 12, 2023

Refactor to_text() to return string instead of bytes #493

Merged

bosd force-pushed the importerr-pdfplumber branch from f7d42b7 to 1ae80ab Compare March 12, 2023 18:58

bosd mentioned this pull request Mar 18, 2023

Template validator ? #362

Open

pdplumber raise importerror

db0ac11

bosd force-pushed the importerr-pdfplumber branch from 1ae80ab to db0ac11 Compare March 30, 2023 21:52

Merge branch 'master' into importerr-pdfplumber

80820fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdplumber return empty string on importerror #481

pdplumber return empty string on importerror #481

bosd commented Feb 25, 2023 •

edited

Loading

rmilecki Mar 11, 2023

bosd Mar 11, 2023 •

edited

Loading

pdplumber return empty string on importerror #481

Are you sure you want to change the base?

pdplumber return empty string on importerror #481

Conversation

bosd commented Feb 25, 2023 • edited Loading

rmilecki Mar 11, 2023

Choose a reason for hiding this comment

bosd Mar 11, 2023 • edited Loading

Choose a reason for hiding this comment

bosd commented Feb 25, 2023 •

edited

Loading

bosd Mar 11, 2023 •

edited

Loading