You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am getting such warnings on some pdf-s that come from pdftotext and originate from poppler, these messages are in stderr and pdftotext has an option to suppress them with the flag "-q".
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Please add suppression option of poppler warnings or at least handle stderr in the pdftotext wrapper. Currently the warnings are sent to stderr and can't be caught (to my understanding).
Below I added an option, where subprocess errors from stderr are sent to logger as warning so its easy to suppress them when not needed.
def to_text(path: str, area_details: dict = None):
"""Wrapper around Poppler pdftotext.
Parameters
----------
path : str
path of electronic invoice in PDF
area_details : dictionary
of the format {x: int, y: int, r: int, W: int, H: int}
used when extracting an area of the pdf rather than the whole document
Returns
-------
out : str
returns extracted text from pdf
Raises
------
EnvironmentError:
If pdftotext library is not found
"""
import subprocess
from distutils import spawn # py2 compat
if spawn.find_executable("pdftotext"): # shutil.which('pdftotext'):
cmd = ["pdftotext", "-layout", "-enc", "UTF-8"]
if area_details is not None:
# An area was specified
# Validate the required keys were provided
assert 'f' in area_details, 'Area r details missing'
assert 'l' in area_details, 'Area r details missing'
assert 'r' in area_details, 'Area r details missing'
assert 'x' in area_details, 'Area x details missing'
assert 'y' in area_details, 'Area y details missing'
assert 'W' in area_details, 'Area W details missing'
assert 'H' in area_details, 'Area H details missing'
# Convert all of the values to strings
for key in area_details.keys():
area_details[key] = str(area_details[key])
cmd += [
'-f', area_details['f'],
'-l', area_details['l'],
'-r', area_details['r'],
'-x', area_details['x'],
'-y', area_details['y'],
'-W', area_details['W'],
'-H', area_details['H'],
]
cmd += [path, "-"]
# Run the extraction
out, err = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
if err:
errors = err.decode().split("\n")
for er in errors:
logger.debug(er)
else:
raise EnvironmentError(
"pdftotext not installed. Can be downloaded from https://poppler.freedesktop.org/"
)
The text was updated successfully, but these errors were encountered:
I was thinking about adding a -q option for some time now (since I added some info prints actually). It may be a good idea to combine it with pdftotext. I'm planning to work on that.
I am getting such warnings on some pdf-s that come from pdftotext and originate from poppler, these messages are in stderr and pdftotext has an option to suppress them with the flag "-q".
Please add suppression option of poppler warnings or at least handle stderr in the pdftotext wrapper. Currently the warnings are sent to stderr and can't be caught (to my understanding).
Below I added an option, where subprocess errors from stderr are sent to logger as warning so its easy to suppress them when not needed.
The text was updated successfully, but these errors were encountered: