Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annoying warning "Syntax Warning: Could not parse ligature component" and possible solution to suppress these messages.. #459

Open
g-rd opened this issue Feb 1, 2023 · 1 comment

Comments

@g-rd
Copy link

g-rd commented Feb 1, 2023

I am getting such warnings on some pdf-s that come from pdftotext and originate from poppler, these messages are in stderr and pdftotext has an option to suppress them with the flag "-q".

Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName

Please add suppression option of poppler warnings or at least handle stderr in the pdftotext wrapper. Currently the warnings are sent to stderr and can't be caught (to my understanding).

Below I added an option, where subprocess errors from stderr are sent to logger as warning so its easy to suppress them when not needed.


def to_text(path: str, area_details: dict = None):
    """Wrapper around Poppler pdftotext.

    Parameters
    ----------
    path : str
        path of electronic invoice in PDF
    area_details : dictionary
        of the format {x: int, y: int, r: int, W: int, H: int}
        used when extracting an area of the pdf rather than the whole document

    Returns
    -------
    out : str
        returns extracted text from pdf

    Raises
    ------
    EnvironmentError:
        If pdftotext library is not found
    """
    import subprocess
    from distutils import spawn  # py2 compat

    if spawn.find_executable("pdftotext"):  # shutil.which('pdftotext'):
        cmd = ["pdftotext", "-layout", "-enc", "UTF-8"]
        if area_details is not None:
            # An area was specified
            # Validate the required keys were provided
            assert 'f' in area_details, 'Area r details missing'
            assert 'l' in area_details, 'Area r details missing'
            assert 'r' in area_details, 'Area r details missing'
            assert 'x' in area_details, 'Area x details missing'
            assert 'y' in area_details, 'Area y details missing'
            assert 'W' in area_details, 'Area W details missing'
            assert 'H' in area_details, 'Area H details missing'
            # Convert all of the values to strings
            for key in area_details.keys():
                area_details[key] = str(area_details[key])
            cmd += [
                '-f', area_details['f'],
                '-l', area_details['l'],
                '-r', area_details['r'],
                '-x', area_details['x'],
                '-y', area_details['y'],
                '-W', area_details['W'],
                '-H', area_details['H'],
            ]
        cmd += [path, "-"]
        # Run the extraction
        out, err = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
        if err:
            errors = err.decode().split("\n")
            for er in errors:
                logger.debug(er)
    else:
        raise EnvironmentError(
            "pdftotext not installed. Can be downloaded from https://poppler.freedesktop.org/"
        )

@rmilecki
Copy link
Collaborator

rmilecki commented Feb 3, 2023

I was thinking about adding a -q option for some time now (since I added some info prints actually). It may be a good idea to combine it with pdftotext. I'm planning to work on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants