Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pdfplumber input module #404

Merged
merged 2 commits into from
Oct 22, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ A command line tool and Python library to support your accounting
process.

1. extracts text from PDF files using different techniques, like
`pdftotext`, `pdfminer` or OCR -- `tesseract`, `tesseract4` or
`pdftotext`, `pdfminer`, `pdfplumber` or OCR -- `tesseract`, `tesseract4` or
`gvision` (Google Cloud Vision).
2. searches for regex in the result using a YAML-based template system
3. saves results as CSV, JSON or XML or renames PDF files to match the content.
Expand Down Expand Up @@ -56,6 +56,7 @@ Choose any of the following input readers:
- pdftotext `invoice2data --input-reader pdftotext invoice.pdf`
- tesseract `invoice2data --input-reader tesseract invoice.pdf`
- pdfminer.six `invoice2data --input-reader pdfminer invoice.pdf`
- pdfplumber `invoice2data --input-reader pdfplumber invoice.pdf`
- tesseract4 `invoice2data --input-reader tesseract4 invoice.pdf`
- gvision `invoice2data --input-reader gvision invoice.pdf` (needs `GOOGLE_APPLICATION_CREDENTIALS` env var)

Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ install_requires =
unidecode

[options.extras_require]
test = pytest; pytest-cov; flake8; pdfminer.six; tox
test = pytest; pytest-cov; flake8; pdfminer.six; pdfplumber; tox

[options.entry_points]
console_scripts =
Expand Down
48 changes: 48 additions & 0 deletions src/invoice2data/input/pdfplumber.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# -*- coding: utf-8 -*-

import logging

logger = logging.getLogger(__name__)


def to_text(path):
"""Wrapper around `pdfplumber`.
Parameters
----------
path : str
path of electronic invoice in PDF
Returns
-------
str : str
returns extracted text from pdf
"""
try:
import pdfplumber
except ImportError:
logger.debug("Cannot import pdfplumber")

raw_text = ""
raw_text = raw_text.encode(encoding='UTF-8')
with pdfplumber.open(path, laparams={"detect_vertical": True}) as pdf:
pages = []
for pdf_page in pdf.pages:
pages.append(
pdf_page.extract_text(
layout=True, use_text_flow=True, x_tolerance=6, y_tolerance=4, keep_blank_chars=True
) # y_tolerance=6, dirty Fix for html table problem
)
res = {
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
logger.debug("Text extraction made with pdfplumber")

raw_text = res_to_raw_text(res)
return raw_text.encode("utf-8")


def res_to_raw_text(res):
# we need to convert result to raw text:
raw_text_dict = res
raw_text = (raw_text_dict["first"] or raw_text_dict["all"])
return raw_text
2 changes: 2 additions & 0 deletions src/invoice2data/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from .input import pdftotext
from .input import pdfminer_wrapper
from .input import pdfplumber
from .input import tesseract
from .input import tesseract4
from .input import gvision
Expand All @@ -27,6 +28,7 @@
"tesseract": tesseract,
"tesseract4": tesseract4,
"pdfminer": pdfminer_wrapper,
"pdfplumber": pdfplumber,
"gvision": gvision,
}

Expand Down
7 changes: 6 additions & 1 deletion tests/test_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
import unittest

from invoice2data.main import extract_data
from invoice2data.input import pdftotext, tesseract, pdfminer_wrapper
from invoice2data.input import pdftotext, tesseract, pdfminer_wrapper, pdfplumber
from invoice2data.output import to_csv, to_json, to_xml
from .common import get_sample_files

Expand Down Expand Up @@ -86,6 +86,11 @@ def test_extract_data_pdfminer(self):
self.assertTrue(False, "pdfminer is not installed")
self.assertTrue(type(res) is str, "return is not a string")

def test_extract_data_pdfplumber(self):
pdf_files = get_sample_files('.pdf')
for file in pdf_files:
extract_data(file, None, pdfplumber)

def test_tesseract_for_return(self):
png_files = get_sample_files('.png')
for file in png_files:
Expand Down