Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADD] Pdfplumber support #391

Closed
wants to merge 16 commits into from
Closed

[ADD] Pdfplumber support #391

wants to merge 16 commits into from

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Sep 5, 2022

This is a highly opinionated basic implementation of pdfplumber.
I've created this in an attempt to parse invoices with an html style table as mentioned in #359

It's working, but currently I'm developing an integration with camelot.
As Camelot allows more complex table structures to be parsed and a gui to help build the templates.

Creating this PR in case someone needs it.

@bosd
Copy link
Collaborator Author

bosd commented Sep 5, 2022

Propably need to tell the tests to install pdfplumber.
@m3nu Any ideas how to do that?

@bosd bosd force-pushed the pdfplumber branch 4 times, most recently from 65f3d3c to 61d493e Compare September 19, 2022 12:10
@bosd
Copy link
Collaborator Author

bosd commented Sep 19, 2022

All ✔️ now!
But in this state it is not something we would like to merge.
Need to clean commits and dependency issue on pdfminer.
Therefore will convert this one back to draft.

@bosd bosd marked this pull request as draft September 19, 2022 12:40
@bosd
Copy link
Collaborator Author

bosd commented Sep 19, 2022

depends upon: #395

@bosd bosd force-pushed the pdfplumber branch 2 times, most recently from f5e4c18 to dd29930 Compare September 21, 2022 08:37
@bosd bosd requested a review from m3nu September 21, 2022 09:32
@bosd bosd force-pushed the pdfplumber branch 3 times, most recently from 5b5ea25 to e07fc07 Compare September 23, 2022 06:30
@bosd bosd marked this pull request as ready for review September 23, 2022 07:01
@bosd bosd requested a review from rmilecki September 24, 2022 18:02
@bosd bosd added this to the 0.4.0 release milestone Sep 24, 2022
@rmilecki
Copy link
Collaborator

Looks good except for the list of commits in your bosd:pdfplumber branch. There is some merge commit and a lot of work-in-progress commits.

In my personal taste we shouldn't merge such commits into invoice2data' master branch. Those commits don't bring much/any value, make git log a bit messy and make even break git bisect if someone ever tries that.

My personal suggestion / preference would be to squash all those commits into a single one with some clear subject.

@@ -56,6 +56,7 @@ Choose any of the following input readers:
- pdftotext `invoice2data --input-reader pdftotext invoice.pdf`
- tesseract `invoice2data --input-reader tesseract invoice.pdf`
- pdfminer.six `invoice2data --input-reader pdfminer invoice.pdf`
- pdf plumber `invoice2data --input-reader pdfplumber invoice.pdf`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: I'd replace pdf plumber with pdfplumber for consistency.

@bosd
Copy link
Collaborator Author

bosd commented Sep 24, 2022

In my personal taste we shouldn't merge such commits into invoice2data' master branch.

@rmilecki
I totally agree. Was really struggling on this one. (working on 2 dev systems one windows and one linux machine)
Will try to fiup here. otherwise start over on new branch and pr.

@bosd bosd mentioned this pull request Sep 24, 2022
@bosd bosd closed this Sep 24, 2022
@bosd bosd deleted the pdfplumber branch November 5, 2022 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants