Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would you parse this line? (bol.com) #359

Open
MrMoronIV opened this issue Oct 1, 2021 · 10 comments
Open

How would you parse this line? (bol.com) #359

MrMoronIV opened this issue Oct 1, 2021 · 10 comments

Comments

@MrMoronIV
Copy link

MrMoronIV commented Oct 1, 2021

Template:

Omschrijving                    Aantal     Prijs/st   Korting     Bedrag     BTW%               BTW


Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel




                                              Subtotaal ex. BTW                        \u20ac       18,60
                                              21% BTW                                  \u20ac       3,90
                                              Bedrag incl. BTW                         \u20ac       22,50


                                              Totaalbedrag                             \u20ac       22,50

This above mess is the line, I can grab everything except the description of the product using this:

\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?

I tried capturing the first line of the description using this:

(?P<description>.+)[\r\n]\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?

However, no lines are found at all anymore then.

Is it possible to capture the description and amounts at the same time? Or how would I approach this situation?

@sergiuturus
Copy link

I'm facing the same issue, product's description lays on 2 rows but there's nothing on the row between these two. How can you capture the description in this case?

@MrMoronIV
Copy link
Author

I think the problem is that the parser doesn't support line breaks in the regex. It would be a great start to at least have the first line of the description.

If somebody knows a workaround or fix, it's highly appreciated.

@bosd
Copy link
Collaborator

bosd commented Jan 27, 2022

Struggling with the same issue on aliexpress invoices.
Can you share your bol.com template?

@MrMoronIV
Copy link
Author

This issue has not been solved yet, the code you're asking for is in the first post, it's just a default template otherwise.

As stated earlier, when line breaks are supported it should start to work, but someone should program that.

@bosd
Copy link
Collaborator

bosd commented Jan 28, 2022

Just tested the code with description on a (regex101.com)
Got an error on the \u parts.
Buy replacing with a . seems to work.
It catches the description partially.

(This is were im at on aliexpress invoices 80/20 rule)

Im running into the limitations of the debug website. Will look into this when i have acces to an install. As the module handles multi line differentially

@bosd
Copy link
Collaborator

bosd commented Jan 29, 2022

Have you tried replacing all the line breaks? I've had some luck with that on gasstation invoices.
It seems to do the replacement before it goes trough the parser.

The parser spreads the actual description on multiple lines so the output look like:

Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel

Which makes it impossible to extract:

Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel

The replacement of line breaks made it go on my invoices to something like:

Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel  1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90

used this code to replace the linebreaks

options:
  currency: EUR
  languages:
    - nl
  decimal_separator: ','
  replace:
    - ['\n' ,'']

@bosd
Copy link
Collaborator

bosd commented Feb 1, 2022

Forget my previous statement about removing linebreaks.
I am still learning this module as well.
Best bet is to use the lines plugin. Telling where to stop, start and how the first line and follow-up line looks.
It's stil kinda hard to debug without the original invoice file.
(Did you post the extracted or optimized string??)

try someting like:

lines:
    start: Omschrijving
    end: Subtotaal ex
    first_line:  (?P<description>\w+(?:\S|[ ]\w\w+|\n)*)[\n]?\s+\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+.u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+.u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+.u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?
    line:  '^(?P<description>.+)$'

or
line: '^(?P<description>\w+(?:\S|[ ]\w\w+|\n)*)$'

Might still need some work on the desciption part.

@MrMoronIV
Copy link
Author

Like i said, the technical code is at the top. The extracted string from the PDF first, my attempt for a regex second. My regex for multiple lines works fine, it's just that this program can't deal with such a regex apparently. The solutions is not in the template, it's in fixing the source code.

@bosd
Copy link
Collaborator

bosd commented Feb 1, 2022

Sorry, but without the template en input file, I am unable to help.

just to be clear.
Where did you get this code from? As it does not look as the original human readable pdf

Omschrijving                    Aantal     Prijs/st   Korting     Bedrag     BTW%               BTW


Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel




                                              Subtotaal ex. BTW                        \u20ac       18,60
                                              21% BTW                                  \u20ac       3,90
                                              Bedrag incl. BTW                         \u20ac       22,50


                                              Totaalbedrag                             \u20ac       22,50

As wierdly as it may sound from my experience working with this module. The debug window shows different strings.
I've had similar data as above. But when changing the template the parser handled the pdf document differently.
It would be easier if you posted the original PDF file. (al be it anonimized).
This module is quite capable of handling multiline texts. But it does require some fiddling around with the options for line extraction and tables.

example of mulltiline extraction:
PDF: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.pdf
Template: https://github.com/invoice-x/invoice2data/blob/master/src/invoice2data/extract/templates/de/de.qualityhosting.yml
Output: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.json

Oddly, with your regexcode I do get pattern errors

@rmilecki
Copy link
Collaborator

rmilecki commented Aug 6, 2023

There is really no easy/clean way to parse such lines. The problem is vertical alignment of table cells content.

Ideally why should ask pdftotext to vertically align every table cell to the top. That isn't easy however as PDFs in general don't have a concept of tables. So it's hard for pdftotext to detect table cells and handle them according to some extra requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants