How would you parse this line? (bol.com) #359

MrMoronIV · 2021-10-01T10:23:55Z

Template:

Omschrijving                    Aantal     Prijs/st   Korting     Bedrag     BTW%               BTW


Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel




                                              Subtotaal ex. BTW                        \u20ac       18,60
                                              21% BTW                                  \u20ac       3,90
                                              Bedrag incl. BTW                         \u20ac       22,50


                                              Totaalbedrag                             \u20ac       22,50

This above mess is the line, I can grab everything except the description of the product using this:

\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?

I tried capturing the first line of the description using this:

(?P<description>.+)[\r\n]\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+\u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+\u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+\u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?

However, no lines are found at all anymore then.

Is it possible to capture the description and amounts at the same time? Or how would I approach this situation?

The text was updated successfully, but these errors were encountered:

sergiuturus · 2021-10-05T06:02:28Z

I'm facing the same issue, product's description lays on 2 rows but there's nothing on the row between these two. How can you capture the description in this case?

MrMoronIV · 2021-10-08T04:31:39Z

I think the problem is that the parser doesn't support line breaks in the regex. It would be a great start to at least have the first line of the description.

If somebody knows a workaround or fix, it's highly appreciated.

bosd · 2022-01-27T12:48:35Z

Struggling with the same issue on aliexpress invoices.
Can you share your bol.com template?

MrMoronIV · 2022-01-28T07:47:57Z

This issue has not been solved yet, the code you're asking for is in the first post, it's just a default template otherwise.

As stated earlier, when line breaks are supported it should start to work, but someone should program that.

bosd · 2022-01-28T08:27:21Z

Just tested the code with description on a (regex101.com)
Got an error on the \u parts.
Buy replacing with a . seems to work.
It catches the description partially.

(This is were im at on aliexpress invoices 80/20 rule)

Im running into the limitations of the debug website. Will look into this when i have acces to an install. As the module handles multi line differentially

bosd · 2022-01-29T19:04:02Z

Have you tried replacing all the line breaks? I've had some luck with that on gasstation invoices.
It seems to do the replacement before it goes trough the parser.

The parser spreads the actual description on multiple lines so the output look like:

Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel

Which makes it impossible to extract:

Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel

The replacement of line breaks made it go on my invoices to something like:

Double A printpapier - A4 - 1 DOOS - 5 pakken x 500 vel  1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90

used this code to replace the linebreaks

options:
  currency: EUR
  languages:
    - nl
  decimal_separator: ','
  replace:
    - ['\n' ,'']

bosd · 2022-02-01T07:20:29Z

Forget my previous statement about removing linebreaks.
I am still learning this module as well.
Best bet is to use the lines plugin. Telling where to stop, start and how the first line and follow-up line looks.
It's stil kinda hard to debug without the original invoice file.
(Did you post the extracted or optimized string??)

try someting like:

lines:
    start: Omschrijving
    end: Subtotaal ex
    first_line:  (?P<description>\w+(?:\S|[ ]\w\w+|\n)*)[\n]?\s+\s+(?P<amount_nr>-?\d{1,10},?\d{0,2})\s+.u20ac\s+(?P<amount_piece>-?\d{1,10},\d{0,2})\s+.u20ac\s+(?P<amount_total_incl>-?\d{1,10},\d{0,2})(?:\s+(?P<tax_perc>-?\d{1,10}%)\s+.u20ac\s+(?P<amount_tax>-?\d{1,10}.?\d{0,2}))?
    line:  '^(?P<description>.+)$'

or
line: '^(?P<description>\w+(?:\S|[ ]\w\w+|\n)*)$'

Might still need some work on the desciption part.

MrMoronIV · 2022-02-01T07:38:31Z

Like i said, the technical code is at the top. The extracted string from the PDF first, my attempt for a regex second. My regex for multiple lines works fine, it's just that this program can't deal with such a regex apparently. The solutions is not in the template, it's in fixing the source code.

bosd · 2022-02-01T14:36:43Z

Sorry, but without the template en input file, I am unable to help.

just to be clear.
Where did you get this code from? As it does not look as the original human readable pdf

Omschrijving                    Aantal     Prijs/st   Korting     Bedrag     BTW%               BTW


Double A printpapier - A4 - 1
                                    1    \u20ac 22,50                \u20ac 22,50        21%         \u20ac     3,90
DOOS - 5 pakken x 500 vel




                                              Subtotaal ex. BTW                        \u20ac       18,60
                                              21% BTW                                  \u20ac       3,90
                                              Bedrag incl. BTW                         \u20ac       22,50


                                              Totaalbedrag                             \u20ac       22,50

As wierdly as it may sound from my experience working with this module. The debug window shows different strings.
I've had similar data as above. But when changing the template the parser handled the pdf document differently.
It would be easier if you posted the original PDF file. (al be it anonimized).
This module is quite capable of handling multiline texts. But it does require some fiddling around with the options for line extraction and tables.

example of mulltiline extraction:
PDF: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.pdf
Template: https://github.com/invoice-x/invoice2data/blob/master/src/invoice2data/extract/templates/de/de.qualityhosting.yml
Output: https://github.com/invoice-x/invoice2data/blob/master/tests/compare/QualityHosting.json

Oddly, with your regexcode I do get pattern errors

rmilecki · 2023-08-06T17:32:49Z

There is really no easy/clean way to parse such lines. The problem is vertical alignment of table cells content.

Ideally why should ask pdftotext to vertically align every table cell to the top. That isn't easy however as PDFs in general don't have a concept of tables. So it's hard for pdftotext to detect table cells and handle them according to some extra requests.

bosd mentioned this issue Jan 30, 2022

How to parse long lines? #360

Closed

bosd mentioned this issue Sep 5, 2022

[ADD] Pdfplumber support #391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How would you parse this line? (bol.com) #359

How would you parse this line? (bol.com) #359

MrMoronIV commented Oct 1, 2021 •

edited

Loading

sergiuturus commented Oct 5, 2021

MrMoronIV commented Oct 8, 2021

bosd commented Jan 27, 2022

MrMoronIV commented Jan 28, 2022

bosd commented Jan 28, 2022 •

edited

Loading

bosd commented Jan 29, 2022

bosd commented Feb 1, 2022

MrMoronIV commented Feb 1, 2022

bosd commented Feb 1, 2022

rmilecki commented Aug 6, 2023

How would you parse this line? (bol.com) #359

How would you parse this line? (bol.com) #359

Comments

MrMoronIV commented Oct 1, 2021 • edited Loading

sergiuturus commented Oct 5, 2021

MrMoronIV commented Oct 8, 2021

bosd commented Jan 27, 2022

MrMoronIV commented Jan 28, 2022

bosd commented Jan 28, 2022 • edited Loading

bosd commented Jan 29, 2022

bosd commented Feb 1, 2022

MrMoronIV commented Feb 1, 2022

bosd commented Feb 1, 2022

rmilecki commented Aug 6, 2023

MrMoronIV commented Oct 1, 2021 •

edited

Loading

bosd commented Jan 28, 2022 •

edited

Loading