Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsers: lines: support "rules" for multiple sets of regexes #407

Closed

Conversation

rmilecki
Copy link
Collaborator

templates: make Amazon YAML use "rules" for "lines"

This doesn't really change anything and isn't necessary. It however
allows testing "rules" implementation in the "lines" parser.
lines: support "rules": field for multiple sets of parsing regexes

Sometimes companies use more than 1 format for line-parseable data. They
may e.g. randomly add some extra columns that are used occasionally.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

@rmilecki
Copy link
Collaborator Author

This is alternative implementation of feature suggested in the #377

This should resolve:
#238
#377

Rafał Miłecki added 2 commits September 25, 2022 12:09
Sometimes companies use more than 1 format for line-parseable data. They
may e.g. randomly add some extra columns that are used occasionally.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

Signed-off-by: Rafał Miłecki <[email protected]>
This doesn't really change anything and isn't necessary. It however
allows testing "rules" implementation in the "lines" parser.

Signed-off-by: Rafał Miłecki <[email protected]>
@rmilecki
Copy link
Collaborator Author

@bosd: I think this implementation is a bit simpler than the one suggested in the #377 . I hope my changes are rather simple to understand & review.

An advantage of this approach I see is that lines parser focuses on the parser-based syntax (the new syntax).

If we want to support new features in the old syntax I believe that code should go into plugins lines code.

@bosd
Copy link
Collaborator

bosd commented Sep 25, 2022

To start off: I'm all in, for cleaner and easier to understand code.
However, I think this PR does not achieve the same thing.

edit: added txt input for easier testing

Mekro B.V. Pagina 1
Moermanstraat 4 Invoicedate : 01-11-2021 16:34
Info: Mekro Service Center 5222 BD 's-Hertogenbosch Printdatum : 01-11-2021
16:35
Telefoon: 0900-2025000
Postbus 159, 2290 AD Wateringen
www.mekro.nl
K.v.K: 33166113 OB nr: NL001799434B01
IBAN: NL44 INGB 0702 5937 02
BIC : INGBNL2A
Invoicenumber: 0/0(057)0004/041503 (004-374170) 057/119
----------------------------------------------------------------------------------------------------
Efficient Invoice Handling Klantnummer : 057 726666 01 01 SC
Sessamestreet 46
5555 NH TommyCity
----------------------------------------------------------------------------------------------------
Stuks per Prijs per Code Prijs st/kg
Barcode                 Description          qty uom         unitprice discount
----------------------------------------------------------------------------------------------------
---FOOD ITEMS---
2231012001992 KROKETBROODJES                             2 KG           1,00 0,0%
8713009019455 Oil                                        3 L           0,50 0,0%
8713009019475 Apple                                      1 KG           50,0 0,0%
---OTHER ITEMS---
8713009019375 programmer                                 1 Hour        50,0 100,0%
0013009019475 Sticker                                    1 pce          0,0 0,0%

Aantal stuks: 7 Netto totaal: 5,50
Excl.BTW Code BTW BTW Totaal
0 1=21,00% 0,10 0.00
0 5= 9,00%           0,0 0,00
------------------------------------------------
53,85        6,05          59,90
----------------------------------------------------------------------------------------------------
To Pay 15,50
POI: 52001324 KLANTTICKET --------------------------------
Terminal: BS111850 Merchant: 9533494654 Period: 1305
Transactie: 00000055 Token: 2004130501564440011 AMERICAN EXPRESS
(A000000022010801) Kaart: 375382xxxxx1000 Kaartserienummer: 0
BETALING Datum: 01/11/2021 16:36 Autorisatiecode: 66
Visit www.americanexpress.nl Total: 5,50 EUR Contact
Leesmethode: CHIP Met PIN gevalideerd
Pin betaling 5,50
------------------------------------------------
Paid 5,50
test with taxes, changed the “ te betalen” bedrag.


Repeating the functional test from #378 (comment)

re-written the template to the syntax of this pr:


# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
## new test here
  lines:
    parser: lines
    rules:
    - start: Barcode
      line: (?P<line_note>(---FOOD ITEMS---))
      end: Netto totaal
    - start: Barcode
      line: (?P<line_note>(---OTHER ITEMS---))
      end: Netto totaal
    - start: Barcode
      line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
      end: Netto totaal
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

Result:

[
    {
        "issuer": "Mekro",
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "date": "2021-01-11",
        "invoice_number": "0/0(057)0004/041503",
        "iban": "NL44INGB0702593702",
        "partner_coc": "33166113",
        "partner_website": "mekro.nl",
        "lines": [
            {
                "line_note": "---FOOD ITEMS---"
            },
            {
                "line_note": "---OTHER ITEMS---"
            },
            {
                "barcode": "2231012001992",
                "name": "KROKETBROODJES",
                "qty": "2",
                "uom": "KG",
                "price_unit": "1,00",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019455",
                "name": "Oil",
                "qty": "3",
                "uom": "L",
                "price_unit": "0,50",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019475",
                "name": "Apple",
                "qty": "1",
                "uom": "KG",
                "price_unit": "50,0",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019375",
                "name": "programmer",
                "qty": "1",
                "uom": "Hour",
                "price_unit": "50,0",
                "discount": "100,0"
            },
            {
                "barcode": "0013009019475",
                "name": "Sticker",
                "qty": "1",
                "uom": "pce",
                "price_unit": "0,0",
                "discount": "0,0"
            }
        ],
        "currency": "EUR",
        "desc": "Invoice from Mekro"
    }
]

Conclusion, lines output is in the wrong order.
@rmilecki Is it possible to achieve the same result with this code? Am I doing something wrong?

@bosd
Copy link
Collaborator

bosd commented Sep 25, 2022

For completeness,

Here is the desired outcome of the test:

[
    {
        "issuer": "Mekro",
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "date": "2021-01-11",
        "invoice_number": "0/0(057)0004/041503",
        "iban": "NL44INGB0702593702",
        "partner_coc": "33166113",
        "partner_website": "mekro.nl",
        "currency": "EUR",
        "lines": [
            {
                "line_note": "---FOOD ITEMS---"
            },
            {
                "barcode": "2231012001992",
                "name": "KROKETBROODJES",
                "qty": "2",
                "uom": "KG",
                "price_unit": "1,00",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019455",
                "name": "Oil",
                "qty": "3",
                "uom": "L",
                "price_unit": "0,50",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019475",
                "name": "Apple",
                "qty": "1",
                "uom": "KG",
                "price_unit": "50,0",
                "discount": "0,0"
            },
            {
                "line_note": "---OTHER ITEMS---"
            },
            {
                "barcode": "8713009019375",
                "name": "programmer",
                "qty": "1",
                "uom": "Hour",
                "price_unit": "50,0",
                "discount": "100,0"
            },
            {
                "barcode": "0013009019475",
                "name": "Sticker",
                "qty": "1",
                "uom": "pce",
                "price_unit": "0,0",
                "discount": "0,0"
            }
        ],
        "desc": "Invoice from Mekro"
    }
]

@bosd
Copy link
Collaborator

bosd commented Oct 22, 2022

@rmilecki What to do with this pr / functionality?

@rmilecki rmilecki marked this pull request as draft October 22, 2022 14:04
@rmilecki
Copy link
Collaborator Author

I need to rework this. Describe better, provide use case, test, probably avoid modifying Amazon YAML as there is no strong reason for this.

Converted into draft for now.

I think meanwhile we can focus on #423

@bosd
Copy link
Collaborator

bosd commented Oct 22, 2022

@rmilecki No worries, You'll have some time for this.
Just want to let you know I really want this..

As we've merged #417 ,
I'm adapting real invoices and template from Coolblue which we can add as an example.
Sadly, I have to conclude that (417) still is no real alternative for #378 as it does not allow to parse multiple blocks, and multiple line definitions.
Or maybe I just don't know the correct syntax :)

@bosd bosd added this to the 0.4.0 release milestone Oct 22, 2022
@bosd bosd removed this from the 0.4.0 release milestone Nov 29, 2022
@rmilecki
Copy link
Collaborator Author

rmilecki commented Feb 3, 2023

To start off: I'm all in, for cleaner and easier to understand code.
However, I think this PR does not achieve the same thing.

That particular case ended up being discussed in the #428. It seems we can already support such invoices with current code. There may be more than 1 way of handling such complex lines - depending on expected output.

As for changes from this pull request I should rewrite them and add custom test. I'll open another pull request for that when I get it ready.

@rmilecki
Copy link
Collaborator Author

One more update: Mekro invoices can be parsed the way @bosd expected since #417. It can be done with something like:

  lines:
    parser: lines
    start: Barcode
    line:
      - (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
      - ---(?P<line_note>.*ITEMS)---
    end: Netto totaal

Above template fragment results in parsing invoice provided by @bosd into:

        "lines": [
            {
                "line_note": "FOOD ITEMS"
            },
            {
                "barcode": "2231012001992",
                "name": "KROKETBROODJES",
                "qty": "2",
                "uom": "KG",
                "price_unit": "1,00",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019455",
                "name": "Oil",
                "qty": "3",
                "uom": "L",
                "price_unit": "0,50",
                "discount": "0,0"
            },
            {
                "barcode": "8713009019475",
                "name": "Apple",
                "qty": "1",
                "uom": "KG",
                "price_unit": "50,0",
                "discount": "0,0"
            },
            {
                "line_note": "OTHER ITEMS"
            },
            {
                "barcode": "8713009019375",
                "name": "programmer",
                "qty": "1",
                "uom": "Hour",
                "price_unit": "50,0",
                "discount": "100,0"
            },
            {
                "barcode": "0013009019475",
                "name": "Sticker",
                "qty": "1",
                "uom": "pce",
                "price_unit": "0,0",
                "discount": "0,0"
            }
        ]

(which seems to match what was expected).


As for coolblue invoices those are more tricky, it's even hard to agree on ideal expected output. That it being discussed in the #428.

@rmilecki rmilecki deleted the parser-lines-support-rules branch May 11, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants