Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

templates: de: improve parsing QualityHosting lines #422

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rmilecki
Copy link
Collaborator

@rmilecki rmilecki commented Oct 8, 2022

A single QualityHosting invoice position spans across multiple lines.
For that reason it uses a very generic RegEx for middle lines:
line: '^\s+(?P<desc>.+)$'

That doesn't work well with multi-page invoices. It's because above
RegEx matches page footer lines. That results in footer content getting
extracted as invoice line "desc".

Improve that situation by adding "last_line" RegEx matching position
last line. That prevents parsing lines between last and first lines
(e.g. footer content).

A single QualityHosting invoice position spans across multiple lines.
For that reason it uses a very generic RegEx for middle lines:
line: '^\s+(?P<desc>.+)$'

That doesn't work well with multi-page invoices. It's because above
RegEx matches page footer lines. That results in footer content getting
extracted as invoice line "desc".

Improve that situation by adding "last_line" RegEx matching position
last line. That prevents parsing lines between last and first lines
(e.g. footer content).

Signed-off-by: Rafał Miłecki <[email protected]>
@rmilecki
Copy link
Collaborator Author

So what do we do about this pull request?

It received some minor cross-pull-request comment in the #417:

Maybe better to leave suboptimal tests and examples in this library. Just as an showcase. (Same goes for the OCR examples in this repo). It is definityle helping us to find these corner cases.

As de.qualityhosting.yml is a template for actual (real-life) invoices I really think we should fix it. By that I mean accepting this pull request.

If we need to test some corner cases - that can be done using custom templates & tests. I added support for such in the just-merged #414.

@bosd
Copy link
Collaborator

bosd commented Oct 16, 2022

So what do we do about this pull request?

I'd suggest we let this one sit here for a moment.
At least until we've sorted out the multiline parsing.

Wen't trought the re lib docs. need some time to test and verify things. Will report back in #417

@bosd
Copy link
Collaborator

bosd commented Oct 19, 2022

As in 417 mentioned.
I agree to update this template.
Possibly to include only lines with more then 30 spaces in front
^\s{30,}

For this particular invoice we could make the lastline specific, (or either leave it very generic, like before this pr)
Just the question wat kind of example we would like to include.

As per your previous suggestion:
'^\s+(?P<desc>\d\d\.\d\d\.\d\d-\d\d\.\d\d\.\d\d)$'
I would like to propose to change it to
'^\s{30,}(?P<desc>\d{2}[.]\d{2}[.]\d{2}[.]-\d{2}[.]\d{2}[.]\d{2})$'
rewrote the . to [.] as it is best practice to prevent the use of a plain . preformance wise.

(I quikly drafted this, it needs testing..)

@bosd
Copy link
Collaborator

bosd commented Oct 21, 2022

Used this version of thetemplate to check the code of #417.
Intentional it is without last_line. Because the purpose of the check was to see if it is adding a matched line to the output without the lastline key.

issuer: QualityHosting AG
fields:
  amount: Total EUR\s+(\d+,\d+)
  amount_untaxed: Total EUR\s+(\d+,\d+)
  date:
    - \s{2,}(\d+\. .+ \d{4})\s{2,}
    - Rechnungsdatum\s+(\w+ \d+, \d{4})
  invoice_number: Rechnungsnr\.\s+(\d{8})
  vat: DE 232 446 240
lines:
  start: 'Contract No. \w+'
  end: 'Total EUR'
  first_line: '\s+(?P<pos>\d+)\s+(?P<qty>\d+)\s+(?P<desc>.{,70})\s+(?P<price>\d+,\d+)'
  line: '^\s{30}(?P<desc>.{5,30})$'
  types:
      qty: float
      price: float
keywords:
- QualityHosting
options:
  currency: EUR
  decimal_separator: ","

@rmilecki Is it ok for you to use this version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants