lines: support "rules" field for multiple sets of parsing regexes #463

rmilecki · 2023-02-04T11:20:19Z

Sometimes companies use more than 1 format for line-parseable data. They
may randomly generate invoices with e.g.
1. Some extra columns that are used occasionally
2. Rearrange columns order

Such format changes may be too invasive to support parsing with e.g.
multiple "line" regexes.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single upper field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

rmilecki · 2023-02-04T11:24:21Z

This is a refactored version of #407. Rebased, with updated description and test case added.

I think that added test provides a good idea why we may need this feature. The thing is it's not always possible to parse lines of varying columns with multiple regexes. Look at the example for included test case:

No  Angle [°]  Length [cm]
1   30         30
2   45         40
3   90         60
Count: 3

No  Length [cm]  Angle [°]
4   80           135
5   100          180
Count: 2

If some company changes orders of columns, we need different sets of rules for parsing such lines. It's bacause it's impossible to write a generic line regex that will recognize whether the number is "Angle" or "Length".

Let me know what do you think about such feature.

bosd · 2023-02-14T10:36:01Z

Thanks for this PR. It does what it is supposed to do. 👍 🎉

Maybe it was not the intention of this PR.
Yet, I was hoping it to do more and to provide a solution for the problem as described here:
#407 (comment)

The possibility to parse multiple sets of regexes while applying the extraction rules (fist_line, last_line) and retaining the order of the input string in the output.

Technical Background behind previous pr 378
In the previous PR I achieved it by nested for loops. 💀 ⚠️ (yes maybe there are better iteration methods)
The main loop was going trough each line of optimized_str. (plain and simple)

Then there was the nested for loop applied on that particular line.
The nested for loop, was going over the sets of regexes (aka rules). Trying to find a match. Yet also taking into account if the regex match was between the first_line and last_line of that particular rule.
If all the conditions are met, it was appended to the output.

(If it was not met detailed feedback was provided. Line: XXX matched, but is not between firstline and lastline, so ignoring)

The fundamental difference seems to be that the code in this pr is considering the order of rules how they where written in the template file. (assumption). Instead of considering the order from the optimized string from the input file.

What to do with this pr?? We can merge it??
However, I still hope to get the initial problem solved.
I hope the explanation in this comment was helpfull how to attack the problem.
(only now implement it with better code then nesting multiple loops)

rmilecki · 2023-02-18T21:38:04Z

@bosd: I believe we can parse Mekro invoices the way you expected since #417.

I prepared this pull request to handle different cases.

Please kindly take another look at #417. I think you managed to parse Mekro invoices by using two regex for the lines.

bosd · 2023-02-26T13:54:30Z

Will later look at the mekro example again.

Tested this pr, with an template generated to solve #428.
Works, perfectly! Thanks @rmilecki

Sometimes companies use more than 1 format for line-parseable data. They may randomly generate invoices with e.g. 1. Some extra columns that are used occasionally 2. Rearrange columns order Such format changes may be too invasive to support parsing with e.g. multiple "line" regexes. This commit adds "rules" field support to the "lines" parser. It allows defining multiple sets or regexes ("start", "end", "line" & friends) for a single upper field. Usage of "rules" is optional. Backward compatibility wiht existing templates is preserved. Signed-off-by: Rafał Miłecki <[email protected]>

bosd

Functional Tests 👍

rmilecki requested review from m3nu and bosd February 4, 2023 11:24

bosd force-pushed the lines-rules branch from bda783f to 5ef8c10 Compare February 14, 2023 09:33

bosd force-pushed the lines-rules branch from 5ef8c10 to 3d41995 Compare February 26, 2023 13:54

bosd approved these changes Feb 26, 2023

View reviewed changes

bosd merged commit d460e47 into invoice-x:master Feb 26, 2023

bosd mentioned this pull request Mar 18, 2023

Is there a support for multiple regex for lines plugin? #238

Closed

rmilecki deleted the lines-rules branch May 11, 2023 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lines: support "rules" field for multiple sets of parsing regexes #463

lines: support "rules" field for multiple sets of parsing regexes #463

rmilecki commented Feb 4, 2023

rmilecki commented Feb 4, 2023 •

edited

Loading

bosd commented Feb 14, 2023

rmilecki commented Feb 18, 2023

bosd commented Feb 26, 2023

bosd left a comment

lines: support "rules" field for multiple sets of parsing regexes #463

lines: support "rules" field for multiple sets of parsing regexes #463

Conversation

rmilecki commented Feb 4, 2023

rmilecki commented Feb 4, 2023 • edited Loading

bosd commented Feb 14, 2023

rmilecki commented Feb 18, 2023

bosd commented Feb 26, 2023

bosd left a comment

Choose a reason for hiding this comment

rmilecki commented Feb 4, 2023 •

edited

Loading