Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lines: support "rules" field for multiple sets of parsing regexes #463

Merged
merged 1 commit into from
Feb 26, 2023

Conversation

rmilecki
Copy link
Collaborator

@rmilecki rmilecki commented Feb 4, 2023

Sometimes companies use more than 1 format for line-parseable data. They
may randomly generate invoices with e.g.
1. Some extra columns that are used occasionally
2. Rearrange columns order

Such format changes may be too invasive to support parsing with e.g.
multiple "line" regexes.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single upper field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

@rmilecki
Copy link
Collaborator Author

rmilecki commented Feb 4, 2023

This is a refactored version of #407. Rebased, with updated description and test case added.

I think that added test provides a good idea why we may need this feature. The thing is it's not always possible to parse lines of varying columns with multiple regexes. Look at the example for included test case:

No  Angle [°]  Length [cm]
1   30         30
2   45         40
3   90         60
Count: 3

No  Length [cm]  Angle [°]
4   80           135
5   100          180
Count: 2

If some company changes orders of columns, we need different sets of rules for parsing such lines. It's bacause it's impossible to write a generic line regex that will recognize whether the number is "Angle" or "Length".

Let me know what do you think about such feature.

@bosd
Copy link
Collaborator

bosd commented Feb 14, 2023

Thanks for this PR. It does what it is supposed to do. 👍 🎉

Maybe it was not the intention of this PR.
Yet, I was hoping it to do more and to provide a solution for the problem as described here:
#407 (comment)

The possibility to parse multiple sets of regexes while applying the extraction rules (fist_line, last_line) and retaining the order of the input string in the output.

Technical Background behind previous pr 378
In the previous PR I achieved it by nested for loops. 💀 ⚠️ (yes maybe there are better iteration methods)
The main loop was going trough each line of optimized_str. (plain and simple)

Then there was the nested for loop applied on that particular line.
The nested for loop, was going over the sets of regexes (aka rules). Trying to find a match. Yet also taking into account if the regex match was between the first_line and last_line of that particular rule.
If all the conditions are met, it was appended to the output.

(If it was not met detailed feedback was provided. Line: XXX matched, but is not between firstline and lastline, so ignoring)

The fundamental difference seems to be that the code in this pr is considering the order of rules how they where written in the template file. (assumption). Instead of considering the order from the optimized string from the input file.


What to do with this pr?? We can merge it??
However, I still hope to get the initial problem solved.
I hope the explanation in this comment was helpfull how to attack the problem.
(only now implement it with better code then nesting multiple loops)

@rmilecki
Copy link
Collaborator Author

@bosd: I believe we can parse Mekro invoices the way you expected since #417.

I prepared this pull request to handle different cases.

Please kindly take another look at #417. I think you managed to parse Mekro invoices by using two regex for the lines.

@bosd
Copy link
Collaborator

bosd commented Feb 26, 2023

Will later look at the mekro example again.

Tested this pr, with an template generated to solve #428.
Works, perfectly! Thanks @rmilecki

Sometimes companies use more than 1 format for line-parseable data. They
may randomly generate invoices with e.g.
1. Some extra columns that are used occasionally
2. Rearrange columns order

Such format changes may be too invasive to support parsing with e.g.
multiple "line" regexes.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single upper field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

Signed-off-by: Rafał Miłecki <[email protected]>
Copy link
Collaborator

@bosd bosd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Tests 👍

@bosd bosd merged commit d460e47 into invoice-x:master Feb 26, 2023
@rmilecki rmilecki deleted the lines-rules branch May 11, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants