Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lines: support "rules" field for multiple sets of parsing regexes #463

Merged
merged 1 commit into from
Feb 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,10 @@ This parser allows parsing selected invoice section as a set of lines
sharing some pattern. Those can be e.g. invoice items (good or services)
or VAT rates.

Some companies may use multiple formats for their line-based data. In
such cases multiple sets of parsing regexes can be added to the `rules`.
Results from multiple `rules` get merged into a single array.

It replaces `lines` plugin and should be preferred over it. It allows
reusing in multiple `fields`.

Expand All @@ -149,6 +153,17 @@ Example for `fields`:
end: \s+Total
line: (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)

fields:
lines:
parser: lines
rules:
- start: Item\s+Discount\s+Price$
end: \s+Total
line: (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)
- start: Item\s+Price$
end: \s+Total
line: (?P<description>.+)\s+(?P<price>\d+\d+)

### Legacy regexes

For non-text fields, the name of the field is important:
Expand Down
22 changes: 20 additions & 2 deletions src/invoice2data/extract/parsers/lines.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,10 +115,10 @@ def parse_block(template, field, settings, content):
return lines


def parse(template, field, _settings, content):
def parse_by_rule(template, field, rule, content):
# First apply default options.
settings = DEFAULT_OPTIONS.copy()
settings.update(_settings)
settings.update(rule)

# Validate settings
assert "start" in settings, "Lines start regex missing"
Expand Down Expand Up @@ -152,6 +152,24 @@ def parse(template, field, _settings, content):
return lines


def parse(template, field, settings, content):
if "rules" in settings:
# One field can have multiple sets of line-parsing rules
rules = settings['rules']
else:
# Original syntax stored line-parsing rules in top field YAML object
keys = ('start', 'end', 'line', 'first_line', 'last_line', 'skip_line', 'types')
rules = [{k: v for k, v in settings.items() if k in keys}]

lines = []
for rule in rules:
new_lines = parse_by_rule(template, field, rule, content)
if new_lines is not None:
lines += new_lines

return lines


def parse_current_row(match, current_row):
# Parse the current row data
for field, value in match.groupdict().items():
Expand Down
7 changes: 7 additions & 0 deletions tests/custom/lines-multiple-patterns.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@
{ "pos": 6, "name": "Penguin" },
{ "pos": 7, "name": "Ostrich" }
],
"dimensions": [
{ "pos": 1, "angle": 30, "length": 30 },
{ "pos": 2, "angle": 45, "length": 40 },
{ "pos": 3, "angle": 90, "length": 60 },
{ "pos": 4, "length": 80, "angle": 135 },
{ "pos": 5, "length": 100, "angle": 180 }
],
"currency": "EUR",
"desc": "Invoice from Lines Tests"
}
Expand Down
13 changes: 13 additions & 0 deletions tests/custom/lines-multiple-patterns.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Total: 50.00 EUR

Lines with multiple patterns


Lines start

Group: Mammals
Expand All @@ -21,3 +22,15 @@ Subgroup: Flightless
7. Ostrich

Lines end


No Angle [°] Length [cm]
1 30 30
2 45 40
3 90 60
Count: 3

No Length [cm] Angle [°]
4 80 135
5 100 180
Count: 2
17 changes: 17 additions & 0 deletions tests/custom/templates/lines-multiple-patterns.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,23 @@ fields:
- ^Subgroup:\s*(?P<subgroup>.+)$
types:
pos: int
dimensions:
parser: lines
rules:
- start: No.*Angle.*Length
end: Count
line: ^(?P<pos>\d+)\s+(?P<angle>\d+)\s+(?P<length>\d+)$
types:
pos: int
angle: int
length: int
- start: No.*Length.*Angle
end: Count
line: ^(?P<pos>\d+)\s+(?P<length>\d+)\s+(?P<angle>\d+)$
types:
pos: int
angle: int
length: int
options:
currency: EUR
date_formats:
Expand Down