Skip to content

Commit

Permalink
lines: support "rules" field for multiple sets of parsing regexes
Browse files Browse the repository at this point in the history
Sometimes companies use more than 1 format for line-parseable data. They
may randomly generate invoices with e.g.
1. Some extra columns that are used occasionally
2. Rearrange columns order

Such format changes may be too invasive to support parsing with e.g.
multiple "line" regexes.

This commit adds "rules" field support to the "lines" parser. It allows
defining multiple sets or regexes ("start", "end", "line" & friends) for
a single upper field.

Usage of "rules" is optional. Backward compatibility wiht existing
templates is preserved.

Signed-off-by: Rafał Miłecki <[email protected]>
  • Loading branch information
Rafał Miłecki authored and bosd committed Feb 26, 2023
1 parent 7bed841 commit d460e47
Show file tree
Hide file tree
Showing 5 changed files with 72 additions and 2 deletions.
15 changes: 15 additions & 0 deletions TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,10 @@ This parser allows parsing selected invoice section as a set of lines
sharing some pattern. Those can be e.g. invoice items (good or services)
or VAT rates.

Some companies may use multiple formats for their line-based data. In
such cases multiple sets of parsing regexes can be added to the `rules`.
Results from multiple `rules` get merged into a single array.

It replaces `lines` plugin and should be preferred over it. It allows
reusing in multiple `fields`.

Expand All @@ -149,6 +153,17 @@ Example for `fields`:
end: \s+Total
line: (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)

fields:
lines:
parser: lines
rules:
- start: Item\s+Discount\s+Price$
end: \s+Total
line: (?P<description>.+)\s+(?P<discount>\d+.\d+)\s+(?P<price>\d+\d+)
- start: Item\s+Price$
end: \s+Total
line: (?P<description>.+)\s+(?P<price>\d+\d+)

### Legacy regexes

For non-text fields, the name of the field is important:
Expand Down
22 changes: 20 additions & 2 deletions src/invoice2data/extract/parsers/lines.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,10 +115,10 @@ def parse_block(template, field, settings, content):
return lines


def parse(template, field, _settings, content):
def parse_by_rule(template, field, rule, content):
# First apply default options.
settings = DEFAULT_OPTIONS.copy()
settings.update(_settings)
settings.update(rule)

# Validate settings
assert "start" in settings, "Lines start regex missing"
Expand Down Expand Up @@ -152,6 +152,24 @@ def parse(template, field, _settings, content):
return lines


def parse(template, field, settings, content):
if "rules" in settings:
# One field can have multiple sets of line-parsing rules
rules = settings['rules']
else:
# Original syntax stored line-parsing rules in top field YAML object
keys = ('start', 'end', 'line', 'first_line', 'last_line', 'skip_line', 'types')
rules = [{k: v for k, v in settings.items() if k in keys}]

lines = []
for rule in rules:
new_lines = parse_by_rule(template, field, rule, content)
if new_lines is not None:
lines += new_lines

return lines


def parse_current_row(match, current_row):
# Parse the current row data
for field, value in match.groupdict().items():
Expand Down
7 changes: 7 additions & 0 deletions tests/custom/lines-multiple-patterns.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@
{ "pos": 6, "name": "Penguin" },
{ "pos": 7, "name": "Ostrich" }
],
"dimensions": [
{ "pos": 1, "angle": 30, "length": 30 },
{ "pos": 2, "angle": 45, "length": 40 },
{ "pos": 3, "angle": 90, "length": 60 },
{ "pos": 4, "length": 80, "angle": 135 },
{ "pos": 5, "length": 100, "angle": 180 }
],
"currency": "EUR",
"desc": "Invoice from Lines Tests"
}
Expand Down
13 changes: 13 additions & 0 deletions tests/custom/lines-multiple-patterns.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Total: 50.00 EUR

Lines with multiple patterns


Lines start

Group: Mammals
Expand All @@ -21,3 +22,15 @@ Subgroup: Flightless
7. Ostrich

Lines end


No Angle [°] Length [cm]
1 30 30
2 45 40
3 90 60
Count: 3

No Length [cm] Angle [°]
4 80 135
5 100 180
Count: 2
17 changes: 17 additions & 0 deletions tests/custom/templates/lines-multiple-patterns.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,23 @@ fields:
- ^Subgroup:\s*(?P<subgroup>.+)$
types:
pos: int
dimensions:
parser: lines
rules:
- start: No.*Angle.*Length
end: Count
line: ^(?P<pos>\d+)\s+(?P<angle>\d+)\s+(?P<length>\d+)$
types:
pos: int
angle: int
length: int
- start: No.*Length.*Angle
end: Count
line: ^(?P<pos>\d+)\s+(?P<length>\d+)\s+(?P<angle>\d+)$
types:
pos: int
angle: int
length: int
options:
currency: EUR
date_formats:
Expand Down

0 comments on commit d460e47

Please sign in to comment.