Data Sanitization after match #497

bosd · 2023-03-18T11:11:19Z

I would like to propose a data cleansing / sanitazation step after matching.
as commented in: #106 (comment)

Use Case:

I would like to match a Netherlands vat number
Format: 'NL' + 9 digits + B + 2-digit company index – e.g. NL999999999B01
Which translates to:

  vat:
    parser: regex
    regex: (NL\d{9}B\d{2})\s

Input string from OCR'd pdf:
VAT NUMBER NL.999,999.999,B01
We get the data, but it includes . and ,
So the previous mentioned regex won't match 😞

Capturing something like that would need:

  vat:
    parser: regex
    regex: (NL.\d{3}.\d{3}.\d{3}.B\d{2})\s

or maybe use multiple capturing groups, without the . and ,
and use goup: join

As writing templates is very hard, I prefer it to make it as easy as possible.
The ideal regex template for the input string is:
regex: VAT NUMBER\s+(\S+)

results in vat: ['NL.999,999.999,B01']

and then have a sanitazation function to strip out the unwanted characters.
As we know the value of the vat number should only contain digits and numbers we can replace all the rest.
re.sub(r'\W+', '', vat)
results in vat: ['NL999999999B01']

What would be the best way to implement this in code?

fields:
  vat:
    parser: regex
    regex: (NL\d{9}B\d{2})\s
    type: str
    # 1. Make replace function available on field level
    replace: ['\W+', '']
    # 2. Make a new santitize option
    sanitize: any_word_character

Option 1: is still not easy to include in a template. But is is very powerfull and flexible.
Option 2: is easier to include in the template.

The text was updated successfully, but these errors were encountered:

Jopie01 · 2024-01-19T14:03:18Z

This would be a very cool feature! Please also add it to the different plugins like 'tables' and 'lines'. Because suppliers have different naming for units, I want to be able to replace the units with the name I use in my system. This also means that you have to have a list of possible replacements. Below a part of the Farnell template with added the replace

lines:
    start: 'Lijn Nr'
    end: BELANGRIJK
    first_line: '\d+\s+(?P<code>\d{7})\s+(?P<uom>\w+)\s+(?P<qty>\d+)\s+(?P<price_unit>\d+[.]\d{2,4})\s+(?P<netto_price>\d+[.]\d{2,4})\s+(?P<btw_percent>\d+[.]\d{2})\s+(?P<price_subtotal>\d+[.]\d{2})'
    line: '^\s{9,11}(?P<name>(\S+(?:\s\S+)*))\s+'
    last_line: '\s+(?P<name>(Tariff Code[:]\s+\d+))'
    replace:
      - uom:
          - ['PS', 'unit']  # should be regex
          - ['M', 'meter']
          -  .....
    types:
      qty: float
      price_unit: float
      price_subtotal: float
      netto_price: float
      price_subtotal: float
      btw_percent: float```

bosd added type:feature type:enhancement labels Mar 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Sanitization after match #497

Data Sanitization after match #497

bosd commented Mar 18, 2023

Jopie01 commented Jan 19, 2024

Data Sanitization after match #497

Data Sanitization after match #497

Comments

bosd commented Mar 18, 2023

Jopie01 commented Jan 19, 2024