Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Sanitization after match #497

Open
bosd opened this issue Mar 18, 2023 · 1 comment
Open

Data Sanitization after match #497

bosd opened this issue Mar 18, 2023 · 1 comment

Comments

@bosd
Copy link
Collaborator

bosd commented Mar 18, 2023

I would like to propose a data cleansing / sanitazation step after matching.
as commented in: #106 (comment)

Use Case:

I would like to match a Netherlands vat number
Format: 'NL' + 9 digits + B + 2-digit company index – e.g. NL999999999B01
Which translates to:

  vat:
    parser: regex
    regex: (NL\d{9}B\d{2})\s

Input string from OCR'd pdf:
VAT NUMBER NL.999,999.999,B01
We get the data, but it includes . and ,
So the previous mentioned regex won't match 😞

Capturing something like that would need:

  vat:
    parser: regex
    regex: (NL.\d{3}.\d{3}.\d{3}.B\d{2})\s

or maybe use multiple capturing groups, without the . and ,
and use goup: join

As writing templates is very hard, I prefer it to make it as easy as possible.
The ideal regex template for the input string is:
regex: VAT NUMBER\s+(\S+)

results in vat: ['NL.999,999.999,B01']

and then have a sanitazation function to strip out the unwanted characters.
As we know the value of the vat number should only contain digits and numbers we can replace all the rest.
re.sub(r'\W+', '', vat)
results in vat: ['NL999999999B01']

What would be the best way to implement this in code?

fields:
  vat:
    parser: regex
    regex: (NL\d{9}B\d{2})\s
    type: str
    # 1. Make replace function available on field level
    replace: ['\W+', '']
    # 2. Make a new santitize option
    sanitize: any_word_character

Option 1: is still not easy to include in a template. But is is very powerfull and flexible.
Option 2: is easier to include in the template.

@Jopie01
Copy link

Jopie01 commented Jan 19, 2024

This would be a very cool feature! Please also add it to the different plugins like 'tables' and 'lines'. Because suppliers have different naming for units, I want to be able to replace the units with the name I use in my system. This also means that you have to have a list of possible replacements. Below a part of the Farnell template with added the replace

lines:
    start: 'Lijn Nr'
    end: BELANGRIJK
    first_line: '\d+\s+(?P<code>\d{7})\s+(?P<uom>\w+)\s+(?P<qty>\d+)\s+(?P<price_unit>\d+[.]\d{2,4})\s+(?P<netto_price>\d+[.]\d{2,4})\s+(?P<btw_percent>\d+[.]\d{2})\s+(?P<price_subtotal>\d+[.]\d{2})'
    line: '^\s{9,11}(?P<name>(\S+(?:\s\S+)*))\s+'
    last_line: '\s+(?P<name>(Tariff Code[:]\s+\d+))'
    replace:
      - uom:
          - ['PS', 'unit']  # should be regex
          - ['M', 'meter']
          -  .....
    types:
      qty: float
      price_unit: float
      price_subtotal: float
      netto_price: float
      price_subtotal: float
      btw_percent: float```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants