Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Area Plugin Support #305

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

kavinsharma
Copy link

@kavinsharma kavinsharma commented Oct 18, 2020

An invoice2data area plugin helps in extracting text on the basis of area coordinates utilizing pdf2text area cropping option. An area plugin is customization to invoice2data to define area cropping with coordinates. Coordinates defined for the template can vary from pdf to pdf.
You just have to add a normal template containing the YAML file in which there are different plugins for fields and tables and you just have to add an area plugin and it works on every pdf.
Just write the field of multiple lines you want to extract and give the coordinates of that field that is (x=? y=? r=? H=? W=?)

x = x coordinate
y = y coordinate
H= height of the area
W= width of the area 
R= resolution of invoice

Area Plugin Options:
Name: field name to map with extracted text
Area: takes dict as input for cropping pdf area and extract text
Regex: Optional parameter, used for further extracting text from the cropped area.
Sample Invoice

Here is a sample of an invoice template including the area plugin that helps you extract multiple lines.

# -*- coding: utf-8 -*-
issuer: The XYZ Company
keywords:
- The XYZ Company
- US Supplier 123
fields:
	amount: TOTAL:\s+(\d+,\d+\.\d\d)
	date: Invoice Date:\s+(\d{1,2}\/\d{1,2}\/\d{4})
	delivery_date: Delivery Date:\s+(\d{1,2}\/\d{1,2}\/\d{4})
	invoice_number: INVOICE:\s+(\w{3}\d{1,8})
	sales_order : Sales Order:\s+(\w{2}\d+)
tables:
	-   start: Line\s+Product\s+Description\s+Quantity
    	end: Prices
    	body: (?P<Line>^\d{2})\s+(?P<Product>\w{2}\-\w{2}\-\w+\-\w+)\s+(?P<Description>\w+\s\w+\s\w+\-\w+)\s+(?P<Quantity>\d+)
area:
    -  name: "Address"
       area: {x: 115,y: 124,r: 300,W: 412, H: 326}
       regex: \w+
     - name: "Bill_to"
       area: {x: 930,y: 416,r: 300,W: 310, H: 294}

options:
	remove_whitespace: false
	currency: USD
	date_formats:
    	- '%d/%m/%Y'
	languages:
    	- en
decimal_separator: '.'

Output :

{'issuer': 'Innomatiq', 'amount': 51500.0, 'date': datetime.datetime(2020, 3, 12, 0, 0), 'invoice_number': 'INV0005', 'Terms': 'Due On Receipt', 'cur
rency': 'USD', 'DESCRIPTION': 'Laptops', 'RATE': '$1,500.00', 'QTY': '1', 'AMOUNT': '$1,500.00', 'Address': 'Invoice INV0005 \r\nInnomatiq\r\n\r\nSuite 2000\r\nPlano, TX\r\n75023\r\n\r\[email protected]', 'Bill_to': 'Dell 7\r\nRound Rock, Tx\r\n78664\r\n(800) 285-1653\r\[email protected]', 'desc': 'Invoice from Innomatiq'}

@rmilecki
Copy link
Collaborator

rmilecki commented Nov 2, 2020

@kavinsharma: I think implementing this feature as separated plugin is non optimal and will cause maintenance problems in a long term. You add another piece of code handling regular expressions. Soon someone will ask for specifying field types (integers, floats, dates). Later someone will ask for handling sums.

I think support for areas should be added to standard fields. Your code should handle extracting areas only and don't care about actual parsing. This should be easy to implement once we managed to polish and merge #308.

I'd suggest working on top of syntax like:

fields:
  foo:
    area: {x: 115,y: 124,r: 300,W: 412, H: 326}
    [parser details]

@kavinsharma
Copy link
Author

@rmilecki thanks for the review, I totally agree with you. Let me make these changes and update the PR

@RossK1
Copy link
Contributor

RossK1 commented Mar 14, 2021

@kavinsharma , any update on reworking this as a parser so it can be merged? This is exactly what I'm looking for to extract addresses from invoices.
Thanks 😄

@erkin98
Copy link

erkin98 commented Mar 16, 2021

any update?

@m3nu
Copy link
Collaborator

m3nu commented Mar 16, 2021

Needs to pass tests and also a rebase.

@RossK1
Copy link
Contributor

RossK1 commented Mar 16, 2021

@m3nu, do you still think this would be better as a parser or as a plugin like lines or tables is acceptable?

@m3nu
Copy link
Collaborator

m3nu commented Mar 16, 2021

Makes sense as plugin, which it already is.

@kavinsharma
Copy link
Author

Hi @m3nu @RossK1,
if it makes sense as a plugin, i am fixing the conflicts and working on passing the tests

@RossK1
Copy link
Contributor

RossK1 commented Mar 17, 2021

Hi @m3nu @RossK1,
if it makes sense as a plugin, i am fixing the conflicts and working on passing the tests

Hey @kavinsharma :) I'm currently working on this too :P
Adding it as an argument to the parser field. Will push the commit in a few minutes here. All credit goes to you for the idea and the base code!

@m3nu
Copy link
Collaborator

m3nu commented Mar 17, 2021

Added some comments. I see you need to rely on existing modules. We don't really have that anywhere else. Will be better to try reusing what's already there.

@bosd
Copy link
Collaborator

bosd commented Feb 1, 2022

This PR is open for a while. What needs to be done to get this merged?

@BenjaminHoegh
Copy link

Any news?

@bosd
Copy link
Collaborator

bosd commented Jul 28, 2023

Area support has been added in #438
Propably this one can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants