Clipping paths implementation #414

kelvin0 · 2020-04-10T14:53:37Z

Hi Everyone,

I've been using Pdfminer for the last few months, I really thing it's a very helpful codebase.

But recently I noticed that clipping paths do not seem to be implemented, I inspected:
\pdfminer\pdfinterp.py

# clip
def do_W(self):
    return

# clip-even-odd
def do_W_a(self):
    return

The effect of this is that ALL text is extracted from the PDF, even text that should not be visible (since it should be clipped).

I am not a PDF expert but I can surely help implement the following features:

Implement do_W and do_W_a
Add an option (laparams?) to force text extraction regardless of clipping paths (as occurs today)

Hope I can clarify this and be able to contribute to the project if necessary.

The text was updated successfully, but these errors were encountered:

pietermarsman · 2020-05-09T12:54:48Z

Hi @kelvin0, are you experiencing problems due to this issue? I assume that the clipping operator is more often used to exclude parts of a drawing, than it being used to exclude part of the text. Anyway, it would be nice to have a pdf to test this on.

If you want to start implementing this, have a look at section 4.4.3 of the pdf reference manual.

You should also adjust the PDFGraphicsState class. I think it is wise to assess the impact that adding the clipping path to PDFGraphicsState could have on all the other graphics-state aware operators.

Belval · 2020-05-15T13:52:41Z

Clipping path is indeed used to hide text in PDF documents, here is an example that could be used as a starting point: https://mva.maryland.gov/Documents/VR-181.pdf

There is hidden text slightly above the "VR-181 (03-18)". I was able to extract it properly with pdfbox, but not with pdfminer as path clipping is not supported.

pietermarsman · 2020-05-16T14:30:56Z

Feel free to create a PR. I can do reviews and merge it when ready.

I don't mind if the first implementation only focusses on adding clipping-path behaviour and ignoring additional top-level arguments for enabling/disabling the behavior. We can create another issue for that, if needed.

jstockwin · 2020-07-09T14:16:34Z

@kelvin0 Just a quick bump on this issue as we're trying to sort through them. Are you still willing to work on this? As commented above, a PR would be appreciated if you're still interested and able to.

dhdaines · 2024-07-31T13:44:58Z

Hi! I just ran into this issue as well. It specifically seems fairly common to use the clipping path to hide text in legal documents (academic documents often use the more prosaic method of setting the text colour to white). You can see this pretty clearly in https://www.legisquebec.gouv.qc.ca/fr/pdf/lc/C-1.pdf - on the first page (PDF object 5) there is a bunch of hidden text. The way the formatter in question (Antenna House 6.3) renders text is somewhat annoying to follow, but it appears that it simply sets the clipping path to something arbitrary which excludes the text in question, for example, at the top of page 1, the hidden text "CADASTRE":

% flip the transformation matrix so that 0, 0 is at the top of the page, then translate to
% set the margins, or something like that
q 1 0 0 -1 0 792 cm q 1 0 0 1 72 42.51968 cm 1 0 0 1 0 85.03937 cm 0 0 0 rg
% create a rectangle of height and width 0 and intersect with the clipping path using the
% even-odd winding rule (for no apparent reason) then move to its location
q 0 47.47323 0 0 re W* n 1 0 0 1 0 47.47323 cm
% render some text (that will not get rendered) using gratuitously arbitrary cid mapping
BT /F0 11 Tf 1 0 0 -1 -0.00001 8.88061 Tm<0026> Tj 0 -11.52 Td<0024> Tj 0 -11.52
Td<0027> Tj 0 -11.52 Td<0024> Tj 0 -11.52 Td<0036> Tj 0 -11.52 Td<0037> Tj 0 -11.52
Td<0035> Tj 0 -11.52 Td<0028> Tj ET
% restore normal graphics state
Q

Implementing the winding rules seems rather complicated though there are plenty of implementations out there that can serve as a reference.

dhdaines · 2024-07-31T14:49:39Z

Clipping path is indeed used to hide text in PDF documents, here is an example that could be used as a starting point: https://mva.maryland.gov/Documents/VR-181.pdf

There is hidden text slightly above the "VR-181 (03-18)". I was able to extract it properly with pdfbox, but not with pdfminer as path clipping is not supported.

Thanks for the example! In this case the clipping path is a simple rectangle and all the hidden text is placed outside that rectangle.

My first idea is to make a PR that minimally supports these two examples by deriving a visible rectangle from the clipping path and intersecting it with the bbox of characters when they are added to the layout - at that point the converter (or another library like pdfplumber) can call is_empty() on them to decide if they should be shown or not.

Edit: That seems like not such a great idea, actually, since objects that are clipped out are not in the layout by definition. If you want to get at them you could use the interpreter directly.

pietermarsman added the type: new feature label Jul 10, 2020

pietermarsman added the status: accepted label Aug 7, 2022

cmdlineluser mentioned this issue Jul 21, 2023

Incorrect extraction in tables with overlapping columns jsvine/pdfplumber#912

Open

dhdaines linked a pull request Jul 31, 2024 that will close this issue

Very approximate support for hiding text using clipping path #1026

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clipping paths implementation #414

Clipping paths implementation #414

kelvin0 commented Apr 10, 2020

pietermarsman commented May 9, 2020

Belval commented May 15, 2020

pietermarsman commented May 16, 2020

jstockwin commented Jul 9, 2020

dhdaines commented Jul 31, 2024

dhdaines commented Jul 31, 2024 •

edited

Loading

Clipping paths implementation #414

Clipping paths implementation #414

Comments

kelvin0 commented Apr 10, 2020

pietermarsman commented May 9, 2020

Belval commented May 15, 2020

pietermarsman commented May 16, 2020

jstockwin commented Jul 9, 2020

dhdaines commented Jul 31, 2024

dhdaines commented Jul 31, 2024 • edited Loading

dhdaines commented Jul 31, 2024 •

edited

Loading