Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing coordinates in paragraphs continuation #1076

Merged
merged 2 commits into from
Jan 21, 2024

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Jan 18, 2024

When the paragraph continues after interruption (e.g., reference callout), the coordinates are lost:

image

This PR solves this issue.
image

This PR also adds a small modification in the frontend so that the paragraph coordinates are extracted if "add coordinates" is selected and "segment sentence" is not selected.

@coveralls
Copy link

Coverage Status

coverage: 39.906% (+0.01%) from 39.892%
when pulling 0d7913d on bugfix/paragraph-coords
into cbc77d5 on master.

@kermitt2
Copy link
Owner

I didn't see the problem in the previous PR, sorry !

@kermitt2 kermitt2 merged commit e14ce33 into master Jan 21, 2024
9 checks passed
@lfoppiano lfoppiano deleted the bugfix/paragraph-coords branch January 22, 2024 01:37
@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Jan 22, 2024

Neither did I when I was developing it. The structure viewer app (https://structure-vision.streamlit.app/) is quite helpful in validating the stream order of PDF extraction.

@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 9, 2024
@Darrshan-Sankar
Copy link

Darrshan-Sankar commented Sep 16, 2024

Still there are few paragraphs un-annotated it seems. I checked in a PDF. Any fixes?

@lfoppiano
Copy link
Collaborator Author

Oulx you please share an example?

@Darrshan-Sankar
Copy link

Please check with these articles:

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10510434 ------> Article page no:9
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10042417 --------> Article page no:8

And similarly many more articles

@lfoppiano
Copy link
Collaborator Author

Those issues are not related with this PR. Here the issue is that part of the text is misclassified as figure.

I've referenced your comment in a separate issue. This will likely be solved, or, at least, mitigated by #963 (WIP).

@Darrshan-Sankar
Copy link

@lfoppiano Thanks looking forward for the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants