Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROBID splits sentences, puts second half in a figure description #1160

Open
mariadelmarq opened this issue Aug 29, 2024 · 6 comments
Open
Labels
error cases Some error/test case for future improvements

Comments

@mariadelmarq
Copy link

Potential error case, not sure if open access (i.e., can be used for training). For the PDF file from: https://link.springer.com/article/10.1007/s12144-016-9469-4.

The PDF looks like this:
image

Whereas GROBID appears to split the text inside this section:
image

and arbitrarily puts the second half into a figure description:
image

@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Sep 1, 2024
@lfoppiano
Copy link
Collaborator

Hi @mariadelmarq, and thanks again for reporting the issue. This is a recurring issue in the fulltext, and likely going to be solved by #963.

@mariadelmarq
Copy link
Author

Thanks heaps, @lfoppiano! Do you have a rough timeline for the next release? No pressure at all, of course, it's just a great package and I would like to know whether a new iteration will be out before the end of the project I'm working on, later this year. Thanks again for all your work on this!

@lfoppiano
Copy link
Collaborator

Hi @mariadelmarq we are currently working on releasing version 0.8.1 (#1123), we've been facing an issue with the JVM that requires to process large amount of PDF documents and this is taking more time. For the change I mentioned, is going to be next year.

@vegarab
Copy link

vegarab commented Sep 2, 2024

Hi, @mariadelmarq, @lfoppiano I've been facing similar issues this past week and was about to enquire myself.

Experiencing very simple and plain PDFs (Clean front page, pages are typically just a subheader + paragraph, clear bibliography with standard format) being parsed incorrectly. Mostly text disappears into figure descriptions, where sentences are split in the middle.
The same happens with sentences into tables and full paragraphs being pulled into bibliography elements.

I mostly experience this with non-English PDFs, typically German.

For the change I mentioned, is going to be next year.

Is there any way to contribute to speed up the work on this? I've found that GROBID is the best solution for full-text extraction from scholarly PDFs/documents. Or do you recommend any other way of extracting fulltexts that is less involved than the GROBID biblio and header extraction? Not looking for bibliography data or headers, just the clean paragraph-level text from the documents, removing any metadata, footers, author info, etc. etc.

Thanks

@lfoppiano
Copy link
Collaborator

hi @vegarab, I'm assuming you are dealing with scientific articles.
One solution would be to create additional training data for the grobid models. This could help to improve the results. I did not see any German document in the fulltext training data, so I think one ore two could already improve the results.

Unfortunately, creating new training data can appear complicated at first. The steps are divided into two: a) generate per-annotated training data, and b) correct them following the guidelines. Ref to the documentation.

Since the Grobid model is working in cascade, you will have to start from the segmentation and go throught it. I explained in another issue here.

Unfortunately, I don't' have time to work on the training data at the moment, but I can help you with the process if needed.

@lfoppiano
Copy link
Collaborator

Adding additional cases here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements
Projects
None yet
Development

No branches or pull requests

3 participants