GROBID splits sentences, puts second half in a figure description #1160

mariadelmarq · 2024-08-29T07:54:48Z

Potential error case, not sure if open access (i.e., can be used for training). For the PDF file from: https://link.springer.com/article/10.1007/s12144-016-9469-4.

The PDF looks like this:

Whereas GROBID appears to split the text inside this section:

and arbitrarily puts the second half into a figure description:

lfoppiano · 2024-09-01T19:52:17Z

Hi @mariadelmarq, and thanks again for reporting the issue. This is a recurring issue in the fulltext, and likely going to be solved by #963.

mariadelmarq · 2024-09-01T23:43:04Z

Thanks heaps, @lfoppiano! Do you have a rough timeline for the next release? No pressure at all, of course, it's just a great package and I would like to know whether a new iteration will be out before the end of the project I'm working on, later this year. Thanks again for all your work on this!

lfoppiano · 2024-09-02T04:04:45Z

Hi @mariadelmarq we are currently working on releasing version 0.8.1 (#1123), we've been facing an issue with the JVM that requires to process large amount of PDF documents and this is taking more time. For the change I mentioned, is going to be next year.

vegarab · 2024-09-02T09:30:52Z

Hi, @mariadelmarq, @lfoppiano I've been facing similar issues this past week and was about to enquire myself.

Experiencing very simple and plain PDFs (Clean front page, pages are typically just a subheader + paragraph, clear bibliography with standard format) being parsed incorrectly. Mostly text disappears into figure descriptions, where sentences are split in the middle.
The same happens with sentences into tables and full paragraphs being pulled into bibliography elements.

I mostly experience this with non-English PDFs, typically German.

For the change I mentioned, is going to be next year.

Is there any way to contribute to speed up the work on this? I've found that GROBID is the best solution for full-text extraction from scholarly PDFs/documents. Or do you recommend any other way of extracting fulltexts that is less involved than the GROBID biblio and header extraction? Not looking for bibliography data or headers, just the clean paragraph-level text from the documents, removing any metadata, footers, author info, etc. etc.

Thanks

lfoppiano · 2024-09-12T12:22:04Z

hi @vegarab, I'm assuming you are dealing with scientific articles.
One solution would be to create additional training data for the grobid models. This could help to improve the results. I did not see any German document in the fulltext training data, so I think one ore two could already improve the results.

Unfortunately, creating new training data can appear complicated at first. The steps are divided into two: a) generate per-annotated training data, and b) correct them following the guidelines. Ref to the documentation.

Since the Grobid model is working in cascade, you will have to start from the segmentation and go throught it. I explained in another issue here.

Unfortunately, I don't' have time to work on the training data at the moment, but I can help you with the process if needed.

lfoppiano · 2024-09-16T12:06:31Z

Adding additional cases here.

lfoppiano added the error cases Some error/test case for future improvements label Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GROBID splits sentences, puts second half in a figure description #1160

GROBID splits sentences, puts second half in a figure description #1160

mariadelmarq commented Aug 29, 2024

lfoppiano commented Sep 1, 2024

mariadelmarq commented Sep 1, 2024

lfoppiano commented Sep 2, 2024

vegarab commented Sep 2, 2024

lfoppiano commented Sep 12, 2024

lfoppiano commented Sep 16, 2024

GROBID splits sentences, puts second half in a figure description #1160

GROBID splits sentences, puts second half in a figure description #1160

Comments

mariadelmarq commented Aug 29, 2024

lfoppiano commented Sep 1, 2024

mariadelmarq commented Sep 1, 2024

lfoppiano commented Sep 2, 2024

vegarab commented Sep 2, 2024

lfoppiano commented Sep 12, 2024

lfoppiano commented Sep 16, 2024