Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix affiliation missing when using DL affiliation-address model #1166

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Sep 17, 2024

This PR propose a fix for the affiliation, that are lost when processing them with a DL model.

The issue seems to be in the method: getAffiliationBlocksFromSegments() where new \n are added (in general they should be added if there is a misalignment, however they are added for sure at the beginning).

protected static List<String> getAffiliationBlocksFromSegments(List<List<LayoutToken>> tokenizations) {

I patched quickly by checking that end is not zero. However this \n does not work well with the DL models, at contrary with the CRF models that they are ignoring it.

I've left two tests which are showing the problem from both CRF and DL:

The DL test is still failing, as I'm not sure really where to fix the issue.

After this is fix we would need to rebuild the grobid-full image.

@lfoppiano
Copy link
Collaborator Author

After a few iteration over it, I think I understood the principle which is of separating blocks of affiliations that are on different offset differences. My fix just avoid adding \n at the beginning. The \n helps to separate the blocks and, with the DL models, to process the blocks in parallel, among other things.

@lfoppiano
Copy link
Collaborator Author

@kermitt2 I've tried to fix this a bit in a rush, at least to mitigate the issue on the docker image. I'm sorry, I might need a quick review on your side.

I've pushed this fix on the branch 0.8.1-fixes (which is a branch from the tag 0.8.1) and I've pushed an updated docker image lfoppiano/grobid:0.8.1-full which should at least mitigate this issue. It's deployed here.

@lfoppiano lfoppiano changed the title Fix affiliation missing for DL models Fix affiliation missing when using DL affiliation-address model Sep 17, 2024
@kermitt2
Copy link
Owner

Hi @lfoppiano the fix works fine no problem. It is surprising that the starting "\n" has such effect on the DL processing. There's nothing else to change, the segmentation goes then normally, including parallel processing. I changed this part last December and it seems I only tested with the CRF model :)
Unfortunately the end-to-end benchmarks are not covering affiliations. The docker image and the huggingface demo are also updated for the grobid account.

@lfoppiano lfoppiano merged commit f501033 into master Sep 18, 2024
5 of 7 checks passed
@lfoppiano
Copy link
Collaborator Author

Thanks!

@lfoppiano lfoppiano deleted the fix-affiliation-dl branch September 18, 2024 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants