Skip to content

Commit

Permalink
Merge branch 'master' into feature/add-styles-xml
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Dec 17, 2023
2 parents 9adb8d8 + 6bd974d commit a25de0d
Show file tree
Hide file tree
Showing 1,465 changed files with 477,979 additions and 2,073 deletions.
57 changes: 0 additions & 57 deletions .circleci/config.yml

This file was deleted.

2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/general-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ assignees: ''

- What is your OS and architecture? Windows is not supported and Mac OS arm64 is not yet supported. For non-supported OS, you can use Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/)

- What is your Java version (`java --version`)? Expected versions are currently Java 8, 9 and 10.
- What is your Java version (`java --version`)?

- In case of build or run errors, please submit the error while running gradlew with ``--stacktrace`` and ``--info`` for better log traces (e.g. `./gradlew run --stacktrace --info`) or attach the log file `logs/grobid-service.log`.
32 changes: 32 additions & 0 deletions .github/workflows/ci-build-unstable.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Build unstable

on: [ push ]

concurrency:
group: gradle
cancel-in-progress: true


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v1
- name: Set up JDK 17
uses: actions/setup-java@v1
with:
java-version: 1.17
- name: Build with Gradle
run: ./gradlew clean assemble --info --stacktrace --no-daemon

- name: Test with Gradle Jacoco and Coveralls
run: ./gradlew test jacocoTestReport coveralls --no-daemon

- name: Coveralls GitHub Action
uses: coverallsapp/github-action@v2
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
format: jacoco
file: build/reports/jacoco/codeCoverageReport/codeCoverageReport.xml
debug: true
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,34 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.8.0] - 2023-11-19

### Added

+ Extraction of funder and funding information with a specific new model, see https://github.com/kermitt2/grobid/pull/1046 for details
+ Optional consolidation of funder with CrossRef Funder Registry
+ Identification of acknowledged entities in the acknowledgement section
+ Optional coordinates in title elements

### Changed

+ Dropwizard upgrade to 4.0
+ Minimum JDK/JVM requirement for building/running the project is now 1.11
+ Logging now with Logback, removal of Log4j2, optional logs in json format
+ General review of logs
+ Enable Github actions / Disable circleci

### Fixed

+ Set dynamic memory limit in pdfalto_server #1038
+ Logging in files when training models work now as expected
+ Various dependency upgrades
+ Fix #1051 with possible problematic PDF
+ Fix #1036 for pdfalto memory limit
+ fix readthedocs build #1040
+ fix for null equation #1030
+ Other minor fixes

## [0.7.3] – 2023-05-13

### Added
Expand Down
1 change: 1 addition & 0 deletions Dockerfile.crf
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ COPY grobid-trainer/ ./grobid-trainer/
RUN rm -rf grobid-home/pdf2xml
RUN rm -rf grobid-home/pdfalto/lin-32
RUN rm -rf grobid-home/pdfalto/mac-64
RUN rm -rf grobid-home/pdfalto/mac_arm-64
RUN rm -rf grobid-home/pdfalto/win-*
RUN rm -rf grobid-home/lib/lin-32
RUN rm -rf grobid-home/lib/win-*
Expand Down
9 changes: 5 additions & 4 deletions Dockerfile.delft
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

## See https://grobid.readthedocs.io/en/latest/Grobid-docker/

## usage example with version 0.7.3:
## docker build -t grobid/grobid:0.7.3 --build-arg GROBID_VERSION=0.7.3 --file Dockerfile.delft .
## usage example with version 0.8.0:
## docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.delft .

## no GPU:
## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.3
## docker run -t --rm --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.8.0

## allocate all available GPUs (only Linux with proper nvidia driver installed on host machine):
## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.7.3
## docker run --rm --gpus all --init -p 8070:8070 -p 8071:8071 -v /home/lopez/grobid/grobid-home/config/grobid.properties:/opt/grobid/grobid-home/config/grobid.properties:ro grobid/grobid:0.8.0

# -------------------
# build builder image
Expand Down Expand Up @@ -41,6 +41,7 @@ COPY grobid-trainer/ ./grobid-trainer/
RUN rm -rf grobid-home/pdf2xml
RUN rm -rf grobid-home/pdfalto/lin-32
RUN rm -rf grobid-home/pdfalto/mac-64
RUN rm -rf grobid-home/pdfalto/mac_arm-64
RUN rm -rf grobid-home/pdfalto/win-*
RUN rm -rf grobid-home/lib/lin-32
RUN rm -rf grobid-home/lib/win-*
Expand Down
6 changes: 3 additions & 3 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# GROBID

[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)
[![CircleCI](https://circleci.com/gh/kermitt2/grobid.svg?style=svg)](https://circleci.com/gh/kermitt2/grobid)
[![Coverage Status](https://coveralls.io/repos/kermitt2/grobid/badge.svg)](https://coveralls.io/r/kermitt2/grobid)
[![Documentation Status](https://readthedocs.org/projects/grobid/badge/?version=latest)](https://readthedocs.org/projects/grobid/?badge=latest)
[![GitHub release](https://img.shields.io/github/release/kermitt2/grobid.svg)](https://github.com/kermitt2/grobid/releases/)
Expand All @@ -18,21 +17,22 @@ Visit the [GROBID documentation](https://grobid.readthedocs.io) for more detaile

GROBID (or Grobid, but not GroBid nor GroBiD) means **G**ene**R**ation **O**f **BI**bliographic **D**ata.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby. In 2011 the tool has been made available in open source. Work on GROBID has been steady as a side project since the beginning and is expected to continue as such.
GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby, following a suggestion by Laurent Romary (Inria, France). In 2011, the tool has been made available in open source. Work on GROBID has been steady as a side project since the beginning and is expected to continue as such, facilitated in particular to the continuous support of Inria.

The following functionalities are available:

- __Header extraction and parsing__ from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
- __References extraction and parsing__ from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
- __Citation contexts recognition and resolution__ of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.).
- __Full text extraction and structuring__ from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, data availability statements, etc.).
- __PDF coordinates__ for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.
- Parsing of __references in isolation__ (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).
- __Parsing of names__ (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).
- __Parsing of affiliation and address__ blocks.
- __Parsing of dates__, ISO normalized day, month, year.
- __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
- __Extraction and parsing of patent and non-patent references in patent__ publications.
- __Extraction of Funders and funding information__ with optional matching of extracted funders with the CrossRef Funder Registry.

In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).

Expand Down
Loading

0 comments on commit a25de0d

Please sign in to comment.