Segment long text automatically and manage sliding context #140

kermitt2 · 2022-02-08T11:28:06Z

Long text are segmented automatically on server side given a target size (default 1000 characters, ~150-200 words), using in priorities end-of-line (double then single), then paragraph boundaries. Segmentation is balanced to have segments of similar sizes.
DocumentContext object is created after processing of the first segment and refreshed sequentially when processing the next segment

Advantages:

faster with large texts
better results because models are based on a paragraph size
more stable results and predictions independently of the size of the input text (solves Impact of text length on identified entities #131)
remove the burden of segmenting text and managing a context on the client side

Tests added for the text segmentation. Default target segment size is 1000 characters, but this is also be configurable in the query. Documentation updated. Support the edge/pathological case of sentences of more than 1000 characters

…g context

kermitt2 added 9 commits February 6, 2022 17:18

add server-side management of long text, with segmentation and slidin…

f228767

…g context

fix sentence offsets when segmenting long text

eff973c

update gradle dependencies

cb532c2

fix wrong query

a42b625

add configurable parameters in the query

05e5c42

add full document propagation (for long text); update doc

c8778a8

add Arabic media wiki page parser

ca7a797

segmentation of pathologically long sentences; update tests

b64dcd2

update version

f846754

kermitt2 merged commit 1a1f29c into master Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment long text automatically and manage sliding context #140

Segment long text automatically and manage sliding context #140

kermitt2 commented Feb 8, 2022 •

edited

Loading

Segment long text automatically and manage sliding context #140

Segment long text automatically and manage sliding context #140

Conversation

kermitt2 commented Feb 8, 2022 • edited Loading

kermitt2 commented Feb 8, 2022 •

edited

Loading