Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment long text automatically and manage sliding context #140

Merged
merged 9 commits into from
Feb 14, 2022

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Feb 8, 2022

  • Long text are segmented automatically on server side given a target size (default 1000 characters, ~150-200 words), using in priorities end-of-line (double then single), then paragraph boundaries. Segmentation is balanced to have segments of similar sizes.
  • DocumentContext object is created after processing of the first segment and refreshed sequentially when processing the next segment

Advantages:

  • faster with large texts
  • better results because models are based on a paragraph size
  • more stable results and predictions independently of the size of the input text (solves Impact of text length on identified entities #131)
  • remove the burden of segmenting text and managing a context on the client side

Tests added for the text segmentation. Default target segment size is 1000 characters, but this is also be configurable in the query. Documentation updated. Support the edge/pathological case of sentences of more than 1000 characters

@kermitt2 kermitt2 merged commit 1a1f29c into master Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant