Skip to content

Spliting docs into paragraphs #1596

Answered by julian-risch
kamilpz asked this question in Questions
Discussion options

You must be logged in to vote

Hi @kamilpz I can recommend two resources to read as starting points regarding splitting of documents.

  1. Our blog article on Parameter Tweaking has a section "Increasing Pipeline Speed via Document Length Optimization", which might be interesting to you: https://www.deepset.ai/blog/parameter-tweaking-get-faster-answers-from-your-haystack-pipeline
  2. We have a tutorial on preprocessing, where the split_by="word" parameter is set but you can easily change it to split_by="passage": https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb

While the file converter converts your docx file into text in string format, the preprocessor handles s…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@kamilpz
Comment options

Answer selected by kamilpz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants