lance-deeplearning-recipes/examples/wikitext-llm-dataset at main · lancedb/lance-deeplearning-recipes

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
wikitext-llm-dataset.ipynb		wikitext-llm-dataset.ipynb

README.md

Creating text dataset for LLM pre-training

Overview

In this example, we will be creating a dataset for LLM pre-training by taking a 100K subset of wikitext-103-raw-v1 dataset, tokenizing it and saving it as a Lance dataset. This can be done for as many or as few data samples as you wish with little memory consumption!

The wikitext dataset, is a collection of over 100 million tokens extracted from the set of verified good and featured articles on Wikipedia.

Code and Blog

Below are the links for both the Google Colab walkthrough as well as the blog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikitext-llm-dataset

wikitext-llm-dataset

README.md

Creating text dataset for LLM pre-training

Overview

Code and Blog

Files

wikitext-llm-dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

wikitext-llm-dataset

Folders and files

parent directory

README.md

Creating text dataset for LLM pre-training

Overview

Code and Blog