Skip to content

Latest commit

 

History

History

wikitext-llm-dataset

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Creating text dataset for LLM pre-training

Overview

In this example, we will be creating a dataset for LLM pre-training by taking a 100K subset of wikitext-103-raw-v1 dataset, tokenizing it and saving it as a Lance dataset. This can be done for as many or as few data samples as you wish with little memory consumption!

The wikitext dataset, is a collection of over 100 million tokens extracted from the set of verified good and featured articles on Wikipedia.

Code and Blog

Below are the links for both the Google Colab walkthrough as well as the blog.

Open In Colab Ghost