LLM Training Datasets

Explore the datasets used to train large language models and understand their impact on model capabilities.

The Pile

Created by EleutherAI2020

An 825 GiB diverse, open-source language modeling dataset consisting of 22 smaller, high-quality datasets.

Size

825 GiB

License

Mixed (primarily MIT)

Used By

GPT-Neo, Pythia, OPT

Models trained with this dataset: 3

RedPajama

Created by Together, et al.2023

A large-scale dataset of 1.2 trillion tokens designed to replicate LLaMA's training data.

Size

1.2 trillion tokens

License

Apache 2.0

Used By

RedPajama-INCITE, MPT, Falcon

Models trained with this dataset: 3

C4

Created by Google2019

Colossal Clean Crawled Corpus is a cleaned version of Common Crawl's web crawl data.

Size

750 GB

License

ODC-By

Used By

T5, Flan-T5, PaLM

Models trained with this dataset: 3

The Stack

Created by HuggingFace2022

A 3TB dataset of permissively licensed source code from GitHub, covering 358 programming languages.

Size

3 TB

License

Apache 2.0

Used By

StarCoder, CodeLlama, DeepSeek Coder

Models trained with this dataset: 3

Books3

Created by EleutherAI2020

A dataset of approximately 196,640 books, primarily English fiction and non-fiction.

Size

100 GB

License

Research only

Used By

GPT-NeoX, LLaMA, Pythia

Models trained with this dataset: 3

LAION-5B

Created by LAION2022

A dataset of 5.85 billion CLIP-filtered image-text pairs for multimodal learning.

Size

240 TB

License

LAION Research License

Used By

Stable Diffusion, CLIP, Flamingo

Models trained with this dataset: 3

Want to contribute dataset information?

Help us make LLMDB more comprehensive by contributing dataset information, corrections, or improvements.