LLM Training Datasets
Explore the datasets used to train large language models and understand their impact on model capabilities.
Books3
A dataset of approximately 196,640 books, primarily English fiction and non-fiction.
Size
100 GB
License
Research only
Used By
GPT-NeoX, LLaMA, Pythia
C4
Colossal Clean Crawled Corpus is a cleaned version of Common Crawl's web crawl data.
Size
750 GB
License
ODC-By
Used By
T5, Flan-T5, PaLM
LAION-5B
A dataset of 5.85 billion CLIP-filtered image-text pairs for multimodal learning.
Size
240 TB
License
LAION Research License
Used By
Stable Diffusion, CLIP, Flamingo
The Pile
An 825 GiB diverse, open-source language modeling dataset consisting of 22 smaller, high-quality datasets.
Size
825 GiB
License
Mixed (primarily MIT)
Used By
GPT-Neo, Pythia, OPT
RedPajama
A large-scale dataset of 1.2 trillion tokens designed to replicate LLaMA's training data.
Size
1.2 trillion tokens
License
Apache 2.0
Used By
RedPajama-INCITE, MPT, Falcon
The Stack
A 3TB dataset of permissively licensed source code from GitHub, covering 358 programming languages.
Size
3 TB
License
Apache 2.0
Used By
StarCoder, CodeLlama, DeepSeek Coder
Want to contribute dataset information?
Help us make LLMDB more comprehensive by contributing dataset information, corrections, or improvements.