Explore the datasets used to train large language models and understand their impact on model capabilities.
An 825 GiB diverse, open-source language modeling dataset consisting of 22 smaller, high-quality datasets.
Size
825 GiB
License
Mixed (primarily MIT)
Used By
GPT-Neo, Pythia, OPT
A large-scale dataset of 1.2 trillion tokens designed to replicate LLaMA's training data.
Size
1.2 trillion tokens
License
Apache 2.0
Used By
RedPajama-INCITE, MPT, Falcon
Colossal Clean Crawled Corpus is a cleaned version of Common Crawl's web crawl data.
Size
750 GB
License
ODC-By
Used By
T5, Flan-T5, PaLM
A 3TB dataset of permissively licensed source code from GitHub, covering 358 programming languages.
Size
3 TB
License
Apache 2.0
Used By
StarCoder, CodeLlama, DeepSeek Coder
A dataset of approximately 196,640 books, primarily English fiction and non-fiction.
Size
100 GB
License
Research only
Used By
GPT-NeoX, LLaMA, Pythia
A dataset of 5.85 billion CLIP-filtered image-text pairs for multimodal learning.
Size
240 TB
License
LAION Research License
Used By
Stable Diffusion, CLIP, Flamingo
Help us make LLMDB more comprehensive by contributing dataset information, corrections, or improvements.