LLM Training Datasets
Explore the datasets used to train large language models and understand their impact on model capabilities.
Anthropic HH-RLHF
preferenceThe Anthropic HH-RLHF (Helpful and Harmless) dataset is a pivotal collection of human preference data designed to align Large Language Models with human values. It consists of approximately 161,000 examples of human-AI conversations where a human annotator chooses the better of two model responses based on helpfulness and harmlessness criteria. This dataset was instrumental in democratizing Reinforcement Learning from Human Feedback (RLHF) research, allowing the open-source community to train reward models and align models like Llama and Alpaca variants without generating their own expensive human feedback data. It also includes 'red teaming' data where humans attempt to elicit harmful responses.
Size
161,000 conversation pairs
License
MIT License
Used By
Alpaca (various fine-tunes), Llama 2 (community fine-tunes), OpenAssistant models...
Colossal Clean Crawled Corpus (C4)
textC4 is a massive, cleaned version of Common Crawl's web crawl corpus, originally created by Google to train the T5 (Text-to-Text Transfer Transformer) model. The dataset was designed to produce a high-quality, large-scale English text corpus by applying extensive filtering to raw web data. Filters included removing duplicate lines, stripping 'bad words', removing code, and retaining only sentences ending in terminal punctuation. While Google did not release the original dataset file, they released the scripts to recreate it. The Allen Institute for AI (AI2) subsequently released a widely used open reproduction hosted on Hugging Face. C4 has become a standard baseline dataset for pre-training English language models.
Size
305 GB (en), 2.3 TB (en.noclean), 9.7 TB (multilingual)
License
ODC-BY
Used By
T5, mT5, LaMDA...
Dolma
textDolma (Data to feed OLMo's Appetite) is a 3-trillion token open corpus created by the Allen Institute for AI (AI2) specifically for pre-training the OLMo (Open Language Model) family. Unlike purely web-scraped datasets, Dolma is a curated mixture of diverse sources designed to balance representativeness and quality. It includes web content from Common Crawl, academic publications from peS2o, code from The Stack, books from Project Gutenberg, and encyclopedic data from Wikipedia. A key innovation of Dolma is its release under the 'ImpACT' license, a medium-risk artifact license designed to mitigate ethical risks while maintaining openness, requiring users to provide contact information and intended use cases.
Size
3 trillion tokens
License
AI2 ImpACT License (Medium Risk)
Used By
OLMo (Open Language Model), OLMo-7B, OLMo-1B
FineWeb
textFineWeb is a massive, high-quality English pre-training dataset consisting of approximately 15 trillion tokens. It was developed by Hugging Face to serve as a superior open-source alternative to previous web-scale datasets like RefinedWeb and C4. Derived from 96 snapshots of Common Crawl (spanning from 2013 to 2024), FineWeb utilizes an advanced processing pipeline that includes aggressive deduplication, heuristic filtering, and PII (Personally Identifiable Information) removal. The dataset was designed to push the performance of open LLMs, with ablation studies showing that models trained on FineWeb outperform those trained on The Pile, C4, and Dolma on aggregate benchmarks. It also includes a specialized subset, 'FineWeb-Edu', which filters for high educational value content using synthetic annotations.
Size
15 trillion tokens (44 TB disk space)
License
Open Data Commons Attribution License (ODC-By) v1.0
Used By
Hugging Face ablation models, Open-source community models (post-2024)
The Flan Collection
instructionThe Flan Collection (2022) is a comprehensive compilation of datasets, templates, and methods for instruction tuning, significantly expanding upon the original Flan 2021 dataset. It aggregates tasks from Flan 2021, P3 (Public Pool of Prompts), Super-Natural Instructions, and additional reasoning, dialog, and program synthesis datasets. The collection formats these into a mix of zero-shot, few-shot, and Chain-of-Thought (CoT) templates. It was designed to train the Flan-T5 and Flan-PaLM models, demonstrating that training on a diverse set of instructions improves zero-shot performance on unseen tasks. It is a cornerstone dataset for research into instruction following and generalization.
Size
1,836 tasks (Millions of examples)
License
Apache 2.0 (Code), Component datasets vary
Used By
Flan-T5, Flan-PaLM, OpenOrca
LAION-5B
multimodalLAION-5B is a massive multimodal dataset consisting of 5.85 billion image-text pairs, making it one of the largest open datasets of its kind. Created by the Large-scale Artificial Intelligence Open Network (LAION), it was designed to democratize research into large-scale multi-modal models like CLIP and Stable Diffusion. The dataset was constructed by parsing Common Crawl data, filtering for image tags with alt-text, and using CLIP embeddings to ensure high similarity between images and their text descriptions. It includes multilingual subsets and has been instrumental in the training of open-source generative image models. Note: Due to the discovery of CSAM in the original links, a cleaned version 'Re-LAION-5B' was released in 2024.
Size
5.85 billion image-text pairs (metadata size ~200GB+)
License
Creative Commons CC-BY 4.0 (Metadata)
Used By
Stable Diffusion, OpenCLIP, Imagen (Google - partially)
RedPajama
textRedPajama is a project by Together AI to create leading open-source models and datasets, starting with a reproduction of the LLaMA training dataset. The project has released two major versions: RedPajama-Data-1T (V1) and RedPajama-Data-V2. V1 is a 1.2 trillion token dataset designed to replicate the LLaMA-1 training data distribution, consisting of seven domains: CommonCrawl, C4, GitHub, ArXiv, Books, Wikipedia, and StackExchange. RedPajama-Data-V2 is a significantly larger, web-only dataset containing over 100 trillion raw tokens (30 trillion deduplicated) derived from 84 CommonCrawl snapshots. V2 distinguishes itself by preserving raw data while providing over 40 pre-computed quality annotations (e.g., perplexity, toxicity, length), allowing researchers to filter and curate their own high-quality subsets rather than relying on a fixed pre-filtered corpus.
Size
30 Trillion tokens (V2 deduplicated), 1.2 Trillion tokens (V1)
License
ODC-BY (CommonCrawl data); Various permissive licenses for other subsets in V1
Used By
RedPajama-INCITE, Snowflake Arctic, Salesforce XGen...
Stanford Alpaca Data
instructionThe Stanford Alpaca dataset is a small but highly influential instruction-tuning dataset consisting of 52,000 instruction-following examples. It was created by Stanford's Center for Research on Foundation Models (CRFM) to fine-tune the LLaMA-7B model into an instruction-following assistant (Alpaca). The dataset was generated using OpenAI's text-davinci-003 model via the 'Self-Instruct' method, where a strong model generates instructions and responses based on a small seed set of human-written tasks. This dataset demonstrated that a small, high-quality instruction dataset could significantly improve the capability of base LLMs to follow user commands, sparking the open-source instruction-tuning revolution.
Size
22.7 MB
License
CC BY-NC 4.0
Used By
Alpaca-7B, Alpaca-LoRA, Koala...
The Pile
textThe Pile is an 825 GiB diverse, open-source text corpus created by EleutherAI specifically for training large language models. It was designed to improve upon Common Crawl-based datasets by incorporating a richer mix of high-quality, diverse text sources. The Pile consists of 22 distinct subsets, including academic papers (ArXiv, PubMed Central), code (GitHub), internet discussions (Pile-CC, OpenWebText2, StackExchange), books (Books3, Project Gutenberg), and legal texts (FreeLaw). This diversity aims to improve cross-domain knowledge and downstream generalization capabilities of models trained on it. It served as the primary training data for the GPT-Neo and GPT-J model families.
Size
825 GiB
License
Various (MIT for code; dataset is a collection of mixed licenses including Public Domain, CC-BY, and others)
Used By
GPT-Neo, GPT-J, GPT-NeoX-20B...
The Stack
codeThe Stack is the largest open dataset of permissively licensed source code, created by the BigCode project (a collaboration between Hugging Face and ServiceNow). It was designed to address the legal and ethical issues of training code LLMs on copyrighted code. The dataset is derived from the Software Heritage archive and GitHub, containing over 6 terabytes (V1) to 67 terabytes (V2) of code files across hundreds of programming languages. Crucially, it only includes code with permissive licenses (e.g., MIT, Apache 2.0, BSD) and provides a mechanism for developers to opt-out and have their code removed (Am I in The Stack?). The Stack V2, released in 2024, significantly expanded the scope and scale, serving as the training data for StarCoder2.
Size
67.5 TB (V2 source), ~900B tokens (V2 train)
License
Various Permissive Licenses (MIT, Apache 2.0, etc.)
Used By
StarCoder, StarCoder2, SantaCoder...
Want to contribute dataset information?
Help us make LLMDB more comprehensive by contributing dataset information, corrections, or improvements.