LLM Training Datasets

Explore the datasets used to train large language models and understand their impact on model capabilities.

Anthropic HH-RLHF

preference
Created by Anthropic2022

The Anthropic HH-RLHF (Helpful and Harmless) dataset is a pivotal collection of human preference data designed to align Large Language Models with human values. It consists of approximately 161,000 examples of human-AI conversations where a human annotator chooses the better of two model responses based on helpfulness and harmlessness criteria. This dataset was instrumental in democratizing Reinforcement Learning from Human Feedback (RLHF) research, allowing the open-source community to train reward models and align models like Llama and Alpaca variants without generating their own expensive human feedback data. It also includes 'red teaming' data where humans attempt to elicit harmful responses.

Size

161,000 conversation pairs

License

MIT License

Used By

Alpaca (various fine-tunes), Llama 2 (community fine-tunes), OpenAssistant models...

English
dialogue, safety, general-knowledge
RLHF, reward-modeling +2
Models trained with this dataset: 4

Colossal Clean Crawled Corpus (C4)

text
Created by Google (original concept/scripts), Allen Institute for AI (hosted reproduction)2019

C4 is a massive, cleaned version of Common Crawl's web crawl corpus, originally created by Google to train the T5 (Text-to-Text Transfer Transformer) model. The dataset was designed to produce a high-quality, large-scale English text corpus by applying extensive filtering to raw web data. Filters included removing duplicate lines, stripping 'bad words', removing code, and retaining only sentences ending in terminal punctuation. While Google did not release the original dataset file, they released the scripts to recreate it. The Allen Institute for AI (AI2) subsequently released a widely used open reproduction hosted on Hugging Face. C4 has become a standard baseline dataset for pre-training English language models.

Size

305 GB (en), 2.3 TB (en.noclean), 9.7 TB (multilingual)

License

ODC-BY

Used By

T5, mT5, LaMDA...

en, multilingual (mC4)
web
pretraining, masked-language-modeling
Models trained with this dataset: 6

Dolma

text
Created by Allen Institute for AI (AI2)2023

Dolma (Data to feed OLMo's Appetite) is a 3-trillion token open corpus created by the Allen Institute for AI (AI2) specifically for pre-training the OLMo (Open Language Model) family. Unlike purely web-scraped datasets, Dolma is a curated mixture of diverse sources designed to balance representativeness and quality. It includes web content from Common Crawl, academic publications from peS2o, code from The Stack, books from Project Gutenberg, and encyclopedic data from Wikipedia. A key innovation of Dolma is its release under the 'ImpACT' license, a medium-risk artifact license designed to mitigate ethical risks while maintaining openness, requiring users to provide contact information and intended use cases.

Size

3 trillion tokens

License

AI2 ImpACT License (Medium Risk)

Used By

OLMo (Open Language Model), OLMo-7B, OLMo-1B

English
web, scientific, code +2
pretraining, language-modeling
Models trained with this dataset: 3

FineWeb

text
Created by Hugging Face (HuggingFaceFW)2024

FineWeb is a massive, high-quality English pre-training dataset consisting of approximately 15 trillion tokens. It was developed by Hugging Face to serve as a superior open-source alternative to previous web-scale datasets like RefinedWeb and C4. Derived from 96 snapshots of Common Crawl (spanning from 2013 to 2024), FineWeb utilizes an advanced processing pipeline that includes aggressive deduplication, heuristic filtering, and PII (Personally Identifiable Information) removal. The dataset was designed to push the performance of open LLMs, with ablation studies showing that models trained on FineWeb outperform those trained on The Pile, C4, and Dolma on aggregate benchmarks. It also includes a specialized subset, 'FineWeb-Edu', which filters for high educational value content using synthetic annotations.

Size

15 trillion tokens (44 TB disk space)

License

Open Data Commons Attribution License (ODC-By) v1.0

Used By

Hugging Face ablation models, Open-source community models (post-2024)

English
web
pretraining, language-modeling
Models trained with this dataset: 2

The Flan Collection

instruction
Created by Google Research2023

The Flan Collection (2022) is a comprehensive compilation of datasets, templates, and methods for instruction tuning, significantly expanding upon the original Flan 2021 dataset. It aggregates tasks from Flan 2021, P3 (Public Pool of Prompts), Super-Natural Instructions, and additional reasoning, dialog, and program synthesis datasets. The collection formats these into a mix of zero-shot, few-shot, and Chain-of-Thought (CoT) templates. It was designed to train the Flan-T5 and Flan-PaLM models, demonstrating that training on a diverse set of instructions improves zero-shot performance on unseen tasks. It is a cornerstone dataset for research into instruction following and generalization.

Size

1,836 tasks (Millions of examples)

License

Apache 2.0 (Code), Component datasets vary

Used By

Flan-T5, Flan-PaLM, OpenOrca

English, Multilingual (via subsets)
general-knowledge, reasoning, dialogue +2
instruction-tuning, fine-tuning +1
Models trained with this dataset: 3

LAION-5B

multimodal
Created by LAION (Large-scale Artificial Intelligence Open Network)2022

LAION-5B is a massive multimodal dataset consisting of 5.85 billion image-text pairs, making it one of the largest open datasets of its kind. Created by the Large-scale Artificial Intelligence Open Network (LAION), it was designed to democratize research into large-scale multi-modal models like CLIP and Stable Diffusion. The dataset was constructed by parsing Common Crawl data, filtering for image tags with alt-text, and using CLIP embeddings to ensure high similarity between images and their text descriptions. It includes multilingual subsets and has been instrumental in the training of open-source generative image models. Note: Due to the discovery of CSAM in the original links, a cleaned version 'Re-LAION-5B' was released in 2024.

Size

5.85 billion image-text pairs (metadata size ~200GB+)

License

Creative Commons CC-BY 4.0 (Metadata)

Used By

Stable Diffusion, OpenCLIP, Imagen (Google - partially)

English, Multilingual (100+ languages)
web, image, text
text-to-image-generation, image-captioning +1
Models trained with this dataset: 3

RedPajama

text
Created by Together AI, in collaboration with Ontocord.ai, ETH Zurich, Stanford CRFM, and Hazy Research2023

RedPajama is a project by Together AI to create leading open-source models and datasets, starting with a reproduction of the LLaMA training dataset. The project has released two major versions: RedPajama-Data-1T (V1) and RedPajama-Data-V2. V1 is a 1.2 trillion token dataset designed to replicate the LLaMA-1 training data distribution, consisting of seven domains: CommonCrawl, C4, GitHub, ArXiv, Books, Wikipedia, and StackExchange. RedPajama-Data-V2 is a significantly larger, web-only dataset containing over 100 trillion raw tokens (30 trillion deduplicated) derived from 84 CommonCrawl snapshots. V2 distinguishes itself by preserving raw data while providing over 40 pre-computed quality annotations (e.g., perplexity, toxicity, length), allowing researchers to filter and curate their own high-quality subsets rather than relying on a fixed pre-filtered corpus.

Size

30 Trillion tokens (V2 deduplicated), 1.2 Trillion tokens (V1)

License

ODC-BY (CommonCrawl data); Various permissive licenses for other subsets in V1

Used By

RedPajama-INCITE, Snowflake Arctic, Salesforce XGen...

en, fr, de +2
web, code, scientific +3
pretraining
Models trained with this dataset: 5

Stanford Alpaca Data

instruction
Created by Stanford Center for Research on Foundation Models (CRFM)2023

The Stanford Alpaca dataset is a small but highly influential instruction-tuning dataset consisting of 52,000 instruction-following examples. It was created by Stanford's Center for Research on Foundation Models (CRFM) to fine-tune the LLaMA-7B model into an instruction-following assistant (Alpaca). The dataset was generated using OpenAI's text-davinci-003 model via the 'Self-Instruct' method, where a strong model generates instructions and responses based on a small seed set of human-written tasks. This dataset demonstrated that a small, high-quality instruction dataset could significantly improve the capability of base LLMs to follow user commands, sparking the open-source instruction-tuning revolution.

Size

22.7 MB

License

CC BY-NC 4.0

Used By

Alpaca-7B, Alpaca-LoRA, Koala...

en
general-knowledge, reasoning, creative-writing
fine-tuning, instruction-following
Models trained with this dataset: 4

The Pile

text
Created by EleutherAI2020

The Pile is an 825 GiB diverse, open-source text corpus created by EleutherAI specifically for training large language models. It was designed to improve upon Common Crawl-based datasets by incorporating a richer mix of high-quality, diverse text sources. The Pile consists of 22 distinct subsets, including academic papers (ArXiv, PubMed Central), code (GitHub), internet discussions (Pile-CC, OpenWebText2, StackExchange), books (Books3, Project Gutenberg), and legal texts (FreeLaw). This diversity aims to improve cross-domain knowledge and downstream generalization capabilities of models trained on it. It served as the primary training data for the GPT-Neo and GPT-J model families.

Size

825 GiB

License

Various (MIT for code; dataset is a collection of mixed licenses including Public Domain, CC-BY, and others)

Used By

GPT-Neo, GPT-J, GPT-NeoX-20B...

en
web, books, scientific +5
pretraining, language-modeling
Models trained with this dataset: 6

The Stack

code
Created by BigCode (Hugging Face & ServiceNow)2022

The Stack is the largest open dataset of permissively licensed source code, created by the BigCode project (a collaboration between Hugging Face and ServiceNow). It was designed to address the legal and ethical issues of training code LLMs on copyrighted code. The dataset is derived from the Software Heritage archive and GitHub, containing over 6 terabytes (V1) to 67 terabytes (V2) of code files across hundreds of programming languages. Crucially, it only includes code with permissive licenses (e.g., MIT, Apache 2.0, BSD) and provides a mechanism for developers to opt-out and have their code removed (Am I in The Stack?). The Stack V2, released in 2024, significantly expanded the scope and scale, serving as the training data for StarCoder2.

Size

67.5 TB (V2 source), ~900B tokens (V2 train)

License

Various Permissive Licenses (MIT, Apache 2.0, etc.)

Used By

StarCoder, StarCoder2, SantaCoder...

python, java, javascript +5
code, software-development
pretraining, code-generation
Models trained with this dataset: 4

Want to contribute dataset information?

Help us make LLMDB more comprehensive by contributing dataset information, corrections, or improvements.