LOFT (128k)

long-contextVerified

Long-Context Frontiers benchmark evaluating long-context language models on real-world tasks requiring context up to 128k tokens, including retrieval, RAG, SQL-like reasoning, and many-shot in-context learning across text, visual, and audio modalities.

Published: 2024

Score Range: 0-100

Top Score: 83.3

Read Paper

LOFT (128k) Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	Grok 3	xAI	83.3	Unknown (multi-trillion estimated)	2025-02-19	Multimodal
2	Grok 3 Mini	xAI	83.1	Unknown	2025-02-19	Multimodal

About LOFT (128k)

Description

## Overview LOFT (Long-Context Frontiers) is a comprehensive benchmark designed to evaluate long-context language models (LCLMs) on paradigm-shifting tasks that leverage their ability to process millions of tokens. Unlike traditional benchmarks that rely on synthetic tasks, LOFT focuses on real-world applications where LCLMs can potentially replace specialized systems and complex pipelines. ## Key Features - **Dynamic Scaling**: Supports context lengths from 32k to 1M+ tokens with automatic scaling capabilities - **Multi-Modal**: Covers text, visual, and audio modalities across 35 datasets - **Real-World Tasks**: Six core task categories that test practical applications - **Corpus-in-Context (CiC) Prompting**: Novel prompting approach for long-context evaluation ## Task Categories ### 1. Text Retrieval Evaluates LCLMs' ability to retrieve relevant information from large text corpora without specialized dual-encoder models. Includes datasets from BEIR benchmark covering argument retrieval (ArguAna), fact checking (FEVER), question answering (FIQA, MS MARCO, NQ), duplicate detection (Quora), citation prediction (SciFact), and multi-hop reasoning (HotPotQA, MuSiQue, QAMPARI, QUEST). ### 2. Visual Retrieval Tests text-to-image retrieval (Flickr30k, MS COCO), text-to-video retrieval (MSR-VTT), and image-text retrieval (OVEN) capabilities, comparing against specialized models like CLIP. ### 3. Audio Retrieval Evaluates multilingual audio retrieval across five languages (English, Hindi, Chinese, Spanish, French) using the FLEURS dataset, testing against dual-encoder models. ### 4. Retrieval-Augmented Generation (RAG) Assesses LCLMs' ability to reason over entire corpora directly, eliminating traditional RAG pipeline complexity. Uses subsets of retrieval datasets with phrase-level answer annotations. ### 5. SQL-like Reasoning Tests compositional reasoning over entire databases represented as text, using Spider (single-turn) and SParC (multi-turn) datasets. Evaluates natural language database querying without formal SQL conversion. ### 6. Many-Shot In-Context Learning Explores scaling from traditional few-shot to hundreds/thousands of examples using Big-Bench Hard (BBH) and LongICLBench (LIB) datasets for classification tasks. ## Evaluation Methodology **Corpus-in-Context (CiC) Prompting** is introduced as a novel approach that: - Provides task-specific instructions - Formats entire corpora with unique identifiers - Includes grounded few-shot examples with chain-of-thought reasoning - Maintains consistent query formatting ## Key Findings ### Performance Highlights - **Text Retrieval**: Gemini 1.5 Pro achieves 77% average performance, matching specialized Gecko retriever at 128k context - **Visual Retrieval**: Gemini 1.5 Pro (83% average) outperforms CLIP (71% average) across all visual benchmarks - **Audio Retrieval**: Perfect 100% performance on FLEURS datasets, matching/exceeding PaLM 2 DE - **RAG**: LCLMs excel at multi-hop reasoning but struggle with multi-target tasks requiring comprehensive passage ranking - **SQL**: Significant gap remains (38% vs 65% for specialized models), indicating room for improvement in compositional reasoning - **Many-Shot ICL**: Performance varies by task complexity, with knowledge-intensive tasks benefiting more from additional examples ### Positional Analysis Reveals attention degradation in later corpus sections, but strategic placement of few-shot examples can mitigate this "lost in the middle" effect. ## Technical Details - **Context Lengths**: 32k, 128k, and 1M token versions - **Evaluation Scale**: Up to 100 test queries per dataset with 5 few-shot and 10 development queries - **Metrics**: Task-specific (Recall@1 for retrieval, exact match for RAG, accuracy for SQL/ICL) - **Models Evaluated**: Gemini 1.5 Pro, GPT-4o, Claude 3 Opus ## Significance LOFT demonstrates that LCLMs can rival specialized systems across multiple domains without task-specific training, while revealing areas for improvement in long-context reasoning. The benchmark's scalable design ensures continued relevance as context windows expand to billions of tokens.

Methodology

LOFT (128k) evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Read the full paper