LOFT (128k)

long-contextVerified

Long-Context Frontiers benchmark evaluating long-context language models on real-world tasks requiring context up to 128k tokens, including retrieval, RAG, SQL-like reasoning, and many-shot in-context learning across text, visual, and audio modalities.

Published: 2024
Score Range: 0-100
Top Score: 83.3

LOFT (128k) Leaderboard

RankModelProviderScoreParametersReleasedType
1Grok 3xAI
83.3
Unknown (multi-trillion estimated)2025-02-19Multimodal
2Grok 3 MinixAI
83.1
Unknown2025-02-19Multimodal

About LOFT (128k)

Description

## Overview LOFT (Long-Context Frontiers) is a comprehensive benchmark designed to evaluate long-context language models (LCLMs) on paradigm-shifting tasks that leverage their ability to process millions of tokens. Unlike traditional benchmarks that rely on synthetic tasks, LOFT focuses on real-world applications where LCLMs can potentially replace specialized systems and complex pipelines. ## Key Features - **Dynamic Scaling**: Supports context lengths from 32k to 1M+ tokens with automatic scaling capabilities - **Multi-Modal**: Covers text, visual, and audio modalities across 35 datasets - **Real-World Tasks**: Six core task categories that test practical applications - **Corpus-in-Context (CiC) Prompting**: Novel prompting approach for long-context evaluation ## Task Categories ### 1. Text Retrieval Evaluates LCLMs' ability to retrieve relevant information from large text corpora without specialized dual-encoder models. Includes datasets from BEIR benchmark covering argument retrieval (ArguAna), fact checking (FEVER), question answering (FIQA, MS MARCO, NQ), duplicate detection (Quora), citation prediction (SciFact), and multi-hop reasoning (HotPotQA, MuSiQue, QAMPARI, QUEST). ### 2. Visual Retrieval Tests text-to-image retrieval (Flickr30k, MS COCO), text-to-video retrieval (MSR-VTT), and image-text retrieval (OVEN) capabilities, comparing against specialized models like CLIP. ### 3. Audio Retrieval Evaluates multilingual audio retrieval across five languages (English, Hindi, Chinese, Spanish, French) using the FLEURS dataset, testing against dual-encoder models. ### 4. Retrieval-Augmented Generation (RAG) Assesses LCLMs' ability to reason over entire corpora directly, eliminating traditional RAG pipeline complexity. Uses subsets of retrieval datasets with phrase-level answer annotations. ### 5. SQL-like Reasoning Tests compositional reasoning over entire databases represented as text, using Spider (single-turn) and SParC (multi-turn) datasets. Evaluates natural language database querying without formal SQL conversion. ### 6. Many-Shot In-Context Learning Explores scaling from traditional few-shot to hundreds/thousands of examples using Big-Bench Hard (BBH) and LongICLBench (LIB) datasets for classification tasks. ## Evaluation Methodology **Corpus-in-Context (CiC) Prompting** is introduced as a novel approach that: - Provides task-specific instructions - Formats entire corpora with unique identifiers - Includes grounded few-shot examples with chain-of-thought reasoning - Maintains consistent query formatting ## Key Findings ### Performance Highlights - **Text Retrieval**: Gemini 1.5 Pro achieves 77% average performance, matching specialized Gecko retriever at 128k context - **Visual Retrieval**: Gemini 1.5 Pro (83% average) outperforms CLIP (71% average) across all visual benchmarks - **Audio Retrieval**: Perfect 100% performance on FLEURS datasets, matching/exceeding PaLM 2 DE - **RAG**: LCLMs excel at multi-hop reasoning but struggle with multi-target tasks requiring comprehensive passage ranking - **SQL**: Significant gap remains (38% vs 65% for specialized models), indicating room for improvement in compositional reasoning - **Many-Shot ICL**: Performance varies by task complexity, with knowledge-intensive tasks benefiting more from additional examples ### Positional Analysis Reveals attention degradation in later corpus sections, but strategic placement of few-shot examples can mitigate this "lost in the middle" effect. ## Technical Details - **Context Lengths**: 32k, 128k, and 1M token versions - **Evaluation Scale**: Up to 100 test queries per dataset with 5 few-shot and 10 development queries - **Metrics**: Task-specific (Recall@1 for retrieval, exact match for RAG, accuracy for SQL/ICL) - **Models Evaluated**: Gemini 1.5 Pro, GPT-4o, Claude 3 Opus ## Significance LOFT demonstrates that LCLMs can rival specialized systems across multiple domains without task-specific training, while revealing areas for improvement in long-context reasoning. The benchmark's scalable design ensures continued relevance as context windows expand to billions of tokens.

Methodology

LOFT (128k) evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper