BrowseComp
A benchmark for measuring browsing agents' ability to navigate the web and find hard-to-find, entangled information. Comprises 1,266 challenging questions requiring persistent web browsing and creative search strategies.
BrowseComp Leaderboard
About BrowseComp
Description
BrowseComp (Browsing Competition) is a benchmark designed to evaluate AI agents' ability to browse the internet and retrieve hard-to-find, deeply entangled information. The benchmark comprises 1,266 challenging questions that require agents to persistently navigate multiple websites, reason about factuality, and creatively search to find answers that would be difficult for humans to locate within 10 minutes. The benchmark was created by human trainers who designed questions to be extremely challenging - not solvable by existing models like GPT-4o (with or without browsing) or early versions of OpenAI Deep Research. Questions cover diverse topics including TV shows & movies (16.2%), science & technology (13.7%), art (10.0%), history (9.9%), sports (9.7%), music (9.2%), video games (5.6%), geography (5.5%), and politics (4.7%). The benchmark follows an "inverted" design approach where trainers start with a fact and create questions where the answer is easy to verify but hard to find. For example: identifying a scientific paper from EMNLP 2018-2023 where the first author did undergrad at Dartmouth and fourth author at UPenn. Human performance on BrowseComp shows its difficulty: trainers solved only 29.2% of problems within 2 hours, with 70.8% giving up after the time limit. This demonstrates that BrowseComp measures core browsing capabilities including persistence, creative search strategies, and factual reasoning - making it an important benchmark for evaluating web-browsing AI agents.
Methodology
BrowseComp evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Read the full paper