TaoBench

Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

1 University of California, Los Angeles 2 University of California, Santa Barbara
* Co-first authors
† Senior authors are listed in alphabetical order
Teser Figure

Overview

In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between standard library problems and bespoke mathematical constructions.

We introduce TaoBench, an undergraduate-level benchmark derived from Terence Tao's Analysis I, which formalizes analysis by constructing core mathematical concepts from scratch, without relying on standard Mathlib definitions, as well as by mixing from-scratch and MathLib constructions. For fair evaluation, we build an agentic pipeline that automatically extracts a compilable, self-contained local environment for each problem. To isolate the effect of definitional frameworks, we additionally translate every problem into a mathematically equivalent Mathlib formulation, yielding paired TaoBenchMathlib statements for direct comparison.

While state-of-the-art ATP models perform capably within the MathLib framework, performance drops by an average of roughly 26% on the definitionally equivalent Tao formulation. This indicates that the main bottleneck is limited generalization across definitional frameworks rather than task difficulty.

Introduction

Automated theorem proving (ATP) benchmarks largely consist of problems formalized in MathLib, so current ATP training and evaluation are heavily biased toward MathLib's definitional framework. However, frontier mathematics is often exploratory and prototype-heavy, relying on bespoke constructions that deviate from standard libraries. In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between standard library problems and bespoke mathematical constructions.

We introduce TaoBench, an undergraduate-level benchmark derived from Terence Tao's Analysis I, which formalizes analysis by constructing core mathematical concepts from scratch, without relying on standard Mathlib definitions, as well as by mixing from-scratch and MathLib constructions. For fair evaluation, we build an agentic pipeline that automatically extracts a compilable, self-contained local environment for each problem. To isolate the effect of definitional frameworks, we additionally translate every problem into a mathematically equivalent Mathlib formulation, yielding paired TaoBenchMathlib statements for direct comparison.

While state-of-the-art ATP models perform capably within the MathLib framework, performance drops by an average of roughly 26% on the definitionally equivalent Tao formulation. This indicates that the main bottleneck is limited generalization across definitional frameworks rather than task difficulty. TaoBench thus highlights a gap between benchmark performance and applicability, and provides a concrete foundation for developing and testing provers better aligned with research mathematics.

TaoBench Examples

Below are examples from TaoBench and TaoBenchMathlib. TaoBench is grounded in 150 exercises from Terence Tao's Lean formalization of Analysis I. This formalization develops core concepts of analysis from first principles rather than relying on standard MathLib definitions, and in some cases combines from-scratch constructions with MathLib components. Moreover, we construct a paired control: for each Tao-style problem, we construct a mathematically equivalent MathLib formulation, which we title TaoBenchMathlib. This paired representation isolates the effect of definitional frameworks from the difficulty of each problem, enabling a direct evaluation of impact on model performance.

TaoBench Data Construction

Constructing a benchmark that preserves exact textbook formalizations while producing self-contained, succinct, and compilable Lean snippets is challenging for current LLMs. Naïvely importing entire chapters or sections of the textbook leads to excessive context length, compilation failures, and unfaithful reconstructions of the original problem and its intent. Even when prompted to optimize for compilation, models may silently alter the intended meaning by introducing shortcuts or trivializing local definitions. These issues motivate an agentic pipeline that grounds the model in authoritative reference information, retrieves only the necessary context, and supports iterative compilation and repair.

Our approach focuses on recovering the minimal context strictly required for faithful compilation of each exercise. Using the JiXia static Lean analysis tool, extended to support additional declaration types, we recursively retrieve the referenced theorems, lemmas, definitions, and notation associated with each benchmark problem. Because JiXia alone does not capture all surface syntax and local notation, we further equip the agent with file lookup to recover source-level syntactic details, as well as a Lean compilation tool that enables repeated compile-and-debug cycles using compiler feedback. The resulting pipeline produces TaoBench, a benchmark of textbook-syntax exercises paired with minimal, fully compilable Lean contexts, enabling controlled evaluation in a realistic out-of-distribution formal setting without confounds from missing context or compilation errors.

Agentic framework for automatically extracting a compilable, self-contained local environment from a formalized textbook. The construction of references and dependencies for each exercise is performed using the JiXia tool. An agentic harness, equipped with a file-lookup tool and a Lean verifier, then iteratively builds a compilable, self-contained local environment.

TaoBenchMathlib Translation Pipeline

To isolate the effect of definitional frameworks from the underlying mathematical difficulty, we construct TaoBenchMathlib, a MathLib-based counterpart to TaoBench. The goal is to translate Tao-style statements, which rely on local in-context definitions, into statements that are both valid in MathLib and mathematically faithful to the original textbook exercises. Early direct translation attempts with GPT-5.1 proved unreliable: only around 60% of outputs were syntactically valid Lean, and even fewer preserved the original meaning. The main source of error was the model's difficulty in matching Tao's local definitions with MathLib's canonical definitions and conventions.

To address this, we developed a three-stage translation pipeline. First, GPT-5.1 with web search enabled rewrites each Tao-style statement into a MathLib-compatible form under fixed formatting constraints, and compiler feedback is used in an iterative loop until the statement compiles successfully. Second, because compilation alone does not ensure semantic correctness, we perform an equivalence-checking step: using JiXia, we extract the target proof states of both the Tao and MathLib formulations, and GPT-5.1 assesses whether they are mathematically equivalent. Expert annotators with backgrounds in Analysis I and Lean then manually verified both context retrieval and translation quality, helping refine the retrieval pipeline and correcting 33 subtle issues in the MathLib formalizations.

Translation pipeline for converting exercises from Tao's definitional framework to MathLib's. The rewriting and equivalence checking stages ensure the compilability and mathematical equivalence of TaoBenchMathlib, followed by expert verification for additional assurance.

Dataset Overview

TaoBench comprises ten chapters (Chapters 2-11) and sixty-five subsections that follow a standard progression from foundational material (natural numbers, set theory, and number systems) to core real analysis topics, including limits, series, continuity, differentiation, and the Riemann integral.

Experimental Results

Main Results

The table below reports pass@128 results on TaoBench and TaoBenchMathlib. Although the two versions of the benchmarks consist of mathematically equivalent problem sets, all evaluated provers perform substantially worse on the Tao formulations than on the MathLib formulations. This consistent degradation highlights the difficulty current provers face under a shift in definitional frameworks.

Model Performance on TaoBench and TaoBenchMathlib. Results are reported as pass rates on 150 problems.

In-depth Study

Context Length

We observe that the performance delta between TaoBench and TaoBenchMathlib performance increases with context length. At n = 0, the average delta across models is near zero, which confirms that Tao problems without local definitions are not inherently harder than their MathLib counterparts. However, when the number of in-context definitions increases to 5 ≤ n < 10, the average delta exceeds 40 percentage points, and for n ≥ 10, it approaches or exceeds 50 percentage points for all models. This pattern suggests that the performance difference is driven by an inability to effectively integrate and reason over unfamiliar definitions introduced in context.

Model Performance on TaoBench and TaoBenchMathlib. Results are reported as pass rates on 150 problems.

Counts of data samples across ranges of different numbers of local definitions.

Frontier Models & Context Dependency

Frontier Models: Frontier models underperform both Goedel-Prover-V2-32B and several 7B parameter ATP models on the MathLib problem formulations, which indicates that ATP models excel in definitional frameworks in which they are trained. However, while the performances of the frontier models do degrade on the Tao problem formulations, they achieve performances comparable to - or even in excess of - Goedel-Prover-V2-32B. These results indicate that the relative advantage of frontier models on Tao formulations is driven by a superior capability to leverage in-context examples rather than overall stronger proving capabilities.

Context Dependency: Across all models, removing contextual information leads to a consistent, but relatively modest drop in performance compared to the full-context Tao setting. A manual review of the problems that models solved without context shows that the associated set of references either contained no additional statements (i.e., context length of zero) or referenced a small set of statements that were similar to or in the standard MathLib definitional framework (ex., the rational numbers). Taken with the context-length analysis, these results indicate that current formal theorem provers do not reliably benefit from in-context definitional information unless it conforms closely to their training distribution. The presence of explicit local definitions is therefore not inherently helpful and can be actively harmful, highlighting a limitation in models ability to perform in-context definitional grounding.

Model Performance by Chapter

The figure below breaks down pass rates by chapter and shows that the overall Tao-MathLib gap is not uniform: it is concen- trated in a few chapters where the representation of core objects diverges most from MathLib. A large and consistent gap appears in Set Theory (Ch. 3), where all five models exhibit a substantial MathLib advantage of roughly 42-58 percentage points. Two pronounced failure modes occur in The Natural Numbers (Ch. 1) and Series (Ch. 7): for all five models, Tao accuracy falls to 0%, while the MathLib translation remains solvable, suggesting that the bottleneck is not the underlying analysis content but rather sensitivity to the new definitional framework.

BibTeX

@ PLACEHOLDER