Benchmarking Automated Theorem Proving with Large Language Models

Vanessa Lama, Catherine Ma, Tirthankar Ghosal


Abstract
Theorem proving presents a significant challenge for large language models (LLMs) due to the requirement for formal proofs to be rigorously checked by proof assistants, such as Lean, eliminating any margin for error or hallucination. While existing LLM-based theorem provers attempt to operate autonomously, they often struggle with novel and complex theorems where human insights are essential. Lean Copilot is a novel framework that integrates LLM inference into the Lean proof assistant environment. In this work, we benchmark performance of several LLMs including general and math-specific models for theorem proving using the Lean Copilot framework. Our initial investigation suggests that a general-purpose large model like LLaMa-70B still has edge over math-specific smaller models for the task under consideration. We provide useful insights into the performance of different LLMs we chose for the task.
Anthology ID:
2024.nlp4science-1.18
Volume:
Proceedings of the 1st Workshop on NLP for Science (NLP4Science)
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Lotem Peled-Cohen, Nitay Calderon, Shir Lissak, Roi Reichart
Venue:
NLP4Science
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
208–218
Language:
URL:
https://aclanthology.org/2024.nlp4science-1.18
DOI:
Bibkey:
Cite (ACL):
Vanessa Lama, Catherine Ma, and Tirthankar Ghosal. 2024. Benchmarking Automated Theorem Proving with Large Language Models. In Proceedings of the 1st Workshop on NLP for Science (NLP4Science), pages 208–218, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
Benchmarking Automated Theorem Proving with Large Language Models (Lama et al., NLP4Science 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4science-1.18.pdf