CypherSmith: Transforming Text-to-Cypher Generation for LLMs with Synthetic Data

Zeyu Zhang; Kexuan Sun; Zheng Tang; Jens-S. Vöckler; Thien Huu Nguyen; Thuy Vu

CypherSmith: Transforming Text-to-Cypher Generation for LLMs with Synthetic Data

Zeyu Zhang, Kexuan Sun, Zheng Tang, Jens-S. Vöckler, Thien Huu Nguyen, Thuy Vu

Abstract

Knowledge Graph (KG) retrieval is a promising augmentation to address knowledge gaps and hallucinations in LLMs. As KGs in practice are stored in graph databases (e.g., Wikidata, Freebase), accurate retrieval requires translating natural language questions into structured queries (query generation). A key challenge of query generation is Text-to-Cypher, which generates Cypher queries for property graphs (e.g., Neo4j), a paradigm increasingly adopted in industry for their scalable architectures and expressive schemas. However, compared to other query generation tasks such as Text-to-SQL or Text-to-SPARQL, Text-to-Cypher remains underexplored due to scarce public KGs and datasets. Existing datasets are small, domain-limited, and lack diversity, constraining LLM progress. To address this, we introduce CypherSmith, an instruction-tuning dataset over 12× larger than prior public Text-to-Cypher datasets, spanning diverse domains to better support LLM fine-tuning. Our key distinction lies in fully leveraging open-source LLMs for large-scale synthetic data generation and introducing a novel likelihood-based filtering technique to ensure high-quality Text-to-Cypher data. Extensive experiments demonstrate the effectiveness of CypherSmith, achieving state-of-the-art LLM performance.

Anthology ID:: 2026.acl-long.1601
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34665–34682
Language:
URL:: https://aclanthology.org/2026.acl-long.1601/
DOI:
Bibkey:
Cite (ACL):: Zeyu Zhang, Kexuan Sun, Zheng Tang, Jens-S. Vöckler, Thien Huu Nguyen, and Thuy Vu. 2026. CypherSmith: Transforming Text-to-Cypher Generation for LLMs with Synthetic Data. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34665–34682, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: CypherSmith: Transforming Text-to-Cypher Generation for LLMs with Synthetic Data (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1601.pdf
Checklist:: 2026.acl-long.1601.checklist.pdf

PDF Cite Search Checklist Fix data