GraphQL offers a flexible alternative to REST APIs, allowing precise data retrieval across multiple sources in a single query. However, generating complex GraphQL queries remains a significant challenge. Large Language Models (LLMs), while powerful, often produce suboptimal queries due to limited exposure to GraphQL schemas and their structural intricacies.Custom prompt engineering with in-context examples is a common approach to guide LLMs, but existing methods, like randomly selecting examples, often yield unsatisfactory results. While semantic similarity-based selection is effective in other domains, it falls short for GraphQL, where understanding schema-specific nuances is crucial for accurate query formulation.To address this, we propose a Schema and NL-Aware In-context Learning (SNAIL) framework that integrates both structural and semantic information from GraphQL schemas with natural language inputs, enabling schema-aware in-context learning. Unlike existing methods, our approach captures the complexities of GraphQL schemas to improve query generation accuracy.We validate this framework on a publicly available complex GraphQL test dataset, demonstrating notable performance improvements, with specific query classes showing up to a 20% performance improvement for certain LLMs. As GraphQL adoption grows, with Gartner predicting over 60% of enterprises will use it in production by 2027, this work addresses a critical need, paving the way for more efficient and reliable GraphQL query generation in enterprise applications.
Function calling using Large Language Models (LLMs) is an active research area that aims to empower LLMs with the ability to execute APIs to perform real-world tasks. However, sequential function calling using LLMs with interdependence between functions is still under-explored. To this end, we introduce GraphQLRestBench, a dataset consisting of natural language utterances paired with function call sequences representing real-world REST API calls with variable mapping between functions. In order to represent the response structure of the functions in the LLM prompt, we use the GraphQL schema of the REST APIs. We also introduce a custom evaluation framework for our dataset consisting of four specially designed metrics. We evaluate various open-source LLMs on our dataset using few-shot Chain-of-Thought and ReAct prompting to establish a reasonable baseline.
GraphQL is a powerful query language for APIs that allows clients to fetch precise data efficiently and flexibly, querying multiple resources with a single request. However, crafting complex GraphQL query operations can be challenging. Large Language Models (LLMs) offer an alternative by generating GraphQL queries from natural language, but they struggle due to limited exposure to publicly available GraphQL schemas, often resulting in invalid or suboptimal queries. Furthermore, no benchmark test data suite is available to reliably evaluate the performance of contemporary LLMs.To address this, we present a large-scale, cross-domain Text-to-GraphQL query operation dataset. The dataset includes 10,940 training triples spanning 185 cross-source data stores and 957 test triples over 14 data stores. Each triple consists of a GraphQL schema, GraphQL query operation, and corresponding natural language query. The dataset has been predominantly manually created, with natural language paraphrasing, and carefully validated, requiring approximately 1200 person-hours. In our evaluation, we tested 10 state-of-the-art LLMs using our test dataset. The best-performing model achieved an accuracy of only around 50% with one in-context few-shot example, underscoring the necessity for custom fine-tuning. To support further research and benchmarking, we are releasing the training and test datasets under the MIT License. The dataset is available at https://github.com/stepzen-dev/NL2GQL.