STAR: Self-Automated Back-Querying for Production Data Generation

Kellen Tan Cheng; Anna Lisa Gentile; Chad DeLuca; Guang-Jie Ren

STAR: Self-Automated Back-Querying for Production Data Generation

Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren

Abstract

The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs’ input/output text through various detectors. However, developing and maintaining robust detectors has many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs before deployment. In this work, we propose STAR, a simple yet intuitive solution to generate production-like labeled data for LLMs’ guardrails development. STAR is based on two key ideas: (i) using self-automated back-querying to synthetically generate data, paired with (ii) a sparse human-in-the-loop clustering technique to label the data. The aim of self-automated back-querying is to construct a parallel corpus roughly representative of the original dataset and resembling real LLM output. We then infuse existing datasets with our synthetically generated examples to produce robust training data for our detectors. We test our technique on one of the most difficult and nuanced detectors: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.48%, despite having 400x less parameters.

Anthology ID:: 2025.ijcnlp-long.9
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 133–148
Language:
URL:: https://aclanthology.org/2025.ijcnlp-long.9/
DOI:
Bibkey:
Cite (ACL):: Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, and Guang-Jie Ren. 2025. STAR: Self-Automated Back-Querying for Production Data Generation. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 133–148, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: STAR: Self-Automated Back-Querying for Production Data Generation (Cheng et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ijcnlp-long.9.pdf

PDF Cite Search Fix data