Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal (Editors)


Anthology ID:
2025.coling-industry
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2025.coling-industry/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2025.coling-industry.pdf

pdf bib
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Owen Rambow | Leo Wanner | Marianna Apidianaki | Hend Al-Khalifa | Barbara Di Eugenio | Steven Schockaert | Kareem Darwish | Apoorv Agarwal

pdf bib
STAND-Guard: A Small Task-Adaptive Content Moderation Model
Minjia Wang | Pingping Lin | Siqi Cai | Shengnan An | Shengjie Ma | Zeqi Lin | Congrui Huang | Bixiong Xu

Content moderation, the process of reviewing and monitoring the safety of generated content, is important for development of welcoming online platforms and responsible large language models. Content moderation contains various tasks, each with its unique requirements tailored to specific scenarios. Therefore, it is crucial to develop a model that can be easily adapted to novel or customized content moderation tasks accurately without extensive model tuning. This paper presents STAND-Guard, a Small Task-Adaptive coNtent moDeration model. The basic motivation is: by performing instruct tuning on various content moderation tasks, we can unleash the power of small language models (SLMs) on unseen (out-of-distribution) content moderation tasks. We also carefully study the effects of training tasks and model size on the efficacy of cross-task fine-tuning mechanism. Experiments demonstrate STAND-Guard is comparable to GPT-3.5-Turbo across over 40 public datasets, as well as proprietary datasets derived from real-world business scenarios. Remarkably, STAND-Guard achieved nearly equivalent results to GPT-4-Turbo on unseen English binary classification tasks.

pdf bib
Query-LIFE: Query-aware Language Image Fusion Embedding for E-Commerce Relevance
Hai Zhu | Yuankai Guo | Ronggang Dou | Kai Liu

Relevance module plays a fundamental role in e-commerce search as they are responsible for selecting relevant products from thousands of items based on user queries, thereby enhancing users experience and efficiency. The traditional method calculates the relevance score based on product titles and user queries, but the information in title alone maybe insufficient to describe the product completely. A more general method is to further leverage product image information. In recent years, vision-language pre-training model has achieved impressive results in many scenarios, which leverage contrastive learning to map both textual and visual features into a joint embedding space. In e-commerce, a common practice is to further fine-tune the model using e-commerce data on the basis of pre-trained model. However, the performance is sub-optimal because the vision-language pre-training models lack of alignment specifically designed for queries. In this paper, we propose Query-aware Language Image Fusion Embedding to address these challenges. Query-LIFE utilizes a query-based multimodal fusion to effectively incorporate the image and title based on the product types. Additionally, it employs query-aware modal alignment to enhance the accuracy of the comprehensive representation of products. Furthermore, we design GenFilt, which utilizes the generation capability of large models to filter out false negative samples and further improve the overall performance of the contrastive learning task in the model. Experiments have demonstrated that Query-LIFE outperforms existing baselines. We have conducted ablation studies and human evaluations to validate the effectiveness of each module within Query-LIFE. Moreover, Query-LIFE has been deployed on Miravia Search. resulting in improved both relevance and conversion efficiency.

pdf bib
Improving Tool Retrieval by Leveraging Large Language Models for Query Generation
Mohammad Kachuee | Sarthak Ahuja | Vaibhav Kumar | Puyang Xu | Xiaohu Liu

Using tools by Large Language Models (LLMs) is a promising avenue to extend their reach beyond language or conversational settings. The number of tools can scale to thousands as they enable accessing sensory information, fetching updated factual knowledge, or taking actions in the real world. In such settings, in-context learning by providing a short list of relevant tools in the prompt is a viable approach. To retrieve relevant tools, various approaches have been suggested, ranging from simple frequency-based matching to dense embedding-based semantic retrieval. However, such approaches lack the contextual and common-sense understanding required to retrieve the right tools for complex user requests. Rather than increasing the complexity of the retrieval component itself, we propose leveraging LLM understanding to generate a retrieval query. Then, the generated query is embedded and used to find the most relevant tools via a nearest-neighbor search. We investigate three approaches for query generation: zero-shot prompting, supervised fine-tuning on tool descriptions, and alignment learning by iteratively optimizing a reward metric measuring retrieval performance. By conducting extensive experiments on a dataset covering complex and multi-tool scenarios, we show that leveraging LLMs for query generation improves the retrieval for in-domain (seen tools) and out-of-domain (unseen tools) settings.

pdf bib
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
Rafael Teixeira de Lima | Shubham Gupta | Cesar Berrospi Ramis | Lokesh Mishra | Michele Dolfi | Peter Staar | Panagiotis Vagenas

Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system’s use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.

pdf bib
RED-CT: A Systems Design Methodology for Using LLM-labeled Data to Train and Deploy Edge Linguistic Classifiers
David Farr | Nico Manzonelli | Iain Cruickshank | Jevin West

Large language models (LLMs) have enhanced our ability to rapidly analyze and classify unstructured natural language data. However, concerns regarding cost, network limitations, and security constraints have posed challenges for their integration into industry processes. In this study, we adopt a systems design approach to employing LLMs as imperfect data annotators for downstream supervised learning tasks, introducing system intervention measures aimed at improving classification performance. Our methodology outperforms LLM-generated labels in six of eight tests and base classifiers in all tests, demonstrating an effective strategy for incorporating LLMs into the design and deployment of specialized, supervised learning models present in many industry use cases.

pdf bib
Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking
Harsha Vardhan Khurdula | Basem Rizk | Indus Khaitan

Current benchmarks for evaluating Vision Language Models (VLMs) often fall short in thoroughly assessing these models’ abilities to understand and process complex visual and textual content. They typically focus on simple tasks that do not require deep reasoning or the integration of multiple data modalities to solve an original problem. To address this gap, we introduce the PARROT-360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles designed to test VLMs on complex visual reasoning tasks. We evaluated leading models—GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro—using PARROT-360V to assess their capabilities in combining visual clues with language skills to solve tasks in a manner akin to human problem-solving. Our findings reveal a notable performance gap: state-of-the-art models scored between 28% to 56% on our benchmark, significantly lower than their performance on popular benchmarks. This underscores the limitations of current VLMs in handling complex, multi-step reasoning tasks and highlights the need for more robust evaluation frameworks to advance the field.

pdf bib
PDC & DM-SFT: A Road for LLM SQL Bug-Fix Enhancing
Yiwen Duan | Yonghong Yu | Xiaoming Zhao | Yichang Wu | Wenbo Liu

Code Large Language Models (Code LLMs), such as Code llama and DeepSeek-Coder, have demonstrated exceptional performance in the code generation tasks. However, most existing models focus on the abilities of generating correct code, but often struggle with bug repair. We introduce a suit of methods to enhance LLM’s SQL bug-fixing abilities. The methods are mainly consisted of two parts: A Progressive Dataset Construction (PDC) from scratch and Dynamic Mask Supervised Fine-tuning (DM-SFT). PDC proposes two data expansion methods from the perspectives of breadth first and depth first respectively. DM-SFT introduces an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the “disorientation” in SQL code bug-fixing training. In our evaluation, the code LLM models trained with two methods have exceeds all current best performing model which size is much larger.

pdf bib
Multilingual Continual Learning using Attention Distillation
Sanjay Agrawal | Deep Nayak | Vivek Varadarajan Sembium

Query-product relevance classification is crucial for e-commerce stores like Amazon, ensuring accurate search results that match customer intent. Using a unified multilingual model across multiple languages/marketplaces tends to yield superior outcomes but also presents challenges, especially in maintaining performance across all languages when the model is updated or expanded to include a new one. To tackle this, we examine a multilingual continual learning (CL) framework focused on relevance classification tasks and address the issue of catastrophic forgetting. We propose a novel continual learning approach called attention distillation, which sequentially adds adapters for each new language and incorporates a fusion layer above language-specific adapters. This fusion layer distills attention scores from the previously trained fusion layer, focusing on the older adapters. Additionally, translating a portion of the new language data into older ones supports backward knowledge transfer. Our method reduces trainable parameters by 80%, enhancing computational efficiency and enabling frequent updates, while achieving a 1-3% ROC-AUC improvement over single marketplace baselines and outperforming SOTA CL methods on proprietary and external datasets.

pdf bib
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding
Amit Agarwal | Srikant Panda | Kulbhushan Pachauri

In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG’s capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance.

pdf bib
OKG: On-the-Fly Keyword Generation in Sponsored Search Advertising
Zhao Wang | Briti Gangopadhyay | Mengjie Zhao | Shingo Takamatsu

Current keyword decision-making in sponsored search advertising relies on large static datasets, limiting automatic keyword setup and failing to adapt to real-time KPI metrics and product updates essential for effective advertising. In this paper, we propose On-the-fly Keyword Generation (OKG), an LLM agent-based method that dynamically monitors KPI changes and adapts keyword generation in real-time, realizing the strategy recommended by advertising platforms. Additionally, we introduce the first publicly accessible dataset containing real keyword data with its KPIs across diverse domains, providing a valuable resource for future research. Experimental results and ablation studies demonstrate the effectiveness of OKG, showing significant improvements across various metrics and emphasizing the importance of each component. We believe OKG not only pioneers the use of LLM agents in this research field but also offers practical value for thousands of advertisers to automate keyword generation in real-world applications.

pdf bib
Best Practices for Distilling Large Language Models into BERT for Web Search Ranking
Dezhi Ye | Junwei Hu | Jiabin Fan | Bowen Tian | Jie Liu | Haijin Liang | Jin Ma

Recent studies have highlighted the significant potential of Large Language Models (LLMs) as zero-shot relevance rankers. These methods predominantly utilize prompt learning to assess the relevance between queries and documents by generating a ranked list of potential documents. Despite their promise, the substantial costs associated with LLMs pose a significant challenge for their direct implementation in commercial search systems. To overcome this barrier and fully exploit the capabilities of LLMs for text ranking, we explore techniques to transfer the ranking expertise of LLMs to a more compact model similar to BERT, using a ranking loss to enable the deployment of less resource-intensive models. Specifically, we enhance the training of LLMs through Continued Pre-Training, taking the query as input and the clicked title and summary as output. We then proceed with supervised fine-tuning of the LLM using a rank loss, assigning the final token as a representative of the entire sentence. Given the inherent characteristics of autoregressive language models, only the final token </s> can encapsulate all preceding tokens. Additionally, we introduce a hybrid point-wise and margin MSE loss to transfer the ranking knowledge from LLMs to smaller models like BERT. This method creates a viable solution for environments with strict resource constraints. Both offline and online evaluations have confirmed the efficacy of our approach, and our model has been successfully integrated into a commercial web search engine as of February 2024.

pdf bib
Rationale-Guided Distillation for E-Commerce Relevance Classification: Bridging Large Language Models and Lightweight Cross-Encoders
Sanjay Agrawal | Faizan Ahemad | Vivek Varadarajan Sembium

Accurately classifying the relevance of Query-Product pairs is critical in online retail stores such as Amazon, as displaying irrelevant products can harm user experience and reduce engagement. While Large Language Models (LLMs) excel at this task due to their broad knowledge and strong reasoning abilities. However, their high computational demands constrain their practical deployment in real-world applications. In this paper, we propose a novel distillation approach for e-commerce relevance classification that uses “rationales” generated by LLMs to guide smaller cross-encoder models. These rationales capture key decision-making insights from LLMs, enhancing training efficiency and enabling the distillation to smaller cross-encoder models deployable in production without requiring the LLM. Our method achieves average ROC-AUC improvements of 1.4% on 9 multilingual e-commerce datasets, 2.4% on 3 ESCI datasets, and 6% on GLUE datasets over vanilla cross-encoders. Our 110M parameter BERT model matches 7B parameter LLMs in performance (< 1% ROC-AUC difference) while being 50 times faster per sample.

pdf bib
Automated Clinical Data Extraction with Knowledge Conditioned LLMs
Diya Li | Asim Kadav | Aijing Gao | Rui Li | Richard Bourgon

The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.

pdf bib
Can Large Language Models Serve as Effective Classifiers for Hierarchical Multi-Label Classification of Scientific Documents at Industrial Scale?
Seyed Amin Tabatabaei | Sarah Fancher | Michael Parsons | Arian Askari

We address the task of hierarchical multi-label classification (HMC) of scientific documents at an industrial scale, where hundreds of thousands of documents must be classified across thousands of dynamic labels. The rapid growth of scientific publications necessitates scalable and efficient methods for classification, further complicated by the evolving nature of taxonomies—where new categories are introduced, existing ones are merged, and outdated ones are deprecated. Traditional machine learning approaches, which require costly retraining with each taxonomy update, become impractical due to the high overhead of labelled data collection and model adaptation. Large Language Models (LLMs) have demonstrated great potential in complex tasks such as multi-label classification. However, applying them to large and dynamic taxonomies presents unique challenges as the vast number of labels can exceed LLMs’ input limits. In this paper, we present novel methods that combine the strengths of LLMs with dense retrieval techniques to overcome these challenges. Our approach avoids frequent retraining by leveraging zero-shot and few-shot learning for real-time label assignment. We evaluate the effectiveness of our methods on SSRN, a large repository of preprints spanning multiple disciplines, and demonstrate significant improvements in both classification accuracy and cost-efficiency. By developing a tailored evaluation framework for dynamic taxonomies and publicly releasing our code, this research provides critical insights into applying LLMs for document classification, where the number of classes corresponds to the number of nodes in a large taxonomy, at an industrial scale, significantly contributing to both data science and natural language processing.

pdf bib
EDAR: A pipeline for Emotion and Dialogue Act Recognition
Elie Dina | Rania Ayachi Kibech | Miguel Couceiro

Individuals facing financial difficulties often make decisions driven by emotions rather than rational analysis. EDAR, a pipeline for Emotion and Dialogue Act Recognition, is designed specifically for the debt collection process in France. By integrating EDAR into decision-making systems, debt collection outcomes could be improved. The pipeline employs Machine Learning and Deep Learning models, demonstrating that smaller models with fewer parameters can achieve high performance, offering an efficient alternative to large language models.

pdf bib
No Size Fits All: The Perils and Pitfalls of Leveraging LLMs Vary with Company Size
Ashok Urlana | Charaka Vinayak Kumar | Bala Mallikarjunarao Garlapati | Ajeet Kumar Singh | Rahul Mishra

Large language models (LLMs) are playing a pivotal role in deploying strategic use cases across a range of organizations, from large pan-continental companies to emerging startups. The issues and challenges involved in the successful utilization of LLMs can vary significantly depending on the size of the organization. It is important to study and discuss these pertinent issues of LLM adaptation with a focus on the scale of the industrial concerns and brainstorm possible solutions and prospective directions. Such a study has not been prominently featured in the current research literature. In this study, we adopt a threefold strategy: first, we conduct a case study with industry practitioners to formulate the key research questions; second, we examine existing industrial publications to address these questions; and finally, we provide a practical guide for industries to utilize LLMs more efficiently. We release the GitHub repository with the most recent papers in the field.

pdf bib
Predicting Fine-tuned Performance on Larger Datasets Before Creating Them
Toshiki Kuramoto | Jun Suzuki

This paper proposes a method to estimate the performance of pretrained models fine-tuned with a larger dataset from the result with a smaller dataset. Specifically, we demonstrate that when a pretrained model is fine-tuned, its classification performance increases at the same overall rate, regardless of the original dataset size, as the number of epochs increases. Subsequently, we verify that an approximate formula based on this trend can be used to predict the performance when the model is trained with ten times or more training data, even when the initial training dataset is limited. Our results show that this approach can help resource-limited companies develop machine-learning models.

pdf bib
A Recipe For Building a Compliant Real Estate Chatbot
Navid Madani | Anusha Bagalkotkar | Supriya Anand | Gabriel Arnson | Rohini K. Srihari | Kenneth Joseph

In recent years, there has been significant effort to align large language models with human preferences. This work focuses on developing a chatbot specialized in the real estate domain, with an emphasis on incorporating compliant behavior to ensure it can be used without perpetuating discriminatory practices like steering and redlining, which have historically plagued the real estate industry in the United States. Building on prior work, we present a method for generating a synthetic general instruction-following dataset, along with safety data. Through extensive evaluations and benchmarks, we fine-tuned a llama-3-8B-instruct model and demonstrated that we can enhance it’s performance significantly to match huge closed-source models like GPT-4o while making it safer and more compliant. We open-source the model, data and code to support further development and research in the community

pdf bib
Geo-Spatially Informed Models for Geocoding Unstructured Addresses
Uddeshya Singh | Devanapalli Ravi Shankar | Gowtham Bellala | Vikas Goel

Geocoding customer addresses and determining precise locations is a crucial component for any e-commerce company. Shipment delivery costs make up a significant portion of overall expenses, and having exact customer locations not only improves operational efficiency but also reduces costs and enhances the customer experience. While state-of-the-art geocoding systems are well-suited for developed countries with structured city layouts and high-quality reference corpora, they are less effective in developing countries like India, where addresses are highly unstructured and reliable reference data is scarce. Recent research has focused on creating geocoding systems tailored for developing nations such as India. In this work, we propose a method to geocode addresses in such environments. We explored various approaches to incorporate geo-spatial relationships using an LLM backbone, which provided insights into how the model learns these relationships both explicitly and implicitly. Our proposed approach outperforms the current state-of-the-art system by 20% in drift accuracy within 100 meters, and the state-of-the-art commercial system by 54%. This has a potential to reduce the incorrect delivery hub assignments by 8% which leads to significant customer experience improvements and business savings.

pdf bib
Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models
Tobias Deußer | Max Hahnbück | Tobias Uelwer | Cong Zhao | Christian Bauckhage | Rafet Sifa

Protecting personal and sensitive information in textual data is increasingly crucial, especially when leveraging large language models (LLMs) that may pose privacy risks due to their API-based access. We introduce a novel approach and pipeline for anonymizing text across arbitrary domains without the need for manually labeled data or extensive computational resources. Our method employs knowledge distillation from LLMs into smaller encoder-only models via named entity recognition (NER) coupled with regular expressions to create a lightweight model capable of effective anonymization while preserving the semantic and contextual integrity of the data. This reduces computational overhead, enabling deployment on less powerful servers or even personal computing devices. Our findings suggest that knowledge distillation offers a scalable, resource-efficient pathway for anonymization, balancing privacy preservation with model performance and computational efficiency.

pdf bib
Fine-Tuning Medium-Scale LLMs for Joint Intent Classification and Slot Filling: A Data-Efficient and Cost-Effective Solution for SMEs
Maia Aguirre | Ariane Méndez | Arantza del Pozo | Maria Ines Torres | Manuel Torralbo

Dialogue Systems (DS) are increasingly in demand for automating tasks through natural language interactions. However, the core techniques for user comprehension in DS depend heavily on large amounts of labeled data, limiting their applicability in data-scarce environments common to many companies. This paper identifies best practices for data-efficient development and cost-effective deployment of DS in real-world application scenarios. We evaluate whether fine-tuning a medium-sized Large Language Model (LLM) for joint Intent Classification (IC) and Slot Filling (SF), with moderate hardware resource requirements still affordable by SMEs, can achieve competitive performance using less data compared to current state-of-the-art models. Experiments on the Spanish and English portions of the MASSIVE corpus demonstrate that the Llama-3-8B-Instruct model fine-tuned with only 10% of the data outperforms the JointBERT architecture and GPT-4o in a zero-shot prompting setup in monolingual settings. In cross-lingual scenarios, Llama-3-8B-Instruct drastically outperforms multilingual JointBERT demonstrating a vastly superior performance when fine-tuned in a language and evaluated in the other.

pdf bib
Enhancing Large Language Models for Scientific Multimodal Summarization with Multimodal Output
Zusheng Tan | Xinyi Zhong | Jing-Yu Ji | Wei Jiang | Billy Chiu

The increasing integration of multimedia such as videos and graphical abstracts in scientific publications necessitates advanced summarization techniques. This paper introduces Uni-SciSum, a framework for Scientific Multimodal Summarization with Multimodal Output (SMSMO), addressing the challenges of fusing heterogeneous data sources (e.g., text, images, video, audio) and outputting multimodal summary within a unified architecture. Uni-SciSum leverages the power of large language models (LLMs) and extends its capability to cross-modal understanding through BridgeNet, a query-based transformer that fuses diverse modalities into a fixed-length embedding. A two-stage training process, involving modal-to-modal pre-training and cross-modal instruction tuning, aligns different modalities with summaries and optimizes for multimodal summary generation. Experiments on two new SMSMO datasets show Uni-SciSum outperforms uni- and multi-modality methods, advancing LLM applications in the increasingly multimodal realm of scientific communication.

pdf bib
“Stupid robot, I want to speak to a human!” User Frustration Detection in Task-Oriented Dialog Systems
Mireia Hernandez Caralt | Ivan Sekulic | Filip Carevic | Nghia Khau | Diana Nicoleta Popa | Bruna Guedes | Victor Guimaraes | Zeyu Yang | Andre Manso | Meghana Reddy | Paolo Rosso | Roland Mathis

Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.

pdf bib
LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models
Harsh Saini | Md Tahmid Rahman Laskar | Cheng Chen | Elham Mohammadi | David Rossouw

Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks in recent years. This has inspired researchers and practitioners in the real-world industrial domain to build useful products via leveraging LLMs. However, extensive evaluations of LLMs, in terms of accuracy, memory management, and inference latency, while ensuring the reproducibility of the results are crucial before deploying LLM-based solutions for real-world usage. In addition, when evaluating LLMs on internal customer data, an on-premise evaluation system is necessary to protect customer privacy rather than sending customer data to third-party APIs for evaluation. In this paper, we demonstrate how we build an on-premise system for LLM evaluation to address the challenges in the evaluation of LLMs in real-world industrial settings. We demonstrate the complexities of consolidating various datasets, models, and inference-related artifacts in complex LLM inference pipelines. For this purpose, we also present a case study in a real-world industrial setting. The demonstration of the LLM evaluation tool development would help researchers and practitioners in building on-premise systems for LLM evaluation ensuring privacy, reliability, robustness, and reproducibility.

pdf bib
Enhancing Future Link Prediction in Quantum Computing Semantic Networks through LLM-Initiated Node Features
Gilchan Park | Paul Baity | Byung-Jun Yoon | Adolfy Hoisie

Quantum computing is rapidly evolving in both physics and computer science, offering the potential to solve complex problems and accelerate computational processes. The development of quantum chips necessitates understanding the correlations among diverse experimental conditions. Semantic networks built on scientific literature, representing meaningful relationships between concepts, have been used across various domains to identify knowledge gaps and novel concept combinations. Neural network-based approaches have shown promise in link prediction within these networks. This study proposes initializing node features using LLMs to enhance node representations for link prediction tasks in graph neural networks. LLMs can provide rich descriptions, reducing the need for manual feature creation and lowering costs. Our method, evaluated using various link prediction models on a quantum computing semantic network, demonstrated efficacy compared to traditional node embedding techniques.

pdf bib
Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation
Hunter Heidenreich | Ratish Dalvi | Nikhil Verma | Yosheb Getachew

Page Stream Segmentation (PSS) is critical for automating document processing in industries like insurance, where unstructured document collections are common. This paper explores the use of large language models (LLMs) for PSS, applying parameter-efficient fine-tuning to real-world insurance data. Our experiments show that LLMs outperform baseline models in page- and stream-level segmentation accuracy. However, stream-level calibration remains challenging, especially for high-stakes applications. We evaluate post-hoc calibration and Monte Carlo dropout, finding limited improvement. Future work will integrate active learning to enhance model calibration and support deployment in practical settings.

pdf bib
Graph-Augmented Open-Domain Multi-Document Summarization
Xiaoping Shen | Yekun Chai

In the open-domain multi-document summarization (ODMDS) task, retrieving relevant documents from large repositories and generating coherent summaries are crucial. However, existing methods often treat retrieval and summarization as separate tasks, neglecting the relationships among documents. To address these limitations, we propose an integrated retrieval-summarization framework that captures global document relationships through graph-based clustering, guiding the re-ranking of retrieved documents. This cluster-level thematic information is then used to guide large language models (LLMs) in refining the retrieved documents and generating more accurate, coherent summaries. Experimental results on the ODSUM benchmark demonstrate that our method significantly improves retrieval accuracy and produces summaries that surpass those derived from the oracle documents. These findings highlight the potential of our framework to improve both retrieval and summarization tasks in ODMDS.

pdf bib
Improve Speech Translation Through Text Rewrite
Jing Wu | Shushu Wang | Kai Fan | Wei Luo | Minpeng Liao | Zhongqiang Huang

Despite recent progress in Speech Translation (ST) research, the challenges posed by inherent speech phenomena that distinguish transcribed speech from written text are not well addressed. The informal and erroneous nature of spontaneous speech is inadequately represented in the typical parallel text available for building translation models. We propose to address these issues through a text rewrite approach that aims to transform transcribed speech into a cleaner style more in line with the expectations of translation models built from written text. Moreover, the advantages of the rewrite model can be effectively distilled into a standalone translation model. Experiments on several benchmarks, using both publicly available and in-house translation models, demonstrate that adding a rewrite model to a traditional ST pipeline is a cost-effect way to address a variety of speech irregularities and improve speech translation quality for multiple language directions and domains.

pdf bib
CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding
Johannes Kirmayr | Lukas Stappen | Phillip Schneider | Florian Matthes | Elisabeth Andre

In today’s assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system’s suitability for industrial applications.

pdf bib
XTR meets ColBERTv2: Adding ColBERTv2 Optimizations to XTR
Riyaz Ahmad Bhat | Jaydeep Sen

XTR (Lee et al., 2023) introduced an efficient multi-vector retrieval method that addresses the limitations of the ColBERT (Khattab and Zaharia, 2020model by simplifying retrieval into a single stage through a modified learning objective. While XTR eliminates the need for multistage retrieval, it doesn’t incorporate the efficiency optimizations from ColBERTv2 (Santhanam et al., 2022, which improve indexing and retrieval speed. In this work, we enhance XTR by integrating ColBERTv2’s optimizations, showing that the combined approach preserves the strengths of both models. This results in a more efficient and scalable solution for multi-vector retrieval, while maintaining XTR’s streamlined retrieval process.

pdf bib
sDPO: Don’t Use Your Data All at Once
Dahyun Kim | Yungi Kim | Wonho Song | Hyeonwoo Kim | Yunsu Kim | Sanghoon Kim | Chanjun Park

As large language models (LLMs) continue to advance, aligning them with human preferences has become a critical objective. In this paper, we introduce stepwise DPO (sDPO), an innovative extension of the recently popularized Direct Preference Optimization (DPO) technique for alignment tuning. sDPO systematically partitions the available preference datasets and applies them incrementally, rather than utilizing the entire dataset simultaneously. This stepwise manner enables the integration of progressively more aligned reference models within the DPO training framework. Our empirical results demonstrate that sDPO not only enhances the alignment precision of reference models but also significantly improves the overall performance of the final model, surpassing other prominent LLMs with larger parameter counts.

pdf bib
Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI
Yuya Asano | Sabit Hassan | Paras Sharma | Anthony B. Sicilia | Katherine Atwell | Diane Litman | Malihe Alikhani

General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.

pdf bib
Federated Retrieval Augmented Generation for Multi-Product Question Answering
Parshin Shojaee | Sai Sree Harsha | Dan Luo | Akash Maharaj | Tong Yu | Yunyao Li

Recent advancements in Large Language Models and Retrieval-Augmented Generation have boosted interest in domain-specific question-answering for enterprise products. However, AI Assistants often face challenges in multi-product QA settings, requiring accurate responses across diverse domains. Existing multi-domain RAG-QA approaches either query all domains indiscriminately, increasing computational costs and LLM hallucinations, or rely on rigid resource selection, which can limit search results. We introduce MKP-QA, a novel multi-product knowledge-augmented QA framework with probabilistic federated search across domains and relevant knowledge. This method enhances multi-domain search quality by aggregating query-domain and query-passage probabilistic relevance. To address the lack of suitable benchmarks for multi-product QAs, we also present new datasets focused on three Adobe products: Adobe Experience Platform, Target, and Customer Journey Analytics. Our experiments show that MKP-QA significantly boosts multi-product RAG-QA performance in terms of both retrieval accuracy and response quality.

pdf bib
Luna: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost
Masha Belyi | Robert Friel | Shuai Shao | Atindriyo Sanyal

Retriever-Augmented Generation (RAG) systems have become pivotal in enhancing the capabilities of language models by incorporating external knowledge retrieval mechanisms. However, a significant challenge in deploying these systems in industry applications is the detection and mitigation of hallucinations - instances where the model generates information that is not grounded in the retrieved context. Addressing this issue is crucial for ensuring the reliability and accuracy of responses generated by large language models (LLMs) in industry settings. Current hallucination detection techniques fail to deliver accuracy, low latency, and low cost simultaneously. We introduce Luna: a DeBERTA-large encoder, fine-tuned for hallucination detection in RAG settings. We demonstrate that Luna outperforms GPT-3.5 and commercial evaluation frameworks on the hallucination detection task, with 97% and 91% reduction in cost and latency, respectively. Luna’s generalization capacity across multiple industry verticals and out-of-domain data makes it a strong candidate for guardrailing industry LLM applications.

pdf bib
Seeing Beyond: Enhancing Visual Question Answering with Multi-Modal Retrieval
Boqi Chen | Anuj Khare | Gaurav Kumar | Arjun Akula | Pradyumna Narayana

Multi-modal Large language models (MLLMs) have made significant strides in complex content understanding and reasoning. However, they still suffer from model hallucination and lack of specific knowledge when facing challenging questions. To address these limitations, retrieval augmented generation (RAG) has emerged as an effective solution. While incorporating knowledge has led to improvements, it also highlights the need for a more robust knowledge selection strategy. For multi-modal tasks, such as visual question answering (VQA), integrating all modalities is crucial in providing comprehensive information for accurate answers. Therefore, we propose to construct an encoder model for extracting joint embedding from all modalities, enabling alignment between the corresponding query and knowledge through contrastive learning. To further improve performance, we introduce an additional MLLM re-selection step, which selects the best matching knowledge from the top-k retrieved results of our alignment model. We evaluated our method, SeBe-VQA, on the Encyclopedic VQA dataset. Our knowledge retrieval results demonstrate the benefit of our multi-modal framework. By incorporating the retrieved knowledge along with the question, we achieve a significant performance improvement compared with the previous method and scenarios without knowledge provision.

pdf bib
AutoProteinEngine: A Large Language Model Driven Agent Framework for Multimodal AutoML in Protein Engineering
Yungeng Liu | Zan Chen | Yuguang Wang | Yiqing Shen

Protein engineering is important for various biomedical applications, but traditional approaches are often inefficient and resource-intensive. While deep learning (DL) models have shown promise, their implementation remains challenging for biologists without specialized computational expertise. To address this gap, we propose AutoProteinEngine (AutoPE), an innovative agent framework that leverages large language models (LLMs) for multimodal automated machine learning (AutoML) in protein engineering. AutoPE introduces a conversational interface that allows biologists without DL backgrounds to interact with DL models using natural language, lowering the entry barrier for protein engineering tasks. Our AutoPE uniquely integrates LLMs with AutoML to handle both protein sequence and graph modalities, automate hyperparameter optimization, and facilitate data retrieval from protein databases. We evaluated AutoPE through two real-world protein engineering tasks, demonstrating substantial improvements in model performance compared to traditional zero-shot and manual fine-tuning approaches. By bridging the gap between DL and biologists’ domain expertise, AutoPE empowers researchers to leverage advanced computational tools without extensive programming knowledge.

pdf bib
Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud
Chengyu Wang | Yuanhao Yue | Jun Huang | Peng Wang

Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.

pdf bib
Where do LLMs Encode the Knowledge to Assess the Ambiguity?
Hancheol Park | Geonmin Kim

Recently, large language models (LLMs) have shown remarkable performance across various natural language processing tasks, thanks to their vast amount of knowledge. Nevertheless, they often generate unreliable responses. A common example is providing a single biased answer to an ambiguous question that could have multiple correct answers. To address this issue, in this study, we discuss methods to detect such ambiguous samples. More specifically, we propose a classifier that uses a representation from an intermediate layer of the LLM as input. This is based on observations from previous research that representations of ambiguous samples in intermediate layers are closer to those of relevant label samples in the embedding space, but not necessarily in higher layers. The experimental results demonstrate that using representations from intermediate layers detects ambiguous input prompts more effectively than using representations from the final layer. Furthermore, in this study, we propose a method to train such classifiers without ambiguity labels, as most datasets lack labels regarding the ambiguity of samples, and evaluate its effectiveness.

pdf bib
On the effective transfer of knowledge from English to Hindi Wikipedia
Paramita Das | Amartya Roy | Ritabrata Chakraborty | Animesh Mukherjee

Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books), and adapts it to align with Wikipedia’s distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.

pdf bib
BackMATH: Towards Backward Reasoning for Solving Math Problems Step by Step
Shaowei Zhang | Deyi Xiong

Large language models (LLMs) have achieved impressive results in reasoning, particularly in multi-step reasoning tasks. However, when faced with more complex mathematical problems, the performance of LLMs drops significantly. To address this issue, in this paper, we propose a backward reasoning dataset, BackMATH-Data. The dataset comprises approximately 14K backward reasoning problems and 100K reasoning steps. It follows a result-oriented approach, to construct backward reasoning problems by swapping the reasoning results with specific solving conditions in the original problems.Additionally, we introduce Backward-reasoning Process-supervision Reward Model (BackPRM) and BackMATH-LLM. BackPRM supervises the quality of the generated backward reasoning problems, while BackMATH-LLM is designed for mathematical reasoning. BackMATH-LLM is fine-tuned and enhanced through reinforcement learning by supervising the quality of backward reasoning problems and by providing feedback on reasoning steps, thereby improving the mathematical reasoning capabilities of LLMs.Extensive experiments demonstrate that our model achieves an accuracy of 68.1% on the GSM8K dataset and 21.9% on the MATH dataset, exceeding the SOTA by 1.6% and 2.1% respectively.

pdf bib
Deploying Multi-task Online Server with Large Language Model
Yincen Qu | Hengyue Liu | Kun Wang | Xiangying Dai | Xiaoou Lu | Hui Zhou | Chao Ma

In the industry, numerous tasks are deployed online. Traditional approaches often tackle each task separately by its own network, which leads to excessive costs for developing and scaling models, especially in the context of large language models. Although multi-task methods can save costs through parameter sharing, they often struggle to outperform single-task methods in real-world applications. To tackle these challenges, we present a three-stage multi-task learning framework for large language models. It involves task filtering, followed by fine-tuning on high-resource tasks, and finally fine-tuning on all tasks. We conducted comprehensive experiments in single-task and multi-task settings. Our approach, exemplified on different benchmarks, demonstrates that it is able to achieve performance comparable to the single-task method while reducing up to 90.9% of its overhead.

pdf bib
LLM-Friendly Knowledge Representation for Customer Support
Hanchen Su | Wei Luo | Yashar Mehdad | Wei Han | Elaine Liu | Wayne Zhang | Mia Zhao | Joy Zhang

We propose a practical approach by integrating Large Language Models (LLMs) with a framework designed to navigate the complexities of Airbnb customer support operations. In this paper, our methodology employs a novel reformatting technique, the Intent, Context, and Action (ICA) format, which transforms policies and workflows into a structure more comprehensible to LLMs. Additionally, we develop a synthetic data generation strategy to create training data with minimal human intervention, enabling cost-effective fine-tuning of our model. Our internal experiments (not applied to Airbnb products) demonstrate that our approach of restructuring workflows and fine-tuning LLMs with synthetic data significantly enhances their performance, setting a new benchmark for their application in customer support. Our solution is not only cost-effective but also improves customer support, as evidenced by both accuracy and manual processing time evaluation metrics.

pdf bib
Leveraging Multilingual Models for Robust Grammatical Error Correction Across Low-Resource Languages
Divesh Ramesh Kubal | Apurva Shrikant Nagvenkar

Grammatical Error Correction (GEC) is a crucial task in Natural Language Processing (NLP) aimed at improving the quality of user-generated content, particularly for non-native speakers. This paper introduces a novel end-to-end architecture utilizing the M2M100 multilingual transformer model to build a unified GEC system, with a focus on low-resource languages. A synthetic data generation pipeline is proposed, tailored to address language-specific error categories. The system has been implemented for the Spanish language, showing promising results based on evaluations conducted by linguists with expertise in Spanish. Additionally, we present a user analysis that tracks user interactions, revealing an acceptance rate of 88.2%, as reflected by the actions performed by users.

pdf bib
A Simple yet Efficient Prompt Compression Method for Text Classification Data Annotation Using LLM
Yiran Xie | Debin Xiao | Ping Wang | Shuming Liu

Effectively balancing accuracy and cost is a critical challenge when using large language models (LLMs) for corpus annotation. This paper introduces a novel compression method based on keyword extraction (PCKE) that effectively reduces the number of prompt tokens in text classification annotation tasks, with minimal to no loss in accuracy. Our approach begins with an LLM that generates both category labels and relevant keywords from a small unannotated dataset. These outputs are used to train a BERT-based multi-task model capable of simultaneous classification and keyword extraction. For larger unannotated corpora, this model extracts keywords which are then used in place of full texts for LLM annotation. The significant reduction in prompt tokens result in substantial cost savings. Furthermore, the use of a few well-chosen keywords ensures that classification accuracy is maintained. Extensive experiments validate that our method not only achieves a superior compression rate but also maintains high accuracy, outperforming existing general-purpose compression techniques. Our approach offers a practical and cost-efficient solution for large-scale text classification annotation using LLMs, particularly applicable in industrial settings.

pdf bib
AMAN: Agent for Mentoring and Assisting Newbies in MMORPG
Jeehyun Lee | Seung-Moo Yang | Won Ik Cho

In online games with diverse contents and frequent updates, newcomers first learn gameplay mechanics by community intelligence but soon face challenges that require real-time guidance from senior gamers. To provide easy access to such support, we introduce AMAN, Agent for Mentoring and Assisting Newbies in MMORPG (Massively Multiplayer Online Role-Playing Game) - a companion chatbot designed to engage novice gamers. Our model functions as a human-like chat buddy that interacts with users in a friendly manner while providing substantive informational depth. In this light, we propose a multi-stage learning approach that incorporates continual pre-training with a sequence of online resources and instruction tuning on curated dialogues. To align with gamers’ specific needs, we first analyze user-oriented topics from online communities regarding a widely played MMORPG and construct a domain-specific dataset. Furthermore, we develop a multi-turn dialogue data to foster dynamic conversations with users. The evaluation result with the model trained upon publicly available language model shows our practical applicability on how conversational assistant in online games can help novice gamers.

pdf bib
KARRIEREWEGE: A large scale Career Path Prediction Dataset
Elena Senger | Yuri Campbell | Rob van der Goot | Barbara Plank

Accurate career path prediction can support many stakeholders, like job seekers, recruiters, HR, and project managers. However, publicly available data and tools for career path prediction are scarce. In this work, we introduce Karrierewege, a comprehensive, publicly available dataset containing over 500k career paths, significantly surpassing the size of previously available datasets. We link the dataset to the ESCO taxonomy to offer a valuable resource for predicting career trajectories. To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions resulting in Karrierewege+. This allows for accurate predictions from unstructured data, closely aligning with practical application challenges. We benchmark existing state-of-the-art (SOTA) models on our dataset and a previous benchmark and see increased performance and robustness by synthesizing the data for the free-text use cases.

pdf bib
Transforming Code Understanding: Clustering-Based Retrieval for Improved Summarization in Domain-Specific Languages
Baban Gain | Dibyanayan Bandyopadhyay | Samrat Mukherjee | Aryan Sahoo | Saswati Dana | Palanivel Kodeswaran | Sayandeep Sen | Asif Ekbal | Dinesh Garg

A domain-specific extension of C language known as extended Berkeley Packet Filter (eBPF) has gained widespread acceptance for various tasks, including observability, security, and network acceleration in the cloud community. Due to its recency and complexity, there is an overwhelming need for natural language summaries of existing eBPF codes (particularly open-source code) for practitioners and developers, which will go a long way in easing the understanding and development of new code. However, being a niche Domain-Specific Language (DSL), there is a scarcity of available training data. In this paper, we investigate the effectiveness of LLMs for summarizing low-resource DSLs, in the context of eBPF codes. Specifically, we propose a clustering-based technique to retrieve in-context examples that are semantically closer to the test example and propose a very simple yet powerful prompt design that yields superior-quality code summary generation. Experimental results show that our proposed retrieval approach for prompt generation improves the eBPF code summarization accuracy up to 12.9 BLEU points over other prompting techniques. The codes are available at https://github.com/babangain/ebpf_summ.

pdf bib
Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator
Frederic Thomas Kirstein | Terry Lima Ruas | Bela Gipp

The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA’s components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework’s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.

pdf bib
Learning to Rewrite Negation Queries in Product Search
Mengtian Guo | Mutasem Al-Darabsah | Choon Hui Teo | Jonathan May | Tarun Agarwal | Rahul Bhagat

In product search, negation is frequently used to articulate unwanted product features or components. Modern search engines often struggle to comprehend negations, resulting in suboptimal user experiences. While various methods have been proposed to tackle negations in search, none of them took the vocabulary gap between query keywords and product text into consideration. In this work, we introduced a query rewriting approach to enhance the performance of product search engines when dealing with queries with negations. First, we introduced a data generation workflow that leverages large language models (LLMs) to extract query rewrites from product text. Subsequently, we trained a Seq2Seq model to generate query rewrite for unseen queries. Our experiments demonstrated that query rewriting yields a 3.17% precision@30 improvement for queries with negations. The promising results pave the way for further research on enhancing the search performance of queries with negations.

pdf bib
LAW: Legal Agentic Workflows for Custody and Fund Services Contracts
William Watson | Nicole Cho | Nishan Srishankar | Zhen Zeng | Lucas Cecchi | Daniel Scott | Suchetha Siddagangappa | Rachneet Kaur | Tucker Balch | Manuela Veloso

Legal contracts in the custody and fund services domain govern critical aspects such as key provider responsibilities, fee schedules, and indemnification rights. However, it is challenging for an off-the-shelf Large Language Model (LLM) to ingest these contracts due to the lengthy unstructured streams of text, limited LLM context windows, and complex legal jargon. To address these challenges, we introduce LAW (Legal Agentic Workflows for Custody and Fund Services Contracts). LAW features a modular design that responds to user queries by orchestrating a suite of domain-specific tools and text agents. Our experiments demonstrate that LAW, by integrating multiple specialized agents and tools, significantly outperforms the baseline. LAW excels particularly in complex tasks such as calculating a contract’s termination date, surpassing the baseline by 92.9% points. Furthermore, LAW offers a cost-effective alternative to traditional fine-tuned legal LLMs by leveraging reusable, domain-specific tools.

pdf bib
UR2N: Unified Retriever and ReraNker
Riyaz Ahmad Bhat | Jaydeep Sen | Rudra Murthy | Vignesh P

The two-stage retrieval paradigm has gained popularity, where a neural model serves as a re-ranker atop a non-neural first-stage retriever. We argue that this approach, involving two disparate models without interaction, represents a suboptimal choice. To address this, we propose a unified encoder-decoder architecture with a novel training regimen which enables the encoder representation to be used for retrieval and the decoder for re-ranking within a single unified model, facilitating end-to-end retrieval. We incorporate XTR-style retrieval on top of the trained MonoT5 reranker to specifically concentrate on addressing practical constraints to create a lightweight model. Results on the BIER benchmark demonstrate the effectiveness of our unified architecture, featuring a highly optimized index and parameters. It outperforms ColBERT, XTR, and even serves as a superior re-ranker compared to the Mono-T5 reranker. The performance gains of our proposed system in reranking become increasingly evident as model capacity grows, particularly when compared to rerankers operating over traditional first-stage retrievers like BM25. This is encouraging, as it suggests that we can integrate more advanced retrievers to further enhance final reranking performance. In contrast, BM25’s static nature limits its potential for such improvements.

pdf bib
An Automatic Method to Estimate Correctness of RAG
Chi Zhang | Vivek V. Datla | Aditya Shrivastava | Alfy Samuel | Zhiqi Huang | Anoop Kumar | Daben Liu

In sectors in where data quality is critical, like finance and healthcare, it is crucial to have confidence in not only the outputs generated by retrieval-augmented generation (RAG) models but also the process followed by the model while arriving at the output. Existing methods, such as hallucination detection and input-output entailment measurements, fail to capture the model’s internal state during answer generation. This paper introduces a novel approach to predict the correctness of the generated answer by modeling the model’s uncertainty on quantified perturbations of input. Extensive experiments across multiple large language models (LLMs) demonstrate that our approach quantifies RAG robustness by aligning predictions with ground truth with a Avg.Mean Square Error (MSE) 0.002 while offering flexibility for diverse qualitative metrics.

pdf bib
DaCoM: Strategies to Construct Domain-specific Low-resource Language Machine Translation Dataset
Junghoon Kang | Keunjoo Tak | Joungsu Choi | Myunghyun Kim | Junyoung Jang | Youjin Kang

Translation of low-resource languages in industrial domains is essential for improving market productivity and ensuring foreign workers have better access to information. However, existing translators struggle with domain-specific terms, and there is a lack of expert annotators for dataset creation. In this work, we propose DaCoM, a methodology for collecting low-resource language pairs from industrial domains to address these challenges. DaCoM is a hybrid translation framework enabling effective data collection. The framework consists of a large language model and neural machine translation. Evaluation verifies existing models perform inadequately on DaCoM-created datasets, with up to 53.7 BLEURT points difference depending on domain inclusion. DaCoM is expected to address the lack of datasets for domain-specific low-resource languages by being easily pluggable into future state-of-the-art models and maintaining an industrial domain-agnostic approach.

pdf bib
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Ahmed Masry | Megh Thakkar | Aayush Bajaj | Aaryaman Kartha | Enamul Hoque | Shafiq Joty

Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across 5 benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at https://github.com/vis-nlp/ChartGemma.

pdf bib
LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks
Akshara Prabhakar | Yuanzhi Li | Karthik Narasimhan | Sham Kakade | Eran Malach | Samy Jelassi

Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition—testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain training data for the target task and when it can be decomposed into multiple skills. First, we identify practically occurring use-cases that can be studied under the realm of skill composition, e.g. solving hard math-word problems with code, creating a bot to answer questions on proprietary manuals or about domain-specialized corpora. Our main contribution is to show that concatenation of LoRAs (CAT), which optimally weights LoRAs that were individually trained on different skills, outperforms existing model- and data- merging techniques; for instance on math-word problems, CAT beats these methods by an average of 43% and 12% respectively. Thus, this paper advocates model merging as an efficient way to solve compositional tasks and underscores CAT as a simple, compute-friendly and effective procedure.

pdf bib
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

pdf bib
UCTG: A Unified Controllable Text Generation Framework for Query Auto-Completion
Zhipeng Li | Shuang Zheng | Jiaping Xiao | Xianneng Li | Lei Wang

In the field of natural language generation (NLG), controlling text generation (CTG) is critical, particularly in query auto-completion (QAC) where the need for personalization and diversity is paramount. However, it is essentially challenging to adapt to various control objectives and constraints, which results in existing CTG approaches meeting with mixed success. This paper presents UCTG, a unified controllable text generation framework, which introduces a novel prompt learning method for CTG. Specifically, this framework seamlessly integrates a control module, a prompt module, and a generation module. The control module leverages a fine-tuned model to distill user preference features and behavioral patterns from historical data, incorporating human feedback into the model’s loss functions. These features are then transformed by the prompt module into vectors that guide the generation module. As such, the text generation can be flexibly controlled without modifying the task settings. By employing this unified approach, UCTG significantly improves query accuracy and coherence in tasks with different objectives and constraints, which is validated by extensive experiments on the Meituan and AOL real-world datasets. UCTG not only improves text generation control in QAC but also sets a new framework for flexible NLG applications.

pdf bib
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
Aaron Zheng | Mansi Rana | Andreas Stolcke

With the recent proliferation of large language models (LLMs), enterprises have been able to rapidly develop proof-of-concepts and prototypes. As a result, there is a growing need to implement robust guardrails that monitor, quantize and control an LLM’s behavior, ensuring that the use is reliable, safe, accurate and also aligned with the users’ expectations. Previous approaches for filtering out inappropriate user prompts or system outputs, such as LlamaGuard and OpenAI’s MOD API, have achieved significant success by fine-tuning existing LLMs. However, using fine-tuned LLMs as guardrails introduces increased latency and higher maintenance costs, which may not be practical or scalable for cost-efficient deployments. We take a different approach, focusing on fine-tuning a lightweight architecture: Sentence-BERT. This method reduces the model size from LlamaGuard’s 7 billion parameters to approximately 67 million, while maintaining comparable performance on the AEGIS safety benchmark.

pdf bib
Zero-shot Slot Filling in the Age of LLMs for Dialogue Systems
Mansi Rana | Kadri Hacioglu | Sindhuja Gopalan | Maragathamani Boothalingam

Zero-shot slot filling is a well-established subtask of Natural Language Understanding (NLU). However, most existing methods primarily focus on single-turn text data, overlooking the unique complexities of conversational dialogue. Conversational data is highly dynamic, often involving abrupt topic shifts, interruptions, and implicit references that make it difficult to directly apply zero-shot slot filling techniques, even with the remarkable capabilities of large language models (LLMs). This paper addresses these challenges by proposing strategies for automatic data annotation with slot induction and black-box knowledge distillation (KD) from a teacher LLM to a smaller model, outperforming vanilla LLMs on internal datasets by 26% absolute increase in F1 score. Additionally, we introduce an efficient system architecture for call center product settings that surpasses off-the-shelf extractive models by 34% relative F1 score, enabling near real-time inference on dialogue streams with higher accuracy, while preserving low latency.

pdf bib
LionGuard: A Contextualized Moderation Classifier to Tackle Localized Unsafe Content
Jessica Foo | Shaun Khoo

As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM usage. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by at least 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages, particularly English-based creoles.

pdf bib
REVerSum: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives
Sayantan Adak | Pauras Mangesh Meher | Paramita Das | Animesh Mukherjee

Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia’s B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique – REVerSum – we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVerSum generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5% in terms of informativeness.

pdf bib
RE-FIN: Retrieval-based Enrichment for Financial data
Lorenzo Malandri | Fabio Mercorio | Mario Mezzanzanica | Filippo Pallucchini

Enriching sentences with knowledge from qualitative sources benefits various NLP tasks and enhances the use of labeled data in model training. This is crucial for Financial Sentiment Analysis (FSA), where texts are often brief and contain implied information. We introduce RE-FIN (Retrieval-based Enrichment for FINancial data), an automated system designed to retrieve information from a knowledge base to enrich financial sentences, making them more knowledge-dense and explicit. RE-FIN generates propositions from the knowledge base and employs Retrieval-Augmented Generation (RAG) to augment the original text with relevant information. A large language model (LLM) rewrites the original sentence, incorporating this data. Since the LLM does not create new content, the risk of hallucinations is significantly reduced. The LLM generates multiple new sentences using different relevant information from the knowledge base; we developed an algorithm to select one that best preserves the meaning of the original sentence while avoiding excessive syntactic similarity. Results show that enhanced sentences present lower perplexity than the original ones and improve performances on FSA.

pdf bib
SCV: Light and Effective Multi-Vector Retrieval with Sequence Compressive Vectors
Cheoneum Park | Seohyeong Jeong | Minsang Kim | KyungTae Lim | Yong-Hun Lee

Recent advances in language models (LMs) has driven progress in information retrieval (IR), effectively extracting semantically relevant information. However, they face challenges in balancing computational costs with deeper query-document interactions. To tackle this, we present two mechanisms: 1) a light and effective multi-vector retrieval with sequence compression vectors, dubbed SCV and 2) coarse-to-fine vector search. The strengths of SCV stems from its application of span compressive vectors for scoring. By employing a non-linear operation to examine every token in the document, we abstract these into a span-level representation. These vectors effectively reduce the document’s dimensional representation, enabling the model to engage comprehensively with tokens across the entire collection of documents, rather than the subset retrieved by Approximate Nearest Neighbor. Therefore, our framework performs a coarse single vector search during the inference stage and conducts a fine-grained multi-vector search end-to-end. This approach effectively reduces the cost required for search. We empirically show that SCV achieves the fastest latency compared to other state-of-the-art models and can obtain competitive performance on both in-domain and out-of-domain benchmark datasets.

pdf bib
Efficient Vocabulary Reduction for Small Language Models
Yuta Nozaki | Dai Nakashima | Ryo Sato | Naoki Asaba

The increasing size of large language models (LLMs) poses significant challenges due to their high computational costs and energy consumption, making their deployment in industrial settings difficult. Small language models (SLMs) have been introduced to mitigate these challenges by reducing model size while preserving performance. However, the embedding layer, which occupies a significant portion of the model, remains a bottleneck in model compression efforts. In this paper, we valdated vocabulary reduction as a solution to compress the embedding layer and reduce model size without significant loss of performance. We conduct a series of experiments to investigate how vocabulary reduction affects GPU memory footprint, inference speed, and task performance. Our results show that while performance generally declines with vocabulary reduction, fine-tuning can recover much of the lost performance. Moreover, in some tasks, such as truthfulness and summarization, the vocabulary-reduced models outperform the baseline. Finally, we demonstrate that vocabulary reduction can be effectively applied in domain adaptation, particularly in the medical domain, and in multilingual adaptation, improving task efficiency and cross-lingual robustness.

pdf bib
Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting
Zeyuan Chen | Haiyan Wu | Kaixin Wu | Wei Chen | Mingjie Zhong | Jia Xu | Zhongyi Liu | Wei Zhang

This paper studies the relevance modeling problem by integrating world knowledge stored in the parameters of LLMs with specialized domain knowledge represented by user behavior data for achieving promising performance. The novel framework ProRBP is proposed, which innovatively develops user-driven behavior neighbor retrieval module to learn domain-specific knowledge in time and introduces progressive prompting and aggregation module for considering diverse aspects of the relevance and prediction stability. We explore an industrial implementation to deploy LLMs to handle full-scale search traffics of Alipay with acceptable cost and latency. The comprehensive experiments on real-world industry data and online A/B testing validate the superiority of our proposal and the effectiveness of its main modules.

pdf bib
LLM ContextBridge: A Hybrid Approach for Intent and Dialogue Understanding in IVSR
Changwoo Chun | Daniel Rim | Juhee Park

In-vehicle speech recognition (IVSR) systems are crucial components of modern automotive interfaces, enabling hands-free control and enhancing user safety. However, traditional IVSR systems often struggle with interpreting user intent accurately due to limitations in contextual understanding and ambiguity resolution, leading to user frustration. This paper introduces LLM ContextBridge, a novel hybrid architecture that integrates Pretrained Language Model-based intent classification with Large Language Models to enhance both command recognition and dialogue management. LLM ContextBridge serves as a seamless bridge between traditional natural language understanding techniques and LLMs, combining the precise intent recognition of conventional NLU with the contextual handling and ambiguity resolution capabilities of LLMs. This approach significantly improves recognition accuracy and user experience, particularly in complex, multi-turn dialogues. Experimental results show notable improvements in task success rates and user satisfaction, demonstrating that LLM ContextBridge can make IVSR systems more intuitive, responsive, and context-aware.

pdf bib
Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders
Saeed Abbasi | Aijun An | Heidar Davoudi | Ron Di Carlantonio | Gary Farmaner

We introduce a novel Transformer-based method for document segmentation, tailored for practical, real-world applications. This method utilizes overlapping text sequences with a unique position-aware weighting mechanism to enhance segmentation accuracy. Through comprehensive experiments on both public and proprietary datasets, we demonstrate significant improvements, establishing new state-of-the-art standards by achieving up to a 10% increase in segmentation F1 score compared to existing methods. Additionally, we explore the application of our segmentation method in downstream retrieval-augmented question answering tasks, where it improves the quality of generated responses by 5% while achieving up to four times greater efficiency. These results underscore our model’s potential as a robust and scalable solution for real-world text segmentation challenges.

pdf bib
RecStream: Graph-aware Stream Management for Concurrent Recommendation Model Online Serving
Shuxi Guo | Qi Qi | Haifeng Sun | Jianxin Liao | Jingyu Wang

Recommendation Models (RMs) are crucial for predicting user preferences and enhancing personalized experiences on large-scale platforms. As the application of recommendation models grows, optimizing their online serving performance has become a significant challenge. However, current serving systems perform poorly under highly concurrent scenarios. To address this, we introduce RecStream, a system designed to optimize stream configurations based on model characteristics for handling high concurrency requests. We employ a hybrid Graph Neural Network architecture to determine the best configurations for various RMs. Experimental results demonstrate that RecStream achieves significant performance improvements, reducing latency by up to 74%.

pdf bib
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
Wenting Tan | Dongxiao Chen | Jieting Xue | Zihao Wang | Taijie Chen

Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks. Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities. However, these approaches overlook crucial requirements for prior knowledge of specific concepts, theorems, and tricks to tackle most arithmetic reasoning problems successfully. To address this issue, we propose a novel and effective Teaching-Inspired Integrated Prompting Framework, which emulates the instructional process of a teacher guiding students. This method equips LLMs with essential concepts, relevant theorems, and similar problems with analogous solution approaches, facilitating the enhancement of reasoning abilities. Additionally, we introduce two new Chinese datasets, MathMC and MathToF, both with detailed explanations and answers. Experiments are conducted on nine benchmarks which demonstrates that our approach improves the reasoning accuracy of LLMs. With GPT-4 and our framework, we achieve new state-of-the-art performance on four math benchmarks (AddSub, SVAMP, Math23K and AQuA) with accuracies of 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) and 81.1% (+1.2%).