Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu (Editors)

Anthology ID:: 2025.naacl-industry
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Venue:: NAACL
Event:: Nations of the Americas Chapter of the Association for Computational Linguistics (2025)
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2025.naacl-industry/
DOI:: 10.18653/v1/2025.naacl-industry
ISBN:: 979-8-89176-194-0
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.naacl-industry.pdf

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Weizhu Chen | Yi Yang | Mohammad Kachuee | Xue-Yong Fu

pdf bib abs

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
Chanjun Park | Hyeonwoo Kim

This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.

pdf bib abs

RTSM: Knowledge Distillation with Diverse Signals for Efficient Real-Time Semantic Matching in E-Commerce
Sanjay Agrawal | Vivek Sembium

Semantic matching plays a pivotal role in e-commerce by facilitating better product discovery and driving sales within online stores. Transformer models have proven exceptionally effective in mapping queries to an embedding space, positioning semantically related entities (queries or products) in close proximity. Despite their effectiveness, the high computational demands of large transformer models pose challenges for their deployment in real-time scenarios. This paper presents RTSM, an advanced knowledge distillation framework designed for Real-Time Semantic Matching. Our approach develops accurate, low-latency student models by leveraging both soft labels from a teacher model and ground truth generated from pairwise query-product and query-query signals. These signals are sourced from direct audits, synthetic examples created by LLMs, user interaction data, and taxonomy-based datasets, with custom loss functions enhancing learning efficiency. Experimental evaluations on internal and external e-commerce datasets demonstrate a 2-2.5% increase in ROC-AUC compared to directly trained student models, outperforming both the teacher model and state-of-the-art knowledge distillation benchmarks.

pdf bib abs

WorkTeam: Constructing Workflows from Natural Language with Multi-Agents
Hanchao Liu | Rongjun Li | Weimin Xiong | Ziyu Zhou | Wei Peng

Workflows play a crucial role in enhancing enterprise efficiency by orchestrating complex processes with multiple tools or components. However, hand-crafted workflow construction requires expert knowledge, presenting significant technical barriers. Recent advancements in Large Language Models (LLMs) have improved the generation of workflows from natural language instructions (aka NL2Workflow), yet existing single LLM agent-based methods face performance degradation on complex tasks due to the need for specialized knowledge and the strain of task-switching. To tackle these challenges, we propose WorkTeam, a multi-agent NL2Workflow framework comprising a supervisor, orchestrator, and filler agent, each with distinct roles that collaboratively enhance the conversion process. As there are currently no publicly available NL2Workflow benchmarks, we also introduce the HW-NL2Workflow dataset, which includes 3,695 real-world business samples for training and evaluation. Experimental results show that our approach significantly increases the success rate of workflow construction, providing a novel and effective solution for enterprise NL2Workflow services.

pdf bib abs

Large language models (LLMs) hold revolutionary potential to digitize and enhance the Health & Public Services (H&PS) industry. Despite their advanced linguistic abilities, concerns about accuracy, stability, and traceability still persist, especially in high-stakes areas such as transportation systems. Moreover, the predominance of English in LLM development raises questions about how they perform in non-English contexts. This study originated from a real world industrial GenAI application, introduces a novel cross-lingual benchmark dataset comprising nearly 99,869 real traffic incident records from Vienna (2013-2023) to assess the robustness of state-of-the-art LLMs (≥ 9) in the spatio vs temporal domain for traffic incident classification. We then explored three hypotheses — sentence indexing, date-to-text conversion, and German-to-English translation — and incorporated Retrieval Augmented Generation (RAG) to further examine the LLM hallucinations in both spatial and temporal domain. Our experiments reveal significant performance disparities in the spatio-temporal domain and demonstrate what types of hallucinations that RAG can mitigate and how it achieves this. We also provide open access to our H&PS traffic incident dataset, with the project demo and code available at Website https://sites.google.com/view/llmhallucination/home

pdf bib abs

Text2Sql is a task that converts natural language questions into SQL queries. In previous research on LLM fine-tuning, researchers typically input both the entire database schema and the natural language question into the model. This approach has two issues: 1) the model’s context is limited when dealing with a large number of database tables; 2) the question is often related to only a few tables, leading to excessive irrelevant information that distracts the model. To address these issues, we employed pure fine-tuning strategy to reduce redundancy. The model fine-tuned with pure prompts, using prompts that are only 53% of the baseline length, outperforms the baseline (fine-tuned with all tables in the prompt) by 8.2% and 8.6% in Test-suite accuracy (TS) and exact-set-match accuracy (EM), respectively, on the Spider dev set. Under the most refined Spider dev set of prompts, the model achieves TS and EM scores of 73.5% and 75.4%, respectively, approaching state-of-the-art (SOTA) levels. To leverage the capabilities of the model with pure prompts, we applied pure knowledge distillation strategy to transfer its abilities. The distilled student model achieved a 1.9% improvement in TS, while the teacher model’s prompt length was only 23% of that of the student model.

pdf bib abs

MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering
Vinay Kumar Verma | Shreyas Sunil Kulkarni | Happy Mittal | Deepak Gupta

Question Answering (QA) and Visual Question Answering (VQA) are well-studied problems in the language and vision domain. One challenging scenario involves multiple sources of information, each of a different modality, where the answer to the question may exist in one or more sources. This scenario contains richer information but is highly complex to handle. In this work, we formulate a novel question-answer generation (QAG) framework in an environment containing multi-source, multimodal information. The answer may belong to any or all sources; therefore, selecting the most prominent answer source or an optimal combination of all sources for a given question is challenging. To address this issue, we propose a question-guided attention mechanism that learns attention across multiple sources and decodes this information for robust and unbiased answer generation. To learn attention within each source, we introduce an explicit alignment between questions and various information sources, which facilitates identifying the most pertinent parts of the source information relative to the question. Scalability in handling diverse questions poses a challenge. We address this by extending our model to a sparse mixture-of-experts (sparse-MoE) framework, enabling it to handle thousands of question types. Experiments on T5 and Flan-T5 using three datasets demonstrate the model’s efficacy, supported by ablation studies.

pdf bib abs

This study addresses two key challenges in structuring radiology reports: the lack of a practical structuring schema and datasets to evaluate model generalizability. To address these challenges, we propose a “Finding-Centric Structuring,” which organizes reports around individual findings, facilitating secondary use. We also construct JRadFCS, a large-scale dataset with annotated named entities (NEs) and relations, comprising 8,428 Japanese Computed Tomography (CT) reports from seven facilities, providing a comprehensive resource for evaluating model generalizability. Our experiments reveal performance gaps when applying models trained on single-facility reports to those from other facilities. We further analyze factors contributing to these gaps and demonstrate that augmenting the training set based on these performance-correlated factors can efficiently enhance model generalizability.

pdf bib abs

Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. However, analyzing live dialogues in real-time necessitates low-latency processing systems, making it impractical to deploy models with billions of parameters due to latency constraints. As a result, practitioners often prefer smaller models with millions of parameters, trained on high-quality, human-annotated datasets. Yet, curating such datasets is both time-consuming and costly. Consequently, there is a growing need to combine the scalability of LLM-generated labels with the precision of human annotations, enabling fine-tuned smaller models to achieve both higher speed and accuracy comparable to larger models. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more. To mitigate the impact of labeling errors from LLMs – the primary source of inaccuracies in student models – we propose a noise-reduced preference learning loss. Experimental results demonstrate that our method significantly improves accuracy across utterance-level dialogue tasks, including sentiment detection (over 2%), dialogue act classification (over 1.5%), etc.

pdf bib abs

Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation
Yi-Chang Chen | Po-Chun Hsu | Chan-Jan Hsu | Da-shan Shiu

Large language models (LLMs) have significantly advanced autonomous agents, particularly in zero-shot tool usage, also known as function calling. This research delves into enhancing the function-calling capabilities of LLMs by exploring different approaches, including prompt formats for integrating function descriptions, blending function-calling and instruction-following data, introducing a novel Decision Token for conditional prompts, leveraging chain-of-thought reasoning, and overcoming multilingual challenges with a translation pipeline. Our key findings and contributions are as follows: (1) Instruction-following data improves both function-calling accuracy and relevance detection. (2) The use of the newly proposed Decision Token, combined with synthetic non-function-call data, enhances relevance detection. (3) A tailored translation pipeline effectively overcomes multilingual limitations, demonstrating significant improvements in Traditional Chinese. These insights highlight the potential for improved function-calling capabilities and multilingual applications in LLMs.

pdf bib abs

Large language models (LLMs) are increasingly used in business dialogue systems but they also pose security and ethical risks. Multi-turn conversations, in which context influences the model’s behavior, can be exploited to generate undesired responses. In this paper, we investigate the use of off-the-shelf LLMs in conversational red-teaming settings, where an attacker LLM attempts to elicit undesired outputs from a target LLM. Our experiments address critical questions and offer valuable insights regarding the effectiveness of using LLMs as automated red-teamers, shedding light on key strategies and usage approaches that significantly impact their performance.Our findings demonstrate that off-the-shelf models can serve as effective red-teamers, capable of adapting their attack strategies based on prior attempts. Allowing these models to freely steer conversations and conceal their malicious intent further increases attack success. However, their effectiveness decreases as the alignment of the target model improves.

pdf bib abs

Designing effective debt collection systems is crucial for improving operational efficiency and reducing costs in the financial industry. However, the challenges of maintaining script diversity, contextual relevance, and coherence make this task particularly difficult. This paper presents a debt collection system based on real debtor-collector data from a major commercial bank. We construct a script library from real-world debt collection conversations, and propose a two-stage retrieval based response system for contextual relevance. Experimental results show that our system improves script diversity, enhances response relevance, and achieves practical deployment efficiency through knowledge distillation. This work offers a scalable and automated solution, providing valuable insights for advancing debt collection practices in real-world applications.

pdf bib abs

Universal query embeddings that accurately capture the semantic meaning of search queries are crucial for supporting a range of query understanding (QU) tasks within enterprises.However, current embedding approaches often struggle to effectively represent queries due to the shortness of search queries and their tendency for surface-level variations.We propose a user-behavior-driven contrastive learning approach which directly aligns embeddings according to user intent.This approach uses intent-aligned query pairs as positive examples, derived from two types of real-world user interactions: (1) clickthrough data, in which queries leading to clicks on the same URLs are assumed to share the same intent, and (2) session data, in which queries within the same user session are considered to share intent.By incorporating these query pairs into a robust contrastive learning framework, we can construct query embedding models that align with user intent while minimizing reliance on surface-level lexical similarities.Evaluations on real-world QU tasks demonstrated that these models substantially outperformed state-of-the-art text embedding models such as mE5 and SimCSE.Our models have been deployed in our search engine to support QU technologies.

pdf bib abs

Chinese Search Query Spell Correction is a task designed to autonomously identify and correct typographical errors within queries in the search engine. Despite the availability of comprehensive datasets like Microsoft Speller and Webis, their monolingual nature and limited scope pose significant challenges in evaluating modern pre-trained language models such as BERT and GPT. To address this, we introduce QSpell 250K, a large-scale benchmark specifically developed for Chinese Query Spelling Correction. QSpell 250K offers several advantages: 1) It contains over 250K samples, which is ten times more than previous datasets. 2) It covers a broad range of topics, from formal entities to everyday colloquialisms and idiomatic expressions. 3) It includes both Chinese and English, addressing the complexities of code-switching. Each query undergoes three rounds of high-fidelity annotation to ensure accuracy. Our extensive testing across three popular models demonstrates that QSpell 250K effectively evaluates the efficacy of representative spelling correctors. We believe that QSpell 250K will significantly advance spelling correction methodologies. The accompanying data and code will be made publicly available.

pdf bib abs

CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models
Yifan Zhang | Xue Yang

Automating planning with LLMs presents transformative opportunities for traditional industries, yet remains underexplored. In commercial construction, the complexity of automated scheduling often requires manual intervention to ensure precision. We propose CONSTRUCTA, a novel framework leveraging LLMs to optimize construction schedules in complex projects like semiconductor fabrication. CONSTRUCTA addresses key challenges by: (1) integrating construction-specific knowledge through static RAG; (2) employing context-sampling techniques inspired by architectural expertise to provide relevant input; and (3) deploying Construction DPO to align schedules with expert preferences using RLHF. Experiments on proprietary data demonstrate performance improvements of +42.3% in missing value prediction, +79.1% in dependency analysis, and +28.9% in automated planning compared to baseline methods, showcasing its potential to revolutionize construction workflows and inspire domain-specific LLM advancements.

pdf bib abs

Challenges and Remedies of Domain-Specific Classifiers as LLM Guardrails: Self-Harm as a Case Study
Bing Zhang | Guang-Jie Ren

Context:Despite the impressive capabilities of Large Language Models (LLMs), they pose significant risks in many domains and therefore require guardrails throughout the lifecycle.Problem:Many such guardrails are trained as classifiers with domain-specific human text datasets obtained from sources such as social media and they achieve reasonable performance against closed-domain benchmarks. When deployed in the real world, however, the guardrails have to deal with machine text in an open domain, and their performance deteriorates drastically, rendering them almost unusable due to a high level of false refusal.Solution:In this paper, using a self-harm detector as an example, we demonstrate the specific challenges facing guardrail deployment due to the data drift between training and production environments. More specifically, we formed two hypotheses about the potential causes, i.e. closed vs. open domain, human vs. LLM-generated text, and conducted five experiments to explore various potential remedies, including their respective advantages and disadvantages.Evaluation:While focusing on one example, our experience and knowledge of LLM guardrails give us great confidence that our work contributes to a more thorough understanding of guardrail deployment and can be generalized as a methodology to build more robust domain-specific guardrails in real-world applications.

pdf bib abs

In education, high-quality exams must cover broad specifications across diverse difficulty levels during the assembly and calibration of test items to effectively measure examinees’ competence. However, balancing the trade-off of selecting relevant test items while fulfilling exam specifications without bias is challenging, particularly when manual item selection and exam assembly rely on a pre-validated item base. To address this limitation, we propose a new mixed-integer programming re-ranking approach to improve relevance, while mitigating bias on an industry-grade exam assembly platform. We evaluate our approach by comparing it against nine bias mitigation re-ranking methods in 225 experiments on a real-world benchmark data set from vocational education services. Experimental results demonstrate a 17% relevance improvement with a 9% bias reduction when integrating sequential optimization techniques with improved contextual relevance augmentation and scoring using a large language model. Our approach bridges information retrieval and exam assembly, enhancing the human-in-the-loop exam assembly process while promoting unbiased exam design

pdf bib abs

Pretrained language models (PLMs) have revolutionized NLP but amplify linguistic inequities in multilingual applications. While prior studies focused on transformer architectures such as BERT, we evaluate large language models (LLMs) including Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama. Through rigorous testing across eight languages spanning high-resource (English, German, French, Italian, Spanish) and low-resource (Hindi, Tamil, Kannada) settings, we reveal systemic failures in preserving multilingual reliability and adaptability. Using paradigms like each language for itself’ (ELFI) and each language for others’ (ELFO), we highlight the inability of current LLMs to bridge linguistic divides. Even model merging fail to mitigate these gaps, exposing fundamental limitations. These findings emphasize the critical need for reimagining AI architectures to deliver true linguistic inclusivity and equitable performance across diverse languages.

pdf bib abs

Towards Reliable and Practical Phishing Detection
Hyowon Cho | Minjoon Seo

As the prevalence of phishing attacks continues to rise, there is an increasing demand for more robust detection technologies. With recent advances in AI, we discuss how to construct a reliable and practical phishing detection system using language models. For this system, we introduce the first large-scale Korean dataset for phishing detection, encompassing six types of phishing attacks. We consider multiple factors for building a real-time detection system for edge devices, such as model size, Speech-To-Text quality, split length, training technique and multi-task learning. We evaluate the model’s ability twofold: in-domain, and unseen attack detection performance which is referred to as zero-day performance. Additionally, we demonstrate the importance of accurate comparison groups and evaluation datasets, showing that voice phishing detection performs reasonably well while smishing detection remains challenging. Both the dataset and the trained model will be available upon request.

pdf bib abs

Zero-Shot ATC Coding with Large Language Models for Clinical Assessments
Zijian Chen | John-Michael Gamble | Micaela Jantzi | John P. Hirdes | Jimmy Lin

Manual assignment of Anatomical Therapeutic Chemical (ATC) codes to prescription records is a significant bottleneck in healthcare research and operations at Ontario Health and InterRAI Canada, requiring extensive expert time and effort. To automate this process while maintaining data privacy, we develop a practical approach using locally deployable large language models (LLMs). Inspired by recent advances in automatic International Classification of Diseases (ICD) coding, our method frames ATC coding as a hierarchical information extraction task, guiding LLMs through the ATC ontology level by level. We evaluate our approach using GPT-4o as an accuracy ceiling and focus development on open-source Llama models suitable for privacy-sensitive deployment. Testing across Health Canada drug product data, the RABBITS benchmark, and real clinical notes from Ontario Health, our method achieves 78% exact match accuracy with GPT-4o and 60% with Llama 3.1 70B. We investigate knowledge grounding through drug definitions, finding modest improvements in accuracy. Further, we show that fine-tuned Llama 3.1 8B matches zero-shot Llama 3.1 70B accuracy, suggesting that effective ATC coding is feasible with smaller models. Our results demonstrate the feasibility of automatic ATC coding in privacy-sensitive healthcare environments, providing a foundation for future deployments.

pdf bib abs

Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models
Yukyung Lee | Soonwon Ka | Bokyung Son | Pilsung Kang | Jaewook Kang

Large Language Models (LLMs) have impacted the writing process, enhancing productivity by collaborating with humans in content creation platforms. However, generating high-quality, user-aligned text to satisfy real-world content creation needs remains challenging. We propose WritingPath, a framework that uses explicit outlines to guide LLMs in generating goal-oriented, high-quality text. Our approach draws inspiration from structured writing planning and reasoning paths, focusing on reflecting user intentions throughout the writing process. To validate our approach in real-world scenarios, we construct a diverse dataset from unstructured blog posts to benchmark writing performance and introduce a comprehensive evaluation framework assessing the quality of outlines and generated texts. Our evaluations with various LLMs demonstrate that the WritingPath approach significantly enhances text quality according to evaluations by both LLMs and professional writers.

pdf bib abs

TaeBench: Improving Quality of Toxic Adversarial Examples
Jennifer Zhu | Dmitriy Bespalov | Liwen You | Ninad Kulkarni | Yanjun Qi

Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of . Successful should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.

pdf bib abs

The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.

pdf bib abs

CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning
Zukang Yang | Zixuan Zhu | Jennifer Zhu

Large Language Models (LLMs) have achieved significant success in open-domain question answering. However, they continue to face challenges such as hallucinations and knowledge cutoffs. These issues can be mitigated through in-context learning by providing LLMs with relevant context before generating answers. Recent literature proposes Knowledge Graph Prompting (KGP) which integrates knowledge graphs with an LLM-based traversal agent to substantially enhance document retrieval quality. However, KGP requires costly fine-tuning with large datasets and remains prone to hallucination. In this paper, we propose CuriousLLM, an enhancement that integrates a curiosity-driven reasoning mechanism into an LLM agent. This mechanism enables the agent to generate relevant follow-up questions, thereby guiding the information retrieval process more efficiently.Central to our approach is the development of the new Follow-upQA dataset, which includes questions and supporting evidence as input, with follow-up questions serving as ground truths. These follow-up questions either inquire about what is still missing to fully answer the user’s query or use special tokens to signify that the retrieved evidence is sufficient. Our experiments show that CuriousLLM significantly boosts LLM performance in multi-document question answering (MD-QA), circumventing the substantial computational costs and latency from the original KGP framework.

pdf bib abs

CharacterGPT: A Persona Reconstruction Framework for Role-Playing Agents
Jeiyoon Park | Chanjun Park | Heuiseok Lim

The recent introduction of the Assistants API highlights its potential for large language models (LLMs) in role-playing agents (RPA). However, maintaining consistent character personas remains a significant challenge due to variability in information extraction, which frequently omits critical elements such as backstory or interpersonal relationships. To address this limitation, we introduce CharacterGPT, a framework designed to dynamically reconstruct character personas through Character Persona Training (CPT). This approach incrementally updates personas by extracting traits from chapter-wise novel summaries, reflecting the progression of the narrative. Our framework is evaluated through Big Five personality evaluations and creative tasks, in which characters generate original narratives, demonstrating the efficacy of CharacterGPT in preserving persona consistency. The code and results are available at https://github.com/Jeiyoon/charactergpt

pdf bib abs

Efficient Continual Pre-training of LLMs for Low-resource Languages
Arijit Nag | Soumen Chakrabarti | Animesh Mukherjee | Niloy Ganguly

Open-source large language models (Os-LLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost.To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability.For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.

pdf bib abs

Current intent detection work experiments with minor intent categories. However, in real-world scenarios of data analysis dialogue systems, intents are composed of combinations of numerous metrics and dimensions, resulting in countless intents and posing challenges for the language model. The retrieval-augmented generation (RAG) method efficiently retrieves key intents. However, the single retrieval route sometimes fails to recall target intents and causes incorrect results. To alleviate the above challenges, we introduce the DSRAG framework combining query-to-query (Q2Q) and query-to-metadata (Q2M) double-stream RAG approaches. Specifically, we build a repository of query statements for Q2Q using the query templates with the key intents. When a user’s query comes, it rapidly matches repository statements. Once the relevant query is retrieved, the results can be quickly returned. In contrast, Q2M retrieves the relevant intents from the metadata and utilizes large language models to choose the answer. Experimental results show that DSRAG achieves significant improvements compared with merely using prompt engineering and a single retrieval route.

pdf bib abs

Octopus: On-device language model for function calling of software APIs
Wei Chen | Zhiyuan Li | Mingyuan Ma

Large Language Models (LLMs) are pivotal for advanced text processing and generation. This study presents a framework to train a series of on-device LLMs optimized for invoking software APIs. Using a curated dataset of 30,000 API function calls from software documentation, we fine-tune LLMs with 2B, 3B, and 7B parameters to enhance their proficiency in API interactions. Our approach improves the understanding of API structures and syntax, leading to significantly better accuracy in API function calls. We also propose a conditional masking technique to enforce correct output formats, reducing errors while maintaining inference speed, specifically tailored for API tasks. The fine-tuned model, Octopus, outperforms GPT-4 in API calling tasks, showcasing advancements in automated software development and API integration. The model checkpoints are publicly available.

pdf bib abs

MoFE: Mixture of Frozen Experts Architecture
Jean Seo | Jaeyoon Kim | Hyopil Shin

We propose the Mixture of Frozen Experts (MoFE) architecture, which integrates Parameter-efficient Fine-tuning (PEFT) and the Mixture of Experts (MoE) architecture to enhance both training efficiency and model scalability. By freezing the Feed Forward Network (FFN) layers within the MoE framework, MoFE significantly reduces the number of trainable parameters, improving training efficiency while still allowing for effective knowledge transfer from the expert models. This facilitates the creation of models proficient in multiple domains. We conduct experiments to evaluate the trade-offs between performance and efficiency, compare MoFE with other PEFT methodologies, assess the impact of domain expertise in the constituent models, and determine the optimal training strategy. The results show that, although there may be some trade-offs in performance, the efficiency gains are substantial, making MoFE a reasonable solution for real-world, resource-constrained environments.

pdf bib abs

FinLLM-B: When Large Language Models Meet Financial Breakout Trading
Kang Zhang | Osamu Yoshie | Lichao Sun | Weiran Huang

Trading range breakout is a key method in the technical analysis of financial trading, widely employed by traders in financial markets such as stocks, futures, and foreign exchange. However, distinguishing between true and false breakout and providing the correct rationale cause significant challenges to investors. Traditional quantitative methods require large amounts of data and cannot directly present the reasoning process to users, making them less than perfect in this field. Recently, large language models have achieved success in various downstream applications, but their effectiveness in the domain of financial breakout detection has been subpar. The reason is that the unique data and specific knowledge are required in breakout detection. To address these issues, we created the first financial breakout dataset and introduce FinLLM-B, the premier large language model for financial breakout detection, which enhances the effectiveness of breakout trading strategies. Furthermore, we have developed a novel framework for large language models, namely multi-stage structure, effectively reducing mistakes in downstream applications. Experimental results indicate that compared to GPT-3.5, FinanceGPT-B improves the average accuracy of answers and rational by 49.97%, with the multi-stage structure contributing 9.72% to the improvement. Additionally, it outperforms ChatGPT-4 by 42.38%.

pdf bib abs

Unrestricted access to external Large Language Models (LLM) based services like ChatGPT and Gemini can lead to potential data leakages, especially for large enterprises providing products and services to clients that require legal confidentiality guarantees. However, a blanket restriction on such services is not ideal as these LLMs boost employee productivity. Our goal is to build a solution that enables enterprise employees to query such external LLMs, without leaking confidential internal and client information. In this paper, we propose QueryShield - a platform that enterprises can use to interact with external LLMs without leaking data through queries. It detects if a query leaks data, and rephrases it to minimize data leakage while limiting the impact to its semantics. We construct a dataset of 1500 queries and manually annotate them for their sensitivity labels and their low sensitivity rephrased versions. We fine-tune a set of lightweight model candidates using this dataset and evaluate them using multiple metrics including one we propose specific to this problem.

pdf bib abs

SwissADT: An Audio Description Translation System for Swiss Languages
Lukas Fischer | Yingqiang Gao | Alexa Lintner | Annette Rios | Sarah Ebling

Audio description (AD) is a crucial accessibility service provided to blind persons and persons with visual impairment, designed to convey visual information in acoustic form. Despite recent advancements in multilingual machine translation research, the lack of well-crafted and time-synchronized AD data impedes the development of audio description translation (ADT) systems that address the needs of multilingual countries such as Switzerland. Furthermore, most ADT systems rely on text alone, and it is unclear whether incorporating visual information from video clips improves the quality of ADT outputs.In this work, we introduce SwissADT, an **emerging** ADT system for three main Swiss languages and English, designed for future use by our industry partners. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), we aim to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language. Our extensive experimental ADT results, composed of both automatic and human evaluations of ADT quality, demonstrate the promising capability of SwissADT for the ADT task. We believe that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.

pdf bib abs

Chinese Morph Resolution in E-commerce Live Streaming Scenarios
Jiahao Zhu | Jipeng Qiang | Ran Bai | Chenyu Liu | Xiaoye Ouyang

E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.

pdf bib abs

MonoTODia: Translating Monologue Requests to Task-Oriented Dialogues
Sebastian Steindl | Ulrich Schäfer | Bernd Ludwig

Data scarcity is one of the main problems when it comes to real-world applications of transformer-based models.This is especially evident for task-oriented dialogue (TOD) systems, which require specialized datasets, that are usually not readily available. This can hinder companies from adding TOD systems to their services.This study therefore investigates a novel approach to sourcing annotated dialogues from existing German monologue material.Focusing on a real-world example, we investigate whether these monologues can be transformed into dialogue formats suitable for training TOD systems.We show the approach with the concrete example of a company specializing in travel bookings via e-mail. We fine-tune state-of-the-art Large Language Models for the task of rewriting e-mails as dialogues and annotating them.To ensure the quality and validity of the generated data, we employ crowd workers to evaluate the dialogues across multiple criteria and to provide gold-standard annotations for the test dataset.We further evaluate the usefulness of the dialogues for training TOD systems.Our evaluation shows that the dialogues and annotations are of high quality and can serve as a valuable starting point for training TOD systems.Finally, we make the annotated dataset publicly available to foster future research.

pdf bib abs

MedEthicEval: Evaluating Large Language Models Based on Chinese Medical Ethics
Haoan Jin | Jiacheng Shi | Hanhui Xu | Kenny Q. Zhu | Mengyue Wu

Large language models (LLMs) demonstrate significant potential in advancing medical applications, yet their capabilities in addressing medical ethics challenges remain underexplored. This paper introduces MedEthicEval, a novel benchmark designed to systematically evaluate LLMs in the domain of medical ethics. Our framework encompasses two key components: knowledge, assessing the models’ grasp of medical ethics principles, and application, focusing on their ability to apply these principles across diverse scenarios. To support this benchmark, we consulted with medical ethics researchers and developed three datasets addressing distinct ethical challenges: blatant violations of medical ethics, priority dilemmas with clear inclinations, and equilibrium dilemmas without obvious resolutions. MedEthicEval serves as a critical tool for understanding LLMs’ ethical reasoning in healthcare, paving the way for their responsible and effective use in medical contexts.

pdf bib abs

Predicting ICU Length of Stay for Patients using Latent Categorization of Health Conditions
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana

Predicting the duration of a patient’s stay in an Intensive Care Unit (ICU) is a critical challenge for healthcare administrators, as it impacts resource allocation, staffing, and patient care strategies. Traditional approaches often rely on structured clinical data, but recent developments in language models offer significant potential to utilize unstructured text data such as nursing notes, discharge summaries, and clinical reports for ICU length-of-stay (LoS) predictions. In this study, we introduce a method for analyzing nursing notes to predict the remaining ICU stay duration of patients. Our approach leverages a joint model of latent note categorization, which identifies key health-related patterns and disease severity factors from unstructured text data. This latent categorization enables the model to derive high-level insights that influence patient care planning. We evaluate our model on the widely used MIMIC-III dataset, and our preliminary findings show that it significantly outperforms existing baselines, suggesting promising industrial applications for resource optimization and operational efficiency in healthcare settings.

pdf bib abs

RevieWeaver: Weaving Together Review Insights by Leveraging LLMs and Semantic Similarity
Jiban Adhikary | Mohammad Alqudah | Arun Palghat Udayashankar

With the rise of online retail, customer reviews have become a critical factor in shaping purchasing decisions. The sheer volume of customer reviews being generated continuously presents a challenge for consumers who must sift through an overwhelming amount of feedback. To address this issue, we introduce RevieWeaver, a novel framework that extracts key product features and provides concise review summaries. Our innovative approach not only scales efficiently to 30 million reviews but also ensures reproducibility and controllability. Moreover, it delivers unbiased and reliable assessments of products that accurately reflect the input reviews.

pdf bib abs

Medical coding standardizes clinical data but is both time-consuming and error-prone. Traditional Natural Language Processing (NLP) methods struggle with automating coding due to the large label space, lengthy text inputs, and the absence of supporting evidence annotations that justify code selection. Recent advancements in Generative Artificial Intelligence (AI) offer promising solutions to these challenges. In this work, we introduce MedCodER, an emerging Generative AI framework for automatic medical coding that leverages extraction, retrieval, and re-ranking techniques as core components. MedCodER achieves a micro-F1 score of 0.62 on International Classification of Diseases (ICD) code prediction, significantly outperforming state-of-the-art methods. Additionally, we present a new dataset containing medical records annotated with disease diagnoses, ICD codes, and supporting evidence texts (https://doi.org/10.5281/zenodo.13308316). Ablation tests confirm that MedCodER’s performance depends on the integration of each of its aforementioned components, as performance declines when these components are evaluated in isolation.

pdf bib abs

Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM correct the decoded outputs for out-of-domain attribute values. Experiments show that ViOC-AG significantly outperforms other fine-tuned vision-language models for zero-shot attribute value extraction.

pdf bib abs

SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
Grigor Nalbandyan | Rima Shahbazyan | Evelina Bakhturina

Typical evaluations of Large Language Models (LLMs) report a single metric per dataset, often representing the model’s best-case performance under carefully selected settings. Unfortunately, this approach overlooks model robustness and reliability in real-world applications. For instance, simple paraphrasing of prompts on the MMLU-Pro dataset causes accuracy fluctuations of up to 10%, while reordering answer choices in the AGIEval dataset results in accuracy differences of up to 6.1%. While some studies discuss issues with LLM robustness, there is no unified or centralized framework for evaluating the robustness of language models. To address this gap and consolidate existing research on model robustness, we present SCORE (Systematic COnsistency and Robustness Evaluation), a comprehensive framework for non-adversarial evaluation of LLMs. The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency. We will make the code publicly available to facilitate further development and research.

pdf bib abs

The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be benchmarked with enterprise datasets for a variety of NLP tasks. This work explores benchmarking strategies focused on LLM evaluation, with a specific emphasis on both English and Japanese. The proposed evaluation framework encompasses 25 publicly available domain-specific English benchmarks from diverse enterprise domains like financial services, legal, climate, cyber security, and 2 public Japanese finance benchmarks. The diverse performance of 8 models across different enterprise tasks highlights the importance of selecting the right model based on the specific requirements of each task. Code and prompts are available on GitHub.

pdf bib abs

Can Post-Training Quantization Benefit from an Additional QLoRA Integration?
Xiliang Zhu | Elena Khasanova | Cheng Chen

Large language models (LLMs) have transformed natural language processing but pose significant challenges for real-world deployment. These models necessitate considerable computing resources, which can be costly and frequently unavailable. Model compression techniques such as quantization are often leveraged to alleviate resource demand, but they may have a negative impact on the generation quality. In this study, we explore the integration of 4-bit Post-training Quantization (PTQ) with QLoRA to address these issues. We demonstrate through extensive experiments that this integration outperforms standard PTQ, and in some cases even 16-bit full-parameter fine-tuning on LLMs, validated across proprietary and public datasets with different quantization algorithms. The results demonstrate the efficacy of PTQ-QLoRA integration, offering a viable solution for deploying powerful LLMs in resource-constrained environments without compromising on performance.

pdf bib abs

Application of LLMs for complex causal question answering can be stymied by their opacity and propensity for hallucination. Although recent approaches such as Retrieval Augmented Generation and Chain of Thought prompting have improved reliability, we argue current approaches are insufficient and further fail to satisfy key criteria humans use to select and evaluate causal explanations. Inspired by findings from the social sciences, we present an implemented causal QA approach that combines iterative RAG with guidance from a formal model of causation. Our causal model is backed by the Cogent reasoning engine, allowing users to interactively perform counterfactual analysis and refine their answer. Our approach has been integrated into a deployed Collaborative Research Assistant (Cora) and we present a pilot evaluation in the life sciences domain.

pdf bib abs

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Aman Goel | Xian Wu | Zhe Wang | Dmitriy Bespalov | Yanjun Qi

Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves ≥ 95% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o & GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks.

pdf bib abs

Does Self-Attention Need Separate Weights in Transformers?
Md Kowsher | Nusrat Jahan Prottasha | Chun-Nam Yu | Ozlem Garibay | Niloofar Yousefi

Self-attention has revolutionized natural language processing by capturing long-range dependencies and improving context understanding. However, it comes with high computational costs and struggles with sequential data’s inherent directionality. This paper investigates and presents a simplified approach called “shared weight self-attention,” where a single weight matrix is used for Keys, Queries, and Values instead of separate matrices for each. This approach cuts training parameters by more than half and significantly reduces training time. Our method not only improves efficiency but also achieves strong performance on tasks from the GLUE benchmark, even outperforming the standard BERT baseline in handling noisy and out-of-domain data. Experimental results show a 66.53% reduction in parameter size within the attention block and competitive accuracy improvements of 3.55% and 0.89% over symmetric and pairwise attention-based BERT models, respectively.

pdf bib abs

This paper introduces layout-aware graph modeling for multimodal RAG. Different from traditional RAG methods that only deal with flat text chunks, the proposed method takes into account the relationship of multimodalities by using a graph structure. To do that, a graph modeling structure is defined based on document layout parsing. The structure of an input document is retained with the connection of text chunks, tables, and figures. This representation allows the method to handle complex questions that require information from multimodalities. To confirm the efficiency of the graph modeling, a flexible RAG pipeline is developed using robust components. Experimental results on four benchmark test sets confirm the contribution of the layout-aware modeling for performance improvement of the RAG pipeline.

pdf bib abs

Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.

pdf bib abs

Natural Language Processing for Human Resources: A Survey
Naoki Otani | Nikita Bhutani | Estevam Hruschka

Advances in Natural Language Processing (NLP) have the potential to transform HR processes, from recruitment to employee management. While recent breakthroughs in NLP have generated significant interest in its industrial applications, a comprehensive overview of how NLP can be applied across HR activities is still lacking. This paper discovers opportunities for researchers and practitioners to harness NLP’s transformative potential in this domain. We analyze key fundamental tasks such as information extraction and text classification, and their roles in downstream applications like recommendation and language generation, while also discussing ethical concerns. Additionally, we identify gaps in current research and encourage future work to explore holistic approaches for achieving broader objectives in this field.

pdf bib abs

The retrieval-augmented generation (RAG) technique enables generative AI models to extract accurate facts from external unstructureddata sources. For structured data, RAG is further augmented by function calls to query databases. This paper presents an industrialcase study that implements RAG in a large financial institution’s call center. The study showcases experiences and architecture for ascalable RAG deployment. It also introduces enhancements to RAG for retrieving facts from structured data sources using data embeddings, achieving low latency and high reliability. Our optimized production application demonstratesan average response time of only 7.33 seconds. Additionally, the paper compares various open-source and closed-source models for answer generation in an industrial context.

The deployment of language models in real-world applications exposes users to various risks, including hallucinations and harmful or unethical content. These challenges highlight the urgent need for robust safeguards to ensure safe and responsible AI. To address this, we introduce Granite Guardian, a suite of advanced models designed to detect and mitigate risks associated with prompts and responses, enabling seamless integration with any large language model (LLM). Unlike existing open-source solutions, our Granite Guardian models provide comprehensive coverage across a wide range of risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related issues such as context relevance, groundedness, and answer accuracy in retrieval-augmented generation (RAG) scenarios. Trained on a unique dataset combining diverse human annotations and synthetic data, Granite Guardian excels in identifying risks often overlooked by traditional detection systems, particularly jailbreak attempts and RAG-specific challenges. https://github.com/ibm-granite/granite-guardian

pdf bib abs

Power consumption plays a crucial role in on-device streaming speech recognition, significantly influencing the user experience. This study explores how the configuration of weight parameters in speech recognition models affects their overall energy efficiency. We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. Our approach, which adjusts model components based on their specific energy sensitivities, achieves up to 47% lower energy usage while preserving comparable model accuracy and improving real-time performance compared to leading methods.

pdf bib abs

Break-Ideate-Generate (BrIdGe): Moving beyond Translations for Localization using LLMs
Swapnil Gupta | Lucas Pereira Carlini | Prateek Sircar | Deepak Gupta

Language localization is the adaptation of written content to different linguistic and cultural contexts. Ability to localize written content is crucial for global businesses to provide consistent and reliable customer experience across diverse markets. Traditional methods have approached localization as an application of machine translation (MT), but localization requires more than linguistic conversion – content needs to align with the target audience’s cultural norms, linguistic nuances, and technical requirements. This difference is prominent for long-form text, where multiple facts are present in a creative choice of language. We propose a novel prompt approach for Large Languages Models (LLMs), called Break-Ideate-Generate (BrIdGe), for language localization. BrIdGe ‘breaks’ the source content into granular facts, ‘ideates’ an action plan for content creation in the target language by organizing the granular facts, and finally executes the plan to ‘generate’ localized content. This approach emulates the cognitive processes humans employ in writing that begin with identifying important points, followed by brainstorming on how to structure and organize the output. We evaluated the BrIdGe methodology from multiple perspectives, including impact of BrIdGe prompt on different LLMs and performance comparisons with traditional MT models and direct translation through LLMs on public benchmark and proprietary e-commerce datasets. Through human and LLM-based automated evaluations across content in multiple languages, we demonstrate effectiveness of BrIdGe in generating fluent localized content while preserving factual consistency between source and target languages.

pdf bib abs

Hand-crafting high quality prompts to optimize the performance of language models is a complicated and labor-intensive process. Furthermore, when migrating to newer, smaller, or weaker models (possibly due to latency or cost gains), prompts need to be updated to re-optimize the task performance. We propose Concept Distillation (CD), an automatic prompt optimization technique for enhancing weaker models on complex tasks. CD involves: (1) collecting mistakes made by weak models with a base prompt (initialization), (2) using a strong model to generate reasons for these mistakes and create rules/concepts for weak models (induction), and (3) filtering these rules based on validation set performance and integrating them into the base prompt (deduction/verification). We evaluated CD on NL2Code and mathematical reasoning tasks, observing significant performance boosts for small and weaker language models. Notably, Mistral-7B’s accuracy on Multi-Arith increased by 20%, and Phi-3-mini-3.8B’s accuracy on HumanEval rose by 34%. Compared to other automated methods, CD offers an effective, cost-efficient strategy for improving weak models’ performance on complex tasks and enables seamless workload migration across different language models without compromising performance.

pdf bib abs

Towards Reliable Agents: Benchmarking Customized LLM-Based Retrieval-Augmented Generation Frameworks with Deployment Validation
Kevin Shukang Wang | Karel Joshua Harjono | Ramon Lawrence

The emergence of Large Language Models has created new opportunities for building agent applications across various domains. To address the lack of targeted open benchmarks for agent frameworks, we designed a benchmark that features domain-specific, small knowledge bases, and includes a diverse set of questions categorized by type, such as simple, multi-hop, aggregation, and reasoning questions. We evaluated OpenAI’s Assistants API versus a RAG assistant built with Langchain and deployed a RAG system based on benchmark insights as a course assistant over a two-year span in a computer science course. Our findings reveal how domain-specific retrieval impacts response accuracy and highlight key challenges in real-world deployment. Notably, in smaller agentic systems with constrained knowledge bases, the primary challenge shifts from retrieval accuracy to data availability in the knowledge bases. We present insights from both benchmark evaluation and real-world usage data to guide the development of more reliable and effective agentic applications.

pdf bib abs

This paper addresses the challenge of detecting query variants—pairs of queries with identical intents. One application in commercial search engines is reformulating user queries with its variant online. While measuring pairwise query similarity has been an established standard, it often falls short of capturing semantic equivalence when word forms or order differ. We propose leveraging the retrieval as an environment feedback (EF), based on the premise that desirable retrieval outcomes from equivalent queries should be interchangeable. Experimental results on both proprietary and public datasets demonstrate the efficacy of the proposed method, both with and without LLM calls.

pdf bib abs

Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education
Hayate Iso | Pouya Pezeshkpour | Nikita Bhutani | Estevam Hruschka

Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications.Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.

pdf bib abs

Goal-Driven Data Story, Narrations and Explanations
Aniya Aggarwal | Ankush Gupta | Shivangi Bithel | Arvind Agarwal

In this paper, we propose a system designed to process and interpret vague, open-ended, and multi-line complex natural language queries, transforming them into coherent, actionable data stories. Our system’s modular architecture comprises five components—Question Generation, Answer Generation, NLG/Chart Generation, Chart2Text, and Story Representation—each utilizing LLMs to transform data into human-readable narratives and visualizations. Unlike existing tools, our system uniquely addresses the ambiguity of vague, multi-line queries, setting a new benchmark in data storytelling by tackling complexities no existing system comprehensively handles. Our system is cost-effective, which uses open-source models without extra training and emphasizes transparency by showcasing end-to-end processing and intermediate outputs. This enhances explainability, builds user trust, and clarifies the data story generation process.

pdf bib abs

General vision-language models (VLMs) trained on web data struggle to understand and converse about real-world e-commerce product images. We propose a cost-efficient approach for collecting training data to train a generative VLM for e-commerce product images. The key idea is to leverage large-scale, loosely-coupled image-text pairs from e-commerce stores, use a pretrained LLM to generate multimodal instruction-following data, and fine-tune a general vision-language model using LoRA. Our instruction-finetuned model, VIT-Pro, can understand and respond to queries about product images, covering diverse concepts and tasks. VIT-Pro outperforms several general-purpose VLMs on multiple vision tasks in the e-commerce domain.

pdf bib abs

AutoKB: Automated Creation of Structured Knowledge Bases for Domain-Specific Support
Rishav Sahay | Arihant Jain | Purav Aggarwal | Anoop Saladi

Effective customer support requires domain-specific solutions tailored to users’ issues. However, LLMs like ChatGPT, while excelling in open-domain tasks, often face challenges such as hallucinations, lack of domain compliance, and imprecise solutions when applied to specialized contexts. RAG-based systems, designed to combine domain context from unstructured knowledge bases (KBs) with LLMs, often struggle with noisy retrievals, further limiting their effectiveness in addressing user issues. Consequently, a sanitized KB is essential to ensure solution accuracy, precision, and domain compliance. To address this, we propose AutoKB, an automated pipeline for building a domain-specific KB with a hierarchical tree structure that maps user issues to precise and domain-compliant solutions. This structure facilitates granular issue resolution by improving real-time retrieval of user-specific solutions. Experiments in troubleshooting and medical domains demonstrate that our approach significantly enhances solution correctness, preciseness, and domain compliance, outperforming LLMs and unstructured KB baselines. Moreover, AutoKB is 75 times more cost-effective than manual methods.

pdf bib abs

Spoken Named Entity Recognition (NER) aims to extract named entities from speech and categorise them into types like person, location, organization, etc. In this work, we present *VietMed-NER* - the first spoken NER dataset in the medical domain. To our knowledge, our Vietnamese real-world dataset is the largest spoken NER dataset in the world regarding the number of entity types, featuring 18 distinct types. Furthermore, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence; and conduct quantitative and qualitative error analysis. We found that pre-trained multilingual models generally outperform monolingual models on reference text and ASR output and encoders outperform sequence-to-sequence models in NER tasks. By translating the transcripts, the dataset can also be utilised for text NER in the medical domain in other languages than Vietnamese. All code, data and models are publicly available.

pdf bib abs

Fine-tuning large language models (LLMs) for code generation is challenging due to computational costs and the underrepresentation of some programming languages (PLs) in pre-training. We propose PLEX, a lottery-ticket based parameter-efficient fine-tuning (PEFT) method that adapts LLMs to either well-supported and underrepresented PLs.During lottery ticket selection, PLEX employs a dual strategy: for well-represented PLs, it leverages the LLM’s full parametric knowledge by selecting from full layers, while for underrepresented PLs, it narrows the selection scope to dense layers, prioritizing the most influential parameters.Additionally, PLEX-E, a low-rank extension of PLEX, further reduces computational costs by limiting the scope of fine-tuning. On MultiPL-E benchmarks, PLEX achieves state-of-the-art performance among PEFT methods, while PLEX-E maintains competitive results with reduced computational overhead. Both variants demonstrate effective adaptation across diverse programming languages, particularly for those underrepresented in pre-training.

pdf bib abs

Evaluating the Performance of RAG Methods for Conversational AI in the Airport Domain
Yuyang Li | Pjm Kerbusch | Rhr Pruim | Tobias Käfer

Airports from the top 20 in terms of annual passengers are highly dynamic environment with thousands of flights daily, and they aim to increase the degree of automation. To contribute to this, we implemented a Conversational AI system that enables staff in an airport to communicate with flight information systems. This system not only answers standard airport queries but also resolves airport terminology, jargon, abbreviations, and dynamic questions involving reasoning. In this paper, we built three different Retrieval-Augmented Generation (RAG) methods, including traditional RAG, SQL RAG, and Knowledge Graph-based RAG (Graph RAG). Experiments showed that traditional RAG achieved 84.84% accuracy using BM25 + GPT-4 but occasionally produced hallucinations, which is risky to airport safety. In contrast, SQL RAG and Graph RAG achieved 80.85% and 91.49% accuracy respectively, with significantly fewer hallucinations. Moreover, Graph RAG was especially effective for questions that involved reasoning. Based on our observations, we thus recommend SQL RAG and Graph RAG are better for airport environments, due to fewer hallucinations and the ability to handle dynamic questions.

pdf bib abs

LLM Safety for Children
Prasanjit Rath | Hari Shrawgi | Parag Agrawal | Sandipan Dandapat

This paper analyzes the safety of Large Language Models (LLMs) in interactions with children below age of 18 years. Despite the transformative applications of LLMs in various aspects of children’s lives, such as education and therapy, there remains a significant gap in understanding and mitigating potential content harms specific to this demographic. The study acknowledges the diverse nature of children, often overlooked by standard safety evaluations, and proposes a comprehensive approach to evaluating LLM safety specifically for children. We list down potential risks that children may encounter when using LLM-powered applications. Additionally, we develop Child User Models that reflect the varied personalities and interests of children, informed by literature in child care and psychology. These user models aim to bridge the existing gap in child safety literature across various fields. We utilize Child User Models to evaluate the safety of six state-of-the-art LLMs. Our observations reveal significant safety gaps in LLMs, particularly in categories harmful to children but not adults.

pdf bib abs

RxLens: Multi-Agent LLM-powered Scan and Order for Pharmacy
Akshay Jagatap | Srujana Merugu | Prakash Mandayam Comar

Automated construction of shopping cart frommedical prescriptions is a vital prerequisite forscaling up online pharmaceutical servicesin emerging markets due to the high prevalence of paper prescriptionsthat are challenging for customers to interpret.We present RxLens, a multi-step end-end Large Language Model (LLM)-based deployed solutionfor automated pharmacy cart construction comprisingmultiple steps: redaction of Personal Identifiable Information (PII),Optical Character Recognition (OCR), medication extraction, matching against the catalog, and bounding box detection for lineage. Our multi-step design leverages the synergy between retrieval and LLM-based generationto mitigate the vocabulary gaps in LLMs and fuzzy matching errors during retrieval.Empirical evaluation demonstrates that RxLens can yield up to 19% - 40% and 11% - 26% increase in Recall@3 relative to SOTA methods such as Medical Comprehend and vanilla retrieval augmentation of LLMs on handwritten and printed prescriptions respectively.We also explore LLM-based auto-evaluation as an alternative to costly manual annotations and observe a 76% - 100% match relative to human judgements on various tasks.

pdf bib abs

The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.

pdf bib abs

eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables
Luis Antonio Gutierrez Guanilo | Mir Tafseer Nayeem | Cristian Jose Lopez Del Alamo | Davood Rafiei

Large Language Models (LLMs) have demonstrated exceptional versatility across diverse domains, yet their application in e-commerce remains underexplored due to a lack of domain-specific datasets. To address this gap, we introduce eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, including detailed product attributes and user-specific queries. Leveraging eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to produce high-quality, attribute-specific product reviews from structured tabular data. Fine-tuned models were rigorously evaluated using standard Table2Text metrics, alongside correctness, faithfulness, and fluency assessments. Our results demonstrate substantial improvements in generating contextually accurate reviews, highlighting the transformative potential of tailored datasets and fine-tuning methodologies in optimizing e-commerce workflows. This work highlights the potential of LLMs in e-commerce workflows and the essential role of domain-specific datasets in tailoring them to industry-specific challenges.

pdf bib

pdf bib abs

Construction standards across different countries provide technical guidelines to ensure the quality and safety of buildings and facilities, with periodic revisions to accommodate advances in construction technology. However, these standards often contain overlapping or conflicting content owing to their broad scope and interdependence, complicating the revision process and creating public inconvenience. Although current expert-driven manual approaches aim to mitigate these issues, they are time-consuming, costly, and error-prone. To address these challenges, we propose conflict and overlap classification in construction standards using a large language model (COSLLM), a framework that leverages a construction domain-adapted large language model for the semantic comparison of sentences in construction standards. COSLLM utilizes a two-step reasoning process that adaptively employs chain-of-thought reasoning for the in-depth analysis of sentences suspected of overlaps or conflicts, ensuring computational and temporal efficiency while maintaining high classification accuracy. The framework achieved an accuracy of 97.9% and a macro F1-score of 0.907 in classifying real-world sentence pairs derived from Korean construction standards as overlapping, conflicting, or neutral. Furthermore, we develop and deploy a real-time web-based system powered by COSLLM to facilitate the efficient establishment and revision of construction standards.

pdf bib abs

Proteins play critical roles in biological systems, yet 99.7% of over 227 million known protein sequences remain uncharacterized due to the limitations of experimental methods. To assist experimentalists in narrowing down hypotheses and accelerating protein characterization, we present Protein2Text, a multimodal large language model that interprets protein sequences and generates informative text to address open-ended questions about protein functions and attributes. By integrating a resampling mechanism within an adapted LLaVA framework, our model effectively maps protein sequences into a language-compatible space, enhancing its capability to handle diverse and complex queries. Trained on a newly curated dataset derived from PubMed articles and rigorously evaluated using four comprehensive benchmarks—including in-domain and cross-domain evaluations—Protein2Text outperforms several existing models in open-ended question-answering tasks. Our work also highlights the limitations of current evaluation metrics applied to template-based approaches, which may lead to misleading results, emphasizing the need for unbiased assessment methods. Our model weights, evaluation datasets, and evaluation scripts are publicly available at https://github.com/alaaj27/Protein2Text.git.

pdf bib abs

Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia
Fajri Koto

While knowledge evaluation in large language models has predominantly focused on academic subjects like math and physics, these assessments often fail to capture the practical demands of real-world professions. In this paper, we introduce IndoCareer, a dataset comprising 8,834 multiple-choice questions designed to evaluate performance in vocational and professional certification exams across various fields. With a focus on Indonesia, IndoCareer provides rich local contexts, spanning six key sectors: (1) healthcare, (2) insurance and finance, (3) creative and design, (4) tourism and hospitality, (5) education and training, and (6) law. Our comprehensive evaluation of 27 large language models shows that these models struggle particularly in fields with strong local contexts, such as insurance and finance. Additionally, while using the entire dataset, shuffling answer options generally maintains consistent evaluation results across models, but it introduces instability specifically in the insurance and finance sectors.

pdf bib abs

CodeGenWrangler: Data Wrangling task automation using Code-Generating Models
Ashlesha Akella | Abhijit Manatkar | Krishnasuri Narayanam | Sameep Mehta

Assuring the data quality of tabular datasets is essential for the efficiency of the diverse tabular downstream tasks (like summarization and fact-checking). Data-wrangling tasks effectively address the challenges associated with structured data processing to improve the quality of tabular data. Traditional statistical methods handle numeric data efficiently but often fail to understand the semantic context of the textual data in tables. Deep learning approaches are resource-intensive, requiring task and dataset-specific training. Addressing these shortcomings, we present an automated system that leverages LLMs to generate executable code for data-wrangling tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-independent and memory-dependent tasks.

pdf bib abs

Dialogue Language Model with Large-Scale Persona Data Engineering
Mengze Hong | Chen Jason Zhang | Chaotao Chen | Rongzhong Lian | Di Jiang

Maintaining persona consistency is paramount in the application of open-domain dialogue systems, as exemplified by models like ChatGPT. Despite significant advancements, the limited scale and diversity of current persona dialogue datasets remain challenges to achieving robust persona-consistent dialogue models. In this study, drawing inspiration from the success of large-scale pre-training, we introduce PPDS, an open-domain persona dialogue system that employs extensive generative pre-training on a persona dialogue dataset to enhance persona consistency. Specifically, we present a persona extraction model designed to autonomously and precisely generate vast persona dialogue datasets. Additionally, we unveil a pioneering persona augmentation technique to address the invalid persona bias inherent in the constructed dataset. Both quantitative and human evaluations consistently highlight the superior response quality and persona consistency of our proposed model, underscoring its effectiveness.

pdf bib abs

Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, we have crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.

pdf bib abs

Improved Near-Duplicate Detection for Aggregated and Paywalled News-Feeds
Siddharth Tumre | Sangameshwar Patil | Alok Kumar

News aggregators play a key role in the rapidly evolving digital landscape by providing comprehensive and timely news stories aggregated from diverse sources into one feed. As these articles are sourced from different outlets, they often end up covering the same underlying event but differ in phrasing, formatting or supplemented with additional details. It is crucial for the news aggregators to identify these near-duplicates, improving the content quality and user engagement by steering away from redundant information. The problem of near-duplicate news detection has become harder with increasing use of paywalls by the news websites resulting in restricted access to the content. It is now common to get only the headline and a short snippet from the article. Previous works have concentrated on full length versions of documents such as webpages. There is very little work that focuses on this variation of the near-duplicate detection problem in which only headline and a small text blurb is available for each news article. We propose Near-Duplicate Detection Using Metadata Augmented Communities (NDD-MAC) approach that combines embeddings from pretrained language model (PLM) and latent metadata of a news article followed by community detection to identify clusters of near-duplicates. We show the efficacy of proposed approach using 2 different real-world datasets. By integrating metadata with community detection, NDD-MAC is able to detect nuanced similarities and differences in news snippets and offers an industrial scale solution for the near-duplicate detection in scenarios with restricted content availability.

pdf bib abs

This work presents a speech-to-text system “Pisets” for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system’s effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of “Pisets” system is publicly available at GitHub: https://github.com/bond005/pisets.

pdf bib abs

Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.

pdf bib abs

GraphQL offers a flexible alternative to REST APIs, allowing precise data retrieval across multiple sources in a single query. However, generating complex GraphQL queries remains a significant challenge. Large Language Models (LLMs), while powerful, often produce suboptimal queries due to limited exposure to GraphQL schemas and their structural intricacies.Custom prompt engineering with in-context examples is a common approach to guide LLMs, but existing methods, like randomly selecting examples, often yield unsatisfactory results. While semantic similarity-based selection is effective in other domains, it falls short for GraphQL, where understanding schema-specific nuances is crucial for accurate query formulation.To address this, we propose a Schema and NL-Aware In-context Learning (SNAIL) framework that integrates both structural and semantic information from GraphQL schemas with natural language inputs, enabling schema-aware in-context learning. Unlike existing methods, our approach captures the complexities of GraphQL schemas to improve query generation accuracy.We validate this framework on a publicly available complex GraphQL test dataset, demonstrating notable performance improvements, with specific query classes showing up to a 20% performance improvement for certain LLMs. As GraphQL adoption grows, with Gartner predicting over 60% of enterprises will use it in production by 2027, this work addresses a critical need, paving the way for more efficient and reliable GraphQL query generation in enterprise applications.

pdf bib abs

In industrial LLM development, evaluating large language models (LLMs) is critical for tasks like benchmarking internal models and detecting regressions during fine-tuning, but existing benchmark aggregation methods, such as Elo-based systems, can be resource-intensive, public facing, and time-consuming. Here, we describe Chatbot Arena Estimate (CAE), a practical framework for aggregating performance across diverse benchmarks. The framework, developed and widely adopted within our organization, addresses the need for quick, accurate, and cost-efficient evaluations of LLMs. CAE generates two primary metrics: a “Goodness” score (answer accuracy) and a “Fastness” score (cost or queries per second, QPS). These metrics allow for model ranking both overall and within specific subdomains, enabling informed decisions during model iteration and deployment. We demonstrate CAE’s effectiveness by comparing it with existing benchmarks, including the full Chatbot Arena and the MMLU leaderboard. Notably, our approach achieves higher Pearson correlation with Chatbot Arena Elo scores than MMLU’s correlation with Chatbot Arena Elo scores, validating its reliability for real-world LLM evaluation.

pdf bib abs

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
Arvind Krishna Sridhar | Yinyi Guo | Erik Visser

The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning. Recently, AQA has garnered attention due to the advent of Large Audio Language Models (LALMs). Current literature focuses on constructing LALMs by integrating audio encoders with text-only Large Language Models (LLMs) through a projection module. While LALMs excel in general audio understanding, they are limited in temporal reasoning, which may hinder their commercial applications and on-device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we perform a further fine-tuning of an existing baseline using curriculum learning strategy to specialize in temporal reasoning without compromising performance on fine-tuned tasks. We demonstrate the performance of our model using state-of-the-art LALMs on public audio benchmark datasets. Third, we implement our AQA model on-device locally and investigate its CPU inference for edge applications.

pdf bib abs

Large Language Models (LLMs) face limitations in AI legal and policy applications due to outdated knowledge, hallucinations, and poor reasoning in complex contexts. Retrieval-Augmented Generation (RAG) systems address these issues by incorporating external knowledge, but suffer from retrieval errors, ineffective context integration, and high operational costs. This paper presents the Hybrid Parameter-Adaptive RAG (HyPA-RAG) system, designed for the AI legal domain, with NYC Local Law 144 (LL144) as the test case. HyPA-RAG integrates a query complexity classifier for adaptive parameter tuning, a hybrid retrieval approach combining dense, sparse, and knowledge graph methods, and a comprehensive evaluation framework with tailored question types and metrics. Testing on LL144 demonstrates that HyPA-RAG enhances retrieval accuracy, response fidelity, and contextual precision, offering a robust and adaptable solution for high-stakes legal and policy applications.

pdf bib abs

An Efficient Context-Dependent Memory Framework for LLM-Centric Agents
Pengyu Gao | Jinming Zhao | Xinyue Chen | Long Yilin

In human cognitive memory psychology, the context-dependent effect helps retrieve key memory cues essential for recalling relevant knowledge in problem-solving. Inspired by this, we introduce the context-dependent memory framework (CDMem), an efficient architecture miming human memory processes through multistage encoding, context-aware storage, and retrieval strategies for LLM-centric agents. We propose multistage memory encoding strategies for acquiring high-quality multilevel knowledge: expert encoding compresses raw trajectories from a domain-expert perspective, short-term encoding consolidates experiences from current tasks, and long-term encoding reflects insights from past tasks. For memory storage and retrieval, we design a graph-structured, context-dependent indexing mechanism that allows agents to efficiently and accurately recall the most relevant multilevel knowledge tailored to the current task and environmental context. Furthermore, the proposed CDMem framework is an online learning architecture, enabling agents to efficiently learn and update memory while adapting to novel environments and tasks in real-world applications. We conducted extensive experiments on two interactive decision-making benchmarks in the navigation and manipulation domain, ALFWorld and ScienceWorld. Using GPT-4o-mini, our method surpasses state-of-the-art online LLM-centric approaches, achieving success rates of 85.8% and 56.0%, respectively. We hope this work will serve as a valuable reference for the academic and industrial communities in advancing agent-based applications.