2024
pdf
bib
abs
Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization
Md Tahmid Rahman Laskar
|
Elena Khasanova
|
Xue-Yong Fu
|
Cheng Chen
|
Shashi Bhushan Tn
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
This work focuses on the task of query-based meeting summarization in which the summary of a context (meeting transcript) is generated in response to a specific query. When using Large Language Models (LLMs) for this task, a new call to the LLM inference endpoint/API is required for each new query even if the context stays the same. However, repeated calls to the LLM inference endpoints would significantly increase the costs of using them in production, making LLMs impractical for many real-world use cases. To address this problem, in this paper, we investigate whether combining the queries for the same input context in a single prompt to minimize repeated calls can be successfully used in meeting summarization. In this regard, we conduct extensive experiments by comparing the performance of various popular LLMs: GPT-4, Gemini, Claude-3, LLaMA2, Mistral, Phi-3, and Qwen-2 in single-query and multi-query settings. We observe that the capability to reliably generate the response in the expected format is usually limited to closedsource LLMs, with most open-source LLMs lagging behind (except Mistral). We conclude that multi-query prompting could be useful to optimize the inference costs by significantly reducing calls to the inference endpoints/APIs for the task of meeting summarization.
2023
pdf
bib
abs
Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective
Md Tahmid Rahman Laskar
|
Xue-Yong Fu
|
Cheng Chen
|
Shashi Bhushan TN
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
This paper studies how to effectively build meeting summarization systems for real-world usage using large language models (LLMs). For this purpose, we conduct an extensive evaluation and comparison of various closed-source and open-source LLMs, namely, GPT-4, GPT-3.5, PaLM-2, and LLaMA-2. Our findings reveal that most closed-source LLMs are generally better in terms of performance. However, much smaller open-source models like LLaMA-2 (7B and 13B) could still achieve performance comparable to the large closed-source models even in zero-shot scenarios. Considering the privacy concerns of closed-source models for only being accessible via API, alongside the high cost associated with using fine-tuned versions of the closed-source models, the opensource models that can achieve competitive performance are more advantageous for industrial use. Balancing performance with associated costs and privacy concerns, the LLaMA-2-7B model looks more promising for industrial usage. In sum, this paper offers practical insights on using LLMs for real-world business meeting summarization, shedding light on the trade-offs between performance and cost.
pdf
bib
abs
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs
Xue-Yong Fu
|
Md Tahmid Rahman Laskar
|
Cheng Chen
|
Shashi Bhushan Tn
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
In recent years, large language models (LLMs) have drawn significant attention due to their impressive emergent capabilities that were not observed in earlier language models. One emerging area where LLMs have been widely used in recent times is the utilization of LLMs as the evaluator of the texts generated by various generative models. In this paper, we also explore the possibility of whether LLMs are reliable in assessing the factual consistency of summaries generated by text generation models. We first propose a new approach to evaluate the factuality score using LLMs by utilizing the same LLM to perform all steps in the question-answering-based factuality scoring pipeline. Subsequently, we study the performance of various LLMs to directly score the factuality. Our evaluation is conducted in traditional benchmarks by comparing their correlation with human annotations. Contrary to expectations, our findings revealed that none of the factuality metrics showed any significant correlations (e.g., coefficient scores greater than 0.3) to human evaluations of factuality for GPT-4, PaLM-2, and Claude-2, with the only exception being GPT-3.5 in two subcategories of factuality. Nonetheless, our findings are consistent across almost all factual error types, suggesting a fundamental limitation in the ability of current LLMs to assess factuality.
2022
pdf
bib
abs
Improving Named Entity Recognition in Telephone Conversations via Effective Active Learning with Human in the Loop
Md Tahmid Rahman Laskar
|
Cheng Chen
|
Xue-yong Fu
|
Shashi Bhushan Tn
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
Telephone transcription data can be very noisy due to speech recognition errors, disfluencies, etc. Not only that annotating such data is very challenging for the annotators, but also such data may have lots of annotation errors even after the annotation job is completed, resulting in a very poor model performance. In this paper, we present an active learning framework that leverages human in the loop learning to identify data samples from the annotated dataset for re-annotation that are more likely to contain annotation errors. In this way, we largely reduce the need for data re-annotation for the whole dataset. We conduct extensive experiments with our proposed approach for Named Entity Recognition and observe that by re-annotating only about 6% training instances out of the whole dataset, the F1 score for a certain entity type can be significantly improved by about 25%.
pdf
bib
abs
Entity-level Sentiment Analysis in Contact Center Telephone Conversations
Xue-yong Fu
|
Cheng Chen
|
Md Tahmid Rahman Laskar
|
Shayna Gardiner
|
Pooja Hiranandani
|
Shashi Bhushan Tn
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Entity-level sentiment analysis predicts the sentiment about entities mentioned in a given text. It is very useful in a business context to understand user emotions towards certain entities, such as products or companies. In this paper, we demonstrate how we developed an entity-level sentiment analysis system that analyzes English telephone conversation transcripts in contact centers to provide business insight. We present two approaches, one entirely based on the transformer-based DistilBERT model, and another that uses a neural network supplemented with some heuristic rules.
pdf
bib
abs
BLINK with Elasticsearch for Efficient Entity Linking in Business Conversations
Md Tahmid Rahman Laskar
|
Cheng Chen
|
Aliaksandr Martsinovich
|
Jonathan Johnston
|
Xue-Yong Fu
|
Shashi Bhushan Tn
|
Simon Corston-Oliver
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
An Entity Linking system aligns the textual mentions of entities in a text to their corresponding entries in a knowledge base. However, deploying a neural entity linking system for efficient real-time inference in production environments is a challenging task. In this work, we present a neural entity linking system that connects the product and organization type entities in business conversations to their corresponding Wikipedia and Wikidata entries. The proposed system leverages Elasticsearch to ensure inference efficiency when deployed in a resource limited cloud machine, and obtains significant improvements in terms of inference speed and memory consumption while retaining high accuracy.
pdf
bib
abs
An Effective, Performant Named Entity Recognition System for Noisy Business Telephone Conversation Transcripts
Xue-Yong Fu
|
Cheng Chen
|
Md Tahmid Rahman Laskar
|
Shashi Bhushan Tn
|
Simon Corston-Oliver
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
We present a simple yet effective method to train a named entity recognition (NER) model that operates on business telephone conversation transcripts that contain noise due to the nature of spoken conversation and artifacts of automatic speech recognition. We first fine-tune LUKE, a state-of-the-art Named Entity Recognition (NER) model, on a limited amount of transcripts, then use it as the teacher model to teach a smaller DistilBERT-based student model using a large amount of weakly labeled data and a small amount of human-annotated data. The model achieves high accuracy while also satisfying the practical constraints for inclusion in a commercial telephony product: realtime performance when deployed on cost-effective CPUs rather than GPUs. In this paper, we introduce the fine-tune-then-distill method for entity recognition on real world noisy data to deploy our NER model in a limited budget production environment. By generating pseudo-labels using a large teacher model pre-trained on typed text while fine-tuned on noisy speech text to train a smaller student model, we make the student model 75x times faster while reserving 99.09% of its accuracy. These findings demonstrate that our proposed approach is very effective in limited budget scenarios to alleviate the need of human labeling of a large amount of noisy data.