2023
pdf
bib
abs
Reducing cohort bias in natural language understanding systems with targeted self-training scheme
Dieu-thu Le
|
Gabriela Hernandez
|
Bei Chen
|
Melanie Bradford
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Bias in machine learning models can be an issue when the models are trained on particular types of data that do not generalize well, causing under performance in certain groups of users. In this work, we focus on reducing the bias related to new customers in a digital voice assistant system. It is observed that natural language understanding models often have lower performance when dealing with requests coming from new users rather than experienced users. To mitigate this problem, we propose a framework that consists of two phases (1) a fixing phase with four active learning strategies used to identify important samples coming from new users, and (2) a self training phase where a teacher model trained from the first phase is used to annotate semi-supervised samples to expand the training data with relevant cohort utterances. We explain practical strategies that involve an identification of representative cohort-based samples through density clustering as well as employing implicit customer feedbacks to improve new customers’ experience. We demonstrate the effectiveness of our approach in a real world large scale voice assistant system for two languages, German and French through both offline experiments as well as A/B testings.
2022
pdf
bib
abs
Unsupervised training data re-weighting for natural language understanding with local distribution approximation
Jose Garrido Ramas
|
Dieu-thu Le
|
Bei Chen
|
Manoj Kumar
|
Kay Rottmann
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
One of the major challenges of training Natural Language Understanding (NLU) production models lies in the discrepancy between the distributions of the offline training data and of the online live data, due to, e.g., biased sampling scheme, cyclic seasonality shifts, annotated training data coming from a variety of different sources, and a changing pool of users. Consequently, the model trained by the offline data is biased. We often observe this problem especially in task-oriented conversational systems, where topics of interest and the characteristics of users using the system change over time. In this paper we propose an unsupervised approach to mitigate the offline training data sampling bias in multiple NLU tasks. We show that a local distribution approximation in the pre-trained embedding space enables the estimation of importance weights for training samples guiding re-sampling for an effective bias mitigation. We illustrate our novel approach using multiple NLU datasets and show improvements obtained without additional annotation, making this a general approach for mitigating effects of sampling bias.
pdf
bib
abs
Semi-supervised Adversarial Text Generation based on Seq2Seq models
Hieu Le
|
Dieu-thu Le
|
Verena Weber
|
Chris Church
|
Kay Rottmann
|
Melanie Bradford
|
Peter Chin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
To improve deep learning models’ robustness, adversarial training has been frequently used in computer vision with satisfying results. However, adversarial perturbation on text have turned out to be more challenging due to the discrete nature of text. The generated adversarial text might not sound natural or does not preserve semantics, which is the key for real world applications where text classification is based on semantic meaning. In this paper, we describe a new way for generating adversarial samples by using pseudo-labeled in-domain text data to train a seq2seq model for adversarial generation and combine it with paraphrase detection. We showcase the benefit of our approach for a real-world Natural Language Understanding (NLU) task, which maps a user’s request to an intent. Furthermore, we experiment with gradient-based training for the NLU task and try using token importance scores to guide the adversarial text generation. We show that our approach can generate realistic and relevant adversarial samples compared to other state-of-the-art adversarial training methods. Applying adversarial training using these generated samples helps the NLU model to recover up to 70% of these types of errors and makes the model more robust, especially in the tail distribution in a large scale real world application.
2021
pdf
bib
abs
Combining semantic search and twin product classification for recognition of purchasable items in voice shopping
Dieu-Thu Le
|
Verena Weber
|
Melanie Bradford
Proceedings of the 4th Workshop on e-Commerce and NLP
The accuracy of an online shopping system via voice commands is particularly important and may have a great impact on customer trust. This paper focuses on the problem of detecting if an utterance contains actual and purchasable products, thus referring to a shopping-related intent in a typical Spoken Language Understanding architecture consist- ing of an intent classifier and a slot detec- tor. Searching through billions of products to check if a detected slot is a purchasable item is prohibitively expensive. To overcome this problem, we present a framework that (1) uses a retrieval module that returns the most rele- vant products with respect to the detected slot, and (2) combines it with a twin network that decides if the detected slot is indeed a pur- chasable item or not. Through various exper- iments, we show that this architecture outper- forms a typical slot detector approach, with a gain of +81% in accuracy and +41% in F1 score.
2018
pdf
bib
abs
Joint learning of frequency and word embeddings for multilingual readability assessment
Dieu-Thu Le
|
Cam-Tu Nguyen
|
Xiaoliang Wang
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications
This paper describes two models that employ word frequency embeddings to deal with the problem of readability assessment in multiple languages. The task is to determine the difficulty level of a given document, i.e., how hard it is for a reader to fully comprehend the text. The proposed models show how frequency information can be integrated to improve the readability assessment. The experimental results testing on both English and Chinese datasets show that the proposed models improve the results notably when comparing to those using only traditional word embeddings.
pdf
bib
abs
Dave the debater: a retrieval-based and generative argumentative dialogue agent
Dieu Thu Le
|
Cam-Tu Nguyen
|
Kim Anh Nguyen
Proceedings of the 5th Workshop on Argument Mining
In this paper, we explore the problem of developing an argumentative dialogue agent that can be able to discuss with human users on controversial topics. We describe two systems that use retrieval-based and generative models to make argumentative responses to the users. The experiments show promising results although they have been trained on a small dataset.
2016
pdf
bib
Towards a text analysis system for political debates
Dieu-Thu Le
|
Ngoc Thang Vu
|
Andre Blessing
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
pdf
bib
abs
Construction and Analysis of a Large Vietnamese Text Corpus
Dieu-Thu Le
|
Uwe Quasthoff
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks for separating words, hence using common tokenizers such as replacing blanks with word boundary does not work. A short review about different approaches of Vietnamese tokenization is presented together with how the corpus has been processed and created. After that, some statistical analysis on this data is reported including the number of syllable, average word length, sentence length and topic analysis. The corpus is integrated into a framework which allows searching and browsing. Using this web interface, users can find out how many times a particular word appears in the corpus, sample sentences where this word occurs, its left and right neighbors.
2014
pdf
bib
TUHOI: Trento Universal Human Object Interaction Dataset
Dieu-Thu Le
|
Jasper Uijlings
|
Raffaella Bernardi
Proceedings of the Third Workshop on Vision and Language
2013
pdf
bib
Exploiting Language Models for Visual Recognition
Dieu-Thu Le
|
Jasper Uijlings
|
Raffaella Bernardi
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
2012
pdf
bib
Query classification using topic models and support vector machine
Dieu-Thu Le
|
Raffaella Bernardi
Proceedings of ACL 2012 Student Research Workshop
2011
pdf
bib
Query classification via Topic Models for an art image archive
Dieu-Thu Le
|
Raffaella Bernardi
|
Ed Vald
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage