2022
pdf
bib
abs
Doc2Bot: Accessing Heterogeneous Documents via Conversational Bots
Haomin Fu
|
Yeqin Zhang
|
Haiyang Yu
|
Jian Sun
|
Fei Huang
|
Luo Si
|
Yongbin Li
|
Cam Tu Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2022
This paper introduces Doc2Bot, a novel dataset for building machines that help users seek information via conversations. This is of particular interest for companies and organizations that own a large number of manuals or instruction books. Despite its potential, the nature of our task poses several challenges: (1) documents contain various structures that hinder the ability of machines to comprehend, and (2) user information needs are often underspecified. Compared to prior datasets that either focus on a single structural type or overlook the role of questioning to uncover user needs, the Doc2Bot dataset is developed to target such challenges systematically. Our dataset contains over 100,000 turns based on Chinese documents from five domains, larger than any prior document-grounded dialog dataset for information seeking. We propose three tasks in Doc2Bot: (1) dialog state tracking to track user intentions, (2) dialog policy learning to plan system actions and contents, and (3) response generation which generates responses based on the outputs of the dialog policy. Baseline methods based on the latest deep learning models are presented, indicating that our proposed tasks are challenging and worthy of further research.
2018
pdf
bib
abs
Joint learning of frequency and word embeddings for multilingual readability assessment
Dieu-Thu Le
|
Cam-Tu Nguyen
|
Xiaoliang Wang
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications
This paper describes two models that employ word frequency embeddings to deal with the problem of readability assessment in multiple languages. The task is to determine the difficulty level of a given document, i.e., how hard it is for a reader to fully comprehend the text. The proposed models show how frequency information can be integrated to improve the readability assessment. The experimental results testing on both English and Chinese datasets show that the proposed models improve the results notably when comparing to those using only traditional word embeddings.
pdf
bib
abs
Dave the debater: a retrieval-based and generative argumentative dialogue agent
Dieu Thu Le
|
Cam-Tu Nguyen
|
Kim Anh Nguyen
Proceedings of the 5th Workshop on Argument Mining
In this paper, we explore the problem of developing an argumentative dialogue agent that can be able to discuss with human users on controversial topics. We describe two systems that use retrieval-based and generative models to make argumentative responses to the users. The experiments show promising results although they have been trained on a small dataset.
2008
pdf
bib
abs
Word Segmentation of Vietnamese Texts: a Comparison of Approaches
Quang Thắng Đinh
|
Hồng Phương Lê
|
Thị Minh Huyền Nguyễn
|
Cẩm Tú Nguyễn
|
Mathias Rossignol
|
Xuân Lương Vũ
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.
2006
pdf
bib
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
Cam-Tu Nguyen
|
Trung-Kien Nguyen
|
Xuan-Hieu Phan
|
Le-Minh Nguyen
|
Quang-Thuy Ha
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation