2024
pdf
bib
abs
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models
Songbo Hu
|
Xiaobin Wang
|
Moy Yuan
|
Anna Korhonen
|
Ivan Vulić
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)
We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained Language Models (PLMs) and those utilising the zero-shot and in-context learning capabilities of Large Language Models (LLMs). In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field.
pdf
bib
abs
Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue Systems
Songbo Hu
|
Ivan Vulić
|
Fangyu Liu
|
Anna Korhonen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
End-to-end task-oriented dialogue systems are prone to fall into the so-called ‘likelihood trap’, resulting in generated responses which are dull, repetitive, and often inconsistent with dialogue history. Comparing ranked lists of multiple generated responses against the ‘gold response’ reveals a wide diversity in quality, with many good responses placed lower in the ranked list. The main challenge addressed in this work is how to reach beyond greedily generated system responses, that is, how to obtain and select high-quality responses from the list of overgenerated responses at inference without the availability of the gold response. To this end, we propose a simple yet effective reranking method to select high-quality items from the lists of initially overgenerated responses. The idea is to use any sequence-level scoring function to divide the semantic space of responses into high-scoring versus low-scoring partitions. At training, the high-scoring partition comprises all generated responses whose similarity to the gold response is higher than the similarity of the greedy response to the gold response. At inference, the aim is to estimate the probability that each overgenerated response belongs to the high-scoring partition. We evaluate our proposed method on the standard MultiWOZ dataset, the BiTOD dataset, and with human evaluation.
2023
pdf
bib
abs
A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
Songbo Hu
|
Han Zhou
|
Moy Yuan
|
Milan Gritta
|
Guchun Zhang
|
Ignacio Iacobacci
|
Anna Korhonen
|
Ivan Vulić
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Achieving robust language technologies that can perform well across the world’s many languages is a central goal of multilingual NLP. In this work, we take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue (ToD) systems. We first define new quantitative measures of absolute and relative equivalence in system performance, capturing disparities across languages and within individual languages. Through a series of controlled experiments, we demonstrate that performance disparities depend on a number of factors: the nature of the ToD task at hand, the underlying pretrained language model, the target language, and the amount of ToD annotated data. We empirically prove the existence of the adaptation and intrinsic biases in current ToD systems: e.g., ToD systems trained for Arabic or Turkish using annotated ToD data fully parallel to English ToD data still exhibit diminished ToD task performance. Beyond providing a series of insights into the performance disparities of ToD systems in different languages, our analyses offer practical tips on how to approach ToD data collection and system development for new languages.
pdf
bib
abs
Can Pretrained Language Models (Yet) Reason Deductively?
Zhangdie Yuan
|
Songbo Hu
|
Ivan Vulić
|
Anna Korhonen
|
Zaiqiao Meng
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Acquiring factual knowledge with Pretrained Language Models (PLMs) has attracted increasing attention, showing promising performance in many knowledge-intensive tasks. Their good performance has led the community to believe that the models do possess a modicum of reasoning competence rather than merely memorising the knowledge. In this paper, we conduct a comprehensive evaluation of the learnable deductive (also known as explicit) reasoning capability of PLMs. Through a series of controlled experiments, we posit two main findings. 1) PLMs inadequately generalise learned logic rules and perform inconsistently against simple adversarial surface form edits. 2) While the deductive reasoning fine-tuning of PLMs does improve their performance on reasoning over unseen knowledge facts, it results in catastrophically forgetting the previously learnt knowledge. Our main results suggest that PLMs cannot yet perform reliable deductive reasoning, demonstrating the importance of controlled examinations and probing of PLMs’ deductive reasoning abilities; we reach beyond (misleading) task performance, revealing that PLMs are still far from robust reasoning capabilities, even for simple deductive tasks.
pdf
bib
abs
Multi 3 WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems
Songbo Hu
|
Han Zhou
|
Mete Hergul
|
Milan Gritta
|
Guchun Zhang
|
Ignacio Iacobacci
|
Ivan Vulić
|
Anna Korhonen
Transactions of the Association for Computational Linguistics, Volume 11
Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom–up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.
2021
pdf
bib
abs
Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems
Hsien-chin Lin
|
Nurul Lubis
|
Songbo Hu
|
Carel van Niekerk
|
Christian Geishauser
|
Michael Heck
|
Shutong Feng
|
Milica Gasic
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Dialogue policy optimisation via reinforcement learning requires a large number of training interactions, which makes learning with real users time consuming and expensive. Many set-ups therefore rely on a user simulator instead of humans. These user simulators have their own problems. While hand-coded, rule-based user simulators have been shown to be sufficient in small, simple domains, for complex domains the number of rules quickly becomes intractable. State-of-the-art data-driven user simulators, on the other hand, are still domain-dependent. This means that adaptation to each new domain requires redesigning and retraining. In this work, we propose a domain-independent transformer-based user simulator (TUS). The structure of TUS is not tied to a specific domain, enabling domain generalization and the learning of cross-domain user behaviour from data. We compare TUS with the state-of-the-art using automatic as well as human evaluations. TUS can compete with rule-based user simulators on pre-defined domains and is able to generalize to unseen domains in a zero-shot fashion.