Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue

Evgeniia Razumovskaia, Ivan Vulić, Anna Korhonen


Abstract
Scaling dialogue systems to a multitude of domains, tasks and languages relies on costly and time-consuming data annotation for different domain-task-language configurations. The annotation efforts might be substantially reduced by the methods that generalise well in zero- and few-shot scenarios, and also effectively leverage external unannotated data sources (e.g., Web-scale corpora). We propose two methods to this aim, offering improved dialogue natural language understanding (NLU) across multiple languages: 1) Multi-SentAugment, and 2) LayerAgg. Multi-SentAugment is a self-training method which augments available (typically few-shot) training data with similar (automatically labelled) in-domain sentences from large monolingual Web-scale corpora. LayerAgg learns to select and combine useful semantic information scattered across different layers of a Transformer model (e.g., mBERT); it is especially suited for zero-shot scenarios as semantically richer representations should strengthen the model’s cross-lingual capabilities. Applying the two methods with state-of-the-art NLU models obtains consistent improvements across two standard multilingual NLU datasets covering 16 diverse languages. The gains are observed in zero-shot, few-shot, and even in full-data scenarios. The results also suggest that the two methods achieve a synergistic effect: the best overall performance in few-shot setups is attained when the methods are used together.
Anthology ID:
2022.findings-acl.160
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2017–2033
Language:
URL:
https://aclanthology.org/2022.findings-acl.160
DOI:
10.18653/v1/2022.findings-acl.160
Bibkey:
Cite (ACL):
Evgeniia Razumovskaia, Ivan Vulić, and Anna Korhonen. 2022. Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2017–2033, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue (Razumovskaia et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.160.pdf
Video:
 https://aclanthology.org/2022.findings-acl.160.mp4
Data
CC100xSID