Macedon: Minimizing Representation Coding Rate Reduction for Cross-Lingual Natural Language Understanding

Haoyu Wang, Yaqing Wang, Huaxiu Yao, Jing Gao


Abstract
Cross-lingual natural language understanding(NLU) is one of the fundamental tasks of NLP. The goal is to learn a model which can generalize well on both high-resource and low-resource language data. Recent pre-trained multilingual language models, e.g., multilingual BERT, XLM, have shown impressive performance on cross-lingual NLU tasks. However, such promising results request the use of sufficient training data, which is a difficult condition to satisfy for low-resource language. When the data is limited in those low resource languages, the accuracy of existing models will drop. In light of this challenge, we investigate the important task of how to train the cross-lingual model with abundant high-source language data and limited low-resource language data. Existing methods typically learn language-agnostic representation via adversarial training and mutual information estimation. Existing approaches may suffer When data is very limited (e.g., low-resource language) because it is challenging to estimate data distribution accurately. To tackle this issue, we propose a conceptually innovative approach to remove language-associated information via minimizing representation coding rate reduction(Macedon). Specifically, Macedon avoids using extra codes to encode language-related information, which is measured by the rate-distortion function. To validate the effectiveness of Macedon, we conduct extensive experiments on three tasks, including paraphrase identification, natural language inference, and query advertisement matching. The experiment results show that the proposed Macedon outperforms state-of-the-art cross-lingual NLU approaches.
Anthology ID:
2023.findings-emnlp.829
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12426–12436
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.829
DOI:
10.18653/v1/2023.findings-emnlp.829
Bibkey:
Cite (ACL):
Haoyu Wang, Yaqing Wang, Huaxiu Yao, and Jing Gao. 2023. Macedon: Minimizing Representation Coding Rate Reduction for Cross-Lingual Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12426–12436, Singapore. Association for Computational Linguistics.
Cite (Informal):
Macedon: Minimizing Representation Coding Rate Reduction for Cross-Lingual Natural Language Understanding (Wang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.829.pdf