Krishnamoorthy Manohara
2024
Creating and Evaluating a Multilingual Corpus of UN General Assembly Debates
Hannah Bechara
|
Krishnamoorthy Manohara
|
Slava Jankin
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
This paper presents a multilingual aligned corpus of political debates from the United Nations (UN) General Assembly sessions between 1978 and 2021, which covers five of the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. We explain the preprocessing steps we applied to the corpus. We align the sentences by using word vectors to numerically represent the meaning of each sentence and then calculating the Euclidean distance between them. To validate our alignment methods, we conducted an evaluation study with crowd-sourced human annotators using Scale AI, an online platform for data labelling. The final dataset consists of around 300,000 aligned sentences for En-Es, En-Fr, En-Zh and En-Ru. It is publicly available for download.