Creating and Evaluating a Multilingual Corpus of UN General Assembly Debates

Hannah Bechara, Krishnamoorthy Manohara, Slava Jankin


Abstract
This paper presents a multilingual aligned corpus of political debates from the United Nations (UN) General Assembly sessions between 1978 and 2021, which covers five of the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. We explain the preprocessing steps we applied to the corpus. We align the sentences by using word vectors to numerically represent the meaning of each sentence and then calculating the Euclidean distance between them. To validate our alignment methods, we conducted an evaluation study with crowd-sourced human annotators using Scale AI, an online platform for data labelling. The final dataset consists of around 300,000 aligned sentences for En-Es, En-Fr, En-Zh and En-Ru. It is publicly available for download.
Anthology ID:
2024.eamt-1.52
Volume:
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Month:
June
Year:
2024
Address:
Sheffield, UK
Editors:
Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, Víctor M Sánchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Cabarrão, Konstantinos Chatzitheodorou, Mary Nurminen, Diptesh Kanojia, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
Note:
Pages:
623–627
Language:
URL:
https://aclanthology.org/2024.eamt-1.52
DOI:
Bibkey:
Cite (ACL):
Hannah Bechara, Krishnamoorthy Manohara, and Slava Jankin. 2024. Creating and Evaluating a Multilingual Corpus of UN General Assembly Debates. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), pages 623–627, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):
Creating and Evaluating a Multilingual Corpus of UN General Assembly Debates (Bechara et al., EAMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eamt-1.52.pdf