Slava Jankin
2024
Creating and Evaluating a Multilingual Corpus of UN General Assembly Debates
Hannah Bechara
|
Krishnamoorthy Manohara
|
Slava Jankin
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
This paper presents a multilingual aligned corpus of political debates from the United Nations (UN) General Assembly sessions between 1978 and 2021, which covers five of the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. We explain the preprocessing steps we applied to the corpus. We align the sentences by using word vectors to numerically represent the meaning of each sentence and then calculating the Euclidean distance between them. To validate our alignment methods, we conducted an evaluation study with crowd-sourced human annotators using Scale AI, an online platform for data labelling. The final dataset consists of around 300,000 aligned sentences for En-Es, En-Fr, En-Zh and En-Ru. It is publicly available for download.
PolitiCause: An Annotation Scheme and Corpus for Causality in Political Texts
Paulina Garcia Corral
|
Hanna Bechara
|
Ran Zhang
|
Slava Jankin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper, we present PolitiCAUSE, a new corpus of political texts annotated for causality. We provide a detailed and robust annotation scheme for annotating two types of information: (1) whether a sentence contains a causal relation or not, and (2) the spans of text that correspond to the cause and effect components of the causal relation. We also provide statistics and analysis of the corpus, and outline the difficulties and limitations of the task. Finally, we test out two transformer-based classification models on our dataset as a form of evaluation. The models achieve a moderate performance on the dataset, with a MCC score of 0.62. Our results show that PolitiCAUSE is a valuable resource for studying causality in texts, especially in the domain of political discourse, and that there is still room for improvement in developing more accurate and robust methods for this problem.