SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification

Cristian Onose; Dumitru-Clementin Cercel; Stefan Trausan-Matu

doi:10.18653/v1/W19-1418

SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification

Cristian Onose, Dumitru-Clementin Cercel, Stefan Trausan-Matu

Abstract

This paper describes our models for the Moldavian vs. Romanian Cross-Topic Identification (MRC) evaluation campaign, part of the VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification between the Moldavian (MD) and the Romanian (RO) dialects and two cross-dialect multi-class classification between six news topics, MD to RO and RO to MD. We propose several deep learning models based on long short-term memory cells, Bidirectional Gated Recurrent Unit (BiGRU) and Hierarchical Attention Networks (HAN). We also employ three word embedding models to represent the text as a low dimensional vector. Our official submission includes two runs of the BiGRU and HAN models for each of the three subtasks. The best submitted model obtained the following macro-averaged F1 scores: 0.708 for subtask 1, 0.481 for subtask 2 and 0.480 for the last one. Due to a read error caused by the quoting behaviour over the test file, our final submissions contained a smaller number of items than expected. More than 50% of the submission files were corrupted. Thus, we also present the results obtained with the corrected labels for which the HAN model achieves the following results: 0.930 for subtask 1, 0.590 for subtask 2 and 0.687 for the third one.

Anthology ID:: W19-1418
Volume:: Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:: June
Year:: 2019
Address:: Ann Arbor, Michigan
Editors:: Marcos Zampieri, Preslav Nakov, Shervin Malmasi, Nikola Ljubešić, Jörg Tiedemann, Ahmed Ali
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 172–177
Language:
URL:: https://aclanthology.org/W19-1418/
DOI:: 10.18653/v1/W19-1418
Bibkey:
Cite (ACL):: Cristian Onose, Dumitru-Clementin Cercel, and Stefan Trausan-Matu. 2019. SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 172–177, Ann Arbor, Michigan. Association for Computational Linguistics.
Cite (Informal):: SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification (Onose et al., VarDial 2019)
Copy Citation:
PDF:: https://aclanthology.org/W19-1418.pdf

PDF Cite Search Fix data