Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

Daniela Trotta; Raffaele Guarasci; Elisa Leonardelli; Sara Tonelli

doi:10.18653/v1/2021.findings-emnlp.250

Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

Daniela Trotta, Raffaele Guarasci, Elisa Leonardelli, Sara Tonelli

Abstract

The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the lack of resources with a comparable size in other languages. We have therefore developed the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments, which has been created following the same approach and the same steps as the English one. In this paper we describe the corpus creation, we detail its content, and we present the first experiments on this new resource. We compare in-domain and out-of-domain classification, and perform a specific evaluation of nine linguistic phenomena. We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformer-based approaches can benefit from using sentences in two languages during fine-tuning.

Anthology ID:: 2021.findings-emnlp.250
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2021
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: Findings
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2929–2940
Language:
URL:: https://aclanthology.org/2021.findings-emnlp.250/
DOI:: 10.18653/v1/2021.findings-emnlp.250
Bibkey:
Cite (ACL):: Daniela Trotta, Raffaele Guarasci, Elisa Leonardelli, and Sara Tonelli. 2021. Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2929–2940, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus (Trotta et al., Findings 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.findings-emnlp.250.pdf
Video:: https://aclanthology.org/2021.findings-emnlp.250.mp4
Code: dhfbk/itacola-dataset
Data: ItaCoLA, CoLA, GLUE

PDF Cite Search Code Video Fix data