Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora

Taja Kuzman; Peter Rupnik; Nikola Ljubešić

doi:10.18653/v1/2023.vardial-1.9

Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora

Taja Kuzman, Peter Rupnik, Nikola Ljubešić

Abstract

Collecting texts from the web enables a rapid creation of monolingual and parallel corpora of unprecedented size. However, unlike manually-collected corpora, authors and end users do not know which texts make up the web collections. In this work, we analyse the content of seven European parallel web corpora, collected from national top-level domains, by analysing the English variety and genre distribution in them. We develop and provide a lexicon-based British-American variety classifier, which we use to identify the English variety. In addition, we apply a Transformer-based genre classifier to corpora to analyse genre distribution and the interplay between genres and English varieties. The results reveal significant differences among the seven corpora in terms of different genre distribution and different preference for English varieties.

Anthology ID:: 2023.vardial-1.9
Volume:: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 91–103
Language:
URL:: https://aclanthology.org/2023.vardial-1.9/
DOI:: 10.18653/v1/2023.vardial-1.9
Bibkey:
Cite (ACL):: Taja Kuzman, Peter Rupnik, and Nikola Ljubešić. 2023. Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 91–103, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora (Kuzman et al., VarDial 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.vardial-1.9.pdf
Video:: https://aclanthology.org/2023.vardial-1.9.mp4

PDF Cite Search Video Fix data