David P. Corina
2020
Resources in Underrepresented Languages: Building a Representative Romanian Corpus
Ludmila Midrigan - Ciochina
|
Victoria Boyd
|
Lucila Sanchez-Ortega
|
Diana Malancea_Malac
|
Doina Midrigan
|
David P. Corina
Proceedings of the Twelfth Language Resources and Evaluation Conference
The effort in the field of Linguistics to develop theories that aim to explain language-dependent effects on language processing is greatly facilitated by the availability of reliable resources representing different languages. This project presents a detailed description of the process of creating a large and representative corpus in Romanian – a relatively under-resourced language with unique structural and typological characteristics, that can be used as a reliable language resource for linguistic studies. The decisions that have guided the construction of the corpus, including the type of corpus, its size and component resource files are discussed. Issues related to data collection, data organization and storage, as well as characteristics of the data included in the corpus are described. Currently, the corpus has approximately 5,500,000 tokens originating from written text and 100,000 tokens of spoken language. it includes language samples that represent a wide variety of registers (i.e. written language - 16 registers and 5 registers of spoken language), as well as different authors and speakers