Carole Werner
2024
Managing Fine-grained Metadata for Text Bases in Extremely Low Resource Languages: The Cases of Two Regional Languages of France
Marianne Vergez-Couret
|
Delphine Bernhard
|
Michael Nauge
|
Myriam Bras
|
Pablo Ruiz Fabo
|
Carole Werner
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Metadata are key components of language resources and facilitate their exploitation and re-use. Their creation is a labour intensive process and requires a modeling step, which identifies resource-specific information as well as standards and controlled vocabularies that can be reused. In this article, we focus on metadata for documenting text bases for regional languages of France characterised by several levels of variation (space, time, usage, social status), based on a survey of existing metadata schema. Moreover, we implement our metadata model as a database structure for the Heurist data management system, which combines both the ease of use of spreadsheets and the ability to model complex relationships between entities of relational databases. The Heurist template is made freely available and was used to describe metadata for text bases in Alsatian and Poitevin-Santongeais. We also propose tools to automatically generate XML metadata headers files from the database.
The ParCoLab Parallel Corpus and Its Extension to Four Regional Languages of France
Dejan Stosic
|
Saša Marjanović
|
Delphine Bernhard
|
Xavier Bach
|
Myriam Bras
|
Laurent Kevers
|
Stella Retali-Medori
|
Marianne Vergez-Couret
|
Carole Werner
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Parallel corpora are still scarce for most of the world’s language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.
Search
Co-authors
- Marianne Vergez-Couret 2
- Delphine Bernhard 2
- Myriam Bras 2
- Michael Nauge 1
- Pablo Ruiz Fabo 1
- show all...