Nikolaos Kokkas


2023

pdf bib
ASR pipeline for low-resourced languages: A case study on Pomak
Chara Tsoukala | Kosmas Kritsis | Ioannis Douros | Athanasios Katsamanis | Nikolaos Kokkas | Vasileios Arampatzakis | Vasileios Sevetlidis | Stella Markantonatou | George Pavlidis
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.

2022

pdf bib
Morphologically annotated corpora of Pomak
Ritván Jusúf Karahóǧa | Panagiotis G. Krimpas | Vivian Stamou | Vasileios Arampatzakis | Dimitrios Karamatskos | Vasileios Sevetlidis | Nikolaos Constantinides | Nikolaos Kokkas | George Pavlidis | Stella Markantonatou
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

The project XXXX is developing a platform to enable researchers of living languages to easily create and make available state-of-the-art spoken and textual annotated resources. As a case study we use Greek and Pomak, the latter being an endangered oral Slavic language of the Balkans (including Thrace/Greece). The linguistic documentation of Pomak is an ongoing work by an interdisciplinary team in close cooperation with the Pomak community of Greece. We describe our experience in the development of a Latin-based orthography and morphologically annotated text corpora of Pomak with state-of-the-art NLP technology. These resources will be made openly available on the XXXX site and the gold annotated corpora of Pomak will be made available on the Universal Dependencies treebank repository.