Workshop on NLP Applications to Field Linguistics (2024)


up

pdf (full)
bib (full)
Proceedings of the 3rd Workshop on NLP Applications to Field Linguistics (Field Matters 2024)

pdf bib
Proceedings of the 3rd Workshop on NLP Applications to Field Linguistics (Field Matters 2024)
Oleg Serikov | Ekaterina Voloshina | Anna Postnikova | Saliha Muradoglu | Eric Le Ferrand | Elena Klyachko | Ekaterina Vylomova | Tatiana Shavrina | Francis Tyers

pdf bib
The Parallel Corpus of Russian and Ruska Romani Languages
Kirill Koncha | Abina Kukanova | Kazakova Tatiana | Gloria Rozovskaya

The paper presents a parallel corpus for the Ruska Romani dialect and Russian language. Ruska Romani is the dialect of Romani language attributed to Ruska Roma, the largest subgroup of Romani people in Russia. The corpus contains the translations of Russian literature into Ruska Romani dialect. The corpus creation involved manual alignment of a small part of translations with original works, fine-tuning a language model on the aligned pairs, and using the fine-tuned model to align the remaining data. Ruska Romani sentences were annotated using a morphological analyzer, with rules crafted for proper nouns and borrowings. The corpus, available in JSON and Russian National Corpus XML formats. It includes 88,742 Russian tokens and 84,635 Ruska Romani tokens, 74,291 of which were grammatically annotated. The corpus could be used for linguistic research, including comparative and diachronic studies, bilingual dictionary creation, stylometry research, and NLP/MT tool development for Ruska Romani.

pdf bib
ManWav: The First Manchu ASR Model
Jean Seo | Minha Kang | SungJoo Byun | Sangah Lee

This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a severely endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data.

pdf bib
User-Centered Design of Digital Tools for Sociolinguistic Studies in Under-Resourced Languages
Jonas Adler | Carsten Scholle | Daniel Buschek | Nicolo’ Brandizzi | Muhadj Adnan

Investigating language variation is a core aspect of sociolinguistics, especially through the use of linguistic corpora. Collecting and analyzing spoken language in text-based corpora can be time-consuming and error-prone, especially for under-resourced languages with limited software assistance. This paper explores the language variation research process using a User-Centered Design (UCD) approach from the field of Human-Computer Interaction (HCI), offering guidelines for the development of digital tools for sociolinguists. We interviewed four researchers, observed their workflows and software usage, and analyzed the data using Grounded Theory. This revealed key challenges in manual tasks, software assistance, and data management. Based on these insights, we identified a set of requirements that future tools should meet to be valuable for researchers in this domain. The paper concludes by proposing design concepts with sketches and prototypes based on the identified requirements. These concepts aim to guide the implementation of a fully functional, open-source tool. This work presents an interdisciplinary approach between sociolinguistics and HCI by emphasizing the practical aspects of research that are often overlooked.

pdf bib
Documenting Endangered Languages with LangDoc: A Wordlist-Based System and A Case Study on Moklen
Piyapath Spencer

Language documentation, especially languages lacking standardised writing systems, is a laborious and time-consuming process. This paper introduces LangDoc, a comprehensive system designed to address challenges and improve the efficiency and accuracy of language documentation projects. LangDoc offers several features, including tools for managing, recording, and reviewing the collected data. It operates both online and offline, crucial for fieldwork in remote locations. The paper also presents a comparative analysis demonstrating LangDoc’s efficiency compared to other methods. A case study of the Moklen language documentation project demonstrates how the features address the specific challenges of working with endangered languages and remote communities. Future development areas include integrating with NLP tools for advanced linguistic analysis and emphasising its potential to support the preservation of language diversity.

pdf bib
Leveraging Deep Learning to Shed Light on Tones of an Endangered Language: A Case Study of Moklen
Sireemas Maspong | Francesco Burroni | Teerawee Sukanchanon | Warunsiri Pornpottanamas | Pittayawat Pittayaporn

Moklen, a tonal Austronesian language spoken in Thailand, exhibits two tones with unbalanced distributions. We employed machine learning techniques for time-series classification to investigate its acoustic properties. Our analysis reveals that a synergy between pitch and vowel quality is crucial for tone distinction, as the model trained with these features achieved the highest accuracy.

pdf bib
A Comparative Analysis of Speaker Diarization Models: Creating a Dataset for German Dialectal Speech
Lea Fischbach

Speaker diarization is a critical task in the field of computer science, aiming to assign timestamps and speaker labels to audio segments. The aim of these tests in this Publication is to find a pretrained speaker diarization pipeline capable of distinguishing dialectal speakers from each other and an explorer. To achieve this, three pipelines, namely Pyannote, CLEAVER and NeMo, are tested and compared, across various segmentation and parameterization strategies. The study considers multiple scenarios, such as the impact of threshold values, overlap handling, and minimum duration parameters, on classification accuracy. Additionally, this study aims to create a dataset for German dialect identification (DID) based on the findings from this research.

pdf bib
Noise Be Gone: Does Speech Enhancement Distort Linguistic Nuances?
Iñigo Parra

This study evaluates the impact of speech enhancement (SE) techniques on linguistic research, focusing on their ability to maintain essential acoustic characteristics in enhanced audio without introducing significant artifacts. Through a sociophonetic analysis of Peninsular and Peruvian Spanish speakers, using both original and enhanced recordings, we demonstrate that SE effectively preserves critical speech nuances such as voicing and vowel quality. This supports the use of SE in improving the quality of speech samples. This study marks an initial effort to assess SE’s reliability in language studies and proposes a methodology for enhancing low-quality audio corpora of under-resourced languages.

pdf bib
Comparing Kaldi-Based Pipeline Elpis and Whisper for Čakavian Transcription
Austin Jones | Shulin Zhang | John Hale | Margaret Renwick | Zvjezdana Vrzic | Keith Langston

Automatic speech recognition (ASR) has the potential to accelerate the documentation of endangered languages, but the dearth of resources poses a major obstacle. Čakavian, an endangered variety spoken primarily in Croatia, is a case in point, lacking transcription tools that could aid documentation efforts. We compare training a new ASR model on a limited dataset using the Kaldi-based ASR pipeline Elpis to using the same dataset to adapt the transformer-based pretrained multilingual model Whisper, to determine which is more practical in the documentation context. Results show that Whisper outperformed Elpis, achieving the lowest average Word Error Rate (WER) of 57.3% and median WER of 35.48%. While Elpis offers a less computationally expensive model and friendlier user experience, Whisper appears better at adapting to our collected Čakavian data.

pdf bib
Zero-shot Cross-lingual POS Tagging for Filipino
Jimson Layacan | Isaiah Edri W. Flores | Katrina Tan | Ma. Regina E. Estuar | Jann Montalan | Marlene M. De Leon

Supervised learning approaches in NLP, exemplified by POS tagging, rely heavily on the presence of large amounts of annotated data. However, acquiring such data often requires significant amount of resources and incurs high costs. In this work, we explore zero-shot cross-lingual transfer learning to address data scarcity issues in Filipino POS tagging, particularly focusing on optimizing source language selection. Our zero-shot approach demonstrates superior performance compared to previous studies, with top-performing fine-tuned PLMs achieving F1 scores as high as 79.10%. The analysis reveals moderate correlations between cross-lingual transfer performance and specific linguistic distances–featural, inventory, and syntactic–suggesting that source languages with these features closer to Filipino provide better results. We identify tokenizer optimization as a key challenge, as PLM tokenization sometimes fails to align with meaningful representations, thus hindering POS tagging performance.