Jordan Lachler

2025

Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Jordan Lachler | Godfred Agyapong | Antti Arppe | Sarah Moeller | Aditi Chaudhary | Shruti Rijhwani | Daisy Rosenblum
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

2023

pdf bib abs

Strengthening Relationships Between Indigenous Communities, Documentary Linguists, and Computational Linguists in the Era of NLP-Assisted Language Revitalization
Darren Flavelle | Jordan Lachler
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)

As the global crisis of language endangerment deepens, Indigenous communities have continued to seek new means of preserving, promoting and passing on their languages to future generations. For many communities, modern language technology holds the promise of accelerating that process. However, the cultural and disciplinary divides between documentary linguists, computational linguists and Indigenous communities have posed an on-going challenge for the development and deployment of NLP applications that can support the documentation and revitalization of Indigenous languages. In this paper, we discuss the main barriers to collaboration that these groups have encountered, as well as some notable initiatives in recent years to bring the groups closer together. We follow this with specific recommendations to build upon those efforts, calling for increased opportunities for awareness-building and skills-training in computational linguistics, tailored to the specific needs of both documentary linguists and Indigenous community members. We see this as an essential step as we move forward into an era of NLP-assisted language revitalization.

We are presenting our work on the creation of the first optical character recognition (OCR) model for Northern Haida, also known as Masset or Xaad Kil, a nearly extinct First Nations language spoken in the Haida Gwaii archipelago in British Columbia, Canada. We are addressing the challenges of training an OCR model for a language with an extensive, non-standard Latin character set as follows: (1) We have compared various training approaches and present the results of practical analyses to maximize recognition accuracy and minimize manual labor. An approach using just one or two pages of Source Images directly performed better than the Image Generation approach, and better than models based on three or more pages. Analyses also suggest that a character’s frequency is directly correlated with its recognition accuracy. (2) We present an overview of current OCR accuracy analysis tools available. (3) We have ported the once de-facto standardized OCR accuracy tools to be able to cope with Unicode input. Our work adds to a growing body of research on OCR for particularly challenging character sets, and contributes to creating the largest electronic corpus for this severely endangered language.

2014

pdf bib

Venues

Fix author

Jordan Lachler

2025

2023

2022

2021

2019

2018

2017

2016

2014

Co-authors

Venues