Christopher M. Homan
Also published as: Christopher M Homan
2026
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: **ZarmaInstruct-50k**, **BambaraInstruct-50k**, and **FulfuldeInstruct-50k**.
How Many Ratings per Item are Necessary for Reliable Significance Testing?
Christopher M Homan | Flip Korn | Deepak Pandita | Chris Welty
Findings of the Association for Computational Linguistics: EACL 2026
Christopher M Homan | Flip Korn | Deepak Pandita | Chris Welty
Findings of the Association for Computational Linguistics: EACL 2026
A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, “gold standard” data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI—along with strong evidence that humans are unreliable judges—estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.
Grammatical Error Correction for Low-Resource Languages: The Case of Zarma
Mamadou K. Keita | Marcos Zampieri | Christopher M Homan | Adwoa Asantewaa Bremang | Dennis Asamoah Owusu | Huy Le
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Mamadou K. Keita | Marcos Zampieri | Christopher M Homan | Adwoa Asantewaa Bremang | Dennis Asamoah Owusu | Huy Le
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Grammatical error correction (GEC) aims to improve text quality and readability. Previous work on the task focused primarily on high-resource languages, while low-resource languages lack robust tools. To address this shortcoming, we present a study on GEC for Zarma, a language spoken by over five million people in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated GEC models using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs—Gemma 2b and MT5-small—showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.
2025
Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech
Jonathan Pofcher | Christopher M Homan | Randall Sell | Ashiqur R. KhudaBukhsh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jonathan Pofcher | Christopher M Homan | Randall Sell | Ashiqur R. KhudaBukhsh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper makes three contributions. First, via a substantial corpus of 1,419,047 comments posted on 3,161 YouTube news videos of major US cable news outlets, we analyze how users engage with LGBTQ+ news content. Our analyses focus both on positive and negative content. In particular, we construct a hope speech classifier that detects positive (hope speech), negative, neutral, and irrelevant content. Second, in consultation with a public health expert specializing on LGBTQ+ health, we conduct an annotation study with a balanced and diverse political representation and release a dataset of 3,750 instances with crowd-sourced labels and detailed annotator demographic information. Finally, beyond providing a vital resource for the LGBTQ+ community, our annotation study and subsequent in-the-wild assessments reveal (1) strong association between rater political beliefs and how they rate content relevant to a marginalized community, (2) models trained on individual political beliefs exhibit considerable in-the-wild disagreement, and (3) zero-shot large language models (LLMs) align more with liberal raters.
GAIfE: Using GenAI to Improve Literacy in Low-resourced Settings
Allahsera Auguste Tapo | Nouhoum Coulibaly | Seydou Diallo | Sebastien Diarra | Christopher M Homan | Mamadou K. Keita | Michael Leventhal
Findings of the Association for Computational Linguistics: NAACL 2025
Allahsera Auguste Tapo | Nouhoum Coulibaly | Seydou Diallo | Sebastien Diarra | Christopher M Homan | Mamadou K. Keita | Michael Leventhal
Findings of the Association for Computational Linguistics: NAACL 2025
Illiteracy is a predictor of many negative social and personal outcomes. Illiteracy rates are particularly high in countries with underresourced languages, where few books exist that are suitable for children to learn to read from. We present GAIfE (Generative AI for Education), a toolchain and workflow developed through empirical methods, that demonstrates how existing tools can be adapted to address low literacy for an underresourced language. We used GAIfE (a play on the Bambara word for “book”) to construct materials for developing children’s reading competence in Bambara, the vehicular language of Mali. Our approach to the generation and post-generation editing of content skewed by the Global-North-centric bias of available LLMs, enabled us to rapidly multiply the content in Bambara available online by 10 times while maintaining high standards of attractiveness of the material to maintain high engagement, accurate representation of the Malian culture and physical and social environment and language quality. Using our materials, pilot reading programs achieved a 67% reduction in the number of children unable to read Bambara. Our approach demonstrated the power of bias-aware application of generative AI to the problem domain as well as the potential impact the application of this technology could have on reducing illiteracy and improving learning outcomes through native language education.
Bayelemabaga: Creating Resources for Bambara NLP
Allahsera Auguste Tapo | Kevin Assogba | Christopher M Homan | M. Mustafa Rafique | Marcos Zampieri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Allahsera Auguste Tapo | Kevin Assogba | Christopher M Homan | M. Mustafa Rafique | Marcos Zampieri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Data curation for under-resource languages enables the development of more accurate and culturally sensitive natural language processing models. However, the scarcity of well-structured multilingual datasets remains a challenge for advancing machine translation in these languages, especially for African languages. This paper focuses on creating high-quality parallel corpora that capture linguistic diversity to address this gap. We introduce Bayelemabaga, the most extensive curated multilingual dataset for machine translation in the Bambara language, the vehicular language of Mali. The dataset consists of 47K Bambara-French parallel sentences curated from 231 data sources, including short stories, formal documents, and religious literature, combining modern, historical, and indigenous languages. We present our data curation process and analyze its impact on neural machine translation by fine-tuning seven commonly used transformer-based language models, i.e., MBART, MT5, M2M-100, NLLB-200, Mistral-7B, Open-Llama-7B, and Meta-Llama3-8B on Bayelemabaga. Our evaluation on four Bambara-French language pair datasets (three existing datasets and the test set of Bayelemabaga) show up to +4.5, +11.4, and +0.27 in gains, respectively, on BLEU, CHRF++, and AfriCOMET evaluation metrics. We also conducted machine and human evaluations of translations from studied models to compare the machine translation quality of encoder-decoder and decoder-only models. Our results indicate that encoder-decoder models remain the best, highlighting the importance of additional datasets to train decoder-only models.
Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
Shanilka Haturusinghe | Tharindu Cyril Weerasooriya | Marcos Zampieri | Christopher M. Homan | S.R. Liyanage
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Shanilka Haturusinghe | Tharindu Cyril Weerasooriya | Marcos Zampieri | Christopher M. Homan | S.R. Liyanage
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: “Subasa-XLM-R”, which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of “Subasa-Llama” and “Subasa-Mistral”, are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.
LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo
Mandira Sawkar | Samay U. Shetty | Deepak Pandita | Tharindu Cyril Weerasooriya | Christopher M. Homan
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP
Mandira Sawkar | Samay U. Shetty | Deepak Pandita | Tharindu Cyril Weerasooriya | Christopher M. Homan
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP
The Learning With Disagreements (LeWiDi) 2025 shared task aims to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, which focuses on modeling individual annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend DisCo by introducing annotator metadata embeddings, enhancing input representations, and multi-objective training losses to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth calibration and error analyses that reveal when and why disagreement-aware modeling improves. Our findings show that disagreement can be better captured by conditioning on annotator demographics and by optimizing directly for distributional metrics, yielding consistent improvements across datasets.
2024
Rater Cohesion and Quality from a Vicarious Perspective
Deepak Pandita | Tharindu Cyril Weerasooriya | Sujan Dutta | Sarah K. Luger | Tharindu Ranasinghe | Ashiqur R. KhudaBukhsh | Marcos Zampieri | Christopher M. Homan
Findings of the Association for Computational Linguistics: EMNLP 2024
Deepak Pandita | Tharindu Cyril Weerasooriya | Sujan Dutta | Sarah K. Luger | Tharindu Ranasinghe | Ashiqur R. KhudaBukhsh | Marcos Zampieri | Christopher M. Homan
Findings of the Association for Computational Linguistics: EMNLP 2024
Human feedback is essential for building human-centered AI systems across domains where disagreement is prevalent, such as AI safety, content moderation, or sentiment analysis. Many disagreements, particularly in politically charged settings, arise because raters have opposing values or beliefs. Vicarious annotation is a method for breaking down disagreement by asking raters how they think others would annotate the data. In this paper, we explore the use of vicarious annotation with analytical methods for moderating rater disagreement. We employ rater cohesion metrics to study the potential influence of political affiliations and demographic backgrounds on raters’ perceptions of offense. Additionally, we utilize CrowdTruth’s rater quality metrics, which consider the demographics of the raters, to score the raters and their annotations. We study how the rater quality metrics influence the in-group and cross-group rater cohesion across the personal and vicarious levels.
Diversity-Aware Annotation for Conversational AI Safety
Alicia Parrish | Vinodkumar Prabhakaran | Lora Aroyo | Mark Díaz | Christopher M. Homan | Greg Serapio-García | Alex S. Taylor | Ding Wang
Proceedings of Safety4ConvAI: The Third Workshop on Safety for Conversational AI @ LREC-COLING 2024
Alicia Parrish | Vinodkumar Prabhakaran | Lora Aroyo | Mark Díaz | Christopher M. Homan | Greg Serapio-García | Alex S. Taylor | Ding Wang
Proceedings of Safety4ConvAI: The Third Workshop on Safety for Conversational AI @ LREC-COLING 2024
How people interpret content is deeply influenced by their socio-cultural backgrounds and lived experiences. This is especially crucial when evaluating AI systems for safety, where accounting for such diversity in interpretations and potential impacts on human users will make them both more successful and inclusive. While recent work has demonstrated the importance of diversity in human ratings that underlie AI pipelines, effective and efficient ways to incorporate diverse perspectives in human data annotation pipelines is still largely elusive. In this paper, we discuss the primary challenges faced in incorporating diversity into model evaluations, and propose a practical diversity-aware annotation approach. Using an existing dataset with highly parallel safety annotations, we take as a test case a policy that prioritizes recall of safety issues, and demonstrate that our diversity-aware approach can efficiently obtain a higher recall of safety issues flagged by minoritized rater groups without hurting overall precision.
2023
Findings from the Bambara - French Machine Translation Competition (BFMT 2023)
Ninoh Agostinho Da Silva | Tunde Oluwaseyi Ajayi | Alexander Antonov | Panga Azazia Kamate | Moussa Coulibaly | Mason Del Rio | Yacouba Diarra | Sebastian Diarra | Chris Emezue | Joel Hamilcaro | Christopher M. Homan | Alexander Most | Joseph Mwatukange | Peter Ohue | Michael Pham | Abdoulaye Sako | Sokhar Samb | Yaya Sy | Tharindu Cyril Weerasooriya | Yacine Zahidi | Sarah Luger
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Ninoh Agostinho Da Silva | Tunde Oluwaseyi Ajayi | Alexander Antonov | Panga Azazia Kamate | Moussa Coulibaly | Mason Del Rio | Yacouba Diarra | Sebastian Diarra | Chris Emezue | Joel Hamilcaro | Christopher M. Homan | Alexander Most | Joseph Mwatukange | Peter Ohue | Michael Pham | Abdoulaye Sako | Sokhar Samb | Yaya Sy | Tharindu Cyril Weerasooriya | Yacine Zahidi | Sarah Luger
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Orange Silicon Valley hosted a low-resource machine translation (MT) competition with monetary prizes. The goals of the competition were to raise awareness of the challenges in the low-resource MT domain, improve MT algorithms and data strategies, and support MT expertise development in the regions where people speak Bambara and other low-resource languages. The participants built Bambara to French and French to Bambara machine translation systems using data provided by the organizers and additional data resources shared amongst the competitors. This paper details each team’s different approaches and motivation for ongoing work in Bambara and the broader low-resource machine translation domain.
Search
Fix author
Co-authors
- Tharindu Cyril Weerasooriya 4
- Marcos Zampieri 4
- Mamadou K. Keita 3
- Deepak Pandita 3
- Seydou Diallo 2
- Sebastien Diarra 2
- Ashiqur R. KhudaBukhsh 2
- Allahsera Auguste Tapo 2
- Ninoh Agostinho Da Silva 1
- Tunde Oluwaseyi Ajayi 1
- Alexander Antonov 1
- Lora Aroyo 1
- Dennis Asamoah Owusu 1
- Kevin Assogba 1
- Adwoa Asantewaa Bremang 1
- Moussa Coulibaly 1
- Nouhoum Coulibaly 1
- Mason Del Rio 1
- Yacouba Diarra 1
- Sebastian Diarra 1
- Sujan Dutta 1
- Mark Díaz 1
- Chris Chinenye Emezue 1
- Joel Hamilcaro 1
- Shanilka Haturusinghe 1
- Panga Azazia Kamaté 1
- Flip Korn 1
- Huy Le 1
- Michael Leventhal 1
- S.R. Liyanage 1
- Sarah Luger 1
- Sarah K. Luger 1
- Alexander Most 1
- Joseph Mwatukange 1
- Peter Ohue 1
- Alicia Parrish 1
- Michael Pham 1
- Jonathan Pofcher 1
- Vinodkumar Prabhakaran 1
- M. Mustafa Rafique 1
- Tharindu Ranasinghe 1
- Abdoulaye Sako 1
- Sokhar Samb 1
- Mandira Sawkar 1
- Randall Sell 1
- Greg Serapio-García 1
- Samay U. Shetty 1
- Yaya Sy 1
- Alex S. Taylor 1
- Ding Wang 1
- Chris Welty 1
- Yacine Zahidi 1