Steven R. Wilson

Also published as: Steven R Wilson

2025

pdf bib abs
Cross Domain Classification of Education Talk Turns
Achyutarama R. Ganti | Steven R. Wilson | Geoffrey Louie Wing-Yue
Proceedings of the 31st International Conference on Computational Linguistics

The study of classroom discourse is essential for enhancing child development and educational outcomes in academic settings. Prior research has focused on the annotation of conversational talk-turns within the classroom, offering a statistical analysis of the various types of discourse prevalent in these environments. In this work, we explore the generalizability and transferability of text classifiers trained to predict these discourse codes across educational domains. We examine two distinct English-language classroom datasets from the domains: literacy and math. Our results show that models exhibit high accuracy and generalizability when the training and test datasets originate from the same or similar domains. In situations where limited training data is available in new domains, few shot and zero shot exhibit more resiliency and aren’t as effected as their supervised counterparts. We also observe that accompanying each talk turn with dialog-level context improves the accuracy of the generative models. We conclude by offering suggestions on how to enhance the generalization of these methods to novel domains, proposing directions for future studies to investigate new methods for boosting the model adaptability across domains.

pdf bib abs
TounsiBench: Benchmarking Large Language Models for Tunisian Arabic
Souha Ben Hassine | Asma Arrak | Marouene Addhoum | Steven R Wilson
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In this work, we introduce the first benchmark for evaluating the capabilities of large language models (LLMs) in understanding and generating responses in Tunisian Arabic. To achieve this, we construct a dataset of Tunisian Arabic instructions and prompt ten widely-used LLMs that claim to support Arabic. We then assess the LLM responses through both human and LLM-based evaluations across four criteria: quality, correctness, relevance, and dialectal adherence. We analyze the agreement and correlation between these judgments and identify GPT-4o as our automated judge model based on its high correlation with human ratings, and generate a final leaderboard using this model. Our error analysis reveals that most LLMs struggle with recognizing and properly responding in Tunisian Arabic. To facilitate further research, we release our dataset, along with gold-standard human-written responses for all 744 instructions, and our evaluation framework, allowing others to benchmark their own models.

pdf bib abs
Representing and Clustering Errors in Offensive Language Detection
Jood Otey | Laura Biester | Steven R Wilson
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Content moderation is essential in preventing the spread of harmful content on the Internet. However, there are instances where moderation fails and it is important to understand when and why that happens. Workflows that aim to uncover a system’s weakness typically use clustering of the data points’ embeddings to group errors together. In this paper, we evaluate the K-Means clustering of four text representations for the task of offensive language detection in English and Levantine Arabic. We find Sentence-BERT (SBERT) embeddings give the most human-interpretable clustering for English errors and the grouping is mainly based on the targeted group in the text. Meanwhile, SBERT embeddings of Large Language Model (LLM)-generated linguistic features give the most interpretable clustering for Arabic errors.

2022

pdf bib abs
A Comparative Study on Word Embeddings and Social NLP Tasks
Fatma Elsafoury | Steven R. Wilson | Naeem Ramzan
Proceedings of the Tenth International Workshop on Natural Language Processing for Social Media

In recent years, gray social media platforms, those with a loose moderation policy on cyberbullying, have been attracting more users. Recently, data collected from these types of platforms have been used to pre-train word embeddings (social-media-based), yet these word embeddings have not been investigated for social NLP related tasks. In this paper, we carried out a comparative study between social-media-based and non-social-media-based word embeddings on two social NLP tasks: Detecting cyberbullying and Measuring social bias. Our results show that using social-media-based word embeddings as input features, rather than non-social-media-based embeddings, leads to better cyberbullying detection performance. We also show that some word embeddings are more useful than others for categorizing offensive words. However, we do not find strong evidence that certain word embeddings will necessarily work best when identifying certain categories of cyberbullying within our datasets. Finally, We show even though most of the state-of-the-art bias metrics ranked social-media-based word embeddings as the most socially biased, these results remain inconclusive and further research is required.

Co-authors

Souha Ben Hassine 1

Jood Otey 1

Naeem Ramzan 1

Geoffrey Louie Wing-Yue 1

Venues

Fix author