Jou-An Chi

2024

pdf bib abs
Extending the BabyLM Initiative : Promoting Diversity in Datasets and Metrics through High-Quality Linguistic Corpora
Laurent Prévot | Sheng-Fu Wang | Jou-An Chi | Shu-Kai Hsieh
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

BabyLM paves the way for a range of experiments aimed at better understanding language models (LMs) and the differences and similarities between human and artificial language learning. However, the current framework is limited to the English language and a narrow but significant range of evaluation metrics, primarily focused on syntax, semantics, and pragmatics. In this paper, we propose some steps towards extending the framework to other languages, specifically Mandarin Chinese and French, leveraging existing linguistic resources for these languages. Additionally, we advocate for greater exploration of genre variations within subcorpora for training LMs, as well as for the adoption of additional evaluation metrics with different underlying principles. Our proposal consists of using high-quality spontaneous speech corpora as a source for extracting production-related variables, which the models are then fine-tuned to predict. We hypothesize that these production-related features offer insights into the language processing mechanisms underlying the data and that cognitively sensitive models should outperform others in predicting these features. Specifically, we propose focusing on the prediction of phenomena such as speech reductions, prosodic prominences, sequences co-occurring with listeners’ backchannels, and disfluencies. To illustrate our approach, we present an example involving the prediction of speech reductions in spontaneous speech in two different languages (French and English), using models trained on 10 million tokens from different data source mixtures. Although the results are preliminary, they suggest that this task can characterize models for predicting human language processing.

2023

pdf bib
Evaluating Interfaced LLM Bias
Kai-Ching Yeh | Jou-An Chi | Da-Chen Lian | Shu-Kai Hsieh
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

Co-authors

Venues

Fix data