Miyu Oba

2025

How to Make the Most of LLMs’ Grammatical Knowledge for Acceptability Judgments
Yusuke Ide | Yuto Nishida | Justin Vasselli | Miyu Oba | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches compare sentence probabilities directly, but large language models (LLMs) provide nuanced evaluation methods using prompts and templates. We therefore investigate how to derive the most accurate acceptability judgments from LLMs to comprehensively evaluate their grammatical knowledge. Through extensive experiments in both English and Chinese, we compare nine judgment methods and demonstrate that two of them, in-template LP (a probability readout method) and Yes/No probability computing (a prompting-based method), achieve higher accuracy than the conventional approach. Our analysis reveals that the top two methods excel in different linguistic phenomena, suggesting they access different aspects of the LLMs’ grammatical knowledge. We find that ensembling the two methods achieves even higher accuracy. Consequently, we recommend these techniques, either individually or ensembled, as more effective alternatives to conventional approaches for assessing grammatical knowledge in LLMs.

pdf bib abs

BQA: Body Language Question Answering Dataset for Video Large Language Models
Shintaro Ozaki | Kazuki Hayashi | Miyu Oba | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding.Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent.To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language.We evaluated various VideoLLMs on the BQA with and without Multimodal Chain of Thought (CoT) and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made largely biased answers depending on the age group and ethnicity of the individuals. We also found consistent error patterns in VideoLLMs.

2024

pdf bib abs

BabyLM Challenge: Exploring the effect of variation sets on language model training efficiency
Akari Haga | Akiyo Fukatsu | Miyu Oba | Arianna Bisazza | Yohei Oseki
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.

pdf bib abs

The imitation of the children’s language acquisition process has been explored to make language models (LMs) more efficient.In particular, errors caused by children’s regularization (so-called overregularization, e.g., using wroted for the past tense of write) have been widely studied to reveal the mechanisms of language acquisition. Existing research has analyzed regularization in language acquisition only by modeling word inflection directly, which is unnatural in light of human language acquisition. In this paper, we hypothesize that language models that imitate the errors children make during language acquisition have a learning process more similar to humans. To verify this hypothesis, we analyzed the learning curve and error preferences of verb inflections in small-scale LMs using acceptability judgments. We analyze the differences in results by model architecture, data, and tokenization. Our model shows child-like U-shaped learning curves clearly for certain verbs, but the preferences for types of overgeneralization did not fully match the observations in children.

pdf bib abs

What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans use indirect evidence efficiently, which is considered one of the inductive biases contributing to efficient language acquisition. To explore this question, we introduce the Wug InDirect Evidence Test (WIDET), a dataset consisting of training instances inserted into the pre-training data and evaluation instances. We inject synthetic instances with newly coined wug words into pretraining data and explore the model’s behavior on evaluation data that assesses grammatical acceptability regarding those words. We prepare the injected instances by varying their levels of indirectness and quantity. Our experiments surprisingly show that language models do not induce grammatical knowledge even after repeated exposure to instances with the same structure but differing only in lexical items from evaluation instances in certain language phenomena. Our findings suggest a potential direction for future research: developing models that use latent indirect evidence to induce grammatical knowledge.

2023

pdf bib

BabyLM Challenge: Curriculum learning based on sentence complexity approximating language acquisition
Miyu Oba | Akari Haga | Akiyo Fukatsu | Yohei Oseki
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib abs

Second Language Acquisition of Neural Language Models
Miyu Oba | Tatsuki Kuribayashi | Hiroki Ouchi | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2023

With the success of neural language models (LMs), their language acquisition has gained much attention. This work sheds light on the second language (L2) acquisition of LMs, while previous work has typically explored their first language (L1) acquisition. Specifically, we trained bilingual LMs with a scenario similar to human L2 acquisition and analyzed their cross-lingual transfer from linguistic perspectives. Our exploratory experiments demonstrated that the L1 pretraining accelerated their linguistic generalization in L2, and language transfer configurations (e.g., the L1 choice, and presence of parallel texts) substantially affected their generalizations. These clarify their (non-)human-like L2 acquisition in particular aspects.