Haruki Sakajo


2026

This paper presents an evaluation framework for probing large language models’ linguistic knowledge of Indigenous languages of the Americas using zero- and few-shot prompting. The framework consists of three tasks: (1) language identification, (2) cloze completion of Spanish sentences supported by Indigenous-language translations, and (3) grammatical feature classification. We evaluate models from five major families (GPT, Gemini, DeepSeek, Qwen, and LLaMA) on 13 Indigenous languages, including Bribri, Guarani, and Nahuatl. The results show substantial variation across both languages and model families. While a small number of model-language combinations demonstrate consistently stronger performance across tasks, many others perform near chance, highlighting persistent gaps in current models’ abilities on Indigenous languages.

2025

Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
This paper describes the NAIST submission to the AmericasNLP 2025 shared task on the creation of educational materials for Indigenous languages. We implement three systems to tackle the unique challenges of each language. The first system, used for Maya and Guarani, employs a straightforward GPT-4o few-shot prompting technique, enhanced by synthetically generated examples to ensure coverage of all grammatical variations encountered. The second system, used for Bribri, integrates dictionary-based alignment and linguistic rules to systematically manage linguisticand lexical transformations. Finally, we developed a specialized rule-based system for Nahuatl that systematically reduces sentences to their base form, simplifying the generation of correct morphology variants.
Large-scale Vision-Language Models (LVLMs) integrate linguistic and visual information, demonstrating advanced task-solving capabilities. These models are originally derived from Large Language Models, leading to strong capabilities for language tasks. However, the impact of additional visual information on model responses remains insufficiently understood. In this study, we focus on the priming effect, a psychological phenomenon, to investigate how visual information influences language task processing. We present additional intentionally designed images alongside two types of language tasks with different characteristics and analyze changes in the model’s responses. Our experimental results show that model responses shift in the direction intended by the image, suggesting that LVLMs do not simply ignore visual information but actively incorporate it into language processing. Furthermore, the similarity between this behavior and priming effects observed in human cognition suggests that LVLMs may share certain aspects of human cognitive mechanisms.
Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
Continuous instruction following closely mirrors real-world tasks by requiring models to solve sequences of interdependent steps, yet existing multi-step instruction datasets suffer from three key limitations: (1) lack of logical coherence across turns, (2) narrow topical breadth and depth, and (3) reliance on rigid templates or heavy manual effort. We introduce LoCt-Pipeline, a novel pipeline that leverages modern LLMs’ reasoning capabilities to assemble rich, topic-related single-instruction data into multi-turn dialogues, producing chains that are logically coherent, progressively deepen in content, and span diverse domains without fixed templates or extensive human annotation. We employed this pipeline to construct LoCt-Instruct for assessing models’ problem-solving abilities. The generated chains serve as a testbed for benchmarking a variety of models, including reasoning-oriented architectures, instruction-tuned variants, and state-of-the-art closed-source LLMs on their capacity to follow and correctly respond to each step. Our results reveal a substantial performance gap between current LLMs and human solvers. These findings highlight the need for more robust continuous instruction following. We publicly release the dataset and end-to-end pipeline.