Taro Watanabe - ACL Anthology

Taro Watanabe

2026

Toward Automatic Delegation Extraction in Japanese Law
Tsuyoshi Fujita | Yuya Sawada | Yusuke Sakai | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

The legal systems have a hierarchical structure, and a higher-level law often authorizes a lower-level law to implement detailed provisions, which is called delegation. When interpreting legal texts with delegation, readers must repeatedly consult the lower-level laws that stipulate the detailed provisions, imposing a substantial workload. Therefore, it is necessary to develop a system that enables readers to instantly refer to relevant laws in delegation. However, manually annotating delegation is difficult because it requires extensive legal expertise, careful reading of numerous legal texts, and continuous adaptation to newly enacted laws. In this study, we focus on Japanese law and develop a two-stage pipeline system for automatic delegation annotation. First, we extract keywords that indicate delegation using a named entity recognition approach. Second, we identify the delegated provision corresponding to each keyword as an entity disambiguation task. In our experiments, the proposed system demonstrates sufficient performance to assist manual annotation in practice.

entity-linkings: A Unified Library for Entity Linking
Yuya Sawada | Tsuyoshi Fujita | Yusuke Sakai | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Entity linking (EL) aims to disambiguate named entities in text by mapping them to the appropriate entities in a knowledge base. However, it is difficult to use some EL methods, as they sometimes have issues in reproducibility due to limited maintenance or the lack of official resources.To address this, we introduce , a unified library for using and developing entity linking systems through a unified interface. Our library flexibly integrates various candidate retrievers and re-ranking models, making it easy to compare and use any entity linking methods within a unified framework. In addition, it is designed with a strong emphasis on API usability, making it highly extensible, and it supports both command-line tools and APIs. Our code is available on GitHub and is also distributed via PyPI under the MIT-license. The video is available on YouTube.

Diagnosing Vision Language Models’ Perception by Leveraging Human Methods for Color Vision Deficiencies
Kazuki Hayashi | Shintaro Ozaki | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large-scale Vision-Language Models (LVLMs) are being deployed in real-world settings that require visual inference. As capabilities improve, applications in navigation, education, and accessibility are becoming practical. These settings require accommodation of perceptual variation rather than assuming a uniform visual experience. Color perception illustrates this requirement: it is central to visual understanding yet varies across individuals due to Color Vision Deficiencies, an aspect largely ignored in multimodal AI.In this work, we examine whether LVLMs can account for variation in color perception using the Ishihara Test. We evaluate model behavior through generation, confidence, and internal representation, using Ishihara plates as controlled stimuli that expose perceptual differences. Although models possess factual knowledge about color vision deficiencies and can describe the test, they fail to reproduce the perceptual outcomes experienced by affected individuals and instead default to normative color perception. These results indicate that current systems lack mechanisms for representing alternative perceptual experiences, raising concerns for accessibility and inclusive deployment in multimodal settings.

Beyond Sampling: Self-Sorting for Long-Context Ranking
Juseon Do | Sungwoo Han | Jingun Kwon | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Findings of the Association for Computational Linguistics: EACL 2026

Ranking is a fundamental component in a wide range of AI applications. However, large language models (LLMs) remain unstable on long-context ranking. Sliding-window processing is costly and listwise prompting over full candidates still yields inconsistent orders. We show that sampling alone, even with selection-based methods, cannot stabilize ranking because LLM consistency decomposes into within-list order and cross-list preference, in which a single stochastic process cannot align. To address this, we introduce Self-Sorting (SS), which generates m candidate lists and performs n selection-time re-rankings over those lists. SS fuses explicit within-list positions with implicit cross-list preferences to score entities and return a top-k set. Experimental results on five widely used ranking benchmarks show significant improvements in nDCG@1,5,10, highlighting the critical role of implicit consistency.

“Yuki Gets Sushi, David Gets Steak?”: Uncovering Gender and Racial Biases in LLM-Based Meal Recommendations
Xuefeng Wei | Xuan Zhou | Yusuke Sakai | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Group bias in Large Language Models (LLMs) is a well-documented issue, its impact in high-stakes domains such as personalized nutritional advice remains under explored. This study introduces the USChainMains dataset to systematically evaluate LLMs, prompting them with names associated with specific racial and gender groups and rigorously quantifying the healthfulness of the generated meal recommendations against established dietary standards. The findings demonstrate that LLMs systematically recommend meals with significantly higher levels of adverse nutrients for names associated with Black, Hispanic, or male individuals, thereby reflecting and potentially reinforcing detrimental dietary stereotypes. Furthermore, our analysis of two common mitigation strategies reveals their limitations. While model scaling improves overall recommendation healthfulness, it is insufficient to eliminate the healthfulness gap between demographic groups. Notably, while augmented reasoning was effective in mitigating gender bias, it did not mitigate racial disparities. This work underscores the necessity of developing more nuanced, group-aware debiasing techniques to ensure AI-driven systems advance, rather than hinder, health equity.

Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas
Justin Vasselli | Arturo Mp | Frederikus Hudi | Haruki Sakajo | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper presents an evaluation framework for probing large language models’ linguistic knowledge of Indigenous languages of the Americas using zero- and few-shot prompting. The framework consists of three tasks: (1) language identification, (2) cloze completion of Spanish sentences supported by Indigenous-language translations, and (3) grammatical feature classification. We evaluate models from five major families (GPT, Gemini, DeepSeek, Qwen, and LLaMA) on 13 Indigenous languages, including Bribri, Guarani, and Nahuatl. The results show substantial variation across both languages and model families. While a small number of model-language combinations demonstrate consistently stronger performance across tasks, many others perform near chance, highlighting persistent gaps in current models’ abilities on Indigenous languages.

Cosine Similarity as Logits?: A Scalable Knowledge Probe Using Embedding Vectors from Generative Language Models
Tomoyuki Jinno | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, the use of pretrained language models (PLMs) as soft knowledge bases has gained growing interest, sparking the development of knowledge probes to evaluate their factual knowledge retrieval capabilities. However, existing knowledge probes for generative PLMs that support multi-token entities exhibit quadratic time complexity 𝒪(n²), where n corresponds to the number of candidate entities, limiting the size of knowledge graphs used for probing. To address this, we propose DEcoder Embedding-based Relational (DEER) probe, utilizing embedding vectors extracted from generative PLMs. DEER probe achieves effective time complexity of linear order 𝒪(n), supports rank-based evaluation metrics including Hit@k, handles multi-token entity names and enables probing whilst disambiguating homographic tail-entity names. We empirically show that DEER-probe correlates with existing knowledge probes, validating its probing capability, and we demonstrate the practical benefits of its improved scalability.

Completely Modular Fine-tuning for Dynamic Language Adaptation
Zhe Cao | Yusuke Oda | Qianying Liu | Akiko Aizawa | Taro Watanabe
Findings of the Association for Computational Linguistics: EACL 2026

Multilingual Fine-tuning of Large Language Models (LLMs) has achieved great advancements in machine translation. However, existing research focuses only on the traditional fine-tuning setting with a fixed set of languages, lacking dynamic adaptability to new ones. Introducing new languages requires retraining and often causes catastrophic forgetting. In this study, we propose a completely modular fine-tuning pipeline that enables dynamic language adaptation for LLMs. Instead of directly fine-tuning on all languages, our approach first trains English-centric input and output LoRA adapters for each language separately, and then merges the corresponding adapters for arbitrary-direction translation without any additional training. Experiments on 12 translation directions of four low-resource and less-supported languages show that modular fine-tuning achieves up to 86% performance of traditional multi-parallel full-parameter fine-tuning, while training only 0.1% parameters and relying solely on English-centric data without any catastrophic forgetting. Furthermore, we perform a comprehensive analysis about the merging ratio, when to merge, and the rationale for using English as a bridge language via Bayesian Optimization and logit lens.

Towards Singable Lyrics Translation Using Large Language Models
Liu Hanze | Yusuke Sakai | Taro Watanabe
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Lyrics translation must account for rhythm, rhyme, and singability in the translated lyrics. In this study, we focus on singability and investigate effective prompting methods for translating singable lyrics, including verification-guided and multi-round prompting, applied to large language models. First, we curate a multilingual lyrics translation dataset covering a total of six language directions across Chinese, Japanese, and English. Next, we evaluate seven prompting strategies, with instruction complexity increasing incrementally. The results show that multi-prompt strategies improve singability-related aspects, such as rhythmic alignment and phonological naturalness, compared to naive translation. Furthermore, human evaluations using songs created from translated lyrics suggest that moderately complex prompting strategies improve singable naturalness, while more complex strategies contribute to greater stability in perceived quality.

2025

Graph-Structured Trajectory Extraction from Travelogues
Aitaro Yamamoto | Hiroyuki Otomo | Hiroki Ouchi | Shohei Higashiyama | Hiroki Teranishi | Hiroyuki Shindo | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Human traveling trajectories play a central role in characterizing each travelogue, and automatic trajectory extraction from travelogues is highly desired for tourism services, such as travel planning and recommendation. This work addresses the extraction of human traveling trajectories from travelogues. Previous work treated each trajectory as a sequence of visited locations, although locations with different granularity levels, e.g., “Kyoto City” and “Kyoto Station,” should not be lined up in a sequence. In this work, we propose to represent the trajectory as a graph that can capture the hierarchy as well as the visiting order, and construct a benchmark dataset for the trajectory extraction. The experiments using this dataset show that even naive baseline systems can accurately predict visited locations and the visiting order between them, while it is more challenging to predict the hierarchical relations.

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata | Frederikus Hudi | Patrick Amadeus Irawan | David Anugraha | Rifki Afina Putri | Wang Yutong | Adam Nohejl | Ubaidillah Ariq Prathama | Nedjma Ousidhoum | Afifa Amriani | Anar Rzayev | Anirban Das | Ashmari Pramodya | Aulia Adila | Bryan Wilie | Candy Olivia Mawalim | Cheng Ching Lam | Daud Abolade | Emmanuele Chersoni | Enrico Santus | Fariz Ikhwantri | Garry Kuwanto | Hanyang Zhao | Haryo Akbarianto Wibowo | Holy Lovenia | Jan Christian Blaise Cruz | Jan Wira Gotama Putra | Junho Myung | Lucky Susanto | Maria Angelica Riera Machin | Marina Zhukova | Michael Anugraha | Muhammad Farid Adilazuarda | Natasha Christabelle Santosa | Peerat Limkonchotiwat | Raj Dabre | Rio Alexander Audino | Samuel Cahyawijaya | Shi-Xiong Zhang | Stephanie Yulia Salim | Yi Zhou | Yinxuan Gui | David Ifeoluwa Adelani | En-Shiun Annie Lee | Shogo Okada | Ayu Purwarianti | Alham Fikri Aji | Taro Watanabe | Derry Tanti Wijaya | Alice Oh | Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

Speaker Identification and Dataset Construction Using LLMs: A Case Study on Japanese Narratives
Seiji Gobara | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the The 7th Workshop on Narrative Understanding

Speaker identification in narrative analysis is a challenging task due to complex dialogues, diverse utterance patterns, and ambiguous character references. Cosly and time-intensive manual annotation limits the scalability of high-quality dataset creation.This study demonstrates a cost-efficient approach of constructing speaker identification datasets by combining small-scale manual annotation with LLM-based labeling. A subset of data is manually annotated and is used to guide LLM predictions with a few-shot approach followed by refinement through minimal human corrections. Our results show that LLMs achieve approximately 90% accuracy on challenging narratives, such as the “Three Kingdoms” dataset, underscoring the importance of targeted human corrections. This approach proves effective for constructing scalable and cost-efficient datasets for Japanese and complex narratives.

Investigating Omission as a Latency Reduction Strategy in Simultaneous Speech Translation
Mana Makinae | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Simultaneous speech translation (SiST) requires balancing translation quality and latency. While most SiST systems follow machine translation assumptions that prioritize full semantic accuracy to the source, human interpreters often omit less critical content to catch up with the speaker. This study investigates whether omission can be used to reduce latency while preserving meaning in SiST.We construct a dataset that includes omission using large language models (LLMs) and propose a Target-Duration Latency (TDL), target-based latency metric that measures the output length considering the start and end timing of translation. Our analysis shows that LLMs can omit less important words while retaining the core meaning. Furthermore, experimental results show that although standard metrics overlook the benefit of the model trained with proposed omission-involving dataset, alternative evaluation methods capture it, as omission leads to shorter outputs with acceptable quality.

Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models
Wataru Hashimoto | Hidetaka Kamigaito | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2025

Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies on uncertainty estimation in Large Language Models (LLMs). Our experiments show that Contrastive Search, which mitigates repetition, yields better uncertainty estimates on average across a range of preference-aligned LLMs. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.

Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Zhi Qu | Yiran Wang | Jiannan Mao | Chenchen Ding | Hideki Tanaka | Masao Utiyama | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The multilingual neural machine translation (MNMT) aims for arbitrary translations across multiple languages.Although MNMT-specific models trained on parallel data offer low costs in training and deployment, their performance consistently lags behind that of large language models (LLMs).In this work, we introduce registering, a novel method that enables a small MNMT-specific model to compete with LLMs.Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens.By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space.Experiments on EC-40, a large-scale benchmark, show that our method advances the state-of-the-art of MNMT.We further pre-train two models, namely MITRE (multilingual translation with registers), by 9.3 billion sentence pairs across 24 languages collected from public corpora.One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning.Finally, we open-source our models to facilitate further research and development in MNMT: https://github.com/zhiqu22/mitre.

Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?
Takumi Goto | Yusuke Sakai | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, -gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark.We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4.

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation
Zhi Qu | Yiran Wang | Chenchen Ding | Hideki Tanaka | Masao Utiyama | Taro Watanabe
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios.results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations. We release our codes at https://github.com/zhiqu22/PhasedDecoder.

Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain
Shintaro Ozaki | Yuta Kato | Siyuan Feng | Masayo Tomita | Kazuki Hayashi | Wataru Hashimoto | Ryoma Obara | Masafumi Oyamada | Katsuhiko Hayashi | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 24th Workshop on Biomedical Language Processing

Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications.However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored.Our study focuses on the impact of RAG, specifically examining whether RAG increases the confidence of LLM outputs in the medical domain.We conduct this analysis across various configurations and models.We evaluate confidence by treating the model’s predicted probability as its output and calculating several evaluation metrics which include calibration error method, entropy, best probability, and accuracy.Experimental results across multiple datasets confirmed that certain models possess the capability to judge for themselves whether an inserted document relates to the correct answer. These results suggest that evaluating models based on their output probabilities determine whether they function as generators in the RAG framework.Our approach allows to evaluate whether the models handle retrieved documents.

Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding
Hidetaka Kamigaito | Hiroyuki Deguchi | Yusuke Sakai | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Inference methods play an important role in eliciting the performance of large language models (LLMs). Currently, LLMs use inference methods utilizing generated multiple samples, which can be derived from Minimum Bayes Risk (MBR) Decoding. Previous studies have conducted empirical analyses to clarify the improvements in generation performance achieved by MBR decoding and have reported various observations. However, the theoretical underpinnings of these findings remain uncertain. To address this, we offer a new theoretical interpretation of MBR decoding from the perspective of bias–diversity decomposition. In this interpretation, the error in the quality estimation of hypotheses by MBR decoding is decomposed into two main factors: bias, which considers the closeness between the utility function and human evaluation, and diversity, which represents the variability in the quality estimation of the utility function. The theoretical analysis reveals the difficulty of simultaneously improving bias and diversity, confirming the validity of enhancing MBR decoding performance by increasing diversity. Furthermore, we reveal that diversity can explain one aspect of inference scaling laws that describe performance improvement by increasing sample size. Moreover, experiments across multiple NLP tasks yielded results consistent with these theoretical characteristics. Our code is available at https://github.com/naist-nlp/mbr-bias-diversity.

JaCorpTrack: Corporate History Event Extraction for Tracking Organizational Changes
Yuya Sawada | Hiroki Ouchi | Yuichiro Yasui | Hiroki Teranishi | Yuji Matsumoto | Taro Watanabe | Masayuki Ishii
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Corporate history in corporate annual reports includes events related to organizational changes, which can provide useful cues for a comprehensive understanding of corporate actions.However, extracting organizational changes requires identifying differences in companies before and after an event, raising concerns about whether existing information extraction systems can accurately capture the relations.This work introduces JaCorpTrack, a novel event extraction task designed to identify events related to organizational changes.JaCorpTrack defines five event types related to organizational changes and is designed to identify the company names before and after each event, as well as the corresponding date.Experimental results indicate that large language models (LLMs) exhibit notable disparities in performance across event types.Our analysis reveals that these systems face challenges in identifying company names before and after events, and in interpreting event types expressed under ambiguous terminology.We will publicly release our dataset and experimental code at https://github.com/naist-nlp/JaCorpTrack

Multilingual Dialogue Generation and Localization with Dialogue Act Scripting
Justin Vasselli | Eunike Andriani Kardinata | Yusuke Sakai | Taro Watanabe
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.

Visual Priming Effect on Large-scale Vision Language Models
Daiki Yoshida | Haruki Sakajo | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Large-scale Vision-Language Models (LVLMs) integrate linguistic and visual information, demonstrating advanced task-solving capabilities. These models are originally derived from Large Language Models, leading to strong capabilities for language tasks. However, the impact of additional visual information on model responses remains insufficiently understood. In this study, we focus on the priming effect, a psychological phenomenon, to investigate how visual information influences language task processing. We present additional intentionally designed images alongside two types of language tasks with different characteristics and analyze changes in the model’s responses. Our experimental results show that model responses shift in the direction intended by the image, suggesting that LVLMs do not simply ignore visual information but actively incorporate it into language processing. Furthermore, the similarity between this behavior and priming effects observed in human cognition suggests that LVLMs may share certain aspects of human cognitive mechanisms.

SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala
Ashmari Pramodya | Nirasha Nelki | Heshan Shalinda | Chamila Liyanage | Yusuke Sakai | Randil Pushpananda | Ruvan Weerasinghe | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

CoAM: Corpus of All-Type Multiword Expressions
Yusuke Ide | Joshua Tanner | Adam Nohejl | Jacob Hoffman | Justin Vasselli | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size.To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking.Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis.Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form.Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset.Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.

BannerBench: Benchmarking Vision Language Models for Multi-Ad Selection with Human Preferences
Hiroto Otake | Peinan Zhang | Yusuke Sakai | Masato Mita | Hiroki Ouchi | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2025

Web banner advertisements, which are placed on websites to guide users to a targeted landing page (LP), are still often selected manually because human preferences are important in selecting which ads to deliver. To automate this process, we propose a new benchmark, BannerBench, to evaluate the human preference-driven banner selection process using vision-language models (VLMs). This benchmark assesses the degree of alignment with human preferences in two tasks: a ranking task and a best-choice task, both using sets of five images derived from a single LP. Our experiments show that VLMs are moderately correlated with human preferences on the ranking task. In the best-choice task, most VLMs perform close to chance level across various prompting strategies. These findings suggest that although VLMs have a basic understanding of human preferences, most of them struggle to pinpoint a single suitable option from many candidates.

Improving Explainability of Sentence-level Metrics via Edit-level Attribution for Grammatical Error Correction
Takumi Goto | Justin Vasselli | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Various evaluation metrics have been proposed for Grammatical Error Correction (GEC), but many, particularly reference-free metrics, lack explainability. This lack of explainability hinders researchers from analyzing the strengths and weaknesses of GEC models and limits the ability to provide detailed feedback for users. To address this issue, we propose attributing sentence-level scores to individual edits, providing insight into how specific corrections contribute to the overall performance. For the attribution method, we use Shapley values, from cooperative game theory, to compute the contribution of each edit. Experiments with existing sentence-level metrics demonstrate high consistency across different edit granularities and show approximately 70% alignment with human evaluations. In addition, we analyze biases in the metrics based on the attribution results, revealing trends such as the tendency to ignore orthographic edits.

BQA: Body Language Question Answering Dataset for Video Large Language Models
Shintaro Ozaki | Kazuki Hayashi | Miyu Oba | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding.Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent.To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language.We evaluated various VideoLLMs on the BQA with and without Multimodal Chain of Thought (CoT) and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made largely biased answers depending on the age group and ethnicity of the individuals. We also found consistent error patterns in VideoLLMs.

Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Haruki Sakajo | Yusuke Ide | Justin Vasselli | Yusuke Sakai | Yingtao Tian | Hidetaka Kamigaito | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2025

Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.

Tonguescape: Exploring Language Models Understanding of Vowel Articulation
Haruki Sakajo | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.

IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator
Yusuke Sakai | Takumi Goto | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2025

We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.

Long-Tail Crisis in Nearest Neighbor Language Models
Yuto Nishida | Makoto Morishita | Hiroyuki Deguchi | Hidetaka Kamigaito | Taro Watanabe
Findings of the Association for Computational Linguistics: NAACL 2025

The k-nearest-neighbor language model (kNN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference.A widely held hypothesis for the success of kNN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena.However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model’s performance remain underexplored in estimating the probabilities of long-tail target tokens during inference.In this paper, we investigate the behavior of kNN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, and token distribution in the datastore.Our experimental results reveal that kNN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.

AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising
Peinan Zhang | Yusuke Sakai | Masato Mita | Hiroki Ouchi | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As the fluency of ad texts automatically generated by natural language generation technologies continues to improve, there is an increasing demand to assess the quality of these creatives in real-world setting.We propose **AdTEC**, the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations.Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as constructing a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically maintained in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on this dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark.Our results show that while PLMs have a practical level of performance in several tasks, humans continue to outperform them in certain domains, indicating that there remains significant potential for further improvement in this area.

Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?
Adam Nohejl | Frederikus Hudi | Eunike Andriani Kardinata | Shintaro Ozaki | Maria Angelica Riera Machin | Hongyu Sun | Justin Vasselli | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. We publicly release our code, the frequency lists, fastText word embeddings, and statistical language models.

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Haruki Sakajo | Hiroshi Takato | Hiroshi Tsutsui | Komei Soda | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.

Leveraging Dictionaries and Grammar Rules for the Creation of Educational Materials for Indigenous Languages
Justin Vasselli | Haruki Sakajo | Arturo Martínez Peguero | Frederikus Hudi | Taro Watanabe
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This paper describes the NAIST submission to the AmericasNLP 2025 shared task on the creation of educational materials for Indigenous languages. We implement three systems to tackle the unique challenges of each language. The first system, used for Maya and Guarani, employs a straightforward GPT-4o few-shot prompting technique, enhanced by synthetically generated examples to ensure coverage of all grammatical variations encountered. The second system, used for Bribri, integrates dictionary-based alignment and linguistic rules to systematically manage linguisticand lexical transformations. Finally, we developed a specialized rule-based system for Nahuatl that systematically reduces sentences to their base form, simplifying the generation of correct morphology variants.

IRR: Image Review Ranking Framework for Evaluating Vision-Language Models
Kazuki Hayashi | Kazuma Onishi | Toma Suzuki | Yusuke Ide | Seiji Gobara | Shigeki Saito | Yusuke Sakai | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.

HLU: Human Vs LLM Generated Text Detection Dataset for Urdu at Multiple Granularities
Iqra Ali | Jesse Atuhurra | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

The rise of large language models (LLMs) generating human-like text has raised concerns about misuse, especially in low-resource languages like Urdu. To address this gap, we introduce the HLU dataset, which consists of three datasets: Document, Paragraph, and Sentence level. The document-level dataset contains 1,014 instances of human-written and LLM-generated articles across 13 domains, while the paragraph and sentence-level datasets each contain 667 instances. We conducted both human and automatic evaluations. In the human evaluation, the average accuracy at the document level was 35%, while at the paragraph and sentence levels, accuracies were 75.68% and 88.45%, respectively. For automatic evaluation, we finetuned the XLMRoBERTa model for both monolingual and multilingual settings achieving consistent results in both. Additionally, we assessed the performance of GPT4 and Claude3Opus using zero-shot prompting. Our experiments and evaluations indicate that distinguishing between human and machine-generated text is challenging for both humans and LLMs, marking a significant step in addressing this issue in Urdu.

Translating Movie Subtitles by Large Language Models using Movie-meta Information
Ashmari Pramodya | Yusuke Sakai | Justin Vasselli | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large language models (LLMs) have advanced natural language processing by understanding, generating, and manipulating texts.Although recent studies have shown that prompt engineering can reduce computational effort and potentially improve translation quality, prompt designs specific to different domains remain challenging. Besides, movie subtitle translation is particularly challenging and understudied, as it involves handling colloquial language, preserving cultural nuances, and requires contextual information such as the movie’s theme and storyline to ensure accurate meaning. This study aims to fill this gap by focusing on the translation of movie subtitles through the use of prompting strategies that incorporate the movie’s meta-information, e.g., movie title, summary, and genre. We build a multilingual dataset which aligns the OpenSubtitles dataset with their corresponding Wikipedia articles and investigate different prompts and their effect on translation performance. Our experiments with GPT-3.5, GPT-4o, and LLaMA-3 models have shown that the presence of meta-information improves translation accuracy. These findings further emphasize the importance of designing appropriate prompts and highlight the potential of LLMs to enhance subtitle translation quality.

Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Justin Vasselli | Adam Nohejl | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval—a prompt-based method leveraging LLMs—across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.

Efficient Nearest Neighbor based Uncertainty Estimation for Natural Language Processing Tasks
Wataru Hashimoto | Hidetaka Kamigaito | Taro Watanabe
Findings of the Association for Computational Linguistics: NAACL 2025

Trustworthiness in model predictions is crucial for safety-critical applications in the real world. However, deep neural networks often suffer from the issues of uncertainty estimation, such as miscalibration. In this study, we propose k-Nearest Neighbor Uncertainty Estimation (kNN-UE), which is a new uncertainty estimation method that uses not only the distances from the neighbors, but also the ratio of labels in the neighbors. Experiments on sentiment analysis, natural language inference, and named entity recognition show that our proposed method outperforms the baselines and recent density-based methods in several calibration and uncertainty metrics. Moreover, our analyses indicate that approximate nearest neighbor search techniques reduce the inference overhead without significantly degrading the uncertainty estimation performance when they are appropriately combined.

How to Make the Most of LLMs’ Grammatical Knowledge for Acceptability Judgments
Yusuke Ide | Yuto Nishida | Justin Vasselli | Miyu Oba | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches compare sentence probabilities directly, but large language models (LLMs) provide nuanced evaluation methods using prompts and templates. We therefore investigate how to derive the most accurate acceptability judgments from LLMs to comprehensively evaluate their grammatical knowledge. Through extensive experiments in both English and Chinese, we compare nine judgment methods and demonstrate that two of them, in-template LP (a probability readout method) and Yes/No probability computing (a prompting-based method), achieve higher accuracy than the conventional approach. Our analysis reveals that the top two methods excel in different linguistic phenomena, suggesting they access different aspects of the LLMs’ grammatical knowledge. We find that ensembling the two methods achieves even higher accuracy. Consequently, we recommend these techniques, either individually or ensembled, as more effective alternatives to conventional approaches for assessing grammatical knowledge in LLMs.

Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

A Text Embedding Model with Contrastive Example Mining for Point-of-Interest Geocoding
Hibiki Nakatani | Hiroki Teranishi | Shohei Higashiyama | Yuya Sawada | Hiroki Ouchi | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Geocoding is a fundamental technique that links location mentions to their geographic positions, which is important for understanding texts in terms of where the described events occurred. Unlike most geocoding studies that targeted coarse-grained locations, we focus on geocoding at a fine-grained point-of-interest (POI) level. To address the challenge of finding appropriate geo-database entries from among many candidates with similar POI names, we develop a text embedding-based geocoding model and investigate (1) entry encoding representations and (2) hard negative mining approaches suitable for enhancing the model’s disambiguation ability. Our experiments show that the second factor significantly impact the geocoding accuracy of the model.

Knowledge Editing Induces Underconfidence in Language Models
Ryo Hasegawa | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)

As language models continue to scale, the demand for knowledge editing, a retraining-free knowledge update method, has increased. However, since knowledge editing directly alters token prediction probabilities acquired during pretraining, the probabilities may diverge from the empirical distribution. In this study, we analyze the impact of knowledge editing to compare the alignment between token prediction probabilities and task accuracy by calculating confidence calibration before and after knowledge editing. Our results reveal that, for tasks requiring semantic understanding, the range of increase in token prediction probabilities tends to be smaller than that of accuracy improvement, suggesting that knowledge editing methods lead to less confidence in prediction.

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
Shintaro Ozaki | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Findings of the Association for Computational Linguistics: NAACL 2025

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.

gec-metrics: A Unified Library for Grammatical Error Correction Evaluation
Takumi Goto | Yusuke Sakai | Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license¹ and is also distributed as an installable package². The video is available at YouTube³.¹GitHub: https://github.com/gotutiyan/gec-metrics²PyPi: https://pypi.org/project/gec-metrics/³Video: https://youtu.be/cor6dkN6EfI

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction
Takumi Goto | Yusuke Sakai | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2025

Reference-free evaluation metrics for grammatical error correction (GEC) have achieved high correlation with human judgments.However, these metrics are not designed to evaluate adversarial systems that aim to obtain unjustifiably high scores. The existence of such systems undermines the reliability of automatic evaluation, as it can mislead users in selecting appropriate GEC systems. In this study, we propose adversarial attack strategies for four reference-free metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that our adversarial systems outperform the current state-of-the-art. These findings highlight the need for more robust evaluation methods.

Languages Transferred Within the Encoder: On Representation Transfer in Zero-Shot Multilingual Translation
Zhi Qu | Chenchen Ding | Taro Watanabe
Proceedings of Machine Translation Summit XX: Volume 1

Understanding representation transfer in multilingual neural machine translation (MNMT) can reveal the reason for the zero-shot translation deficiency. In this work, we systematically analyze the representational issue of MNMT models. We first introduce the identity pair, translating a sentence to itself, to address the lack of the base measure in multilingual investigations, as the identity pair can reflect the representation of a language within the model. Then, we demonstrate that the encoder transfers the source language to the representational subspace of the target language instead of the language-agnostic state. Thus, the zero-shot translation deficiency arises because the representation of a translation is entangled with other languages and not transferred to the target language effectively. Based on our findings, we propose two methods: 1) low-rank language-specific embedding at the encoder, and 2) language-specific contrastive learning of the representation at the decoder. The experimental results on Europarl-15, TED-19, and OPUS-100 datasets show that our methods substantially enhance the performance of zero-shot translations without sacrifices in supervised directions by improving language transfer capacity, thereby providing practical evidence to support our conclusions. Codes are available at https://github.com/zhiqu22/ZeroTrans.

Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding
Koki Natsumi | Hiroyuki Deguchi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT’23 En↔De translation tasks.

Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates
Toma Suzuki | Yusuke Sakai | Justin Vasselli | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

Large language models (LLMs) achieve high performance through instruction-tuning, which involves learning various tasks using instruction templates. However, these templates often contain task-specific expressions, which are words that frequently appear in certain contexts but do not always convey the actual meaning of that context, even if they seem closely related to the target task. Biases inherent in such instruction templates may be learned by LLMs during training, potentially degrading performance when the models encounter superficial expressions. In this study, we propose a method that incorporates additional instructions to FLAN templates, without altering the base instruction to produce “superfluous instructions”. This allows us to investigate the vulnerabilities of LLMs caused by overfitting to task-specific expressions embedded in instruction templates. The experimental results revealed that the inclusion of superficial words strongly related to each task in the instruction text can alter the output, regardless of the intended meaning.

Reliability of Distribution Predictions by LLMs: Insights from Counterintuitive Pseudo-Distributions
Toma Suzuki | Ayuki Katayama | Seiji Gobara | Ryo Tsujimoto | Hibiki Nakatani | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

The proportion of responses to a question and its options, known as the response distribution, enables detailed analysis of human society. Recent studies highlight the use of Large Language Models (LLMs) for predicting response distributions as a cost-effective survey method. However, the reliability of these predictions remains unclear. LLMs often generate answers by blindly following instructions rather than applying rational reasoning based on pretraining-acquired knowledge. This study investigates whether LLMs can rationally estimate distributions when presented with explanations of “artificially generated distributions” that are against commonsense. Specifically, we assess whether LLMs recognize counterintuitive explanations and adjust their predictions or simply follow these inconsistent explanations. Results indicate that smaller or less human-optimized LLMs tend to follow explanations uncritically, while larger or more optimized models are better at resisting counterintuitive explanations by leveraging their pretraining-acquired knowledge. These findings shed light on factors influencing distribution prediction performance in LLMs and are crucial for developing reliable distribution predictions using language models.

LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions
Hongyu Sun | Yusuke Sakai | Haruki Sakajo | Shintaro Ozaki | Kazuki Hayashi | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Continuous instruction following closely mirrors real-world tasks by requiring models to solve sequences of interdependent steps, yet existing multi-step instruction datasets suffer from three key limitations: (1) lack of logical coherence across turns, (2) narrow topical breadth and depth, and (3) reliance on rigid templates or heavy manual effort. We introduce LoCt-Pipeline, a novel pipeline that leverages modern LLMs’ reasoning capabilities to assemble rich, topic-related single-instruction data into multi-turn dialogues, producing chains that are logically coherent, progressively deepen in content, and span diverse domains without fixed templates or extensive human annotation. We employed this pipeline to construct LoCt-Instruct for assessing models’ problem-solving abilities. The generated chains serve as a testbed for benchmarking a variety of models, including reasoning-oriented architectures, instruction-tuned variants, and state-of-the-art closed-source LLMs on their capacity to follow and correctly respond to each step. Our results reveal a substantial performance gap between current LLMs and human solvers. These findings highlight the need for more robust continuous instruction following. We publicly release the dataset and end-to-end pipeline.

2024

Context-Aware Machine Translation with Source Coreference Explanation
Huy Hien Vu | Hidetaka Kamigaito | Taro Watanabe
Transactions of the Association for Computational Linguistics, Volume 12

Despite significant improvements in enhancing the quality of translation, context-aware machine translation (MT) models underperform in many cases. One of the main reasons is that they fail to utilize the correct features from context when the context is too long or their models are overly complex. This can lead to the explain-away effect, wherein the models only consider features easier to explain predictions, resulting in inaccurate translations. To address this issue, we propose a model that explains the decisions made for translation by predicting coreference features in the input. We construct a model for input coreference by exploiting contextual features from both the input and translation output representations on top of an existing MT model. We evaluate and analyze our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset, demonstrating an improvement of over 1.0 BLEU score when compared with other context-aware models.

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates
Yusuke Sakai | Adam Nohejl | Jiangnan Hang | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.

mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans
Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2024

It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer capabilities for questions requiring deep knowledge or commonsense. This highlights the necessity of language-specific datasets for evaluation and training. Finally, our method demonstrated that multilingual LMs could create QA including language-specific knowledge, significantly reducing the dataset creation cost compared to manual creation. The datasets are available at https://huggingface.co/datasets/yusuke1997/mCSQA.

Synthetic Context with LLM for Entity Linking from Scientific Tables
Yuji Oshima | Hiroyuki Shindo | Hiroki Teranishi | Hiroki Ouchi | Taro Watanabe
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Tables in scientific papers contain crucial information, such as experimental results.Entity Linking (EL) is a promising technology that analyses tables and associates them with a knowledge base.EL for table cells requires identifying the referent concept of each cell while understanding the context relevant to each cell in the paper. However, extracting the relevant context from the paper is challenging because the relevant parts are scattered in the main text and captions.This study defines a rule-based method for extracting broad context from the main text, including table captions and sentences that mention the table.Furthermore, we propose synthetic context as a more refined context generated by large language models (LLMs).In a synthetic context, contexts from the entire paper are refined by summarizing, injecting supplemental knowledge, and clarifying the referent concept.We observe this approach improves accuracy for EL by more than 10 points on the S2abEL dataset, and our qualitative analysis suggests potential future works.

Alignment-Based Decoding Policy for Low-Latency and Anticipation-Free Neural Japanese Input Method Editors
Armin Sarhangzadeh | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2024

Japanese input method editors (IMEs) are essential tools for inputting Japanese text using a limited set of characters such as the kana syllabary. However, despite their importance, the potential of newer attention-based encoder-decoder neural networks, such as Transformer, has not yet been fully explored for IMEs due to their high computational cost and low-quality intermediate output in simultaneous settings, leading to high latencies. In this work, we propose a simple decoding policy to enable the use of attention-based encoder-decoder networks for simultaneous kana-kanji conversion in the context of Japanese IMEs inspired by simultaneous machine translation (SimulMT). We demonstrate that simply decoding by explicitly considering the word boundaries achieves a fairly strong quality-latency trade-off, as it can be seen as equivalent to performing decoding on aligned prefixes and thus achieving an incremental anticipation-free conversion. We further show how such a policy can be applied in practice to achieve high-quality conversions with minimal computational overhead. Our experiments show that our approach can achieve a noticeably better quality-latency trade-off compared to the baselines, while also being a more practical approach due to its ability to directly handle streaming input. Our code is available at https://anonymous.4open.science/r/transformer_ime-D327.

Are Data Augmentation Methods in Named Entity Recognition Applicable for Uncertainty Estimation?
Wataru Hashimoto | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This work investigates the impact of data augmentation on confidence calibration and uncertainty estimation in Named Entity Recognition (NER) tasks. For the future advance of NER in safety-critical fields like healthcare and finance, it is essential to achieve accurate predictions with calibrated confidence when applying Deep Neural Networks (DNNs), including Pre-trained Language Models (PLMs), as a real-world application. However, DNNs are prone to miscalibration, which limits their applicability. Moreover, existing methods for calibration and uncertainty estimation are computational expensive. Our investigation in NER found that data augmentation improves calibration and uncertainty in cross-genre and cross-lingual setting, especially in-domain setting. Furthermore, we showed that the calibration for NER tends to be more effective when the perplexity of the sentences generated by data augmentation is lower, and that increasing the size of the augmentation further improves calibration and uncertainty.

Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation
Frederikus Hudi | Zhi Qu | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multilingual neural machine translation aims to encapsulate multiple languages into a single model. However, it requires an enormous dataset, leaving the low-resource language (LRL) underdeveloped. As LRLs may benefit from shared knowledge of multilingual representation, we aspire to find effective ways to integrate unseen languages in a pre-trained model. Nevertheless, the intricacy of shared representation among languages hinders its full utilisation. To resolve this problem, we employed target language prediction and a central language-aware layer to improve representation in integrating LRLs. Focusing on improving LRLs in the linguistically diverse country of Indonesia, we evaluated five languages using a parallel corpus of 1,000 instances each, with experimental results measured by BLEU showing zero-shot improvement of 7.4 from the baseline score of 7.1 to a score of 15.5 at best. Further analysis showed that the gains in performance are attributed more to the disentanglement of multilingual representation in the encoder with the shift of the target language-specific representation in the decoder.

Centroid-Based Efficient Minimum Bayes Risk Decoding
Hiroyuki Deguchi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe | Hideki Tanaka | Masao Utiyama
Findings of the Association for Computational Linguistics: ACL 2024

Minimum Bayes risk (MBR) decoding achieved state-of-the-art translation performance by using COMET, a neural metric that has a high correlation with human evaluation.However, MBR decoding requires quadratic time since it computes the expected score between a translation hypothesis and all reference translations.We propose centroid-based MBR (CBMBR) decoding to improve the speed of MBR decoding.Our method clusters the reference translations in the feature space, and then calculates the score using the centroids of each cluster.The experimental results show that our CBMBR not only improved the decoding speed of the expected score calculation 5.7 times, but also outperformed vanilla MBR decoding in translation quality by up to 0.5 COMET in the WMT’22 En↔Ja, En↔De, En↔Zh, and WMT’23 En↔Ja translation tasks.

Applying Linguistic Expertise to LLMs for Educational Material Development in Indigenous Languages
Justin Vasselli | Arturo Martínez Peguero | Junehwan Sung | Taro Watanabe
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

This paper presents our approach to the AmericasNLP 2024 Shared Task 2 as the JAJ (/dʒæz/) team. The task aimed at creating educational materials for indigenous languages, and we focused on Maya and Bribri. Given the unique linguistic features and challenges of these languages, and the limited size of the training datasets, we developed a hybrid methodology combining rule-based NLP methods with prompt-based techniques. This approach leverages the meta-linguistic capabilities of large language models, enabling us to blend broad, language-agnostic processing with customized solutions. Our approach lays a foundational framework that can be expanded to other indigenous languages languages in future work.

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
Huayang Li | Siheng Li | Deng Cai | Longyue Wang | Lemao Liu | Taro Watanabe | Yujiu Yang | Shuming Shi
Findings of the Association for Computational Linguistics: ACL 2024

Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering LLMs with multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. Extensive quantitative and qualitative experiments demonstrate that MIM trained on TextBind achieves remarkable generation capability in multimodal conversations compared to recent baselines.

Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
Yusuke Sakai | Mana Makinae | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In Simultaneous Machine Translation (SiMT), training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency. However, constructing such a corpus is challenging due to high costs, and limitations in annotator capabilities, and as a result, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation (ST) corpora into interpretation-style corpora, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models using the LLM-SI-Corpus reduces latency while achieving better quality compared to models fine-tuned with other corpora in both speech-to-text and text-to-text settings. The LLM-SI-Corpus is available at https://github.com/yusuke1997/LLM-SI-Corpus.

Japanese Rule-based Grapheme-to-phoneme Conversion System and Multilingual Named Entity Dataset with International Phonetic Alphabet
Yuhi Matogawa | Yusuke Sakai | Taro Watanabe | Chihiro Taguchi
Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

In Japanese, loanwords are primarily written in Katakana, a syllabic writing system, based on their pronunciation. However, the transliterated loanwords often exhibit spelling variations, such as the word “Hepburn” being written as “ヘボン (hebon)”, “ヘプバーン (hepubaan)”, “ヘップバーン (heppubaan)”. These orthographical variants pose a bottleneck in multilingual Named Entity Recognition (NER), because named entities (NEs) do not have one-to-one matches. In this study, we introduce a rule-based grapheme-to-phoneme (G2P) system for Japanese based on literature in linguistics and a large-scale multilingual NE dataset with annotations of the International Phonetic Alphabet (IPA), focusing on IPA to address the Katakana spelling variations in loanwords. These rules and dataset are expected to be beneficial for tasks such as NE aggregation, G2P system, construction of cross-lingual language models, and entity linking. We hope our work advances research on Japanese NER with multilingual loanwords by solving the spelling ambiguities.

Detector–Corrector: Edit-Based Automatic Post Editing for Human Post Editing
Hiroyuki Deguchi | Masaaki Nagata | Taro Watanabe
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

Post-editing is crucial in the real world because neural machine translation (NMT) sometimes makes errors.Automatic post-editing (APE) attempts to correct the outputs of an MT model for better translation quality.However, many APE models are based on sequence generation, and thus their decisions are harder to interpret for actual users.In this paper, we propose “detector–corrector”, an edit-based post-editing model, which breaks the editing process into two steps, error detection and error correction.The detector model tags each MT output token whether it should be corrected and/or reordered while the corrector model generates corrected words for the spans identified as errors by the detector.Experiments on the WMT’20 English–German and English–Chinese APE tasks showed that our detector–corrector improved the translation edit rate (TER) compared to the previous edit-based model and a black-box sequence-to-sequence APE model, in addition, our model is more explainable because it is based on edit operations.

Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model
Mana Makinae | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Simultaneous Speech Translation (SiST) begins translating before the entire source input is received, making it crucial to balance quality and latency. In real interpreting situations, interpreters manage this simultaneity by breaking sentences into smaller segments and translating them while maintaining the source order as much as possible. SiST could benefit from this approach to balance quality and latency. However, current corpora used for simultaneous tasks often involve significant word reordering in translation, which is not ideal given that interpreters faithfully follow source syntax as much as possible. Inspired by conference interpreting by humans utilizing the salami technique, we introduce the Simul-MuST-C, a dataset created by leveraging the Large Language Model (LLM), specifically GPT-4o, which aligns the target text as closely as possible to the source text by using minimal chunks that contain enough information to be interpreted. Experiments on three language pairs show that the effectiveness of segmented-base monotonicity in training data varies with the grammatical distance between the source and the target, with grammatically distant language pairs benefiting the most in achieving quality while minimizing latency.

Difficult for Whom? A Study of Japanese Lexical Complexity
Adam Nohejl | Akio Hayakawa | Yusuke Ide | Taro Watanabe
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

The tasks of lexical complexity prediction (LCP) and complex word identification (CWI) commonly presuppose that difficult-to-understand words are shared by the target population. Meanwhile, personalization methods have also been proposed to adapt models to individual needs. We verify that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation. By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary. To explore the possibilities of personalization, we compare competitive baselines trained on the group mean ratings and individual ratings in terms of performance for an individual. We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult. We also experiment with adapting a finetuned BERT model, which results only in marginal improvements across all settings.

Constructing Indonesian-English Travelogue Dataset
Eunike Andriani Kardinata | Hiroki Ouchi | Taro Watanabe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Research in low-resource language is often hampered due to the under-representation of how the language is being used in reality. This is particularly true for Indonesian language because there is a limited variety of textual datasets, and majority were acquired from official sources with formal writing style. All the more for the task of geoparsing, which could be implemented for navigation and travel planning applications, such datasets are rare, even in the high-resource languages, such as English. Being aware of the need for a new resource in both languages for this specific task, we constructed a new dataset comprising both Indonesian and English from personal travelogue articles. Our dataset consists of 88 articles, exactly half of them written in each language. We covered both named and nominal expressions of four entity types related to travel: location, facility, transportation, and line. We also conducted experiments by training classifiers to recognise named entities and their nominal expressions. The results of our experiments showed a promising future use of our dataset as we obtained F1-score above 0.9 for both languages.

Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
Zhiyu Guo | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Do LLMs Implicitly Determine the Suitable Text Difficulty for Users?
Seiji Gobara | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
Eri Onami | Shuhei Kurita | Taiki Miyanishi | Taro Watanabe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Document question answering is a task of question answering on given documents such as reports, slides, pamphlets, and websites, and it is a truly demanding task as paper and electronic forms of documents are so common in our society. This is known as a quite challenging task because it requires not only text understanding but also understanding of figures and tables, and hence visual question answering (VQA) methods are often examined in addition to textual approaches. We introduce Japanese Document Question Answering (JDocQA), a large-scale document-based QA dataset, essentially requiring both visual and textual information to answer questions, which comprises 5,504 documents in PDF format and annotated 11,600 question-and-answer instances in Japanese. Each QA instance includes references to the document pages and bounding boxes for the answer clues. We incorporate multiple categories of questions and unanswerable questions from the document for realistic question-answering applications. We empirically evaluate the effectiveness of our dataset with text-based large language models (LLMs) and multimodal models. Incorporating unanswerable questions in finetuning may contribute to harnessing the so-called hallucination generation.

Towards Artwork Explanation in Large-scale Vision Language Models
Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large-scale Vision-Language Models (LVLMs) output text from images and instructions, demonstrating advanced capabilities in text generation and comprehension. However, it has not been clarified to what extent LVLMs understand the knowledge necessary for explaining images, the complex relationships between various pieces of knowledge, and how they integrate these understandings into their explanations. To address this issue, we propose a new task: the artwork explanation generation task, along with its evaluation dataset and metric for quantitatively assessing the understanding and utilization of knowledge about artworks. This task is apt for image description based on the premise that LVLMs are expected to have pre-existing knowledge of artworks, which are often subjects of wide recognition and documented information.It consists of two parts: generating explanations from both images and titles of artworks, and generating explanations using only images, thus evaluating the LVLMs’ language-based and vision-based knowledge.Alongside, we release a training dataset for LVLMs to learn explanations that incorporate knowledge about artworks.Our findings indicate that LVLMs not only struggle with integrating language and visual information but also exhibit a more pronounced limitation in acquiring knowledge from images alone. The datasets ExpArt=Explain Artworks are available at https://huggingface.co/datasets/naist-nlp/ExpArt

Unified Interpretation of Smoothing Methods for Negative Sampling Loss Functions in Knowledge Graph Embedding
Xincan Feng | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

Knowledge Graphs (KGs) are fundamental resources in knowledge-intensive tasks in NLP. Due to the limitation of manually creating KGs, KG Completion (KGC) has an important role in automatically completing KGs by scoring their links with KG Embedding (KGE). To handle many entities in training, KGE relies on Negative Sampling (NS) loss that can reduce the computational cost by sampling. Since the appearance frequencies for each link are at most one in KGs, sparsity is an essential and inevitable problem. The NS loss is no exception. As a solution, the NS loss in KGE relies on smoothing methods like Self-Adversarial Negative Sampling (SANS) and subsampling. However, it is uncertain what kind of smoothing method is suitable for this purpose due to the lack of theoretical understanding. This paper provides theoretical interpretations of the smoothing methods for the NS loss in KGE and induces a new NS loss, Triplet Adaptive Negative Sampling (TANS), that can cover the characteristics of the conventional smoothing methods. Experimental results of TransE, DistMult, ComplEx, RotatE, HAKE, and HousE on FB15k-237, WN18RR, and YAGO3-10 datasets and their sparser subsets show the soundness of our interpretation and performance improvement by our TANS.

Generating Diverse Translation with Perturbed kNN-MT
Yuto Nishida | Makoto Morishita | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Generating multiple translation candidates would enable users to choose the one that satisfies their needs.Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem—the model underestimates a prediction that is largely different from the training data, even if that prediction is likely.This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT).Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem.Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation’s magnitude.

Can Language Models Induce Grammatical Knowledge from Indirect Evidence?
Miyu Oba | Yohei Oseki | Akiyo Fukatsu | Akari Haga | Hiroki Ouchi | Taro Watanabe | Saku Sugawara
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans use indirect evidence efficiently, which is considered one of the inductive biases contributing to efficient language acquisition. To explore this question, we introduce the Wug InDirect Evidence Test (WIDET), a dataset consisting of training instances inserted into the pre-training data and evaluation instances. We inject synthetic instances with newly coined wug words into pretraining data and explore the model’s behavior on evaluation data that assesses grammatical acceptability regarding those words. We prepare the injected instances by varying their levels of indirectness and quantity. Our experiments surprisingly show that language models do not induce grammatical knowledge even after repeated exposure to instances with the same structure but differing only in lexical items from evaluation instances in certain language phenomena. Our findings suggest a potential direction for future research: developing models that use latent indirect evidence to induce grammatical knowledge.

Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation
Zhe Cao | Zhi Qu | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multilingual neural machine translation models support fine-tuning hundreds of languages simultaneously. However, fine-tuning on full parameters solely is inefficient potentially leading to negative interactions among languages. In this work, we demonstrate that the fine-tuning for a language occurs in its intrinsic language-specific subspace with a tiny fraction of entire parameters. Thus, we propose language-specific LoRA to isolate intrinsic language-specific subspaces. Furthermore, we propose architecture learning techniques and introduce a gradual pruning schedule during fine-tuning to exhaustively explore the optimal setting and the minimal intrinsic subspaces for each language, resulting in a lightweight yet effective fine-tuning procedure. The experimental results on a 12-language subset and a 30-language subset of FLORES-101 show that our methods not only outperform full-parameter fine-tuning up to 2.25 spBLEU scores but also reduce trainable parameters to 0.4% for high and medium-resource languages and 1.6% for low-resource ones.

Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
Iqra Ali | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively. This result indicates that the quality of our corpus is not less than commonly used datasets. It‘s a pioneering contribution to the field. We will publicize a subset of 1,800 instances from our corpus, free from any licensing issues.

Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation
Shohei Higashiyama | Hiroki Ouchi | Hiroki Teranishi | Hiroyuki Otomo | Yusuke Ide | Aitaro Yamamoto | Hiroyuki Shindo | Yuki Matsuda | Shoko Wakamiya | Naoya Inoue | Ikuya Yamada | Taro Watanabe
Findings of the Association for Computational Linguistics: EACL 2024

Geoparsing is a fundamental technique for analyzing geo-entity information in text, which is useful for geographic applications, e.g., tourist spot recommendation. We focus on document-level geoparsing that considers geographic relatedness among geo-entity mentions and present a Japanese travelogue dataset designed for training and evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

Evaluating Language Models in Location Referring Expression Extraction from Early Modern and Contemporary Japanese Texts
Ayuki Katayama | Yusuke Sakai | Shohei Higashiyama | Hiroki Ouchi | Ayano Takeuchi | Ryo Bando | Yuta Hashimoto | Toshinobu Ogiso | Taro Watanabe
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Automatic extraction of geographic information, including Location Referring Expressions (LREs), can aid humanities research in analyzing large collections of historical texts. In this study, to investigate how accurate pretrained Transformer language models (LMs) can extract LREs from historical texts, we evaluate two representative types of LMs, namely, masked language model and causal language model, using early modern and contemporary Japanese datasets. Our experimental results demonstrated the potential of contemporary LMs for historical texts, but also suggest the need for further model enhancement, such as pretraining on historical texts.

Cross-lingual Contextualized Phrase Retrieval
Huayang Li | Deng Cai | Zhi Qu | Qu Cui | Hidetaka Kamigaito | Lemao Liu | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2024

Phrase-level dense retrieval has shown many appealing characteristics in downstream NLP tasks by leveraging the fine-grained information that phrases offer. In our work, we propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval, which aims to augment cross-lingual applications by addressing polysemy using context information. However, the lack of specific training data and models are the primary challenges to achieve our goal. As a result, we extract pairs of cross-lingual phrases using word alignment information automatically induced from parallel sentences. Subsequently, we train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning, which encourages the hidden representations of phrases with similar contexts and semantics to align closely. Comprehensive experiments on both the cross-lingual phrase retrieval task and a downstream task, i.e, machine translation, demonstrate the effectiveness of CCPR. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher. When utilizing CCPR to augment the large-language-model-based translator, it achieves average gains of 0.7 and 1.5 in BERTScore for translations from X=>En and vice versa, respectively, on WMT16 dataset. We will release our code and data.

Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion?
Yusuke Sakai | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Knowledge graphs (KGs) consist of links that describe relationships between entities. Due to the difficulty of manually enumerating all relationships between entities, automatically completing them is essential for KGs. Knowledge Graph Completion (KGC) is a task that infers unseen relationships between entities in a KG. Traditional embedding-based KGC methods (e.g. RESCAL, TransE, DistMult, ComplEx, RotatE, HAKE, HousE, etc.) infer missing links using only the knowledge from training data. In contrast, the recent Pre-trained Language Model (PLM)-based KGC utilizes knowledge obtained during pre-training, which means it can estimate missing links between entities by reusing memorized knowledge from pre-training without inference. This part is problematic because building KGC models aims to infer unseen links between entities. However, conventional evaluations in KGC do not consider inference and memorization abilities separately. Thus, a PLM-based KGC method, which achieves high performance in current KGC evaluations, may be ineffective in practical applications. To address this issue, we analyze whether PLM-based KGC methods make inferences or merely access memorized knowledge. For this purpose, we propose a method for constructing synthetic datasets specified in this analysis and conclude that PLMs acquire the inference abilities required for KGC through pre-training, even though the performance improvements mostly come from textual information of entities and relations.

mbrs: A Library for Minimum Bayes Risk Decoding
Hiroyuki Deguchi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Modeling Overregularization in Children with Small Language Models
Akari Haga | Saku Sugawara | Akiyo Fukatsu | Miyu Oba | Hiroki Ouchi | Taro Watanabe | Yohei Oseki
Findings of the Association for Computational Linguistics: ACL 2024

The imitation of the children’s language acquisition process has been explored to make language models (LMs) more efficient.In particular, errors caused by children’s regularization (so-called overregularization, e.g., using wroted for the past tense of write) have been widely studied to reveal the mechanisms of language acquisition. Existing research has analyzed regularization in language acquisition only by modeling word inflection directly, which is unnatural in light of human language acquisition. In this paper, we hypothesize that language models that imitate the errors children make during language acquisition have a learning process more similar to humans. To verify this hypothesis, we analyzed the learning curve and error preferences of verb inflections in small-scale LMs using acceptability judgments. We analyze the differences in results by model architecture, data, and tokenization. Our model shows child-like U-shaped learning curves clearly for certain verbs, but the preferences for types of overgeneralization did not fully match the observations in children.

2023

24-bit Languages
Yiran Wang | Taro Watanabe | Masao Utiyama | Yuji Matsumoto
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

A Closer Look at k-Nearest Neighbors Grammatical Error Correction
Justin Vasselli | Taro Watanabe
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

In various natural language processing tasks, such as named entity recognition and machine translation, example-based approaches have been used to improve performance by leveraging existing knowledge. However, the effectiveness of this approach for Grammatical Error Correction (GEC) is unclear. In this work, we explore how an example-based approach affects the accuracy and interpretability of the output of GEC systems and the trade-offs involved. The approach we investigate has shown great promise in machine translation by using the $k$-nearest translation examples to improve the results of a pretrained Transformer model. We find that using this technique increases precision by reducing the number of false positives, but recall suffers as the model becomes more conservative overall. Increasing the number of example sentences in the datastore does lead to better performing systems, but with diminishing returns and a high decoding cost. Synthetic data can be used as examples, but the effectiveness varies depending on the base model. Finally, we find that finetuning on a set of data may be more effective than using that data during decoding as examples.

Model-based Subsampling for Knowledge Graph Completion
Xincan Feng | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

NAISTeacher: A Prompt and Rerank Approach to Generating Teacher Utterances in Educational Dialogues
Justin Vasselli | Christopher Vasselli | Adam Nohejl | Taro Watanabe
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper presents our approach to the BEA 2023 shared task of generating teacher responses in educational dialogues, using the Teacher-Student Chatroom Corpus. Our system prompts GPT-3.5-turbo to generate initial suggestions, which are then subjected to reranking. We explore multiple strategies for candidate generation, including prompting for multiple candidates and employing iterative few-shot prompts with negative examples. We aggregate all candidate responses and rerank them based on DialogRPT scores. To handle consecutive turns in the dialogue data, we divide the task of generating teacher utterances into two components: teacher replies to the student and teacher continuations of previously sent messages. Through our proposed methodology, our system achieved the top score on both automated metrics and human evaluation, surpassing the reference human teachers on the latter.

NAIST-NICT WMT’23 General MT Task Submission
Hiroyuki Deguchi | Kenji Imamura | Yuto Nishida | Yusuke Sakai | Justin Vasselli | Taro Watanabe
Proceedings of the Eighth Conference on Machine Translation

In this paper, we describe our NAIST-NICT submission to the WMT’23 English ↔ Japanese general machine translation task. Our system generates diverse translation candidates and reranks them using a two-stage reranking system to find the best translation. First, we generated 50 candidates each from 18 translation methods using a variety of techniques to increase the diversity of the translation candidates. We trained seven models per language direction using various combinations of hyperparameters. From these models we used various decoding algorithms, ensembling the models, and using kNN-MT (Khandelwal et al., 2021). We processed the 900 translation candidates through a two-stage reranking system to find the most promising candidate. In the first step, we compared 50 candidates from each translation method using DrNMT (Lee et al., 2021) and returned the candidate with the best score. We ranked the final 18 candidates using COMET-MBR (Fernandes et al., 2022) and returned the best score as the system output. We found that generating diverse translation candidates improved translation quality using the well-designed reranker model.

Japanese Lexical Complexity for Non-Native Readers: A New Dataset
Yusuke Ide | Masato Mita | Adam Nohejl | Hiroki Ouchi | Taro Watanabe
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Lexical complexity prediction (LCP) is the task of predicting the complexity of words in a text on a continuous scale. It plays a vital role in simplifying or annotating complex words to assist readers. To study lexical complexity in Japanese, we construct the first Japanese LCP dataset. Our dataset provides separate complexity scores for Chinese/Korean annotators and others to address the readers’ L1-specific needs. In the baseline experiment, we demonstrate the effectiveness of a BERT-based system for Japanese LCP.

Second Language Acquisition of Neural Language Models
Miyu Oba | Tatsuki Kuribayashi | Hiroki Ouchi | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2023

With the success of neural language models (LMs), their language acquisition has gained much attention. This work sheds light on the second language (L2) acquisition of LMs, while previous work has typically explored their first language (L1) acquisition. Specifically, we trained bilingual LMs with a scenario similar to human L2 acquisition and analyzed their cross-lingual transfer from linguistic perspectives. Our exploratory experiments demonstrated that the L1 pretraining accelerated their linguistic generalization in L2, and language transfer configurations (e.g., the L1 choice, and presence of parallel texts) substantially affected their generalizations. These clarify their (non-)human-like L2 acquisition in particular aspects.

Subset Retrieval Nearest Neighbor Machine Translation
Hiroyuki Deguchi | Taro Watanabe | Yusuke Matsui | Masao Utiyama | Hideki Tanaka | Eiichiro Sumita
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

k-nearest-neighbor machine translation (kNN-MT) (Khandelwal et al., 2021) boosts the translation performance of trained neural machine translation (NMT) models by incorporating example-search into the decoding algorithm. However, decoding is seriously time-consuming, i.e., roughly 100 to 1,000 times slower than standard NMT, because neighbor tokens are retrieved from all target tokens of parallel data in each timestep. In this paper, we propose “Subset kNN-MT”, which improves the decoding speed of kNN-MT by two methods: (1) retrieving neighbor target tokens from a subset that is the set of neighbor sentences of the input sentence, not from all sentences, and (2) efficient distance computation technique that is suitable for subset neighbor search using a look-up table. Our proposed method achieved a speed-up of up to 132.2 times and an improvement in BLEU score of up to 1.6 compared with kNN-MT in the WMT’19 De-En translation task and the domain adaptation tasks in De-En and En-Ja.

Findings of the Word-Level AutoCompletion Shared Task in WMT 2023
Lemao Liu | Francisco Casacuberta | George Foster | Guoping Huang | Philipp Koehn | Geza Kovacs | Shuming Shi | Taro Watanabe | Chengqing Zong
Proceedings of the Eighth Conference on Machine Translation

This paper presents the overview of the second Word-Level autocompletion (WLAC) shared task for computer-aided translation, which aims to automatically complete a target word given a translation context including a human typed character sequence. We largely adhere to the settings of the previous round of the shared task, but with two main differences: 1) The typed character sequence is obtained from the typing process of human translators to demonstrate system performance under real-world scenarios when preparing some type of testing examples; 2) We conduct a thorough analysis on the results of the submitted systems from three perspectives. From the experimental results, we observe that translation tasks are helpful to improve the performance of WLAC models. Additionally, our further analysis shows that the semantic error accounts for a significant portion of all errors, and thus it would be promising to take this type of errors into account in future.

Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we propose a table and image generation task to verify how the knowledge about entities acquired from natural language is retained in Vision & Language (V & L) models. This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption and a table containing related knowledge of the entity. In both tasks, the model must know the entities used to perform the generation properly. We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks. We evaluated the performance on the tasks with respect to the above research question using the V & L model OFA, which has achieved state-of-the-art results in multiple tasks. Experimental results show that OFA forgets part of its entity knowledge by pre-training as a complement to improve the performance of image related tasks.

2022

Visualizing the Relationship Between Encoded Linguistic Information and Task Performance
Jiannan Xiang | Huayang Li | Defu Lian | Guoping Huang | Taro Watanabe | Lemao Liu
Findings of the Association for Computational Linguistics: ACL 2022

Probing is popular to analyze whether linguistic information can be captured by a well-trained deep neural model, but it is hard to answer how the change of the encoded linguistic information will affect task performance. To this end, we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality. Its key idea is to obtain a set of models which are Pareto-optimal in terms of both objectives. From this viewpoint, we propose a method to optimize the Pareto-optimal models by formalizing it as a multi-objective optimization problem. We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances. Experimental results demonstrate that the proposed method is better than a baseline method. Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance, because the model architecture is also an important factor.

Sharing Parameter by Conjugation for Knowledge Graph Embeddings in Complex Space
Xincan Feng | Zhi Qu | Yuchang Cheng | Taro Watanabe | Nobuhiro Yugami
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

A Knowledge Graph (KG) is the directed graphical representation of entities and relations in the real world. KG can be applied in diverse Natural Language Processing (NLP) tasks where knowledge is required. The need to scale up and complete KG automatically yields Knowledge Graph Embedding (KGE), a shallow machine learning model that is suffering from memory and training time consumption issues. To mitigate the computational load, we propose a parameter-sharing method, i.e., using conjugate parameters for complex numbers employed in KGE models. Our method improves memory efficiency by 2x in relation embedding while achieving comparable performance to the state-of-the-art non-conjugate models, with faster, or at least comparable, training time. We demonstrated the generalizability of our method on two best-performing KGE models 5^★E (CITATION) and ComplEx (CITATION) on five benchmark datasets.

Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information
Chihiro Taguchi | Sei Iwata | Taro Watanabe
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

This paper introduces a new Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first outline NMCTT with a focus on differences from other treebanks of Turkic languages. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.

Improving Discriminative Learning for Zero-Shot Relation Extraction
Van-Hien Tran | Hiroki Ouchi | Taro Watanabe | Yuji Matsumoto
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

Zero-shot relation extraction (ZSRE) aims to predict target relations that cannot be observed during training. While most previous studies have focused on fully supervised relation extraction and achieved considerably high performance, less effort has been made towards ZSRE. This study proposes a new model incorporating discriminative embedding learning for both sentences and semantic relations. In addition, a self-adaptive comparator network is used to judge whether the relationship between a sentence and a relation is consistent. Experimental results on two benchmark datasets showed that the proposed method significantly outperforms the state-of-the-art methods.

JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers
Akio Hayakawa | Tomoyuki Kajiwara | Hiroki Ouchi | Taro Watanabe
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

The user-dependency of Text Simplification makes its evaluation obscure. A targeted evaluation dataset clarifies the purpose of simplification, though its specification is hard to define. We built JADES (JApanese Dataset for the Evaluation of Simplification), a text simplification dataset targeted at non-native Japanese speakers, according to public vocabulary and grammar profiles. JADES comprises 3,907 complex-simple sentence pairs annotated by an expert. Analysis of JADES shows that wide and multiple rewriting operations were applied through simplification. Furthermore, we analyzed outputs on JADES from several benchmark systems and automatic and manual scores of them. Results of these analyses highlight differences between English and Japanese in operations and evaluations.

NAIST-NICT-TIT WMT22 General MT Task Submission
Hiroyuki Deguchi | Kenji Imamura | Masahiro Kaneko | Yuto Nishida | Yusuke Sakai | Justin Vasselli | Huy Hien Vu | Taro Watanabe
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we describe our NAIST-NICT-TIT submission to the WMT22 general machine translation task. We participated in this task for the English ↔ Japanese language pair. Our system is characterized as an ensemble of Transformer big models, k-nearest-neighbor machine translation (kNN-MT) (Khandelwal et al., 2021), and reranking.In our translation system, we construct the datastore for kNN-MT from back-translated monolingual data and integrate kNN-MT into the ensemble model. We designed a reranking system to select a translation from the n-best translation candidates generated by the translation system. We also use a context-aware model to improve the document-level consistency of the translation.

Adapting to Non-Centered Languages for Zero-shot Multilingual Translation
Zhi Qu | Taro Watanabe
Proceedings of the 29th International Conference on Computational Linguistics

Multilingual neural machine translation can translate unseen language pairs during training, i.e. zero-shot translation. However, the zero-shot translation is always unstable. Although prior works attributed the instability to the domination of central language, e.g. English, we supplement this viewpoint with the strict dependence of non-centered languages. In this work, we propose a simple, lightweight yet effective language-specific modeling method by adapting to non-centered languages and combining the shared information and the language-specific information to counteract the instability of zero-shot translation. Experiments with Transformer on IWSLT17, Europarl, TED talks, and OPUS-100 datasets show that our method not only performs better than strong baselines in centered data conditions but also can easily fit non-centered data conditions. By further investigating the layer attribution, we show that our proposed method can disentangle the coupled representation in the correct direction.

Residual Learning of Neural Text Generation with n-gram Language Model
Huayang Li | Deng Cai | Jin Xu | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2022

N-gram language models (LM) has been largely superseded by neural LMs as the latter exhibits better performance. However, we find that n-gram models can achieve satisfactory performance on a large proportion of testing cases, indicating they have already captured abundant knowledge of the language with relatively low computational cost. With this observation, we propose to learn a neural LM that fits the residual between an n-gram LM and the real-data distribution. The combination of n-gram LMs and neural LMs not only allows the neural part to focus on deeper understanding of the language, but also provides a flexible way to customize a LM by switching the underlying n-gram model without changing the neural model. Experimental results on three typical language tasks (i.e., language modeling, machine translation, and summarization) demonstrate that our approach attains additional performance gains over popular standalone neural models consistently. We also show that our approach allows for effective domain adaptation by simply switching to a domain-specific n-gram model, without any extra training.

Law Retrieval with Supervised Contrastive Learning Using the Hierarchical Structure of Law
Jungmin Choi | Ukyo Honda | Taro Watanabe | Hiroki Ouchi | Kentaro Inui
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

What Works and Doesn’t Work, A Deep Decoder for Neural Machine Translation
Zuchao Li | Yiran Wang | Masao Utiyama | Eiichiro Sumita | Hai Zhao | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2022

Deep learning has demonstrated performance advantages in a wide range of natural language processing tasks, including neural machine translation (NMT). Transformer NMT models are typically strengthened by deeper encoder layers, but deepening their decoder layers usually results in failure. In this paper, we first identify the cause of the failure of the deep decoder in the Transformer model. Inspired by this discovery, we then propose approaches to improving it, with respect to model structure and model training, to make the deep decoder practical in NMT. Specifically, with respect to model structure, we propose a cross-attention drop mechanism to allow the decoder layers to perform their own different roles, to reduce the difficulty of deep-decoder learning. For model training, we propose a collapse reducing training approach to improve the stability and effectiveness of deep-decoder training. We experimentally evaluated our proposed Transformer NMT model structure modification and novel training methods on several popular machine translation benchmarks. The results showed that deepening the NMT model by increasing the number of decoder layers successfully prevented the deepened decoder from degrading to an unconditional language model. In contrast to prior work on deepening an NMT model on the encoder, our method can deepen the model on both the encoder and decoder at the same time, resulting in a deeper model and improved performance.

Findings of the Word-Level AutoCompletion Shared Task in WMT 2022
Francisco Casacuberta | George Foster | Guoping Huang | Philipp Koehn | Geza Kovacs | Lemao Liu | Shuming Shi | Taro Watanabe | Chengqing Zong
Proceedings of the Seventh Conference on Machine Translation (WMT)

Recent years have witnessed rapid advancements in machine translation, but the state-of-the-art machine translation system still can not satisfy the high requirements in some rigorous translation scenarios. Computer-aided translation (CAT) provides a promising solution to yield a high-quality translation with a guarantee. Unfortunately, due to the lack of popular benchmarks, the research on CAT is not well developed compared with machine translation. In this year, we hold a new shared task called Word-level AutoCompletion (WLAC) for CAT in WMT. Specifically, we introduce some resources to train a WLAC model, and particularly we collect data from CAT systems as a part of test data for this shared task. In addition, we employ both automatic and human evaluations to measure the performance of the submitted systems, and our final evaluation results reveal some findings for the WLAC task.

2021

Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning
Ukyo Honda | Yoshitaka Ushiku | Atsushi Hashimoto | Taro Watanabe | Yuji Matsumoto
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images. In previous work, pseudo-captions, i.e., sentences that contain the detected object labels, were assigned to a given image. The focus of the previous work was on the alignment of input images and pseudo-captions at the sentence level. However, pseudo-captions contain many words that are irrelevant to a given image. In this work, we investigate the effect of removing mismatched words from image-sentence alignment to determine how they make this task difficult. We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions: the detected object labels. The experimental results show that our proposed method outperforms the previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.

Neural Machine Translation with Synchronous Latent Phrase Structure
Shintaro Harada | Taro Watanabe
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

It is reported that grammatical information is useful for machine translation (MT) task. However, the annotation of grammatical information requires the highly human resources. Furthermore, it is not trivial to adapt grammatical information to MT since grammatical annotation usually adapts tokenization standards which might not be suitable to capture the relation of two languages, and the use of sub-word tokenization, e.g., Byte-Pair-Encoding, to alleviate out-of-vocabulary problem might not be compatible with those annotations. In this work, we propose two methods to explicitly incorporate grammatical information without supervising annotation; first, latent phrase structure is induced in an unsupervised fashion from a multi-head attention mechanism; second, the induced phrase structures in encoder and decoder are synchronized so that they are compatible with each other using constraints during training. We demonstrate that our approach produces better performance and explainability in two tasks, translation and alignment tasks without extra resources. Although we could not obtain the high quality phrase structure in constituency parsing when evaluated monolingually, we find that the induced phrase structures enhance the explainability of translation through the synchronization constraint.

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization
Shohei Higashiyama | Masao Utiyama | Taro Watanabe | Eiichiro Sumita
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

Zero Pronouns Identification based on Span prediction
Sei Iwata | Taro Watanabe | Masaaki Nagata
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

The presence of zero-pronoun (ZP) greatly affects the downstream tasks of NLP in pro-drop languages such as Japanese and Chinese. To tackle the problem, the previous works identified ZPs as sequence labeling on the word sequence or the linearlized tree nodes of the input. We propose a novel approach to ZP identification by casting it as a query-based argument span prediction task. Given a predicate as a query, our model predicts the omission with ZP. In the experiments, our model surpassed the sequence labeling baseline.

Nested Named Entity Recognition via Explicitly Excluding the Influence of the Best Path
Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto | Taro Watanabe
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper presents a novel method for nested named entity recognition. As a layered method, our method extends the prior second-best path recognition method by explicitly excluding the influence of the best path. Our method maintains a set of hidden states at each time step and selectively leverages them to build a different potential function for recognition at each level. In addition, we demonstrate that recognizing innermost entities first results in better performance than the conventional outermost entities first scheme. We provide extensive experimental results on ACE2004, ACE2005, and GENIA datasets to show the effectiveness and efficiency of our proposed method.

A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization
Shohei Higashiyama | Masao Utiyama | Taro Watanabe | Eiichiro Sumita
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing. In this paper, we propose a text editing model to solve the three task jointly and methods of pseudo-labeled data generation to overcome the problem of data deficiency. Our experiments showed that the proposed model achieved better normalization performance when trained on more diverse pseudo-labeled data.

Structured Refinement for Sequential Labeling
Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-to-Latin Converter for Tatar
Chihiro Taguchi | Yusuke Sakai | Taro Watanabe
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

We introduce a Cyrillic-to-Latin transliterator for the Tatar language based on subword-level language identification. The transliteration is a challenging task due to the following two reasons. First, because modern Tatar texts often contain intra-word code-switching to Russian, a different transliteration set of rules needs to be applied to each morpheme depending on the language, which necessitates morpheme-level language identification. Second, the fact that Tatar is a low-resource language, with most of the texts in Cyrillic, makes it difficult to prepare a sufficient dataset. Given this situation, we proposed a transliteration method based on subword-level language identification. We trained a language classifier with monolingual Tatar and Russian texts, and applied different transliteration rules in accord with the identified language. The results demonstrate that our proposed method outscores other Tatar transliteration tools, and imply that it correctly transcribes Russian loanwords to some extent.

Dependency Patterns of Complex Sentences and Semantic Disambiguation for Abstract Meaning Representation Parsing
Yuki Yamamoto | Yuji Matsumoto | Taro Watanabe
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Abstract Meaning Representation (AMR) is a sentence-level meaning representation based on predicate argument structure. One of the challenges we find in AMR parsing is to capture the structure of complex sentences which expresses the relation between predicates. Knowing the core part of the sentence structure in advance may be beneficial in such a task. In this paper, we present a list of dependency patterns for English complex sentence constructions designed for AMR parsing. With a dedicated pattern matcher, all occurrences of complex sentence constructions are retrieved from an input sentence. While some of the subordinators have semantic ambiguities, we deal with this problem through training classification models on data derived from AMR and Wikipedia corpus, establishing a new baseline for future works. The developed complex sentence patterns and the corresponding AMR descriptions will be made public.

2020

Coordination Boundary Identification without Labeled Data for Compound Terms Disambiguation
Yuya Sawada | Takashi Wada | Takayoshi Shibahara | Hiroki Teranishi | Shuhei Kondo | Hiroyuki Shindo | Taro Watanabe | Yuji Matsumoto
Proceedings of the 28th International Conference on Computational Linguistics

We propose a simple method for nominal coordination boundary identification. As the main strength of our method, it can identify the coordination boundaries without training on labeled data, and can be applied even if coordination structure annotations are not available. Our system employs pre-trained word embeddings to measure the similarities of words and detects the span of coordination, assuming that conjuncts share syntactic and semantic similarities. We demonstrate that our method yields good results in identifying coordinated noun phrases in the GENIA corpus and is comparable to a recent supervised method for the case when the coordinator conjoins simple noun phrases.

2018

Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection
Wei Wang | Taro Watanabe | Macduff Hughes | Tetsuji Nakagawa | Ciprian Chelba
Proceedings of the Third Conference on Machine Translation: Research Papers

Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.

2017

Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Greg Kondrak | Taro Watanabe
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Greg Kondrak | Taro Watanabe
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2016

Phrase-based Machine Translation using Multiple Preordering Candidates
Yusuke Oda | Taku Kudo | Tetsuji Nakagawa | Taro Watanabe
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a new decoding method for phrase-based statistical machine translation which directly uses multiple preordering candidates as a graph structure. Compared with previous phrase-based decoding methods, our method is based on a simple left-to-right dynamic programming in which no decoding-time reordering is performed. As a result, its runtime is very fast and implementing the algorithm becomes easy. Our system does not depend on specific preordering methods as long as they output multiple preordering candidates, and it is trivial to employ existing preordering methods into our system. In our experiments for translating diverse 11 languages into English, the proposed method outperforms conventional phrase-based decoder in terms of translation qualities under comparable or faster decoding time.

Optimization for Statistical Machine Translation: A Survey
Graham Neubig | Taro Watanabe
Computational Linguistics, Volume 42, Issue 1 - March 2016

2015

Transition-based Neural Constituent Parsing
Taro Watanabe | Eiichiro Sumita
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Leave-one-out Word Alignment without Garbage Collector Effects
Xiaolin Wang | Masao Utiyama | Andrew Finch | Taro Watanabe | Eiichiro Sumita
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Hierarchical Back-off Modeling of Hiero Grammar based on Non-parametric Bayesian Model
Hidetaka Kamigaito | Taro Watanabe | Hiroya Takamura | Manabu Okumura | Eiichiro Sumita
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Unsupervised Word Alignment Using Frequency Constraint in Posterior Regularized EM
Hidetaka Kamigaito | Taro Watanabe | Hiroya Takamura | Manabu Okumura
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recurrent Neural Network-based Tuple Sequence Model for Machine Translation
Youzheng Wu | Taro Watanabe | Chiori Hori
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Syntax-Augmented Machine Translation using Syntax-Label Clustering
Hideya Mino | Taro Watanabe | Eiichiro Sumita
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recurrent Neural Networks for Word Alignment Model
Akihiro Tamura | Taro Watanabe | Eiichiro Sumita
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The NICT translation system for IWSLT 2014
Xiaolin Wang | Andrew Finch | Masao Utiyama | Taro Watanabe | Eiichiro Sumita
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes NICT’s participation in the IWSLT 2014 evaluation campaign for the TED Chinese-English translation shared-task. Our approach used a combination of phrase-based and hierarchical statistical machine translation (SMT) systems. Our focus was in several areas, specifically system combination, word alignment, and various language modeling techniques including the use of neural network joint models. Our experiments on the test set from the 2013 shared task, showed that an improvement in BLEU score can be gained in translation performance through all of these techniques, with the largest improvements coming from using large data sizes to train the language model.

2013

Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Akihiro Tamura | Taro Watanabe | Eiichiro Sumita | Hiroya Takamura | Manabu Okumura
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Additive Neural Networks for Statistical Machine Translation
Lemao Liu | Taro Watanabe | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Tuning SMT with a Large Number of Features via Online Feature Grouping
Lemao Liu | Tiejun Zhao | Taro Watanabe | Eiichiro Sumita
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Hierarchical Phrase Table Combination for Machine Translation
Conghui Zhu | Taro Watanabe | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

Expected Error Minimization with Ultraconservative Update for SMT
Lemao Liu | Tiejun Zhao | Taro Watanabe | Hailong Cao | Conghui Zhu
Proceedings of COLING 2012: Posters

Locally Training the Log-Linear Model for SMT
Lemao Liu | Hailong Cao | Taro Watanabe | Tiejun Zhao | Mo Yu | Conghui Zhu
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Machine Translation without Words through Substring Alignment
Graham Neubig | Taro Watanabe | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Head-driven Transition-based Parsing with Top-down Prediction
Katsuhiko Hayashi | Taro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
Akihiro Tamura | Taro Watanabe | Eiichiro Sumita
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Workshop on Monolingual Machine Translation
Tsuyoshi Okita | Artem Sokolov | Taro Watanabe
Workshop on Monolingual Machine Translation

Optimized Online Rank Learning for Machine Translation
Taro Watanabe
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Inducing a Discriminative Parser to Optimize Machine Translation Reordering
Graham Neubig | Taro Watanabe | Shinsuke Mori
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

Third-order Variational Reranking on Packed-Shared Dependency Forests
Katsuhiko Hayashi | Taro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Machine Translation System Combination by Confusion Forest
Taro Watanabe | Eiichiro Sumita
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

An Unsupervised Model for Joint Phrase Alignment and Extraction
Graham Neubig | Taro Watanabe | Eiichiro Sumita | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

The NICT translation system for IWSLT 2010
Chooi-Ling Goh | Taro Watanabe | Michael Paul | Andrew Finch | Eiichiro Sumita
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes NICT’s participation in the IWSLT 2010 evaluation campaign for the DIALOG translation (Chinese-English) and the BTEC (French-English) translation shared-tasks. For the DIALOG translation, the main challenge to this task is applying context information during translation. Context information can be used to decide on word choice and also to replace missing information during translation. We applied discriminative reranking using contextual information as additional features. In order to provide more choices for re-ranking, we generated n-best lists from multiple phrase-based statistical machine translation systems that varied in the type of Chinese word segmentation schemes used. We also built a model that merged the phrase tables generated by the different segmentation schemes. Furthermore, we used a lattice-based system combination model to combine the output from different systems. A combination of all of these systems was used to produce the n-best lists for re-ranking. For the BTEC task, a general approach that used latticebased system combination of two systems, a standard phrasebased system and a hierarchical phrase-based system, was taken. We also tried to process some unknown words by replacing them with the same words but different inflections that are known to the system.

2009

A Succinct N-gram Language Model
Taro Watanabe | Hajime Tsukada | Hideki Isozaki
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Structural support vector machines for log-linear approach in statistical machine translation
Katsuhiko Hayashi | Taro Watanabe | Hajime Tsukada | Hideki Isozaki
Proceedings of the 6th International Workshop on Spoken Language Translation: Papers

Minimum error rate training (MERT) is a widely used learning method for statistical machine translation. In this paper, we present a SVM-based training method to enhance generalization ability. We extend MERT optimization by maximizing the margin between the reference and incorrect translations under the L2-norm prior to avoid overfitting problem. Translation accuracy obtained by our proposed methods is more stable in various conditions than that obtained by MERT. Our experimental results on the French-English WMT08 shared task show that degrade of our proposed methods is smaller than that of MERT in case of small training data or out-of-domain test data.

2008

NTT statistical machine translation system for IWSLT 2008.
Katsuhito Sudoh | Taro Watanabe | Jun Suzuki | Hajime Tsukada | Hideki Isozaki
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

The NTT Statistical Machine Translation System consists of two primary components: a statistical machine translation decoder and a reranker. The decoder generates k-best translation canditates using a hierarchical phrase-based translation based on synchronous context-free grammar. The decoder employs a linear feature combination among several real-valued scores on translation and language models. The reranker reorders the k-best translation candidates using Ranking SVMs with a large number of sparse features. This paper describes the two components and presents the results for the evaluation campaign of IWSLT 2008.

2007

Larger feature set approach for machine translation in IWSLT 2007
Taro Watanabe | Jun Suzuki | Katsuhito Sudoh | Hajime Tsukada | Hideki Isozaki
Proceedings of the Fourth International Workshop on Spoken Language Translation

The NTT Statistical Machine Translation System employs a large number of feature functions. First, k-best translation candidates are generated by an efficient decoding method of hierarchical phrase-based translation. Second, the k-best translations are reranked. In both steps, sparse binary features — of the order of millions — are integrated during the search. This paper gives the details of the two steps and shows the results for the Evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2007.

Online Large-Margin Training for Statistical Machine Translation
Taro Watanabe | Jun Suzuki | Hajime Tsukada | Hideki Isozaki
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

NTT statistical machine translation for IWSLT 2006
Taro Watanabe | Jun Suzuki | Hajime Tsukada | Hideki Isozaki
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

NTT System Description for the WMT2006 Shared Task
Taro Watanabe | Hajime Tsukada | Hideki Isozaki
Proceedings on the Workshop on Statistical Machine Translation

Left-to-Right Target Generation for Hierarchical Phrase-Based Translation
Taro Watanabe | Hajime Tsukada | Hideki Isozaki
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

The NTT Statistical Machine Translation System for IWSLT2005
Hajime Tsukada | Taro Watanabe | Jun Suzuki | Hideto Kazawa | Hideki Isozaki
Proceedings of the Second International Workshop on Spoken Language Translation

Empirical Study of Utilizing Morph-Syntactic Information in SMT
Young-Sook Hwang | Taro Watanabe | Yutaka Sasaki
Second International Joint Conference on Natural Language Processing: Full Papers

2004

A Unified Approach in Speech-to-Speech Translation: Integrating Features of Speech recognition and Machine Translation
Ruiqiang Zhang | Genichiro Kikui | Hirofumi Yamamoto | Frank Soong | Taro Watanabe | Wai Kit Lo
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

EBMT, SMT, hybrid and more: ATR spoken language translation system
Eiichiro Sumita | Yasuhiro Akiba | Takao Doi | Andrew Finch | Kenji Imamura | Hideo Okuma | Michael Paul | Mitsuo Shimohata | Taro Watanabe
Proceedings of the First International Workshop on Spoken Language Translation: Evaluation Campaign

Reordering Constraints for Phrase-Based Statistical Machine Translation
Richard Zens | Hermann Ney | Taro Watanabe | Eiichiro Sumita
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Example-based Machine Translation Based on Syntactic Transfer with Statistical Models
Kenji Imamura | Hideo Okuma | Taro Watanabe | Eiichiro Sumita
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

Chunk-Based Statistical Translation
Taro Watanabe | Eiichiro Sumita | Hiroshi G. Okuno
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

A corpus-centered approach to spoken language translation
Eiichiro Sumita | Yasuhiro Akiba | Takao Doi | Andrew Finch | Kenji Imamura | Michael Paul | Mitsuo Shimohata | Taro Watanabe
10th Conference of the European Chapter of the Association for Computational Linguistics

Example-based decoding for statistical machine translation
Taro Watanabe | Eiichiro Sumita
Proceedings of Machine Translation Summit IX: Papers

This paper presents a decoder for statistical machine translation that can take advantage of the example-based machine translation framework. The decoder presented here is based on the greedy approach to the decoding problem, but the search is initiated from a similar translation extracted from a bilingual corpus. The experiments on multilingual translations showed that the proposed method was far superior to a word-by-word generation beam search algorithm.

2002

Using Language and Translation Models to Select the Best among Outputs from Multiple MT Systems
Yasuhiro Akiba | Taro Watanabe | Eiichiro Sumita
COLING 2002: The 19th International Conference on Computational Linguistics

Bidirectional Decoding for Statistical Machine Translation
Taro Watanabe | Eiichiro Sumita
COLING 2002: The 19th International Conference on Computational Linguistics

Language Model Adaptation with Additional Text Generated by Machine Translation
Hideharu Nakajima | Hirofumi Yamamoto | Taro Watanabe
COLING 2002: The 19th International Conference on Computational Linguistics

Statistical machine translation based on hierarchical phrase alignment
Taro Watanabe | Kenji Imamura | Eiichiro Sumita
Proceedings of the 9th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

Statistical Machine Translation on Paraphrased Corpora
Taro Watanabe | Mitsuo Shimohata | Eiichiro Sumita
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

Evaluation of a Practical Interlingua for Task-Oriented Dialogue
Lori Levin | Donna Gates | Alon Lavie | Fabio Pianesi | Dorcas Wallace | Taro Watanabe
NAACL-ANLP 2000 Workshop: Applied Interlinguas: Practical Applications of Interlingual Approaches to NLP

Lessons Learned from a Task-based Evaluation of Speech-to-Speech Machine Translation
Lori Levin | Boris Bartlog | Ariadna Font Llitjos | Donna Gates | Alon Lavie | Dorcas Wallace | Taro Watanabe | Monika Woszczyna
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Co-authors

Katsuhiko Hayashi 14

Kazuki Hayashi 10

Yuji Matsumoto 10

Masao Utiyama 10

Hiroyuki Deguchi 9

Hideki Isozaki 9

Hajime Tsukada 9

Haruki Sakajo 7

Shohei Higashiyama 6

Kenji Imamura 6

Shintaro Ozaki 6

Hiroyuki Shindo 6

Hiroki Teranishi 6

Frederikus Hudi 5

Tiejun Zhao (赵铁军) 5

Wataru Hashimoto 4

Graham Neubig 4

Hideki Tanaka 4

Yasuhiro Akiba 3

Chenchen Ding 3

Guoping Huang 3

Eunike Andriani Kardinata 3

Shinsuke Mori 3

Manabu Okumura 3

Ashmari Pramodya 3

Mitsuo Shimohata 3

Chihiro Taguchi 3

Hiroya Takamura 3

Akihiro Tamura 3

Masayuki Asahara 2

Francisco Casacuberta 2

George Foster 2

Tsuyoshi Fujita 2

Akiyo Fukatsu 2

Akio Hayakawa 2

Ayuki Katayama 2

Tatsuya Kawahara 2

Philipp Koehn 2

Arturo Martínez Peguero 2

Makoto Morishita 2

Masaaki Nagata 2

Tetsuji Nakagawa 2

Hibiki Nakatani 2

Hiroyuki Otomo 2

Maria Angelica Riera Machin 2

Katsuhito Sudoh 2

Saku Sugawara 2

Dorcas Wallace 2

Aitaro Yamamoto 2

Hirofumi Yamamoto 2

Chengqing Zong 2

David Ifeoluwa Adelani 1

Muhammad Farid Adilazuarda 1

Alham Fikri Aji 1

Afifa Amriani 1

David Anugraha 1

Michael Anugraha 1

Jesse Atuhurra 1

Rio Alexander Audino 1

Boris Bartlog 1

Samuel Cahyawijaya 1

Ciprian Chelba 1

Yuchang Cheng 1

Emmanuele Chersoni 1

Jan Christian Blaise Cruz 1

Ariadna Font Llitjós 1

Chooi-Ling Goh 1

Jiangnan Hang 1

Shintaro Harada 1

Atsushi Hashimoto 1

Yuta Hashimoto 1

Jacob Hoffman 1

Macduff Hughes 1

Young-Sook Hwang 1

Fariz Ikhwantri 1

Patrick Amadeus Irawan 1

Masayuki Ishii 1

Tomoyuki Jinno 1

Tomoyuki Kajiwara 1

Masahiro Kaneko 1

Hideto Kazawa 1

Genichiro Kikui 1

Tatsuki Kuribayashi 1

Shuhei Kurita 1

Garry Kuwanto 1

Cheng Ching Lam 1

En-Shiun Annie Lee 1

Peerat Limkonchotiwat 1

Chamila Liyanage 1

Yuhi Matogawa 1

Yusuke Matsui 1

Candy Olivia Mawalim 1

Taiki Miyanishi 1

Hideharu Nakajima 1

Nirasha Nelki 1

Chong-Wah Ngo 1

Toshinobu Ogiso 1

Tsuyoshi Okita 1

Hiroshi G. Okuno 1

Kazuma Onishi 1

Nedjma Ousidhoum 1

Masafumi Oyamada 1

Fabio Pianesi 1

Ubaidillah Ariq Prathama 1

Ayu Purwarianti 1

Randil Pushpananda 1

Jan Wira Gotama Putra 1

Rifki Afina Putri 1

Shigeki Saito 1

Stephanie Yulia Salim 1

Natasha Christabelle Santosa 1

Enrico Santus 1

Armin Sarhangzadeh 1

Yutaka Sasaki 1

Heshan Shalinda 1

Takayoshi Shibahara 1

Artem Sokolov 1

Frank K. Soong 1

Junehwan Sung 1

Lucky Susanto 1

Hiroshi Takato 1

Ayano Takeuchi 1

Joshua Tanner 1

Masayo Tomita 1

Van-Hien Tran 1

Ryo Tsujimoto 1

Hiroshi Tsutsui 1

Yoshitaka Ushiku 1

Christopher Vasselli 1

Shoko Wakamiya 1

Ruvan Weerasinghe 1

Haryo Akbarianto Wibowo 1

Derry Tanti Wijaya 1

Genta Indra Winata 1

Monika Woszczyna 1

Jiannan Xiang 1

Yuki Yamamoto 1

Yuichiro Yasui 1

Daiki Yoshida 1

Nobuhiro Yugami 1

Ruiqiang Zhang 1

Shi-Xiong Zhang 1

Marina Zhukova 1

Venues