Yutaka Matsuo - ACL Anthology

Yutaka Matsuo

2026

Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers
Atsushi Shimizu | Shohei Taniguchi | Yutaka Matsuo
Findings of the Association for Computational Linguistics: EACL 2026

Length generalization is the ability of language models to maintain performance on inputs longer than those seen during pretraining. In this work, we introduce a simple yet powerful position encoding (PE) strategy, Random Float Sampling (RFS), that generalizes well to lengths unseen during pretraining or fine-tuning. In particular, instead of selecting position indices from a predefined discrete set, RFS uses randomly sampled continuous values, thereby avoiding out-of-distribution (OOD) issues on unseen lengths by exposing the model to diverse indices during training. Since assigning indices to tokens is a common and fundamental procedure in widely used PEs, the advantage of RFS can easily be incorporated into, for instance, the absolute sinusoidal encoding, RoPE, and ALiBi. Experiments corroborate its effectiveness by showing that RFS results in superior performance in length generalization tasks as well as zero-shot commonsense reasoning benchmarks.

Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning
Jerry Huang | Peng Lu | Qiuhao Zeng | Yusuke Iwasawa | Yutaka Matsuo | Sarath Chandar | Edison Marrese-Taylor | Irene Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.

Revealing Redundant Syntax in Large Language Models through Multi-Hop Dependency Paths
Masaki Sashida | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Findings of the Association for Computational Linguistics: EACL 2026

Prior work on attention–syntax alignment has largely focused on single-hop Universal Dependency edges (DPs). In this paper, we treat short multi-hop dependency paths (MDPs) (e.g., “obl+case”) as first-class units and analyze them alongside DPs. Across three pretrained autoregressive LMs (GPT-2 XL, Llama 3 8B, Qwen3-8B) and one encoder baseline (BERT-large), we extract 2–3 hop MDPs from UD-parsed English and quantify head–relation alignment with an Unlabeled Attachment Score (UAS)–style metric modified for causal masking in decoder-only models. Rank visualizations reveal both overlap and specialization: we observe heads that align with both DPs and MDPs, as well as heads that appear specialized for one route. To test functional relevance, we first identify heads by UAS and then apply an undifferentiated (uniform) attention ablation to those heads; we evaluate the impact on BLiMP and LAMBADA. Ablating the top 10% of all heads shows that MDP-selected heads induce larger drops than DP-selected heads and that the union (“Mix”) of DP- and MDP-selected heads yields the largest drops. For GPT-2 XL, the observed drops are (BLiMP: 𝛥DP = 1.35 pp, 𝛥MDP = 4.81 pp, 𝛥Mix = 7.11 pp; LAMBADA: 𝛥DP = 4.70 pp, 𝛥MDP = 25.17 pp, 𝛥Mix = 32.99 pp), all exceeding size-matched random controls. These results indicate that models can route information consistent with syntactic dependencies via both DP and MDP pathways, with MDPs playing a distinct and measurable role in some settings under our interventions.

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
Qi Cao | Andrew Gambardella | Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, their limited truthfulness and tendency toward overconfidence constrain their reliability in factual tasks. Uncertainty quantification offers a promising approach to identifying potentially unreliable outputs from LLMs. Yet most existing methods rely on repeated sampling or auxiliary models, which substantially increase computational overhead. To address these limitations, we propose an efficient uncertainty quantification method that leverages semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters based on embedding clustering and prefix matching, and compute a cluster-based score at each decoding step to represent uncertainty. Our approach requires only a single generation and does not depend on any auxiliary models. Experiments on multiple datasets and models demonstrate that our method achieves performance comparable to existing baselines while substantially reducing computational overhead.

Infinity-MoE: Generalizing Mixture of Experts to Infinite Experts
Shota Takashiro | Takeshi Kojima | Shohei Taniguchi | Yusuke Iwasawa | Yutaka Matsuo
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose ∞-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based ∞-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters. Adjusting the number of sampled experts at inference time allows for a flexible trade-off between accuracy and speed, with an improvement of up to 2.5% in accuracy over conventional MoE.

2025

Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-lingual reasoning abilities. This dual limitation makes it challenging to assess LLMs’ performance in the multilingual setting comprehensively. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-lingual comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, particularly for African languages. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.

Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Andrew Gambardella | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the “first person psych predicate restriction” grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab’s uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3’s perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.

When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following
Keno Harada | Yudai Yamazaki | Masachika Taniguchi | Edison Marrese-Taylor | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Findings of the Association for Computational Linguistics: EMNLP 2025

As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.

Dynamic Injection of Entity Knowledge into Dense Retrievers
Ikuya Yamada | Ryokan Ri | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Findings of the Association for Computational Linguistics: EMNLP 2025

Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets demonstrate that KPR consistently improves retrieval accuracy, with particularly large gains on the EntityQuestions dataset. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Models and code are released at https://github.com/knowledgeable-embedding/knowledgeable-embedding.

Lost in the Distance: Large Language Models Struggle to Capture Long-Distance Relational Knowledge
Meiyun Wang | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Findings of the Association for Computational Linguistics: NAACL 2025

Large language models (LLMs) have demonstrated impressive capabilities in handling long contexts, but challenges remain in capturing relational knowledge spread far apart within text. Connecting long-distance knowledge is important for solving tasks as the context length increases: imagine reading a lengthy detective novel where seemingly trivial information introduced early on often becomes essential during the climactic reveal of the culprit. In this study, we expose the ”Lost in the Distance” phenomenon, where LLM performance of capturing the relational knowledge degrades significantly when the relational knowledge is separated by noise, i.e., unrelated sentences to solve a task. Specifically, we design an experiment in which we insert artificial noise between two related elements and observe model performance as the distance between them increases. Our findings show that while LLMs can handle edge noise with little impact, their ability to reason about distant relationships declines sharply as the intervening noise grows. These findings are consistent in both forward-looking prediction and backward-looking prediction settings. We validate this across various models (GPT-4, Gemini-1.5-pro, GPT-4o-mini, Gemini-1.5-flash, Claude-3.5-Sonnet) and tasks (causal reasoning and knowledge extraction). These results reveal a significant limitation in how LLMs process relational knowledge over long contexts. We release our code and data to support further research.

An in-depth human study of the mathematical reasoning abilities in Large Language Models
Carolina Dias-Alexiou | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

We study the generalization capabilities of large language models (LLM) through the lens of mathematical reasoning, asking if these models can recognize that two structures are the same even when they do not share the same nomenclature. We propose a human study to evaluate if LLMs reproduce proofs that they have most likely seen during training, but when the symbols do not match the ones seen. To test this in a controlled scenario, we look at proofs in propositional calculus, foundational for other logic systems, semantically complete and widely discussed online. We replace the implication operator (→) with an unrelated, arbitrary symbol (♠) and ask experts to evaluate how the output of a selection of LLMs changes in terms of compliance, correctness, extensiveness and coherence. Our results show that nearly all our tested models produce lower quality proofs in this test, in particular open-weights models, suggesting the abilities of these LLMs to reason in this context have important limitations.

Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning
Shota Takashiro | Takeshi Kojima | Andrew Gambardella | Qi Cao | Yusuke Iwasawa | Yutaka Matsuo
Findings of the Association for Computational Linguistics: ACL 2025

As large language models (LLMs) are applied across diverse domains, the ability to selectively unlearn specific information is becoming increasingly essential. For instance, LLMs are expected to selectively provide confidential information to authorized internal users, such as employees or trusted partners, while withholding it from external users, including the general public and unauthorized entities.Therefore, we propose a novel method termed ìn-context knowledge unlearning”, which enables the model to selectively forget information in test-time based on the query context.Our method fine-tunes pre-trained LLMs to enable prompt unlearning of target knowledge within the context, while preserving unrelated information. Experiments on TOFU, AGE and RWKU datasets using Llama2-7B/13B and Mistral-7B models demonstrate that our method achieves up to 95% forget accuracy while retaining 80% of unrelated knowledge, significantly outperforming baselines in both in-domain and out-of-domain scenarios.Further investigation of the model’s internal behavior revealed that while fine-tuned LLMs generate correct predictions in the middle layers and preserve them up to the final layer. However, the decision to forget is made only at the last layer, i.e. LLMs pretend to forget”.Our findings offer valuable insight into the improvement of the robustness of the unlearning mechanisms in LLMs, laying a foundation for future research in the field.

Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data
Yohei Kobashi | Fumiya Uchiyama | Takeshi Kojima | Andrew Gambardella | Qi Cao | Yusuke Iwasawa | Yutaka Matsuo
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

As the adoption of large language models (LLMs) continues to grow, the risk of sensitive data leakage from their training datasets has become a critical concern. This study proposes a novel method for encrypting training data using a polyalphabetic substitution cipher. This approach prevents the model from learning sensitive information while allowing it to capture abstract linguistic patterns. We pre-trained a Llama 3 model (551M parameters) using approximately 7.5 billion tokens of encrypted data and subsequently conducted continual pre-training with another 2.5 billion tokens of plaintext data. The effectiveness of the model was evaluated by comparing its downstream task performance with a model trained solely on plaintext data. In addition, we evaluated the risk of sensitive data leakage through name reconstruction, true-prefix and data extraction attacks. These results demonstrate the potential of our approach to balance data security with model performance.

Language Models can Categorize System Inputs for Performance Analysis
Dominic Sobhani | Ruiqi Zhong | Edison Marrese-Taylor | Keisuke Sakaguchi | Yutaka Matsuo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Language model systems are used to process diverse categories of input requests, ranging from improving creative writing to solving programming challenges. It would be useful to know which categories they are good at. However, existing evaluations compare model performance on pre-defined categories, failing to reflect a system’s performance on finer-grained or novel ones. We propose to automatically search for finer-grained categories based on inputs where a system performs well or poorly, and describe them in natural language. To search for these categories, we propose a large number of candidate category descriptions, e.g. “Communication Improvement”, find the subset of inputs that match the category descriptions, and calculate the performance on these categories; then we sort these categories based on their performance, thereby highlighting those that score high or low. As one application, we apply our method to compare LLaMA 3-70B and Claude 3 Opus, which have similar Elo-ratings on Chatbot Arena; our method finds the former is weaker at making text more professional and humorous while better at providing psychological insights, depicting a more nuanced picture of model performance.

ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA
Zhao Xinjie | Fan Gao | Xingyu Song | Yingjian Chen | Rui Yang | Yanran Fu | Yuyang Wang | Yusuke Iwasawa | Yutaka Matsuo | Irene Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Multi-hop question answering (QA) remains challenging, as solutions must reliably integrate and reconcile evidence from multiple sources without succumbing to error propagation. While large language models (LLMs) have achieved substantial improvements via chain-of-thought (CoT) prompting and retrieval-augmented generation, these methods typically adopt a forward-only workflow—early mistakes persist throughout inference, and contradictions discovered later cannot systematically trigger re-evaluation. To address this limitation, we present ReAgent, a reversible multi-agent reasoning framework. Specifically, ReAgent enables agents to backtrack to earlier valid states when conflicts arise, thereby isolating and rectifying flawed assumptions before they undermine subsequent reasoning. Our approach combines explicit local and global rollback protocols with modular role specialization, resulting in a flexible and error-tolerant pipeline. Empirical evaluation on three multi-hop QA benchmarks demonstrates consistent performance gains of approximately 6% over forward-only baselines, in addition to enhanced interpretability. These findings highlight the value of non-monotonic, backtracking-driven inference in complex QA scenarios and point to broader implications for multi-agent collaboration in knowledge-intensive tasks.

Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe
Zhenxuan Yu | Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa
Proceedings of the 31st International Conference on Computational Linguistics

Large language models (LLMs) have achieved significant performance improvements in natural language processing (NLP) domain. However, these models often require large computational resources for training and inference. Recently, Mamba, a language model architecture based on State-Space Models (SSMs), has achieved comparable performance to Transformer models while significantly reducing costs by compressing context windows during inference. We focused on the potential of the lightweight Mamba architecture by applying BitNet quantization method to the model architecture. In addition, while prior BitNet methods generally quantized only linear layers in the main body, we extensively quantized the embedding and projection layers considering their significant proportion of model parameters. In our experiments, we applied ternary quantization to the Mamba-2 (170M) architecture and pre-trained the model with 150 B tokens from scratch. Our method achieves approximately 90.0% reduction in the bits used by all parameters, achieving a significant improvement compared with a 48.4% reduction by the conventional BitNet quantization method. In addition, our method experienced minimal performance degradation in both the pre-training perplexity and downstream tasks. These findings demonstrate the potential of incorporating lightweight language models into edge devices, which will become more demanding in the future.

Continual Pre-training on Character-level Noisy Texts Makes Decoder-based Language Models Robust Few-shot Learners
Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa
Transactions of the Association for Computational Linguistics, Volume 13

Recent decoder-based pre-trained language models (PLMs) generally use subword tokenizers. However, adding character-level perturbations drastically changes the delimitation of texts by the tokenizers, leading to the vulnerability of PLMs. This study proposes a method of continual pre-training to convert decoder-based PLMs with subword tokenizers into perturbation-robust few-shot in-context learners. Our method continually trains decoder-based PLMs to predict the next tokens conditioning on artificially created character-level noisy texts. Since decoder-based language models are auto-regressive, we skip noised words from the target optimization. In addition, to maintain the same word prediction performance under noisy text as clean text, our method employs word distribution matching between the original PLMs and training models. We conducted experiments on various subword-based PLMs, including GPT2, Pythia, Mistral, Gemma2, and Llama3, ranging from 1B to 8B parameters. The results demonstrate that our method consistently improves the performance of few-shot in-context learning on downstream tasks which contain actual typos or misspellings as well as artificial noise.1

2024

On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons
Takeshi Kojima | Itsuki Okimura | Yusuke Iwasawa | Hitomi Yanaka | Yutaka Matsuo
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Current decoder-based pre-trained language models (PLMs) successfully demonstrate multilingual capabilities. However, it is unclear how these models handle multilingualism.We analyze the neuron-level internal behavior of multilingual decoder-based PLMs, Specifically examining the existence of neurons that fire “uniquely for each language” within decoder-only multilingual PLMs.We analyze six languages: English, German, French, Spanish, Chinese, and Japanese, and show that language-specific neurons are unique, with a slight overlap (< 5%) between languages. These neurons are mainly distributed in the models’ first and last few layers. This trend remains consistent across languages and models.Additionally, we tamper with less than 1% of the total neurons in each model during inference and demonstrate that tampering with a few language-specific neurons drastically changes the probability of target language occurrence in text generation.

KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques
Rui Yang | Haoran Liu | Edison Marrese-Taylor | Qingcheng Zeng | Yuhe Ke | Wanxin Li | Lechao Cheng | Qingyu Chen | James Caverlee | Yutaka Matsuo | Irene Li
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Large Language Models (LLMs) have significantly advanced healthcare innovation on generation capabilities. However, their application in real clinical settings is challenging due to potential deviations from medical facts and inherent biases. In this work, we develop an augmented LLM framework, KG-Rank, which leverages a medical knowledge graph (KG) with ranking and re-ranking techniques, aiming to improve free-text question-answering (QA) in the medical domain. Specifically, upon receiving a question, we initially retrieve triplets from a medical KG to gather factual information. Subsequently, we innovatively apply ranking methods to refine the ordering of these triplets, aiming to yield more precise answers. To the best of our knowledge, KG-Rank is the first application of ranking models combined with KG in medical QA specifically for generating long answers. Evaluation of four selected medical QA datasets shows that KG-Rank achieves an improvement of over 18% in the ROUGE-L score. Moreover, we extend KG-Rank to open domains, where it realizes a 14% improvement in ROUGE-L, showing the effectiveness and potential of KG-Rank.

Improving Low-Resource Machine Translation for Formosan Languages Using Bilingual Lexical Resources
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo
Findings of the Association for Computational Linguistics: ACL 2024

This paper investigates how machine translation for low-resource languages can be improved by incorporating information from bilingual lexicons during the training process for mainly translation between Mandarin and Formosan languages, which are all moribund or critically endangered, and we also show that our techniques work for translation between Spanish and Nahuatl, a language pair consisting of languages from completely different language families. About 70% of the approximately 7,000 languages of the world have data in the form of lexicons, a valuable resource for improving low-resource language translation. We collect a dataset of parallel data and bilingual lexicons between Mandarin and 16 different Formosan languages and examine mainly three different approaches: (1) simply using lexical data as additional parallel data, (2) generating pseudo-parallel sentence data to use during training by replacing words in the original parallel sentence data using the lexicon, and (3) a combination of (1) and (2). All three approaches give us gains in both Bleu scores and chrF scores, and we found that (3) provided the most gains, followed by (1) and then (2), which we observed for both translation between Mandarin and the Formosan languages and Spanish-Nahuatl. With technique (3), we saw an average increase of 5.55 in Bleu scores and 10.33 in chrF scores.

Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks
Andrew Gambardella | Yusuke Iwasawa | Yutaka Matsuo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The ability (and inability) of large language models (LLMs) to perform arithmetic tasks has been the subject of much theoretical and practical debate. We show that LLMs are frequently able to correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks without using chain of thought reasoning, despite these tasks require compounding operations to solve. Simultaneously, LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication, a task equivalent to 1-digit by 1-digit multiplication which can be easily learned or memorized. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits, which on average increases the confidence of the correct last digit on 5-digit by 5-digit multiplication tasks using Llama 2-13B by over 230% (0.13→0.43) and Mistral-7B by 150% (0.22→0.55).

Media Bias Detection Across Families of Language Models
Iffat Maab | Edison Marrese-Taylor | Sebastian Padó | Yutaka Matsuo
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Bias in reporting can influence the public’s opinion on relevant societal issues. Examples include informational bias (selective presentation of content) and lexical bias (specific framing of content through linguistic choices). The recognition of media bias is arguably an area where NLP can contribute to the “social good”. Traditional NLP models have shown good performance in classifying media bias, but require careful model design and extensive tuning. In this paper, we ask how well prompting of large language models can recognize media bias. Through an extensive empirical study including a wide selection of pre-trained models, we find that prompt-based techniques can deliver comparable performance to traditional models with greatly reduced effort and that, similar to traditional models, the availability of context substantially improves results. We further show that larger models can leverage different kinds of context simultaneously, obtaining further performance improvements.

Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?
Fumiya Uchiyama | Takeshi Kojima | Andrew Gambardella | Qi Cao | Yusuke Iwasawa | Yutaka Matsuo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks.Prior research indicates that LLMs pre-trained with programming language data exhibit high mathematical and reasoning abilities; however, this causal relationship has not been rigorously tested. Our research aims to verify which programming languages and features during pre-training affect logical inference performance. Specifically, we pre-trained decoder-based language models from scratch using datasets from ten programming languages (e.g., Python, C, Java) and three natural language datasets (Wikipedia, Fineweb, C4) under identical conditions. Thereafter, we evaluated the trained models in a few-shot in-context learning setting on logical reasoning tasks: FLD and bAbi, which do not require commonsense or world knowledge. The results demonstrate that nearly all models trained with programming languages consistently outperform those trained with natural languages, indicating that programming languages contain factors that elicit logic inference performance. In addition, we found that models trained with programming languages exhibit a better ability to follow instructions compared to those trained with natural languages. Further analysis reveals that the depth of Abstract Syntax Trees representing parsed results of programs also affects logical reasoning performance. These findings will offer insights into the essential elements of pre-training for acquiring the foundational abilities of LLMs.

2023

An Effective Approach for Informational and Lexical Bias Detection
Iffat Maab | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER)

In this paper we present a thorough investigation of automatic bias recognition on BASIL, a dataset of political news which has been annotated with different kinds of biases. We begin by unveiling several inconsistencies in prior work using this dataset, showing that most approaches focus only on certain task formulations while ignoring others, and also failing to report important evaluation details. We provide a comprehensive categorization of these approaches, as well as a more uniform and clear set of evaluation metrics. We argue about the importance of the missing formulations and also propose the novel task of simultaneously detecting different kinds of biases in news. In our work, we tackle bias on six different BASIL classification tasks in a unified manner. Eventually, we introduce a simple yet effective approach based on data augmentation and preprocessing which is generic and works very well across models and task formulations, allowing us to obtain state-of-the-art results. We also perform ablation studies on some tasks to quantify the strength of data augmentation and preprocessing, and find that they correlate positively on all bias tasks.

Towards Better Evaluation for Formality-Controlled English-Japanese Machine Translation
Edison Marrese-Taylor | Pin Chen Wang | Yutaka Matsuo
Proceedings of the Eighth Conference on Machine Translation

In this paper we propose a novel approach to automatically classify the level of formality in Japanese text, using three categories (formal, polite, and informal). We introduce a new dataset that combine manually-annotated sentences from existing resources, and formal sentences scrapped from the website of the House of Representatives and the House of Councilors of Japan. Based on our data, we propose a Transformer-based classification model for Japanese, which obtains state-of-the-art results in benchmark datasets. We further propose to utilize our classifier to study the effectiveness of prompting techniques for controlling the formality level of machine translation (MT) using Large Language Models (LLM). Our experimental setting includes a large selection of such models and is based on an En->Ja parallel corpus specifically designed to test formality control in MT. Our results validate the robustness and effectiveness of our proposed approach and while also providing empirical evidence suggesting that prompting LLMs is a viable approach to control the formality level of En->Ja MT using LLMs.

Target-Aware Contextual Political Bias Detection in News
Iffat Maab | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text
Qi Cao | Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While Large Language Models (LLMs) have achieved remarkable performance in many tasks, much about their inner workings remains unclear. In this study, we present novel experimental insights into the resilience of LLMs, particularly GPT-4, when subjected to extensive character-level permutations. To investigate this, we first propose the Scrambled Bench, a suite designed to measure the capacity of LLMs to handle scrambled input, in terms of both recovering scrambled sentences and answering questions given scrambled context. The experimental results indicate that multiple advanced LLMs demonstrate the capability akin to typoglycemia, a phenomenon where humans can understand the meaning of words even when the letters within those words are scrambled, as long as the first and last letters remain in place. More surprisingly, we found that only GPT-4 nearly flawlessly processes inputs with unnatural errors, a task that poses significant challenges for other LLMs and often even for humans. Specifically, GPT-4 can almost perfectly reconstruct the original sentences from scrambled ones, decreasing the edit distance by 95%, even when all letters within each word are entirely scrambled. It is counter-intuitive that LLMs can exhibit such resilience despite severe disruption to input tokenization caused by scrambled text.

Evaluating Large Language Models’ Understanding of Financial Terminology via Definition Modeling
James Jhirad | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

2022

On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing
Itsuki Okimura | Machel Reid | Makoto Kawano | Yutaka Matsuo
Proceedings of the Third Workshop on Insights from Negative Results in NLP

With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task. Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.

A Parallel Corpus and Dictionary for Amis-Mandarin Translation
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

Amis is an endangered language indigenous to Taiwan with limited data available for computational processing. We thus present an Amis-Mandarin dataset containing a parallel corpus of 5,751 Amis and Mandarin sentences and a dictionary of 7,800 Amis words and phrases with their definitions in Mandarin. Using our dataset, we also established a baseline for machine translation between Amis and Mandarin in both directions. Our dataset can be found at https://github.com/francisdzheng/amis-mandarin.

Improving Jejueo-Korean Translation With Cross-Lingual Pretraining Using Japanese and Korean
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 9th Workshop on Asian Translation

Jejueo is a critically endangered language spoken on Jeju Island and is closely related to but mutually unintelligible with Korean. Parallel data between Jejueo and Korean is scarce, and translation between the two languages requires more attention, as current neural machine translation systems typically rely on large amounts of parallel training data. While low-resource machine translation has been shown to benefit from using additional monolingual data during the pretraining process, not as much research has been done on how to select languages other than the source and target languages for use during pretraining. We show that using large amounts of Korean and Japanese data during the pretraining process improves translation by 2.16 BLEU points for translation in the Jejueo → Korean direction and 1.34 BLEU points for translation in the Korean → Jejueo direction compared to the baseline.

2021

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers
Machel Reid | Edison Marrese-Taylor | Yutaka Matsuo
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and with a high parameter budget. In light of this, we explore parameter-sharing methods in Transformers with a specific focus on generative models. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer. Our model combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.

Making Use of Latent Space in Language GANs for Generating Diverse Text without Pre-training
Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Generating diverse texts is an important factor for unsupervised text generation. One approach is to produce the diversity of texts conditioned by the sampled latent code. Although several generative adversarial networks (GANs) have been proposed thus far, these models still suffer from mode-collapsing if the models are not pre-trained. In this paper, we propose a GAN model that aims to improve the approach to generating diverse texts conditioned by the latent space. The generator of our model uses Gumbel-Softmax distribution for the word sampling process. To ensure that the text is generated conditioned upon the sampled latent code, reconstruction loss is introduced in our objective function. The discriminator of our model iteratively inspects incomplete partial texts and learns to distinguish whether they are real or fake by using the standard GAN objective function. Experimental results using the COCO Image Captions dataset show that, although our model is not pre-trained, the performance of our model is quite competitive with the existing baseline models, which requires pre-training.

AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages
Machel Reid | Junjie Hu | Graham Neubig | Yutaka Matsuo
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine translation, there are no standardized reproducible benchmarks for many African languages, many of which are used by millions of speakers but have less digitized textual data. To tackle these challenges, we propose AfroMT, a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We also develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. Furthermore, we explore the newly considered case of low-resource focused pretraining and develop two novel data augmentation-based strategies, leveraging word-level alignment information and pseudo-monolingual data for pretraining multilingual sequence-to-sequence models. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines. We also show gains of up to 12 BLEU points over cross-lingual transfer baselines in data-constrained scenarios. All code and pretrained models will be released as further steps towards larger reproducible benchmarks for African languages.

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining
Francis Zheng | Machel Reid | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.

2020

A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Edison Marrese-Taylor | Cristian Rodriguez | Jorge Balazs | Stephen Gould | Yutaka Matsuo
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.

Learning to Describe Editing Activities in Collaborative Environments: A Case Study on GitHub and Wikipedia
Edison Marrese-Taylor | Pablo Loyola | Jorge A. Balazs | Yutaka Matsuo
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

VCDM: Leveraging Variational Bi-encoding and Deep Contextualized Word Representations for Improved Definition Modeling
Machel Reid | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we tackle the task of definition modeling, where the goal is to learn to generate definitions of words and phrases. Existing approaches for this task are discriminative, combining distributional and lexical semantics in an implicit rather than direct way. To tackle this issue we propose a generative model for the task, introducing a continuous latent variable to explicitly model the underlying relationship between a phrase used within a context and its definition. We rely on variational inference for estimation and leverage contextualized word embeddings for improved performance. Our approach is evaluated on four existing challenging benchmarks with the addition of two new datasets, “Cambridge” and the first non-English corpus “Robert”, which we release to complement our empirical study. Our Variational Contextual Definition Modeler (VCDM) achieves state-of-the-art performance in terms of automatic and human evaluation metrics, demonstrating the effectiveness of our approach.

2019

An Edit-centric Approach for Wikipedia Article Quality Assessment
Edison Marrese-Taylor | Pablo Loyola | Yutaka Matsuo
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We propose an edit-centric approach to assess Wikipedia article quality as a complementary alternative to current full document-based techniques. Our model consists of a main classifier equipped with an auxiliary generative module which, for a given edit, jointly provides an estimation of its quality and generates a description in natural language. We performed an empirical study to assess the feasibility of the proposed model and its cost-effectiveness in terms of data and quality requirements.

Gating Mechanisms for Combining Character and Word-level Word Representations: an Empirical Study
Jorge Balazs | Yutaka Matsuo
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In this paper we study how different ways of combining character and word-level representations affect the quality of both final word and sentence representations. We provide strong empirical evidence that modeling characters improves the learned representations at the word and sentence levels, and that doing so is particularly useful when representing less frequent words. We further show that a feature-wise sigmoid gating mechanism is a robust method for creating representations that encode semantic similarity, as it performed reasonably well in several word similarity datasets. Finally, our findings suggest that properly capturing semantic similarity at the word level does not consistently yield improved performance in downstream sentence-level tasks.

2018

IIIDYT at IEST 2018: Implicit Emotion Classification With Deep Contextualized Word Representations
Jorge Balazs | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper we describe our system designed for the WASSA 2018 Implicit Emotion Shared Task (IEST), which obtained 2nd place out of 30 teams with a test macro F1 score of 0.710. The system is composed of a single pre-trained ELMo layer for encoding words, a Bidirectional Long-Short Memory Network BiLSTM for enriching word representations with context, a max-pooling operation for creating sentence representations from them, and a Dense Layer for projecting the sentence representations into label space. Our official submission was obtained by ensembling 6 of these models initialized with different random seeds. The code for replicating this paper is available at https://github.com/jabalazs/implicit_emotion.

Learning to Automatically Generate Fill-In-The-Blank Quizzes
Edison Marrese-Taylor | Ai Nakajima | Yutaka Matsuo | Ono Yuichi
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

In this paper we formalize the problem automatic fill-in-the-blank question generation using two standard NLP machine learning schemes, proposing concrete deep learning models for each. We present an empirical study based on data obtained from a language learning platform showing that both of our proposed settings offer promising results.

Deep contextualized word representations for detecting sarcasm and irony
Suzana Ilić | Edison Marrese-Taylor | Jorge Balazs | Yutaka Matsuo
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Predicting context-dependent and non-literal utterances like sarcastic and ironic expressions still remains a challenging task in NLP, as it goes beyond linguistic patterns, encompassing common sense and shared knowledge as crucial components. To capture complex morpho-syntactic features that can usually serve as indicators for irony or sarcasm across dynamic contexts, we propose a model that uses character-level vector representations of words, based on ELMo. We test our model on 7 different datasets derived from 3 different data sources, providing state-of-the-art performance in 6 of them, and otherwise offering competitive results.

Content Aware Source Code Change Description Generation
Pablo Loyola | Edison Marrese-Taylor | Jorge Balazs | Yutaka Matsuo | Fumiko Satoh
Proceedings of the 11th International Conference on Natural Language Generation

We propose to study the generation of descriptions from source code changes by integrating the messages included on code commits and the intra-code documentation inside the source in the form of docstrings. Our hypothesis is that although both types of descriptions are not directly aligned in semantic terms —one explaining a change and the other the actual functionality of the code being modified— there could be certain common ground that is useful for the generation. To this end, we propose an architecture that uses the source code-docstring relationship to guide the description generation. We discuss the results of the approach comparing against a baseline based on a sequence-to-sequence model, using standard automatic natural language generation metrics as well as with a human study, thus offering a comprehensive view of the feasibility of the approach.

IIIDYT at SemEval-2018 Task 3: Irony detection in English tweets
Edison Marrese-Taylor | Suzana Ilic | Jorge Balazs | Helmut Prendinger | Yutaka Matsuo
Proceedings of the 12th International Workshop on Semantic Evaluation

In this paper we introduce our system for the task of Irony detection in English tweets, a part of SemEval 2018. We propose representation learning approach that relies on a multi-layered bidirectional LSTM, without using external features that provide additional semantic information. Although our model is able to outperform the baseline in the validation set, our results show limited generalization power over the test set. Given the limited size of the dataset, we think the usage of more pre-training schemes would greatly improve the obtained results.

2017

Replication issues in syntax-based aspect extraction for opinion mining
Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics

Reproducing experiments is an important instrument to validate previous work and build upon existing approaches. It has been tackled numerous times in different areas of science. In this paper, we introduce an empirical replicability study of three well-known algorithms for syntactic centric aspect-based opinion mining. We show that reproducing results continues to be a difficult endeavor, mainly due to the lack of details regarding preprocessing and parameter setting, as well as due to the absence of available implementations that clarify these details. We consider these are important threats to validity of the research on the field, specifically when compared to other problems in NLP where public datasets and code availability are critical validity components. We conclude by encouraging code-based research, which we think has a key role in helping researchers to understand the meaning of the state-of-the-art better and to generate continuous advances.

EmoAtt at EmoInt-2017: Inner attention sentence embedding for Emotion Intensity
Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper we describe a deep learning system that has been designed and built for the WASSA 2017 Emotion Intensity Shared Task. We introduce a representation learning approach based on inner attention on top of an RNN. Results show that our model offers good capabilities and is able to successfully identify emotion-bearing words to predict intensity without leveraging on lexicons, obtaining the 13t place among 22 shared task competitors.

Extractive Summarization Using Multi-Task Learning with Document Classification
Masaru Isonuma | Toru Fujino | Junichiro Mori | Yutaka Matsuo | Ichiro Sakata
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The need for automatic document summarization that can be used for practical applications is increasing rapidly. In this paper, we propose a general framework for summarization that extracts sentences from a document using externally related information. Our work is aimed at single document summarization using small amounts of reference summaries. In particular, we address document summarization in the framework of multi-task learning using curriculum learning for sentence extraction and document classification. The proposed framework enables us to obtain better feature representations to extract sentences from documents. We evaluate our proposed summarization method on two datasets: financial report and news corpus. Experimental results demonstrate that our summarizers achieve performance that is comparable to state-of-the-art systems.

A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes
Pablo Loyola | Edison Marrese-Taylor | Yutaka Matsuo
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting.

Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN
Edison Marrese-Taylor | Jorge Balazs | Yutaka Matsuo
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Video reviews are the natural evolution of written product reviews. In this paper we target this phenomenon and introduce the first dataset created from closed captions of YouTube product review videos as well as a new attention-RNN model for aspect extraction and joint aspect extraction and sentiment classification. Our model provides state-of-the-art performance on aspect extraction without requiring the usage of hand-crafted features on the SemEval ABSA corpus, while it outperforms the baseline on the joint task. In our dataset, the attention-RNN model outperforms the baseline for both tasks, but we observe important performance drops for all models in comparison to SemEval. These results, as well as further experiments on domain adaptation for aspect extraction, suggest that differences between speech and written text, which have been discussed extensively in the literature, also extend to the domain of product reviews, where they are relevant for fine-grained opinion mining.

Refining Raw Sentence Representations for Textual Entailment Recognition via Attention
Jorge Balazs | Edison Marrese-Taylor | Pablo Loyola | Yutaka Matsuo
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

In this paper we present the model used by the team Rivercorners for the 2017 RepEval shared task. First, our model separately encodes a pair of sentences into variable-length representations by using a bidirectional LSTM. Later, it creates fixed-length raw representations by means of simple aggregation functions, which are then refined using an attention mechanism. Finally it combines the refined representations of both sentences into a single vector to be used for classification. With this model we obtained test accuracies of 72.057% and 72.055% in the matched and mismatched evaluation tracks respectively, outperforming the LSTM baseline, and obtaining performances similar to a model that relies on shared information between sentences (ESIM). When using an ensemble both accuracies increased to 72.247% and 72.827% respectively.

2015

Understanding Rating Behaviour and Predicting Ratings by Identifying Representative Users
Rahul Kamath | Masanao Ochi | Yutaka Matsuo
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

2009

A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the Web
Danushka Bollegala | Yutaka Matsuo | Mitsuru Ishizuka
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web
Yulan Yan | Naoaki Okazaki | Yutaka Matsuo | Zhenglu Yang | Mitsuru Ishizuka
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

A Co-occurrence Graph-based Approach for Personal Name Alias Extraction from Anchor Texts
Danushka Bollegala | Yutaka Matsuo | Mitsuru Ishizuka
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

2007

An Integrated Approach to Measuring Semantic Similarity between Words Using Information Available on the Web
Danushka Bollegala | Yutaka Matsuo | Mitsuru Ishizuka
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

Subtree Mining for Relation Extraction from Wikipedia
Dat P.T. Nguyen | Yutaka Matsuo | Mitsuru Ishizuka
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

2006

Graph-based Word Clustering using a Web Search Engine
Yutaka Matsuo | Takeshi Sakaki | Kôki Uchiyama | Mitsuru Ishizuka
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Extracting Key Phrases to Disambiguate Personal Name Queries in Web Search
Danushka Bollegala | Yutaka Matsuo | Mitsuru Ishizuka
Proceedings of the Workshop on How Can Computational Linguistics Improve Information Retrieval?

2004

Improving Chronological Sentence Ordering by Precedence Relation
Naoaki Okazaki | Yutaka Matsuo | Mitsuru Ishizuka
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Co-authors

Andrew Gambardella 6

Danushka Bollegala 4

Francis Zheng 4

Naoaki Okazaki 2

Itsuki Okimura 2

Shota Takashiro 2

Shohei Taniguchi 2

Fumiya Uchiyama 2

Qingcheng Zeng 2

Jorge A. Balazs 1

James Caverlee 1

Sarath Chandar 1

Yingjian Chen 1

Carolina Dias-Alexiou 1

Stephen Gould 1

Masaru Isonuma 1

Felix Juefei-Xu 1

Makoto Kawano 1

Yohei Kobashi 1

Junichiro Mori 1

Graham Neubig 1

Dat P.T. Nguyen 1

Sebastian Padó 1

Helmut Prendinger 1

Cristian Rodriguez 1

Keisuke Sakaguchi 1

Takeshi Sakaki 1

Ichiro Sakata 1

Masaki Sashida 1

Atsushi Shimizu 1

Dominic Sobhani 1

Masachika Taniguchi 1

Douglas Teodoro 1

Kôki Uchiyama 1

Pin Chen Wang 1

Yudai Yamazaki 1

Hitomi Yanaka 1

Venues