Kentaro Inui

2026

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Vera Demberg | Kentaro Inui | Lluís Marquez
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs

Sycophancy Hides Linearly in the Attention Heads
Rifo Ahmad Genadi | Munachiso Samuel Nwadike | Nurdaulet Mukhituly | Tatsuya Hiraoka | Hilal AlQuabeh | Kentaro Inui
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We find that correct-to-incorrect sycophancy signals are most linearly accessible within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified “truthful” directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations. Code will be released upon publication.

pdf bib

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Vera Demberg | Kentaro Inui | Lluís Marquez
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib

Findings of the Association for Computational Linguistics: EACL 2026
Vera Demberg | Kentaro Inui | Lluís Marquez
Findings of the Association for Computational Linguistics: EACL 2026

pdf bib abs

This study investigates the internal information flow of large language models (LLMs) while performing chain-of-thought (CoT) style reasoning.Specifically, with a particular interest in the faithfulness of the CoT explanation to LLMs’ final answer, we explore (i) when the LLMs’ answer is (pre)determined, especially before the CoT begins or after, and (ii) how strongly the information from CoT specifically has a causal effect on the final answer.Our experiments with controlled arithmetic tasks reveal a systematic internal reasoning mechanism of LLMs.They have not derived an answer at the moment when input was fed into the model.Instead, they compute (sub-)answers while generating the reasoning chain on the fly.Therefore, the generated reasoning chains can be regarded as faithful reflections of the model’s internal computation.

2025

pdf bib abs

We introduce the concept of the self-referencing causal cycle (abbreviated ReCall )—a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse. When an LLM is prompted with sequential data, it often fails to recall preceding context. For example, when we ask an LLM to recall the line preceding “O say does that star-spangled banner yet wave” in the U.S. National Anthem, it often fails to correctly return “Gave proof through the night that our flag was still there”—this is due to the reversal curse. It occurs because language models such as ChatGPT and Llama generate text based on preceding tokens, requiring facts to be learned and reproduced in a consistent token order. While the reversal curse is often viewed as a limitation, we offer evidence of an alternative view: it is not always an obstacle in practice. We find that ReCall is driven by what we designate as cycle tokens—sequences that connect different parts of the training data, enabling recall of preceding tokens from succeeding ones. Through rigorous probabilistic formalization and controlled experiments, we demonstrate how the cycles they induce influence a model’s ability to reproduce information. To facilitate reproducibility, we provide our code and experimental details at https://anonymous.4open.science/r/remember-B0B8/.

pdf bib abs

Understanding the Side Effects of Rank-One Knowledge Editing
Ryosuke Takahashi | Go Kamoda | Benjamin Heinzerling | Keisuke Sakaguchi | Kentaro Inui
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

This study conducts a detailed analysis of the side effects of rank-one knowledge editing using language models with controlled knowledge. The analysis focuses on each element of knowledge triples (subject, relation, object) and examines two aspects: “knowledge that causes large side effects when edited” and “knowledge that is affected by the side effects.” Our findings suggest that editing knowledge with subjects that have relationships with numerous objects or are robustly embedded within the LM may trigger extensive side effects. Furthermore, we demonstrate that the similarity between relation vectors, the density of object vectors, and the distortion of knowledge representations are closely related to how susceptible knowledge is to editing influences. The findings of this research provide new insights into the mechanisms of side effects in LM knowledge editing and indicate specific directions for developing more effective and reliable knowledge editing methods.

pdf bib abs

MQM-Chat: Multidimensional Quality Metrics for Chat Translation
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Kentaro Inui
Proceedings of the 31st International Conference on Computational Linguistics

The complexities of chats, such as the stylized contents specific to source segments and dialogue consistency, pose significant challenges for machine translation. Recognizing the need for a precise evaluation metric to address the issues associated with chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat), which encompasses seven error types, including three specifically designed for chat translations: ambiguity and disambiguation, buzzword or loanword issues, and dialogue inconsistency. In this study, human annotations were applied to the translations of chat data generated by five translation models. Based on the error distribution of MQM-Chat and the performance of relabeling errors into chat-specific types, we concluded that MQM-Chat effectively classified the errors while highlighting chat-specific issues explicitly. The results demonstrate that MQM-Chat can qualify both the lexical accuracy and semantical accuracy of translation models in chat translation tasks.

pdf bib abs

Counter-arguments (CAs) are a good means to improve the critical-thinking skills of learners, especially given that one has to thoroughly consider the logic of initial arguments (IA) when composing their CA. Although several tasks have been created for identifying the logical structure of CAs, no prior work has focused on capturing multiple interpretations of logical structures due to their complexity. In this work, we create CALSA+, a dataset consisting of 134 CAs annotated with 13 logical predicate questions. CALSA+ contains 1,742 instances annotated by 3 expert annotators (5,226 total annotations) with good agreement (Krippendorff 𝛼=0.46). Using CALSA+, we train a model with Reinforcement Learning with Verifiable Rewards (RLVR) to identify multiple logical interpretations and show that models trained with RLVR can perform on par with much bigger proprietary models. Our work is the first to attempt to annotate all the interpretations of logical structure on top of CAs. We publicly release our dataset to facilitate research in CA logical structure identification.

pdf bib abs

SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov | Nurdaulet Mukhituly | Kentaro Inui | Hanan Aldarmaki | Nils Lukas
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks under white-box access and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM’s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.

pdf bib abs

Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference
Go Kamoda | Benjamin Heinzerling | Tatsuro Inaba | Keito Kudo | Keisuke Sakaguchi | Kentaro Inui
Findings of the Association for Computational Linguistics: NAACL 2025

According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model’s “inner vocabulary”.Prior analysis of this *detokenization* stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior.Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps.Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2.Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects.By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.

pdf bib abs

On Entity Identification in Language Models
Masaki Sakata | Benjamin Heinzerling | Sho Yokoi | Takumi Ito | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL 2025

We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions.We first formulate two problems of entity mentions — ambiguity and variability — and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated.Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9.Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers.Additionally, we clarify how the characteristics of entity representations influence word prediction performance.These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.

pdf bib abs

Rectifying Belief Space via Unlearning to Harness LLMs’ Reasoning
Ayana Niwa | Masahiro Kaneko | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) exhibit sophisticated reasoning yet still generate incorrect answers. We attribute these errors to **Spurious Beliefs**, defined as propositions the model internally considers as true despite being factually false. To reduce reasoning errors, we propose a belief space rectification framework. Our method first identifies the beliefs invoked during inference via an explanation‐based approach with Forward‐Backward Beam Search (FBBS). We subsequently apply unlearning via gradient ascent to suppress spurious beliefs and enhance true ones, thereby effectively rectifying the model’s belief space. Experiments on three QA datasets and three LLMs show that our method significantly reduces erroneous reasoning and improves generalization.

pdf bib abs

Spelling-out is not Straightforward: LLMs’ Capability of Tokenization from Token to Characters
Tatsuya Hiraoka | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct “breakthrough” in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

pdf bib abs

This study explores how bilingual language models develop complex internal representations.We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes.Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models.Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model.Our results provide insights into how language models acquire bilingual capabilities.

pdf bib abs

LLMs Can Compensate for Deficiencies in Visual Representations
Sho Takishita | Jay Gala | Abdelrahman Mohamed | Kentaro Inui | Yova Kementchedjhieva
Findings of the Association for Computational Linguistics: EMNLP 2025

Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.

pdf bib

Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kentaro Inui | Sakriani Sakti | Haofen Wang | Derek F. Wong | Pushpak Bhattacharyya | Biplab Banerjee | Asif Ekbal | Tanmoy Chakraborty | Dhirendra Pratap Singh
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

pdf bib abs

Grape at GenAI Detection Task 1: Leveraging Compact Models and Linguistic Features for Robust Machine-Generated Text Detection
Nhi Hoai Doan | Kentaro Inui
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

In this project, we aim to address two subtasks of Task 1: Binary Multilingual Machine-Generated Text (MGT) Detection (Human vs. Machine) as part of the COLING 2025 Workshop on MGT Detection (Wang et al., 2025) using different approaches. The first method involves separately fine-tuning small language models tailored to the specific subtask. The second approach builds on this methodology by incorporating linguistic, syntactic, and semantic features, leveraging ensemble learning to integrate these features with model predictions for more robust classification. By evaluating and comparing these approaches, we aim to identify the most effective techniques for detecting machine-generated content across languages, providing insights into improving automated verification tools amidst the rapid growth of LLM-generated text in digital spaces.

pdf bib

ASR Models for Traditional Emirati Arabic: Challenges, Adaptations, and Performance Evaluation
Maha Alblooki | Kentaro Inui | Shady Shehata
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)

pdf bib

pdf bib abs

Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning
Nhi Hoai Doan | Tatsuya Hiraoka | Kentaro Inui
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

pdf bib abs

FOCUS: A Benchmark for Targeted Socratic Question Generation via Source-Span Grounding
Surawat Pothong | Machi Shimmei | Naoya Inoue | Paul Reisert | Ana Brassard | Wenzhi Wang | Shoichi Naito | Jungmin Choi | Kentaro Inui
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We present FOCUS, a benchmark and task setting for Socratic question generation that delivers more informative and targeted feedback to learners. Unlike prior datasets, which rely on broad typologies and lack grounding in the source text, FOCUS introduces a new formulation: each Socratic question is paired with a fine-grained, 11-type typology and an explicit source span from the argument it targets. This design supports clearer, more actionable feedback and facilitates interpretable model evaluation. FOCUS includes 440 annotated instances with moderate partial-match agreement, establishing it as a reliable benchmark. Baseline experiments with representative state-of-the-art models reveal, through detailed error analysis, that even strong models struggle with span selection and context-sensitive categories. An extension study on the LogicClimate dataset further confirms the generalizability of the task and annotation framework. FOCUS sets a new standard for pedagogically grounded and informative Socratic question generation.

pdf bib

pdf bib abs

Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki
Mutsumi Sasaki | Go Kamoda | Ryosuke Takahashi | Kosuke Sato | Kentaro Inui | Keisuke Sakaguchi | Benjamin Heinzerling
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Temporal reasoning and knowledge are essential capabilities for language models (LMs).While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar.However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time.If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far.Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese *wareki*.We create datasets that require temporal knowledge and reasoning in using *wareki* dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving *wareki* dates.Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model’s knowledge as possible explanations.Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.

pdf bib abs

Repetition Neurons: How Do Language Models Produce Repetitions?
Tatsuya Hiraoka | Kentaro Inui
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

This paper introduces repetition neurons, which can be regarded as “skill neurons” responsible for the repetition problem in text generation tasks. These neurons are progressively activated more strongly as repetition continues, indicating that they perceive repetition as a task to copy the previous context repeatedly, similar to in-context learning. We identify these repetition neurons by comparing activation values before and after the onset of repetition in texts generated by recent pre-trained language models. We analyze the repetition neurons in three English and one Japanese pre-trained language models and observe similar patterns across them.

pdf bib abs

The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces
Ahmed Oumar El-Shangiti | Tatsuya Hiraoka | Hilal AlQuabeh | Benjamin Heinzerling | Kentaro Inui
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of theembedding space when answering questions involving numeric comparisons, e.g., Was Cristiano born before Messi? We first identified,using partial least squares regression, these subspaces, which effectively encode the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality, by intervening in these subspaces to manipulate hidden states, thereby altering the LLM’s comparison outcomes. Experiments conducted on three different LLMs showed that our results hold across different numerical attributes, indicating that LLMs utilize the linearly encoded information for numerical reasoning.

pdf bib abs

Large Language Models Are Human-Like Internally
Tatsuki Kuribayashi | Yohei Oseki | Souhaib Ben Taieb | Kentaro Inui | Timothy Baldwin
Transactions of the Association for Computational Linguistics, Volume 13

Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior (Oh and Schuler, 2023b; Shain et al., 2024; Kuribayashi et al., 2024), leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: Earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.1

pdf bib

Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)
Michael Zock | Kentaro Inui | Zheng Yuan
Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)

2024

pdf bib abs

Monotonic Representation of Numeric Attributes in Language Models
Benjamin Heinzerling | Kentaro Inui
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Language models (LMs) can express factual knowledge involving numeric properties such as Karl Popper was born in 1902. However, how this information is encoded in the model’s internal representations is not understood well. Here, we introduce a method for finding and editing representations of numeric properties such as an entity’s birth year. We find directions that encode numeric properties monotonically, in an interpretable fashion. When editing representations along these directions, LM output changes accordingly. For example, by patching activations along a “birthyear” direction we can make the LM express an increasingly late birthyear. Property-encoding directions exist across several numeric properties in all models under consideration, suggesting the possibility that monotonic representation of numeric properties consistently emerges during LM pretraining.Code: https://github.com/bheinzerling/numeric-property-reprA long version of this short paper is available at: https://arxiv.org/abs/2403.10381

pdf bib abs

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
Masato Mita | Keisuke Sakaguchi | Masato Hagiwara | Tomoya Mizumoto | Jun Suzuki | Kentaro Inui
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Natural language processing (NLP) technology has rapidly improved automated grammatical error correction (GEC) tasks, and the GEC community has begun to explore document-level revision. However, there are two major obstacles to going beyond automated sentence-level GEC to NLP-based document-level revision support: (1) there are few public corpora with document-level revisions annotated by professional editors, and (2) it is infeasible to obtain all possible references and evaluate revision quality using such references because there are infinite revision possibilities. To address these challenges, this paper proposes a new document revision corpus, Text Revision of ACL papers (TETRA), in which multiple professional editors have revised academic papers sampled from the ACL anthology. This corpus enables us to focus on document-level and paragraph-level edits, such as edits related to coherence and consistency. Additionally, as a case study using the TETRA corpus, we investigate reference-less and interpretable methods for meta-evaluation to detect quality improvements according to document revisions. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle.

pdf bib abs

Japanese-English Sentence Translation Exercises Dataset for Automatic Grading
Naoki Miura | Hiroaki Funayama | Seiya Kikuchi | Yuichiroh Matsubayashi | Yuya Iwase | Kentaro Inui
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning.We formalize the task as grading student responses for each rubric criterion pre-specified by the educators.We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3,498 student responses (167 on average).The answer responses were collected from students and crowd workers.Using this dataset, we demonstrate the performance of baselines including a finetuned BERT model and GPT-3.5 with few-shot learning. Experimental results showed that the baseline model with fine-tuned BERT was able to classify correct responses with approximately 90% in F₁, but only less than 80% for incorrect responses. Furthermore, GPT-3.5 with few-shot learning shows a poorer result than the BERT model, indicating that our newly proposed task presents a challenging issue, even for the state-of-the-art large language model.

pdf bib abs

Explicit multi-step reasoning, such as chain-of-thought, is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process.Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, their reliance on heuristics decreases as LMs progress closer to the final answer. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with rational ones in solving tasks involving multi-step reasoning.

pdf bib abs

Representational Analysis of Binding in Language Models
Qin Dai | Benjamin Heinzerling | Kentaro Inui
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Entity tracking is essential for complex reasoning. To perform in-context entity tracking, language models (LMs) must bind an entity to its attribute (e.g., bind a container to its content) to recall attribute for a given entity. For example, given a context mentioning “The coffee is in Box Z, the stone is in Box M, the map is in Box H”, to infer “Box Z contains the coffee” later, LMs must bind “Box Z” to “coffee”. To explain the binding behaviour of LMs, existing research introduces a Binding ID mechanism and states that LMs use a abstract concept called Binding ID (BI) to internally mark entity-attribute pairs. However, they have not directly captured the BI information from entity activations. In this work, we provide a novel view of the Binding ID mechanism by localizing the BI information. Specifically, we discover that there exists a low-rank subspace in the hidden state (or activation) of LMs, that primarily encodes BIs. To identify this subspace, we take principle component analysis as our first attempt and it is empirically proven to be effective. Moreover, we also discover that when editing representations along directions in the subspace, LMs tend to bind a given entity to other attributes accordingly. For example, by patching activations along the BI encoding direction we can make the LM to infer “Box Z contains the stone” and “Box Z contains the map”.

pdf bib abs

Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy’s implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf’s 𝛼 of 0.54) and reasonable coverage 83%. Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available.

pdf bib abs

A Large Collection of Model-generated Contradictory Responses for Consistency-aware Dialogue Systems
Shiki Sato | Reina Akama | Jun Suzuki | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL 2024

Mitigating the generation of contradictory responses poses a substantial challenge in dialogue response generation. The quality and quantity of available contradictory response data play a vital role in suppressing these contradictions, offering two significant benefits. First, having access to large contradiction data enables a comprehensive examination of their characteristics. Second, data-driven methods to mitigate contradictions may be enhanced with large-scale contradiction data for training. Nevertheless, no attempt has been made to build an extensive collection of model-generated contradictory responses. In this paper, we build a large dataset of response generation models’ contradictions for the first time. Then, we acquire valuable insights into the characteristics of model-generated contradictions through an extensive analysis of the collected responses. Lastly, we also demonstrate how this dataset substantially enhances the performance of data-driven contradiction suppression methods.

pdf bib

pdf bib abs

To Drop or Not to Drop? Predicting Argument Ellipsis Judgments: A Case Study in Japanese
Yukiko Ishizuki | Tatsuki Kuribayashi | Yuichiroh Matsubayashi | Ryohei Sasano | Kentaro Inui
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Speakers sometimes omit certain arguments of a predicate in a sentence; such omission is especially frequent in pro-drop languages. This study addresses a question about ellipsis—what can explain the native speakers’ ellipsis decisions?—motivated by the interest in human discourse processing and writing assistance for this choice. To this end, we first collect large-scale human annotations of whether and why a particular argument should be omitted across over 2,000 data points in the balanced corpus of Japanese, a prototypical pro-drop language. The data indicate that native speakers overall share common criteria for such judgments and further clarify their quantitative characteristics, e.g., the distribution of related linguistic factors in the balanced corpus. Furthermore, the performance of the language model–based argument ellipsis judgment model is examined, and the gap between the systems’ prediction and human judgments in specific linguistic aspects is revealed. We hope our fundamental resource encourages further studies on natural human ellipsis judgment.

pdf bib abs

J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema
Kosuke Matsuzaki | Masaya Taniguchi | Kentaro Inui | Keisuke Sakaguchi
Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language’s agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] + suru (do-PRS)). Morphologically, this inflection pattern is same as the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We will release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.

2023

pdf bib abs

Can LMs Store and Retrieve 1-to-N Relational Knowledge?
Haruki Nagasawa | Benjamin Heinzerling | Kazuma Kokuta | Kentaro Inui
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

It has been suggested that pretrained language models can be viewed as knowledge bases. One of the prerequisites for using language models as knowledge bases is how accurately they can store and retrieve world knowledge. It is already revealed that language models can store much 1-to-1 relational knowledge, such as ”country and its capital,” with high memorization accuracy. On the other hand, world knowledge includes not only 1-to-1 but also 1-to-N relational knowledge, such as ”parent and children.”However, it is not clear how accurately language models can handle 1-to-N relational knowledge. To investigate language models’ abilities toward 1-to-N relational knowledge, we start by designing the problem settings. Specifically, we organize the character of 1-to-N relational knowledge and define two essential skills: (i) memorizing multiple objects individually and (ii) retrieving multiple stored objects without excesses or deficiencies at once. We inspect LMs’ ability to handle 1-to-N relational knowledge on the controlled synthesized data. As a result, we report that it is possible to memorize multiple objects with high accuracy, but generalizing the retrieval ability (expressly, enumeration) is challenging.

pdf bib abs

The use of argumentation in education has shown improvement in students’ critical thinking skills, and computational models for argumentation have been developed to further assist this process. Although these models are useful for evaluating the quality of an argument, they often cannot explain why a particular argument score was predicted, i.e., why the argument is good or bad, which makes it difficult to provide constructive feedback to users, e.g., students, so that they can strengthen their critical thinking skills. In this survey, we explore current NLP feedback systems by categorizing each into four important dimensions of feedback (Richness, Visualization, Interactivity and Personalization). We discuss limitations for each dimension and provide suggestions to enhance the power of feedback and explanations to ultimately improve user critical thinking skills.

pdf bib abs

Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.

pdf bib abs

Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models.

pdf bib abs

Prompting for explanations improves Adversarial NLI. Is this true? {Yes} it is {true} because {it weakens superficial cues}
Pride Kavumba | Ana Brassard | Benjamin Heinzerling | Kentaro Inui
Findings of the Association for Computational Linguistics: EACL 2023

Explanation prompts ask language models to not only assign a particular label to a giveninput, such as true, entailment, or contradiction in the case of natural language inference but also to generate a free-text explanation that supports this label. For example: “This is label because explanation.” While this type of prompt was originally introduced with the aim of improving model interpretability, we showhere that explanation prompts also improve robustness to adversarial perturbations in naturallanguage inference benchmarks. Compared to prompting for labels only, explanation prompting consistently yields stronger performance on adversarial benchmarks, outperforming the state of the art on Adversarial Natural Language Inference, Counterfactually-Augmented Natural Language Inference, and SNLI-Hard datasets. We argue that the increase in robustness is due to the fact that prompting for explanations weakens superficial cues. Specifically, single tokens that are highly predictive of the correct answer in the label-only setting become uninformative when the model also has to generate explanations.

pdf bib abs

Transformer Language Models Handle Word Frequency in Prediction Head
Goro Kobayashi | Tatsuki Kuribayashi | Sho Yokoi | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL 2023

Prediction head is a crucial component of Transformer language models. Despite its direct impact on prediction, this component has often been overlooked in analyzing Transformers.In this study, we investigate the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a significant role in the models’ ability to reflect word frequency in a corpus, aligning with the logit adjustment method commonly used in long-tailed learning. We also quantify the effect of controlling the biases in practical auto-regressive text generation scenarios;under a particular setting, more diverse text can be generated without compromising text quality.

pdf bib abs

Test-time Augmentation for Factual Probing
Go Kamoda | Benjamin Heinzerling | Keisuke Sakaguchi | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2023

Factual probing is a method that uses prompts to test if a language model “knows” certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.

pdf bib abs

Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words
Hiroto Kurita | Goro Kobayashi | Sho Yokoi | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2023

The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words.

pdf bib abs

Investigating the Effectiveness of Multiple Expert Models Collaboration
Ikumi Ito | Takumi Ito | Jun Suzuki | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2023

This paper aims to investigate the effectiveness of several machine translation (MT) models and aggregation methods in a multi-domain setting under fair conditions and explore a direction for tackling multi-domain MT. We mainly compare the performance of the single model approach by jointly training all domains and the multi-expert models approach with a particular aggregation strategy. We conduct experiments on multiple domain datasets and demonstrate that a combination of smaller domain expert models can outperform a larger model trained for all domain data.

pdf bib

An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Kentaro Inui
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib abs

This paper describes our system submitted to SemEval-2023 Task 5: Clickbait Spoiling. We work on spoiler generation of the subtask 2 and develop a system which comprises two parts: 1) simple seq2seq spoiler generation and 2) post-hoc model ensembling. Using this simple method, we address the challenge of generating multipart spoiler. In the test set, our submitted system outperformed the baseline by a large margin (approximately 10 points above on the BLEU score) for mixed types of spoilers. We also found that our system successfully handled the challenge of the multipart spoiler, confirming the effectiveness of our approach.

2022

pdf bib abs

Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model
Sosuke Kobayashi | Shun Kiyono | Jun Suzuki | Kentaro Inui
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks and their ensemble outperformed the standard ensemble in some tasks when accurate lottery tickets are found on the tasks.

pdf bib abs

Prior studies addressing target-oriented conversational tasks lack a crucial notion that has been intensively studied in the context of goal-oriented artificial intelligence agents, namely, planning. In this study, we propose the task of Target-Guided Open-Domain Conversation Planning (TGCP) task to evaluate whether neural conversational agents have goal-oriented conversation planning abilities. Using the TGCP task, we investigate the conversation planning abilities of existing retrieval models and recent strong generative models. The experimental results reveal the challenges facing current technology.

pdf bib abs

Topicalization in Language Models: A Case Study on Japanese
Riki Fujihara | Tatsuki Kuribayashi | Kaori Abe | Ryoko Tokuhisa | Kentaro Inui
Proceedings of the 29th International Conference on Computational Linguistics

Humans use different wordings depending on the context to facilitate efficient communication. For example, instead of completely new information, information related to the preceding context is typically placed at the sentence-initial position. In this study, we analyze whether neural language models (LMs) can capture such discourse-level preferences in text generation. Specifically, we focus on a particular aspect of discourse, namely the topic-comment structure. To analyze the linguistic knowledge of LMs separately, we chose the Japanese language, a topic-prominent language, for designing probing tasks, and we created human topicalization judgment data by crowdsourcing. Our experimental results suggest that LMs have different generalizations from humans; LMs exhibited less context-dependent behaviors toward topicalization judgment. These results highlight the need for the additional inductive biases to guide LMs to achieve successful discourse-level generalization.

pdf bib abs

Iterative Span Selection: Self-Emergence of Resolving Orders in Semantic Role Labeling
Shuhei Kurita | Hiroki Ouchi | Kentaro Inui | Satoshi Sekine
Proceedings of the 29th International Conference on Computational Linguistics

Semantic Role Labeling (SRL) is the task of labeling semantic arguments for marked semantic predicates. Semantic arguments and their predicates are related in various distinct manners, of which certain semantic arguments are a necessity while others serve as an auxiliary to their predicates. To consider such roles and relations of the arguments in the labeling order, we introduce iterative argument identification (IAI), which combines global decoding and iterative identification for the semantic arguments. In experiments, we first realize that the model with random argument labeling orders outperforms other heuristic orders such as the conventional left-to-right labeling order. Combined with simple reinforcement learning, the proposed model spontaneously learns the optimized labeling orders that are different from existing heuristic orders. The proposed model with the IAI algorithm achieves competitive or outperforming results from the existing models in the standard benchmark datasets of span-based SRL: CoNLL-2005 and CoNLL-2012.

pdf bib abs

Cross-stitching Text and Knowledge Graph Encoders for Distantly Supervised Relation Extraction
Qin Dai | Benjamin Heinzerling | Kentaro Inui
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Bi-encoder architectures for distantly-supervised relation extraction are designed to make use of the complementary information found in text and knowledge graphs (KG).However, current architectures suffer from two drawbacks. They either do not allow any sharing between the text encoder and the KG encoder at all, or, in case of models with KG-to-text attention, only share information in one direction. Here, we introduce cross-stitch bi-encoders, which allow full interaction between the text encoder and the KG encoder via a cross-stitch mechanism. The cross-stitch mechanism allows sharing and updating representations between the two encoders at any layer, with the amount of sharing being dynamically controlled via cross-attention-based gates. Experimental results on two relation extraction benchmarks from two different domains show that enabling full interaction between the two encoders yields strong improvements.

pdf bib abs

Context Limitations Make Neural Language Models More Human-Like
Tatsuki Kuribayashi | Yohei Oseki | Ana Brassard | Kentaro Inui
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Language models (LMs) have been used in cognitive modeling as well as engineering studies—they compute information-theoretic complexity metrics that simulate humans’ cognitive load during reading.This study highlights a limitation of modern neural LMs as the model of choice for this purpose: there is a discrepancy between their context access capacities and that of humans.Our results showed that constraining the LMs’ context access improved their simulation of human reading behavior.We also showed that LM-human gaps in context access were associated with specific syntactic constructions; incorporating syntactic biases into LMs’ context access might enhance their cognitive plausibility.

pdf bib

Why is sentence similarity benchmark not predictive of application-oriented task performance?
Kaori Abe | Sho Yokoi | Tomoyuki Kajiwara | Kentaro Inui
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

pdf bib

pdf bib abs

LPAttack: A Feasible Annotation Scheme for Capturing Logic Pattern of Attacks in Arguments
Farjana Sultana Mim | Naoya Inoue | Shoichi Naito | Keshav Singh | Kentaro Inui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In argumentative discourse, persuasion is often achieved by refuting or attacking others’ arguments. Attacking an argument is not always straightforward and often consists of complex rhetorical moves in which arguers may agree with a logic of an argument while attacking another logic. Furthermore, an arguer may neither deny nor agree with any logics of an argument, instead ignore them and attack the main stance of the argument by providing new logics and presupposing that the new logics have more value or importance than the logics presented in the attacked argument. However, there are no studies in computational argumentation that capture such complex rhetorical moves in attacks or the presuppositions or value judgments in them. To address this gap, we introduce LPAttack, a novel annotation scheme that captures the common modes and complex rhetorical moves in attacks along with the implicit presuppositions and value judgments. Our annotation study shows moderate inter-annotator agreement, indicating that human annotation for the proposed scheme is feasible. We publicly release our annotated corpus and the annotation guidelines.

pdf bib abs

COPA-SSE: Semi-structured Explanations for Commonsense Reasoning
Ana Brassard | Benjamin Heinzerling | Pride Kavumba | Kentaro Inui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present Semi-Structured Explanations for COPA (COPA-SSE), a new crowdsourced dataset of 9,747 semi-structured, English common sense explanations for Choice of Plausible Alternatives (COPA) questions. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts. This semi-structured format strikes a balance between the high quality but low coverage of structured data and the lower quality but high coverage of free-form crowdsourcing. Each explanation also includes a set of human-given quality ratings. With their familiar format, the explanations are geared towards commonsense reasoners operating on knowledge graphs and serve as a starting point for ongoing work on improving such systems. The dataset is available at https://github.com/a-brassard/copa-sse.

pdf bib abs

IRAC: A Domain-Specific Annotated Corpus of Implicit Reasoning in Arguments
Keshav Singh | Naoya Inoue | Farjana Sultana Mim | Shoichi Naito | Kentaro Inui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The task of implicit reasoning generation aims to help machines understand arguments by inferring plausible reasonings (usually implicit) between argumentative texts. While this task is easy for humans, machines still struggle to make such inferences and deduce the underlying reasoning. To solve this problem, we hypothesize that as human reasoning is guided by innate collection of domain-specific knowledge, it might be beneficial to create such a domain-specific corpus for machines. As a starting point, we create the first domain-specific resource of implicit reasonings annotated for a wide range of arguments, which can be leveraged to empower machines with better implicit reasoning generation ability. We carefully design an annotation framework to collect them on a large scale through crowdsourcing and show the feasibility of creating a such a corpus at a reasonable cost and high-quality. Our experiments indicate that models trained with domain-specific implicit reasonings significantly outperform domain-general models in both automatic and human evaluations. To facilitate further research towards implicit reasoning generation in arguments, we present an in-depth analysis of our corpus and crowdsourcing methodology, and release our materials (i.e., crowdsourcing guidelines and domain-specific resource of implicit reasonings).

pdf bib abs

Providing feedback on the argumentation of the learner is essential for developing critical thinking skills, however, it requires a lot of time and effort. To mitigate the overload on teachers, we aim to automate a process of providing feedback, especially giving diagnostic comments which point out the weaknesses inherent in the argumentation. It is recommended to give specific diagnostic comments so that learners can recognize the diagnosis without misinterpretation. However, it is not obvious how the task of providing specific diagnostic comments should be formulated. We present a formulation of the task as template selection and slot filling to make an automatic evaluation easier and the behavior of the model more tractable. The key to the formulation is the possibility of creating a template set that is sufficient for practical use. In this paper, we define three criteria that a template set should satisfy: expressiveness, informativeness, and uniqueness, and verify the feasibility of creating a template set that satisfies these criteria as a first trial. We will show that it is feasible through an annotation study that converts diagnostic comments given in a text to a template format. The corpus used in the annotation study is publicly available.

pdf bib abs

Tracing and Manipulating intermediate values in Neural Math Problem Solvers
Yuta Matsumoto | Benjamin Heinzerling | Masashi Yoshikawa | Kentaro Inui
Proceedings of the 1st Workshop on Mathematical Natural Language Processing (MathNLP)

How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing how a Transformer model processes these inputs by focusing on simple arithmetic problems and their intermediate values. To trace where information about intermediate values is encoded, we measure the correlation between intermediate values and the activations of the model using principal component analysis (PCA). Then, we perform a causal intervention by manipulating model weights. This intervention shows that the weights identified via tracing are not merely correlated with intermediate values, but causally related to model predictions. Our findings show that the model has a locality to certain intermediate values, and this is useful for enhancing the interpretability of the models.

pdf bib

Law Retrieval with Supervised Contrastive Learning Using the Hierarchical Structure of Law
Jungmin Choi | Ukyo Honda | Taro Watanabe | Hiroki Ouchi | Kentaro Inui
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

N-best Response-based Analysis of Contradiction-awareness in Neural Response Generation Models
Shiki Sato | Reina Akama | Hiroki Ouchi | Ryoko Tokuhisa | Jun Suzuki | Kentaro Inui
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Avoiding the generation of responses that contradict the preceding context is a significant challenge in dialogue response generation. One feasible method is post-processing, such as filtering out contradicting responses from a resulting n-best response list. In this scenario, the quality of the n-best list considerably affects the occurrence of contradictions because the final response is chosen from this n-best list. This study quantitatively analyzes the contextual contradiction-awareness of neural response generation models using the consistency of the n-best lists. Particularly, we used polar questions as stimulus inputs for concise and quantitative analyses. Our tests illustrate the contradiction-awareness of recent neural response generation models and methodologies, followed by a discussion of their properties and limitations.

2021

pdf bib abs

Lower Perplexity is Not Always Human-Like
Tatsuki Kuribayashi | Yohei Oseki | Takumi Ito | Ryo Yoshida | Masayuki Asahara | Kentaro Inui
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.

pdf bib abs

Exploring Methodologies for Collecting High-Quality Implicit Reasoning in Arguments
Keshav Singh | Farjana Sultana Mim | Naoya Inoue | Shoichi Naito | Kentaro Inui
Proceedings of the 8th Workshop on Argument Mining

Annotation of implicit reasoning (i.e., warrant) in arguments is a critical resource to train models in gaining deeper understanding and correct interpretation of arguments. However, warrants are usually annotated in unstructured form, having no restriction on their lexical structure which sometimes makes it difficult to interpret how warrants relate to any of the information given in claim and premise. Moreover, assessing and determining better warrants from the large variety of reasoning patterns of unstructured warrants becomes a formidable task. Therefore, in order to annotate warrants in a more interpretative and restrictive way, we propose two methodologies to annotate warrants in a semi-structured form. To the best of our knowledge, we are the first to show how such semi-structured warrants can be annotated on a large scale via crowdsourcing. We demonstrate through extensive quality evaluation that our methodologies enable collecting better quality warrants in comparison to unstructured annotations. To further facilitate research towards the task of explicating warrants in arguments, we release our materials publicly (i.e., crowdsourcing guidelines and collected warrants).

pdf bib abs

Exploring Transitivity in Neural NLI Models through Veridicality
Hitomi Yanaka | Koji Mineshima | Kentaro Inui
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Despite the recent success of deep neural networks in natural language processing, the extent to which they can demonstrate human-like generalization capacities for natural language understanding remains unclear. We explore this issue in the domain of natural language inference (NLI), focusing on the transitivity of inference relations, a fundamental property for systematically drawing inferences. A model capturing transitivity can compose basic inference patterns and draw new inferences. We introduce an analysis method using synthetic and naturalistic NLI datasets involving clause-embedding verbs to evaluate whether models can perform transitivity inferences composed of veridical inferences and arbitrary inference types. We find that current NLI models do not perform consistently well on transitivity inference tasks, suggesting that they lack the generalization capacity for drawing composite inferences from provided training examples. The data and code for our analysis are publicly available at https://github.com/verypluming/transitivity.

pdf bib abs

Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries
Benjamin Heinzerling | Kentaro Inui
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Pretrained language models have been suggested as a possible alternative or complement to structured knowledge bases. However, this emerging LM-as-KB paradigm has so far only been considered in a very limited setting, which only allows handling 21k entities whose name is found in common LM vocabularies. Furthermore, a major benefit of this paradigm, i.e., querying the KB using natural language paraphrases, is underexplored. Here we formulate two basic requirements for treating LMs as KBs: (i) the ability to store a large number facts involving a large number of entities and (ii) the ability to query stored facts. We explore three entity representations that allow LMs to handle millions of entities and present a detailed case study on paraphrased querying of facts stored in LMs, thereby providing a proof-of-concept that language models can indeed serve as knowledge bases.

pdf bib abs

Two Training Strategies for Improving Relation Extraction over Universal Graph
Qin Dai | Naoya Inoue | Ryo Takahashi | Kentaro Inui
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the difficulty in learning a UG and then propose two training strategies: (1) Path Type Adaptive Pretraining, which sequentially trains the model with different types of UG paths so as to prevent the reliance on a single type of UG path; and (2) Complexity Ranking Guided Attention mechanism, which restricts the attention span according to the complexity of a UG path so as to force the model to extract features not only from simple UG paths but also from complex ones. Experimental results on both biomedical and NYT10 datasets prove the robustness of our methods and achieve a new state-of-the-art result on the NYT10 dataset. The code and datasets used in this paper are available at https://github.com/baodaiqin/UGDSRE.

pdf bib abs

SHAPE: Shifted Absolute Position Embedding for Transformers
Shun Kiyono | Sosuke Kobayashi | Jun Suzuki | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training. We demonstrate that SHAPE is empirically comparable to its counterpart while being simpler and faster.

pdf bib abs

Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution
Ryuto Konno | Shun Kiyono | Yuichiroh Matsubayashi | Hiroki Ouchi | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task that trains MLMs on anaphoric relations with explicit supervision, and the second proposal is a new finetuning method that remedies a notorious issue, the pretrain-finetune discrepancy. Our experiments on Japanese ZAR demonstrated that our two proposals boost the state-of-the-art performance, and our detailed analysis provides new insights on the remaining challenges.

pdf bib abs

This paper explores a variant of automatic headline generation methods, where a generated headline is required to include a given phrase such as a company or a product name. Previous methods using Transformer-based models generate a headline including a given phrase by providing the encoder with additional information corresponding to the given phrase. However, these methods cannot always include the phrase in the generated headline. Inspired by previous RNN-based methods generating token sequences in backward and forward directions from the given phrase, we propose a simple Transformer-based method that guarantees to include the given phrase in the high-quality generated headline. We also consider a new headline generation strategy that takes advantage of the controllable generation order of Transformer. Our experiments with the Japanese News Corpus demonstrate that our methods, which are guaranteed to include the phrase in the generated headline, achieve ROUGE scores comparable to previous Transformer-based methods. We also show that our generation strategy performs better than previous strategies.

pdf bib abs

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
Goro Kobayashi | Tatsuki Kuribayashi | Sho Yokoi | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers’ progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.

pdf bib abs

Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension
Naoya Inoue | Harsh Trivedi | Steven Sinha | Niranjan Balasubramanian | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

How can we generate concise explanations for multi-hop Reading Comprehension (RC)? The current strategies of identifying supporting sentences can be seen as an extractive question-focused summarization of the input text. However, these extractive explanations are not necessarily concise i.e. not minimally sufficient for answering a question. Instead, we advocate for an abstractive approach, where we propose to generate a question-focused, abstractive summary of input paragraphs and then feed it to an RC system. Given a limited amount of human-annotated abstractive explanations, we train the abstractive explainer in a semi-supervised manner, where we start from the supervised model and then train it further through trial and error maximizing a conciseness-promoted reward function. Our experiments demonstrate that the proposed abstractive explainer can generate more compact explanations than an extractive explainer with limited supervision (only 2k instances) while maintaining sufficiency.

pdf bib abs

Exploring Methods for Generating Feedback Comments for Writing Learning
Kazuaki Hanawa | Ryo Nagata | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The task of generating explanatory notes for language learners is known as feedback comment generation. Although various generation techniques are available, little is known about which methods are appropriate for this task. Nagata (2019) demonstrates the effectiveness of neural-retrieval-based methods in generating feedback comments for preposition use. Retrieval-based methods have limitations in that they can only output feedback comments existing in a given training data. Furthermore, feedback comments can be made on other grammatical and writing items than preposition use, which is still unaddressed. To shed light on these points, we investigate a wider range of methods for generating many feedback comments in this study. Our close analysis of the type of task leads us to investigate three different architectures for comment generation: (i) a neural-retrieval-based method as a baseline, (ii) a pointer-generator-based generation method as a neural seq2seq method, (iii) a retrieve-and-edit method, a hybrid of (i) and (ii). Intuitively, the pointer-generator should outperform neural-retrieval, and retrieve-and-edit should perform best. However, in our experiments, this expectation is completely overturned. We closely analyze the results to reveal the major causes of these counter-intuitive results and report on our findings from the experiments.

pdf bib

SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics
Hitomi Yanaka | Koji Mineshima | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs

Learning to Learn to be Right for the Right Reasons
Pride Kavumba | Benjamin Heinzerling | Ana Brassard | Kentaro Inui
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Improving model generalization on held-out data is one of the core objectives in common- sense reasoning. Recent work has shown that models trained on the dataset with superficial cues tend to perform well on the easy test set with superficial cues but perform poorly on the hard test set without superficial cues. Previous approaches have resorted to manual methods of encouraging models not to overfit to superficial cues. While some of the methods have improved performance on hard instances, they also lead to degraded performance on easy in- stances. Here, we propose to explicitly learn a model that does well on both the easy test set with superficial cues and the hard test set without superficial cues. Using a meta-learning objective, we learn such a model that improves performance on both the easy test set and the hard test set. By evaluating our models on Choice of Plausible Alternatives (COPA) and Commonsense Explanation, we show that our proposed method leads to improved performance on both the easy test set and the hard test set upon which we observe up to 16.5 percentage points improvement over the baseline.

pdf bib abs

Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.

2020

pdf bib abs

Language Models as an Alternative Evaluator of Word Order Hypotheses: A Case Study in Japanese
Tatsuki Kuribayashi | Takumi Ito | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We examine a methodology using neural language models (LMs) for analyzing the word order of language. This LM-based method has the potential to overcome the difficulties existing methods face, such as the propagation of preprocessor errors in count-based methods. In this study, we explore whether the LM-based method is valid for analyzing the word order. As a case study, this study focuses on Japanese due to its complex and flexible word order. To validate the LM-based method, we test (i) parallels between LMs and human word order preference, and (ii) consistency of the results obtained using the LM-based method with previous linguistic studies. Through our experiments, we tentatively conclude that LMs display sufficient word order knowledge for usage as an analysis tool. Finally, using the LM-based method, we demonstrate the relationship between the canonical word order and topicalization, which had yet to be analyzed by large-scale experiments.

pdf bib abs

Evaluating Dialogue Generation Systems via Response Selection
Shiki Sato | Reina Akama | Hiroki Ouchi | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose a method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test set developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.

pdf bib abs

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction
Masahiro Kaneko | Masato Mita | Shun Kiyono | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the fine-tuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-of-the-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: https://github.com/kanekomasahiro/bert-gec.

pdf bib abs

Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language?
Hitomi Yanaka | Koji Mineshima | Daisuke Bekki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Despite the success of language models using neural networks, it remains unclear to what extent neural models have the generalization ability to perform inferences. In this paper, we introduce a method for evaluating whether neural models can learn systematicity of monotonicity inference in natural language, namely, the regularity for performing arbitrary inferences with generalization on composition. We consider four aspects of monotonicity inferences and test whether the models can systematically interpret lexical and logical phenomena on different training/test splits. A series of experiments show that three neural models systematically draw inferences on unseen combinations of lexical and logical phenomena when the syntactic structures of the sentences are similar between the training and test sets. However, the performance of the models significantly decreases when the structures are slightly changed in the test set while retaining all vocabularies and constituents already appearing in the training set. This indicates that the generalization ability of neural models is limited to cases where the syntactic structures are nearly the same as those in the training set.

pdf bib abs

Interpretable rationales for model predictions play a critical role in practical applications. In this study, we develop models possessing interpretable inference process for structured prediction. Specifically, we present a method of instance-based learning that learns similarities between spans. At inference time, each span is assigned a class label based on its similar spans in the training set, where it is easy to understand how much each training instance contributes to the predictions. Through empirical analysis on named entity recognition, we demonstrate that our method enables to build models that have high interpretability without sacrificing performance.

pdf bib abs

R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason
Naoya Inoue | Pontus Stenetorp | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent studies have revealed that reading comprehension (RC) systems learn to exploit annotation artifacts and other biases in current datasets. This prevents the community from reliably measuring the progress of RC systems. To address this issue, we introduce R4C, a new task for evaluating RC systems’ internal reasoning. R4C requires giving not only answers but also derivations: explanations that justify predicted answers. We present a reliable, crowdsourced framework for scalably annotating RC datasets with derivations. We create and publicly release the R4C dataset, the first, quality-assured dataset consisting of 4.6k questions, each of which is annotated with 3 reference derivations (i.e. 13.8k derivations). Experiments show that our automatic evaluation metrics using multiple reference derivations are reliable, and that R4C assesses different skills from an existing benchmark.

pdf bib abs

Embeddings of Label Components for Sequence Labeling: A Case Study of Fine-grained Named Entity Recognition
Takuma Kato | Kaori Abe | Hiroki Ouchi | Shumpei Miyawaki | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In general, the labels used in sequence labeling consist of different types of elements. For example, IOB-format entity labels, such as B-Person and I-Person, can be decomposed into span (B and I) and type information (Person). However, while most sequence labeling models do not consider such label components, the shared components across labels, such as Person, can be beneficial for label prediction. In this work, we propose to integrate label component information as embeddings into models. Through experiments on English and Japanese fine-grained named entity recognition, we demonstrate that the proposed method improves performance, especially for instances with low-frequency labels.

pdf bib abs

Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation
Hiroaki Funayama | Shota Sasaki | Yuichiroh Matsubayashi | Tomoya Mizumoto | Jun Suzuki | Masato Mita | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments in our new task formulation and demonstrate that a typical SAS system can predict scores with zero CSE for approximately 50% of test data at maximum by filtering out low-reliablility predictions on the basis of a certain confidence estimation. This result directly indicates the possibility of reducing half the scoring cost of human raters, which is more preferable for the evaluation of SAS systems.

pdf bib abs

Events in a narrative differ in salience: some are more important to the story than others. Estimating event salience is useful for tasks such as story generation, and as a tool for text analysis in narratology and folkloristics. To compute event salience without any annotations, we adopt Barthes’ definition of event salience and propose several unsupervised methods that require only a pre-trained language model. Evaluating the proposed methods on folktales with event salience annotation, we show that the proposed methods outperform baseline methods and find fine-tuning a language model on narrative texts is a key factor in improving the proposed methods.

pdf bib abs

An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution
Ryuto Konno | Yuichiroh Matsubayashi | Shun Kiyono | Hiroki Ouchi | Ryo Takahashi | Kentaro Inui
Proceedings of the 28th International Conference on Computational Linguistics

One critical issue of zero anaphora resolution (ZAR) is the scarcity of labeled data. This study explores how effectively this problem can be alleviated by data augmentation. We adopt a state-of-the-art data augmentation method, called the contextual data augmentation (CDA), that generates labeled training instances using a pretrained language model. The CDA has been reported to work well for several other natural language processing tasks, including text classification and machine translation. This study addresses two underexplored issues on CDA, that is, how to reduce the computational cost of data augmentation and how to ensure the quality of the generated data. We also propose two methods to adapt CDA to ZAR: [MASK]-based augmentation and linguistically-controlled masking. Consequently, the experimental results on Japanese ZAR show that our methods contribute to both the accuracy gainand the computation cost reduction. Our closer analysis reveals that the proposed method can improve the quality of the augmented training data when compared to the conventional CDA.

pdf bib abs

Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.

pdf bib abs

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness
Reina Akama | Sho Yokoi | Jun Suzuki | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large-scale dialogue datasets have recently become available for training neural dialogue agents. However, these datasets have been reported to contain a non-negligible number of unacceptable utterance pairs. In this paper, we propose a method for scoring the quality of utterance pairs in terms of their connectivity and relatedness. The proposed scoring method is designed based on findings widely shared in the dialogue and linguistics research communities. We demonstrate that it has a relatively good correlation with the human judgment of dialogue quality. Furthermore, the method is applied to filter out potentially unacceptable utterance pairs from a large-scale noisy dialogue corpus to ensure its quality. We experimentally confirm that training data filtered by the proposed method improves the quality of neural dialogue agents in response generation.

pdf bib abs

Word Rotator’s Distance
Sho Yokoi | Ryo Takahashi | Reina Akama | Jun Suzuki | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

One key principle for assessing textual similarity is measuring the degree of semantic overlap between texts by considering the word alignment. Such alignment-based approaches are both intuitive and interpretable; however, they are empirically inferior to the simple cosine similarity between general-purpose sentence vectors. We focus on the fact that the norm of word vectors is a good proxy for word importance, and the angle of them is a good proxy for word similarity. However, alignment-based approaches do not distinguish the norm and direction, whereas sentence-vector approaches automatically use the norm as the word importance. Accordingly, we propose decoupling word vectors into their norm and direction then computing the alignment-based similarity with the help of earth mover’s distance (optimal transport), which we refer to as word rotator’s distance. Furthermore, we demonstrate how to grow the norm and direction of word vectors (vector converter); this is a new systematic approach derived from the sentence-vector estimation methods, which can significantly improve the performance of the proposed method. On several STS benchmarks, the proposed methods outperform not only alignment-based approaches but also strong baselines. The source code is avaliable at https://github.com/eumesy/wrd

pdf bib abs

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
Goro Kobayashi | Tatsuki Kuribayashi | Sho Yokoi | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Attention is a key component of Transformers, which have recently achieved considerable success in natural language processing. Hence, attention is being extensively studied to investigate various linguistic capabilities of Transformers, focusing on analyzing the parallels between attention weights and specific linguistic phenomena. This paper shows that attention weights alone are only one of the two factors that determine the output of attention and proposes a norm-based analysis that incorporates the second factor, the norm of the transformed input vectors. The findings of our norm-based analyses of BERT and a Transformer-based neural machine translation system include the following: (i) contrary to previous studies, BERT pays poor attention to special tokens, and (ii) reasonable word alignment can be extracted from attention mechanisms of Transformer. These findings provide insights into the inner workings of Transformers.

pdf bib abs

Langsmith: An Interactive Academic Text Revision System
Takumi Ito | Tatsuki Kuribayashi | Masatoshi Hidaka | Jun Suzuki | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Despite the current diversity and inclusion initiatives in the academic community, researchers with a non-native command of English still face significant obstacles when writing papers in English. This paper presents the Langsmith editor, which assists inexperienced, non-native researchers to write English papers, especially in the natural language processing (NLP) field. Our system can suggest fluent, academic-style sentences to writers based on their rough, incomplete phrases or sentences. The system also encourages interaction between human writers and the computerized revision system. The experimental results demonstrated that Langsmith helps non-native English-speaker students write papers in English. The system is available at https://emnlp-demo.editor.langsmith.co.jp/.

pdf bib abs

A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction
Masato Mita | Shun Kiyono | Masahiro Kaneko | Jun Suzuki | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2020

Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of “noise” where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.

pdf bib abs

Seeing the World through Text: Evaluating Image Descriptions for Commonsense Reasoning in Machine Reading Comprehension
Diana Galvan-Sosa | Jun Suzuki | Kyosuke Nishida | Koji Matsuda | Kentaro Inui
Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Despite recent achievements in natural language understanding, reasoning over commonsense knowledge still represents a big challenge to AI systems. As the name suggests, common sense is related to perception and as such, humans derive it from experience rather than from literary education. Recent works in the NLP and the computer vision field have made the effort of making such knowledge explicit using written language and visual inputs, respectively. Our premise is that the latter source fits better with the characteristics of commonsense acquisition. In this work, we explore to what extent the descriptions of real-world scenes are sufficient to learn common sense about different daily situations, drawing upon visual information to answer script knowledge questions.

pdf bib abs

Creating Corpora for Research in Feedback Comment Generation
Ryo Nagata | Kentaro Inui | Shin’ichiro Ishikawa
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we report on datasets that we created for research in feedback comment generation — a task of automatically generating feedback comments such as a hint or an explanatory note for writing learning. There has been almost no such corpus open to the public and accordingly there has been a very limited amount of work on this task. In this paper, we first discuss the principle and guidelines for feedback comment annotation. Then, we describe two corpora that we have manually annotated with feedback comments (approximately 50,000 general comments and 6,700 on preposition use). A part of the annotation results is now available on the web, which will facilitate research in feedback comment generation

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

pdf bib abs

Efficient Estimation of Influence of a Training Instance
Sosuke Kobayashi | Sho Yokoi | Jun Suzuki | Kentaro Inui
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model’s prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.

2019

pdf bib

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Kentaro Inui | Jing Jiang | Vincent Ng | Xiaojun Wan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

pdf bib abs

Select and Attend: Towards Controllable Content Selection in Text Generation
Xiaoyu Shen | Jun Suzuki | Kentaro Inui | Hui Su | Dietrich Klakow | Satoshi Sekine
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many text generation tasks naturally contain two steps: content selection and surface realization. Current neural encoder-decoder models conflate both steps into a black-box architecture. As a result, the content to be described in the text cannot be explicitly controlled. This paper tackles this problem by decoupling content selection from the decoder. The decoupled content selection is human interpretable, whose value can be manually manipulated to control the content of generated text. The model can be trained end-to-end without human annotations by maximizing a lower bound of the marginal likelihood. We further propose an effective way to trade-off between performance and controllability with a single adjustable hyperparameter. In both data-to-text and headline generation tasks, our model achieves promising results, paving the way for controllable content selection in text generation.

pdf bib abs

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction
Shun Kiyono | Jun Suzuki | Masato Mita | Tomoya Mizumoto | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The incorporation of pseudo data in the training of grammatical error correction models has been one of the main factors in improving the performance of such models. However, consensus is lacking on experimental configurations, namely, choosing how the pseudo data should be generated or used. In this study, these choices are investigated through extensive experiments, and state-of-the-art performance is achieved on the CoNLL-2014 test set (F0.5=65.0) and the official test set of the BEA-2019 shared task (F0.5=70.2) without making any modifications to the model architecture.

pdf bib abs

Transductive Learning of Neural Language Models for Syntactic and Semantic Analysis
Hiroki Ouchi | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In transductive learning, an unlabeled test set is used for model training. Although this setting deviates from the common assumption of a completely unseen test set, it is applicable in many real-world scenarios, wherein the texts to be processed are known in advance. However, despite its practical advantages, transductive learning is underexplored in natural language processing. Here we conduct an empirical study of transductive learning for neural models and demonstrate its utility in syntactic and semantic tasks. Specifically, we fine-tune language models (LMs) on an unlabeled test set to obtain test-set-specific word representations. Through extensive experiments, we demonstrate that despite its simplicity, transductive LM fine-tuning consistently improves state-of-the-art neural models in in-domain and out-of-domain settings.

pdf bib abs

TEASPN: Framework and Protocol for Integrated Writing Assistance Environments
Masato Hagiwara | Takumi Ito | Tatsuki Kuribayashi | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Language technologies play a key role in assisting people with their writing. Although there has been steady progress in e.g., grammatical error correction (GEC), human writers are yet to benefit from this progress due to the high development cost of integrating with writing software. We propose TEASPN, a protocol and an open-source framework for achieving integrated writing assistance environments. The protocol standardizes the way writing software communicates with servers that implement such technologies, allowing developers and researchers to integrate the latest developments in natural language processing (NLP) with low cost. As a result, users can enjoy the integrated experience in their favorite writing software. The results from experiments with human participants show that users use a wide range of technologies and rate their writing experience favorably, allowing them to write more fluent text.

pdf bib abs

Pretrained language models, such as BERT and RoBERTa, have shown large improvements in the commonsense reasoning benchmark COPA. However, recent work found that many improvements in benchmarks of natural language understanding are not due to models learning the task, but due to their increasing ability to exploit superficial cues, such as tokens that occur more often in the correct answer than the wrong one. Are BERT’s and RoBERTa’s good performance on COPA also caused by this? We find superficial cues in COPA, as well as evidence that BERT exploits these cues. To remedy this problem, we introduce Balanced COPA, an extension of COPA that does not suffer from easy-to-exploit single token cues. We analyze BERT’s and RoBERTa’s performance on original and Balanced COPA, finding that BERT relies on superficial cues when they are present, but still achieves comparable performance once they are made ineffective, suggesting that BERT learns the task to a certain degree when forced to. In contrast, RoBERTa does not appear to rely on superficial cues.

pdf bib abs

Inject Rubrics into Short Answer Grading System
Tianqi Wang | Naoya Inoue | Hiroki Ouchi | Tomoya Mizumoto | Kentaro Inui
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

Short Answer Grading (SAG) is a task of scoring students’ answers in examinations. Most existing SAG systems predict scores based only on the answers, including the model used as base line in this paper, which gives the-state-of-the-art performance. But they ignore important evaluation criteria such as rubrics, which play a crucial role for evaluating answers in real-world situations. In this paper, we present a method to inject information from rubrics into SAG systems. We implement our approach on top of word-level attention mechanism to introduce the rubric information, in order to locate information in each answer that are highly related to the score. Our experimental results demonstrate that injecting rubric information effectively contributes to the performance improvement and that our proposed model outperforms the state-of-the-art SAG model on the widely used ASAP-SAS dataset under low-resource settings.

pdf bib abs

Improving Evidence Detection by Leveraging Warrants
Keshav Singh | Paul Reisert | Naoya Inoue | Pride Kavumba | Kentaro Inui
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)

Recognizing the implicit link between a claim and a piece of evidence (i.e. warrant) is the key to improving the performance of evidence detection. In this work, we explore the effectiveness of automatically extracted warrants for evidence detection. Given a claim and candidate evidence, our proposed method extracts multiple warrants via similarity search from an existing, structured corpus of arguments. We then attentively aggregate the extracted warrants, considering the consistency between the given argument and the acquired warrants. Although a qualitative analysis on the warrants shows that the extraction method needs to be improved, our results indicate that our method can still improve the performance of evidence detection.

pdf bib abs

Cross-Corpora Evaluation and Analysis of Grammatical Error Correction Models — Is Single-Corpus Evaluation Enough?
Masato Mita | Tomoya Mizumoto | Masahiro Kaneko | Ryo Nagata | Kentaro Inui
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This study explores the necessity of performing cross-corpora evaluation for grammatical error correction (GEC) models. GEC models have been previously evaluated based on a single commonly applied corpus: the CoNLL-2014 benchmark. However, the evaluation remains incomplete because the task difficulty varies depending on the test corpus and conditions such as the proficiency levels of the writers and essay topics. To overcome this limitation, we evaluate the performance of several GEC models, including NMT-based (LSTM, CNN, and transformer) and an SMT-based model, against various learner corpora (CoNLL-2013, CoNLL-2014, FCE, JFLEG, ICNALE, and KJ). Evaluation results reveal that the models’ rankings considerably vary depending on the corpus, indicating that single-corpus evaluation is insufficient for GEC models.

pdf bib abs

Subword-based Compact Reconstruction of Word Embeddings
Shota Sasaki | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The idea of subword-based word embeddings has been proposed in the literature, mainly for solving the out-of-vocabulary (OOV) word problem observed in standard word-based word embeddings. In this paper, we propose a method of reconstructing pre-trained word embeddings using subword information that can effectively represent a large number of subword embeddings in a considerably small fixed space. The key techniques of our method are twofold: memory-shared embeddings and a variant of the key-value-query self-attention mechanism. Our experiments show that our reconstructed subword-based embeddings can successfully imitate well-trained word embeddings in a small fixed space while preventing quality degradation across several linguistic benchmark datasets, and can simultaneously predict effective embeddings of OOV words. We also demonstrate the effectiveness of our reconstruction method when we apply them to downstream tasks.

pdf bib abs

For several natural language processing (NLP) tasks, span representation design is attracting considerable attention as a promising new technique; a common basis for an effective design has been established. With such basis, exploring task-dependent extensions for argumentation structure parsing (ASP) becomes an interesting research direction. This study investigates (i) span representation originally developed for other NLP tasks and (ii) a simple task-dependent extension for ASP. Our extensive experiments and analysis show that these representations yield high performance for ASP and provide some challenging types of instances to be parsed.

pdf bib abs

Unsupervised Learning of Discourse-Aware Text Representation for Essay Scoring
Farjana Sultana Mim | Naoya Inoue | Paul Reisert | Hiroki Ouchi | Kentaro Inui
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Existing document embedding approaches mainly focus on capturing sequences of words in documents. However, some document classification and regression tasks such as essay scoring need to consider discourse structure of documents. Although some prior approaches consider this issue and utilize discourse structure of text for document classification, these approaches are dependent on computationally expensive parsers. In this paper, we propose an unsupervised approach to capture discourse structure in terms of coherence and cohesion for document embedding that does not require any expensive parser or annotation. Extrinsic evaluation results show that the document representation obtained from our approach improves the performance of essay Organization scoring and Argument Strength scoring.

pdf bib abs

Large crowdsourced datasets are widely used for training and evaluating neural models on natural language inference (NLI). Despite these efforts, neural models have a hard time capturing logical inferences, including those licensed by phrase replacements, so-called monotonicity reasoning. Since no large dataset has been developed for monotonicity reasoning, it is still unclear whether the main obstacle is the size of datasets or the model architectures themselves. To investigate this issue, we introduce a new dataset, called HELP, for handling entailments with lexical and logical phenomena. We add it to training data for the state-of-the-art neural models and evaluate them on test sets for monotonicity phenomena. The results showed that our data augmentation improved the overall accuracy. We also find that the improvement is better on monotonicity inferences with lexical replacements than on downward inferences with disjunction and modification. This suggests that some types of inferences can be improved by our data augmentation while others are immune to it.

pdf bib abs

The Sally Smedley Hyperpartisan News Detector at SemEval-2019 Task 4
Kazuaki Hanawa | Shota Sasaki | Hiroki Ouchi | Jun Suzuki | Kentaro Inui
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system submitted to the formal run of SemEval-2019 Task 4: Hyperpartisan news detection. Our system is based on a linear classifier using several features, i.e., 1) embedding features based on the pre-trained BERT embeddings, 2) article length features, and 3) embedding features of informative phrases extracted from by-publisher dataset. Our system achieved 80.9% accuracy on the test set for the formal run and got the 3rd place out of 42 teams.

pdf bib abs

Distantly Supervised Biomedical Knowledge Acquisition via Knowledge Graph Based Attention
Qin Dai | Naoya Inoue | Paul Reisert | Ryo Takahashi | Kentaro Inui
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

The increased demand for structured scientific knowledge has attracted considerable attention in extracting scientific relation from the ever growing scientific publications. Distant supervision is widely applied approach to automatically generate large amounts of labelled data with low manual annotation cost. However, distant supervision inevitably accompanies the wrong labelling problem, which will negatively affect the performance of Relation Extraction (RE). To address this issue, (Han et al., 2018) proposes a novel framework for jointly training both RE model and Knowledge Graph Completion (KGC) model to extract structured knowledge from non-scientific dataset. In this work, we firstly investigate the feasibility of this framework on scientific dataset, specifically on biomedical dataset. Secondly, to achieve better performance on the biomedical dataset, we extend the framework with other competitive KGC models. Moreover, we proposed a new end-to-end KGC model to extend the framework. Experimental results not only show the feasibility of the framework on the biomedical dataset, but also indicate the effectiveness of our extensions, because our extended model achieves significant and consistent improvements on distant supervised RE as compared with baselines.

pdf bib abs

Annotating with Pros and Cons of Technologies in Computer Science Papers
Hono Shirai | Naoya Inoue | Jun Suzuki | Kentaro Inui
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

This paper explores a task for extracting a technological expression and its pros/cons from computer science papers. We report ongoing efforts on an annotated corpus of pros/cons and an analysis of the nature of the automatic extraction task. Specifically, we show how to adapt the targeted sentiment analysis task for pros/cons extraction in computer science papers and conduct an annotation study. In order to identify the challenges of the automatic extraction task, we construct a strong baseline model and conduct an error analysis. The experiments show that pros/cons can be consistently annotated by several annotators, and that the task is challenging due to domain-specific knowledge. The annotated dataset is made publicly available for research purposes.

pdf bib abs

This paper provides an analytical assessment of student short answer responses with a view to potential benefits in pedagogical contexts. We first propose and formalize two novel analytical assessment tasks: analytic score prediction and justification identification, and then provide the first dataset created for analytic short answer scoring research. Subsequently, we present a neural baseline model and report our extensive empirical results to demonstrate how our dataset can be used to explore new and intriguing technical challenges in short answer scoring. The dataset is publicly available for research purposes.

pdf bib abs

Monotonicity reasoning is one of the important reasoning skills for any intelligent natural language inference (NLI) model in that it requires the ability to capture the interaction between lexical and syntactic structures. Since no test set has been developed for monotonicity reasoning with wide coverage, it is still unclear whether neural models can perform monotonicity reasoning in a proper way. To investigate this issue, we introduce the Monotonicity Entailment Dataset (MED). Performance by state-of-the-art NLI models on the new test set is substantially worse, under 55%, especially on downward reasoning. In addition, analysis using a monotonicity-driven data augmentation method showed that these models might be limited in their generalization ability in upward and downward reasoning.

pdf bib abs

The writing process consists of several stages such as drafting, revising, editing, and proofreading. Studies on writing assistance, such as grammatical error correction (GEC), have mainly focused on sentence editing and proofreading, where surface-level issues such as typographical errors, spelling errors, or grammatical errors should be corrected. We broaden this focus to include the earlier revising stage, where sentences require adjustment to the information included or major rewriting and propose Sentence-level Revision (SentRev) as a new writing assistance task. Well-performing systems in this task can help inexperienced authors by producing fluent, complete sentences given their rough, incomplete drafts. We build a new freely available crowdsourced evaluation dataset consisting of incomplete sentences authored by non-native writers paired with their final versions extracted from published academic papers for developing and evaluating SentRev models. We also establish baseline performance on SentRev using our newly built evaluation dataset.

pdf bib abs

Browsing news articles on multiple devices is now possible. The lengths of news article headlines have precise upper bounds, dictated by the size of the display of the relevant device or interface. Therefore, controlling the length of headlines is essential when applying the task of headline generation to news production. However, because there is no corpus of headlines of multiple lengths for a given article, previous research on controlling output length in headline generation has not discussed whether the system outputs could be adequately evaluated without multiple references of different lengths. In this paper, we introduce two corpora, which are Japanese News Corpus (JNC) and JApanese MUlti-Length Headline Corpus (JAMUL), to confirm the validity of previous evaluation settings. The JNC provides common supervision data for headline generation. The JAMUL is a large-scale evaluation dataset for headlines of three different lengths composed by professional editors. We report new findings on these corpora; for example, although the longest length reference summary can appropriately evaluate the existing methods controlling output length, this evaluation setting has several problems.

2018

pdf bib abs

Distance-Free Modeling of Multi-Predicate Interactions in End-to-End Japanese Predicate-Argument Structure Analysis
Yuichiroh Matsubayashi | Kentaro Inui
Proceedings of the 27th International Conference on Computational Linguistics

Capturing interactions among multiple predicate-argument structures (PASs) is a crucial issue in the task of analyzing PAS in Japanese. In this paper, we propose new Japanese PAS analysis models that integrate the label prediction information of arguments in multiple PASs by extending the input and last layers of a standard deep bidirectional recurrent neural network (bi-RNN) model. In these models, using the mechanisms of pooling and attention, we aim to directly capture the potential interactions among multiple PASs, without being disturbed by the word order and distance. Our experiments show that the proposed models improve the prediction accuracy specifically for cases where the predicate and argument are in an indirect dependency relation and achieve a new state of the art in the overall F₁ on a standard benchmark corpus.

pdf bib abs

Predicting Stances from Social Media Posts using Factorization Machines
Akira Sasaki | Kazuaki Hanawa | Naoaki Okazaki | Kentaro Inui
Proceedings of the 27th International Conference on Computational Linguistics

Social media provide platforms to express, discuss, and shape opinions about events and issues in the real world. An important step to analyze the discussions on social media and to assist in healthy decision-making is stance detection. This paper presents an approach to detect the stance of a user toward a topic based on their stances toward other topics and the social media posts of the user. We apply factorization machines, a widely used method in item recommendation, to model user preferences toward topics from the social media data. The experimental results demonstrate that users’ posts are useful to model topic preferences and therefore predict stances of silent users.

pdf bib abs

Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions
Sho Yokoi | Sosuke Kobayashi | Kenji Fukumizu | Jun Suzuki | Kentaro Inui
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert–Schmidt independence criterion (HSIC); thus, we call the new measure the pointwise HSIC (PHSIC). PHSIC can be interpreted as a smoothed variant of PMI that allows various similarity metrics (e.g., sentence embeddings) to be plugged in as kernels. Moreover, PHSIC can be estimated by simple and fast (linear in the size of the data) matrix calculations regardless of whether we use linear or nonlinear kernels. Empirically, in a dialogue response selection task, PHSIC is learned thousands of times faster than an RNN-based PMI while outperforming PMI in accuracy. In addition, we also demonstrate that PHSIC is beneficial as a criterion of a data selection task for machine translation owing to its ability to give high (low) scores to a consistent (inconsistent) pair with other pairs.

pdf bib abs

What Makes Reading Comprehension Questions Easier?
Saku Sugawara | Kentaro Inui | Satoshi Sekine | Akiko Aizawa
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiple-choice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.

pdf bib abs

A Melody-Conditioned Lyrics Language Model
Kento Watanabe | Yuichiroh Matsubayashi | Satoru Fukayama | Masataka Goto | Kentaro Inui | Tomoyasu Nakano
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

This paper presents a novel, data-driven language model that produces entire lyrics for a given input melody. Previously proposed models for lyrics generation suffer from the inability of capturing the relationship between lyrics and melody partly due to the unavailability of lyrics-melody aligned data. In this study, we first propose a new practical method for creating a large collection of lyrics-melody aligned data and then create a collection of 1,000 lyrics-melody pairs augmented with precise syllable-note alignments and word/sentence/paragraph boundaries. We then provide a quantitative analysis of the correlation between word/sentence/paragraph boundaries in lyrics and melodies. We then propose an RNN-based lyrics language model conditioned on a featurized melody. Experimental results show that the proposed model generates fluent lyrics while maintaining the compatibility between boundaries of lyrics and melody structures.

pdf bib abs

Cross-Lingual Learning-to-Rank with Shared Representations
Shota Sasaki | Shuo Sun | Shigehiko Schamoni | Kevin Duh | Kentaro Inui
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Cross-lingual information retrieval (CLIR) is a document retrieval task where the documents are written in a language different from that of the user’s query. This is a challenging problem for data-driven approaches due to the general lack of labeled training data. We introduce a large-scale dataset derived from Wikipedia to support CLIR research in 25 languages. Further, we present a simple yet effective neural learning-to-rank model that shares representations across languages and reduces the data requirement. This model can exploit training data in, for example, Japanese-English CLIR to improve the results of Swahili-English CLIR.

pdf bib abs

Interpretable and Compositional Relation Learning by Joint Training with an Autoencoder
Ryo Takahashi | Ran Tian | Kentaro Inui
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Embedding models for entities and relations are extremely useful for recovering missing facts in a knowledge base. Intuitively, a relation can be modeled by a matrix mapping entity vectors. However, relations reside on low dimension sub-manifolds in the parameter space of arbitrary matrices – for one reason, composition of two relations M1, M2 may match a third M3 (e.g. composition of relations currency_of_country and country_of_film usually matches currency_of_film_budget), which imposes compositional constraints to be satisfied by the parameters (i.e. M1*M2=M3). In this paper we investigate a dimension reduction technique by training relations jointly with an autoencoder, which is expected to better capture compositional constraints. We achieve state-of-the-art on Knowledge Base Completion tasks with strongly improved Mean Rank, and show that joint training with an autoencoder leads to interpretable sparse codings of relations, helps discovering compositional constraints and benefits from compositional training. Our source code is released at github.com/tianran/glimvec.

pdf bib abs

Unsupervised Learning of Style-sensitive Word Vectors
Reina Akama | Kento Watanabe | Sho Yokoi | Sosuke Kobayashi | Kentaro Inui
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper presents the first study aimed at capturing stylistic similarity between words in an unsupervised manner. We propose extending the continuous bag of words (CBOW) embedding model (Mikolov et al., 2013b) to learn style-sensitive word vectors using a wider context window under the assumption that the style of all the words in an utterance is consistent. In addition, we introduce a novel task to predict lexical stylistic similarity and to create a benchmark dataset for this task. Our experiment with this dataset supports our assumption and demonstrates that the proposed extensions contribute to the acquisition of style-sensitive word embeddings.

pdf bib abs

Feasible Annotation Scheme for Capturing Policy Argument Reasoning using Argument Templates
Paul Reisert | Naoya Inoue | Tatsuki Kuribayashi | Kentaro Inui
Proceedings of the 5th Workshop on Argument Mining

Most of the existing works on argument mining cast the problem of argumentative structure identification as classification tasks (e.g. attack-support relations, stance, explicit premise/claim). This paper goes a step further by addressing the task of automatically identifying reasoning patterns of arguments using predefined templates, which is called argument template (AT) instantiation. The contributions of this work are three-fold. First, we develop a simple, yet expressive set of easily annotatable ATs that can represent a majority of writer’s reasoning for texts with diverse policy topics while maintaining the computational feasibility of the task. Second, we create a small, but highly reliable annotated corpus of instantiated ATs on top of reliably annotated support and attack relations and conduct an annotation study. Third, we formulate the task of AT instantiation as structured prediction constrained by a feasible set of templates. Our evaluation demonstrates that we can annotate ATs with a reasonably high inter-annotator agreement, and the use of template-constrained inference is useful for instantiating ATs with only partial reasoning comprehension clues.

pdf bib abs

Unsupervised Token-wise Alignment to Improve Interpretation of Encoder-Decoder Models
Shun Kiyono | Sho Takase | Jun Suzuki | Naoaki Okazaki | Kentaro Inui | Masaaki Nagata
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Developing a method for understanding the inner workings of black-box neural methods is an important research endeavor. Conventionally, many studies have used an attention matrix to interpret how Encoder-Decoder-based models translate a given source sentence to the corresponding target sentence. However, recent studies have empirically revealed that an attention matrix is not optimal for token-wise translation analyses. We propose a method that explicitly models the token-wise alignment between the source and target sequences to provide a better analysis. Experiments show that our method can acquire token-wise alignments that are superior to those of an attention mechanism.

pdf bib abs

Investigating the Challenges of Temporal Relation Extraction from Clinical Text
Diana Galvan | Naoaki Okazaki | Koji Matsuda | Kentaro Inui
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Temporal reasoning remains as an unsolved task for Natural Language Processing (NLP), particularly demonstrated in the clinical domain. The complexity of temporal representation in language is evident as results of the 2016 Clinical TempEval challenge indicate: the current state-of-the-art systems perform well in solving mention-identification tasks of event and time expressions but poorly in temporal relation extraction, showing a gap of around 0.25 point below human performance. We explore to adapt the tree-based LSTM-RNN model proposed by Miwa and Bansal (2016) to temporal relation extraction from clinical text, obtaining a five point improvement over the best 2016 Clinical TempEval system and two points over the state-of-the-art. We deliver a deep analysis of the results and discuss the next step towards human-like temporal reasoning.

pdf bib

Multi-dialect Neural Machine Translation and Dialectometry
Kaori Abe | Yuichiroh Matsubayashi | Naoaki Okazaki | Kentaro Inui
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib

Improving Scientific Relation Classification with Task Specific Supersense
Qin Dai | Naoya Inoue | Paul Reisert | Kentaro Inui
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib

pdf bib

2017

pdf bib abs

Neural Architectures for Fine-grained Entity Type Classification
Sonse Shimaoka | Pontus Stenetorp | Kentaro Inui | Sebastian Riedel
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this work, we investigate several neural network architectures for fine-grained entity type classification and make three key contributions. Despite being a natural comparison and addition, previous work on attentive neural architectures have not considered hand-crafted features and we combine these with learnt features and establish that they complement each other. Additionally, through quantitative analysis we establish that the attention mechanism learns to attend over syntactic heads and the phrase containing the mention, both of which are known to be strong hand-crafted features for our task. We introduce parameter sharing between labels through a hierarchical encoding method, that in low-dimensional projections show clear clusters for each type hierarchy. Lastly, despite using the same evaluation dataset, the literature frequently compare models trained using different data. We demonstrate that the choice of training data has a drastic impact on performance, which decreases by as much as 9.85% loose micro F1 score for a previously proposed method. Despite this discrepancy, our best model achieves state-of-the-art results with 75.36% loose micro F1 score on the well-established Figer (GOLD) dataset and we report the best results for models trained using publicly available data for the OntoNotes dataset with 64.93% loose micro F1 score.

pdf bib abs

A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse
Sosuke Kobayashi | Naoaki Okazaki | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This study addresses the problem of identifying the meaning of unknown words or entities in a discourse with respect to the word embedding approaches used in neural language models. We proposed a method for on-the-fly construction and exploitation of word embeddings in both the input and output layers of a neural model by tracking contexts. This extends the dynamic entity representation used in Kobayashi et al. (2016) and incorporates a copy mechanism proposed independently by Gu et al. (2016) and Gulcehre et al. (2016). In addition, we construct a new task and dataset called Anonymized Language Modeling for evaluating the ability to capture word meanings while reading. Experiments conducted using our novel dataset show that the proposed variant of RNN language model outperformed the baseline model. Furthermore, the experiments also demonstrate that dynamic updates of an output layer help a model predict reappearing entities, whereas those of an input layer are effective to predict words following reappearing entities.

pdf bib abs

Revisiting the Design Issues of Local Models for Japanese Predicate-Argument Structure Analysis
Yuichiroh Matsubayashi | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The research trend in Japanese predicate-argument structure (PAS) analysis is shifting from pointwise prediction models with local features to global models designed to search for globally optimal solutions. However, the existing global models tend to employ only relatively simple local features; therefore, the overall performance gains are rather limited. The importance of designing a local model is demonstrated in this study by showing that the performance of a sophisticated local model can be considerably improved with recent feature embedding methods and a feature combination learning based on a neural network, outperforming the state-of-the-art global models in F1 on a common benchmark dataset.

pdf bib abs

Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems
Hiroki Asano | Tomoya Mizumoto | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In grammatical error correction (GEC), automatically evaluating system outputs requires gold-standard references, which must be created manually and thus tend to be both expensive and limited in coverage. To address this problem, a reference-less approach has recently emerged; however, previous reference-less metrics that only consider the criterion of grammaticality, have not worked as well as reference-based metrics. This study explores the potential of extending a prior grammaticality-based method to establish a reference-less evaluation method for GEC systems. Further, we empirically show that a reference-less metric that combines fluency and meaning preservation with grammaticality provides a better estimate of manual scores than that of commonly used reference-based metrics. To our knowledge, this is the first study that provides empirical evidence that a reference-less metric can replace reference-based metrics in evaluating GEC systems.

pdf bib abs

Generating Stylistically Consistent Dialog Responses with Transfer Learning
Reina Akama | Kazuaki Inada | Naoya Inoue | Sosuke Kobayashi | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We propose a novel, data-driven, and stylistically consistent dialog response generation system. To create a user-friendly system, it is crucial to make generated responses not only appropriate but also stylistically consistent. For leaning both the properties effectively, our proposed framework has two training stages inspired by transfer learning. First, we train the model to generate appropriate responses, and then we ensure that the responses have a specific style. Experimental results demonstrate that the proposed method produces stylistically consistent responses while maintaining the appropriateness of the responses learned in a general domain.

pdf bib abs

Proofread Sentence Generation as Multi-Task Learning with Editing Operation Prediction
Yuta Hitomi | Hideaki Tamori | Naoaki Okazaki | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This paper explores the idea of robot editors, automated proofreaders that enable journalists to improve the quality of their articles. We propose a novel neural model of multi-task learning that both generates proofread sentences and predicts the editing operations required to rewrite the source sentences and create the proofread ones. The model is trained using logs of the revisions made professional editors revising draft newspaper articles written by journalists. Experiments demonstrate the effectiveness of our multi-task learning approach and the potential value of using revision logs for this task.

pdf bib abs

Other Topics You May Also Agree or Disagree: Modeling Inter-Topic Preferences using Tweets and Matrix Factorization
Akira Sasaki | Kazuaki Hanawa | Naoaki Okazaki | Kentaro Inui
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We presents in this paper our approach for modeling inter-topic preferences of Twitter users: for example, “those who agree with the Trans-Pacific Partnership (TPP) also agree with free trade”. This kind of knowledge is useful not only for stance detection across multiple topics but also for various real-world applications including public opinion survey, electoral prediction, electoral campaigns, and online debates. In order to extract users’ preferences on Twitter, we design linguistic patterns in which people agree and disagree about specific topics (e.g., “A is completely wrong”). By applying these linguistic patterns to a collection of tweets, we extract statements agreeing and disagreeing with various topics. Inspired by previous work on item recommendation, we formalize the task of modeling inter-topic preferences as matrix factorization: representing users’ preference as a user-topic matrix and mapping both users and topics onto a latent feature space that abstracts the preferences. Our experimental results demonstrate both that our presented approach is useful in predicting missing preferences of users and that the latent vector representations of topics successfully encode inter-topic preferences.

pdf bib abs

Analyzing the Revision Logs of a Japanese Newspaper for Article Quality Assessment
Hideaki Tamori | Yuta Hitomi | Naoaki Okazaki | Kentaro Inui
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

We address the issue of the quality of journalism and analyze daily article revision logs from a Japanese newspaper company. The revision logs contain data that can help reveal the requirements of quality journalism such as the types and number of edit operations and aspects commonly focused in revision. This study also discusses potential applications such as quality assessment and automatic article revision as our future research directions.

pdf bib

Handling Multiword Expressions in Causality Estimation
Shota Sasaki | Sho Takase | Naoya Inoue | Naoaki Okazaki | Kentaro Inui
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

pdf bib

A Crowdsourcing Approach for Annotating Causal Relation Instances in Wikipedia
Kazuaki Hanawa | Akira Sasaki | Naoaki Okazaki | Kentaro Inui
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

2016

pdf bib abs

This study proposes a computational model of the discourse segments in lyrics to understand and to model the structure of lyrics. To test our hypothesis that discourse segmentations in lyrics strongly correlate with repeated patterns, we conduct the first large-scale corpus study on discourse segments in lyrics. Next, we propose the task to automatically identify segment boundaries in lyrics and train a logistic regression model for the task with the repeated pattern and textual features. The results of our empirical experiments illustrate the significance of capturing repeated patterns in predicting the boundaries of discourse segments in lyrics.

pdf bib abs

Modeling Context-sensitive Selectional Preference with Distributed Representations
Naoya Inoue | Yuichiroh Matsubayashi | Masayuki Ono | Naoaki Okazaki | Kentaro Inui
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper proposes a novel problem setting of selectional preference (SP) between a predicate and its arguments, called as context-sensitive SP (CSP). CSP models the narrative consistency between the predicate and preceding contexts of its arguments, in addition to the conventional SP based on semantic types. Furthermore, we present a novel CSP model that extends the neural SP model (Van de Cruys, 2014) to incorporate contextual information into the distributed representations of arguments. Experimental results demonstrate that the proposed CSP model successfully learns CSP and outperforms the conventional SP model in coreference cluster ranking.

pdf bib abs

Question-Answering with Logic Specific to Video Games
Corentin Dumont | Ran Tian | Kentaro Inui
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a corpus and a knowledge database aiming at developing Question-Answering in a new context, the open world of a video game. We chose a popular game called ‘Minecraft’, and created a QA corpus with a knowledge database related to this game and the ontology of a meaning representation that will be used to structure this database. We are interested in the logic rules specific to the game, which may not exist in the real world. The ultimate goal of this research is to build a QA system that can answer natural language questions from players by using inference on these game-specific logic rules. The QA corpus is partially composed of online quiz questions and partially composed of manually written variations of the most relevant ones. The knowledge database is extracted from several wiki-like websites about Minecraft. It is composed of unstructured data, such as text, that will be structured using the meaning representation we defined, and already structured data such as infoboxes. A preliminary examination of the data shows that players are asking creative questions about the game, and that the QA corpus can be used for clustering verbs and linking them to predefined actions in the game.

pdf bib

Dynamic Entity Representation with Max-pooling Improves Machine Reading
Sosuke Kobayashi | Ran Tian | Naoaki Okazaki | Kentaro Inui
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

Learning Semantically and Additively Compositional Distributional Representations
Ran Tian | Naoaki Okazaki | Kentaro Inui
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib

Composing Distributed Representations of Relational Patterns
Sho Takase | Naoaki Okazaki | Kentaro Inui
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib

Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes
Davaajav Jargalsaikhan | Naoaki Okazaki | Koji Matsuda | Kentaro Inui
Proceedings of the ACL 2016 Student Research Workshop

pdf bib

Tohoku at SemEval-2016 Task 6: Feature-based Model versus Convolutional Neural Network for Stance Detection
Yuki Igarashi | Hiroya Komatsu | Sosuke Kobayashi | Naoaki Okazaki | Kentaro Inui
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib

An Attentive Neural Architecture for Fine-grained Entity Type Classification
Sonse Shimaoka | Pontus Stenetorp | Kentaro Inui | Sebastian Riedel
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib

pdf bib

Toward the automatic extraction of knowledge of usable goods
Mei Uemura | Naho Orita | Naoaki Okazaki | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf bib

Neural Joint Learning for Classifying Wikipedia Articles into Fine-grained Named Entity Types
Masatoshi Suzuki | Koji Matsuda | Satoshi Sekine | Naoaki Okazaki | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

A Latent Discriminative Model for Compositional Entailment Relation Recognition using Natural Logic
Yotaro Watanabe | Junta Mizuno | Eric Nichols | Naoaki Okazaki | Kentaro Inui
Proceedings of COLING 2012

pdf bib

Set Expansion using Sibling Relations between Semantic Categories
Sho Takase | Naoaki Okazaki | Kentaro Inui
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

pdf bib

2010

pdf bib

Identifying Contradictory and Contrastive Relations between Statements to Outline Web Information on a Given Topic
Daisuke Kawahara | Kentaro Inui | Sadao Kurohashi
Coling 2010: Posters

pdf bib abs

Annotating Event Mentions in Text with Modality, Focus, and Source Information
Suguru Matsuyoshi | Megumi Eguchi | Chitose Sao | Koji Murakami | Kentaro Inui | Yuji Matsumoto
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many natural language processing tasks, including information extraction, question answering and recognizing textual entailment, require analysis of the polarity, focus of polarity, tense, aspect, mood and source of the event mentions in a text in addition to its predicate-argument structure analysis. We refer to modality, polarity and other associated information as extended modality. In this paper, we propose a new annotation scheme for representing the extended modality of event mentions in a sentence. Our extended modality consists of the following seven components: Source, Time, Conditional, Primary modality type, Actuality, Evaluation and Focus. We reviewed the literature about extended modality in Linguistics and Natural Language Processing (NLP) and defined appropriate labels of each component. In the proposed annotation scheme, information of extended modality of an event mention is summarized at the core predicate of the event mention for immediate use in NLP applications. We also report on the current progress of our manual annotation of a Japanese corpus of about 50,000 event mentions, showing a reasonably high ratio of inter-annotator agreement.

pdf bib

Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables
Tetsuji Nakagawa | Kentaro Inui | Sadao Kurohashi
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib

A Thesaurus of Predicate-Argument Structure for Japanese Verbs to Deal with Granularity of Verb Meanings
Koichi Takeuchi | Kentaro Inui | Nao Takeuchi | Atsushi Fujita
Proceedings of the Eighth Workshop on Asian Language Resouces

pdf bib

2009

pdf bib

Capturing Salience with a Trainable Cache Model for Zero-anaphora Resolution
Ryu Iida | Kentaro Inui | Yuji Matsumoto
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib

pdf bib

pdf bib

Interpolated PLSI for Learning Plausible Verb Arguments
Hiram Calvo | Kentaro Inui | Yuji Matsumoto
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2008

pdf bib

Two-Phased Event Relation Acquisition: Coupling the Relation-Oriented and Argument-Oriented Approaches
Shuya Abe | Kentaro Inui | Yuji Matsumoto
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib

Emotion Classification Using Massive Examples Extracted from the Web
Ryoko Tokuhisa | Kentaro Inui | Yuji Matsumoto
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib

Acquiring Event Relation Knowledge by Learning Cooccurrence Patterns and Fertilizing Cooccurrence Samples with Verbal Nouns
Shuya Abe | Kentaro Inui | Yuji Matsumoto
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

2007

pdf bib

Extracting Aspect-Evaluation and Aspect-Of Relations in Opinion Mining
Nozomi Kobayashi | Kentaro Inui | Yuji Matsumoto
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib

pdf bib

Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations
Ryu Iida | Mamoru Komachi | Kentaro Inui | Yuji Matsumoto
Proceedings of the Linguistic Annotation Workshop

2006

pdf bib abs

Augmenting a Semantic Verb Lexicon with a Large Scale Collection of Example Sentences
Kentaro Inui | Toru Hirano | Ryu Iida | Atsushi Fujita | Yuji Matsumoto
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

One of the crucial issues in semantic parsing is how to reduce costs of collecting a sufficiently large amount of labeled data. This paper presents a new approach to cost-saving annotation of example sentences with predicate-argument structure information, taking Japanese as a target language. In this scheme, a large collection of unlabeled examples are first clustered and selectively sampled, and for each sampled cluster, only one representative example is given a label by a human annotator. The advantages of this approach are empirically supported by the results of our preliminary experiments, where we use an existing similarity function and naive sampling strategy.

pdf bib

Exploiting Syntactic Patterns as Clues in Zero-Anaphora Resolution
Ryu Iida | Kentaro Inui | Yuji Matsumoto
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics