Gaku Morio - ACL Anthology

Gaku Morio

2026

Disentangling the Effects of Unlearning in Measuring Parametric Faithfulness of Chain-of-Thought
Ryo Mitsuhashi | Gaku Morio | Ayana Niwa | Masahiro Kaneko | Kentaro Inui | Terufumi Morishita | Yuta Koreeda | Yasuhiro Sogawa
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Chain-of-Thought (CoT) in large language models (LLMs) has been widely debated in terms of whether it faithfully reflects an internal reasoning process of models. Parametric faithfulness is a recently proposed metric that uses unlearning to assess whether a model encodes parametric beliefs corresponding to a reasoning chain. This paper refines this metric by accounting for the unintended artifacts of unlearning. We introduce control tasks that unlearn irrelevant knowledge and word-shuffled content and show that these control tasks yield substantial parametric faithfulness values, suggesting the non-negligible effect of unlearning. We also found that control tasks help explain the significant variations in parametric faithfulness observed across different model sizes and CoT lengths. We conclude that the effects of unlearning need to be considered when measuring parametric faithfulness.

2025

Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)
Kalyan Dutia | Peter Henderson | Markus Leippold | Christoper Manning | Gaku Morio | Veruska Muccione | Jingwei Ni | Tobias Schimanski | Dominik Stammbach | Alok Singh | Alba (Ruiran) Su | Saeid A. Vaghefi
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)

2024

JFLD: A Japanese Benchmark for Deductive Reasoning Based on Formal Logic
Terufumi Morishita | Atsuki Yamaguchi | Gaku Morio | Hikaru Tomonari | Osamu Imaichi | Yasuhiro Sogawa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) have proficiently solved a broad range of tasks with their rich knowledge but often struggle with logical reasoning. To foster the research on logical reasoning, many benchmarks have been proposed so far. However, most of these benchmarks are limited to English, hindering the evaluation of LLMs specialized for each language. To address this, we propose **JFLD** (**J**apanese **F**ormal **L**ogic **D**eduction), a deductive reasoning benchmark for Japanese. JFLD assess whether LLMs can generate logical steps to (dis-)prove a given hypothesis based on a given set of facts. Its key features are assessing pure logical reasoning abilities isolated from knowledge and assessing various reasoning rules. We evaluate various Japanese LLMs and see that they are still poor at logical reasoning, thus highlighting a substantial need for future research.

Predicting Narratives of Climate Obstruction in Social Media Advertising
Harri Rowlands | Gaku Morio | Dylan Tanner | Christopher Manning
Findings of the Association for Computational Linguistics: ACL 2024

Social media advertising offers a platform for fossil fuel value chain companies and their agents to reinforce their narratives, often emphasizing economic, labor market, and energy security benefits to promote oil and gas policy and products. Whether such narratives can be detected automatically and the extent to which the cost of human annotation can be reduced is our research question. We introduce a task of classifying narratives into seven categories, based on existing definitions and data.Experiments showed that RoBERTa-large outperforms other methods, while GPT-4 Turbo can serve as a viable annotator for the task, thereby reducing human annotation costs. Our findings and insights provide guidance to automate climate-related ad analysis and lead to more scalable ad scrutiny.

2023

How does the task complexity of masked pretraining objectives affect downstream performance?
Atsuki Yamaguchi | Hiroaki Ozaki | Terufumi Morishita | Gaku Morio | Yasuhiro Sogawa
Findings of the Association for Computational Linguistics: ACL 2023

Masked language modeling (MLM) is a widely used self-supervised pretraining objective, where a model needs to predict an original token that is replaced with a mask given contexts. Although simpler and computationally efficient pretraining objectives, e.g., predicting the first character of a masked token, have recently shown comparable results to MLM, no objectives with a masking scheme actually outperform it in downstream tasks. Motivated by the assumption that their lack of complexity plays a vital role in the degradation, we validate whether more complex masked objectives can achieve better results and investigate how much complexity they should have to perform comparably to MLM. Our results using GLUE, SQuAD, and Universal Dependencies benchmarks demonstrate that more complicated objectives tend to show better downstream results with at least half of the MLM complexity needed to perform comparably to MLM. Finally, we discuss how we should pretrain a model using a masked objective from the task complexity perspective.

2022

End-to-end Argument Mining with Cross-corpora Multi-task Learning
Gaku Morio | Hiroaki Ozaki | Terufumi Morishita | Kohsuke Yanai
Transactions of the Association for Computational Linguistics, Volume 10

Mining an argument structure from text is an important step for tasks such as argument search and summarization. While studies on argument(ation) mining have proposed promising neural network models, they usually suffer from a shortage of training data. To address this issue, we expand the training data with various auxiliary argument mining corpora and propose an end-to-end cross-corpus training method called Multi-Task Argument Mining (MT-AM). To evaluate our approach, we conducted experiments for the main argument mining tasks on several well-established argument mining corpora. The results demonstrate that MT-AM generally outperformed the models trained on a single corpus. Also, the smaller the target corpus was, the better the MT-AM performed. Our extensive analyses suggest that the improvement of MT-AM depends on several factors of transferability among auxiliary and target corpora.

Hitachi at SemEval-2022 Task 10: Comparing Graph- and Seq2Seq-based Models Highlights Difficulty in Structured Sentiment Analysis
Gaku Morio | Hiroaki Ozaki | Atsuki Yamaguchi | Yasuhiro Sogawa
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes our participation in SemEval-2022 Task 10, a structured sentiment analysis. In this task, we have to parse opinions considering both structure- and context-dependent subjective aspects, which is different from typical dependency parsing. Some of the major parser types have recently been used for semantic and syntactic parsing, while it is still unknown which type can capture structured sentiments well due to their subjective aspects. To this end, we compared two different types of state-of-the-art parser, namely graph-based and seq2seq-based. Our in-depth analyses suggest that, even though graph-based parser generally outperforms the seq2seq-based one, with strong pre-trained language models both parsers can essentially output acceptable and reasonable predictions. The analyses highlight that the difficulty derived from subjective aspects in structured sentiment analysis remains an essential challenge.

Hitachi at SemEval-2022 Task 2: On the Effectiveness of Span-based Classification Approaches for Multilingual Idiomaticity Detection
Atsuki Yamaguchi | Gaku Morio | Hiroaki Ozaki | Yasuhiro Sogawa
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In this paper, we describe our system for SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding. The task aims at detecting idiomaticity in an input sequence (Subtask A) and modeling representation of sentences that contain potential idiomatic multiword expressions (MWEs) (Subtask B) in three languages. We focus on the zero-shot setting of Subtask A and propose two span-based idiomaticity classification methods: MWE span-based classification and idiomatic MWE span prediction-based classification. We use several cross-lingual pre-trained language models (InfoXLM, XLM-R, and others) as our backbone network. Our best-performing system, fine-tuned with the span-based idiomaticity classification, ranked fifth in the zero-shot setting of Subtask A and exhibited a macro F1 score of 0.7466.

2021

Project-then-Transfer: Effective Two-stage Cross-lingual Transfer for Semantic Dependency Parsing
Hiroaki Ozaki | Gaku Morio | Terufumi Morishita | Toshinori Miyoshi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper describes the first report on cross-lingual transfer for semantic dependency parsing. We present the insight that there are twodifferent kinds of cross-linguality, namely sur-face level and mantic level, and try to cap-ture both kinds of cross-linguality by combin-ing annotation projection and model transferof pre-trained language models. Our exper-iments showed that the performance of our graph-based semantic dependency parser almost achieved the approximated upper bound.

2020

Hitachi at SemEval-2020 Task 3: Exploring the Representation Spaces of Transformers for Human Sense Word Similarity
Terufumi Morishita | Gaku Morio | Hiroaki Ozaki | Toshinori Miyoshi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present our system for SemEval-2020 task 3, Predicting the (Graded) Effect of Context in Word Similarity. Due to the unsupervised nature of the task, we concentrated on inquiring about the similarity measures induced by different layers of different pre-trained Transformer-based language models, which can be good approximations of the human sense of word similarity. Interestingly, our experiments reveal a language-independent characteristic: the middle to upper layers of Transformer-based language models can induce good approximate similarity measures. Finally, our system was ranked 1st on the Slovenian part of Subtask1 and 2nd on the Croatian part of both Subtask1 and Subtask2.

Hitachi at SemEval-2020 Task 11: An Empirical Study of Pre-Trained Transformer Family for Propaganda Detection
Gaku Morio | Terufumi Morishita | Hiroaki Ozaki | Toshinori Miyoshi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we show our system for SemEval-2020 task 11, where we tackle propaganda span identification (SI) and technique classification (TC). We investigate heterogeneous pre-trained language models (PLMs) such as BERT, GPT-2, XLNet, XLM, RoBERTa, and XLM-RoBERTa for SI and TC fine-tuning, respectively. In large-scale experiments, we found that each of the language models has a characteristic property, and using an ensemble model with them is promising. Finally, the ensemble model was ranked 1st amongst 35 teams for SI and 3rd amongst 31 teams for TC.

Hitachi at SemEval-2020 Task 10: Emphasis Distribution Fusion on Fine-Tuned Language Models
Gaku Morio | Terufumi Morishita | Hiroaki Ozaki | Toshinori Miyoshi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper shows our system for SemEval-2020 task 10, Emphasis Selection for Written Text in Visual Media. Our strategy is two-fold. First, we propose fine-tuning many pre-trained language models, predicting an emphasis probability distribution over tokens. Then, we propose stacking a trainable distribution fusion DistFuse system to fuse the predictions of the fine-tuned models. Experimental results show tha DistFuse is comparable or better when compared with a naive average ensemble. As a result, we were ranked 2nd amongst 31 teams.

Hitachi at SemEval-2020 Task 8: Simple but Effective Modality Ensemble for Meme Emotion Recognition
Terufumi Morishita | Gaku Morio | Shota Horiguchi | Hiroaki Ozaki | Toshinori Miyoshi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Users of social networking services often share their emotions via multi-modal content, usually images paired with text embedded in them. SemEval-2020 task 8, Memotion Analysis, aims at automatically recognizing these emotions of so-called internet memes. In this paper, we propose a simple but effective Modality Ensemble that incorporates visual and textual deep-learning models, which are independently trained, rather than providing a single multi-modal joint network. To this end, we first fine-tune four pre-trained visual models (i.e., Inception-ResNet, PolyNet, SENet, and PNASNet) and four textual models (i.e., BERT, GPT-2, Transformer-XL, and XLNet). Then, we fuse their predictions with ensemble methods to effectively capture cross-modal correlations. The experiments performed on dev-set show that both visual and textual features aided each other, especially in subtask-C, and consequently, our system ranked 2nd on subtask-C.

Hitachi at SemEval-2020 Task 7: Stacking at Scale with Heterogeneous Language Models for Humor Recognition
Terufumi Morishita | Gaku Morio | Hiroaki Ozaki | Toshinori Miyoshi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the winning system for SemEval-2020 task 7: Assessing Humor in Edited News Headlines. Our strategy is Stacking at Scale (SaS) with heterogeneous pre-trained language models (PLMs) such as BERT and GPT-2. SaS first performs fine-tuning on numbers of PLMs with various hyperparameters and then applies a powerful stacking ensemble on top of the fine-tuned PLMs. Our experimental results show that SaS outperforms a naive average ensemble, leveraging weaker PLMs as well as high-performing PLMs. Interestingly, the results show that SaS captured non-funny semantics. Consequently, the system was ranked 1st in all subtasks by significant margins compared with other systems.

Corpus for Modeling User Interactions in Online Persuasive Discussions
Ryo Egawa | Gaku Morio | Katsuhide Fujita
Proceedings of the Twelfth Language Resources and Evaluation Conference

Persuasions are common in online arguments such as discussion forums. To analyze persuasive strategies, it is important to understand how individuals construct posts and comments based on the semantics of the argumentative components. In addition to understanding how we construct arguments, understanding how a user post interacts with other posts (i.e., argumentative inter-post relation) still remains a challenge. Therefore, in this study, we developed a novel annotation scheme and corpus that capture both user-generated inner-post arguments and inter-post relations between users in ChangeMyView, a persuasive forum. Our corpus consists of arguments with 4612 elementary units (EUs) (i.e., propositions), 2713 EU-to-EU argumentative relations, and 605 inter-post argumentative relations in 115 threads. We analyzed the annotated corpus to identify the characteristics of online persuasive arguments, and the results revealed persuasive documents have more claims than non-persuasive ones and different interaction patterns among persuasive and non-persuasive documents. Our corpus can be used as a resource for analyzing persuasiveness and training an argument mining system to identify and extract argument structures. The annotated corpus and annotation guidelines have been made publicly available.

Hitachi at MRP 2020: Text-to-Graph-Notation Transducer
Hiroaki Ozaki | Gaku Morio | Yuta Koreeda | Terufumi Morishita | Toshinori Miyoshi
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

This paper presents our proposed parser for the shared task on Meaning Representation Parsing (MRP 2020) at CoNLL, where participant systems were required to parse five types of graphs in different languages. We propose to unify these tasks as a text-to-graph-notation transduction in which we convert an input text into a graph notation. To this end, we designed a novel Plain Graph Notation (PGN) that handles various graphs universally. Then, our parser predicts a PGN-based sequence by leveraging Transformers and biaffine attentions. Notably, our parser can handle any PGN-formatted graphs with fewer framework-specific modifications. As a result, ensemble versions of the parser tied for 1st place in both cross-framework and cross-lingual tracks.

Towards Better Non-Tree Argument Mining: Proposition-Level Biaffine Parsing with Task-Specific Parameterization
Gaku Morio | Hiroaki Ozaki | Terufumi Morishita | Yuta Koreeda | Kohsuke Yanai
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

State-of-the-art argument mining studies have advanced the techniques for predicting argument structures. However, the technology for capturing non-tree-structured arguments is still in its infancy. In this paper, we focus on non-tree argument mining with a neural model. We jointly predict proposition types and edges between propositions. Our proposed model incorporates (i) task-specific parameterization (TSP) that effectively encodes a sequence of propositions and (ii) a proposition-level biaffine attention (PLBA) that can predict a non-tree argument consisting of edges. Experimental results show that both TSP and PLBA boost edge prediction performance compared to baselines.

2019

Annotating and Analyzing Semantic Role of Elementary Units and Relations in Online Persuasive Arguments
Ryo Egawa | Gaku Morio | Katsuhide Fujita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

For analyzing online persuasions, one of the important goals is to semantically understand how people construct comments to persuade others. However, analyzing the semantic role of arguments for online persuasion has been less emphasized. Therefore, in this study, we propose a novel annotation scheme that captures the semantic role of arguments in a popular online persuasion forum, so-called ChangeMyView. Through this study, we have made the following contributions: (i) proposing a scheme that includes five types of elementary units (EUs) and two types of relations. (ii) annotating ChangeMyView which results in 4612 EUs and 2713 relations in 345 posts. (iii) analyzing the semantic role of persuasive arguments. Our analyses captured certain characteristic phenomena for online persuasion.

Hitachi at MRP 2019: Unified Encoder-to-Biaffine Network for Cross-Framework Meaning Representation Parsing
Yuta Koreeda | Gaku Morio | Terufumi Morishita | Hiroaki Ozaki | Kohsuke Yanai
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

This paper describes the proposed system of the Hitachi team for the Cross-Framework Meaning Representation Parsing (MRP 2019) shared task. In this shared task, the participating systems were asked to predict nodes, edges and their attributes for five frameworks, each with different order of “abstraction” from input tokens. We proposed a unified encoder-to-biaffine network for all five frameworks, which effectively incorporates a shared encoder to extract rich input features, decoder networks to generate anchorless nodes in UCCA and AMR, and biaffine networks to predict edges. Our system was ranked fifth with the macro-averaged MRP F1 score of 0.7604, and outperformed the baseline unified transition-based MRP. Furthermore, post-evaluation experiments showed that we can boost the performance of the proposed system by incorporating multi-task learning, whereas the baseline could not. These imply efficacy of incorporating the biaffine network to the shared architecture for MRP and that learning heterogeneous meaning representations at once can boost the system performance.

Revealing and Predicting Online Persuasion Strategy with Elementary Units
Gaku Morio | Ryo Egawa | Katsuhide Fujita
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In online arguments, identifying how users construct their arguments to persuade others is important in order to understand a persuasive strategy directly. However, existing research lacks empirical investigations on highly semantic aspects of elementary units (EUs), such as propositions for a persuasive online argument. Therefore, this paper focuses on a pilot study, revealing a persuasion strategy using EUs. Our contributions are as follows: (1) annotating five types of EUs in a persuasive forum, the so-called ChangeMyView, (2) revealing both intuitive and non-intuitive strategic insights for the persuasion by analyzing 4612 annotated EUs, and (3) proposing baseline neural models that identify the EU boundary and type. Our observations imply that EUs definitively characterize online persuasion strategies.

2018

End-to-End Argument Mining for Discussion Threads Based on Parallel Constrained Pointer Architecture
Gaku Morio | Katsuhide Fujita
Proceedings of the 5th Workshop on Argument Mining

Argument Mining (AM) is a relatively recent discipline, which concentrates on extracting claims or premises from discourses, and inferring their structures. However, many existing works do not consider micro-level AM studies on discussion threads sufficiently. In this paper, we tackle AM for discussion threads. Our main contributions are follows: (1) A novel combination scheme focusing on micro-level inner- and inter- post schemes for a discussion thread. (2) Annotation of large-scale civic discussion threads with the scheme. (3) Parallel constrained pointer architecture (PCPA), a novel end-to-end technique to discriminate sentence types, inner-post relations, and inter-post interactions simultaneously. The experimental results demonstrate that our proposed model shows better accuracy in terms of relations extraction, in comparison to existing state-of-the-art models.

Venues