Ryohei Sasano - ACL Anthology

Ryohei Sasano

2026

How Do Language Models Acquire Character-Level Information?
Soma Sato | Ryohei Sasano
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.

2025

Semantic Frame Induction from a Real-World Corpus
Shogo Tsujimoto | Kosuke Yamada | Ryohei Sasano
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Recent studies on semantic frame induction have demonstrated that the emergence of pre-trained language models (PLMs) has led to more accurate results.However, most existing studies evaluate the performance using frame resources such as FrameNet, which may not accurately reflect real-world language usage.In this study, we conduct semantic frame induction using the Colossal Clean Crawled Corpus (C4) and assess the applicability of existing frame induction methods to real-world data.Our experimental results demonstrate that existing frame induction methods are effective on real-world data and that frames corresponding to novel concepts can be induced.

Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings
Hayato Tsukagoshi | Ryohei Sasano
Findings of the Association for Computational Linguistics: ACL 2025

Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.

FrameEOL: Semantic Frame Induction using Causal Language Models
Chihiro Yano | Kosuke Yamada | Hayato Tsukagoshi | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: EMNLP 2025

Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.

Optimizing the Arrangement of Citations in Related Work Section
Masashi Oshika | Ryohei Sasano
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

In related work section of a scientific paper, authors collect relevant citations and structure them into coherent paragraphs that follow a logical order. Previous studies have addressed citation recommendation and related work section generation in settings where both the citations and their order are provided in advance. However, they have not adequately addressed the optimal ordering of these citations, which is a critical step for achieving fully automated related work section generation. In this study, we propose a new task, citation arrangement, which focuses on determining the optimal order of cited papers to enable fully automated related work section generation. Our approach decomposes citation arrangement into three tasks: citation clustering, paragraph ordering, and citation ordering within a paragraph. For each task, we propose a method that uses a large language model (LLM) in combination with a graph-based technique to comprehensively consider the context of each paper and the relationships among all cited papers. The experimental results show that our method is more effective than methods that generate outputs for each task using only an LLM.

2024

Sentence Representations via Gaussian Embedding
Shohei Yoda | Hayato Tsukagoshi | Ryohei Sasano | Koichi Takeda
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Recent progress in sentence embedding, which represents a sentence’s meaning as a point in a vector space, has achieved high performance on several tasks such as the semantic textual similarity (STS) task.However, a sentence representation cannot adequately express the diverse information that sentences contain: for example, such representations cannot naturally handle asymmetric relationships between sentences.This paper proposes GaussCSE, a Gaussian-distribution-based contrastive learning framework for sentence embedding that can handle asymmetric inter-sentential relations, as well as a similarity measure for identifying entailment relations.Our experiments show that GaussCSE achieves performance comparable to that of previous methods on natural language inference (NLI) tasks, and that it can estimate the direction of entailment relations, which is difficult with point representations.

To Drop or Not to Drop? Predicting Argument Ellipsis Judgments: A Case Study in Japanese
Yukiko Ishizuki | Tatsuki Kuribayashi | Yuichiroh Matsubayashi | Ryohei Sasano | Kentaro Inui
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Speakers sometimes omit certain arguments of a predicate in a sentence; such omission is especially frequent in pro-drop languages. This study addresses a question about ellipsis—what can explain the native speakers’ ellipsis decisions?—motivated by the interest in human discourse processing and writing assistance for this choice. To this end, we first collect large-scale human annotations of whether and why a particular argument should be omitted across over 2,000 data points in the balanced corpus of Japanese, a prototypical pro-drop language. The data indicate that native speakers overall share common criteria for such judgments and further clarify their quantitative characteristics, e.g., the distribution of related linguistic factors in the balanced corpus. Furthermore, the performance of the language model–based argument ellipsis judgment model is examined, and the gap between the systems’ prediction and human judgments in specific linguistic aspects is revealed. We hope our fundamental resource encourages further studies on natural human ellipsis judgment.

Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples
Soma Sato | Hayato Tsukagoshi | Ryohei Sasano | Koichi Takeda
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL requires a manually annotated natural language inference (NLI) dataset for fine-tuning.We aim to improve sentence embeddings without using large manually annotated datasets by automatically generating an NLI dataset with an LLM and using it for fine-tuning of PromptEOL. To achieve this, we explore methods of data generation suitable for sentence embedding learning in this study. Specifically, we will focus on automatic dataset generation through few-shot learning and explore the appropriate methods to leverage few-shot examples. Experimental results on the STS tasks demonstrate that our approach outperforms existing models in settings without large manually annotated datasets.

Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs
Masashi Oshika | Makoto Morishita | Tsutomu Hirao | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: ACL 2024

In recent years, neural machine translation (NMT) has become widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user’s language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces high Age of Acquisitions (AoA) words in translations with simpler words to match the translations to the user’s level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.

Definition Generation for Automatically Induced Semantic Frame
Yi Han | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: ACL 2024

In a semantic frame resource such as FrameNet, the definition sentence of a frame is essential for humans to understand the meaning of the frame intuitively. Recently, several attempts have been made to induce semantic frames from large corpora, but the cost of creating the definition sentences for such frames is significant. In this paper, we address a new task of generating frame definitions from a set of frame-evoking words. Specifically, given a cluster of frame-evoking words and associated exemplars induced as the same semantic frame, we utilize a large language model to generate frame definitions. We demonstrate that incorporating frame element reasoning as chain-of-thought can enhance the inclusion of correct frame elements in the generated definitions.

Verifying Claims About Metaphors with Large-Scale Automatic Metaphor Identification
Kotaro Aono | Ryohei Sasano | Koichi Takeda
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

There are several linguistic claims about situations where words are more likely to be used as metaphors.However, few studies have sought to verify such claims with large corpora.This study entails a large-scale, corpus-based analysis of certain existing claims about verb metaphors, by applying metaphor detection to sentences extracted from Common Crawl and using the statistics obtained from the results.The verification results indicate that the direct objects of verbs used as metaphors tend to have lower degrees of concreteness, imageability, and familiarity, and that metaphors are more likely to be used in emotional and subjective sentences.

WikiSplit++: Easy Data Refinement for Split and Rephrase
Hayato Tsukagoshi | Tsutomu Hirao | Makoto Morishita | Katsuki Chousa | Ryohei Sasano | Koichi Takeda
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.

2023

Semantic Frame Induction with Deep Metric Learning
Kosuke Yamada | Ryohei Sasano | Koichi Takeda
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Recent studies have demonstrated the usefulness of contextualized word embeddings in unsupervised semantic frame induction. However, they have also revealed that generic contextualized embeddings are not always consistent with human intuitions about semantic frames, which causes unsatisfactory performance for frame induction based on contextualized embeddings. In this paper, we address supervised semantic frame induction, which assumes the existence of frame-annotated data for a subset of predicates in a corpus and aims to build a frame induction model that leverages the annotated data. We propose a model that uses deep metric learning to fine-tune a contextualized embedding model, and we apply the fine-tuned contextualized embeddings to perform semantic frame induction. Our experiments on FrameNet show that fine-tuning with deep metric learning considerably improves the clustering evaluation scores, namely, the B-cubed F-score and Purity F-score, by about 8 points or more. We also demonstrate that our approach is effective even when the number of training instances is small.

Acquiring Frame Element Knowledge with Deep Metric Learning for Semantic Frame Induction
Kosuke Yamada | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: ACL 2023

The semantic frame induction tasks are defined as a clustering of words into the frames that they evoke, and a clustering of their arguments according to the frame element roles that they should fill. In this paper, we address the latter task of argument clustering, which aims to acquire frame element knowledge, and propose a method that applies deep metric learning. In this method, a pre-trained language model is fine-tuned to be suitable for distinguishing frame element roles through the use of frame-annotated data, and argument clustering is performed with embeddings obtained from the fine-tuned model. Experimental results on FrameNet demonstrate that our method achieves substantially better performance than existing methods.

Transformer-based Live Update Generation for Soccer Matches from Microblog Posts
Masashi Oshika | Kosuke Yamada | Ryohei Sasano | Koichi Takeda
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

It has been known to be difficult to generate adequate sports updates from a sequence of vast amounts of diverse live tweets, although the live sports viewing experience with tweets is gaining the popularity. In this paper, we focus on soccer matches and work on building a system to generate live updates for soccer matches from tweets so that users can instantly grasp a match’s progress and enjoy the excitement of the match from raw tweets. Our proposed system is based on a large pre-trained language model and incorporates a mechanism to control the number of updates and a mechanism to reduce the redundancy of duplicate and similar updates.

Building a Buzzer-quiz Answering System
Naoya Sugiura | Kosuke Yamada | Ryohei Sasano | Koichi Takeda | Katsuhiko Toyama
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

A buzzer quiz is a genre of quiz in which multiple players simultaneously listen to a quiz being read aloud and respond it by buzzing in as soon as they can predict the answer. Because incorrect answers often result in penalties, a buzzer-quiz answering system must not only predict the answer from only part of a question but also estimate the predicted answer’s accuracy. In this paper, we introduce two types of buzzer-quiz answering systems: (1) a system that directly generates an answer from part of a question by using an autoregressive language model; and (2) a system that first reconstructs the entire question by using an autoregressive language model and then determines the answer according to the reconstructed question. We then propose a method to estimate the accuracy of the answers for each system by using the internal scores of each model.

Realistic Citation Count Prediction Task for Newly Published Papers
Jun Hirako | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: EACL 2023

Citation count prediction is the task of predicting the future citation counts of academic papers, which is particularly useful for estimating the future impacts of an ever-growing number of academic papers. Although there have been many studies on citation count prediction, they are not applicable to predicting the citation counts of newly published papers, because they assume the availability of future citation counts for papers that have not had enough time pass since publication. In this paper, we first identify problems in the settings of existing studies and introduce a realistic citation count prediction task that strictly uses information available at the time of a target paper’s publication. For realistic citation count prediction, we then propose two methods to leverage the citation counts of papers shortly after publication. Through experiments using papers collected from arXiv and bioRxiv, we demonstrate that our methods considerably improve the performance of citation count prediction for newly published papers in a realistic setting.

2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
Hongkuan Zhang | Saku Sugawara | Akiko Aizawa | Lei Zhou | Ryohei Sasano | Koichi Takeda
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image–caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision–language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.

Cross-lingual Linking of Automatically Constructed Frames and FrameNet
Ryohei Sasano
Proceedings of the Thirteenth Language Resources and Evaluation Conference

A semantic frame is a conceptual structure describing an event, relation, or object along with its participants. Several semantic frame resources have been manually elaborated, and there has been much interest in the possibility of applying semantic frames designed for a particular language to other languages, which has led to the development of cross-lingual frame knowledge. However, manually developing such cross-lingual lexical resources is labor-intensive. To support the development of such resources, this paper presents an attempt at automatic cross-lingual linking of automatically constructed frames and manually crafted frames. Specifically, we link automatically constructed example-based Japanese frames to English FrameNet by using cross-lingual word embeddings and a two-stage model that first extracts candidate FrameNet frames for each Japanese frame by taking only the frame-evoking words into account, then finds the best alignment of frames by also taking frame elements into account. Experiments using frame-annotated sentences in Japanese FrameNet indicate that our approach will facilitate the manual development of cross-lingual frame resources.

Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals
Hayato Tsukagoshi | Ryohei Sasano | Koichi Takeda
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

There have been many successful applications of sentence embedding methods. However, it has not been well understood what properties are captured in the resulting sentence embeddings depending on the supervision signals. In this paper, we focus on two types of sentence embedding methods with similar architectures and tasks: one fine-tunes pre-trained language models on the natural language inference task, and the other fine-tunes pre-trained language models on word prediction task from its definition sentence, and investigate their properties. Specifically, we compare their performances on semantic textual similarity (STS) tasks using STS datasets partitioned from two perspectives: 1) sentence source and 2) superficial similarity of the sentence pairs, and compare their performances on the downstream and probing tasks. Furthermore, we attempt to combine the two methods and demonstrate that combining the two methods yields substantially better performance than the respective methods on unsupervised STS tasks and downstream tasks.

Automating Interlingual Homograph Recognition with Parallel Sentences
Yi Han | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Interlingual homographs are words that spell the same but possess different meanings across languages. Recognizing interlingual homographs from form-identical words generally needs linguistic knowledge and massive annotation work. In this paper, we propose an automatic interlingual homograph recognition method based on the cross-lingual word embedding similarity and co-occurrence of form-identical words in parallel sentences. We conduct experiments with various off-the-shelf language models coordinating with cross-lingual alignment operations and co-occurrence metrics on the Chinese-Japanese and English-Dutch language pairs. Experimental results demonstrate that our proposed method is able to make accurate and consistent predictions across languages.

Leveraging Three Types of Embeddings from Masked Language Models in Idiom Token Classification
Ryosuke Takahashi | Ryohei Sasano | Koichi Takeda
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

Many linguistic expressions have idiomatic and literal interpretations, and the automatic distinction of these two interpretations has been studied for decades. Recent research has shown that contextualized word embeddings derived from masked language models (MLMs) can give promising results for idiom token classification. This indicates that contextualized word embedding alone contains information about whether the word is being used in a literal sense or not. However, we believe that more types of information can be derived from MLMs and that leveraging such information can improve idiom token classification. In this paper, we leverage three types of embeddings from MLMs; uncontextualized token embeddings and masked token embeddings in addition to the standard contextualized word embeddings and show that the newly added embeddings significantly improve idiom token classification for both English and Japanese datasets.

2021

Verb Sense Clustering using Contextualized Word Representations for Semantic Frame Induction
Kosuke Yamada | Ryohei Sasano | Koichi Takeda
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Semantic Frame Induction using Masked Word Embeddings and Two-Step Clustering
Kosuke Yamada | Ryohei Sasano | Koichi Takeda
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Recent studies on semantic frame induction show that relatively high performance has been achieved by using clustering-based methods with contextualized word embeddings. However, there are two potential drawbacks to these methods: one is that they focus too much on the superficial information of the frame-evoking verb and the other is that they tend to divide the instances of the same verb into too many different frame clusters. To overcome these drawbacks, we propose a semantic frame induction method using masked word embeddings and two-step clustering. Through experiments on the English FrameNet data, we demonstrate that using the masked word embeddings is effective for avoiding too much reliance on the surface information of frame-evoking verbs and that two-step clustering can improve the number of resulting frame clusters for the instances of the same verb.

Self-Guided Curriculum Learning for Neural Machine Translation
Lei Zhou | Liang Ding | Kevin Duh | Shinji Watanabe | Ryohei Sasano | Koichi Takeda
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

In supervised learning, a well-trained model should be able to recover ground truth accurately, i.e. the predicted labels are expected to resemble the ground truth labels as much as possible. Inspired by this, we formulate a difficulty criterion based on the recovery degrees of training examples. Motivated by the intuition that after skimming through the training corpus, the neural machine translation (NMT) model “knows” how to schedule a suitable curriculum according to learning difficulty, we propose a self-guided curriculum learning strategy that encourages the NMT model to learn from easy to hard on the basis of recovery degrees. Specifically, we adopt sentence-level BLEU score as the proxy of recovery degree. Experimental results on translation benchmarks including WMT14 English-German and WMT17 Chinese-English demonstrate that our proposed method considerably improves the recovery degree, thus consistently improving the translation performance.

Transformer-based Lexically Constrained Headline Generation
Kosuke Yamada | Yuta Hitomi | Hideaki Tamori | Ryohei Sasano | Naoaki Okazaki | Kentaro Inui | Koichi Takeda
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper explores a variant of automatic headline generation methods, where a generated headline is required to include a given phrase such as a company or a product name. Previous methods using Transformer-based models generate a headline including a given phrase by providing the encoder with additional information corresponding to the given phrase. However, these methods cannot always include the phrase in the generated headline. Inspired by previous RNN-based methods generating token sequences in backward and forward directions from the given phrase, we propose a simple Transformer-based method that guarantees to include the given phrase in the high-quality generated headline. We also consider a new headline generation strategy that takes advantage of the controllable generation order of Transformer. Our experiments with the Japanese News Corpus demonstrate that our methods, which are guaranteed to include the phrase in the generated headline, achieve ROUGE scores comparable to previous Transformer-based methods. We also show that our generation strategy performs better than previous strategies.

DefSent: Sentence Embeddings using Definition Sentences
Hayato Tsukagoshi | Ryohei Sasano | Koichi Takeda
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Sentence embedding methods using natural language inference (NLI) datasets have been successfully applied to various tasks. However, these methods are only available for limited languages due to relying heavily on the large NLI datasets. In this paper, we propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary, which performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks than conventional methods. Since dictionaries are available for many languages, DefSent is more broadly applicable than methods using NLI datasets without constructing additional datasets. We demonstrate that DefSent performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks to the methods using large NLI datasets. Our code is publicly available at https://github.com/hpprc/defsent.

2020

Development of a Medical Incident Report Corpus with Intention and Factuality Annotation
Hongkuan Zhang | Ryohei Sasano | Koichi Takeda | Zoie Shui-Yee Wong
Proceedings of the Twelfth Language Resources and Evaluation Conference

Medical incident reports (MIRs) are documents that record what happened in a medical incident. A typical MIR consists of two sections: a structured categorical part and an unstructured text part. Most texts in MIRs describe what medication was intended to be given and what was actually given, because what happened in an incident is largely due to discrepancies between intended and actual medications. Recognizing the intention of clinicians and the factuality of medication is essential to understand the causes of medical incidents and avoid similar incidents in the future. Therefore, we are developing an MIR corpus with annotation of intention and factuality as well as of medication entities and their relations. In this paper, we present our annotation scheme with respect to the definition of medication entities that we take into account, the method to annotate the relations between entities, and the details of the intention and factuality annotation. We then report the annotated corpus consisting of 349 Japanese medical incident reports.

Sequential Span Classification with Neural Semi-Markov CRFs for Biomedical Abstracts
Kosuke Yamada | Tsutomu Hirao | Ryohei Sasano | Koichi Takeda | Masaaki Nagata
Findings of the Association for Computational Linguistics: EMNLP 2020

Dividing biomedical abstracts into several segments with rhetorical roles is essential for supporting researchers’ information access in the biomedical domain. Conventional methods have regarded the task as a sequence labeling task based on sequential sentence classification, i.e., they assign a rhetorical label to each sentence by considering the context in the abstract. However, these methods have a critical problem: they are prone to mislabel longer continuous sentences with the same rhetorical label. To tackle the problem, we propose sequential span classification that assigns a rhetorical label, not to a single sentence but to a span that consists of continuous sentences. Accordingly, we introduce Neural Semi-Markov Conditional Random Fields to assign the labels to such spans by considering all possible spans of various lengths. Experimental results obtained from PubMed 20k RCT and NICTA-PIBOSO datasets demonstrate that our proposed method achieved the best micro sentence-F1 score as well as the best micro span-F1 score.

Investigating Word-Class Distributions in Word Vector Spaces
Ryohei Sasano | Anna Korhonen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper presents an investigation on the distribution of word vectors belonging to a certain word class in a pre-trained word vector space. To this end, we made several assumptions about the distribution, modeled the distribution accordingly, and validated each assumption by comparing the goodness of each model. Specifically, we considered two types of word classes – the semantic class of direct objects of a verb and the semantic class in a thesaurus – and tried to build models that properly estimate how likely it is that a word in the vector space is a member of a given word class. Our results on selectional preference and WordNet datasets show that the centroid-based model will fail to achieve good enough performance, the geometry of the distribution and the existence of subgroups will have limited impact, and also the negative instances need to be considered for adequate modeling of the distribution. We further investigated the relationship between the scores calculated by each model and the degree of membership and found that discriminative learning-based models are best in finding the boundaries of a class, while models based on the offset between positive and negative instances perform best in determining the degree of membership.

2019

Incorporating Textual Information on User Behavior for Personality Prediction
Kosuke Yamada | Ryohei Sasano | Koichi Takeda
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Several recent studies have shown that textual information of user posts and user behaviors such as liking and sharing the specific posts are useful for predicting the personality of social media users. However, less attention has been paid to the textual information derived from the user behaviors. In this paper, we investigate the effect of textual information on user behaviors for personality prediction. Our experiments on the personality prediction of Twitter users show that the textual information of user behaviors is more useful than the co-occurrence information of the user behaviors. They also show that taking user behaviors into account is crucial for predicting the personality of users who do not post frequently.

2018

An Empirical Study on Fine-Grained Named Entity Recognition
Khai Mai | Thai-Hoang Pham | Minh Trung Nguyen | Tuan Duc Nguyen | Danushka Bollegala | Ryohei Sasano | Satoshi Sekine
Proceedings of the 27th International Conference on Computational Linguistics

Named entity recognition (NER) has attracted a substantial amount of research. Recently, several neural network-based models have been proposed and achieved high performance. However, there is little research on fine-grained NER (FG-NER), in which hundreds of named entity categories must be recognized, especially for non-English languages. It is still an open question whether there is a model that is robust across various settings or the proper model varies depending on the language, the number of named entity categories, and the size of training datasets. This paper first presents an empirical comparison of FG-NER models for English and Japanese and demonstrates that LSTM+CNN+CRF (Ma and Hovy, 2016), one of the state-of-the-art methods for English NER, also works well for English FG-NER but does not work well for Japanese, a language that has a large number of character types. To tackle this problem, we propose a method to improve the neural network-based Japanese FG-NER performance by removing the CNN layer and utilizing dictionary and category embeddings. Experiment results show that the proposed method improves Japanese FG-NER F-score from 66.76% to 75.18%.

2017

Distinguishing Japanese Non-standard Usages from Standard Ones
Tatsuya Aoki | Ryohei Sasano | Hiroya Takamura | Manabu Okumura
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We focus on non-standard usages of common words on social media. In the context of social media, words sometimes have other usages that are totally different from their original. In this study, we attempt to distinguish non-standard usages on social media from standard ones in an unsupervised manner. Our basic idea is that non-standardness can be measured by the inconsistency between the expected meaning of the target word and the given context. For this purpose, we use context embeddings derived from word embeddings. Our experimental results show that the model leveraging the context embedding outperforms other methods and provide us with findings, for example, on how to construct context embeddings and which corpus to use.

Extended Named Entity Recognition API and Its Applications in Language Education
Tuan Duc Nguyen | Khai Mai | Thai-Hoang Pham | Minh Trung Nguyen | Truc-Vien T. Nguyen | Takashi Eguchi | Ryohei Sasano | Satoshi Sekine
Proceedings of ACL 2017, System Demonstrations

2016

Controlling Output Length in Neural Encoder-Decoders
Yuta Kikuchi | Graham Neubig | Ryohei Sasano | Hiroya Takamura | Manabu Okumura
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

A Corpus-Based Analysis of Canonical Word Order of Japanese Double Object Constructions
Ryohei Sasano | Manabu Okumura
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Context-Dependent Automatic Response Generation Using Statistical Machine Translation Techniques
Andrew Shin | Ryohei Sasano | Hiroya Takamura | Manabu Okumura
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2013

Automatic Knowledge Acquisition for Case Alternation between the Passive and Active Voices in Japanese
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi | Manabu Okumura
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Subtree Extractive Summarization via Submodular Maximization
Hajime Morita | Ryohei Sasano | Hiroya Takamura | Manabu Okumura
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis
Ryohei Sasano | Sadao Kurohashi | Manabu Okumura
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

Generating “A for Alpha” When There Are Thousands of Characters
Hiroaki Kawasaki | Ryohei Sasano | Hiroya Takamura | Manabu Okumura
Proceedings of COLING 2012

2011

A Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames
Ryohei Sasano | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

2009

The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

A Probabilistic Model for Associative Anaphora Resolution
Ryohei Sasano | Sadao Kurohashi
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

A Fully-Lexicalized Probabilistic Model for Japanese Zero Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

Japanese Named Entity Recognition Using Structural Natural Language Processing
Ryohei Sasano | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

2004

Automatic Construction of Nominal Case Frames and its Application to Indirect Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames
Daisuke Kawahara | Ryohei Sasano | Sadao Kurohashi
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.

Venues