Shinsuke Mori

2024

pdf bib abs
Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Keyaki Ohno | Hirotaka Kameko | Keisuke Shirai | Taichi Nishimura | Shinsuke Mori
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.

2023

pdf bib abs
Towards Flow Graph Prediction of Open-Domain Procedural Texts
Keisuke Shirai | Hirotaka Kameko | Shinsuke Mori
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

Machine comprehension of procedural texts is essential for reasoning about the steps and automating the procedures. However, this requires identifying entities within a text and resolving the relationships between the entities. Previous work focused on the cooking domain and proposed a framework to convert a recipe text into a flow graph (FG) representation. In this work, we propose a framework based on the recipe FG for flow graph prediction of open-domain procedural texts. To investigate flow graph prediction performance in non-cooking domains, we introduce the wikiHow-FG corpus from articles on wikiHow, a website of how-to instruction articles. In experiments, we consider using the existing recipe corpus and performing domain adaptation from the cooking to the target domain. Experimental results show that the domain adaptation models achieve higher performance than those trained only on the cooking or target domain data.

2022

pdf bib abs
Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows
Keisuke Shirai | Atsushi Hashimoto | Taichi Nishimura | Hirotaka Kameko | Shuhei Kurita | Yoshitaka Ushiku | Shinsuke Mori
Proceedings of the 29th International Conference on Computational Linguistics

We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn a cooking action result for each object in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph. We developed a web interface to reduce human annotation costs. The dataset allows us to try various applications, including multimodal information retrieval.

2020

pdf bib abs
A Contract Corpus for Recognizing Rights and Obligations
Ruka Funaki | Yusuke Nagata | Kohei Suenaga | Shinsuke Mori
Proceedings of the Twelfth Language Resources and Evaluation Conference

A contract is a legal document executed by two or more parties. It is important for these parties to precisely understand their rights and obligations that are described in the contract. However, understanding the content of a contract is sometimes difficult and costly, particularly if the contract is long and complicated. Therefore, a language-processing system that can present information concerning rights and obligations found within a given contract document would help a contracting party to make better decisions. As a step toward the development of such a language-processing system, in this paper, we describe the annotated corpus of contract documents that we built. Our corpus is annotated so that a language-processing system can recognize a party’s rights and obligations. The annotated information includes the parties involved in the contract, the rights and obligations of the parties, the conditions and the exceptions under which these rights and obligations to take effect. The corpus was built based on 46 English contracts and 25 Japanese contracts drafted by lawyers. We explain how we annotated the corpus and the statistics of the corpus. We also report the results of the experiments for recognizing rights and obligations.

In this paper, we provide a dataset that gives visual grounding annotations to recipe flow graphs. A recipe flow graph is a representation of the cooking workflow, which is designed with the aim of understanding the workflow from natural language processing. Such a workflow will increase its value when grounded to real-world activities, and visual grounding is a way to do so. Visual grounding is provided as bounding boxes to image sequences of recipes, and each bounding box is linked to an element of the workflow. Because the workflows are also linked to the text, this annotation gives visual grounding with workflow’s contextual information between procedural text and visual observation in an indirect manner. We subsidiarily annotated two types of event attributes with each bounding box: “doing-the-action,” or “done-the-action”. As a result of the annotation, we got 2,300 bounding boxes in 272 flow graph recipes. Various experiments showed that the proposed dataset enables us to estimate contextual information described in recipe flow graphs from an image sequence.

pdf bib abs
Annotating Event Appearance for Japanese Chess Commentary Corpus
Hirotaka Kameko | Shinsuke Mori
Proceedings of the Twelfth Language Resources and Evaluation Conference

In recent years, there has been a surge of interest in natural language processing related to the real world, such as symbol grounding, language generation, and non-linguistic data search by natural language queries. Researchers usually collect pairs of text and non-text data for research. However, the text and non-text data are not always a “true” pair. We focused on the shogi (Japanese chess) commentaries, which are accompanied by game states as a well-defined “real world”. For analyzing and processing texts accurately, considering only the given states is insufficient, and we must consider the relationship between texts and the real world. In this paper, we propose “Event Appearance” labels that show the relationship between events mentioned in texts and those happening in the real world. Our event appearance label set consists of temporal relation, appearance probability, and evidence of the event. Statistics of the annotated corpus and the experimental result show that there exists temporal relation which skillful annotators realize in common. However, it is hard to predict the relationship only by considering the given states.

pdf bib abs
English Recipe Flow Graph Corpus
Yoko Yamakata | Shinsuke Mori | John Carroll
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present an annotated corpus of English cooking recipe procedures, and describe and evaluate computational methods for learning these annotations. The corpus consists of 300 recipes written by members of the public, which we have annotated with domain-specific linguistic and semantic structure. Each recipe is annotated with (1) ‘recipe named entities’ (r-NEs) specific to the recipe domain, and (2) a flow graph representing in detail the sequencing of steps, and interactions between cooking tools, food ingredients and the products of intermediate steps. For these two kinds of annotations, inter-annotator agreement ranges from 82.3 to 90.5 F1, indicating that our annotation scheme is appropriate and consistent. We experiment with producing these annotations automatically. For r-NE tagging we train a deep neural network NER tool; to compute flow graphs we train a dependency-style parsing procedure which we apply to the entire sequence of r-NEs in a recipe. In evaluations, our systems achieve 71.1 to 87.5 F1, demonstrating that our annotation scheme is learnable.

2019

pdf bib abs
Procedural Text Generation from a Photo Sequence
Taichi Nishimura | Atsushi Hashimoto | Shinsuke Mori
Proceedings of the 12th International Conference on Natural Language Generation

Multimedia procedural texts, such as instructions and manuals with pictures, support people to share how-to knowledge. In this paper, we propose a method for generating a procedural text given a photo sequence allowing users to obtain a multimedia procedural text. We propose a single embedding space both for image and text enabling to interconnect them and to select appropriate words to describe a photo. We implemented our method and tested it on cooking instructions, i.e., recipes. Various experimental results showed that our method outperforms standard baselines.

2018

pdf bib
Annotating Modality Expressions and Event Factuality for a Japanese Chess Commentary Corpus
Suguru Matsuyoshi | Hirotaka Kameko | Yugo Murawaki | Shinsuke Mori
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Procedural Text Generation from an Execution Video
Atsushi Ushiku | Hayato Hashimoto | Atsushi Hashimoto | Shinsuke Mori
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In recent years, there has been a surge of interest in automatically describing images or videos in a natural language. These descriptions are useful for image/video search, etc. In this paper, we focus on procedure execution videos, in which a human makes or repairs something and propose a method for generating procedural texts from them. Since video/text pairs available are limited in size, the direct application of end-to-end deep learning is not feasible. Thus we propose to train Faster R-CNN network for object recognition and LSTM for text generation and combine them at run time. We took pairs of recipe and cooking video, generated a recipe from a video, and compared it with the original recipe. The experimental results showed that our method can produce a recipe as accurate as the state-of-the-art scene descriptions.

pdf bib
Japanese all-words WSD system using the Kyoto Text Analysis ToolKit
Hiroyuki Shinnou | Kanako Komiya | Minoru Sasaki | Shinsuke Mori
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

2016

pdf bib abs
Language Resource Addition Strategies for Raw Text Parsing
Atsushi Ushiku | Tetsuro Sasada | Shinsuke Mori
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We focus on the improvement of accuracy of raw text parsing, from the viewpoint of language resource addition. In Japanese, the raw text parsing is divided into three steps: word segmentation, part-of-speech tagging, and dependency parsing. We investigate the contribution of language resource addition in each of three steps to the improvement in accuracy for two domain corpora. The experimental results show that this improvement depends on the target domain. For example, when we handle well-written texts of limited vocabulary, white paper, an effective language resource is a word-POS pair sequence corpus for the parsing accuracy. So we conclude that it is important to check out the characteristics of the target domain and to choose a suitable language resource addition strategy for the parsing accuracy improvement.

pdf bib abs
Wikification for Scriptio Continua
Yugo Murawaki | Shinsuke Mori
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from being straightforwardly incorporated into the word segmentation task. If we intentionally violate segmentation standards with the direct incorporation, quantitative evaluation will be no longer feasible. To address this problem, we propose to define a separate task that directly links given texts to an external resource, that is, wikification in the case of Wikipedia. By doing so, we can circumvent segmentation mismatch that may not necessarily be important for downstream applications. As the first step to realize the idea, we design the task of Japanese wikification and construct wikification corpora. We annotated subsets of the Balanced Corpus of Contemporary Written Japanese plus Twitter short messages. We also implement a simple wikifier and investigate its performance on these corpora.

In recent years there has been a surge of interest in the natural language prosessing related to the real world, such as symbol grounding, language generation, and nonlinguistic data search by natural language queries. In order to concentrate on language ambiguities, we propose to use a well-defined “real world,” that is game states. We built a corpus consisting of pairs of sentences and a game state. The game we focus on is shogi (Japanese chess). We collected 742,286 commentary sentences in Japanese. They are spontaneously generated contrary to natural language annotations in many image datasets provided by human workers on Amazon Mechanical Turk. We defined domain specific named entities and we segmented 2,508 sentences into words manually and annotated each word with a named entity tag. We describe a detailed definition of named entities and show some statistics of our game commentary corpus. We also show the results of the experiments of word segmentation and named entity recognition. The accuracies are as high as those on general domain texts indicating that we are ready to tackle various new problems related to the real world.

We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public.

pdf bib abs
Parallel Speech Corpora of Japanese Dialects
Koichiro Yoshino | Naoki Hirayama | Shinsuke Mori | Fumihiko Takahashi | Katsutoshi Itoyama | Hiroshi G. Okuno
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Binary file summaries/549.html matches

pdf bib
Domain Specific Named Entity Recognition Referring to the Real World by Deep Neural Networks
Suzushi Tomori | Takashi Ninomiya | Shinsuke Mori
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib
Keyboard Logs as Natural Annotations for Word Segmentation
Fumihiko Takahasi | Shinsuke Mori
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Can Symbol Grounding Improve Low-Level NLP? Word Segmentation as a Case Study
Hirotaka Kameko | Shinsuke Mori | Yoshimasa Tsuruoka
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Combining Active Learning and Partial Annotation for Domain Adaptation of a Japanese Dependency Parser
Daniel Flannery | Shinsuke Mori
Proceedings of the 14th International Conference on Parsing Technologies

pdf bib
A Framework for Procedural Text Understanding
Hirokuni Maeta | Tetsuro Sasada | Shinsuke Mori
Proceedings of the 14th International Conference on Parsing Technologies

2014

pdf bib abs
Japanese-to-English patent translation system based on domain-adapted word segmentation and post-ordering
Katsuhito Sudoh | Masaaki Nagata | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

This paper presents a Japanese-to-English statistical machine translation system specialized for patent translation. Patents are practically useful technical documents, but their translation needs different efforts from general-purpose translation. There are two important problems in the Japanese-to-English patent translation: long distance reordering and lexical translation of many domain-specific terms. We integrated novel lexical translation of domain-specific terms with a syntax-based post-ordering framework that divides the machine translation problem into lexical translation and reordering explicitly for efficient syntax-based translation. The proposed lexical translation consists of a domain-adapted word segmentation and an unknown word transliteration. Experimental results show our system achieves better translation accuracy in BLEU and TER compared to the baseline methods.

pdf bib abs
A Japanese Word Dependency Corpus
Shinsuke Mori | Hideki Ogura | Tetsuro Sasada
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a corpus annotated with dependency relationships in Japanese. It contains about 30 thousand sentences in various domains. Six domains in Balanced Corpus of Contemporary Written Japanese have part-of-speech and pronunciation annotation as well. Dictionary example sentences have pronunciation annotation and cover basic vocabulary in Japanese with English sentence equivalent. Economic newspaper articles also have pronunciation annotation and the topics are similar to those of Penn Treebank. Invention disclosures do not have other annotation, but it has a clear application, machine translation. The unit of our corpus is word like other languages contrary to existing Japanese corpora whose unit is phrase called bunsetsu. Each sentence is manually segmented into words. We first present the specification of our corpus. Then we give a detailed explanation about our standard of word dependency. We also report some preliminary results of an MST-based dependency parser on our corpus.

pdf bib abs
Language Resource Addition: Dictionary or Corpus?
Shinsuke Mori | Graham Neubig
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we investigate the relative effect of two strategies of language resource additions to the word segmentation problem and part-of-speech tagging problem in Japanese. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that the annotated sentence addition to the training corpus is better than the entries addition to the dictionary. And the annotated sentence addition is efficient especially when we add new words with contexts of three real occurrences as partially annotated sentences. According to this knowledge, we executed annotation on the invention disclosure texts and observed word segmentation accuracy.

pdf bib abs
Flow Graph Corpus from Recipe Texts
Shinsuke Mori | Hirokuni Maeta | Yoko Yamakata | Tetsuro Sasada
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present our attempt at annotating procedural texts with a flow graph as a representation of understanding. The domain we focus on is cooking recipe. The flow graphs are directed acyclic graphs with a special root node corresponding to the final dish. The vertex labels are recipe named entities, such as foods, tools, cooking actions, etc. The arc labels denote relationships among them. We converted 266 Japanese recipe texts into flow graphs manually. 200 recipes are randomly selected from a web site and 66 are of the same dish. We detail the annotation framework and report some statistics on our corpus. The most typical usage of our corpus may be automatic conversion from texts to flow graphs which can be seen as an entire understanding of procedural texts. With our corpus, one can also try word segmentation, named entity recognition, predicate-argument structure analysis, and coreference resolution.

In order to utilize the corpus-based techniques that have proven effective in natural language processing in recent years, costly and time-consuming manual creation of linguistic resources is often necessary. Traditionally these resources are created on the document or sentence-level. In this paper, we examine the benefit of annotating only particular words with high information content, as opposed to the entire sentence or document. Using the task of Japanese pronunciation estimation as an example, we devise a machine learning method that can be trained on data annotated word-by-word. This is done by dividing the estimation process into two steps (word segmentation and word-based pronunciation estimation), and introducing a point-wise estimator that is able to make each decision independent of the other decisions made for a particular sentence. In an evaluation, the proposed strategy is shown to provide greater increases in accuracy using a smaller number of annotated words than traditional sentence-based annotation techniques.

2008

pdf bib
Training Conditional Random Fields Using Incomplete Annotations
Yuta Tsuboi | Hisashi Kashima | Shinsuke Mori | Hiroki Oda | Yuji Matsumoto
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2006

pdf bib
Phoneme-to-Text Transcription System with an Infinite Vocabulary
Shinsuke Mori | Daisuke Takuma | Gakuto Kurata
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2002

pdf bib
A Stochastic Parser Based on an SLM with Arboreal Context Trees
Shinsuke Mori
COLING 2002: The 19th International Conference on Computational Linguistics

2000

pdf bib
A Stochastic Parser Based on a Structural Word Prediction Model
Shinsuke Mori | Masafumi Nishimura | Nobuyasu Itoh | Shiho Ogino | Hideo Watanabe
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

1998

pdf bib
A Stochastic Language Model using Dependency and Its Improvement by Word Clustering
Shinsuke Mori | Makoto Nagao
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
A Stochastic Language Model using Dependency and its Improvement by Word Clustering
Shinsuke Mori | Makoto Nagao
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

1996

pdf bib
Word Extraction from Corpora and Its Part-of-Speech Estimation Using Distributional Analysis
Shinsuke Mori | Makoto Nagao
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics

1995

pdf bib abs
Parsing Without Grammar
Shinsuke Mori | Makoto Nagao
Proceedings of the Fourth International Workshop on Parsing Technologies

We describe and evaluate experimentally a method to parse a tagged corpus without grammar modeling a natural language on context-free language. This method is based on the following three hypotheses. 1) Part-of-speech sequences on the right-hand side of a rewriting rule are less constrained as to what part-of-speech precedes and follows them than non-constituent sequences. 2) Part-of-speech sequences directly derived from the same non-terminal symbol have similar environments. 3) The most suitable set of rewriting rules makes the greatest reduction of the corpus size. Based on these hypotheses, the system finds a set of constituent-like part-of-speech sequences and replaces them with a new symbol. The repetition of these processes brings us a set of rewriting rules, a grammar, and the bracketed corpus.