Naoaki Okazaki

Also published as: Naoki Okazaki


2021

pdf bib
Predicting Antonyms in Context using BERT
Ayana Niwa | Keisuke Nishiguchi | Naoaki Okazaki
Proceedings of the 14th International Conference on Natural Language Generation

We address the task of antonym prediction in a context, which is a fill-in-the-blanks problem. This task setting is unique and practical because it requires contrastiveness to the other word and naturalness as a text in filling a blank. We propose methods for fine-tuning pre-trained masked language models (BERT) for context-aware antonym prediction. The experimental results demonstrate that these methods have positive impacts on the prediction of antonyms within a context. Moreover, human evaluation reveals that more than 85% of predictions using the proposed method are acceptable as antonyms.

pdf bib
Joint Optimization of Tokenization and Downstream Model
Tatsuya Hiraoka | Sho Takase | Kei Uchiumi | Atsushi Keyaki | Naoaki Okazaki
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Image Caption Generation for News Articles
Zhishen Yang | Naoaki Okazaki
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we address the task of news-image captioning, which generates a description of an image given the image and its article body as input. This task is more challenging than the conventional image captioning, because it requires a joint understanding of image and text. We present a Transformer model that integrates text and image modalities and attends to textual features from visual features in generating a caption. Experiments based on automatic evaluation metrics and human evaluation show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-of-the-art model. In addition, we also confirm that visual features contribute to improving the quality of news-image captions.

pdf bib
Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
Sangwhan Moon | Naoaki Okazaki
Proceedings of the 12th Language Resources and Evaluation Conference

In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model.

pdf bib
Evaluation Dataset for Zero Pronoun in Japanese to English Translation
Sho Shimazu | Sho Takase | Toshiaki Nakazawa | Naoaki Okazaki
Proceedings of the 12th Language Resources and Evaluation Conference

In natural language, we often omit some words that are easily understandable from the context. In particular, pronouns of subject, object, and possessive cases are often omitted in Japanese; these are known as zero pronouns. In translation from Japanese to other languages, we need to find a correct antecedent for each zero pronoun to generate a correct and coherent translation. However, it is difficult for conventional automatic evaluation metrics (e.g., BLEU) to focus on the success of zero pronoun resolution. Therefore, we present a hand-crafted dataset to evaluate whether translation models can resolve the zero pronoun problems in Japanese to English translations. We manually and statistically validate that our dataset can effectively evaluate the correctness of the antecedents selected in translations. Through the translation experiments using our dataset, we reveal shortcomings of an existing context-aware neural machine translation model.

pdf bib
Optimizing Word Segmentation for Downstream Task
Tatsuya Hiraoka | Sho Takase | Kei Uchiumi | Atsushi Keyaki | Naoaki Okazaki
Findings of the Association for Computational Linguistics: EMNLP 2020

In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.

pdf bib
Improving Truthfulness of Headline Generation
Kazuki Matsumaru | Sho Takase | Naoaki Okazaki
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Most studies on abstractive summarization report ROUGE scores between system and reference summaries. However, we have a concern about the truthfulness of generated summaries: whether all facts of a generated summary are mentioned in the source text. This paper explores improving the truthfulness in headline generation on two popular datasets. Analyzing headlines generated by the state-of-the-art encoder-decoder model, we show that the model sometimes generates untruthful headlines. We conjecture that one of the reasons lies in untruthful supervision data used for training the model. In order to quantify the truthfulness of article-headline pairs, we consider the textual entailment of whether an article entails its headline. After confirming quite a few untruthful instances in the datasets, this study hypothesizes that removing untruthful instances from the supervision data may remedy the problem of the untruthful behaviors of the model. Building a binary classifier that predicts an entailment relation between an article and its headline, we filter out untruthful instances from the supervision data. Experimental results demonstrate that the headline generation model trained on filtered supervision data shows no clear difference in ROUGE scores but remarkable improvements in automatic and manual evaluations of the generated headlines.

pdf bib
Enhancing Machine Translation with Dependency-Aware Self-Attention
Emanuele Bugliarello | Naoaki Okazaki
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Most neural machine translation models only rely on pairs of parallel sentences, assuming syntactic information is automatically learned by an attention mechanism. In this work, we investigate different approaches to incorporate syntactic knowledge in the Transformer model and also propose a novel, parameter-free, dependency-aware self-attention mechanism that improves its translation quality, especially for long sentences and in low-resource scenarios. We show the efficacy of each approach on WMT English-German and English-Turkish, and WAT English-Japanese translation tasks.

pdf bib
It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information
Emanuele Bugliarello | Sabrina J. Mielke | Antonios Anastasopoulos | Ryan Cotterell | Naoaki Okazaki
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The performance of neural machine translation systems is commonly evaluated in terms of BLEU. However, due to its reliance on target language properties and generation, the BLEU metric does not allow an assessment of which translation directions are more difficult to model. In this paper, we propose cross-mutual information (XMI): an asymmetric information-theoretic metric of machine translation difficulty that exploits the probabilistic nature of most neural machine translation models. XMI allows us to better evaluate the difficulty of translating text into the target language while controlling for the difficulty of the target-side generation component independent of the translation task. We then present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems. Code for replicating our experiments is available online at https://github.com/e-bug/nmt-difficulty.

pdf bib
PatchBERT: Just-in-Time, Out-of-Vocabulary Patching
Sangwhan Moon | Naoaki Okazaki
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large scale pre-trained language models have shown groundbreaking performance improvements for transfer learning in the domain of natural language processing. In our paper, we study a pre-trained multilingual BERT model and analyze the OOV rate on downstream tasks, how it introduces information loss, and as a side-effect, obstructs the potential of the underlying model. We then propose multiple approaches for mitigation and demonstrate that it improves performance with the same parameter count when combined with fine-tuning.

pdf bib
SWAGex at SemEval-2020 Task 4: Commonsense Explanation as Next Event Prediction
Wiem Ben Rim | Naoaki Okazaki
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We describe the system submitted by the SWAGex team to the SemEval-2020 Commonsense Validation and Explanation Task. We use multiple methods on the pre-trained language model BERT (Devlin et al., 2018) for tasks that require the system to recognize sentences against commonsense and justify the reasoning behind this decision. Our best performing model is BERT trained on SWAG and fine-tuned for the task. We investigate the ability to transfer commonsense knowledge from SWAG to SemEval-2020 by training a model for the Explanation task with Next Event Prediction data

pdf bib
TextLearner at SemEval-2020 Task 10: A Contextualized Ranking System in Solving Emphasis Selection in Text
Zhishen Yang | Lars Wolfsteller | Naoaki Okazaki
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the emphasis selection system of the team TextLearner for SemEval 2020 Task 10: Emphasis Selection For Written Text in Visual Media. The system aims to learn the emphasis selection distribution using contextual representations extracted from pre-trained language models and a two-staged ranking model. The experimental results demonstrate the strong contextual representation power of the recent advanced transformer-based language model RoBERTa, which can be exploited using a simple but effective architecture on top.

pdf bib
You May Like This Hotel Because ...: Identifying Evidence for Explainable Recommendations
Shin Kanouchi | Masato Neishi | Yuta Hayashibe | Hiroki Ouchi | Naoaki Okazaki
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Explainable recommendation is a good way to improve user satisfaction. However, explainable recommendation in dialogue is challenging since it has to handle natural language as both input and output. To tackle the challenge, this paper proposes a novel and practical task to explain evidences in recommending hotels given vague requests expressed freely in natural language. We decompose the process into two subtasks on hotel reviews: Evidence Identification and Evidence Explanation. The former predicts whether or not a sentence contains evidence that expresses why a given request is satisfied. The latter generates a recommendation sentence given a request and an evidence sentence. In order to address these subtasks, we build an Evidence-based Explanation dataset, which is the largest dataset for explaining evidences in recommending hotels for vague requests. The experimental results demonstrate that the BERT model can find evidence sentences with respect to various vague requests and that the LSTM-based model can generate recommendation sentences.

2019

pdf bib
Positional Encoding to Control Output Sequence Length
Sho Takase | Naoaki Okazaki
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Neural encoder-decoder models have been successful in natural language generation tasks. However, real applications of abstractive summarization must consider an additional constraint that a generated summary should not exceed a desired length. In this paper, we propose a simple but effective extension of a sinusoidal positional encoding (Vaswani et al., 2017) so that a neural encoder-decoder model preserves the length constraint. Unlike previous studies that learn length embeddings, the proposed method can generate a text of any length even if the target length is unseen in training data. The experimental results show that the proposed method is able not only to control generation length but also improve ROUGE scores.

pdf bib
Neural Question Generation using Interrogative Phrases
Yuichi Sasazawa | Sho Takase | Naoaki Okazaki
Proceedings of the 12th International Conference on Natural Language Generation

Question Generation (QG) is the task of generating questions from a given passage. One of the key requirements of QG is to generate a question such that it results in a target answer. Previous works used a target answer to obtain a desired question. However, we also want to specify how to ask questions and improve the quality of generated questions. In this study, we explore the use of interrogative phrases as additional sources to control QG. By providing interrogative phrases, we expect that QG can generate a more reliable sequence of words subsequent to an interrogative phrase. We present a baseline sequence-to-sequence model with the attention, copy, and coverage mechanisms, and show that the simple baseline achieves state-of-the-art performance. The experiments demonstrate that interrogative phrases contribute to improving the performance of QG. In addition, we report the superiority of using interrogative phrases in human evaluation. Finally, we show that a question answering system can provide target answers more correctly when the questions are generated with interrogative phrases.

pdf bib
A Large-Scale Multi-Length Headline Corpus for Analyzing Length-Constrained Headline Generation Model Evaluation
Yuta Hitomi | Yuya Taguchi | Hideaki Tamori | Ko Kikuta | Jiro Nishitoba | Naoaki Okazaki | Kentaro Inui | Manabu Okumura
Proceedings of the 12th International Conference on Natural Language Generation

Browsing news articles on multiple devices is now possible. The lengths of news article headlines have precise upper bounds, dictated by the size of the display of the relevant device or interface. Therefore, controlling the length of headlines is essential when applying the task of headline generation to news production. However, because there is no corpus of headlines of multiple lengths for a given article, previous research on controlling output length in headline generation has not discussed whether the system outputs could be adequately evaluated without multiple references of different lengths. In this paper, we introduce two corpora, which are Japanese News Corpus (JNC) and JApanese MUlti-Length Headline Corpus (JAMUL), to confirm the validity of previous evaluation settings. The JNC provides common supervision data for headline generation. The JAMUL is a large-scale evaluation dataset for headlines of three different lengths composed by professional editors. We report new findings on these corpora; for example, although the longest length reference summary can appropriately evaluate the existing methods controlling output length, this evaluation setting has several problems.

pdf bib
TokyoTech_NLP at SemEval-2019 Task 3: Emotion-related Symbols in Emotion Detection
Zhishen Yang | Sam Vijlbrief | Naoaki Okazaki
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper presents our contextual emotion detection system in approaching the SemEval2019 shared task 3: EmoContext: Contextual Emotion Detection in Text. This system cooperates with an emotion detection neural network method (Poria et al., 2017), emoji2vec (Eisner et al., 2016) embedding, word2vec embedding (Mikolov et al., 2013), and our proposed emoticon and emoji preprocessing method. The experimental results demonstrate the usefulness of our emoticon and emoji prepossessing method, and representations of emoticons and emoji contribute model’s emotion detection.

pdf bib
Learning to Select, Track, and Generate for Data-to-Text
Hayate Iso | Yui Uehara | Tatsuya Ishigaki | Hiroshi Noji | Eiji Aramaki | Ichiro Kobayashi | Yusuke Miyao | Naoaki Okazaki | Hiroya Takamura
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our proposed model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generations. Experimental results show that our proposed model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.

2018

pdf bib
Predicting Stances from Social Media Posts using Factorization Machines
Akira Sasaki | Kazuaki Hanawa | Naoaki Okazaki | Kentaro Inui
Proceedings of the 27th International Conference on Computational Linguistics

Social media provide platforms to express, discuss, and shape opinions about events and issues in the real world. An important step to analyze the discussions on social media and to assist in healthy decision-making is stance detection. This paper presents an approach to detect the stance of a user toward a topic based on their stances toward other topics and the social media posts of the user. We apply factorization machines, a widely used method in item recommendation, to model user preferences toward topics from the social media data. The experimental results demonstrate that users’ posts are useful to model topic preferences and therefore predict stances of silent users.

pdf bib
Multi-dialect Neural Machine Translation and Dialectometry
Kaori Abe | Yuichiroh Matsubayashi | Naoaki Okazaki | Kentaro Inui
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Reducing Odd Generation from Neural Headline Generation
Shun Kiyono | Sho Takase | Jun Suzuki | Naoaki Okazaki | Kentaro Inui | Masaaki Nagata
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Incorporating Semantic Attention in Video Description Generation
Natsuda Laokulrat | Naoaki Okazaki | Hideki Nakayama
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Unsupervised Token-wise Alignment to Improve Interpretation of Encoder-Decoder Models
Shun Kiyono | Sho Takase | Jun Suzuki | Naoaki Okazaki | Kentaro Inui | Masaaki Nagata
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Developing a method for understanding the inner workings of black-box neural methods is an important research endeavor. Conventionally, many studies have used an attention matrix to interpret how Encoder-Decoder-based models translate a given source sentence to the corresponding target sentence. However, recent studies have empirically revealed that an attention matrix is not optimal for token-wise translation analyses. We propose a method that explicitly models the token-wise alignment between the source and target sequences to provide a better analysis. Experiments show that our method can acquire token-wise alignments that are superior to those of an attention mechanism.

pdf bib
Investigating the Challenges of Temporal Relation Extraction from Clinical Text
Diana Galvan | Naoaki Okazaki | Koji Matsuda | Kentaro Inui
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Temporal reasoning remains as an unsolved task for Natural Language Processing (NLP), particularly demonstrated in the clinical domain. The complexity of temporal representation in language is evident as results of the 2016 Clinical TempEval challenge indicate: the current state-of-the-art systems perform well in solving mention-identification tasks of event and time expressions but poorly in temporal relation extraction, showing a gap of around 0.25 point below human performance. We explore to adapt the tree-based LSTM-RNN model proposed by Miwa and Bansal (2016) to temporal relation extraction from clinical text, obtaining a five point improvement over the best 2016 Clinical TempEval system and two points over the state-of-the-art. We deliver a deep analysis of the results and discuss the next step towards human-like temporal reasoning.

2017

pdf bib
Analyzing the Revision Logs of a Japanese Newspaper for Article Quality Assessment
Hideaki Tamori | Yuta Hitomi | Naoaki Okazaki | Kentaro Inui
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

We address the issue of the quality of journalism and analyze daily article revision logs from a Japanese newspaper company. The revision logs contain data that can help reveal the requirements of quality journalism such as the types and number of edit operations and aspects commonly focused in revision. This study also discusses potential applications such as quality assessment and automatic article revision as our future research directions.

pdf bib
Handling Multiword Expressions in Causality Estimation
Shota Sasaki | Sho Takase | Naoya Inoue | Naoaki Okazaki | Kentaro Inui
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Other Topics You May Also Agree or Disagree: Modeling Inter-Topic Preferences using Tweets and Matrix Factorization
Akira Sasaki | Kazuaki Hanawa | Naoaki Okazaki | Kentaro Inui
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We presents in this paper our approach for modeling inter-topic preferences of Twitter users: for example, “those who agree with the Trans-Pacific Partnership (TPP) also agree with free trade”. This kind of knowledge is useful not only for stance detection across multiple topics but also for various real-world applications including public opinion survey, electoral prediction, electoral campaigns, and online debates. In order to extract users’ preferences on Twitter, we design linguistic patterns in which people agree and disagree about specific topics (e.g., “A is completely wrong”). By applying these linguistic patterns to a collection of tweets, we extract statements agreeing and disagreeing with various topics. Inspired by previous work on item recommendation, we formalize the task of modeling inter-topic preferences as matrix factorization: representing users’ preference as a user-topic matrix and mapping both users and topics onto a latent feature space that abstracts the preferences. Our experimental results demonstrate both that our presented approach is useful in predicting missing preferences of users and that the latent vector representations of topics successfully encode inter-topic preferences.

pdf bib
A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse
Sosuke Kobayashi | Naoaki Okazaki | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This study addresses the problem of identifying the meaning of unknown words or entities in a discourse with respect to the word embedding approaches used in neural language models. We proposed a method for on-the-fly construction and exploitation of word embeddings in both the input and output layers of a neural model by tracking contexts. This extends the dynamic entity representation used in Kobayashi et al. (2016) and incorporates a copy mechanism proposed independently by Gu et al. (2016) and Gulcehre et al. (2016). In addition, we construct a new task and dataset called Anonymized Language Modeling for evaluating the ability to capture word meanings while reading. Experiments conducted using our novel dataset show that the proposed variant of RNN language model outperformed the baseline model. Furthermore, the experiments also demonstrate that dynamic updates of an output layer help a model predict reappearing entities, whereas those of an input layer are effective to predict words following reappearing entities.

pdf bib
Proofread Sentence Generation as Multi-Task Learning with Editing Operation Prediction
Yuta Hitomi | Hideaki Tamori | Naoaki Okazaki | Kentaro Inui
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This paper explores the idea of robot editors, automated proofreaders that enable journalists to improve the quality of their articles. We propose a novel neural model of multi-task learning that both generates proofread sentences and predicts the editing operations required to rewrite the source sentences and create the proofread ones. The model is trained using logs of the revisions made professional editors revising draft newspaper articles written by journalists. Experiments demonstrate the effectiveness of our multi-task learning approach and the potential value of using revision logs for this task.

pdf bib
A Crowdsourcing Approach for Annotating Causal Relation Instances in Wikipedia
Kazuaki Hanawa | Akira Sasaki | Naoaki Okazaki | Kentaro Inui
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

2016

pdf bib
Neural Headline Generation on Abstract Meaning Representation
Sho Takase | Jun Suzuki | Naoaki Okazaki | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning Semantically and Additively Compositional Distributional Representations
Ran Tian | Naoaki Okazaki | Kentaro Inui
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Composing Distributed Representations of Relational Patterns
Sho Takase | Naoaki Okazaki | Kentaro Inui
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes
Davaajav Jargalsaikhan | Naoaki Okazaki | Koji Matsuda | Kentaro Inui
Proceedings of the ACL 2016 Student Research Workshop

pdf bib
Recognizing Open-Vocabulary Relations between Objects in Images
Masayasu Muraoka | Sumit Maharjan | Masaki Saito | Kota Yamaguchi | Naoaki Okazaki | Takayuki Okatani | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf bib
Toward the automatic extraction of knowledge of usable goods
Mei Uemura | Naho Orita | Naoaki Okazaki | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf bib
Neural Joint Learning for Classifying Wikipedia Articles into Fine-grained Named Entity Types
Masatoshi Suzuki | Koji Matsuda | Satoshi Sekine | Naoaki Okazaki | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

pdf bib
Dynamic Entity Representation with Max-pooling Improves Machine Reading
Sosuke Kobayashi | Ran Tian | Naoaki Okazaki | Kentaro Inui
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Generating Video Description using Sequence-to-sequence Model with Temporal Attention
Natsuda Laokulrat | Sang Phan | Noriki Nishida | Raphael Shu | Yo Ehara | Naoaki Okazaki | Yusuke Miyao | Hideki Nakayama
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Automatic video description generation has recently been getting attention after rapid advancement in image caption generation. Automatically generating description for a video is more challenging than for an image due to its temporal dynamics of frames. Most of the work relied on Recurrent Neural Network (RNN) and recently attentional mechanisms have also been applied to make the model learn to focus on some frames of the video while generating each word in a describing sentence. In this paper, we focus on a sequence-to-sequence approach with temporal attention mechanism. We analyze and compare the results from different attention model configuration. By applying the temporal attention mechanism to the system, we can achieve a METEOR score of 0.310 on Microsoft Video Description dataset, which outperformed the state-of-the-art system so far.

pdf bib
Modeling Discourse Segments in Lyrics Using Repeated Patterns
Kento Watanabe | Yuichiroh Matsubayashi | Naho Orita | Naoaki Okazaki | Kentaro Inui | Satoru Fukayama | Tomoyasu Nakano | Jordan Smith | Masataka Goto
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This study proposes a computational model of the discourse segments in lyrics to understand and to model the structure of lyrics. To test our hypothesis that discourse segmentations in lyrics strongly correlate with repeated patterns, we conduct the first large-scale corpus study on discourse segments in lyrics. Next, we propose the task to automatically identify segment boundaries in lyrics and train a logistic regression model for the task with the repeated pattern and textual features. The results of our empirical experiments illustrate the significance of capturing repeated patterns in predicting the boundaries of discourse segments in lyrics.

pdf bib
Modeling Context-sensitive Selectional Preference with Distributed Representations
Naoya Inoue | Yuichiroh Matsubayashi | Masayuki Ono | Naoaki Okazaki | Kentaro Inui
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper proposes a novel problem setting of selectional preference (SP) between a predicate and its arguments, called as context-sensitive SP (CSP). CSP models the narrative consistency between the predicate and preceding contexts of its arguments, in addition to the conventional SP based on semantic types. Furthermore, we present a novel CSP model that extends the neural SP model (Van de Cruys, 2014) to incorporate contextual information into the distributed representations of arguments. Experimental results demonstrate that the proposed CSP model successfully learns CSP and outperforms the conventional SP model in coreference cluster ranking.

pdf bib
Tohoku at SemEval-2016 Task 6: Feature-based Model versus Convolutional Neural Network for Stance Detection
Yuki Igarashi | Hiroya Komatsu | Sosuke Kobayashi | Naoaki Okazaki | Kentaro Inui
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Fast and Large-scale Unsupervised Relation Extraction
Sho Takase | Naoaki Okazaki | Kentaro Inui
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Reducing Lexical Features in Parsing by Word Embeddings
Hiroya Komatsu | Ran Tian | Naoaki Okazaki | Kentaro Inui
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
A Computational Approach for Generating Toulmin Model Argumentation
Paul Reisert | Naoya Inoue | Naoaki Okazaki | Kentaro Inui
Proceedings of the 2nd Workshop on Argumentation Mining

pdf bib
Annotating Geographical Entities on Microblog Text
Koji Matsuda | Akira Sasaki | Naoaki Okazaki | Kentaro Inui
Proceedings of The 9th Linguistic Annotation Workshop

pdf bib
Who caught a cold ? - Identifying the subject of a symptom
Shin Kanouchi | Mamoru Komachi | Naoaki Okazaki | Eiji Aramaki | Hiroshi Ishikawa
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Disease Event Detection based on Deep Modality Analysis
Yoshiaki Kitagawa | Mamoru Komachi | Eiji Aramaki | Naoaki Okazaki | Hiroshi Ishikawa
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop

2014

pdf bib
Finding The Best Model Among Representative Compositional Models
Masayasu Muraoka | Sonse Shimaoka | Kazeto Yamamoto | Yotaro Watanabe | Naoaki Okazaki | Kentaro Inui
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
A Corpus Study for Identifying Evidence on Microblogs
Paul Reisert | Junta Mizuno | Miwa Kanno | Naoaki Okazaki | Kentaro Inui
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2013

pdf bib
Extracting and Aggregating False Information from Microblogs
Naoaki Okazaki | Keita Nabeshima | Kento Watanabe | Junta Mizuno | Kentaro Inui
Proceedings of the Workshop on Language Processing and Crisis Information 2013

pdf bib
Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web
Katsuma Narisawa | Yotaro Watanabe | Junta Mizuno | Naoaki Okazaki | Kentaro Inui
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Detecting Chronic Critics Based on Sentiment Polarity and User’s Behavior in Social Media
Sho Takase | Akiko Murakami | Miki Enoki | Naoaki Okazaki | Kentaro Inui
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

2012

pdf bib
A Latent Discriminative Model for Compositional Entailment Relation Recognition using Natural Logic
Yotaro Watanabe | Junta Mizuno | Eric Nichols | Naoaki Okazaki | Kentaro Inui
Proceedings of COLING 2012

pdf bib
Acquiring and Generalizing Causal Inference Rules from Deverbal Noun Constructions
Shohei Tanaka | Naoaki Okazaki | Mitsuru Ishizuka
Proceedings of COLING 2012: Posters

pdf bib
Set Expansion using Sibling Relations between Semantic Categories
Sho Takase | Naoaki Okazaki | Kentaro Inui
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

pdf bib
Automatic Acquisition of Huge Training Data for Bio-Medical Named Entity Recognition
Yu Usami | Han-Cheol Cho | Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of BioNLP 2011 Workshop

2010

pdf bib
Simple and Efficient Algorithm for Approximate Dictionary Matching
Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Learning Web Query Patterns for Imitating Wikipedia Articles
Shohei Tanaka | Naoaki Okazaki | Mitsuru Ishizuka
Coling 2010: Posters

2009

pdf bib
A Comparative Study on Generalization of Semantic Roles in FrameNet
Yuichiroh Matsubayashi | Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Robust Approach to Abbreviating Terms: A Discriminative Latent Variable Model with Global Information
Xu Sun | Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web
Yulan Yan | Naoaki Okazaki | Yutaka Matsuo | Zhenglu Yang | Mitsuru Ishizuka
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
The UOT system
Xianchao Wu | Takuya Matsuzaki | Naoaki Okazaki | Yusuke Miyao | Jun’ichi Tsujii
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

We present the UOT Machine Translation System that was used in the IWSLT-09 evaluation campaign. This year, we participated in the BTEC track for Chinese-to-English translation. Our system is based on a string-to-tree framework. To integrate deep syntactic information, we propose the use of parse trees and semantic dependencies on English sentences described respectively by Head-driven Phrase Structure Grammar and Predicate-Argument Structures. We report the results of our system on both the development and test sets.

pdf bib
Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages
Xianchao Wu | Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
A Discriminative Alignment Model for Abbreviation Recognition
Naoaki Okazaki | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Building a Bilingual Lexicon Using Phrase-based Statistical Machine Translation via a Pivot Language
Takashi Tsunakawa | Naoaki Okazaki | Jun’ichi Tsujii
Coling 2008: Companion volume: Posters

pdf bib
A Discriminative Candidate Generator for String Transformations
Naoaki Okazaki | Yoshimasa Tsuruoka | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Identifying Sections in Scientific Abstracts using Conditional Random Fields
Kenji Hirohata | Naoaki Okazaki | Sophia Ananiadou | Mitsuru Ishizuka
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
A Discriminative Approach to Japanese Abbreviation Extraction
Naoaki Okazaki | Mitsuru Ishizuka | Jun’ichi Tsujii
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
Building Bilingual Lexicons using Lexical Translation Probabilities via Pivot Languages
Takashi Tsunakawa | Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper proposes a method of increasing the size of a bilingual lexicon obtained from two other bilingual lexicons via a pivot language. When we apply this approach, there are two main challenges, “ambiguity” and “mismatch” of terms; we target the latter problem by improving the utilization ratio of the bilingual lexicons. Given two bilingual lexicons between language pairs Lf-Lp and Lp-Le, we compute lexical translation probabilities of word pairs by using a statistical word-alignment model, and term decomposition/composition techniques. We compare three approaches to generate the bilingual lexicon: “exact merging”, “word-based merging”, and our proposed “alignment-based merging”. In our method, we combine lexical translation probabilities and a simple language model for estimating the probabilities of translation pairs. The experimental results show that our method could drastically improve the number of translation terms compared to the two methods mentioned above. Additionally, we evaluated and discussed the quality of the translation outputs.

pdf bib
Connecting Text Mining and Pathways using the PathText Resource
Rune Sætre | Brian Kemper | Kanae Oda | Naoaki Okazaki | Yukiko Matsuoka | Norihiro Kikuchi | Hiroaki Kitano | Yoshimasa Tsuruoka | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature than their corresponding descriptions in English text, we developed a prototype system called PathText. The goal of PathText is to serve as a bridge between these two different representations. In this paper, we first describe the overall architecture and the interfaces of the PathText system, and then provide some details about the core Text Mining components.

pdf bib
Improving English-to-Chinese Translation for Technical Terms using Morphological Information
Xianchao Wu | Naoaki Okazaki | Takashi Tsunakawa | Jun’ichi Tsujii
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

The continuous emergence of new technical terms and the difficulty of keeping up with neologism in parallel corpora deteriorate the performance of statistical machine translation (SMT) systems. This paper explores the use of morphological information to improve English-to-Chinese translation for technical terms. To reduce the morpheme-level translation ambiguity, we group the morphemes into morpheme phrases and propose the use of domain information for translation candidate selection. In order to find correspondences of morpheme phrases between the source and target languages, we propose an algorithm to mine morpheme phrase translation pairs from a bilingual lexicon. We also build a cascaded translation model that dynamically shifts translation units from phrase level to word and morpheme phrase levels. The experimental results show the significant improvements over the current phrase-based SMT systems.

2006

pdf bib
A Bottom-Up Approach to Sentence Ordering for Multi-Document Summarization
Danushka Bollegala | Naoaki Okazaki | Mitsuru Ishizuka
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
A Term Recognition Approach to Acronym Recognition
Naoaki Okazaki | Sophia Ananiadou
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Clustering acronyms in biomedical text for disambiguation
Naoaki Okazaki | Sophia Ananiadou
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Given the increasing number of neologisms in biomedicine (names of genes, diseases, molecules, etc.), the rate of acronyms used in literature also increases. Existing acronym dictionaries cannot keep up with the rate of new creations. Thus, discovering and disambiguating acronyms and their expanded forms are essential aspects of text mining and terminology management. We present a method for clustering long forms identified by an acronym recognition method. Applying the acronym recognition method to MEDLINE abstracts, we obtained a list of short/long forms. The recognized short/long forms were classified by abiologist to construct an evaluation set for clustering sets of similar long forms. We observed five types of term variation in the evaluation set and defined four similarity measures to gathers the similar longforms (i.e., orthographic, morphological, syntactic, lexico semantic variants, nested abbreviations). The complete-link clustering with the four similarity measures achieved 87.5% precision and 84.9% recall on the evaluation set.

pdf bib
Towards a terminological resource for biomedical text mining
Goran Nenadic | Naoki Okazaki | Sophia Ananiadou
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

One of the main challenges in biomedical text mining is the identification of terminology, which is a key factor for accessing and integrating the information stored in literature. Manual creation of biomedical terminologies cannot keep pace with the data that becomes available. Still, many of them have been used in attempts to recognise terms in literature, but their suitability for text mining has been questioned as substantial re-engineering is needed to tailor the resources for automatic processing. Several approaches have been suggested to automatically integrate and map between resources, but the problems of extensive variability of lexical representations and ambiguity have been revealed. In this paper we present a methodology to automatically maintain a biomedical terminological database, which contains automatically extracted terms, their mutual relationships, features and possible annotations that can be useful in text processing. In addition to TermDB, a database used for terminology management and storage, we present the following modules that are used to populate the database: TerMine (recognition, extraction and normalisation of terms from literature), AcroTerMine (extraction and clustering of acronyms and their long forms), AnnoTerm (annotation and classification of terms), and ClusTerm (extraction of term associations and clustering of terms).

2005

pdf bib
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation
Danushka Bollegala | Naoaki Okazaki | Mitsuru Ishizuka
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf bib
Improving Chronological Sentence Ordering by Precedence Relation
Naoaki Okazaki | Yutaka Matsuo | Mitsuru Ishizuka
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Search
Co-authors