Sadao Kurohashi

2025

We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models (LLM_Voice), designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM_Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR-LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM_Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training. Our code and data will be open source to encourage future studies.

In this study, we investigate whether non-English-centric large language models, ‘think’ in their specialized language. Specifically, we analyze how intermediate layer representations, when projected into the vocabulary space, favor certain languages during generation—termed as latent languages. We categorize non-English-centric models into two groups: CPMs, which are English-centric models with continued pre-training on its specialized language, and BLMs, which are pre-trained on a balanced mix of multiple languages from scratch. Our findings reveal that while English-centric models rely exclusively on English as their latent language, non-English-centric models activate multiple latent languages, dynamically selecting the most similar one based on both the source and target languages. This also influences responses to culture difference questions, reducing English-centric biases in non-English models. This study deepens our understanding of language representation in non-English-centric LLMs, shedding light on the intricate dynamics of multilingual processing at the representational level.

pdf bib abs
7 Points to Tsinghua but 10 Points to ? Assessing Large Language Models in Agentic Multilingual National Bias
Qianying Liu | Katrina Qiyao Wang | Fei Cheng | Sadao Kurohashi
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM’s applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation.We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal significant biases in both the scores and the reasoning structure of non-English languages. We also draw future implications for improving multilingual alignment in AI systems.

2024

pdf bib abs
SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation
Haiyue Song | Francois Meyer | Raj Dabre | Hideki Tanaka | Chenhui Chu | Sadao Kurohashi
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

Subword regularized models leverage multiple subword tokenizations of one target sentence during training. However, selecting one tokenization during inference leads to the underutilization of knowledge learned about multiple tokenizations.We propose the SubMerge algorithm to rescue the ignored Subword tokenizations through merging equivalent ones during inference.SubMerge is a nested search algorithm where the outer beam search treats the word as the minimal unit, and the inner beam search provides a list of word candidates and their probabilities, merging equivalent subword tokenizations. SubMerge estimates the probability of the next word more precisely, providing better guidance during inference.Experimental results on six low-resource to high-resource machine translation datasets show that SubMerge utilizes a greater proportion of a model’s probability weight during decoding (lower word perplexities for hypotheses). It also improves BLEU and chrF++ scores for many translation directions, most reliably for low-resource scenarios. We investigate the effect of different beam sizes, training set sizes, dropout rates, and whether it is effective on non-regularized models.

pdf bib abs
Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain
Zhen Wan | Yating Zhang | Yexiang Wang | Fei Cheng | Sadao Kurohashi
Findings of the Association for Computational Linguistics: ACL 2024

While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it’s not plausible to continue training LLMs of the GPT-4’s scale on in-domain data.This paper introduces a simple yet effective domain adaptation framework for GPT-4 by reformulating generation as an adapt-retrieve-revise process. The initial step is to adapt an affordable 7B LLM to the Chinese legal domain by continuing learning in-domain data. When solving an in-domain task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to retrieve supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and revise the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves the average score by +33.6 points, compared to GPT-4 direct generation. When compared to two stronger retrieval-based baselines, our method outperforms them by +17.0 and +23.5.

Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.

pdf bib abs
A Comprehensive Analysis of Memorization in Large Language Models
Hirokazu Kiyomaru | Issa Sugiura | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 17th International Natural Language Generation Conference

This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization.

pdf bib abs
Abstractive Multi-Video Captioning: Benchmark Dataset Construction and Extensive Evaluation
Rikito Takahashi | Hirokazu Kiyomaru | Chenhui Chu | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces a new task, abstractive multi-video captioning, which focuses on abstracting multiple videos with natural language. Unlike conventional video captioning tasks generating a specific caption for a video, our task generates an abstract caption of the shared content in a video group containing multiple videos. To address our task, models must learn to understand each video in detail and have strong abstraction abilities to find commonalities among videos. We construct a benchmark dataset for abstractive multi-video captioning named AbstrActs. AbstrActs contains 13.5k video groups and corresponding abstract captions. As abstractive multi-video captioning models, we explore two approaches: end-to-end and cascade. For evaluation, we proposed a new metric, CocoA, which can evaluate the model performance based on the abstractness of the generated captions. In experiments, we report the impact of the way of combining multiple video features, the overall model architecture, and the number of input videos.

pdf bib abs
An Empirical Study of Synthetic Data Generation for Implicit Discourse Relation Recognition
Kazumasa Omura | Fei Cheng | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Implicit Discourse Relation Recognition (IDRR), which is the task of recognizing the semantic relation between given text spans that do not contain overt clues, is a long-standing and challenging problem. In particular, the paucity of training data for some error-prone discourse relations makes the problem even more challenging. To address this issue, we propose a method of generating synthetic data for IDRR using a large language model. The proposed method is summarized as two folds: extraction of confusing discourse relation pairs based on false negative rate and synthesis of data focused on the confusion. The key points of our proposed method are utilizing a confusion matrix and adopting two-stage prompting to obtain effective synthetic data. According to the proposed method, we generated synthetic data several times larger than training examples for some error-prone discourse relations and incorporated it into training. As a result of experiments, we achieved state-of-the-art macro-F1 performance thanks to the synthetic data without sacrificing micro-F1 performance and demonstrated its positive effects especially on recognizing some infrequent discourse relations.

pdf bib abs
Domain Transferable Semantic Frames for Expert Interview Dialogues
Taishi Chika | Taro Okahisa | Takashi Kodama | Yin Jou Huang | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Interviews are an effective method to elicit critical skills to perform particular processes in various domains. In order to understand the knowledge structure of these domain-specific processes, we consider semantic role and predicate annotation based on Frame Semantics. We introduce a dataset of interview dialogues with experts in the culinary and gardening domains, each annotated with semantic frames. This dataset consists of (1) 308 interview dialogues related to the culinary domain, originally assembled by Okahisa et al. (2022), and (2) 100 interview dialogues associated with the gardening domain, which we newly acquired. The labeling specifications take into account the domain-transferability by adopting domain-agnostic labels for frame elements. In addition, we conducted domain transfer experiments from the culinary domain to the gardening domain to examine the domain transferability with our dataset. The experimental results showed the effectiveness of our domain-agnostic labeling scheme.

pdf bib abs
Identifying Source Language Expressions for Pre-editing in Machine Translation
Norizo Sakaguchi | Yugo Murawaki | Chenhui Chu | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Machine translation-mediated communication can benefit from pre-editing source language texts to ensure accurate transmission of intended meaning in the target language. The primary challenge lies in identifying source language expressions that pose difficulties in translation. In this paper, we hypothesize that such expressions tend to be distinctive features of texts originally written in the source language (native language) rather than translations generated from the target language into the source language (machine translation). To identify such expressions, we train a neural classifier to distinguish native language from machine translation, and subsequently isolate the expressions that contribute to the model’s prediction of native language. Our manual evaluation revealed that our method successfully identified characteristic expressions of the native language, despite the noise and the inherent nuances of the task. We also present case studies where we edit the identified expressions to improve translation quality.

pdf bib abs
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Nobuhiro Ueda | Hideko Habe | Akishige Yuguchi | Seiya Kawano | Yasutomo Kawanishi | Sadao Kurohashi | Koichiro Yoshino
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.

pdf bib abs
Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese
Yikun Sun | Zhen Wan | Nobuhiro Ueda | Sakiko Yahata | Fei Cheng | Chenhui Chu | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37% win-rate. The human evaluation exhibits the consistency between GPT-4’s assessments and human preference. Our high-quality instruction data and evaluation benchmark are released here.

pdf bib abs
RecomMind: Movie Recommendation Dialogue with Seeker’s Internal State
Takashi Kodama | Hirokazu Kiyomaru | Yin Jou Huang | Sadao Kurohashi
Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)

Humans pay careful attention to the interlocutor’s internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker’s internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis and experiment, we constructed RecomMind, a movie recommendation dialogue dataset with annotations of the seeker’s internal state at the entity level. Each entity has a first-person label annotated by the seeker and a second-person label annotated by the recommender. Our analysis based on RecomMind reveals that the success of recommendations is enhanced when recommenders mention entities that seekers do not know but are interested in. We also propose a response generation framework that explicitly considers the seeker’s internal state, utilizing the chain-of-thought prompting. The human evaluation results show that our proposed method outperforms the baseline method in both consistency and the success of recommendations.

2023

pdf bib abs
Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation
Zhuoyuan Mao | Raj Dabre | Qianying Liu | Haiyue Song | Chenhui Chu | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may overfit supervised directions and thus have low generalizability for ZST. Through experiments on OPUS, IWSLT, and Europarl datasets for 54 ZST directions, we demonstrate that the original Transformer setting of LayerNorm after residual connections (PostNorm) consistently outperforms PreNorm by up to 12.3 BLEU points. We then study the performance disparities by analyzing the differences in off-target rates and structural variations between PreNorm and PostNorm. This study highlights the need for careful consideration of the LayerNorm setting for ZST.

pdf bib abs
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting
Tatsuro Inaba | Hirokazu Kiyomaru | Fei Cheng | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large language models (LLMs) have achieved impressive performance on various reasoning tasks. To further improve the performance, we propose MultiTool-CoT, a novel framework that leverages chain-of-thought (CoT) prompting to incorporate multiple external tools, such as a calculator and a knowledge retriever, during the reasoning process. We apply MultiTool-CoT to the Task 2 dataset of NumGLUE, which requires both numerical reasoning and domain-specific knowledge. The experiments show that our method significantly outperforms strong baselines and achieves state-of-the-art performance.

pdf bib abs
KWJA: A Unified Japanese Analyzer Based on Foundation Models
Nobuhiro Ueda | Kazumasa Omura | Takashi Kodama | Hirokazu Kiyomaru | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We present KWJA, a high-performance unified Japanese text analyzer based on foundation models.KWJA supports a wide range of tasks, including typo correction, word segmentation, word normalization, morphological analysis, named entity recognition, linguistic feature tagging, dependency parsing, PAS analysis, bridging reference resolution, coreference resolution, and discourse relation analysis, making it the most versatile among existing Japanese text analyzers.KWJA solves these tasks in a multi-task manner but still achieves competitive or better performance compared to existing analyzers specialized for each task.KWJA is publicly available under the MIT license at https://github.com/ku-nlp/kwja.

pdf bib abs
Is a Knowledge-based Response Engaging?: An Analysis on Knowledge-Grounded Dialogue with Information Source Annotation
Takashi Kodama | Hirokazu Kiyomaru | Yin Jou Huang | Taro Okahisa | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Currently, most knowledge-grounded dialogue response generation models focus on reflecting given external knowledge. However, even when conveying external knowledge, humans integrate their own knowledge, experiences, and opinions with external knowledge to make their utterances engaging. In this study, we analyze such human behavior by annotating the utterances in an existing knowledge-grounded dialogue corpus. Each entity in the corpus is annotated with its information source, either derived from external knowledge (database-derived) or the speaker’s own knowledge, experiences, and opinions (speaker-derived). Our analysis shows that the presence of speaker-derived information in the utterance improves dialogue engagingness. We also confirm that responses generated by an existing model, which is trained to reflect the given knowledge, cannot include speaker-derived information in responses as often as humans do.

pdf bib abs
ComSearch: Equation Searching with Combinatorial Strategy for Solving Math Word Problems with Weak Supervision
Qianying Liu | Wenyu Guan | Jianhao Shen | Fei Cheng | Sadao Kurohashi
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm with combinatorial strategy ComSearch, which can compress the search space by excluding mathematically equivalent equations. The compression allows the searching algorithm to enumerate all possible equations and obtain high-quality data. We investigate the noise in the pseudo labels that hold wrong mathematical logic, which we refer to as the false-matching problem, and propose a ranking model to denoise the pseudo labels. Our approach holds a flexible framework to utilize two existing supervised math word problem solvers to train pseudo labels, and both achieve state-of-the-art performance in the weak supervision task.

In spite of the potential for ground-breaking achievements offered by large language models (LLMs) (e.g., GPT-3) via in-context learning (ICL), they still lag significantly behind fully-supervised baselines (e.g., fine-tuned BERT) in relation extraction (RE). This is due to the two major shortcomings of ICL for RE: (1) low relevance regarding entity and relation in existing sentence-level demonstration retrieval approaches for ICL; and (2) the lack of explaining input-label mappings of demonstrations leading to poor ICL effectiveness. In this paper, we propose GPT-RE to successfully address the aforementioned issues by (1) incorporating task-aware representations in demonstration retrieval; and (2) enriching the demonstrations with gold label-induced reasoning logic. We evaluate GPT-RE on four widely-used RE datasets, and observe that GPT-RE achieves improvements over not only existing GPT-3 baselines, but also fully-supervised baselines as in Figure 1. Specifically, GPT-RE achieves SOTA performances on the Semeval and SciERC datasets, and competitive performances on the TACRED and ACE05 datasets. Additionally, a critical issue of LLMs revealed by previous work, the strong inclination to wrongly classify NULL examples into other pre-defined labels, is substantially alleviated by our method. We show an empirical analysis.

pdf bib abs
SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation
Junfeng Jiang | Chengzhang Dong | Sadao Kurohashi | Akiko Aizawa
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Dialogue segmentation is a crucial task for dialogue systems allowing a better understanding of conversational texts. Despite recent progress in unsupervised dialogue segmentation methods, their performances are limited by the lack of explicit supervised signals for training. Furthermore, the precise definition of segmentation points in conversations still remains as a challenging problem, increasing the difficulty of collecting manual annotations. In this paper, we provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues and release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues based on two prevalent document-grounded dialogue corpora, and also inherit their useful dialogue-related annotations. Moreover, we provide a benchmark including 18 models across five categories for the dialogue segmentation task with several proper evaluation metrics. Empirical studies show that supervised learning is extremely effective in in-domain datasets and models trained on SuperDialseg can achieve good generalization ability on out-of-domain data. Additionally, we also conducted human verification on the test set and the Kappa score confirmed the quality of our automatically constructed dataset. We believe our work is an important step forward in the field of dialogue segmentation.

pdf bib abs
Video-Helpful Multimodal Machine Translation
Yihang Li | Shuichiro Shimizu | Chenhui Chu | Sadao Kurohashi | Wei Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English parallel subtitle pairs, 520k Chinese-English parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models.

Contrastive pre-training on distant supervision has shown remarkable effectiveness in improving supervised relation extraction tasks. However, the existing methods ignore the intrinsic noise of distant supervision during the pre-training stage. In this paper, we propose a weighted contrastive learning method by leveraging the supervised data to estimate the reliability of pre-training instances and explicitly reduce the effect of noise. Experimental results on three supervised datasets demonstrate the advantages of our proposed weighted contrastive learning approach compared to two state-of-the-art non-weighted baselines. Our code and models are available at: https://github.com/YukinoWan/WCL.

pdf bib abs
Towards Speech Dialogue Translation Mediating Speakers of Different Languages
Shuichiro Shimizu | Chenhui Chu | Sheng Li | Sadao Kurohashi
Findings of the Association for Computational Linguistics: ACL 2023

We present a new task, speech dialogue translation mediating speakers of different languages. We construct the SpeechBSD dataset for the task and conduct baseline experiments. Furthermore, we consider context to be an important aspect that needs to be addressed in this task and propose two ways of utilizing context, namely monolingual context and bilingual context. We conduct cascaded speech translation experiments using Whisper and mBART, and show that bilingual context performs better in our settings.

pdf bib abs
ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes
Shunya Kato | Shuhei Kurita | Chenhui Chu | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2023

3D referring expression comprehension is a task to ground text representations onto objects in 3D scenes. It is a crucial task for indoor household robots or augmented reality devices to localize objects referred to in user instructions. However, existing indoor 3D referring expression comprehension datasets typically cover larger object classes that are easy to localize, such as chairs, tables, or doors, and often overlook small objects, such as cooking tools or office supplies. Based on the recently proposed diverse and high-resolution 3D scene dataset of ARKitScenes, we construct the ARKitSceneRefer dataset focusing on small daily-use objects that frequently appear in real-world indoor scenes. ARKitSceneRefer contains 15k objects of 1,605 indoor scenes, which are significantly larger than those of the existing 3D referring datasets, and covers diverse object classes of 583 from the LVIS dataset. In empirical experiments with both 2D and 3D state-of-the-art referring expression comprehension models, we observed the task difficulty of the localization in the diverse small object classes.

pdf bib
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract
Yun-Nung (Vivian) Chen | Sadao Kurohashi
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract

pdf bib
Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation
Zhuoyuan Mao | Haiyue Song | Raj Dabre | Chenhui Chu | Sadao Kurohashi
Proceedings of the 1st International Workshop on Multilingual, Multimodal and Multitask Language Generation

This paper presents the results of the shared tasks from the 10th workshop on Asian translation (WAT2023). For the WAT2023, 2 teams submitted their translation results for the human evaluation. We also accepted 1 research paper. About 40 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

2022

pdf bib abs
BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation
Haiyue Song | Raj Dabre | Zhuoyuan Mao | Chenhui Chu | Sadao Kurohashi
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Existing subword segmenters are either 1) frequency-based without semantics information or 2) neural-based but trained on parallel corpora. To address this, we present BERTSeg, an unsupervised neural subword segmenter for neural machine translation, which utilizes the contextualized semantic embeddings of words from characterBERT and maximizes the generation probability of subword segmentations. Furthermore, we propose a generation probability-based regularization method that enables BERTSeg to produce multiple segmentations for one word to improve the robustness of neural machine translation. Experimental results show that BERTSeg with regularization achieves up to 8 BLEU points improvement in 9 translation directions on ALT, IWSLT15 Vi->En, WMT16 Ro->En, and WMT15 Fi->En datasets compared with BPE. In addition, BERTSeg is efficient, needing up to 5 minutes for training.

pdf bib abs
Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems
Yibin Shen | Qianying Liu | Zhuoyuan Mao | Zhen Wan | Fei Cheng | Sadao Kurohashi
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

To solve Math Word Problems, human students leverage diverse reasoning logic that reaches different possible equation solutions. However, the mainstream sequence-to-sequence approach of automatic solvers aims to decode a fixed solution equation supervised by human annotation. In this paper, we propose a controlled equation generation solver by leveraging a set of control codes to guide the model to consider certain reasoning logic and decode the corresponding equations expressions transformed from the human reference. The empirical results suggest that our method universally improves the performance on single-unknown (Math23K) and multiple-unknown (DRAW1K, HMWP) benchmarks, with substantial improvements up to 13.2% accuracy on the challenging multiple-unknown datasets.

pdf bib abs
Flexible Visual Grounding
Yongmin Kim | Chenhui Chu | Sadao Kurohashi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Existing visual grounding datasets are artificially made, where every query regarding an entity must be able to be grounded to a corresponding image region, i.e., answerable. However, in real-world multimedia data such as news articles and social media, many entities in the text cannot be grounded to the image, i.e., unanswerable, due to the fact that the text is unnecessarily directly describing the accompanying image. A robust visual grounding model should be able to flexibly deal with both answerable and unanswerable visual grounding. To study this flexible visual grounding problem, we construct a pseudo dataset and a social media dataset including both answerable and unanswerable queries. In order to handle unanswerable visual grounding, we propose a novel method by adding a pseudo image region corresponding to a query that cannot be grounded. The model is then trained to ground to ground-truth regions for answerable queries and pseudo regions for unanswerable queries. In our experiments, we show that our model can flexibly process both answerable and unanswerable queries with high accuracy on our datasets.

pdf bib abs
Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair
Iglika Nikolova-Stoupak | Shuichiro Shimizu | Chenhui Chu | Sadao Kurohashi
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.

pdf bib abs
Improving Commonsense Contingent Reasoning by Pseudo-data and Its Application to the Related Tasks
Kazumasa Omura | Sadao Kurohashi
Proceedings of the 29th International Conference on Computational Linguistics

Contingent reasoning is one of the essential abilities in natural language understanding, and many language resources annotated with contingent relations have been constructed. However, despite the recent advances in deep learning, the task of contingent reasoning is still difficult for computers. In this study, we focus on the reasoning of contingent relation between basic events. Based on the existing data construction method, we automatically generate large-scale pseudo-problems and incorporate the generated data into training. We also investigate the generality of contingent knowledge through quantitative evaluation by performing transfer learning on the related tasks: discourse relation analysis, the Japanese Winograd Schema Challenge, and the JCommonsenseQA. The experimental results show the effectiveness of utilizing pseudo-problems for both the commonsense contingent reasoning task and the related tasks, which suggests the importance of contingent reasoning.

pdf bib abs
Improving Bridging Reference Resolution using Continuous Essentiality from Crowdsourcing
Nobuhiro Ueda | Sadao Kurohashi
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

Bridging reference resolution is the task of finding nouns that complement essential information of another noun. The essentiality varies depending on noun combination and context and has a continuous distribution. Despite the continuous nature of essentiality, existing datasets of bridging reference have only a few coarse labels to represent the essentiality. In this work, we propose a crowdsourcing-based annotation method that considers continuous essentiality. In the crowdsourcing task, we asked workers to select both all nouns with a bridging reference relation and a noun with the highest essentiality among them. Combining these annotations, we can obtain continuous essentiality. Experimental results demonstrated that the constructed dataset improves bridging reference resolution performance. The code is available at https://github.com/nobu-g/bridging-resolution.

pdf bib abs
Construction of Hierarchical Structured Knowledge-based Recommendation Dialogue Dataset and Dialogue System
Takashi Kodama | Ribeka Tanaka | Sadao Kurohashi
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

We work on a recommendation dialogue system to help a user understand the appealing points of some target (e.g., a movie). In such dialogues, the recommendation system needs to utilize structured external knowledge to make informative and detailed recommendations. However, there is no dialogue dataset with structured external knowledge designed to make detailed recommendations for the target. Therefore, we construct a dialogue dataset, Japanese Movie Recommendation Dialogue (JMRD), in which the recommender recommends one movie in a long dialogue (23 turns on average). The external knowledge used in this dataset is hierarchically structured, including title, casts, reviews, and plots. Every recommender’s utterance is associated with the external knowledge related to the utterance. We then create a movie recommendation dialogue system that considers the structure of the external knowledge and the history of the knowledge used. Experimental results show that the proposed model is superior in knowledge selection to the baseline models.

Relation extraction (RE) has achieved remarkable progress with the help of pre-trained language models. However, existing RE models are usually incapable of handling two situations: implicit expressions and long-tail relation types, caused by language complexity and data sparsity. In this paper, we introduce a simple enhancement of RE using k nearest neighbors (kNN-RE). kNN-RE allows the model to consult training relations at test time through a nearest-neighbor search and provides a simple yet effective means to tackle the two issues above. Additionally, we observe that kNN-RE serves as an effective way to leverage distant supervision (DS) data for RE. Experimental results show that the proposed kNN-RE achieves state-of-the-art performances on a variety of supervised RE datasets, i.e., ACE05, SciERC, and Wiki80, along with outperforming the best model to date on the i2b2 and Wiki80 datasets in the setting of allowing using DS. Our code and models are available at: https://github.com/YukinoWan/kNN-RE.

Word alignment has proven to benefit many-to-many neural machine translation (NMT). However, high-quality ground-truth bilingual dictionaries were used for pre-editing in previous methods, which are unavailable for most language pairs. Meanwhile, the contrastive objective can implicitly utilize automatically learned word alignment, which has not been explored in many-to-many NMT. This work proposes a word-level contrastive objective to leverage word alignments for many-to-many NMT. Empirical results show that this leads to 0.8 BLEU gains for several language pairs. Analyses reveal that in many-to-many NMT, the encoder’s sentence retrieval performance highly correlates with the translation quality, which explains when the proposed method impacts translation. This motivates future exploration for many-to-many NMT to improve the encoder’s sentence retrieval performance.

pdf bib abs
Textual Enhanced Contrastive Learning for Solving Math Word Problems
Yibin Shen | Qianying Liu | Zhuoyuan Mao | Fei Cheng | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2022

Solving math word problems is the task that analyses the relation of quantities e and requires an accurate understanding of contextual natural language information. Recent studies show that current models rely on shallow heuristics to predict solutions and could be easily misled by small textual perturbations. To address this problem, we propose a Textual Enhanced Contrastive Learning framework, which enforces the models to distinguish semantically similar examples while holding different mathematical logic. We adopt a self-supervised manner strategy to enrich examples with subtle textual variance by textual reordering or problem re-construction. We then retrieve the hardest to differentiate samples from both equation and textual perspectives and guide the model to learn their representations. Experimental results show that our method achieves state-of-the-art on both widely used benchmark datasets and also exquisitely designed challenge datasets in English and Chinese.

pdf bib abs
Constructing a Culinary Interview Dialogue Corpus with Video Conferencing Tool
Taro Okahisa | Ribeka Tanaka | Takashi Kodama | Yin Jou Huang | Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Interview is an efficient way to elicit knowledge from experts of different domains. In this paper, we introduce CIDC, an interview dialogue corpus in the culinary domain in which interviewers play an active role to elicit culinary knowledge from the cooking expert. The corpus consists of 308 interview dialogues (each about 13 minutes in length), which add up to a total of 69,000 utterances. We use a video conferencing tool for data collection, which allows us to obtain the facial expressions of the interlocutors as well as the screen-sharing contents. To understand the impact of the interlocutors’ skill level, we divide the experts into “semi-professionals’” and “enthusiasts” and the interviewers into “skilled interviewers” and “unskilled interviewers.” For quantitative analysis, we report the statistics and the results of the post-interview questionnaire. We also conduct qualitative analysis on the collected interview dialogues and summarize the salient patterns of how interviewers elicit knowledge from the experts. The corpus serves the purpose to facilitate future research on the knowledge elicitation mechanism in interview dialogues.

pdf bib abs
JaMIE: A Pipeline Japanese Medical Information Extraction System with Novel Relation Annotation
Fei Cheng | Shuntaro Yada | Ribeka Tanaka | Eiji Aramaki | Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In the field of Japanese medical information extraction, few analyzing tools are available and relation extraction is still an under-explored topic. In this paper, we first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the superiority of the latest contextual embedding models. and the feasible annotation strategy for high-accuracy demand.

pdf bib abs
Improving Event Duration Question Answering by Leveraging Existing Temporal Information Extraction Data
Felix Virgo | Fei Cheng | Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Understanding event duration is essential for understanding natural language. However, the amount of training data for tasks like duration question answering, i.e., McTACO, is very limited, suggesting a need for external duration information to improve this task. The duration information can be obtained from existing temporal information extraction tasks, such as UDS-T and TimeBank, where more duration data is available. A straightforward two-stage fine-tuning approach might be less likely to succeed given the discrepancy between the target duration question answering task and the intermediary duration classification task. This paper resolves this discrepancy by automatically recasting an existing event duration classification task from UDS-T to a question answering task similar to the target McTACO. We investigate the transferability of duration information by comparing whether the original UDS-T duration classification or the recast UDS-T duration question answering can be transferred to the target task. Our proposed model achieves a 13% Exact Match score improvement over the baseline on the McTACO duration question answering task, showing that the two-stage fine-tuning approach succeeds when the discrepancy between the target and intermediary tasks are resolved.

pdf bib abs
VISA: An Ambiguous Subtitles Dataset for Visual Scene-aware Machine Translation
Yihang Li | Shuichiro Shimizu | Weiqi Gu | Chenhui Chu | Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features: (1) the parallel sentences are subtitles from movies and TV episodes; (2) the source subtitles are ambiguous, which means they have multiple possible translations with different meanings; (3) we divide the dataset into Polysemy and Omission according to the cause of ambiguity. We show that VISA is challenging for the latest MMT system, and we hope that the dataset can facilitate MMT research.

pdf bib abs
Explicit Use of Topicality in Dialogue Response Generation
Takumi Yoshikoshi | Hayato Atarashi | Takashi Kodama | Sadao Kurohashi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

The current chat dialogue systems implicitly consider the topic given the context, but not explicitly. As a result, these systems often generate inconsistent responses with the topic of the moment. In this study, we propose a dialogue system that responds appropriately following the topic by selecting the entity with the highest “topicality.” In topicality estimation, the model is trained through self-supervised learning that regards entities that appear in both context and response as the topic entities. In response generation, the model is trained to generate topic-relevant responses based on the estimated topicality. Experimental results show that our proposed system can follow the topic more than the existing dialogue system that considers only the context.

pdf bib abs
Static and Dynamic Speaker Modeling based on Graph Neural Network for Emotion Recognition in Conversation
Prakhar Saxena | Yin Jou Huang | Sadao Kurohashi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Each person has a unique personality which affects how they feel and convey emotions. Hence, speaker modeling is important for the task of emotion recognition in conversation (ERC). In this paper, we propose a novel graph-based ERC model which considers both conversational context and speaker personality. We model the internal state of the speaker (personality) as Static and Dynamic speaker state, where the Dynamic speaker state is modeled with a graph neural network based encoder. Experiments on benchmark dataset shows the effectiveness of our model. Our model outperforms baseline and other graph-based methods. Analysis of results also show the importance of explicit speaker modeling.

This paper presents the results of the shared tasks from the 9th workshop on Asian translation (WAT2022). For the WAT2022, 8 teams submitted their translation results for the human evaluation. We also accepted 4 research papers. About 300 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

2021

pdf bib abs
Lightweight Cross-Lingual Sentence Representation Learning
Zhuoyuan Mao | Prakhar Gupta | Chenhui Chu | Martin Jaggi | Sadao Kurohashi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model.

pdf bib abs
Video-guided Machine Translation with Spatial Hierarchical Attention Network
Weiqi Gu | Haiyue Song | Chenhui Chu | Sadao Kurohashi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Video-guided machine translation, as one type of multimodal machine translations, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pretrained action detection models as motion representations of the video to solve the verb sense ambiguity, leaving the noun sense ambiguity a problem. To address this problem, we propose a video-guided machine translation system by using both spatial and motion representations in videos. For spatial features, we propose a hierarchical attention network to model the spatial information from object-level to video-level. Experiments on the VATEX dataset show that our system achieves 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method.

pdf bib abs
Extractive Summarization Considering Discourse and Coreference Relations based on Heterogeneous Graph
Yin Jou Huang | Sadao Kurohashi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Modeling the relations between text spans in a document is a crucial yet challenging problem for extractive summarization. Various kinds of relations exist among text spans of different granularity, such as discourse relations between elementary discourse units and coreference relations between phrase mentions. In this paper, we propose a heterogeneous graph based model for extractive summarization that incorporates both discourse and coreference relations. The heterogeneous graph contains three types of nodes, each corresponds to text spans of different granularity. Experimental results on a benchmark summarization dataset verify the effectiveness of our proposed method.

pdf bib abs
Japanese Zero Anaphora Resolution Can Benefit from Parallel Texts Through Neural Transfer Learning
Masato Umakoshi | Yugo Murawaki | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2021

Parallel texts of Japanese and a non-pro-drop language have the potential of improving the performance of Japanese zero anaphora resolution (ZAR) because pronouns dropped in the former are usually mentioned explicitly in the latter. However, rule-based cross-lingual transfer is hampered by error propagation in an NLP pipeline and the frequent lack of transparency in translation correspondences. In this paper, we propose implicit transfer by injecting machine translation (MT) as an intermediate task between pretraining and ZAR. We employ a pretrained BERT model to initialize the encoder part of the encoder-decoder model for MT, and eject the encoder part for fine-tuning on ZAR. The proposed framework empirically demonstrates that ZAR performance can be improved by transfer learning from MT. In addition, we find that the incorporation of the masked language model training into MT leads to further gains.

pdf bib abs
Frustratingly Easy Edit-based Linguistic Steganography with a Masked Language Model
Honai Ueoka | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

With advances in neural language models, the focus of linguistic steganography has shifted from edit-based approaches to generation-based ones. While the latter’s payload capacity is impressive, generating genuine-looking texts remains challenging. In this paper, we revisit edit-based linguistic steganography, with the idea that a masked language model offers an off-the-shelf solution. The proposed method eliminates painstaking rule construction and has a high payload capacity for an edit-based model. It is also shown to be more secure against automatic detection than a generation-based method while offering better control of the security/payload capacity trade-off.

pdf bib abs
Contextualized and Generalized Sentence Representations by Contrastive Self-Supervised Learning: A Case Study on Discourse Relation Analysis
Hirokazu Kiyomaru | Sadao Kurohashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a method to learn contextualized and generalized sentence representations using contrastive self-supervised learning. In the proposed method, a model is given a text consisting of multiple sentences. One sentence is randomly selected as a target sentence. The model is trained to maximize the similarity between the representation of the target sentence with its context and that of the masked target sentence with the same context. Simultaneously, the model minimizes the similarity between the latter representation and the representation of a random sentence with the same context. We apply our method to discourse relation analysis in English and Japanese and show that it outperforms strong baseline methods based on BERT, XLNet, and RoBERTa.

This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 teams submitted their translation results for the human evaluation. We also accepted 5 research papers. About 2,100 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

2020

pdf bib abs
Building a Japanese Typo Dataset from Wikipedia’s Revision History
Yu Tanaka | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

pdf bib abs
Pre-training via Leveraging Assisting Languages for Neural Machine Translation
Haiyue Song | Raj Dabre | Zhuoyuan Mao | Fei Cheng | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks. However, large monolingual corpora might not always be available for the languages of interest (LOI). Thus, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. We utilize script mapping (Chinese to Japanese) to increase the similarity (number of cognates) between the monolingual corpora of helping languages and LOI. An empirical case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora, respectively, for S2S pre-training. Using only Chinese and French monolingual corpora, we were able to improve Japanese-English translation quality by up to 8.5 BLEU in low-resource scenarios.

pdf bib abs
BERT-based Cohesion Analysis of Japanese Texts
Nobuhiro Ueda | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

The meaning of natural language text is supported by cohesion among various kinds of entities, including coreference relations, predicate-argument structures, and bridging anaphora relations. However, predicate-argument structures for nominal predicates and bridging anaphora relations have not been studied well, and their analyses have been still very difficult. Recent advances in neural networks, in particular, self training-based language models including BERT (Devlin et al., 2019), have significantly improved many natural language processing tasks, making it possible to dive into the study on analysis of cohesion in the whole text. In this study, we tackle an integrated analysis of cohesion in Japanese texts. Our results significantly outperformed existing studies in each task, especially about 10 to 20 point improvement both for zero anaphora and coreference resolution. Furthermore, we also showed that coreference resolution is different in nature from the other tasks and should be treated specially.

pdf bib abs
Native-like Expression Identification by Contrasting Native and Proficient Second Language Speakers
Oleksandr Harust | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

We propose a novel task of native-like expression identification by contrasting texts written by native speakers and those by proficient second language speakers. This task is highly challenging mainly because 1) the combinatorial nature of expressions prevents us from choosing candidate expressions a priori and 2) the distributions of the two types of texts overlap considerably. Our solution to the first problem is to combine a powerful neural network-based classifier of sentence-level nativeness with an explainability method that measures an approximate contribution of a given expression to the classifier’s prediction. To address the second problem, we introduce a special label neutral and reformulate the classification task as complementary-label learning. Our crowdsourcing-based evaluation and in-depth analysis suggest that our method successfully uncovers linguistically interesting usages distinctive of native speech.

pdf bib abs
A Method for Building a Commonsense Inference Dataset based on Basic Events
Kazumasa Omura | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present a scalable, low-bias, and low-cost method for building a commonsense inference dataset that combines automatic extraction from a corpus and crowdsourcing. Each problem is a multiple-choice question that asks contingency between basic events. We applied the proposed method to a Japanese corpus and acquired 104k problems. While humans can solve the resulting problems with high accuracy (88.9%), the accuracy of a high-performance transfer learning model is reasonably low (76.0%). We also confirmed through dataset analysis that the resulting dataset contains low bias. We released the datatset to facilitate language understanding research.

Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These methods introduce exposure bias, which may cause the models overfit to the frequent label combination, thus limiting the generalization ability. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at https://github.com/WindChimeRan/OpenJERE.

pdf bib abs
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Fei Cheng | Masayuki Asahara | Ichiro Kobayashi | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2020

Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.

pdf bib abs
Adapting BERT to Implicit Discourse Relation Classification with a Focus on Discourse Connectives
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

BERT, a neural network-based language model pre-trained on large corpora, is a breakthrough in natural language processing, significantly outperforming previous state-of-the-art models in numerous tasks. However, there have been few reports on its application to implicit discourse relation classification, and it is not clear how BERT is best adapted to the task. In this paper, we test three methods of adaptation. (1) We perform additional pre-training on text tailored to discourse classification. (2) In expectation of knowledge transfer from explicit discourse relations to implicit discourse relations, we add a task named explicit connective prediction at the additional pre-training step. (3) To exploit implicit connectives given by treebank annotators, we add a task named implicit connective prediction at the fine-tuning step. We demonstrate that these three techniques can be combined straightforwardly in a single training pipeline. Through comprehensive experiments, we found that the first and second techniques provide additional gain while the last one did not.

pdf bib abs
Acquiring Social Knowledge about Personality and Driving-related Behavior
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we introduce our psychological approach to collect human-specific social knowledge from a text corpus, using NLP techniques. It is often not explicitly described but shared among people, which we call social knowledge. We focus on the social knowledge, especially personality and driving. We used the language resources that were developed based on psychological research methods; a Japanese personality dictionary (317 words) and a driving experience corpus (8,080 sentences) annotated with behavior and subjectivity. Using them, we automatically extracted collocations between personality descriptors and driving-related behavior from a driving behavior and subjectivity corpus (1,803,328 sentences after filtering) and obtained unique 5,334 collocations. To evaluate the collocations as social knowledge, we designed four step-by-step crowdsourcing tasks. They resulted in 266 pieces of social knowledge. They include the knowledge that might be difficult to recall by themselves but easy to agree with. We discuss the acquired social knowledge and the contribution to implementations into systems.

pdf bib abs
Development of a Japanese Personality Dictionary based on Psychological Methods
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

We propose a new approach to constructing a personality dictionary with psychological evidence. In this study, we collect personality words, using word embeddings, and construct a personality dictionary with weights for Big Five traits. The weights are calculated based on the responses of the large sample (N=1,938, female = 1,004, M=49.8years old:20-78, SD=16.3). All the respondents answered a 20-item personality questionnaire and 537 personality items derived from word embeddings. We present the procedures to examine the qualities of responses with psychological methods and to calculate the weights. These result in a personality dictionary with two sub-dictionaries. We also discuss an application of the acquired resources.

pdf bib abs
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Haiyue Song | Raj Dabre | Atsushi Fujita | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese–English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we have released our code for parallel data creation.

pdf bib abs
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
Zhuoyuan Mao | Fabien Cromieres | Raj Dabre | Haiyue Song | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel corpora. However, they do not account for linguistic information obtained using syntactic analyzers which is known to be invaluable for several Natural Language Processing (NLP) tasks. To this end, we propose JASS, Japanese-specific Sequence to Sequence, as a novel pre-training alternative to MASS for NMT involving Japanese as the source or target language. JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training which focuses on Japanese linguistic units called bunsetsus. In our experiments on ASPEC Japanese–English and News Commentary Japanese–Russian translation we show that JASS can give results that are competitive with if not better than those given by MASS. Furthermore, we show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods indicating their complementary nature. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.

pdf bib abs
Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases
Shuntaro Yada | Ayami Joh | Ribeka Tanaka | Fei Cheng | Eiji Aramaki | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Applying natural language processing (NLP) to medical and clinical texts can bring important social benefits by mining valuable information from unstructured text. A popular application for that purpose is named entity recognition (NER), but the annotation policies of existing clinical corpora have not been standardized across clinical texts of different types. This paper presents an annotation guideline aimed at covering medical documents of various types such as radiography interpretation reports and medical records. Furthermore, the annotation was designed to avoid burdensome requirements related to medical knowledge, thereby enabling corpus development without medical specialists. To achieve these design features, we specifically focus on critical lung diseases to stabilize linguistic patterns in corpora. After annotating around 1100 electronic medical records following the annotation scheme, we demonstrated its feasibility using an NER task. Results suggest that our guideline is applicable to large-scale clinical NLP projects.

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020). For the WAT2020, 20 teams participated in the shared tasks and 14 teams submitted their translation results for the human evaluation. We also received 12 research paper submissions out of which 7 were accepted. About 500 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Meta Ensemble for Japanese-Chinese Neural Machine Translation: Kyoto-U+ECNU Participation to WAT 2020
Zhuoyuan Mao | Yibin Shen | Chenhui Chu | Sadao Kurohashi | Cheqing Jin
Proceedings of the 7th Workshop on Asian Translation

This paper describes the Japanese-Chinese Neural Machine Translation (NMT) system submitted by the joint team of Kyoto University and East China Normal University (Kyoto-U+ECNU) to WAT 2020 (Nakazawa et al.,2020). We participate in APSEC Japanese-Chinese translation task. We revisit several techniques for NMT including various architectures, different data selection and augmentation methods, denoising pre-training, and also some specific tricks for Japanese-Chinese translation. We eventually perform a meta ensemble to combine all of the models into a single model. BLEU results of this meta ensembled model rank the first both on 2 directions of ASPEC Japanese-Chinese translation.

2019

pdf bib abs
Minimally Supervised Learning of Affective Events Using Discourse Relations
Jun Saito | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations. Our method is simple and only requires a very small seed lexicon and a large raw corpus. Our experiments using Japanese data show that our method learns affective events effectively without manually labeled data. It also improves supervised learning results when labeled data are small.

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task. For the WAT2019, 25 teams participated in the shared tasks. We also received 10 research paper submissions out of which 61 were accepted. About 400 translation results were submitted to the automatic evaluation server, and selected submis- sions were manually evaluated.

pdf bib abs
Machine Comprehension Improves Domain-Specific Japanese Predicate-Argument Structure Analysis
Norio Takahashi | Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

To improve the accuracy of predicate-argument structure (PAS) analysis, large-scale training data and knowledge for PAS analysis are indispensable. We focus on a specific domain, specifically Japanese blogs on driving, and construct two wide-coverage datasets as a form of QA using crowdsourcing: a PAS-QA dataset and a reading comprehension QA (RC-QA) dataset. We train a machine comprehension (MC) model based on these datasets to perform PAS analysis. Our experiments show that a stepwise training method is the most effective, which pre-trains an MC model based on the RC-QA dataset to acquire domain knowledge and then fine-tunes based on the PAS-QA dataset.

pdf bib abs
Diversity-aware Event Prediction based on a Conditional Variational Autoencoder with Reconstruction
Hirokazu Kiyomaru | Kazumasa Omura | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Typical event sequences are an important class of commonsense knowledge. Formalizing the task as the generation of a next event conditioned on a current event, previous work in event prediction employs sequence-to-sequence (seq2seq) models. However, what can happen after a given event is usually diverse, a fact that can hardly be captured by deterministic models. In this paper, we propose to incorporate a conditional variational autoencoder (CVAE) into seq2seq for its ability to represent diverse next events as a probabilistic distribution. We further extend the CVAE-based seq2seq with a reconstruction mechanism to prevent the model from concentrating on highly typical events. To facilitate fair and systematic evaluation of the diversity-aware models, we also extend existing evaluation datasets by tying each current event to multiple next events. Experiments show that the CVAE-based models drastically outperform deterministic models in terms of precision and that the reconstruction mechanism improves the recall of CVAE-based models without sacrificing precision.

pdf bib abs
Improving Event Coreference Resolution by Learning Argument Compatibility from Unlabeled Data
Yin Jou Huang | Jing Lu | Sadao Kurohashi | Vincent Ng
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Argument compatibility is a linguistic condition that is frequently incorporated into modern event coreference resolution systems. If two event mentions have incompatible arguments in any of the argument roles, they cannot be coreferent. On the other hand, if these mentions have compatible arguments, then this may be used as information towards deciding their coreferent status. One of the key challenges in leveraging argument compatibility lies in the paucity of labeled data. In this work, we propose a transfer learning framework for event coreference resolution that utilizes a large amount of unlabeled data to learn argument compatibility of event mentions. In addition, we adopt an interactive inference network based model to better capture the compatible and incompatible relations between the context words of event mentions. Our experiments on the KBP 2017 English dataset confirm the effectiveness of our model in learning argument compatibility, which in turn improves the performance of the overall event coreference model.

pdf bib abs
Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

For languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one: it uses less than 15 megabytes of space.

pdf bib abs
Kyoto University Participation to the WMT 2019 News Shared Task
Fabien Cromieres | Sadao Kurohashi
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe here the experiments we did for the the news translation shared task of WMT 2019. We focused on the new German-to-French language direction, and mostly used current standard approaches to develop a Neural Machine Translation system. We make use of the Tensor2Tensor implementation of the Transformer model. After carefully cleaning the data and noting the importance of the good use of recent monolingual data for the task, we obtain our final result by combining the output of a diverse set of trained models through the use of their “checkpoint agreement”.

pdf bib
Applying Machine Translation to Psychology: Automatic Translation of Personality Adjectives
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf bib abs
A Knowledge-Augmented Neural Network Model for Implicit Discourse Relation Classification
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Identifying discourse relations that are not overtly marked with discourse connectives remains a challenging problem. The absence of explicit clues indicates a need for the combination of world knowledge and weak contextual clues, which can hardly be learned from a small amount of manually annotated data. In this paper, we address this problem by augmenting the input text with external knowledge and context and by adopting a neural network model that can effectively handle the augmented text. Experiments show that external knowledge did improve the classification accuracy. Contextual information provided no significant gain for implicit discourse relations, but it did for explicit ones.

pdf bib abs
Cross-lingual Knowledge Projection Using Machine Translation and Target-side Knowledge Base Completion
Naoki Otani | Hirokazu Kiyomaru | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Considerable effort has been devoted to building commonsense knowledge bases. However, they are not available in many languages because the construction of KBs is expensive. To bridge the gap between languages, this paper addresses the problem of projecting the knowledge in English, a resource-rich language, into other languages, where the main challenge lies in projection ambiguity. This ambiguity is partially solved by machine translation and target-side knowledge base completion, but neither of them is adequately reliable by itself. We show their combination can project English commonsense knowledge into Japanese and Chinese with high precision. Our method also achieves a top-10 accuracy of 90% on the crowdsourced English–Japanese benchmark. Furthermore, we use our method to obtain 18,747 facts of accurate Japanese commonsense within a very short period.

pdf bib abs
Juman++: A Morphological Analysis Toolkit for Scriptio Continua
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a three-part toolkit for developing morphological analyzers for languages without natural word boundaries. The first part is a C++11/14 lattice-based morphological analysis library that uses a combination of linear and recurrent neural net language models for analysis. The other parts are a tool for exposing problems in the trained model and a partial annotation tool. Our morphological analyzer of Japanese achieves new SOTA on Jumandic-based corpora while being 250 times faster than the previous one. We also perform a small experiment and quantitive analysis and experience of using development tools. All components of the toolkit is open source and available under a permissive Apache 2 License.

pdf bib
Comprehensive Annotation of Various Types of Temporal Information on the Time Axis
Tomohiro Sakaguchi | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Improving Crowdsourcing-Based Annotation of Japanese Discourse Relations
Yudai Kishimoto | Shinnosuke Sawada | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Knowledge-Enriched Two-Layered Attention Network for Sentiment Analysis
Abhishek Kumar | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised model based on Support Vector Regression using a Multilayer Perceptron network for sentiment analysis. We evaluate our model on the benchmark dataset of SemEval 2017 Task 5. Experimental results show that the proposed model surpasses the top system of SemEval 2017 Task 5. The model performs significantly better by improving the state-of-the-art system at SemEval 2017 Task 5 by 1.7 and 3.7 points for sub-tracks 1 and 2 respectively.

pdf bib abs
Neural Adversarial Training for Semi-supervised Japanese Predicate-argument Structure Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Japanese predicate-argument structure (PAS) analysis involves zero anaphora resolution, which is notoriously difficult. To improve the performance of Japanese PAS analysis, it is straightforward to increase the size of corpora annotated with PAS. However, since it is prohibitively expensive, it is promising to take advantage of a large amount of raw corpora. In this paper, we propose a novel Japanese PAS analysis model based on semi-supervised adversarial training with a raw corpus. In our experiments, our model outperforms existing state-of-the-art models for Japanese PAS analysis.

pdf bib abs
Entity-Centric Joint Modeling of Japanese Coreference Resolution and Predicate Argument Structure Analysis
Tomohide Shibata | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Predicate argument structure analysis is a task of identifying structured events. To improve this field, we need to identify a salient entity, which cannot be identified without performing coreference resolution and predicate argument structure analysis simultaneously. This paper presents an entity-centric joint model for Japanese coreference resolution and predicate argument structure analysis. Each entity is assigned an embedding, and when the result of both analyses refers to an entity, the entity embedding is updated. The analyses take the entity embedding into consideration to access the global information of entities. Our experimental results demonstrate the proposed method can improve the performance of the inter-sentential zero anaphora resolution drastically, which is a notoriously difficult task in predicate argument structure analysis.

pdf bib
Annotating a Driving Experience Corpus with Behavior and Subjectivity
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2017

pdf bib abs
Kyoto University MT System Description for IWSLT 2017
Raj Dabre | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 14th International Conference on Spoken Language Translation

We describe here our Machine Translation (MT) model and the results we obtained for the IWSLT 2017 Multilingual Shared Task. Motivated by Zero Shot NMT [1] we trained a Multilingual Neural Machine Translation by combining all the training data into one single collection by appending the tokens to the source sentences in order to indicate the target language they should be translated to. We observed that even in a low resource situation we were able to get translations whose quality surpass the quality of those obtained by Phrase Based Statistical Machine Translation by several BLEU points. The most surprising result we obtained was in the zero shot setting for Dutch-German and Italian-Romanian where we observed that despite using no parallel corpora between these language pairs, the NMT model was able to translate between these languages and the translations were either as good as or better (in terms of BLEU) than the non zero resource setting. We also verify that the NMT models that use feed forward layers and self attention instead of recurrent layers are extremely fast in terms of training which is useful in a NMT experimental setting.

pdf bib
Proceedings of Machine Translation Summit XVI: Research Track
Sadao Kurohashi | Pascale Fung
Proceedings of Machine Translation Summit XVI: Research Track

pdf bib
Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages
Raj Dabre | Fabien Cromieres | Sadao Kurohashi
Proceedings of Machine Translation Summit XVI: Research Track

pdf bib abs
Improving Chinese Semantic Role Labeling using High-quality Surface and Deep Case Frames
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a method for applying automatically acquired knowledge to semantic role labeling (SRL). We use a large amount of automatically extracted knowledge to improve the performance of SRL. We present two varieties of knowledge, which we call surface case frames and deep case frames. Although the surface case frames are compiled from syntactic parses and can be used as rich syntactic knowledge, they have limited capability for resolving semantic ambiguity. To compensate the deficiency of the surface case frames, we compile deep case frames from automatic semantic roles. We also consider quality management for both types of knowledge in order to get rid of the noise brought from the automatic analyses. The experimental results show that Chinese SRL can be improved using automatically acquired knowledge and the quality management shows a positive effect on this task.

pdf bib
Proceedings of the IJCNLP 2017, Tutorial Abstracts
Sadao Kurohashi | Michael Strube
Proceedings of the IJCNLP 2017, Tutorial Abstracts

pdf bib abs
Neural Joint Model for Transition-based Chinese Syntactic Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present neural network-based joint models for Chinese word segmentation, POS tagging and dependency parsing. Our models are the first neural approaches for fully joint Chinese analysis that is known to prevent the error propagation problem of pipeline models. Although word embeddings play a key role in dependency parsing, they cannot be applied directly to the joint task in the previous work. To address this problem, we propose embeddings of character strings, in addition to words. Experiments show that our models outperform existing systems in Chinese word segmentation and POS tagging, and perform preferable accuracies in dependency parsing. We also explore bi-LSTM models with fewer features.

pdf bib abs
An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we propose a novel domain adaptation method named “mixed fine tuning” for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.

pdf bib abs
Improving Shared Argument Identification in Japanese Event Knowledge Acquisition
Yin Jou Huang | Sadao Kurohashi
Proceedings of the Events and Stories in the News Workshop

Event knowledge represents the knowledge of causal and temporal relations between events. Shared arguments of event knowledge encode patterns of role shifting in successive events. A two-stage framework was proposed for the task of Japanese event knowledge acquisition, in which related event pairs are first extracted, and shared arguments are then identified to form the complete event knowledge. This paper focuses on the second stage of this framework, and proposes a method to improve the shared argument identification of related event pairs. We constructed a gold dataset for shared argument learning. By evaluating our system on this gold dataset, we found that our proposed model outperformed the baseline models by a large margin.

pdf bib abs
Automatic Extraction of High-Quality Example Sentences for Word Learning Using a Determinantal Point Process
Arseny Tolmachev | Sadao Kurohashi
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Flashcard systems are effective tools for learning words but have their limitations in teaching word usage. To overcome this problem, we propose a novel flashcard system that shows a new example sentence on each repetition. This extension requires high-quality example sentences, automatically extracted from a huge corpus. To do this, we use a Determinantal Point Process which scales well to large data and allows to naturally represent sentence similarity and quality as features. Our human evaluation experiment on Japanese language indicates that the proposed method successfully extracted high-quality example sentences.

This paper presents the results of the shared tasks from the 4th workshop on Asian translation (WAT2017) including J↔E, J↔C scientific paper translation subtasks, C↔J, K↔J, E↔J patent translation subtasks, H↔E mixed domain subtasks, J↔E newswire subtasks and J↔E recipe subtasks. For the WAT2017, 12 institutions participated in the shared tasks. About 300 translation results have been submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Kyoto University Participation to WAT 2017
Fabien Cromieres | Raj Dabre | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

We describe here our approaches and results on the WAT 2017 shared translation tasks. Following our good results with Neural Machine Translation in the previous shared task, we continue this approach this year, with incremental improvements in models and training methods. We focused on the ASPEC dataset and could improve the state-of-the-art results for Chinese-to-Japanese and Japanese-to-Chinese translations.

pdf bib abs
Automatically Acquired Lexical Knowledge Improves Japanese Joint Morphological and Dependency Analysis
Daisuke Kawahara | Yuta Hayashibe | Hajime Morita | Sadao Kurohashi
Proceedings of the 15th International Conference on Parsing Technologies

This paper presents a joint model for morphological and dependency analysis based on automatically acquired lexical knowledge. This model takes advantage of rich lexical knowledge to simultaneously resolve word segmentation, POS, and dependency ambiguities. In our experiments on Japanese, we show the effectiveness of our joint model over conventional pipeline models.

2016

pdf bib abs
Consistent Word Segmentation, Part-of-Speech Tagging and Dependency Labelling Annotation for Chinese Language
Mo Shen | Wingmui Li | HyunJeong Choe | Chenhui Chu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a new annotation approach to Chinese word segmentation, part-of-speech (POS) tagging and dependency labelling that aims to overcome the two major issues in traditional morphology-based annotation: Inconsistency and data sparsity. We re-annotate the Penn Chinese Treebank 5.0 (CTB5) and demonstrate the advantages of this approach compared to the original CTB5 annotation through word segmentation, POS tagging and machine translation experiments.

pdf bib
IRT-based Aggregation Model of Crowdsourced Pairwise Comparison for Evaluating Machine Translations
Naoki Otani | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Insertion Position Selection Model for Flexible Non-Terminals in Dependency Tree-to-Tree Machine Translation
Toshiaki Nakazawa | John Richardson | Sadao Kurohashi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib abs
Paraphrasing Out-of-Vocabulary Words with Word Embeddings and Semantic Lexicons for Low Resource Statistical Machine Translation
Chenhui Chu | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Out-of-vocabulary (OOV) word is a crucial problem in statistical machine translation (SMT) with low resources. OOV paraphrasing that augments the translation model for the OOV words by using the translation knowledge of their paraphrases has been proposed to address the OOV problem. In this paper, we propose using word embeddings and semantic lexicons for OOV paraphrasing. Experiments conducted on a low resource setting of the OLYMPICS task of IWSLT 2012 verify the effectiveness of our proposed method.

pdf bib abs
Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons
Antoine Bourlon | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Sentence alignment is a task that consists in aligning the parallel sentences in a translated article pair. This paper describes a method to perform sentence boundary detection and alignment simultaneously, which significantly improves the alignment accuracy on languages like Chinese with uncertain sentence boundaries. It relies on the definition of hard (certain) and soft (uncertain) punctuation delimiters, the latter being possibly ignored to optimize the alignment result. The alignment method is used in combination with lexicons automatically generated from the input article pairs using pivot-based MT, achieving better coverage of the input words with fewer entries than pre-existing dictionaries. Pivot-based MT makes it possible to build dictionaries for language pairs that have scarce parallel data. The alignment method is implemented in a tool that will be freely available in the near future.

In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).

pdf bib abs
Parallel Sentence Extraction from Comparable Corpora with Neural Network Features
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parallel corpora are crucial for machine translation (MT), however they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for MT. In this paper, we exploit the neural network features acquired from neural MT for parallel sentence extraction. We observe significant improvements for both accuracy in sentence extraction and MT performance.

pdf bib
Flexible Non-Terminals for Dependency Tree-to-Tree Reordering
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Neural Network-Based Model for Japanese Predicate Argument Structure Analysis
Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Dependency Forest based Word Alignment
Hitoshi Otsuki | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the ACL 2016 Student Research Workshop

pdf bib
M2L at SemEval-2016 Task 8: AMR Parsing with Neural Networks
Yevgeniy Puzikov | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Design of Word Association Games using Dialog Systems for Acquisition of Word Association Knowledge
Yuichiro Machida | Daisuke Kawahara | Sadao Kurohashi | Manabu Sassano
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Cross-language Projection of Dependency Trees with Constrained Partial Parsing for Tree-to-Tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

pdf bib
The Kyoto University Cross-Lingual Pronoun Translation System
Raj Dabre | Yevgeniy Puzikov | Fabien Cromieres | Sadao Kurohashi
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib abs
Large-Scale Acquisition of Commonsense Knowledge via a Quiz Game on a Dialogue System
Naoki Otani | Daisuke Kawahara | Sadao Kurohashi | Nobuhiro Kaji | Manabu Sassano
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

Commonsense knowledge is essential for fully understanding language in many situations. We acquire large-scale commonsense knowledge from humans using a game with a purpose (GWAP) developed on a smartphone spoken dialogue system. We transform the manual knowledge acquisition process into an enjoyable quiz game and have collected over 150,000 unique commonsense facts by gathering the data of more than 70,000 players over eight months. In this paper, we present a simple method for maintaining the quality of acquired knowledge and an empirical analysis of the knowledge acquisition process. To the best of our knowledge, this is the first work to collect large-scale knowledge via a GWAP on a widely-used spoken dialogue system.

This paper presents the results of the shared tasks from the 3rd workshop on Asian translation (WAT2016) including J ↔ E, J ↔ C scientific paper translation subtasks, C ↔ J, K ↔ J, E ↔ J patent translation subtasks, I ↔ E newswire subtasks and H ↔ E, H ↔ J mixed domain subtasks. For the WAT2016, 15 institutions participated in the shared tasks. About 500 translation results have been submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Kyoto University Participation to WAT 2016
Fabien Cromieres | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

We describe here our approaches and results on the WAT 2016 shared translation tasks. We tried to use both an example-based machine translation (MT) system and a neural MT system. We report very good translation results, especially when using neural MT for Chinese-to-Japanese translation.

pdf bib abs
SCTB: A Chinese Treebank in Scientific Domain
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Treebanks are curial for natural language processing (NLP). In this paper, we present our work for annotating a Chinese treebank in scientific domain (SCTB), to address the problem of the lack of Chinese treebanks in this domain. Chinese analysis and machine translation experiments conducted using this treebank indicate that the annotated treebank can significantly improve the performance on both tasks. This treebank is released to promote Chinese NLP research in scientific domain.

2015

pdf bib
Korean-Chinese word translation using Chinese character knowledge
Yuanmei Lu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of Machine Translation Summit XV: Papers

pdf bib
Enhancing function word translation with syntax-based statistical post-editing
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 6th Workshop on Patent and Scientific Literature Translation

pdf bib
Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model
Hajime Morita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages
Raj Dabre | Fabien Cromieres | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Classification and Acquisition of Contradictory Event Pairs using Crowdsourcing
Yu Takabatake | Hajime Morita | Daisuke Kawahara | Sadao Kurohashi | Ryuichiro Higashinaka | Yoshihiro Matsuo
Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Location Name Disambiguation Exploiting Spatial Proximity and Temporal Consistency
Takashi Awamura | Daisuke Kawahara | Eiji Aramaki | Tomohide Shibata | Sadao Kurohashi
Proceedings of the third International Workshop on Natural Language Processing for Social Media

pdf bib
Chinese Semantic Role Labeling using High-quality Syntactic Knowledge
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

pdf bib
Large-scale Dictionary Construction via Pivot-based Statistical Machine Translation with Significance Pruning and Neural Network Features
Raj Dabre | Chenhui Chu | Fabien Cromieres | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Pivot-Based Topic Models for Low-Resource Lexicon Extraction
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Cross-language Projection of Dependency Trees for Tree-to-tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf bib abs
Post-editing user interface using visualization of a sentence structure
Yudai Kishimoto | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

Translation has become increasingly important by virtue of globalization. To reduce the cost of translation, it is necessary to use machine translation and further to take advantage of post-editing based on the result of a machine translation for accurate information dissemination. Such post-editing (e.g., PET [Aziz et al., 2012]) can be used practically for translation between European languages, which has a high performance in statistical machine translation. However, due to the low accuracy of machine translation between languages with different word order, such as Japanese-English and Japanese-Chinese, post-editing has not been used actively.

pdf bib
Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing
Daisuke Kawahara | Yuichiro Machida | Tomohide Shibata | Sadao Kurohashi | Hayato Kobayashi | Manabu Sassano
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Translation Rules with Right-Hand Side Lattices
Fabien Cromières | Sadao Kurohashi
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib abs
Bilingual Dictionary Construction with Transliteration Filtering
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a bilingual transliteration lexicon of 170K Japanese-English technical terms in the scientific domain. Translation pairs are extracted by filtering a large list of transliteration candidates generated automatically from a phrase table trained on parallel corpora. Filtering uses a novel transliteration similarity measure based on a discriminative phrase-based machine translation approach. We demonstrate that the extracted dictionary is accurate and of high recall (F1 score 0.8). Our lexicon contains not only single words but also multi-word expressions, and is freely available. Our experiments focus on Katakana-English lexicon construction, however it would be possible to apply the proposed methods to transliteration extraction for a variety of language pairs.

pdf bib abs
A Large Scale Database of Strongly-related Events in Japanese
Tomohide Shibata | Shotaro Kohama | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The knowledge about the relation between events is quite useful for coreference resolution, anaphora resolution, and several NLP applications such as dialogue system. This paper presents a large scale database of strongly-related events in Japanese, which has been acquired with our proposed method (Shibata and Kurohashi, 2011). In languages, where omitted arguments or zero anaphora are often utilized, such as Japanese, the coreference-based event extraction methods are hard to be applied, and so our method extracts strongly-related events in a two-phrase construct. This method first calculates the co-occurrence measure between predicate-arguments (events), and regards an event pair, whose mutual information is high, as strongly-related events. To calculate the co-occurrence measure efficiently, we adopt an association rule mining method. Then, we identify the remaining arguments by using case frames. The database contains approximately 100,000 unique events, with approximately 340,000 strongly-related event pairs, which is much larger than an existing automatically-constructed event database. We evaluated randomly-chosen 100 event pairs, and the accuracy was approximately 68%.

pdf bib abs
Constructing a Chinese—Japanese Parallel Corpus from Wikipedia
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/chu/resource/wiki_zh_ja.tgz.

pdf bib abs
Constructing a Corpus of Japanese Predicate Phrases for Synonym/Antonym Relations
Tomoko Izumi | Tomohide Shibata | Hisako Asano | Yoshihiro Matsuo | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We construct a large corpus of Japanese predicate phrases for synonym-antonym relations. The corpus consists of 7,278 pairs of predicates such as receive-permission (ACC) vs. obtain-permission (ACC), in which each predicate pair is accompanied by a noun phrase and case information. The relations are categorized as synonyms, entailment, antonyms, or unrelated. Antonyms are further categorized into three different classes depending on their aspect of oppositeness. Using the data as a training corpus, we conduct the supervised binary classification of synonymous predicates based on linguistically-motivated features. Combining features that are characteristic of synonymous predicates with those that are characteristic of antonymous predicates, we succeed in automatically identifying synonymous predicates at the high F-score of 0.92, a 0.4 improvement over the baseline method of using the Japanese WordNet. The results of an experiment confirm that the quality of the corpus is high enough to achieve automatic classification. To the best of our knowledge, this is the first and the largest publicly available corpus of Japanese predicate phrases for synonym-antonym relations.

pdf bib abs
A Framework for Compiling High Quality Knowledge Resources From Raw Corpora
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The identification of various types of relations is a necessary step to allow computers to understand natural language text. In particular, the clarification of relations between predicates and their arguments is essential because predicate-argument structures convey most of the information in natural languages. To precisely capture these relations, wide-coverage knowledge resources are indispensable. Such knowledge resources can be derived from automatic parses of raw corpora, but unfortunately parsing still has not achieved a high enough performance for precise knowledge acquisition. We present a framework for compiling high quality knowledge resources from raw corpora. Our proposed framework selects high quality dependency relations from automatic parses and makes use of them for not only the calculation of fundamental distributional similarity but also the acquisition of knowledge such as case frames.

pdf bib
Chinese Morphological Analysis with Character-level POS Tagging
Mo Shen | Hongxiao Liu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the 1st Workshop on Asian Translation (WAT2014)
Toshiaki Nakazawa | Hideya Mino | Isao Goto | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
Overview of the 1st Workshop on Asian Translation
Toshiaki Nakazawa | Hideya Mino | Isao Goto | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
KyotoEBMT System Description for the 1st Workshop on Asian Translation
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extractionwith Paraphrases
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

2013

pdf bib
Japanese Zero Reference Resolution Considering Exophora and Author/Reader Mentions
Masatsugu Hangyo | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Automatic Knowledge Acquisition for Case Alternation between the Passive and Active Voices in Japanese
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi | Manabu Okumura
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Precise Information Retrieval Exploiting Predicate-Argument Structures
Daisuke Kawahara | Keiji Shinzato | Tomohide Shibata | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis
Ryohei Sasano | Sadao Kurohashi | Manabu Okumura
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Chinese Word Segmentation by Mining Maximized Substrings
Mo Shen | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Robust Transliteration Mining from Comparable Corpora with Bilingual Topic Models
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
High Quality Dependency Selection from Automatic Parses
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Accurate Parallel Fragment Extraction from Quasi–Comparable Corpora using Alignment Model and Translation Lexicon
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Distortion Model Considering Rich Context for Statistical Machine Translation
Isao Goto | Masao Utiyama | Eiichiro Sumita | Akihiro Tamura | Sadao Kurohashi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
Towards Fully Lexicalized Dependency Parsing for Korean
Jungyeul Park | Daisuke Kawahara | Sadao Kurohashi | Key-Sun Choi
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2012

pdf bib
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf bib abs
EBMT system of Kyoto University in OLYMPICS task at IWSLT 2012
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the EBMT system of Kyoto University that participated in the OLYMPICS task at IWSLT 2012. When translating very different language pairs such as Chinese-English, it is very important to handle sentences in tree structures to overcome the difference. Many recent studies incorporate tree structures in some parts of translation process, but not all the way from model training (alignment) to decoding. Our system is a fully tree-based translation system where we use the Bayesian phrase alignment model on dependency trees and example-based translation. To improve the translation quality, we conduct some special processing for the IWSLT 2012 OLYMPICS task, including sub-sentence splitting, non-parallel sentence filtering, adoption of an optimized Chinese segmenter and rule-based decoding constraints.

pdf bib
Flexible Japanese Sentence Compression by Relaxing Unit Constraints
Jun Harashima | Sadao Kurohashi
Proceedings of COLING 2012

pdf bib
Semi-Supervised Noun Compound Analysis with Edge and Span Features
Yugo Murawaki | Sadao Kurohashi
Proceedings of COLING 2012

pdf bib
Alignment by Bilingual Generation and Monolingual Derivation
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of COLING 2012

pdf bib abs
Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Chinese characters are used both in Japanese and Chinese, which are called Kanji and Hanzi respectively. Chinese characters contain significant semantic information, a mapping table between Kanji and Hanzi can be very useful for many Japanese-Chinese bilingual applications, such as machine translation and cross-lingual information retrieval. Because Kanji characters are originated from ancient China, most Kanji have corresponding Chinese characters in Hanzi. However, the relation between Kanji and Hanzi is quite complicated. In this paper, we propose a method of making a Chinese characters mapping table of Japanese, Traditional Chinese and Simplified Chinese automatically by means of freely available resources. We define seven categories for Kanji based on the relation between Kanji and Hanzi, and classify mappings of Chinese characters into these categories. We use a resource from Wiktionary to show the completeness of the mapping table we made. Statistical comparison shows that our proposed method makes a more complete mapping table than the current version of Wiktionary.

pdf bib
Constrained Hidden Markov Model for Bilingual Keyword Pairs Alignment
Denny Cahyadi | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib
A Reranking Approach for Dependency Parsing with Variable-sized Subtree Features
Mo Shen | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
Building a Diverse Document Leads Corpus Annotated with Semantic Relations
Masatsugu Hangyo | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

pdf bib
Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
Fabien Cromieres | Sadao Kurohashi
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Non-parametric Bayesian Segmentation of Japanese Noun Phrases
Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Generative Modeling of Coordination by Factoring Parallelism and Selectional Preferences
Daisuke Kawahara | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
A Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames
Ryohei Sasano | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Bayesian Subtree Alignment Model based on Dependency Trees
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Acquiring Strongly-related Events using Predicate-argument Co-occurring Statistics and Case Frames
Tomohide Shibata | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Relevance Feedback using Latent Information
Jun Harashima | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Extracting Paraphrases from Definition Sentences on the Web
Chikara Hashimoto | Kentaro Torisawa | Stijn De Saeger | Jun’ichi Kazama | Sadao Kurohashi
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the ACL-HLT 2011 System Demonstrations
Sadao Kurohashi
Proceedings of the ACL-HLT 2011 System Demonstrations

2010

pdf bib
Identifying Contradictory and Contrastive Relations between Statements to Outline Web Information on a Given Topic
Daisuke Kawahara | Kentaro Inui | Sadao Kurohashi
Coling 2010: Posters

pdf bib
Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues
Yugo Murawaki | Sadao Kurohashi
Coling 2010: Posters

pdf bib abs
Online Japanese Unknown Morpheme Detection using Orthographic Variation
Yugo Murawaki | Sadao Kurohashi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

To solve the unknown morpheme problem in Japanese morphological analysis, we previously proposed a novel framework of online unknown morpheme acquisition and its implementation. This framework poses a previously unexplored problem, online unknown morpheme detection. Online unknown morpheme detection is a task of finding morphemes in each sentence that are not listed in a given lexicon. Unlike in English, it is a non-trivial task because Japanese does not delimit words by white space. We first present a baseline method that simply uses the output of the morphological analyzer. We then show that it fails to detect some unknown morphemes because they are over-segmented into shorter registered morphemes. To cope with this problem, we present a simple solution, the use of orthographic variation of Japanese. Under the assumption that orthographic variants behave similarly, each over-segmentation candidate is checked against its counterparts. Experiments show that the proposed method improves the recall of detection and contributes to improving unknown morpheme acquisition.

pdf bib abs
Acquiring Reliable Predicate-argument Structures from Raw Corpora for Case Frame Compilation
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a method for acquiring reliable predicate-argument structures from raw corpora for automatic compilation of case frames. Such lexicon compilation requires highly reliable predicate-argument structures to practically contribute to Natural Language Processing (NLP) applications, such as paraphrasing, text entailment, and machine translation. However, to precisely identify predicate-argument structures, case frames are required. This issue is similar to the question ""what came first: the chicken or the egg?"" In this paper, we propose the first step in the extraction of reliable predicate-argument structures without using case frames. We first apply chunking to raw corpora and then extract reliable chunks to ensure that high-quality predicate-argument structures are obtained from the chunks. We conducted experiments to confirm the effectiveness of our approach. We successfully extracted reliable chunks of an accuracy of 98% and high-quality predicate-argument structures of an accuracy of 97%. Our experiments confirmed that we succeeded in acquiring highly reliable predicate-argument structures that can be used to compile case frames.

pdf bib
Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables
Tetsuji Nakagawa | Kentaro Inui | Sadao Kurohashi
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
Manabu Sassano | Sadao Kurohashi
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)
Sadao Kurohashi | Takehito Utsuro
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

pdf bib
Exploiting Term Importance Categories and Dependency Relations for Natural Language Search
Keiji Shinzato | Sadao Kurohashi
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

pdf bib
Summarizing Search Results using PLSI
Jun Harashima | Sadao Kurohashi
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

2009

pdf bib
A Probabilistic Model for Associative Anaphora Resolution
Ryohei Sasano | Sadao Kurohashi
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
An Alignment Algorithm Using Belief Propagation and a Structure-Based Distortion Model
Fabien Cromières | Sadao Kurohashi
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing
Manabu Sassano | Sadao Kurohashi
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Statistical Phrase Alignment Model Using Dependency Relation Probability
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf bib
Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
Hirotaka Funayama | Tomohide Shibata | Sadao Kurohashi
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

pdf bib
Capturing Consistency between Intra-clause and Inter-clause Relations in Knowledge-rich Dependency and Case Structure Analysis
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
Linguistically-motivated Tree-based Probabilistic Phrase Alignment
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we propose a probabilistic phrase alignment model based on dependency trees. This model is linguistically-motivated, using syntactic information during alignment process. The main advantage of this model is that the linguistic difference between source and target languages is successfully absorbed. It is composed of two models: Model1 is using content word translation probability and function word translation probability; Model2 uses dependency relation probability which is defined for a pair of positional relations on dependency trees. Relation probability acts as tree-based phrase reordering model. Since this model is directed, we combine two alignment results from bi-directional training by symmetrization heuristics to get definitive alignment. We conduct experiments on a Japanese-English corpus, and achieve reasonably high quality of alignment compared with word-based alignment model.

pdf bib
Coordination Disambiguation without Any Similarities
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
A Fully-Lexicalized Probabilistic Model for Japanese Zero Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Chinese Dependency Parsing with Large Scale Automatically Constructed Case Structures
Kun Yu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints
Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology
Keiji Shinzato | Tomohide Shibata | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Japanese Named Entity Recognition Using Structural Natural Language Processing
Ryohei Sasano | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
SYNGRAPH: A Flexible Matching Method based on Synonymous Expression Extraction from an Ordinary Dictionary and a Web Corpus
Tomohide Shibata | Michitaka Odani | Jun Harashima | Takashi Oonishi | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib abs
A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Keiji Shinzato | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.

pdf bib
Blog Categorization Exploiting Domain Dictionary and Dynamically Estimated Domains of Unknown Words
Chikara Hashimoto | Sadao Kurohashi
Proceedings of ACL-08: HLT, Short Papers

2007

pdf bib
Structural phrase alignment based on consistency criteria
Toshiaki Nakazawa | Yu Kun | Sadao Kurohashi
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Probabilistic Coordination Disambiguation in a Fully-Lexicalized Japanese Parser
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
A Three-Step Deterministic Parser for Chinese Dependency Parsing
Kun Yu | Sadao Kurohashi | Hao Liu
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Construction of Domain Dictionary for Fundamental Vocabulary
Chikara Hashimoto | Sadao Kurohashi
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib
Example-based machine translation based on deeper NLP
Toshiaki Nakazawa | Kun Yu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib abs
Case Frame Compilation from the Web using High-Performance Computing
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Case frames are important knowledge for a variety of NLP systems, especially when wide-coverage case frames are available. To acquire such large-scale case frames, it is necessary to automatically compile them from an enormous amount of corpus. In this paper, we consider the web as a corpus. We first build a huge text corpus from the web, and then construct case frames from the corpus. It is infeasible to do these processes by one CPU, and thus we employ a high-performance computing environment composed of 350 CPUs. The acquired corpus consists of 470M sentences, and the case frames compiled from them have 90,000 verb entries. The case frames contain most examples of usual use, and are ready to be applied to lots of NLP analyses and applications.

pdf bib
A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Unsupervised Topic Identification by Integrating Linguistic and Visual Information Based on Hidden Markov Models
Tomohide Shibata | Sadao Kurohashi
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Chinese Word Segmentation and Named Entity Recognition by Character Tagging
Kun Yu | Sadao Kurohashi | Hao Liu | Toshiaki Nakazawa
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

2005

pdf bib
Example-based Machine Translation Pursuing Fully Structural NLP
Sadao Kurohashi | Toshiaki Nakazawa | Kauffmann Alexis | Daisuke Kawahara
Proceedings of the Second International Workshop on Spoken Language Translation

pdf bib abs
Probabilistic Model for Example-based Machine Translation
Eiji Aramaki | Sadao Kurohashi | Hideki Kashioka | Naoto Kato
Proceedings of Machine Translation Summit X: Papers

Example-based machine translation (EBMT) systems, so far, rely on heuristic measures in retrieving translation examples. Such a heuristic measure costs time to adjust, and might make its algorithm unclear. This paper presents a probabilistic model for EBMT. Under the proposed model, the system searches the translation example combination which has the highest probability. The proposed model clearly formalizes EBMT process. In addition, the model can naturally incorporate the context similarity of translation examples. The experimental results demonstrate that the proposed model has a slightly better translation quality than state-of-the-art EBMT systems.

pdf bib
PP-Attachment Disambiguation Boosted by a Gigantic Volume of Unambiguous Examples
Daisuke Kawahara | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus
Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Automatic Slide Generation Based on Discourse Structure Analysis
Tomohide Shibata | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Lexical Choice via Topic Adaptation for Paraphrasing Written Language to Spoken Language
Nobuhiro Kaji | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf bib
Example-based machine translation using structural translation examples
Eiji Aramaki | Sadao Kurohashi
Proceedings of the First International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Improving Japanese Zero Pronoun Resolution by Global Word Sense Disambiguation
Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Automatic Construction of Nominal Case Frames and its Application to Indirect Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib abs
Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames
Daisuke Kawahara | Ryohei Sasano | Sadao Kurohashi
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.

pdf bib
Paraphrasing Predicates from Written Language to Spoken Language Using the Web
Nobuhiro Kaji | Masashi Okamoto | Sadao Kurohashi
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

This paper describes a system for finding phrasal translation correspondences from parallel parsed corpus that are collections paired English and Japanese sentences. First, the system finds phrasal correspondences by Japanese-English translation dictionary consultation. Then, the system finds correspondences in remaining phrases by using sentences dependency structures and the balance of all correspondences. The method is based on an assumption that in parallel corpus most fragments in a source sentence have corresponding fragments in a target sentence.

pdf bib
Japanese Case Frame Construction by Coupling the Verb and its Closest Case Component
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First International Conference on Human Language Technology Research

pdf bib
SENSEVAL-2 Japanese Translation Task
Sadao Kurohashi
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
Building domain-independent text generation system
XinYu Deng | Sadao Kurohashi | Jun’ichi Nakamura
Proceedings of the 15th Pacific Asia Conference on Language, Information and Computation

2000

pdf bib
Japanese Case Structure Analysis
Daisuke Kawahara | Nobuhiro Kaji | Sadao Kurohashi
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation
Hideo Watanabe | Sadao Kurohashi | Eiji Aramaki
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
A Parallel English-Japanese Query Collection for the Evaluation of On-Line Help Systems
Richard F. E. Sutcliffe | Sadao Kurohashi
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Dialogue Helpsystem based on Flexible Matching of User Query with Natural Language Knowledge Base
Sadao Kurohashi | Wataru Higasa
1st SIGdial Workshop on Discourse and Dialogue

pdf bib
Nonlocal Language Modeling based on Context Co-occurrence Vectors
Sadao Kurohashi | Manabu Ori
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Discourse Structure Analysis for News Video
Yasuhiko Watanabe | Yoshihiro Okada | Sadao Kurohashi | Eiichi Iwanari
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

1999

pdf bib
Semantic Analysis of Japanese Noun Phrases - A New Approach to Dictionary-Based Understanding
Sadao Kurohashi | Yasuyuki Sakai
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf bib
Construction of Japanese Nominal Semantic Dictionary using “A NO B” Phrases in Corpora
Sadao Kurohashi | Masaki Murata | Yasunori Yata | Mitsunobu Shimada | Makoto Nagao
The Computational Treatment of Nominals

pdf bib
General Word Sense Disambiguation Method Based on a Full Sentential Context
Jiri Stetina | Sadao Kurohashi | Makoto Nagao
Usage of WordNet in Natural Language Processing Systems

1995

pdf bib abs
Analyzing Coordinate Structures Including Punctuation in English
Sadao Kurohashi
Proceedings of the Fourth International Workshop on Parsing Technologies

We present a met hod of identifying coordinate structure scopes and determining usages of commas in sentences at the same time. All possible interpretations concerning comma usages and coordinate structure scopes are ranked by taking advantage of parallelism between conjoined phrases/clauses/sentences and calculating their similarity scores. We evaluated this method through experiments on held-out test sentences and obtained promising results: both the success ratio of interpreting commas and that of detecting CS scopes were about 80%.

1994

pdf bib
Automatic Detection of Discourse Structure by Checking Surface Information in Sentences
Sadao Kurohashi | Makoto Nagao
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

pdf bib
A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures
Sadao Kurohashi | Makoto Nagao
Computational Linguistics, Volume 20, Number 4, December 1994

1993

pdf bib abs
Structural Disambiguation in Japanese by Evaluating Case Structures based on Examples in a Case Frame Dictionary
Sadao Kurohashi | Makoto Nagao
Proceedings of the Third International Workshop on Parsing Technologies

A case structure expression is one of the most important forms to represent the meaning of a sentence. Case structure analysis is usually performed by consulting case frame information in verb dictionaries and by selecting a proper case frame for an input sentence. However, this analysis is very difficult because of word sense ambiguity and structural ambiguity. A conventional method for solving these problems is to use the method of selectional restriction, but this method has a drawback in the semantic marker (SM) system – the trade-off between descriptive power and construction cost. This paper describes a method of case structure analysis of Japanese sentences which overcomes the drawback in the SM system, concentrating on the structural disambiguation. This method selects a proper case frame for an input by the similarity measure between the input and typical example sentences of each case frame. When there are two or more possible readings for an input because of structural ambiguity, the best reading will be selected by evaluating case structures in each possible reading by the similarity measure with typical example sentences of case frames.