2024
pdf
bib
abs
Coding Open-Ended Responses using Pseudo Response Generation by Large Language Models
Yuki Zenimoto
|
Ryo Hasegawa
|
Takehito Utsuro
|
Masaharu Yoshioka
|
Noriko Kando
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Survey research using open-ended responses is an important method thatcontributes to the discovery of unknown issues and new needs. However,survey research generally requires time and cost-consuming manual dataprocessing, indicating that it is difficult to analyze large dataset.To address this issue, we propose an LLM-based method to automate partsof the grounded theory approach (GTA), a representative approach of thequalitative data analysis. We generated and annotated pseudo open-endedresponses, and used them as the training data for the coding proceduresof GTA. Through evaluations, we showed that the models trained withpseudo open-ended responses are quite effective compared with thosetrained with manually annotated open-ended responses. We alsodemonstrate that the LLM-based approach is highly efficient andcost-saving compared to human-based approach.
pdf
bib
abs
Embedded Topic Models Enhanced by Wikification
Takashi Shibuya
|
Takehito Utsuro
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Topic modeling analyzes a collection of documents to learn meaningful patterns of words.However, previous topic models consider only the spelling of words and do not take into consideration the polysemy of words.In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities.We evaluate our method on two datasets, 1) news articles of New York Times and 2) the AIDA-CoNLL dataset.Our experiments show that our method improves the performance of neural topic models in generalizability.Moreover, we analyze frequent words in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.
pdf
bib
abs
Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data
Minato Kondo
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.
pdf
bib
abs
NTTSU at WMT2024 General Translation Task
Minato Kondo
|
Ryo Fukuda
|
Xiaotian Wang
|
Katsuki Chousa
|
Masato Nishimura
|
Kosei Buma
|
Takatomo Kano
|
Takehito Utsuro
Proceedings of the Ninth Conference on Machine Translation
The NTTSU team’s submission leverages several large language models developed through a training procedure that includes continual pre-training and supervised fine-tuning. For paragraph-level translation, we generated synthetic paragraph-aligned data and utilized this data for training.In the task of translating Japanese to Chinese, we particularly focused on the speech domain translation. Specifically, we built Whisper models for Japanese automatic speech recognition (ASR). We used YODAS dataset for Whisper training. Since this data contained many noisy data pairs, we combined the Whisper outputs using ROVER for polishing the transcriptions. Furthermore, to enhance the robustness of the translation model against errors in the transcriptions, we performed data augmentation by forward translation from audio, using both ASR and base translation models.To select the best translation from multiple hypotheses of the models, we applied Minimum Bayes Risk decoding + reranking, incorporating scores such as COMET-QE, COMET, and cosine similarity by LaBSE.
pdf
bib
abs
Document Alignment based on Overlapping Fixed-Length Segments
Xiaotian Wang
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Acquiring large-scale parallel corpora is crucial for NLP tasks such asNeural Machine Translation, and web crawling has become a popularmethodology for this purpose. Previous studies have been conductedbased on sentence-based segmentation (SBS) when aligning documents invarious languages which are obtained through web crawling. Among them,the TK-PERT method (Thompson and Koehn, 2020) achieved state-of-the-artresults and addressed the boilerplate text in web crawling data wellthrough a down-weighting approach. However, there remains a problemwith how to handle long-text encoding better. Thus, we introduce thestrategy of Overlapping Fixed-Length Segmentation (OFLS) in place ofSBS, and observe a pronounced enhancement when performing the sameapproach for document alignment. In this paper, we compare the SBS andOFLS using three previous methods, Mean-Pool, TK-PERT (Thompson andKoehn, 2020), and Optimal Transport (Clark et al., 2019; El- Kishky andGuzman, 2020), on the WMT16 document alignment shared task forFrench-English, as well as on our self-established Japanese-Englishdataset MnRN. As a result, for the WMT16 task, various SBS basedmethods showed an increase in recall by 1% to 10% after reproductionwith OFLS. For MnRN data, OFLS demonstrated notable accuracyimprovements and exhibited faster document embedding speed.
pdf
bib
abs
Aggregating Impressions on Celebrities and their Reasons from Microblog Posts and Web Search Pages
Hibiki Yokoyama
|
Rikuto Tsuchida
|
Kosei Buma
|
Sho Miyakawa
|
Takehito Utsuro
|
Masaharu Yoshioka
Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP
This paper aims to augment fans’ ability to critique and exploreinformation related to celebrities of interest. First, we collect postsfrom X (formerly Twitter) that discuss matters related to specificcelebrities. For the collection of major impressions from these posts,we employ ChatGPT as a large language model (LLM) to analyze andsummarize key sentiments. Next, based on collected impressions, wesearch for Web pages and collect the content of the top 30 ranked pagesas the source for exploring the reasons behind those impressions. Oncethe Web page content collection is complete, we collect and aggregatedetailed reasons for the impressions on the celebrities from the contentof each page. For this part, we continue to use ChatGPT, enhanced bythe retrieval augmented generation (RAG) framework, to ensure thereliability of the collected results compared to relying solely on theprior knowledge of the LLM. Evaluation results by comparing a referencethat is manually collected and aggregated reasons with those predictedby ChatGPT revealed that ChatGPT achieves high accuracy in reasoncollection and aggregation. Furthermore, we compared the performance ofChatGPT with an existing model of mT5 in reason collection and confirmedthat ChatGPT exhibits superior performance.
2023
pdf
bib
Style-sensitive Sentence Embeddings for Evaluating Similarity in Speech Style of Japanese Sentences by Contrastive Learning
Yuki Zenimoto
|
Shinzan Komata
|
Takehito Utsuro
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop
pdf
bib
Large Scale Evaluation of End-to-End Pipeline of Speaker to Dialogue Attribution in Japanese Novels
Yuki Zenimoto
|
Shinzan Komata
|
Takehito Utsuro
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
pdf
bib
Enhanced Retrieve-Edit-Rerank Framework with kNN-MT
Xiaotian Wang
|
Takuya Tamura
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
Target Language Monolingual Translation Memory based NMT by Cross-lingual Retrieval of Similar Translations and Reranking
Takuya Tamura
|
Xiaotian Wang
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Retrieve-edit-rerank is a text generation framework composed of three steps: retrieving for sentences using the input sentence as a query, generating multiple output sentence candidates, and selecting the final output sentence from these candidates. This simple approach has outperformed other existing and more complex methods. This paper focuses on the retrieving and the reranking steps. In the retrieving step, we propose retrieving similar target language sentences from a target language monolingual translation memory using language-independent sentence embeddings generated by mSBERT or LaBSE. We demonstrate that this approach significantly outperforms existing methods that use monolingual inter-sentence similarity measures such as edit distance, which is only applicable to a parallel translation memory. In the reranking step, we propose a new reranking score for selecting the best sentences, which considers both the log-likelihood of each candidate and the sentence embeddings based similarity between the input and the candidate. We evaluated the proposed method for English-to-Japanese translation on the ASPEC and English-to-French translation on the EU Bookshop Corpus (EUBC). The proposed method significantly exceeded the baseline in BLEU score, especially observing a 1.4-point improvement in the EUBC dataset over the original Retrieve-Edit-Rerank method.
pdf
bib
abs
Leveraging Highly Accurate Word Alignment for Low Resource Translation by Pretrained Multilingual Model
Jingyi Zhu
|
Minato Kondo
|
Takuya Tamura
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Recently, there has been a growing interest in pretraining models in the field of natural language processing. As opposed to training models from scratch, pretrained models have been shown to produce superior results in low-resource translation tasks. In this paper, we introduced the use of pretrained seq2seq models for preordering and translation tasks. We utilized manual word alignment data and mBERT-based generated word alignment data for training preordering and compared the effectiveness of various types of mT5 and mBART models for preordering. For the translation task, we chose mBART as our baseline model and evaluated several input manners. Our approach was evaluated on the Asian Language Treebank dataset, consisting of 20,000 parallel data in Japanese, English and Hindi, where Japanese is either on the source or target side. We also used in-house 3,000 parallel data in Chinese and Japanese. The results indicated that mT5-large trained with manual word alignment achieved a preordering performance exceeding 0.9 RIBES score on Ja-En and Ja-Zh pairs. Moreover, our proposed approach significantly outperformed the baseline model in most translation directions of Ja-En, Ja-Zh, and Ja-Hi pairs in at least one of BLEU/COMET scores.
pdf
bib
abs
Headline Generation for Stock Price Fluctuation Articles
Shunsuke Nishida
|
Yuki Zenimoto
|
Xiaotian Wang
|
Takuya Tamura
|
Takehito Utsuro
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing
The purpose of this paper is to construct a model for the generation of sophisticated headlines pertaining to stock price fluctuation articles, derived from the articles’ content. With respect to this headline generation objective, this paper solves three distinct tasks: in addition to the task of generating article headlines, two other tasks of extracting security names, and ascertaining the trajectory of stock prices, whether they are rising or declining. Regarding the headline generation task, we also revise the task as the model utilizes the outcomes of the security name extraction and rise/decline determination tasks, thereby for the purpose of preventing the inclusion of erroneous security names. We employed state-of-the-art pre-trained models from the field of natural language processing, fine-tuning these models for each task to enhance their precision. The dataset utilized for fine-tuning comprises a collection of articles delineating the rise and decline of stock prices. Consequently, we achieved remarkably high accuracy in the dual tasks of security name extraction and stock price rise or decline determination. For the headline generation task, a significant portion of the test data yielded fitting headlines.
2022
pdf
bib
Speaker Identification of Quotes in Japanese Novels based on Gender Classification Model by BERT
Yuki Zenimoto
|
Takehito Utsuro
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation
pdf
bib
Developing and Evaluating a Dataset for How-to Tip Machine Reading at Scale
Fuzhu Zhu
|
Shuting Bai
|
Tingxuan Li
|
Takehito Utsuro
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation
pdf
bib
Tweet Review Mining focusing on Celebrities by Machine Reading Comprehension based on BERT
Yuta Nozaki
|
Kotoe Sugawara
|
Yuki Zenimoto
|
Takehito Utsuro
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
Detecting Causes of Stock Price Rise and Decline by Machine Reading Comprehension with BERT
Gakuto Tsutsumi
|
Takehito Utsuro
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022
In this paper, we focused on news reported when stock prices fluctuate significantly. The news reported when stock prices change is a very useful source of information on what factors cause stock prices to change. However, because it is manually produced, not all events that cause stock prices to change are necessarily reported. Thus, in order to provide investors with information on those causes of stock price changes, it is necessary to develop a system to collect information on events that could be closely related to the stock price changes of certain companies from the Internet. As the first step towards developing such a system, this paper takes an approach of employing a BERT-based machine reading comprehension model, which extracts causes of stock price rise and decline from news reports on stock price changes. In the evaluation, the approach of using the title of the article as the question of machine reading comprehension performs well. It is shown that the fine-tuned machine reading comprehension model successfully detects additional causes of stock price rise and decline other than those stated in the title of the article.
2020
pdf
bib
abs
MRC Examples Answerable by BERT without a Question Are Less Effective in MRC Model Training
Hongyu Li
|
Tengyang Chen
|
Shuting Bai
|
Takehito Utsuro
|
Yasuhide Kawada
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop
Models developed for Machine Reading Comprehension (MRC) are asked to predict an answer from a question and its related context. However, there exist cases that can be correctly answered by an MRC model using BERT, where only the context is provided without including the question. In this paper, these types of examples are referred to as “easy to answer”, while others are as “hard to answer”, i.e., unanswerable by an MRC model using BERT without being provided the question. Based on classifying examples as answerable or unanswerable by BERT without the given question, we propose a method based on BERT that splits the training examples from the MRC dataset SQuAD1.1 into those that are “easy to answer” or “hard to answer”. Experimental evaluation from a comparison of two models, one trained only with “easy to answer” examples and the other with “hard to answer” examples demonstrates that the latter outperforms the former.
pdf
bib
abs
Automatic Annotation of Werewolf Game Corpus with Players Revealing Oneselves as Seer/Medium and Divination/Medium Results
Youchao Lin
|
Miho Kasamatsu
|
Tengyang Chen
|
Takuya Fujita
|
Huanjin Deng
|
Takehito Utsuro
Workshop on Games and Natural Language Processing
While playing the communication game “Are You a Werewolf”, a player always guesses other players’ roles through discussions, based on his own role and other players’ crucial utterances. The underlying goal of this paper is to construct an agent that can analyze the participating players’ utterances and play the werewolf game as if it is a human. For a step of this underlying goal, this paper studies how to accumulate werewolf game log data annotated with identification of players revealing oneselves as seer/medium, the acts of the divination and the medium and declaring the results of the divination and the medium. In this paper, we divide the whole task into four sub tasks and apply CNN/SVM classifiers to each sub task and evaluate their performance.
pdf
bib
Text Mining of Evidence on Infants’ Developmental Stages for Developmental Order Acquisition from Picture Book Reviews
Miho Kasamatsu
|
Takehito Utsuro
|
Yu Saito
|
Yumiko Ishikawa
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
Integrating Disfluency-based and Prosodic Features with Acoustics in Automatic Fluency Evaluation of Spontaneous Speech
Huaijin Deng
|
Youchao Lin
|
Takehito Utsuro
|
Akio Kobayashi
|
Hiromitsu Nishizaki
|
Junichi Hoshino
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper describes an automatic fluency evaluation of spontaneous speech. In the task of automatic fluency evaluation, we integrate diverse features of acoustics, prosody, and disfluency-based ones. Then, we attempt to reveal the contribution of each of those diverse features to the task of automatic fluency evaluation. Although a variety of different disfluencies are observed regularly in spontaneous speech, we focus on two types of phenomena, i.e., filled pauses and word fragments. The experimental results demonstrate that the disfluency-based features derived from word fragments and filled pauses are effective relative to evaluating fluent/disfluent speech, especially when combined with prosodic features, e.g., such as speech rate and pauses/silence. Next, we employed an LSTM based framework in order to integrate the disfluency-based and prosodic features with time sequential acoustic features. The experimental evaluation results of those integrated diverse features indicate that time sequential acoustic features contribute to improving the model with disfluency-based and prosodic features when detecting fluent speech, but not when detecting disfluent speech. Furthermore, when detecting disfluent speech, the model without time sequential acoustic features performs best even without word fragments features, but only with filled pauses and prosodic features.
pdf
bib
abs
University of Tsukuba’s Machine Translation System for IWSLT20 Open Domain Translation Task
Hongyi Cui
|
Yizhen Wei
|
Shohei Iida
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 17th International Conference on Spoken Language Translation
In this paper, we introduce University of Tsukuba’s submission to the IWSLT20 Open Domain Translation Task. We participate in both Chinese→Japanese and Japanese→Chinese directions. For both directions, our machine translation systems are based on the Transformer architecture. Several techniques are integrated in order to boost the performance of our models: data filtering, large-scale noised training, model ensemble, reranking and postprocessing. Consequently, our efforts achieve 33.0 BLEU scores for Chinese→Japanese translation and 32.3 BLEU scores for Japanese→Chinese translation.
pdf
bib
abs
Developing a How-to Tip Machine Comprehension Dataset and its Evaluation in Machine Comprehension by BERT
Tengyang Chen
|
Hongyu Li
|
Miho Kasamatsu
|
Takehito Utsuro
|
Yasuhide Kawada
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)
In the field of factoid question answering (QA), it is known that the state-of-the-art technology has achieved an accuracy comparable to that of humans in a certain benchmark challenge. On the other hand, in the area of non-factoid QA, there is still a limited number of datasets for training QA models, i.e., machine comprehension models. Considering such a situation within the field of the non-factoid QA, this paper aims to develop a dataset for training Japanese how-to tip QA models. This paper applies one of the state-of-the-art machine comprehension models to the Japanese how-to tip QA dataset. The trained how-to tip QA model is also compared with a factoid QA model trained with a Japanese factoid QA dataset. Evaluation results revealed that the how-to tip machine comprehension performance was almost comparative with that of the factoid machine comprehension even with the training data size reduced to around 4% of the factoid machine comprehension. Thus, the how-to tip machine comprehension task requires much less training data compared with the factoid machine comprehension task.
2019
pdf
bib
abs
Attention over Heads: A Multi-Hop Attention for Neural Machine Translation
Shohei Iida
|
Ryuichiro Kimura
|
Hongyi Cui
|
Po-Hsuan Hung
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
In this paper, we propose a multi-hop attention for the Transformer. It refines the attention for an output symbol by integrating that of each head, and consists of two hops. The first hop attention is the scaled dot-product attention which is the same attention mechanism used in the original Transformer. The second hop attention is a combination of multi-layer perceptron (MLP) attention and head gate, which efficiently increases the complexity of the model by adding dependencies between heads. We demonstrate that the translation accuracy of the proposed multi-hop attention outperforms the baseline Transformer significantly, +0.85 BLEU point for the IWSLT-2017 German-to-English task and +2.58 BLEU point for the WMT-2017 German-to-English task. We also find that the number of parameters required for a multi-hop attention is smaller than that for stacking another self-attention layer and the proposed model converges significantly faster than the original Transformer.
pdf
bib
abs
Mixed Multi-Head Self-Attention for Neural Machine Translation
Hongyi Cui
|
Shohei Iida
|
Po-Hsuan Hung
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 3rd Workshop on Neural Generation and Translation
Recently, the Transformer becomes a state-of-the-art architecture in the filed of neural machine translation (NMT). A key point of its high-performance is the multi-head self-attention which is supposed to allow the model to independently attend to information from different representation subspaces. However, there is no explicit mechanism to ensure that different attention heads indeed capture different features, and in practice, redundancy has occurred in multiple heads. In this paper, we argue that using the same global attention in multiple heads limits multi-head self-attention’s capacity for learning distinct features. In order to improve the expressiveness of multi-head self-attention, we propose a novel Mixed Multi-Head Self-Attention (MMA) which models not only global and local attention but also forward and backward attention in different attention heads. This enables the model to learn distinct representations explicitly among multiple heads. In our experiments on both WAT17 English-Japanese as well as IWSLT14 German-English translation task, we show that, without increasing the number of parameters, our models yield consistent and significant improvements (0.9 BLEU scores on average) over the strong Transformer baseline.
pdf
bib
Selecting Informative Context Sentence by Forced Back-Translation
Ryuichiro Kimura
|
Shohei Iida
|
Hongyi Cui
|
Po-Hsuan Hung
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of Machine Translation Summit XVII: Research Track
pdf
bib
Proceedings of the 8th Workshop on Patent and Scientific Literature Translation
Takehito Utsuro
|
Katsuhito Sudoh
|
Takashi Tsunakawa
Proceedings of the 8th Workshop on Patent and Scientific Literature Translation
pdf
bib
A Multi-Hop Attention for RNN based Neural Machine Translation
Shohei Iida
|
Ryuichiro Kimura
|
Hongyi Cui
|
Po-Hsuan Hung
|
Takehito Utsuro
|
Masaaki Nagata
Proceedings of the 8th Workshop on Patent and Scientific Literature Translation
2018
pdf
bib
abs
Measuring Beginner Friendliness of Japanese Web Pages explaining Academic Concepts by Integrating Neural Image Feature and Text Features
Hayato Shiokawa
|
Kota Kawaguchi
|
Bingcai Han
|
Takehito Utsuro
|
Yasuhide Kawada
|
Masaharu Yoshioka
|
Noriko Kando
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications
Search engine is an important tool of modern academic study, but the results are lack of measurement of beginner friendliness. In order to improve the efficiency of using search engine for academic study, it is necessary to invent a technique of measuring the beginner friendliness of a Web page explaining academic concepts and to build an automatic measurement system. This paper studies how to integrate heterogeneous features such as a neural image feature generated from the image of the Web page by a variant of CNN (convolutional neural network) as well as text features extracted from the body text of the HTML file of the Web page. Integration is performed through the framework of the SVM classifier learning. Evaluation results show that heterogeneous features perform better than each individual type of features.
2017
pdf
bib
Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy
Zi Long
|
Ryuichiro Kimura
|
Takehito Utsuro
|
Tomoharu Mitsuhashi
|
Mikio Yamamoto
Proceedings of Machine Translation Summit XVI: Research Track
pdf
bib
abs
Patent NMT integrated with Large Vocabulary Phrase Translation by SMT at WAT 2017
Zi Long
|
Ryuichiro Kimura
|
Takehito Utsuro
|
Tomoharu Mitsuhashi
|
Mikio Yamamoto
Proceedings of the 4th Workshop on Asian Translation (WAT2017)
Neural machine translation (NMT) cannot handle a larger vocabulary because the training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. Long et al.(2017) proposed to select phrases that contain out-of-vocabulary words using the statistical approach of branching entropy. The selected phrases are then replaced with tokens during training and post-translated by the phrase translation table of SMT. In this paper, we apply the method proposed by Long et al. (2017) to the WAT 2017 Japanese-Chinese and Japanese-English patent datasets. Evaluation on Japanese-to-Chinese, Chinese-to-Japanese, Japanese-to-English and English-to-Japanese patent sentence translation proved the effectiveness of phrases selected with branching entropy, where the NMT model of Long et al.(2017) achieves a substantial improvement over a baseline NMT model without the technique proposed by Long et al.(2017).
2016
pdf
bib
abs
Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation
Zi Long
|
Takehito Utsuro
|
Tomoharu Mitsuhashi
|
Mikio Yamamoto
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
Neural machine translation (NMT), a new approach to machine translation, has achieved promising results comparable to those of traditional approaches such as statistical machine translation (SMT). Despite its recent success, NMT cannot handle a larger vocabulary because training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. In NMTs, words that are out of vocabulary are represented by a single unknown token. In this paper, we propose a method that enables NMT to translate patent sentences comprising a large vocabulary of technical terms. We train an NMT system on bilingual data wherein technical terms are replaced with technical term tokens; this allows it to translate most of the source sentences except technical terms. Further, we use it as a decoder to translate source sentences with technical term tokens and replace the tokens with technical term translations using SMT. We also use it to rerank the 1,000-best SMT translations on the basis of the average of the SMT score and that of the NMT rescoring of the translated sentences with technical term tokens. Our experiments on Japanese-Chinese patent sentences show that the proposed NMT system achieves a substantial improvement of up to 3.1 BLEU points and 2.3 RIBES points over traditional SMT systems and an improvement of approximately 0.6 BLEU points and 0.8 RIBES points over an equivalent NMT system without our proposed technique.
pdf
bib
abs
Analyzing Time Series Changes of Correlation between Market Share and Concerns on Companies measured through Search Engine Suggests
Takakazu Imada
|
Yusuke Inoue
|
Lei Chen
|
Syunya Doi
|
Tian Nie
|
Chen Zhao
|
Takehito Utsuro
|
Yasuhide Kawada
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper proposes how to utilize a search engine in order to predict market shares. We propose to compare rates of concerns of those who search for Web pages among several companies which supply products, given a specific products domain. We measure concerns of those who search for Web pages through search engine suggests. Then, we analyze whether rates of concerns of those who search for Web pages have certain correlation with actual market share. We show that those statistics have certain correlations. We finally propose how to predict the market share of a specific product genre based on the rates of concerns of those who search for Web pages.
2015
pdf
bib
Collecting bilingual technical terms from patent families of character-segmented Chinese sentences and morpheme-segmented Japanese sentences
Zi Long
|
Takehito Utsuro
|
Tomoharu Mitsuhashi
|
Mikio Yamamoto
Proceedings of the 6th Workshop on Patent and Scientific Literature Translation
pdf
bib
Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families
Zi Long
|
Takehito Utsuro
|
Tomoharu Mitsuhashi
|
Mikio Yamamoto
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
pdf
bib
Detecting an Infant’s Developmental Reactions in Reviews on Picture Books
Hiroshi Uehara
|
Mizuho Baba
|
Takehito Utsuro
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters
2013
pdf
bib
Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter
Daichi Koike
|
Yusuke Takahashi
|
Takehito Utsuro
|
Masaharu Yoshioka
|
Noriko Kando
Proceedings of the Sixth International Joint Conference on Natural Language Processing
pdf
bib
Compositional translation of technical terms by integrating patent families as a parallel corpus and a comparable corpus
Itsuki Toyota
|
Zi Long
|
Lijuan Dong
|
Takehito Utsuro
|
Mikio Yamamoto
Proceedings of the 5th Workshop on Patent Translation
2012
pdf
bib
Cross-Lingual Topic Alignment in Time Series Japanese / Chinese News
Shuo Hu
|
Yusuke Takahashi
|
Liyi Zheng
|
Takehito Utsuro
|
Masaharu Yoshioka
|
Noriko Kando
|
Tomohiro Fukuhara
|
Hiroshi Nakagawa
|
Yoji Kiyota
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation
pdf
bib
abs
Detecting Japanese Compound Functional Expressions using Canonical/Derivational Relation
Takafumi Suzuki
|
Yusuke Abe
|
Itsuki Toyota
|
Takehito Utsuro
|
Suguru Matsuyoshi
|
Masatoshi Tsuchiya
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The Japanese language has various types of functional expressions. In order to organize Japanese functional expressions with various surface forms, a lexicon of Japanese functional expressions with hierarchical organization was compiled. This paper proposes how to design the framework of identifying more than 16,000 functional expressions in Japanese texts by utilizing hierarchical organization of the lexicon. In our framework, more than 16,000 functional expressions are roughly divided into canonical / derived functional expressions. Each derived functional expression is intended to be identified by referring to the most similar occurrence of its canonical expression. In our framework, contextual occurrence information of much fewer canonical expressions are expanded into the whole forms of derived expressions, to be utilized when identifying those derived expressions. We also empirically show that the proposed method can correctly identify more than 80% of the functional / content usages only with less than 38,000 training instances of manually identified canonical expressions.
2011
pdf
bib
Semi-Automatic Identification of Bilingual Synonymous Technical Terms from Phrase Tables and Parallel Patent Sentences
Bing Liang
|
Takehito Utsuro
|
Mikio Yamamoto
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
pdf
bib
Example-based Translation of Japanese Functional Expressions utilizing Semantic Equivalence Classes
Yusuke Abe
|
Takafumi Suzuki
|
Bing Liang
|
Takehito Utsuro
|
Mikio Yamamoto
|
Suguru Matsuyoshi
|
Yasuhide Kawada
Proceedings of the 4th Workshop on Patent Translation
2010
pdf
bib
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)
Sadao Kurohashi
|
Takehito Utsuro
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)
pdf
bib
abs
Utilizing Semantic Equivalence Classes of Japanese Functional Expressions in Translation Rule Acquisition from Parallel Patent Sentences
Taiji Nagasaka
|
Ran Shimanouchi
|
Akiko Sakamoto
|
Takafumi Suzuki
|
Yohei Morishita
|
Takehito Utsuro
|
Suguru Matsuyoshi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In the ``Sandglass'' MT architecture, we identify the class of monosemous Japanese functional expressions and utilize it in the task of translating Japanese functional expressions into English. We employ the semantic equivalence classes of a recently compiled large scale hierarchical lexicon of Japanese functional expressions. We then study whether functional expressions within a class can be translated into a single canonical English expression. Based on the results of identifying monosemous semantic equivalence classes, this paper studies how to extract rules for translating functional expressions in Japanese patent documents into English. In this study, we use about 1.8M Japanese-English parallel sentences automatically extracted from Japanese-English patent families, which are distributed through the Patent Translation Task at the NTCIR-7 Workshop. Then, as a toolkit of a phrase-based SMT (Statistical Machine Translation) model, Moses is applied and Japanese-English translation pairs are obtained in the form of a phrase translation table. Finally, we extract translation pairs of Japanese functional expressions from the phrase translation table. Through this study, we found that most of the semantic equivalence classes judged as monosemous based on manual translation into English have only one translation rules even in the patent domain.
2009
pdf
bib
Towards Conceptual Indexing of the Blogosphere through Wikipedia Topic Hierarchy
Mariko Kawaba
|
Daisuke Yokomoto
|
Hiroyuki Nakasaki
|
Takehito Utsuro
|
Tomohiro Fukuhara
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2
pdf
bib
Identifying and Utilizing the Class of Monosemous Japanese Functional Expressions in Machine Translation
Akiko Sakamoto
|
Taiji Nagasaka
|
Takehito Utsuro
|
Suguru Matsuyoshi
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2
pdf
bib
Exploiting Patent Information for the Evaluation of Machine Translation
Atsushi Fujii
|
Masao Utiyama
|
Mikio Yamamoto
|
Takehito Utsuro
Proceedings of the Third Workshop on Patent Translation
pdf
bib
Meta-evaluation of Automatic Evaluation Methods for Machine using Patent Translation Data in NTCIR-7
Hiroshi Echizen-ya
|
Terumasa Ehara
|
Sayori Shimohata
|
Atsushi Fujii
|
Masao Utiyama
|
Mikio Yamamoto
|
Takehito Utsuro
|
Noriko Kando
Proceedings of the Third Workshop on Patent Translation
2008
pdf
bib
abs
Producing a Test Collection for Patent Machine Translation in the Seventh NTCIR Workshop
Atsushi Fujii
|
Masao Utiyama
|
Mikio Yamamoto
|
Takehito Utsuro
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In aiming at research and development on machine translation, we produced a test collection for Japanese-English machine translation in the seventh NTCIR Workshop. This paper describes details of our test collection. From patent documents published in Japan and the United States, we extracted patent families as a parallel corpus. A patent family is a set of patent documents for the same or related invention and these documents are usually filed to more than one country in different languages. In the parallel corpus, we aligned Japanese sentences with their counterpart English sentences. Our test collection, which includes approximately 2,000,000 sentence pairs, can be used to train and test machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval and the contribution of machine translation to a patent retrieval task can also be evaluated. Our test collection will be available to the public for research purposes after the NTCIR final meeting.
pdf
bib
abs
Toward the Evaluation of Machine Translation Using Patent Information
Atsushi Fujii
|
Masao Utiyama
|
Mikio Yamamoto
|
Takehito Utsuro
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
To aid research and development in machine translation, we have produced a test collection for Japanese/English machine translation. To obtain a parallel corpus, we extracted patent documents for the same or related inventions published in Japan and the United States. Our test collection includes approximately 2000000 sentence pairs in Japanese and English, which were extracted automatically from our parallel corpus. These sentence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retrieving patent documents across languages. This paper describes our test collection, methods for evaluating machine translation, and preliminary experiments.
pdf
bib
abs
Integrating a Phrase-based SMT Model and a Bilingual Lexicon for Semi-Automatic Acquisition of Technical Term Translation Lexicons
Yohei Morishita
|
Takehito Utsuro
|
Mikio Yamamoto
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
This paper presents an attempt at developing a technique of acquiring translation pairs of technical terms with sufficiently high precision from parallel patent documents. The approach taken in the proposed technique is based on integrating the phrase translation table of a state-of-the-art statistical phrase-based machine translation model, and compositional translation generation based on an existing bilingual lexicon for human use. Our evaluation results clearly show that the agreement between the two individual techniques definitely contribute to improving precision of translation candidates. We then apply the Support Vector Machines (SVMs) to the task of automatically validating translation candidates in the phrase translation table. Experimental evaluation results again show that the SVMs based approach to translation candidates validation can contribute to improving the precision of translation candidates in the phrase translation table.
2007
pdf
bib
Learning Dependency Relations of Japanese Compound Functional Expressions
Takehito Utsuro
|
Takao Shime
|
Masatoshi Tsuchiya
|
Suguru Matsuyoshi
|
Satoshi Sato
Proceedings of the Workshop on A Broader Perspective on Multiword Expressions
2006
pdf
bib
Japanese Idiom Recognition: Drawing a Line between Literal and Idiomatic Meanings
Chikara Hashimoto
|
Satoshi Sato
|
Takehito Utsuro
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
pdf
bib
Compiling French-Japanese Terminologies from the Web
Xavier Robitaille
|
Yasuhiro Sasaki
|
Masatsugu Tonoike
|
Satoshi Sato
|
Takehito Utsuro
11th Conference of the European Chapter of the Association for Computational Linguistics
pdf
bib
Adjective-to-Verb Paraphrasing in Japanese Based on Lexical Constraints of Verbs
Atsushi Fujita
|
Naruaki Masuno
|
Satoshi Sato
|
Takehito Utsuro
Proceedings of the Fourth International Natural Language Generation Conference
pdf
bib
A comparative study on compositional translation estimation using a domain/topic-specific corpus collected from the Web
Masatsugu Tonoike
|
Mitsuhiro Kida
|
Toshihiro Takagi
|
Yasuhiro Sasaki
|
Takehito Utsuro
|
S. Sato
Proceedings of the 2nd International Workshop on Web as Corpus
pdf
bib
Chunking Japanese Compound Functional Expressions by Machine Learning
Masatoshi Tsuchiya
|
Takao Shime
|
Toshihiro Takagi
|
Takehito Utsuro
|
Kiyotaka Uchimoto
|
Suguru Matsuyoshi
|
Satoshi Sato
|
Seiichi Nakagawa
Proceedings of the Workshop on Multi-word-expressions in a multilingual context
2005
pdf
bib
Effect of Domain-Specific Corpus in Compositional Translation Estimation for Technical Terms
Masatsugu Tonoike
|
Mitsuhiro Kida
|
Toshihiro Takagi
|
Yasuhiro Sasaki
|
Takehito Utsuro
|
Satoshi Sato
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts
2004
pdf
bib
Answer validation by keyword association
Masatsugu Tonoike
|
Takehito Utsuro
|
Satoshi Sato
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)
pdf
bib
Integrating Cross-Lingually Relevant News Articles and Monolingual Web Documents in Bilingual Lexicon Acquisition
Takehito Utsuro
|
Kohei Hino
|
Mitsuhiro Kida
|
Seiichi Nakagawa
|
Satoshi Sato
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics
pdf
bib
An Empirical Study on Multiple LVCSR Model Combination by Machine Learning
Takehito Utsuro
|
Yasuhiro Kodama
|
Tomohiro Watanabe
|
Hiromitsu Nishizaki
|
Seiichi Nakagawa
Proceedings of HLT-NAACL 2004: Short Papers
2003
pdf
bib
Effect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora
Takehito Utsuro
|
Takashi Horiuchi
|
Kohei Hino
|
Takeshi Hamamoto
|
Takeaki Nakayama
10th Conference of the European Chapter of the Association for Computational Linguistics
2002
pdf
bib
Combining Outputs of Multiple Japanese Named Entity Chunkers by Stacking
Takehito Utsuro
|
Manabu Sassano
|
Kiyotaka Uchimoto
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)
pdf
bib
A Web-based English Abstract Writing Tool Using a Tagged E-J Parallel Corpus
Masumi Narita
|
Kazuya Kurokawa
|
Takehito Utsuro
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
bib
abs
Semi-automatic compilation of bilingual lexcion entries from cross-lingually relevant news articles on WWW news sites
Takehito Utsuro
|
Takashi Horiuchi
|
Yasunobu Chiba
|
Takeshi Hamamoto
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers
For the purpose of overcoming resource scarcity bottleneck in corpus-based translation knowledge acquisition research, this paper takes an approach of semi-automatically acquiring domain specific translation knowledge from the collection of bilingual news articles on WWW news sites. This paper presents results of applying standard co-occurrence frequency based techniques of estimating bilingual term correspondences from parallel corpora to relevant article pairs automatically collected from WWW news sites. The experimental evaluation results are very encouraging and it is proved that many useful bilingual term correspondences can be efficiently discovered with little human intervention from relevant article pairs on WWW news sites.
2000
pdf
bib
Analyzing Dependencies of Japanese Subordinate Clauses based on Statistics of Scope Embedding Preference
Takehito Utsuro
|
Shigeyuki Nishiokayama
|
Masakazu Fujio
|
Yuji Matsumoto
1st Meeting of the North American Chapter of the Association for Computational Linguistics
pdf
bib
Learning Preference of Dependency between Japanese Subordinate Clauses and its Evaluation in Parsing
Takehito Utsuro
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
bib
Minimally Supervised Japanese Named Entity Recognition: Resources and Evaluation
Takehito Utsuro
|
Manabu Sassano
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
bib
IPA Japanese Dictation Free Software Project
Katsunobu Itou
|
Kiyohiro Shikano
|
Tatsuya Kawahara
|
Kasuya Takeda
|
Atsushi Yamada
|
Akinori Itou
|
Takehito Utsuro
|
Tetsunori Kobayashi
|
Nobuaki Minematsu
|
Mikio Yamamoto
|
Shigeki Sagayama
|
Akinobu Lee
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
bib
Named Entity Chunking Techniques in Supervised Learning for Japanese Named Entity Recognition
Manabu Sassano
|
Takehito Utsuro
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics
1998
pdf
bib
General-to-Specific Model Selection for Subcategorization Preference
Takehito Utsuro
|
Takashi Miyata
|
Yuji Matsumoto
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2
pdf
bib
General-to-Specific Model Selection for Subcategorization Preference
Takehito Utsuro
|
Takashi Miyata
|
Yuji Matsumoto
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics
1997
pdf
bib
Learning Probabilistic Subcategorization Preference by Identifying Case Dependencies and Optimal Noun Class Generalization Level
Takehito Utsuro
|
Yuji Matsumoto
Fifth Conference on Applied Natural Language Processing
pdf
bib
Maximum Entropy Model Learning of Subcategorization Preference
Takehito Utsuro
|
Takashi Miyata
Fifth Workshop on Very Large Corpora
1996
pdf
bib
Sense Classification of Verbal Polysemy based-on Bilingual Class/Class Association
Takehito Utsuro
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics
1994
pdf
bib
Thesaurus-based Efficient Example Retrieval by Generating Retrieval Queries from Similarities
Takehito Utsuro
|
Kiyotaka Uchimoto
|
Mitsutaka Matsumoto
|
Makoto Nagao
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics
pdf
bib
Bilingual Text, Matching using Bilingual Dictionary and Statistics
Takehito Utsuro
|
Hiroshi Ikeda
|
Masaya Yamane
|
Yuji Matsumoto
|
Makoto Nagao
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics
1993
pdf
bib
Structural Matching of Parallel Texts
Yuji Matsumoto
|
Takehito Utsuro
|
Hiroyuki Ishimoto
31st Annual Meeting of the Association for Computational Linguistics
1992
pdf
bib
Lexical Knowledge Acquisition from Bilingual Corpora
Takehito Utsuro
|
Yuji Matsumoto
|
Makoto Nagao
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics