Hideki Tanaka


pdf bib
NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages
Francois Meyer | Haiyue Song | Abhisek Chakrabarty | Jan Buys | Raj Dabre | Hideki Tanaka
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in the datasets for Nguni languages, but so far no analysis of the performance of NLP models for these languages has been reported across languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of pretrained language models (PLMs). Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.

pdf bib
Robust Neural Machine Translation for Abugidas by Glyph Perturbation
Hour Kaing | Chenchen Ding | Hideki Tanaka | Masao Utiyama
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Neural machine translation (NMT) systems are vulnerable when trained on limited data. This is a common scenario in low-resource tasks in the real world. To increase robustness, a solution is to intently add realistic noise in the training phase. Noise simulation using text perturbation has been proven to be efficient in writing systems that use Latin letters. In this study, we further explore perturbation techniques on more complex abugida writing systems, for which the visual similarity of complex glyphs is considered to capture the essential nature of these writing systems. Besides the generated noise, we propose a training strategy to improve robustness. We conducted experiments on six languages: Bengali, Hindi, Myanmar, Khmer, Lao, and Thai. By overcoming the introduced noise, we obtained non-degenerate NMT systems with improved robustness for low-resource tasks for abugida glyphs.

pdf bib
Overcoming Early Saturation on Low-Resource Languages in Multilingual Dependency Parsing
Jiannan Mao | Chenchen Ding | Hour Kaing | Hideki Tanaka | Masao Utiyama | Tadahiro Matsumoto.
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

UDify is a multilingual and multi-task parser fine-tuned on mBERT that achieves remarkable performance in high-resource languages. However, the performance saturates early and decreases gradually in low-resource languages as training proceeds. This work applies a data augmentation method and conducts experiments on seven few-shot and four zero-shot languages. The unlabeled attachment scores were improved on the zero-shot languages dependency parsing tasks, with the average score rising from 67.1% to 68.7%. Meanwhile, dependency parsing tasks for high-resource languages and other tasks were hardly affected. Experimental results indicate the data augmentation method is effective for low-resource languages in a multilingual dependency parsing.


pdf bib
Improving Embedding Transfer for Low-Resource Machine Translation
Van Hien Tran | Chenchen Ding | Hideki Tanaka | Masao Utiyama
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Low-resource machine translation (LRMT) poses a substantial challenge due to the scarcity of parallel training data. This paper introduces a new method to improve the transfer of the embedding layer from the Parent model to the Child model in LRMT, utilizing trained token embeddings in the Parent model’s high-resource vocabulary. Our approach involves projecting all tokens into a shared semantic space and measuring the semantic similarity between tokens in the low-resource and high-resource languages. These measures are then utilized to initialize token representations in the Child model’s low-resource vocabulary. We evaluated our approach on three benchmark datasets of low-resource language pairs: Myanmar-English, Indonesian-English, and Turkish-English. The experimental results demonstrate that our method outperforms previous methods regarding translation quality. Additionally, our approach is computationally efficient, leading to reduced training time compared to prior works.

pdf bib
A Study on the Effectiveness of Large Language Models for Translation with Markup
Raj Dabre | Bianka Buschbeck | Miriam Exel | Hideki Tanaka
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

In this paper we evaluate the utility of large language models (LLMs) for translation of text with markup in which the most important and challenging aspect is to correctly transfer markup tags while ensuring that the content, both, inside and outside tags is correctly translated. While LLMs have been shown to be effective for plain text translation, their effectiveness for structured document translation is not well understood. To this end, we experiment with BLOOM and BLOOMZ, which are open-source multilingual LLMs, using zero, one and few-shot prompting, and compare with a domain-specific in-house NMT system using a detag-and-project approach for markup tags. We observe that LLMs with in-context learning exhibit poorer translation quality compared to the domain-specific NMT system, however, they are effective in transferring markup tags, especially the large BLOOM model (176 billion parameters). This is further confirmed by our human evaluation which also reveals the types of errors of the different tag transfer techniques. While LLM-based approaches come with the risk of losing, hallucinating and corrupting tags, they excel at placing them correctly in the translation.

pdf bib
Improving Zero-Shot Dependency Parsing by Unsupervised Learning
Jiannan Mao | Chenchen Ding | Hour Kaing | Hideki Tanaka | Masao Utiyama | Tadahiro Matsumoto
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Subset Retrieval Nearest Neighbor Machine Translation
Hiroyuki Deguchi | Taro Watanabe | Yusuke Matsui | Masao Utiyama | Hideki Tanaka | Eiichiro Sumita
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

k-nearest-neighbor machine translation (kNN-MT) (Khandelwal et al., 2021) boosts the translation performance of trained neural machine translation (NMT) models by incorporating example-search into the decoding algorithm. However, decoding is seriously time-consuming, i.e., roughly 100 to 1,000 times slower than standard NMT, because neighbor tokens are retrieved from all target tokens of parallel data in each timestep. In this paper, we propose “Subset kNN-MT”, which improves the decoding speed of kNN-MT by two methods: (1) retrieving neighbor target tokens from a subset that is the set of neighbor sentences of the input sentence, not from all sentences, and (2) efficient distance computation technique that is suitable for subset neighbor search using a look-up table. Our proposed method achieved a speed-up of up to 132.2 times and an improvement in BLEU score of up to 1.6 compared with kNN-MT in the WMT’19 De-En translation task and the domain adaptation tasks in De-En and En-Ja.


pdf bib
A Multilingual Multiway Evaluation Data Set for Structured Document Translation of Asian Languages
Bianka Buschbeck | Raj Dabre | Miriam Exel | Matthias Huck | Patrick Huy | Raphael Rubino | Hideki Tanaka
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Translation of structured content is an important application of machine translation, but the scarcity of evaluation data sets, especially for Asian languages, limits progress. In this paper we present a novel multilingual multiway evaluation data set for the translation of structured documents of the Asian languages Japanese, Korean and Chinese. We describe the data set, its creation process and important characteristics, followed by establishing and evaluating baselines using the direct translation as well as detag-project approaches. Our data set is well suited for multilingual evaluation, and it contains richer annotation tag sets than existing data sets. Our results show that massively multilingual translation models like M2M-100 and mBART-50 perform surprisingly well despite not being explicitly trained to handle structured content. The data set described in this paper and used in our experiments is released publicly.

pdf bib
FeatureBART: Feature Based Sequence-to-Sequence Pre-Training for Low-Resource NMT
Abhisek Chakrabarty | Raj Dabre | Chenchen Ding | Hideki Tanaka | Masao Utiyama | Eiichiro Sumita
Proceedings of the 29th International Conference on Computational Linguistics

In this paper we present FeatureBART, a linguistically motivated sequence-to-sequence monolingual pre-training strategy in which syntactic features such as lemma, part-of-speech and dependency labels are incorporated into the span prediction based pre-training framework (BART). These automatically extracted features are incorporated via approaches such as concatenation and relevance mechanisms, among which the latter is known to be better than the former. When used for low-resource NMT as a downstream task, we show that these feature based models give large improvements in bilingual settings and modest ones in multilingual settings over their counterparts that do not use features.


pdf bib
Field Experiments of Real Time Foreign News Distribution Powered by MT
Keiji Yasuda | Ichiro Yamada | Naoaki Okazaki | Hideki Tanaka | Hidehiro Asaka | Takeshi Anzai | Fumiaki Sugaya
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Field experiments on a foreign news distribution system using two key technologies are reported. The first technology is a summarization component, which is used for generating news headlines. This component is a transformer-based abstractive text summarization system which is trained to output headlines from the leading sentences of news articles. The second technology is machine translation (MT), which enables users to read foreign news articles in their mother language. Since the system uses MT, users can immediately access the latest foreign news. 139 Japanese LINE users participated in the field experiments for two weeks, viewing about 40,000 articles which had been translated from English to Japanese. We carried out surveys both during and after the experiments. According to the results, 79.3% of users evaluated the headlines as adequate, while 74.7% of users evaluated the automatically translated articles as intelligible. According to the post-experiment survey, 59.7% of users wished to continue using the system; 11.5% of users did not. We also report several statistics of the experiments.


pdf bib
Content-Equivalent Translated Parallel News Corpus and Extension of Domain Adaptation for NMT
Hideya Mino | Hideki Tanaka | Hitoshi Ito | Isao Goto | Ichiro Yamada | Takenobu Tokunaga
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we deal with two problems in Japanese-English machine translation of news articles. The first problem is the quality of parallel corpora. Neural machine translation (NMT) systems suffer degraded performance when trained with noisy data. Because there is no clean Japanese-English parallel data for news articles, we build a novel parallel news corpus consisting of Japanese news articles translated into English in a content-equivalent manner. This is the first content-equivalent Japanese-English news corpus translated specifically for training NMT systems. The second problem involves the domain-adaptation technique. NMT systems suffer degraded performance when trained with mixed data having different features, such as noisy data and clean data. Though the existing methods try to overcome this problem by using tags for distinguishing the differences between corpora, it is not sufficient. We thus extend a domain-adaptation method using multi-tags to train an NMT model effectively with the clean corpus and existing parallel news corpora with some types of noise. Experimental results show that our corpus increases the translation quality, and that our domain-adaptation method is more effective for learning with the multiple types of corpora than existing domain-adaptation methods are.

pdf bib
Neural Machine Translation Using Extracted Context Based on Deep Analysis for the Japanese-English Newswire Task at WAT 2020
Isao Goto | Hideya Mino | Hitoshi Ito | Kazutaka Kinugawa | Ichiro Yamada | Hideki Tanaka
Proceedings of the 7th Workshop on Asian Translation

This paper describes the system of the NHK-NES team for the WAT 2020 Japanese–English newswire task. There are two main problems in Japanese-English news translation: translation of dropped subjects and compatibility between equivalent translations and English news-style outputs. We address these problems by extracting subjects from the context based on predicate-argument structures and using them as additional inputs, and constructing parallel Japanese-English news sentences equivalently translated from English news sentences. The evaluation results confirm the effectiveness of our context-utilization method.


pdf bib
Neural Machine Translation System using a Content-equivalently Translated Parallel Corpus for the Newswire Translation Tasks at WAT 2019
Hideya Mino | Hitoshi Ito | Isao Goto | Ichiro Yamada | Hideki Tanaka | Takenobu Tokunaga
Proceedings of the 6th Workshop on Asian Translation

This paper describes NHK and NHK Engineering System (NHK-ES)’s submission to the newswire translation tasks of WAT 2019 in both directions of Japanese→English and English→Japanese. In addition to the JIJI Corpus that was officially provided by the task organizer, we developed a corpus of 0.22M sentence pairs by manually, translating Japanese news sentences into English content- equivalently. The content-equivalent corpus was effective for improving translation quality, and our systems achieved the best human evaluation scores in the newswire translation tasks at WAT 2019.


pdf bib
Detecting Untranslated Content for Neural Machine Translation
Isao Goto | Hideki Tanaka
Proceedings of the First Workshop on Neural Machine Translation

Despite its promise, neural machine translation (NMT) has a serious problem in that source content may be mistakenly left untranslated. The ability to detect untranslated content is important for the practical use of NMT. We evaluate two types of probability with which to detect untranslated content: the cumulative attention (ATN) probability and back translation (BT) probability from the target sentence to the source sentence. Experiments on detecting untranslated content in Japanese-English patent translations show that ATN and BT are each more effective than random choice, BT is more effective than ATN, and the combination of the two provides further improvements. We also confirmed the effectiveness of using ATN and BT to rerank the n-best NMT outputs.


pdf bib
Japanese news simplification: tak design, data set construction, and analysis of simplified text
Isao Goto | Hideki Tanaka | Tadashi Kumano
Proceedings of Machine Translation Summit XV: Papers

pdf bib
The “News Web Easy” news service as a resource for teaching and learning Japanese: An assessment of the comprehension difficulty of Japanese sentence-end expressions
Hideki Tanaka | Tadashi Kumano | Isao Goto
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications


pdf bib
Measuring the Similarity between TV Programs using Semantic Relations
Ichiro Yamada | Masaru Miyazaki | Hideki Sumiyoshi | Atsushi Matsui | Hironori Furumiya | Hideki Tanaka
Proceedings of COLING 2012


pdf bib
Syntax-Driven Sentence Revision for Broadcast News Summarization
Hideki Tanaka | Akinori Kinoshita | Takeshi Kobayakawa | Tadashi Kumano | Naoto Katoh
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)


pdf bib
Extracting phrasal alignments from comparable corpora by using joint probability SMT model
Tadashi Kumano | Hideki Tanaka | Takenobu Tokunaga
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers


pdf bib
Analysis and Modeling of Manual Summarization of Japanese Broadcast News
Hideki Tanaka | Tadashi Kumano | Masamichi Nishiwaki | Takayuki Itoh
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts


pdf bib
Back Transliteration from Japanese to English using Target English Context
Isao Goto | Naoto Kato | Terumasa Ehara | Hideki Tanaka
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics


pdf bib
Word Selection for EBMT based on Monolingual Similarity and Translation Confidence
Eiji Aramaki | Sadao Kurohashi | Hideki Kashioka | Hideki Tanaka
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Comparing the Sentence Alignment Yield from Two News Corpora Using a Dictionary-Based Alignment System
Stephen Nightingale | Hideki Tanaka
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Construction and Analysis of Japanese-English Broadcast News Corpus with Named Entity Tags
Tadashi Kumano | Hideki Kashioka | Hideki Tanaka | Takahiro Fukusima
Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition

pdf bib
Building a parallel corpus for monologues with clause alignment
Hideki Kashioka | Takehiko Maruyama | Hideki Tanaka
Proceedings of Machine Translation Summit IX: Papers

Many studies have been reported in the domain of speech-to-speech machine translation systems for travel conversation use. Therefore, a large number of travel domain corpora have become available in recent years. From a wider viewpoint, speech-to-speech systems are required for many purposes other than travel conversation. One of these is monologues (e.g., TV news, lectures, technical presentations). However, in monologues, sentences tend to be long and complicated, which often causes problems for parsing and translation. Therefore, we need a suitable translation unit, rather than the sentence. We propose the clause as a unit for translation. To develop a speech-to-speech machine translation system for monologues based on the clause as the translation unit, we need a monologue parallel corpus with clause alignment. In this paper, we describe how to build a Japanese-English monologue parallel corpus with clauses aligned, and discuss the features of this corpus.

pdf bib
A multi-language translation example browser
Isao Goto | Naoto Kato | Noriyoshi Uratani | Terumasa Ehara | Tadashi Kumano | Hideki Tanaka
Proceedings of Machine Translation Summit IX: System Presentations

This paper describes a Multi-language Translation Example Browser, a type of translation memory system. The system is able to retrieve translation examples from bilingual news databases, which consist of news transcripts of past broadcasts. We put a Japanese-English system to practical use and undertook trial operations of a system of eight language-pairs.


pdf bib
Automatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary
Kenji Matsumoto | Hideki Tanaka
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)


pdf bib
ATR-SLT System for SENSEVAL-2 Japanese Translation Task
Tadashi Kumano | Hideki Kashioka | Hideki Tanaka
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems


pdf bib
An Efficient Statistical Speech Act Type Tagging System for Speech Translation Systems
Hideki Tanaka | Akio Yokoo
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics


pdf bib
Context Management with Topics for Spoken Dialogue Systems
Kristiina Jokinen | Hideki Tanaka | Akio Yokoo
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Context Management with Topics for Spoken Dialogue Systems
Kristiina Jokinen | Hideki Tanaka | Akio Yokoo
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Planning Dialogue Contributions With New Information
Kristiina Jokinen | Hideki Tanaka | Akio Yokoo
Natural Language Generation


pdf bib
Decision Tree Learning Algorithm with Structured Attributes: Application to Verbal Case Frame Acquisition
Hideki Tanaka
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics


pdf bib
Verbal Case Frame Acquisition From a Bilingual Corpus: Gradual Knowledge Acquisition
Hideki Tanaka
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics


pdf bib
A Method of Translating English Delexical Structures Into Japanese
Hideki Tanaka | Teruaki Aizawa | Yeun-Bae Kim | Nobuko Hatada
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics


pdf bib
A Machine Translation System for Foreign News in Satellite Broadcasting
Teruaki Aizawa | Terumasa Ehara | Noriyoshi Uratani | Hideki Tanaka | Naoto Kato | Sumio Nakase | Norikazu Aruga | Takeo Matsuda
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics