Minato Kondo

2025

pdf bib abs
Patent Claim Translation via Continual Pre-training of Large Language Models with Parallel Data
Haruto Azami | Minato Kondo | Takehito Utsuro | Masaaki Nagata
Proceedings of Machine Translation Summit XX: Volume 1

Recent advancements in large language models (LLMs) have enabled their application across various domains. However, in the field of patent translation, Transformer encoder-decoder based models remain the standard approach, and the potential of LLMs for translation tasks has not been thoroughly explored. In this study, we conducted patent claim translation using an LLM fine-tuned with parallel data through continual pre-training and supervised fine-tuning, following the methodology proposed by Guo et al. (2024) and Kondo et al. (2024). Comparative evaluation against the Transformer encoder-decoder based translations revealed that the LLM achieved high scores for both BLEU and COMET. This demonstrated improvements in addressing issues such as omissions and repetitions. Nonetheless, hallucination errors, which were not observed in the traditional models, occurred in some cases and negatively affected the translation quality. This study highlights the promise of LLMs for patent translation while identifying the challenges that warrant further investigation.

2024

pdf bib abs
Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data
Minato Kondo | Takehito Utsuro | Masaaki Nagata
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.

The NTTSU team’s submission leverages several large language models developed through a training procedure that includes continual pre-training and supervised fine-tuning. For paragraph-level translation, we generated synthetic paragraph-aligned data and utilized this data for training.In the task of translating Japanese to Chinese, we particularly focused on the speech domain translation. Specifically, we built Whisper models for Japanese automatic speech recognition (ASR). We used YODAS dataset for Whisper training. Since this data contained many noisy data pairs, we combined the Whisper outputs using ROVER for polishing the transcriptions. Furthermore, to enhance the robustness of the translation model against errors in the transcriptions, we performed data augmentation by forward translation from audio, using both ASR and base translation models.To select the best translation from multiple hypotheses of the models, we applied Minimum Bayes Risk decoding + reranking, incorporating scores such as COMET-QE, COMET, and cosine similarity by LaBSE.

2023

pdf bib abs
Leveraging Highly Accurate Word Alignment for Low Resource Translation by Pretrained Multilingual Model
Jingyi Zhu | Minato Kondo | Takuya Tamura | Takehito Utsuro | Masaaki Nagata
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Recently, there has been a growing interest in pretraining models in the field of natural language processing. As opposed to training models from scratch, pretrained models have been shown to produce superior results in low-resource translation tasks. In this paper, we introduced the use of pretrained seq2seq models for preordering and translation tasks. We utilized manual word alignment data and mBERT-based generated word alignment data for training preordering and compared the effectiveness of various types of mT5 and mBART models for preordering. For the translation task, we chose mBART as our baseline model and evaluated several input manners. Our approach was evaluated on the Asian Language Treebank dataset, consisting of 20,000 parallel data in Japanese, English and Hindi, where Japanese is either on the source or target side. We also used in-house 3,000 parallel data in Chinese and Japanese. The results indicated that mT5-large trained with manual word alignment achieved a preordering performance exceeding 0.9 RIBES score on Ja-En and Ja-Zh pairs. Moreover, our proposed approach significantly outperformed the baseline model in most translation directions of Ja-En, Ja-Zh, and Ja-Hi pairs in at least one of BLEU/COMET scores.

Co-authors

Venues

Fix author