Ryosuke Takahashi
2024
Document-level Translation with LLM Reranking: Team-J at WMT 2024 General Translation Task
Keito Kudo
|
Hiroyuki Deguchi
|
Makoto Morishita
|
Ryo Fujii
|
Takumi Ito
|
Shintaro Ozaki
|
Koki Natsumi
|
Kai Sato
|
Kazuki Yano
|
Ryosuke Takahashi
|
Subaru Kimura
|
Tomomasa Hara
|
Yusuke Sakai
|
Jun Suzuki
Proceedings of the Ninth Conference on Machine Translation
We participated in the constrained track for English-Japanese and Japanese-Chinese translations at the WMT 2024 General Machine Translation Task. Our approach was to generate a large number of sentence-level translation candidates and select the most probable translation using minimum Bayes risk (MBR) decoding and document-level large language model (LLM) re-ranking. We first generated hundreds of translation candidates from multiple translation models and retained the top 30 candidates using MBR decoding. In addition, we continually pre-trained LLMs on the target language corpora to leverage document-level information. We utilized LLMs to select the most probable sentence sequentially in context from the beginning of the document.
2022
Leveraging Three Types of Embeddings from Masked Language Models in Idiom Token Classification
Ryosuke Takahashi
|
Ryohei Sasano
|
Koichi Takeda
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics
Many linguistic expressions have idiomatic and literal interpretations, and the automatic distinction of these two interpretations has been studied for decades. Recent research has shown that contextualized word embeddings derived from masked language models (MLMs) can give promising results for idiom token classification. This indicates that contextualized word embedding alone contains information about whether the word is being used in a literal sense or not. However, we believe that more types of information can be derived from MLMs and that leveraging such information can improve idiom token classification. In this paper, we leverage three types of embeddings from MLMs; uncontextualized token embeddings and masked token embeddings in addition to the standard contextualized word embeddings and show that the newly added embeddings significantly improve idiom token classification for both English and Japanese datasets.
Search
Co-authors
- Ryohei Sasano 1
- Koichi Takeda 1
- Keito Kudo 1
- Hiroyuki Deguchi 1
- Makoto Morishita 1
- show all...