Recycling a Pre-trained BERT Encoder for Neural Machine Translation

Kenji Imamura; Eiichiro Sumita

doi:10.18653/v1/D19-5603

Recycling a Pre-trained BERT Encoder for Neural Machine Translation

Correct Metadata for

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

In this paper, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is applied to Transformer-based neural machine translation (NMT). In contrast to monolingual tasks, the number of unlearned model parameters in an NMT decoder is as huge as the number of learned parameters in the BERT model. To train all the models appropriately, we employ two-stage optimization, which first trains only the unlearned parameters by freezing the BERT model, and then fine-tunes all the sub-models. In our experiments, stable two-stage optimization was achieved, in contrast the BLEU scores of direct fine-tuning were extremely low. Consequently, the BLEU scores of the proposed method were better than those of the Transformer base model and the same model without pre-training. Additionally, we confirmed that NMT with the BERT encoder is more effective in low-resource settings.

Anthology ID:: D19-5603
Volume:: Proceedings of the 3rd Workshop on Neural Generation and Translation
Month:: November
Year:: 2019
Address:: Hong Kong
Editors:: Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Ioannis Konstas, Thang Luong, Graham Neubig, Yusuke Oda, Katsuhito Sudoh
Venue:: NGT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23–31
Language:
URL:: https://aclanthology.org/D19-5603/
DOI:: 10.18653/v1/D19-5603
Bibkey:
Cite (ACL):: Kenji Imamura and Eiichiro Sumita. 2019. Recycling a Pre-trained BERT Encoder for Neural Machine Translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 23–31, Hong Kong. Association for Computational Linguistics.
Cite (Informal):: Recycling a Pre-trained BERT Encoder for Neural Machine Translation (Imamura & Sumita, NGT 2019)
Copy Citation:
PDF:: https://aclanthology.org/D19-5603.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{imamura-sumita-2019-recycling,
    title = "Recycling a Pre-trained {BERT} Encoder for Neural Machine Translation",
    author = "Imamura, Kenji  and
      Sumita, Eiichiro",
    editor = "Birch, Alexandra  and
      Finch, Andrew  and
      Hayashi, Hiroaki  and
      Konstas, Ioannis  and
      Luong, Thang  and
      Neubig, Graham  and
      Oda, Yusuke  and
      Sudoh, Katsuhito",
    booktitle = "Proceedings of the 3rd Workshop on Neural Generation and Translation",
    month = nov,
    year = "2019",
    address = "Hong Kong",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D19-5603/",
    doi = "10.18653/v1/D19-5603",
    pages = "23--31",
    abstract = "In this paper, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is applied to Transformer-based neural machine translation (NMT). In contrast to monolingual tasks, the number of unlearned model parameters in an NMT decoder is as huge as the number of learned parameters in the BERT model. To train all the models appropriately, we employ two-stage optimization, which first trains only the unlearned parameters by freezing the BERT model, and then fine-tunes all the sub-models. In our experiments, stable two-stage optimization was achieved, in contrast the BLEU scores of direct fine-tuning were extremely low. Consequently, the BLEU scores of the proposed method were better than those of the Transformer base model and the same model without pre-training. Additionally, we confirmed that NMT with the BERT encoder is more effective in low-resource settings."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="imamura-sumita-2019-recycling">
    <titleInfo>
        <title>Recycling a Pre-trained BERT Encoder for Neural Machine Translation</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Kenji</namePart>
        <namePart type="family">Imamura</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Eiichiro</namePart>
        <namePart type="family">Sumita</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2019-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 3rd Workshop on Neural Generation and Translation</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Alexandra</namePart>
            <namePart type="family">Birch</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Andrew</namePart>
            <namePart type="family">Finch</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Hiroaki</namePart>
            <namePart type="family">Hayashi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ioannis</namePart>
            <namePart type="family">Konstas</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Thang</namePart>
            <namePart type="family">Luong</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Graham</namePart>
            <namePart type="family">Neubig</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yusuke</namePart>
            <namePart type="family">Oda</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Katsuhito</namePart>
            <namePart type="family">Sudoh</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Hong Kong</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>In this paper, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is applied to Transformer-based neural machine translation (NMT). In contrast to monolingual tasks, the number of unlearned model parameters in an NMT decoder is as huge as the number of learned parameters in the BERT model. To train all the models appropriately, we employ two-stage optimization, which first trains only the unlearned parameters by freezing the BERT model, and then fine-tunes all the sub-models. In our experiments, stable two-stage optimization was achieved, in contrast the BLEU scores of direct fine-tuning were extremely low. Consequently, the BLEU scores of the proposed method were better than those of the Transformer base model and the same model without pre-training. Additionally, we confirmed that NMT with the BERT encoder is more effective in low-resource settings.</abstract>
    <identifier type="citekey">imamura-sumita-2019-recycling</identifier>
    <identifier type="doi">10.18653/v1/D19-5603</identifier>
    <location>
        <url>https://aclanthology.org/D19-5603/</url>
    </location>
    <part>
        <date>2019-11</date>
        <extent unit="page">
            <start>23</start>
            <end>31</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Recycling a Pre-trained BERT Encoder for Neural Machine Translation
%A Imamura, Kenji
%A Sumita, Eiichiro
%Y Birch, Alexandra
%Y Finch, Andrew
%Y Hayashi, Hiroaki
%Y Konstas, Ioannis
%Y Luong, Thang
%Y Neubig, Graham
%Y Oda, Yusuke
%Y Sudoh, Katsuhito
%S Proceedings of the 3rd Workshop on Neural Generation and Translation
%D 2019
%8 November
%I Association for Computational Linguistics
%C Hong Kong
%F imamura-sumita-2019-recycling
%X In this paper, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is applied to Transformer-based neural machine translation (NMT). In contrast to monolingual tasks, the number of unlearned model parameters in an NMT decoder is as huge as the number of learned parameters in the BERT model. To train all the models appropriately, we employ two-stage optimization, which first trains only the unlearned parameters by freezing the BERT model, and then fine-tunes all the sub-models. In our experiments, stable two-stage optimization was achieved, in contrast the BLEU scores of direct fine-tuning were extremely low. Consequently, the BLEU scores of the proposed method were better than those of the Transformer base model and the same model without pre-training. Additionally, we confirmed that NMT with the BERT encoder is more effective in low-resource settings.
%R 10.18653/v1/D19-5603
%U https://aclanthology.org/D19-5603/
%U https://doi.org/10.18653/v1/D19-5603
%P 23-31

Download as File

Markdown (Informal)

[Recycling a Pre-trained BERT Encoder for Neural Machine Translation](https://aclanthology.org/D19-5603/) (Imamura & Sumita, NGT 2019)

Recycling a Pre-trained BERT Encoder for Neural Machine Translation (Imamura & Sumita, NGT 2019)

ACL

Kenji Imamura and Eiichiro Sumita. 2019. Recycling a Pre-trained BERT Encoder for Neural Machine Translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 23–31, Hong Kong. Association for Computational Linguistics.