Pretraining Language- and Domain-Specific BERT on Automatically Translated Text

Tatsuya Ishigaki; Yui Uehara; Goran Topić; Hiroya Takamura

Pretraining Language- and Domain-Specific BERT on Automatically Translated Text

Tatsuya Ishigaki, Yui Uehara, Goran Topić, Hiroya Takamura

Abstract

Domain-specific pretrained language models such as SciBERT are effective for various tasks involving text in specific domains. However, pretraining BERT requires a large-scale language resource, which is not necessarily available in fine-grained domains, especially in non-English languages. In this study, we focus on a setting with no available domain-specific text for pretraining. To this end, we propose a simple framework that trains a BERT on text in the target language automatically translated from a resource-rich language, e.g., English. In this paper, we particularly focus on the materials science domain in Japanese. Our experiments pertain to the task of entity and relation extraction for this domain and language. The experiments demonstrate that the various models pretrained on translated texts consistently perform better than the general BERT in terms of F1 scores although the domain-specific BERTs do not use any human-authored domain-specific text. These results imply that BERTs for various low-resource domains can be successfully trained on texts automatically translated from resource-rich languages.

Anthology ID:: 2023.ranlp-1.60
Volume:: Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Editors:: Ruslan Mitkov, Galia Angelova
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 548–555
Language:
URL:: https://aclanthology.org/2023.ranlp-1.60/
DOI:
Bibkey:
Cite (ACL):: Tatsuya Ishigaki, Yui Uehara, Goran Topić, and Hiroya Takamura. 2023. Pretraining Language- and Domain-Specific BERT on Automatically Translated Text. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 548–555, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Pretraining Language- and Domain-Specific BERT on Automatically Translated Text (Ishigaki et al., RANLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.ranlp-1.60.pdf

PDF Cite Search Fix data