PhoBERT: Pre-trained language models for Vietnamese

Dat Quoc Nguyen, Anh Tuan Nguyen


Abstract
We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at https://github.com/VinAIResearch/PhoBERT
Anthology ID:
2020.findings-emnlp.92
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1037–1042
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.92
DOI:
10.18653/v1/2020.findings-emnlp.92
Bibkey:
Cite (ACL):
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042, Online. Association for Computational Linguistics.
Cite (Informal):
PhoBERT: Pre-trained language models for Vietnamese (Nguyen & Tuan Nguyen, Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.92.pdf
Code
 VinAIResearch/PhoBERT
Data
105,941 Images Natural Scenes OCR Data of 12 Languages