Pre-training Universal Language Representation

Yian Li, Hai Zhao


Abstract
Despite the well-developed cut-edge representation learning for language, most language representation models usually focus on specific levels of linguistic units. This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space. We propose the training objective MiSAD that utilizes meaningful n-grams extracted from large unlabeled corpus by a simple but effective algorithm for pre-trained language models. Then we empirically verify that well designed pre-training scheme may effectively yield universal language representation, which will bring great convenience when handling multiple layers of linguistic objects in a unified way. Especially, our model achieves the highest accuracy on analogy tasks in different language levels and significantly improves the performance on downstream tasks in the GLUE benchmark and a question answering dataset.
Anthology ID:
2021.acl-long.398
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
August
Year:
2021
Address:
Online
Editors:
Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5122–5133
Language:
URL:
https://aclanthology.org/2021.acl-long.398
DOI:
10.18653/v1/2021.acl-long.398
Bibkey:
Cite (ACL):
Yian Li and Hai Zhao. 2021. Pre-training Universal Language Representation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5122–5133, Online. Association for Computational Linguistics.
Cite (Informal):
Pre-training Universal Language Representation (Li & Zhao, ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-long.398.pdf
Video:
 https://aclanthology.org/2021.acl-long.398.mp4
Data
CoLAGLUEMRPCMultiNLIQNLISSTSST-2