From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

Li Sun, Florian Luisier, Kayhan Batmanghelich, Dinei Florencio, Cha Zhang


Abstract
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model’s robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.
Anthology ID:
2023.acl-long.200
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3605–3620
Language:
URL:
https://aclanthology.org/2023.acl-long.200
DOI:
10.18653/v1/2023.acl-long.200
Bibkey:
Cite (ACL):
Li Sun, Florian Luisier, Kayhan Batmanghelich, Dinei Florencio, and Cha Zhang. 2023. From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3605–3620, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding (Sun et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.200.pdf
Video:
 https://aclanthology.org/2023.acl-long.200.mp4