To Split or Not to Split: Composing Compounds in Contextual Vector Spaces

Christopher Jenkins, Filip Miletic, Sabine Schulte im Walde


Abstract
We investigate the effect of sub-word tokenization on representations of German noun compounds: single orthographic words which are composed of two or more constituents but often tokenized into units that are not morphologically motivated or meaningful. Using variants of BERT models and tokenization strategies on domain-specific restricted diachronic data, we introduce a suite of evaluations relying on the masked language modelling task and compositionality prediction. We obtain the most consistent improvements by pre-splitting compounds into constituents.
Anthology ID:
2023.emnlp-main.1002
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16131–16136
Language:
URL:
https://aclanthology.org/2023.emnlp-main.1002
DOI:
10.18653/v1/2023.emnlp-main.1002
Bibkey:
Cite (ACL):
Christopher Jenkins, Filip Miletic, and Sabine Schulte im Walde. 2023. To Split or Not to Split: Composing Compounds in Contextual Vector Spaces. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16131–16136, Singapore. Association for Computational Linguistics.
Cite (Informal):
To Split or Not to Split: Composing Compounds in Contextual Vector Spaces (Jenkins et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.1002.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.1002.mp4