Constructivist Tokenization for English

Allison Fan, Weiwei Sun


Abstract
This paper revisits tokenization from a theoretical perspective, and argues for the necessity of a constructivist approach to tokenization for semantic parsing and modeling language acquisition. We consider two problems: (1) (semi-) automatically converting existing lexicalist annotations, e.g. those of the Penn TreeBank, into constructivist annotations, and (2) automatic tokenization of raw texts. We demonstrate that (1) a heuristic rule-based constructivist tokenizer is able to yield relatively satisfactory accuracy when gold standard Penn TreeBank part-of-speech tags are available, but that some manual annotations are still necessary to obtain gold standard results, and (2) a neural tokenizer is able to provide accurate automatic constructivist tokenization results from raw character sequences. Our research output also includes a set of high-quality morpheme-tokenized corpora, which enable the training of computational models that more closely align with language comprehension and acquisition.
Anthology ID:
2023.cxgsnlp-1.5
Volume:
Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)
Month:
March
Year:
2023
Address:
Washington, D.C.
Editors:
Claire Bonial, Harish Tayyar Madabushi
Venues:
CxGsNLP | SyntaxFest
SIG:
SIGPARSE
Publisher:
Association for Computational Linguistics
Note:
Pages:
36–40
Language:
URL:
https://aclanthology.org/2023.cxgsnlp-1.5
DOI:
Bibkey:
Cite (ACL):
Allison Fan and Weiwei Sun. 2023. Constructivist Tokenization for English. In Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023), pages 36–40, Washington, D.C.. Association for Computational Linguistics.
Cite (Informal):
Constructivist Tokenization for English (Fan & Sun, CxGsNLP-SyntaxFest 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.cxgsnlp-1.5.pdf