Long Unit Word Tokenization and Bunsetsu Segmentation of Historical Japanese

Hiroaki Ozaki, Kanako Komiya, Masayuki Asahara, Toshinobu Ogiso


Abstract
In Japanese, the natural minimal phrase of a sentence is the “bunsetsu” and it serves as a natural boundary of a sentence for native speakers rather than words, and thus grammatical analysis in Japanese linguistics commonly operates on the basis of bunsetsu units.In contrast, because Japanese does not have delimiters between words, there are two major categories of word definition, namely, Short Unit Words (SUWs) and Long Unit Words (LUWs).Though a SUW dictionary is available, LUW is not.Hence, this study focuses on providing deep learning-based (or LLM-based) bunsetsu and Long Unit Words analyzer for the Heian period (AD 794-1185) and evaluating its performances.We model the parser as transformer-based joint sequential labels model, which combine bunsetsu BI tag, LUW BI tag, and LUW Part-of-Speech (POS) tag for each SUW token.We train our models on corpora of each period including contemporary and historical Japanese.The results range from 0.976 to 0.996 in f1 value for both bunsetsu and LUW reconstruction indicating that our models achieve comparable performance with models for a contemporary Japanese corpus.Through the statistical analysis and diachronic case study, the estimation of bunsetsu could be influenced by the grammaticalization of morphemes.
Anthology ID:
2024.ml4al-1.6
Volume:
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Month:
August
Year:
2024
Address:
Hybrid in Bangkok, Thailand and online
Editors:
John Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
Venues:
ML4AL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–55
Language:
URL:
https://aclanthology.org/2024.ml4al-1.6
DOI:
10.18653/v1/2024.ml4al-1.6
Bibkey:
Cite (ACL):
Hiroaki Ozaki, Kanako Komiya, Masayuki Asahara, and Toshinobu Ogiso. 2024. Long Unit Word Tokenization and Bunsetsu Segmentation of Historical Japanese. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 48–55, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics.
Cite (Informal):
Long Unit Word Tokenization and Bunsetsu Segmentation of Historical Japanese (Ozaki et al., ML4AL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ml4al-1.6.pdf