Towards Context-aware Normalization of Variant Characters in Classical Chinese Using Parallel Editions and BERT

Florian Kessler


Abstract
For the automatic processing of Classical Chinese texts it is highly desirable to normalize variant characters, i.e. characters with different visual forms that are being used to represent the same morpheme, into a single form. However, there are some variant characters that are used interchangeably by some writers but deliberately employed to distinguish between different meanings by others. Hence, in order to avoid losing information in the normalization processes by conflating meaningful distinctions between variants, an intelligent normalization system that takes context into account is needed. Towards the goal of developing such a system, in this study, we describe how a dataset with usage samples of variant characters can be extracted from a corpus of paired editions of multiple texts. Using the dataset, we conduct two experiments, testing whether models can be trained with contextual word embeddings to predict variant characters. The results of the experiments show that while this is often possible for single texts, most conventions learned do not transfer well between documents.
Anthology ID:
2024.ml4al-1.15
Volume:
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Month:
August
Year:
2024
Address:
Hybrid in Bangkok, Thailand and online
Editors:
John Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
Venues:
ML4AL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
141–151
Language:
URL:
https://aclanthology.org/2024.ml4al-1.15
DOI:
10.18653/v1/2024.ml4al-1.15
Bibkey:
Cite (ACL):
Florian Kessler. 2024. Towards Context-aware Normalization of Variant Characters in Classical Chinese Using Parallel Editions and BERT. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 141–151, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics.
Cite (Informal):
Towards Context-aware Normalization of Variant Characters in Classical Chinese Using Parallel Editions and BERT (Kessler, ML4AL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ml4al-1.15.pdf