Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language

Teruaki Oka, Tomoaki Kono


Abstract
We are constructing an annotated diachronic corpora of the Japanese language. In part of thiswork, we construct a corpus of Manyosyu, which is an old Japanese poetry anthology. In thispaper, we describe how to align the transcribed text and its original text semiautomatically to beable to cross-reference them in our Manyosyu corpus. Although we align the original charactersto the transcribed words manually, we preliminarily align the transcribed and original charactersby using an unsupervised automatic alignment technique of statistical machine translation toalleviate the work. We found that automatic alignment achieves an F1-measure of 0.83; thus, each poem has 1–2 alignment errors. However, finding these errors and modifying them are less workintensiveand more efficient than fully manual annotation. The alignment probabilities can beutilized in this modification. Moreover, we found that we can locate the uncertain transcriptionsin our corpus and compare them to other transcriptions, by using the alignment probabilities.
Anthology ID:
W16-4006
Volume:
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Erhard Hinrichs, Marie Hinrichs, Thorsten Trippel
Venue:
LT4DH
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
35–44
Language:
URL:
https://aclanthology.org/W16-4006
DOI:
Bibkey:
Cite (ACL):
Teruaki Oka and Tomoaki Kono. 2016. Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 35–44, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language (Oka & Kono, LT4DH 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4006.pdf