Spelling Normalization of Historical Documents by Using a Machine Translation Approach

Miguel Domingo, Francisco Casacuberta


Abstract
The lack of a spelling convention in historical documents makes their orthography to change depending on the author and the time period in which each document was written. This represents a problem for the preservation of the cultural heritage, which strives to create a digital text version of a historical document. With the aim of solving this problem, we propose three approaches—based on statistical, neural and character-based machine translation—to adapt the document’s spelling to modern standards. We tested these approaches in different scenarios, obtaining very encouraging results.
Anthology ID:
2018.eamt-main.13
Volume:
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Month:
May
Year:
2018
Address:
Alicante, Spain
Editors:
Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Miquel Esplà-Gomis, Maja Popović, Celia Rico, André Martins, Joachim Van den Bogaert, Mikel L. Forcada
Venue:
EAMT
SIG:
Publisher:
Note:
Pages:
149–158
Language:
URL:
https://aclanthology.org/2018.eamt-main.13
DOI:
Bibkey:
Cite (ACL):
Miguel Domingo and Francisco Casacuberta. 2018. Spelling Normalization of Historical Documents by Using a Machine Translation Approach. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pages 149–158, Alicante, Spain.
Cite (Informal):
Spelling Normalization of Historical Documents by Using a Machine Translation Approach (Domingo & Casacuberta, EAMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/2018.eamt-main.13.pdf