Semi-supervised Contextual Historical Text Normalization

Peter Makarov, Simon Clematide


Abstract
Historical text normalization, the task of mapping historical word forms to their modern counterparts, has recently attracted a lot of interest (Bollmann, 2019; Tang et al., 2018; Lusetti et al., 2018; Bollmann et al., 2018;Robertson and Goldwater, 2018; Bollmannet al., 2017; Korchagina, 2017). Yet, virtually all approaches suffer from the two limitations: 1) They consider a fully supervised setup, often with impractically large manually normalized datasets; 2) Normalization happens on words in isolation. By utilizing a simple generative normalization model and obtaining powerful contextualization from the target-side language model, we train accurate models with unlabeled historical data. In realistic training scenarios, our approach often leads to reduction in manually normalized data at the same accuracy levels.
Anthology ID:
2020.acl-main.650
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7284–7295
Language:
URL:
https://aclanthology.org/2020.acl-main.650
DOI:
10.18653/v1/2020.acl-main.650
Bibkey:
Cite (ACL):
Peter Makarov and Simon Clematide. 2020. Semi-supervised Contextual Historical Text Normalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7284–7295, Online. Association for Computational Linguistics.
Cite (Informal):
Semi-supervised Contextual Historical Text Normalization (Makarov & Clematide, ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.650.pdf
Video:
 http://slideslive.com/38929200