Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Talha Çolakoğlu, Umut Sulubacak, Ahmet Cüneyd Tantuğ


Abstract
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.
Anthology ID:
P19-2037
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Fernando Alva-Manchego, Eunsol Choi, Daniel Khashabi
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
267–272
Language:
URL:
https://aclanthology.org/P19-2037
DOI:
10.18653/v1/P19-2037
Bibkey:
Cite (ACL):
Talha Çolakoğlu, Umut Sulubacak, and Ahmet Cüneyd Tantuğ. 2019. Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 267–272, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches (Çolakoğlu et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-2037.pdf
Data
OpenSubtitles