Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Talha Çolakoğlu; Umut Sulubacak; Ahmet Cüneyd Tantuğ

doi:10.18653/v1/P19-2037

Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Talha Çolakoğlu, Umut Sulubacak, Ahmet Cüneyd Tantuğ

Abstract

With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.

Anthology ID:: P19-2037
Volume:: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:: July
Year:: 2019
Address:: Florence, Italy
Editors:: Fernando Alva-Manchego, Eunsol Choi, Daniel Khashabi
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 267–272
Language:
URL:: https://aclanthology.org/P19-2037/
DOI:: 10.18653/v1/P19-2037
Bibkey:
Cite (ACL):: Talha Çolakoğlu, Umut Sulubacak, and Ahmet Cüneyd Tantuğ. 2019. Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 267–272, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches (Çolakoğlu et al., ACL 2019)
Copy Citation:
PDF:: https://aclanthology.org/P19-2037.pdf

PDF Cite Search Fix data