Neural Sequence-to-sequence Learning of Internal Word Structure

Tatyana Ruzsics, Tanja Samardžić


Abstract
Learning internal word structure has recently been recognized as an important step in various multilingual processing tasks and in theoretical language comparison. In this paper, we present a neural encoder-decoder model for learning canonical morphological segmentation. Our model combines character-level sequence-to-sequence transformation with a language model over canonical segments. We obtain up to 4% improvement over a strong character-level encoder-decoder baseline for three languages. Our model outperforms the previous state-of-the-art for two languages, while eliminating the need for external resources such as large dictionaries. Finally, by comparing the performance of encoder-decoder and classical statistical machine translation systems trained with and without corpus counts, we show that including corpus counts is beneficial to both approaches.
Anthology ID:
K17-1020
Volume:
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)
Month:
August
Year:
2017
Address:
Vancouver, Canada
Editors:
Roger Levy, Lucia Specia
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
184–194
Language:
URL:
https://aclanthology.org/K17-1020/
DOI:
10.18653/v1/K17-1020
Bibkey:
Cite (ACL):
Tatyana Ruzsics and Tanja Samardžić. 2017. Neural Sequence-to-sequence Learning of Internal Word Structure. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 184–194, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Neural Sequence-to-sequence Learning of Internal Word Structure (Ruzsics & Samardžić, CoNLL 2017)
Copy Citation:
PDF:
https://aclanthology.org/K17-1020.pdf