Show Some Love to Your n-grams: A Bit of Progress and Stronger n-gram Language Modeling Baselines

Ehsan Shareghi; Daniela Gerz; Ivan Vulić; Anna Korhonen

doi:10.18653/v1/N19-1417

Show Some Love to Your n-grams: A Bit of Progress and Stronger n-gram Language Modeling Baselines

Ehsan Shareghi, Daniela Gerz, Ivan Vulić, Anna Korhonen

Abstract

In recent years neural language models (LMs) have set the state-of-the-art performance for several benchmarking datasets. While the reasons for their success and their computational demand are well-documented, a comparison between neural models and more recent developments in n-gram models is neglected. In this paper, we examine the recent progress in n-gram literature, running experiments on 50 languages covering all morphological language families. Experimental results illustrate that a simple extension of Modified Kneser-Ney outperforms an lstm language model on 42 languages while a word-level Bayesian n-gram LM (Shareghi et al., 2017) outperforms the character-aware neural model (Kim et al., 2016) on average across all languages, and its extension which explicitly injects linguistic knowledge (Gerz et al., 2018) on 8 languages. Further experiments on larger Europarl datasets for 3 languages indicate that neural architectures are able to outperform computationally much cheaper n-gram models: n-gram training is up to 15,000x quicker. Our experiments illustrate that standalone n-gram models lend themselves as natural choices for resource-lean or morphologically rich languages, while the recent progress has significantly improved their accuracy.

Anthology ID:: N19-1417
Volume:: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:: June
Year:: 2019
Address:: Minneapolis, Minnesota
Editors:: Jill Burstein, Christy Doran, Thamar Solorio
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4113–4118
Language:
URL:: https://aclanthology.org/N19-1417/
DOI:: 10.18653/v1/N19-1417
Bibkey:
Cite (ACL):: Ehsan Shareghi, Daniela Gerz, Ivan Vulić, and Anna Korhonen. 2019. Show Some Love to Your n-grams: A Bit of Progress and Stronger n-gram Language Modeling Baselines. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4113–4118, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):: Show Some Love to Your n-grams: A Bit of Progress and Stronger n-gram Language Modeling Baselines (Shareghi et al., NAACL 2019)
Copy Citation:
PDF:: https://aclanthology.org/N19-1417.pdf

PDF Cite Search Fix data