Language and Dialect Discrimination Using Compression-Inspired Language Models

Paul McNamee

Language and Dialect Discrimination Using Compression-Inspired Language Models

Abstract

The DSL 2016 shared task continued previous evaluations from 2014 and 2015 that facilitated the study of automated language and dialect identification. This paper describes results for this year’s shared task and from several related experiments conducted at the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLTCOE). Previously the HLTCOE has explored the use of compression-inspired language modeling for language and dialect identification, using news, Wikipedia, blog post, and Twitter corpora. The technique we have relied upon is based on prediction by partial matching (PPM), a state of the art text compression technique. Due to the close relationship between adaptive compression and language modeling, such compression techniques can also be applied to multi-way text classification problems, and previous studies have examined tasks such as authorship attribution, email spam detection, and topical classification. We applied our approach to the multi-class decision that considered each dialect or language as a possibility for the given shared task input line. Results for test-set A were in accord with our expectations, however results for test-sets B and C appear to be markedly worse. We had not anticipated the inclusion of multiple communications in differing languages in test-set B (social media) input lines, and had not expected the test-set C (dialectal Arabic) data to be represented phonetically instead of in native orthography.

Anthology ID:: W16-4825
Volume:: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:: December
Year:: 2016
Address:: Osaka, Japan
Editors:: Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:: VarDial
SIG:
Publisher:: The COLING 2016 Organizing Committee
Note:
Pages:: 195–203
Language:
URL:: https://aclanthology.org/W16-4825/
DOI:
Bibkey:
Cite (ACL):: Paul McNamee. 2016. Language and Dialect Discrimination Using Compression-Inspired Language Models. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 195–203, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):: Language and Dialect Discrimination Using Compression-Inspired Language Models (McNamee, VarDial 2016)
Copy Citation:
PDF:: https://aclanthology.org/W16-4825.pdf

PDF Cite Search Fix data