An Assessment of Language Identification Methods on Tweets and Wikipedia Articles

Pedro Vernetti, Larissa Freitas


Abstract
Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).
Anthology ID:
2020.winlp-1.15
Volume:
Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:
July
Year:
2020
Address:
Seattle, USA
Editors:
Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:
WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
58–60
Language:
URL:
https://aclanthology.org/2020.winlp-1.15
DOI:
10.18653/v1/2020.winlp-1.15
Bibkey:
Cite (ACL):
Pedro Vernetti and Larissa Freitas. 2020. An Assessment of Language Identification Methods on Tweets and Wikipedia Articles. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 58–60, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
An Assessment of Language Identification Methods on Tweets and Wikipedia Articles (Vernetti & Freitas, WiNLP 2020)
Copy Citation:
Video:
 http://slideslive.com/38929551