Larissa Freitas


2021

2020

Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).
In our work, we applied different techniques for the task of genre classification using lyrics. Utilizing our dataset with lyrics of typical genres in Brazil divided into seven classes, we apply some models used in machine learning and deep learning classification tasks. We explore the performance of usual models for text classification using an input in the Portuguese language. We also compare the use of RNN and classic machine learning approaches for text classification, exploring the most used methods in the field.