2019
pdf
bib
abs
SenZi: A Sentiment Analysis Lexicon for the Latinised Arabic (Arabizi)
Taha Tobaili
|
Miriam Fernandez
|
Harith Alani
|
Sanaa Sharafeddine
|
Hazem Hajj
|
Goran Glavaš
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Arabizi is an informal written form of dialectal Arabic transcribed in Latin alphanumeric characters. It has a proven popularity on chat platforms and social media, yet it suffers from a severe lack of natural language processing (NLP) resources. As such, texts written in Arabizi are often disregarded in sentiment analysis tasks for Arabic. In this paper we describe the creation of a sentiment lexicon for Arabizi that was enriched with word embeddings. The result is a new Arabizi lexicon consisting of 11.3K positive and 13.3K negative words. We evaluated this lexicon by classifying the sentiment of Arabizi tweets achieving an F1-score of 0.72. We provide a detailed error analysis to present the challenges that impact the sentiment analysis of Arabizi.
2014
pdf
bib
abs
On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter
Hassan Saif
|
Miriam Fernandez
|
Yulan He
|
Harith Alani
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier’s feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and shrinking the feature space.
2011
pdf
bib
Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
Yulan He
|
Chenghua Lin
|
Harith Alani
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
2010
pdf
bib
Exploring English Lexicon Knowledge for Chinese Sentiment Analysis
Yulan He
|
Harith Alani
|
Deyu Zhou
CIPS-SIGHAN Joint Conference on Chinese Language Processing
2004
pdf
bib
Data Driven Ontology Evaluation
Christopher Brewster
|
Harith Alani
|
Srinandan Dasmahapatra
|
Yorick Wilks
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)