Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa - A Large Romanian Sentiment Data Set

Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from the largest Romanian e-commerce platform. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf’s law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.

We experiment with two baseline methods on our novel data set. The first baseline employs string kernels, an approach based on low-level features (character n-grams), that was found to work well for sentiment analysis across multiple languages, e.g. English (Giménez-Pérez et al., 2017;, Chinese (Zhang et al., 2008) and Arabic (Popescu et al., 2017), requiring no linguistic resources besides a labeled training set of samples. The second baseline employs bag-ofword-embeddings (Ionescu and Fu et al., 2018), an approach based on high-level features (clusters of word embeddings generated by kmeans), that attains good results in various text classification tasks Cozma et al., 2018;Fu et al., 2018;, including sentiment analysis. As an additional contribution, we replace the k-means clustering algorithm in the second baseline method with self-organizing maps (SOMs) (Kohonen, 2001), obtaining better results because the generated clusters of word embeddings are closer to the Zipf's law distribution (see Figure 2), which is known to govern natural language (Powers, 1998). To our knowl-edge, we are the first to apply SOMs to cluster word embeddings, showing performance gains for both word2vec (Mikolov et al., 2013) and Romanian BERT (Dumitrescu et al., 2020) embeddings. We also demonstrate the generalization capacity of using SOMs in the bag-of-word-embeddings on another recently-introduced Romanian data set , for the task of text categorization by topic.
In summary, our contribution is twofold: • We introduce LaRoSeDa, one of the largest corpora for Romanian sentiment analysis, along with a set of strong baselines to be used as reference in future research.
• To our knowledge, we are the first to employ SOMs as a technique to cluster word embeddings. We provide empirical evidence showing that SOMs produce better results than the popular k-means.

Related Work
To date, a small number of works targeting sentiment classification in the Romanian language have been published. Preceding the sentiment analysis efforts on Romanian texts, there are a few studies on subjectivity, that have introduced two corpora built through cross-lingual projections from English to Romanian (Mihalcea et al., 2007) or through machine translation (Banea et al., 2008). An extensive study conducted by Banea et al. (2011) looks at sentiment and subjectivity from a computational linguistics perspective, in a multilingual setup in which Romanian is also included. However, in these initial works, Romanian is studied only from a subjectivity perspective, which does not go down to the level of polarity. On our topic (i.e. sentiment analysis in Romanian), the first study that we have found describes two word sets tagged with polarity for Romanian and Russian (Sokolova and Bobicev, 2009). Gînscȃ et al. (2011) introduced a sentiment analysis service intended for multiple languages, that also supports Romanian. They perform sentiment identification using a list of manually-built triggers which, to our knowledge, is not publicly available. Another effort (Colhon et al., 2016) in creating an opinion lexicon with polarity annotations introduced a collection of 2,521 Romanian tourist reviews and an extensive linguistic analysis of the corpus. The data set is not released for public use. Similarly, we did not find any public link to RoEmoLex, a lexicon with Figure 1: The rating distribution of Romanian product reviews. Negative reviews are those rated with one or two stars, while positive reviews are those rated with four or five stars. Neutral reviews are not included in our data set. Best viewed in color.
approximately 11,000 Romanian words tagged for emotion and sentiment . The only Romanian data set annotated for sentiment that we have found freely available is rather small, with 1,000 movie reviews manually extracted from several blogs and sites (Russu et al., 2014). With 15,000 reviews, our corpus is much larger.

Data Set
In order to build LaRoSeDa, we collected product reviews from one of the largest e-commerce websites in Romania. Along with the textual content of each review, we collected the associated star ratings in order to automatically assign labels to the collected text samples. Following the same approach used for data sets containing English reviews (Blitzer et al., 2007;Maas et al., 2011;Pang and Lee, 2005), we assigned positive labels to the reviews rated with four or five stars and negative labels to the reviews rated with one or two stars. However, the star rating might not always reflect the polarity of the text. We thus acknowledge that the automatic labeling process is not optimal, i.e. some labels might be noisy. Since automatic labeling based on star ratings is a commonly accepted practice for opinion mining data sets of product reviews, we leave the analysis of noisy labels and manual labeling for future work.
We also imitate the data collection approach used for English review data sets (Blitzer et al., 2007;Maas et al., 2011;Pang and Lee, 2005), selecting a balanced set of Romanian reviews. More precisely, LaRoSeDa is formed of a total of 15,000 reviews that are perfectly balanced, i.e. half of them (7,500)  are positive reviews and the other half (7,500) are negative reviews. In Figure 1, we show the distribution of reviews with respect to the star ratings. We note that most of the negative reviews (5,561) are rated with one star. Similarly, most of the positive reviews (6,238) are rated with five stars. Hence, the corpus is highly polarized. We divide LaRoSeDa into a training set containing 80% of the data samples and a test set containing the remaining 20%.
In Table 1, we present the number of positive and negative reviews inside each subset, along with the number of words. Our data set contains a total of 540,287 words, with an average of 36 words per review. We observe that positive reviews contain 235,474 words (44.6%) and negative reviews contain 304,813 words (56.4%). We note that, in negative reviews, people are likely to complain about several points or to explain what is wrong with the reviewed products. This could provide a natural explanation for the fact that the negative reviews contain more words than the positive reviews.

Methods
String kernels. A simple language-independent and linguistic-theory-neutral approach is to interpret text samples as sequences of characters (strings) and to use character n-grams as features.
The number of character n-grams is usually much higher than the number of samples, so representing the text samples as feature vectors may require a lot of space. String kernels provide an efficient way to avoid storing and using the feature vectors (primal form), by representing the data though a kernel matrix (dual form). Each component K ij in a kernel matrix represents the similarity between data samples x i and x j . In our experiments, we use the histogram intersection string kernel (HISK) (Ionescu et al., 2014(Ionescu et al., , 2016 as the similarity function. For two strings x i and x j over a set of characters S, HISK is defined as follows: where #(x, g) is a function that returns the number of occurrences of n-gram g in x, and n is the length of n-grams. While being a rather shallow approach, string kernels attained strong results in some specific tasks. For instance, string kernels ranked first in the Arabic Dialect Identification tasks of Var-Dial 2017 (Ionescu and  and VarDial 2018 . Bag-of-word-embeddings. Following the seminal paper of Mikolov et al. (2013) introducing word2vec, word embeddings became one of the mainstream approaches in various computational linguistics tasks (Cheng et al., 2018;Conneau et al., 2017;Cozma et al., 2018;Fu et al., 2018;Kim, 2014;Kiros et al., 2015;Shen et al., 2018;Torki, 2018;Zhou et al., 2018). In order to build the bag-of-word-embeddings (BOWE), we first trained word2vec on the collected Romanian reviews using the continuous bag-of-words (CBOW) model. Before training, we transformed all letters to lowercase and removed punctuation. In addition to word2vec, we consider a recently introduced Romanian BERT model (Dumitrescu et al., 2020) as an alternative way to produce word embeddings, which is likely to produce much better results, considering the success of BERT (Devlin et al., 2019) in English NLP tasks. Instead of averaging the word embeddings to obtain document-level representations (Shen et al., 2018), we follow a different and more effective path suggested by some recent works Cozma et al., 2018;Fu et al., 2018;. More specifically, we cluster the word embeddings collected from the entire training set using k-means. For a document D of n words, D = (w 1 , w 2 , ..., w n ), a word embedding model, be it word2vec or BERT, outputs a matrix of n × m components (or a set of n m-dimensional vectors), the m-dimensional vector at index i corresponding to word w i . We apply clustering on the word vectors extracted from all training documents, thus obtaining a set of k clusters. A document D is then represented as a bag-of-word-embeddings (histogram) H = (h 1 , h 2 , ...., h k ) in which each component h i retains the number of word embeddings from the document D that fall in cluster i, where i ∈ {1, 2, ..., k}. We note that the size of the bag-of-word-embeddings is equal to the number of clusters k. In the case of BERT, we emphasize that, although the embedding vector of a word depends on the context, it is likely that the embedding vectors corresponding to a specific word will fall in the same cluster. Hence, BOWE is able to cope well with this situation.
Replacing k-means with SOMs. Quantitative linguistics studies (Powers, 1998) have pointed out that, given a corpus of text documents, the frequency of any word is inversely proportional to its rank in the frequency table, giving rise to a Zipf's law distribution of words in natural language. However, the k-means algorithm tends to ignore the data density, producing equally-sized clusters. We therefore propose to replace the k-means algorithm with an approach that takes into account the density in the word embedding space, producing a set of clusters that follow the Zipf's law. We propose to perform clustering using self-organizing maps (SOMs) (Kohonen, 2001), since these models are known to preserve the topological properties of the input space. Indeed, Figure 2 shows that SOMs produce clusters of Romanian word embeddings closer to the Zipf's law distribution than k-means. It is important to emphasize that k-means can produce clusters of different size, as shown in Figure 2. Our observation refers only to the fact that the data density is not particularly modeled by the k-means optimization process, while SOMs are optimized by shifting the neural units following the density of the data (units tend to migrate where the space is more dense). Our observation with respect to k-means is also confirmed by other studies. For example, Raykov et al. (2016) note that: "even when all other implicit geometric assumptions of kmeans are satisfied, it will fail to learn a correct, or even meaningful, clustering when there are significant differences in cluster density". Since natural language involves such significant differences (due to the presence of the Zipf's law), we believe that k-means is a sub-optimal choice.

Experiments
Corpora. First and foremost, we perform experiments on LaRoSeDa with the goal of introducing some benchmark results on our new data set. We also perform experiments on MOROCO , a data set with Moldavian and Romanian news articles, with the goal of showing the generalization capacity of using SOMs instead ok k-means. Experimental setup. On LaRoSeDa, we present two sets of results, one based on the established train-test split and one based on 10-fold crossvalidation. On MOROCO, we choose to present 10-fold cross-validation results for the intra-dialect multi-class categorization by topic task, on the 18,161 samples written in the Romanian dialect. Parameter and model choices. For HISK, we combined character 3-grams, 4-grams and 5-grams. For BOWE-word2vec and BOWE-BERT, we set the number of clusters to k = 500, just as Cozma et al. (2018). We trained word2vec to produce 300dimensional Romanian word embeddings, while the Romanian BERT outputs 768-dimensional embeddings. In the learning stage, we employed the linear Support Vector Machines (SVM) implementation from Scikit-learn (Pedregosa et al., 2011), providing as input pre-computed kernels. For BOWE-word2vec and BOWE-BERT, we opt for the PQ kernel, based on the findings of . We set the regularization parameter of SVM to C = 10 3 in all the experiments. We also fuse HISK with BOWE-word2vec or BOWE-BERT in the dual form by summing up the corresponding kernel matrices. We employed an open source implementation of SOMs 2 . We used the default choices for most hyperparameters, the modifications being detailed next. We set the learning rate to 0.25 and the number of epochs to 200. Before starting the training, the SOM is configured to randomly choose a number of training samples equal to number of expected outputs. We   opted for the cosine distance between data samples and SOM's weights.
Results on LaRoSeDa. In Table 2, we present the results on LaRoSeDa. Among the individual baselines, we observe that HISK attains the best accuracy rates, surpassing all BOWE configurations. We also note that by replacing k-means with SOMs, the accuracy rate of BOWE-word2vec grows by 4 or 5%. The improvements brought by SOMs can be explained by the fact that, unlike k-means, SOMs produce clusters that are closer to the Zipf's law distribution. This is proven by the word embedding counts per cluster illustrated in Figure 2. When we combine HISK with BOWE-BERT, we notice significant performance gains. Results on MOROCO. In Table 3, we present the results on MOROCO, for the Romanian intradialect multi-class categorization by topic task. We notice that HISK attains better results than BOWE-BERT with k-means and BOWE-BERT with SOMs, although the differences in terms of accuracy seem to be smaller. As for LaRoSeDa, we observe a significant improvement (higher than 5%) when k-means is replaced by SOMs. There is an observable improvement over the plain HISK, when HISK is combined with BOWE-BERT based on k-means. Nonetheless, we notice a larger improvement when we combine HISK and BOWE-BERT based on SOMs.

Conclusion
In this paper, (i) we introduced LaRoSeDa, a large data set for polarity classification of Romanian reviews, and (ii) we employed self-organizing maps, a clustering approach that preserves the density of words in the embedding space, resulting in a more effective bag-of-word-embeddings representation. Our top accuracy rates on LaRoSeDa are 89.54% for the cross-validation procedure and 90.90% on the test set. We note that SOMs had a significant contribution in attaining these high accuracy rates. We conclude that the combination of HISK and BOWE-BERT with SOMs is a strong baseline which should encourage future research in proposing non-trivial models for Romanian polarity classification. Furthermore, the results obtained on MOROCO confirm that SOMs provide better accuracy rates than k-means when it comes to building document-level representations based on clustering word embeddings.