Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach

User-generated content has become a recurrent resource for NLP tools and applications, hence many efforts have been made lately in order to handle the noise present in short social media texts. The use of normalisation techniques has been proven useful for identifying and replacing lexical variants on some of the most informal genres such as microblogs. But annotated data is needed in order to train and evaluate these systems, which usually involves a costly process. Until now, most of these approaches have been focused on English and they were not taking into account demographic variables such as the user location and gender. In this paper we describe the methodology used for automatically mining a corpus of variant and normalisation pairs from English and Spanish tweets.


Introduction
User-generated content (UGC), and specially the microblog genre, has become an interesting resource for Natural Language Processing (NLP) tools and applications. Many are the advantages of exploiting this real-time stream of multilingual textual data. Popular applications such as Twitter has an heterogeneous user base of almost 600 million users that generate more than 60 million new tweets every day. For this reason, Twitter has become one of the most used sources of textual data for NLP with several applications such as sentiment analysis (Tumasjan et al., 2010) or realtime event detection (Sakaki et al., 2010). Recent advances on machine translation or information retrieval systems have been also making an extensive use of UGC for both training and evaluation purposes. However, tweets can be very noisy and sometimes hard to understand for both humans  and NLP applications (Wang and Ng, 2013), so an additional preprocessing step is usually required.
There have been different perceptions regarding the lexical quality of social media (Rello and Baeza-Yates, 2012)   and even others suggested that 40% of the messages of Twitter were "pointless babble" (PearAnalytics, 2009). Most of the out of vocabulary (OOV) words present in social media texts can be catalogued as lexical variants (e.g. "See u 2moro" → "See you tomorrow"), that are words lexically related with their canonic form.
The use of text normalisation techniques has been proven useful in order to clean short and informal texts such as tweets. However, the evaluation of these systems requires annotated data, which usually involves costly human annotations. There are previous works about automatically constructing normalisation dictionaries, but until now, most of these approaches have been focused on English and they were not taking into account demographic variants. In this paper we describe the methodology used for automatically mining lexical variants from English and Spanish tweets associated to a set of headwords. These formal and informal pairs can be later used to train and evaluate existing social media text normalisation systems. Additional metadata from Twitter such as geographic location and user gender is also collected, opening the possibility to model and analyse gender or location-specific variants. This paper is organised as follows. We describe the related work in Section 2. We then describe our variant mining methodology in Section 3. The obtained results are presented in Section 4. Section 5, draws the conclusions and future work.

Related Work
One way to handle the performance drop of NLP tools on user-generated content (Foster et al., 2011) is to re-train existing models on these informal genres (Gimpel et al., 2011), (Liu et al., 2011b). Another approaches make use of preprocessing techniques such as text normalisation in order to minimise the social media textual noise (Han et al., 2013),  where OOV words were first identified and then substituted using lexical and phonetic edit distances. In order to enhance both precision and recall both OOV detection and translation dictionaries were used. Moreover, the creative nature of informal writing and the low availability of manually-annotated corpora can make the improvement and evaluation of these systems challenging.
Motivated by the lack of annotated data and the large amount of OOV words contained in Twitter, several approaches for automatically constructing a lexical normalisation dictionary were proposed; In (Gouws et al., 2011) a normalisation lexicon is generated based on distributional and string similarity (Lodhi et al., 2002) from Twitter. Using a similar technique, a wider-coverage dictionary is constructed in (Han et al., 2012) based on contextually-similar (OOV, IV) pairs. More recently, (Hassan and Menezes, 2013) introduced another context-based approach using random walks on a contextual similarity graph.
Distributional-based methods can have some drawbacks: they rely heavily on pairwise comparisons that make them computationally expensive, and as the normalisation candidates are selected based on context similarity they can be sensitive to domain-specific variants that share similar contexts. Moreover, these approaches were focusing on extracting English lexical variants from social media texts, but due the heterogeneity of its users, lexical distributions can be influenced by geographical factors (Eisenstein et al., 2010) or even gender (Thomson and Murachver, 2001).
To the best of our knowledge, there are not multilingual approaches for mining lexical variants from short, noisy texts that also take into account demographic variables. For this reason, we present an unsupervised method for mining English and Spanish lexical variants from Twitter that collects demographic and contextual information. These obtained pairs can be later used for training and evaluating text normalisation and inverse text normalisation systems.

Lexical Variant Mining
Lexical variants are typically formed from their standard forms through regular processes (Thurlow and Brown, 2003) and these can be modelled by using a set of basic character transformation rules such as letter insertion, deletion or substitution (Liu et al., 2011a) e.g. ("tmrrw" → "2morrow") and combination of these ("2moro"). The relation between formal and informal pairs is not always 1-to-1, two different formal words can share the same lexical variant ("t" in Spanish can represent "te" or "tú") and one formal word can have many different variants (e.g. "see you" us commonly shortened as "c ya" or "see u"). As a difference with previous approaches based on contextual and distributional similarity, we have chosen to model the generation of variant candidates from a set of headwords using transformation rules. These candidates are later validated based on their presence on a popular microblog service, used in this case as a high-coverage corpus.

Candidate Generation
We have defined a set of 6 basic transformation rules (see Table 1) in order to automatically generate candidate lexical variants from the 300k most frequent words of Web 1T 5-gram (English) (Brants and Franz, 2006) and SUBTLEX-SP (Spanish) (Cuetos et al., 2011) corpora.

Rule
Example a) Character duplication "goal" → "gooal" b) Number transliteration "cansados" → "cansa2" c) Character deletion "tomorrow" → "tomrrw" d) Character replacement "friend" → "freend" e) Character transposition "maybe" → "mabye" f) Phonetic substitution "coche" → "coxe" g) Combination of above "coche" → "coxeee" As modelling some variants may need more than one basic operation, and lexically-related variants are usually in an edit distance t where t <= 3 (Han et al., 2013), the aforementioned rules were implemented using an engine based on stacked transducers with the possibility to apply a maximum of three concurrent transformations: (a) Character duplication: For words with n characters, while n>19 each character were duplicated n times (∀ n>0, n<4), generating n 3 candidate variants.
(b) Number transliteration: Words and numbers are transliterated following the language rules defined in Table 2.
(d) Character replacement: Candidate variants are generated by replacing n characters (∀ n>0, n<7) by their neighbours taking into account a QWERTY keyboard and an edit distance of 1.
(e) Character transposition: In order to generate candidate lexical variants the position of adjacent characters are exchanged.
(f) Phonetic substitution: A maximum of three character n-grams are substituted for characters that sound similar following different rules for Spanish (Table 3) and English (Table 4).

Intentionality Filtering
Given an OOV word a and its IV version b we have extracted character transformation rules from a to b using the longest common substring (LCS) algorithm (See Table 5). These lists of transformations were encoded as a numeric array where the number each transformation counts were stored. We have used NLTK (Bird, 2006) and the Sequence-Matcher Python class in order to extract those sets of transformations taking into account also the position of the character (beginning, middle or at the end of the word). A two-class SVM (Vapnik, 1995) model has ben trained using a linear kernel with a corpus composed by 4200 formal-variant pairs extracted from Twitter 1 , SMS 2 and a corpus of the 4200 most common misspellings 3 . In table 6 we show the k-fold cross-validation results (k=10) of the model, obtaining a 87% F1. This model has been used in order to filter the English candidate variants classified as not-intentional.

Twitter Search
The variants filtered during the previous step were searched on the real time Twitter stream for a period of two months by processing more than 7.5 million tweets. Their absolute frequencies n were used as a weighting factor in order to discard not used words (n > 0). Additionally, variants present in another languages rather than English or Spanish were ignored by using the language identification tags present in Twitter metadata.
There were important differences between the final number of selected candidates for Spanish, with 6 times less variant pairs and English (see Table 7). Spanish language uses diacritics that are commonly ignored on informal writing, for this reason there is a higher number of possible combinations for candidate words that would not generate valid or used lexical variants.

Results
Besides the original message and the context of the searched variant, additional metadata has been collected from each tweet such as the gender and the location of the user. In Twitter the gender is not explicitly available, for this reason we applied an heuristic approach based on the first name as it is reported in the user profile. In order to do this, two list of male and female names were used: the 1990 US census data 4 and popular baby names from the US Social Security Administration's statistics between 1960 and 2010 5 . We have analysed the gender and language distribution of the 6 transformation rules across the mined pairs (see Figure 1). On the one hand, lexical variants generated by duplicating characters were the most popular specially between female  users with a 5% more than their male counterparts. On the other hand, variants generated by character replacement and deletion were found a 2% more on tweets from male users. The differences between English and Spanish were notable, mostly regarding the use of transliterations, that were not found on Spanish tweets, and phonetic substitutions, ten times less frequent than in English tweets.
For the distribution of transformations across geographic areas, we have just taken into account the countries where the analysed languages have an official status. Lexical variants found in Tweets from another areas are grouped into the "Nonofficial" label (see Figure 2). The biggest differences were found on the use of transliterations (higher in UK and Ireland with more than a 5%) and phonetic substitutions (higher in Pakistani users with more than a 22%). Transformation frequencies from non-official English speaking countries were very similar as the ones registered for users based on United States and Canada.
Spanish results were less uniform and showed more variance respect the use of character duplication (57% in Argentina), character replacement (more than 24% in Mexico and Guatemala) and character transposition (with more than a 19% for users from Cuba, Colombia and Mexico) (see Figure 3).

Conclusions and Future Work
In this paper we have described a multilingual and unsupervised method for mining English and Spanish lexical variants from Twitter with aim to close the gap regarding the lack of annotated corpora. These obtained pairs can be later used for the training and evaluation of text normalisation systems without the need of costly human annotations. Furthermore, the gathered demographic and contextual information can be used in order to model and generate variants similar to those that can be found on specific geographic areas. This has interesting applications in the field of inverse text normalisation, that are left to a future work. We also intend to explore the benefits of feature engineering for the detection and categorisation of lexical variants using machine learning techniques.