Challenges of studying and processing dialects in social media

Challenges of studying and processing dialects in social media Anna Katrine Jørgensen, Dirk Hovy, and Anders Søgaard University of Copenhagen

Njalsgade 140 DK-2300 Copenhagen S

soegaard@hum.ku.dk Abstract Dialect features typically do not make it into formal writing, but flourish in social media. This enables large- scale variational studies. We fo- cus on three phonological features of African American Vernacular English and their manifestation as spelling variations on Twitter. We discuss to what extent our data can be used to falsify eight sociolinguistic hypothe- ses. To go beyond the spelling level, we require automatic analysis such as POS tagging, but social me- dia language still challenges language technologies. We show how both newswire- and Twitter-adapted state- of-the-art POS taggers perform signif- icantly worse on AAVE tweets, sug- gesting that large-scale dialect studies of language variation beyond the sur- face level are not feasible with out-of- the-box NLP tools. 1 Introduction Dialectal and sociolinguistic studies are tradi- tionally based on interviews of small sets of speakers of each variety. The Atlas of North American English (Labov et al., 2005) has been the reference point for American dialec- tology since its completion, but is based on only 762 speakers. Dallas is represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. How- ever, with the rise of social media platforms and the vast production of user generated con- tent, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to ap- pear). Emails, chats and social media posts serve purposes similar to those of spoken lan- guage, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by in- vestigating several phonological spelling cor- relates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets every day, and some fraction of these tweets are geo-located. Eisenstein (2013) and Doyle (2014) studied the effect of phonological variation across the US on spelling in Twitter posts, and both found some evidence that dialectal phonolog- ical variation has a direct impact on spelling 9 Proceedings of the ACL 2015 Workshop on Noisy User-generated Text, pages 9–18, Beijing, China, July 31, 2015. c�2015 Association for Computational Linguistics on Twitter. Both authors note various method- ological problems using Twitter as a source of evidence for dialectal and sociolinguistic stud- ies, including what we refer to as USER POP- ULATION BIAS and TOPIC BIAS below. In this paper, we collect Twitter data to test eight (8) research hypotheses originating in sociolinguistic studies of African-American Vernacular English (AAVE). The hypotheses relate to three phonological features of AAVE, namely derhotacization, interdental fricative mutation, and backing in /str/. Some of our findings shed an interesting light on existing hypotheses, but our main focus in this paper is to identify the methodological challenges in using social media for testing sociolinguistic hypotheses. Almost all previous large-scale variational studies using social media have focused on spelling variation and lexical markers of di- alect. Ours is no exception. However, di- alectal variation also manifests itself at the morpho-syntactic level. To investigate this variation, we also annotate some data with part-of-speech (POS) tags, using two NLP systems. This approach reveals a severe methodological challenge: sentences contain- ing AAVE features are associated with signif- icant drops in tagger performance. This result challenges large-scale varia- tional studies on social media that require au- tomated analyses. The observed drops in per- formance are prohibitive for studying syntac- tic and semantic variation, and we believe the NLP community should make an effort to pro- vide better and more robust dialect-adapted models to researchers and industry interested in processing social media. The findings also raise the question of whether NLP technology systematically disadvantages groups of non- standard language users. 1.1 Contributions • We identify eight (8) research hypotheses from the sociolinguistic literature. We test them in a study of the distribution of three phonological features typically as- sociated with AAVE in Twitter data. We test the features’ correlations with vari- ous demographic variables. Our results falsify the hypothesis that AAVE is male- dominated (but see §3.1). • We identify five (5) methodological problems common to variational studies in social media and discuss to what ex- tent they compromise the validity of re- sults. • Further, we show that state-of-the-art newswire and Twitter POS taggers per- form much worse on tweets containing AAVE features. This suggests an addi- tional limitation to large-scale sociolin- guistic research using social media data, namely that it is hard to analyze varia- tion beyond the lexical level with current tools. 1.2 Sociolinguistic hypotheses AAVE is, in contrast to other North American dialects, not geographically restricted. Al- though variation in AAVE does exist, AAVE in urban settings has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999). H5: Derhotacization is negatively correlated with in- come and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pol- lock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by corre- lating the distribution of phonological variants in geo-located tweets with demographic infor- mation. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lend- ing statistical power to sociolinguistic analy- ses, and circumventing traditional issues with data collection such as the Observer’s Para- dox (Labov, 1972b; Meyerhof, 2006). Our work differs from previous work by studying phonological rules associated with specific di- alects, as well as considering a wide range of actual sociolinguistic research hypotheses, but our main focus is the methodological prob- lems doing this kind of work, as well as as- sessing the limitations of such work. 1.3 Methodological problems One obvious challenge relating social media data to sociolinguistic studies is that there is generally not a one-to-one relationship between phonological variation and spelling variation. People, in other words, do not spell the way they pronounce. Eisenstein (2013) discusses this challenge ((1) WRITING BIAS), but shows that effects of the phonological en- vironment carry over to social media, which he interprets as evidence that there is at least some causal link between pronunciation and spelling variation. A related problem is that non-speakers of AAVE may cite known features of AAVE with specific purposes in mind. They may use it in citations, for example: (1) My 5 year old sister texted me on my mums phone saying “why did you take a picher in da bafroom” lool okay b (Twitter, Feb 21 2015) or in meta-linguistic discussions: (2) Whenever I hear a black person inquire about the location of the ”bafroom”... (Twitter, Jan 20 2015) We refer to these phenomena as (2) META- USE BIAS. This bias is important with rare phenomena. With ”bafroom”, it seems that about 1 in 20 occurrences on Twitter are meta- uses. Meta-uses may also serve social func- tions. AAVE features are used as cultural markers by Latinos in North Carolina (Carter, 2013), for example. Some of the research hypotheses consid- ered (113 and 115) relate to demographic vari- ables such as income and educational levels. While we do not have socio-economic infor- mation about the individual Twitter user, we can use the geo-located tweets to study the correlation between socio-economic variables and linguistic features at the level of cities or ZIP codes.1 Eisenstein et al. (2011) note that this level of abstraction introduces some noise. Since Twitter users do not form representative sam- ples of the population, the mean income for a city or ZIP code is not necessarily the mean income for the Twitter users in that area. We refer to this problem as the (3) USER POPU- LATION BIAS. Another serious methodological problem known as (4) GALTON’S PROBLEM (Naroll, 1961; Roberts and Winters, 2013), is the ob- servation that cross-cultural associations are 1Unlike many others, we rely on physical locations rather than user-entered profile locations. See Graham et al. (2014) for discussion. 11 often explained by geographical diffusion. In other words, it is the problem of discrimi- nating historical from functional associations in cross-cultural surveys. Briefly put, when we sample tweets and income-levels from US cities, there is little independence between the city data points. Linguistic features dif- fuse geographically and do not change at ran- dom, and we can therefore expect to see more spurious correlations than usual. Like with the famous example of chocolate and Nobel Prize winners, our positive findings may be explained by hidden background variables. A positive correlation between income-level and a phonological pattern may also have cultural, religious or geographical explanations. Reasons to be less worried about GAL- TON’S PROBLEM in our case, include that a) we only consider standard hypotheses from the sociolinguistics literature and not a huge set of previously unexplored, automatically generated hypotheses, b) we sample data points at random from all across the US, giv- ing us a very sparse distribution compared to country-level data, but more notably, c) location is an important, explicit variable in our study. GALTON’S PROBLEM is typically identified by clustering tests based on loca- tion (Naroll, 1961). Obviously, the phono- logical features considered here cluster geo- graphically, as evidenced by our geographic correlations in Table 2, but since our studies explicitly test the influence of location, it is not the case for most of the hypotheses con- sidered here that geographic diffusion is the underlying explanation for something else. In §3, we discuss whether these four methodological problems compromise the va- lidity of our findings. One other methodolog- ical problems that may be relevant for other studies of dialect in social media, is almost completely irrelevant for our study: It is often important to control for topic in dialectal and sociolinguistic studies (Bamman et al., 2014), e.g., when studying the lexical preferences of speakers of urban ethnolects. We call this problem (5) TOPIC BIAS. Using word pairs with equivalent meanings for our studies, we implicitly control for topic (but see §3.1). Feature Positive Negative Total count brotha brother 9528 foreva forever 3673 hea here 4352 lova lover 1273 motha mother 4668 /r/ /Ø/ or /@/ ova over 3441 sista sister 5325 wateva whatever 2974 wea where 5153 total 40,387 kreet street 1226 /str/ /skr/ :krong strong 1629 skrip strip 1101 total 3956 brova brother 3715 dat that 2610 deez these 4477 /D/ /d/or/v/ dem them 3645 dey they 2434 dis this 2135 mova mother 2462 total 21,478 mouf mouth 3861 nuffin nothing 2861 souf south 1102 /T/ /t/ or /f/ teef teeth 1857 trough through 2804 trow throw 1090 total 13,575 All tweets 79,396

Table 1: Word pairs and counts 2 Data and Method We focus on derhotacization, backing in /str/, and interdental fricative mutation. Specifi- cally, we collect data to study the following four phonological variations (the latter two are both instances of interdental fricative muta- tion): a) derhotacization: /r/ → /Ø/ or /@/, b) /str/ → /skr/, c) /D/ → /d/ or /v/ and, d) /T/ → /t/ or /f/. In non-rhotic dialects, /r/ is either not pro- nounced or is approximated as a vocalization in the surface form, when /r/ is in a pre-vocalic position. This can result in an elongation of the preceding vowel or in an off-glide schwa /@/, e.g., guard → /gA:d/, car → /ka:/, fear → /fi@/ (Thomas, 2007). Backing in /skr/ denotes the substitution 12 of /str/ for /skr/ in word-initial positions re- sulting in pronunciations such as /skrit/ for street, /skraq/ for strong and /skrT/ for strip. Backing in /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations re- late to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more com- mon among African Americans than among European Americans and that the frequency of the mutations is inversely correlated with socio-economic levels and formality of speak- ing (Rickford, 1999). We follow Eisenstein (2013) and Doyle (2014) in assuming that spelling variation may be a result of phonological differences and select 25 word pairs for our study (Tabel 1). For each word pair, we collect positive (e.g., ”skreet”) and negative occurrences (e.g., ”street”), resulting in a total number of 79,396 tweets. The word pairs were chosen based on the unambiguity, frequency and representabil- ity of the phonological variations. Uniquely, backing in /str/ is represented by three word pairs of high similarity, which is due to phono- logical restrictions on the variation of /str/ to /skr/ and to the fact that backing in /str/ is a very rare phenomena. The Twitter data used in the experiments was gathered from May to August 2014 us- ing TwitterSearch.2 We only collected tweets with geo-locations in the contiguous United States, from users reporting to tweet in En- glish, and which were also predicted to be in English using langid.py.3 The demo- graphic information was obtained from the 2012 American Community Survey from the 2https://pypi.python.org/pypi/TwitterSearch/ 3https://pypi.python.org/pypi/langid United States Census Bureau, as was informa- tion about population sizes in US cities. We linked each tweet in our data to demographic information using the geo-coordinates of the tweet and its nearest city in the following way. Figure 1: The ratio of AAVE examples over US states For the 110 US cities of ≥ 200,000 inhabi- tants, we gathered information about: a) per- centage high school graduates, b) percent- age below poverty level, c) population size, d) median household income, e) percentage of males, f) percentage between 15 and 24 years old, g) percentage of African Americans and h) unemployment rate. The overall geographical distribution of our data is shown in Figure 1. The map shows that we see more tweets with AAVE features in the Gulf states, in particular Louisiana, Mis- sissippi and Georgia. This lends preliminary support to H2. 3 Results with phonological features Occurrences of the phonological variations related to AAVE were correlated with the geographic and demographic variables using Spearman’s ρ (Table 2–3), at the level of in- dividual tweets. From the correlation coeffi- cients we see that the distributions of the three chosen AAVE rules are best explained by lon- gitude, the distinction between the Gulf states and the rest of the US, and by the distribution 13 – = p > 0.05, * = 0.05 > p > 0.01, ** = p < 0.01, , *** = p < 0.0005 Shading corresponds to negative correlations Table 3: Demographic correlations Feature word pairs male black 15-24 citysize highschool income poverty unemployment skreet/street – – – ** * ** * ** skrong/strong ** *** – * ** ** ** * skrip/strip * – * *** *** – *** *** /str/ —. /skr/ brova/brother *** *** *** *** *** – *** *** dat/that – *** – – – ** ** – deez/these – – – ** *** – ** *** dem/them***********–– – dey/they *** *** ** * ** ** *** – dis/this–*****–– –** mova/mother *** *** *** – *** *** – *** /D/ —. /d/ or /v/ total *** *** *** *** – ** – * /r/ —. /fb/ or /@/ brotha/brother *** ** – – ** – – foreva/forever ** *** – – – – ** – hea/here – *** ** *** *** *** *** * lova/lover ––––****** – motha/mother –**–*–**– – ova/over ******–––****** – sista/sister * *** – – ** – – wateva/whatever *** *** – – – *** *** – wea/where ** *** *** *** *** *** ***

total *** *** *** *** *** *** *** – – * – - *** total *** *** *** *** /T/ —. /t/ or /f/ total *** *** *** – *** – *** *** mouf/mouth ** – – – – – nuffin/nothing *** *** *** *** *** *** – *** souf/south *** – ** – ** – *** *** teef/teeth – – – – ** – – – trough/through – – – trow/throw * – – – *** ** – * ** * ** – – * ** of African Americans (with explained vari- ances in the range of 0.03-0.05). Our data suggests that H2, namely that AAVE is more prevalent in the Gulf states, is probably true. Hypothesis H1, that AAVE is an urban ethnolect, lends some support in our data, but the correlation with urbanicity is weaker (and negatively correlated or non- significant in half of the cases). Our data only lends limited support to the first half of hypothesis H3. While derhota- cization and /str/ correlate (negatively) signif- icantly with income levels, we see no signifi- cant correlations within /D/ and a positive cor- relation within /T/. However, our data does not suggest that H3 is false, either. Our data does lend support to the more specific hypothesis H5, namely that derhoticization is sensitive to income level, while the strong correlation with the distribution of African Americans lends support to H4. More interestingly, our data suggests that women use AAVE features more often than men, i.e., there is a negative correlation be- tween male gender and AAVE features, con- trary to the second half of H3, namely that AAVE is more frequently appropriated by men. Note, however, that our gender ratios are aggregated for city areas, and with the de- mographic bias of Twitter, these correlations should be taken with a grain of salt. Consider- ing the small gender ratio differences, we also compute correlations between our linguistic features and gender using the Rovereto Twit- ter N-gram Corpus (RTC) (Herdagdelen and Baroni, 2011).4 The RTC corpus contains in- formation about the gender of the tweeter as- sociated with n-grams. While there is too lit- tle data in the corpus to correlate gender and backing in /str/, derhotacization and both in- terdental fricative mutations (/D/ → /d/ or /v/ and /T/ → /t/ or /f/) correlate significantly with women. Out of our words, 10 correlate sig- 4http://clic.cimec.unitn.it/amac/ twitter_ngram/ 14 Feature word pairs latitude longitude urban Gulf – = p > 0.05, * = 0.05 > p > 0.01, ** = p < 0.01, *** = p < 0.0001 Shading corresponds to negative correlations Table 2: Geographic correlations nificantly with female speakers; seven with male. The correlations are found in Table 4. For each feature, certain words correlate sig- nificantly with female speakers, while oth- ers correlate significantly with male speakers. Consequently, neither our Twitter data not the Twitter data in the RTC suggest that AAVE is more often appropriated by men. We discuss whether our data provides a basis for falsify- ing the second half of H3 in §3.1. The high correlation between mutations of /D/ and longitude supports the presence of these mutations of /D/ in non-standard north- ern varieties (Rickford, 1999). The mutation of /T/ is also correlated with longitude, and with latitude, suggesting an Eastern Ameri- can feature rather than a distinct Southern fea- ture (Rickford, 1999). The variation in muta- tions could possibly be explained by both ge- ography as well as the distribution og African Americans. There is evidence in our data that backing in /str/ (to /skr/) is appropriated more often by AAVE speakers than by speakers of other di- alects (H8). There is also a negative correla- tion between latitude and backing in /str/ as well as a strong positive correlation with the Gulf states, suggesting that backing in /str/ is a feature primarily seen in this region. The data thereby suggests that the feature is appropri- ated significantly more by African Americans than by speakers of the Southern dialect. In sum, while our data lends support to sev- eral of the common hypotheses from the so- ciolinguistics literature, we found one unex- pected tendency, going against the second half of H3, namely that AAVE features were found more often with females. We now discuss this finding in light of the methodological prob- lems discussed in §1.2. Feature word pairs male brotha-brother ** foreva-forever ** hea-here * lova-lover – /r/ —. /fb/ or /@/ motha-mother ** ova-over ** sista-sister – wateva-whatever – wea-where ** brova-brother * dat-that ** deez-these ** D —. /d/ or /v/ dem-them ** dey-they ** dis-this ** mova-mother – mouf-mouth ** nuffin-nothing ** souf-south ** T —. /f/ or /t/ teef-teeth – trough-through ** trow-throw **

– = p > 0.05, * = 0.05 > p > 0.01, ** = p < 0.01 Shading corresponds to negative correlations Table 4: Gender correlations in RTC 3.1 Is AAVE not male-dominated? We now discuss whether our data falsifies the second half of H3, one methodological problem at a time (see §1.3). If WRITTEN BIAS were to bias our conclusions, one gen- der should be more likely to exhibit more phonologically motivated spelling variation. This may actually be true, since it is well-

total *** *** *** *** trow/throw *** ** – *** brotha/brother *** *** ** – *** *** * *** *** *** lova/lover *** *** ** *** motha/mother––*** – ova/over *** – – *** /r/ *** *** *** *** wateva/whatever wea/where total skreet/street skrong/strong skrip/strip total *** *** *** *** *** *** *** *** ** *** *** *** – * *** *** *** *** – ** – *** sista/sister – *** /str/ – brova/brother *** *** *** *** dat/that *** * – *** deez/these****– – /D/ dem/them *** *** – *** dey/they *** *** – *** dis/this *** – – *** mova/mother * *** *** *** mouf/mouth *** – – *** nuffin/nothing *** *** *** *** souf/south *** *** *** *** teef/teeth ** – ** *** trough/through – *** total * *** *** *** /T/ – – foreva/forever hea/here *** *** ** *** *** 15 established that women tend to be more lin- guistically creative and have larger vocabular- ies (Labov, 1990; Brizendine, 2006). Whether women are also more meta-linguistic (META- USE BIAS), has to the best of our knowl- edge not been studied. Since genders are al- most equally geographically distributed, and since Twitter is generally considered gender- balanced, neither USER POPULATION BIAS nor GALTON’S PROBLEM is likely to bias our conclusions. TOPIC BIAS, on the other hand, may. While our semantically equivalent pairs control for topic, the pragmatics sometimes differ. Just like code-switching is a strategy for bilinguals, using the spelling motha in- stead of mother could mean something, say irony, which one gender is more prone for. In sum, while we do believe that our data should lead sociolinguists to question whether AAVE is male-dominated, our findings may be bi- ased by WRITTEN BIAS. 4 POS tagging We need automated syntactic analysis to study morpho-syntactic dialectal variation. We ran a state-of-the-art POS tagger trained on newswire5 (STANFORD), as well as two state- of-the-art POS taggers adapted to Twitter, namely GATE6 and ARK7, on our data. We had one professional annotator manually an- notate 100 positive (AAVE) and 100 nega- tive (non-AAVE) sentences using the coarse- grained tags proposed by Petrov et al. (2011). We map the tagger outputs to those tags and report tagging accuracies. See Table 5 for re- sults, with A(+, −) being the absolute dif- ference in performance from non-AAVE to AAVE. 5http://nlp.stanford.edu/software/ tagger.shtml 6https://gate.ac.uk/wiki/ twitter-postagger.html 7http://www.ark.cs.cmu.edu/ TweetNLP/ STANFORD GATE ARK AAVE 61.4 79.1 77.5 non-AAVE 74.5 83.3 77.9 Δ(+,-) 13.1 4.2 0.4

Table 5: POS tagging accuracies (%) While GATE is certainly better than STAN- FORD on our data, performance is generally poor and prohibitive of many downstream ap- plications and variational studies. We also note that both the best and worst tagger per- form significantly worse on AAVE tweets than on non-AAVE tweets. What are the sources of error in the AAVE data? One ex- ample is the word brotha, which is tagged as a both an adverb, a verb, and as X (foreign words, mark-up, etc.). Contractions like finna (”fixing to” meaning ”going to”) and gimme (”give me”) are often tagged as particles, but annotated as verbs or, as in the case of witchu (”with you”), as a preposition. Another inter- esting mistake is tagging adverbial like as a verb. 5 Conclusion Large-scale variational studies of social me- dia can be used to question received wisdom about dialects, lending support to some soci- olinguistic research hypotheses and question- ing others. However, we caution that our re- sults were biased by several factors, includ- ing the representativity of the social media user bases. We also show how state-of-the- art POS taggers are more likely to fail on dialects in social media. The performance drops may be considered prohibitive of study- ing morph-syntactic patterns across dialects and as a challenge to us as a community. References Sharon Ash and John Myhill. 1986. Linguis- tic correlates of inter-ethnic contact. In David 16 Sankoff, editor, Diversity and Diachronyc, pages 33–44, Amsterdam and Philadelphia. John Benjamins Publishing Co. David Bamman, Jacob Eisenstein, and Tyler Sch- noebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolin- guistics, 18. Louann Brizendine. 2006. The Female Brain. Morgan Road Books. Phillip Carter. 2013. Shared spaces, shared structures: Latino social formation and African American English in the U.S. south. Journal of Sociolinguistics, 17:66–92. Gabriel Doyle. 2014. Mapping dialectal varia- tion by querying social media. In EACL, pages 98–106, Gothenburg, Sweden. Association for Computational Linguistics. Jacob Eisenstein, Noah A. Smith, and Eric Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In ACL. Jacob Eisenstein. 2013. Phonological factors in social media writing. In NAACL Workshop on Language Analysis in Social Media, pages 11– 19, Atlanta, Georgia. Association for Computa- tional Linguistics. Jacob Eisenstein. to appear. Systematic patterning in phonologically-motivated orthographic vari- ation. Journal of Sociolinguistics. Mark Graham, Scott Hale, and Devin Gaffney. 2014. Where in the world are you? Geoloca- tion and language identification on Twitter. The Professional Geographer, 66(4). Amac Herdagdelen and Marco Baroni. 2011. Stereotypical gender actions can be extracted from web text. Journal of the American So- ciety for Information Science and Technology, 62:1741–1749. Dirk Hovy and Anders Søgaard. 2015. Tag- ging performance correlates with author age. In ACL. Dirk Hovy, Anders Johannsen, and Anders Søgaard. 2015. User review-sites as a source for large-scale sociolinguistic studies. In WWW. Anders Johannsen, Dirk Hovy, and Anders Søgaard. 2015. Cross-lingual syntactic varia- tion over age and gender. In CoNLL. William Labov, Sharon Ash, and Charles Boberg. 2005. The Atlas of North American En- glish Phonetics, Phonology and Sound Change. Mouton de Gruyter, New York, NY. William Labov. 1972a. Language in the Inner City: Studies in the Black English Vernacular. University of Pennsylvania Press. William Labov. 1972b. Sociolingustic Patterns. University of Pennsylvania Press, Philadelphia, PA. William Labov. 1990. The intersection of sex and social class in the course of linguistic change. Language Variation and Change, 2:205–254, 7. William Labov. 2006. Unendangered dialects, endangered people. In Natalie Schilling-Estes, editor, GURT’06. Miriam Meyerhof. 2006. Introducing Sociolin- guistics. Routledge. R Naroll. 1961. Two solutions to Galton’s prob- lem. Philosophy of Science, 28. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. CoRR abs/1104.2086. K.E. Pollock, G. Bailey, M. Berni, D. Fletcher, L. Hinton, I. Johnson, J. Roberts, and R. Weaver. 1998. Phonologi- cal features of african american english. http://www.rehabmed.ualberta.ca/spa/phono- logy/features.htm. Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying la- tent user attributes in twitter. In Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents, pages 37–44. ACM. Sonya Rastogi, Tallese D. Johnson, Elizabeth M. Hoeffel, and Malcolm P. Drewery Jr. 2011. The black population: 2010. Technical report, US Census, September. John Rickford. 1999. African American Vernac- ular English: Features, Evolution, Educational Implications. Blackwell, Malden, MA. 17 John Rickford. 2010. Geographical diversity, res- idential segregation, and the vitality of african american vernacular english and its speakers. Transforming Anthropology, 18(1):28–34. Sean Roberts and James Winters. 2013. Linguis- tic diversity and traffic accidents: lessons from statistical studies of cultural traits. PLoS ONE, 8(8). Eric Thomas. 2007. Phonological and phonetic characteristics of african american vernacular english. Language and Linguistic Compass, 1(5):450–475. Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic lan- guage variations to improve multilingual senti- ment analysis in social media. In EMNLP. Svitlana Volkova, Yoram Bachrach, Michael Arm- strong, and Vijay Sharma. 2015. Inferring la- tent user properties from texts published in so- cial media (demo). In AAAI. Walt Wolfram. 2004. The grammar of urban african american vernacular english. In Kor- mann B. and E. Schneider, editors, Handbook of Varieties of English, pages 111–132, Berlin. Mouton de Gruyter. 18 Challenges of studying and processing dialects in social media Anna Katrine Jørgensen Dirk Hovy Anders University of Njalsgade

DK-2300 Copenhagen

soegaard@hum.ku.dk Dialect features typically do not make it into formal writing, but flourish social media. This enables largevariational studies. We cus on three phonological features of African American Vernacular English and their manifestation as spelling variations on Twitter. We discuss to what extent our data can be used to eight sociolinguistic hypotheses. To go beyond the level, we require automatic analysis such as POS tagging, but social media language still challenges language technologies. We show how both newswireand Twitter-adapted stateof-the-art POS taggers perform significantly worse on AAVE tweets, suggesting that large-scale dialect studies of language variation beyond the surface level are not feasible with out-ofthe-box NLP tools. Diversity and Diachronyc, 33--44 Sankoff, editor, John Benjamins Publishing Co. Amsterdam and Philadelphia. Sankoff, editor, Diversity and Diachronyc, pages 33–44, Amsterdam and Philadelphia. John Benjamins Publishing Co. David Bamman Jacob Eisenstein Tyler Schnoebelen Gender identity and lexical variation in social media. 2014 Journal of Sociolinguistics, 18 ally, as evidenced by our geographic correlations in Table 2, but since our studies explicitly test the influence of location, it is not the case for most of the hypotheses considered here that geographic diffusion is the underlying explanation for something else. In §3, we discuss whether these four methodological problems compromise the validity of our findings. One other methodological problems that may be relevant for other studies of dialect in social media, is almost completely irrelevant for our study: It is often important to control for topic in dialectal and sociolinguistic studies (Bamman et al., 2014), e.g., when studying the lexical preferences of speakers of urban ethnolects. We call this problem (5) TOPIC BIAS. Using word pairs with equivalent meanings for our studies, we implicitly control for topic (but see §3.1). Feature Positive Negative Total count brotha brother 9528 foreva forever 3673 hea here 4352 lova lover 1273 motha mother 4668 /r/ /Ø/ or /@/ ova over 3441 sista sister 5325 wateva whatever 2974 wea where 5153 total 40,387 kreet street 1226 /str/ /skr/ :krong strong 1629 skrip strip 1101 total 3956 brova brother 3715 dat that 2610 deez these 4477 /D/ /d/or/v/ dem them 3645 de Bamman, Eisenstein, Schnoebelen, 2014 David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18. Louann Brizendine The Female Brain. 2006 Morgan Road Books. trong skrip/strip total *** *** *** *** *** *** *** *** ** *** *** *** – * *** *** *** *** – ** – *** sista/sister – *** /str/ – brova/brother *** *** *** *** dat/that *** * – *** deez/these****– – /D/ dem/them *** *** – *** dey/they *** *** – *** dis/this *** – – *** mova/mother * *** *** *** mouf/mouth *** – – *** nuffin/nothing *** *** *** *** souf/south *** *** *** *** teef/teeth ** – ** *** trough/through – *** total * *** *** *** /T/ – – foreva/forever hea/here *** *** ** *** *** 15 established that women tend to be more linguistically creative and have larger vocabularies (Labov, 1990; Brizendine, 2006). Whether women are also more meta-linguistic (METAUSE BIAS), has to the best of our knowledge not been studied. Since genders are almost equally geographically distributed, and since Twitter is generally considered genderbalanced, neither USER POPULATION BIAS nor GALTON’S PROBLEM is likely to bias our conclusions. TOPIC BIAS, on the other hand, may. While our semantically equivalent pairs control for topic, the pragmatics sometimes differ. Just like code-switching is a strategy for bilinguals, using the spelling motha instead of mother could mean something, say irony, which one gender is more Brizendine, 2006 Louann Brizendine. 2006. The Female Brain. Morgan Road Books. Phillip Carter Shared spaces, shared structures: Latino social formation and African 2013 American English in the U.S. south. Journal of Sociolinguistics, 17--66 use it in citations, for example: (1) My 5 year old sister texted me on my mums phone saying “why did you take a picher in da bafroom” lool okay b (Twitter, Feb 21 2015) or in meta-linguistic discussions: (2) Whenever I hear a black person inquire about the location of the ”bafroom”... (Twitter, Jan 20 2015) We refer to these phenomena as (2) METAUSE BIAS. This bias is important with rare phenomena. With ”bafroom”, it seems that about 1 in 20 occurrences on Twitter are metauses. Meta-uses may also serve social functions. AAVE features are used as cultural markers by Latinos in North Carolina (Carter, 2013), for example. Some of the research hypotheses considered (113 and 115) relate to demographic variables such as income and educational levels. While we do not have socio-economic information about the individual Twitter user, we can use the geo-located tweets to study the correlation between socio-economic variables and linguistic features at the level of cities or ZIP codes.1 Eisenstein et al. (2011) note that this level of abstraction introduces some noise. Since Twitter users do not form representative samples of the population, the mean income for a city or ZIP code is not necessarily the Carter, 2013 Phillip Carter. 2013. Shared spaces, shared structures: Latino social formation and African American English in the U.S. south. Journal of Sociolinguistics, 17:66–92. Gabriel Doyle Mapping dialectal variation by querying social media. In 2014 EACL, 98--106 Gothenburg, Sweden. Association for Computational Linguistics. rs. Dallas is represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets every day, and so ely correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof, 2006). Our work differs from previous work by studying phonological rules associated with specific dialects, as well as considering a wide range of actual sociolinguistic research hypotheses, but our main focus is the methodological problems doing this kind of work, as well as assessing the limitations of such work. 1.3 Methodological problems One obvious challenge relating social media data to sociolinguistic studies is that n /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations relate to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more common among African Americans than among European Americans and that the frequency of the mutations is inversely correlated with socio-economic levels and formality of speaking (Rickford, 1999). We follow Eisenstein (2013) and Doyle (2014) in assuming that spelling variation may be a result of phonological differences and select 25 word pairs for our study (Tabel 1). For each word pair, we collect positive (e.g., ”skreet”) and negative occurrences (e.g., ”street”), resulting in a total number of 79,396 tweets. The word pairs were chosen based on the unambiguity, frequency and representability of the phonological variations. Uniquely, backing in /str/ is represented by three word pairs of high similarity, which is due to phonological restrictions on the variation of /str/ to /skr/ and to the fact that backing in /str/ is a very Doyle, 2014 Gabriel Doyle. 2014. Mapping dialectal variation by querying social media. In EACL, pages 98–106, Gothenburg, Sweden. Association for Computational Linguistics. Jacob Eisenstein Noah A Smith Eric Xing Discovering sociolinguistic associations with structured sparsity. 2011 In ACL. henomena. With ”bafroom”, it seems that about 1 in 20 occurrences on Twitter are metauses. Meta-uses may also serve social functions. AAVE features are used as cultural markers by Latinos in North Carolina (Carter, 2013), for example. Some of the research hypotheses considered (113 and 115) relate to demographic variables such as income and educational levels. While we do not have socio-economic information about the individual Twitter user, we can use the geo-located tweets to study the correlation between socio-economic variables and linguistic features at the level of cities or ZIP codes.1 Eisenstein et al. (2011) note that this level of abstraction introduces some noise. Since Twitter users do not form representative samples of the population, the mean income for a city or ZIP code is not necessarily the mean income for the Twitter users in that area. We refer to this problem as the (3) USER POPULATION BIAS. Another serious methodological problem known as (4) GALTON’S PROBLEM (Naroll, 1961; Roberts and Winters, 2013), is the observation that cross-cultural associations are 1Unlike many others, we rely on physical locations rather than user-entered profile locations. See Graham et al. (2014) for discus Eisenstein, Smith, Xing, 2011 Jacob Eisenstein, Noah A. Smith, and Eric Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In ACL. Jacob Eisenstein Phonological factors in social media writing. 2013 In NAACL Workshop on Language Analysis in Social Media, 11--19 Atlanta, pletion, but is based on only 762 speakers. Dallas is represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half otacization is negatively correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof, 2006). Our work differs from previous work by studying phonological rules associated with specific dialects, as well as considering a wide range of actual sociolinguistic research hypotheses, but our main focus is the methodological problems doing this kind of work, as well as assessing the limitations of such work. 1.3 Methodological problems One obvious challenge relating social media data to sociolinguisti / for strip. Backing in /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations relate to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more common among African Americans than among European Americans and that the frequency of the mutations is inversely correlated with socio-economic levels and formality of speaking (Rickford, 1999). We follow Eisenstein (2013) and Doyle (2014) in assuming that spelling variation may be a result of phonological differences and select 25 word pairs for our study (Tabel 1). For each word pair, we collect positive (e.g., ”skreet”) and negative occurrences (e.g., ”street”), resulting in a total number of 79,396 tweets. The word pairs were chosen based on the unambiguity, frequency and representability of the phonological variations. Uniquely, backing in /str/ is represented by three word pairs of high similarity, which is due to phonological restrictions on the variation of /str/ to /skr/ and to the fact that backing in Eisenstein, 2013 Jacob Eisenstein. 2013. Phonological factors in social media writing. In NAACL Workshop on Language Analysis in Social Media, pages 11– 19, Atlanta, Georgia. Association for Computational Linguistics. Jacob Eisenstein to appear. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics. Eisenstein, Jacob Eisenstein. to appear. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics. Mark Graham Scott Hale Devin Gaffney Where in the world are you? Geolocation and language identification on Twitter. 2014 The Professional Geographer, 66 4 odes.1 Eisenstein et al. (2011) note that this level of abstraction introduces some noise. Since Twitter users do not form representative samples of the population, the mean income for a city or ZIP code is not necessarily the mean income for the Twitter users in that area. We refer to this problem as the (3) USER POPULATION BIAS. Another serious methodological problem known as (4) GALTON’S PROBLEM (Naroll, 1961; Roberts and Winters, 2013), is the observation that cross-cultural associations are 1Unlike many others, we rely on physical locations rather than user-entered profile locations. See Graham et al. (2014) for discussion. 11 often explained by geographical diffusion. In other words, it is the problem of discriminating historical from functional associations in cross-cultural surveys. Briefly put, when we sample tweets and income-levels from US cities, there is little independence between the city data points. Linguistic features diffuse geographically and do not change at random, and we can therefore expect to see more spurious correlations than usual. Like with the famous example of chocolate and Nobel Prize winners, our positive findings may be explained by hidden background variables. A posi Graham, Hale, Gaffney, 2014 Mark Graham, Scott Hale, and Devin Gaffney. 2014. Where in the world are you? Geolocation and language identification on Twitter. The Professional Geographer, 66(4). Amac Herdagdelen Marco Baroni Stereotypical gender actions can be extracted from web text. 2011 Journal of the American Society for Information Science and Technology, 62--1741 to H4. More interestingly, our data suggests that women use AAVE features more often than men, i.e., there is a negative correlation between male gender and AAVE features, contrary to the second half of H3, namely that AAVE is more frequently appropriated by men. Note, however, that our gender ratios are aggregated for city areas, and with the demographic bias of Twitter, these correlations should be taken with a grain of salt. Considering the small gender ratio differences, we also compute correlations between our linguistic features and gender using the Rovereto Twitter N-gram Corpus (RTC) (Herdagdelen and Baroni, 2011).4 The RTC corpus contains information about the gender of the tweeter associated with n-grams. While there is too little data in the corpus to correlate gender and backing in /str/, derhotacization and both interdental fricative mutations (/D/ → /d/ or /v/ and /T/ → /t/ or /f/) correlate significantly with women. Out of our words, 10 correlate sig4http://clic.cimec.unitn.it/amac/ twitter_ngram/ 14 Feature word pairs latitude longitude urban Gulf – = p > 0.05, * = 0.05 > p > 0.01, ** = p < 0.01, *** = p < 0.0001 Shading corresponds to negative correlations Table 2: Geographic correlations nifi Herdagdelen, Baroni, 2011 Amac Herdagdelen and Marco Baroni. 2011. Stereotypical gender actions can be extracted from web text. Journal of the American Society for Information Science and Technology, 62:1741–1749. Dirk Hovy Anders Søgaard Tagging performance correlates with author age. 2015 In ACL. etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets every day, and some fraction of these tweets are geo-located. Eisenstein (2013) and Doyle (2014) studied t Hovy, Søgaard, 2015 Dirk Hovy and Anders Søgaard. 2015. Tagging performance correlates with author age. In ACL. Dirk Hovy Anders Johannsen Anders Søgaard User review-sites as a source for large-scale sociolinguistic studies. 2015 In WWW. represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets every day, and some fraction of thes Hovy, Johannsen, Søgaard, 2015 Dirk Hovy, Anders Johannsen, and Anders Søgaard. 2015. User review-sites as a source for large-scale sociolinguistic studies. In WWW. Anders Johannsen Dirk Hovy Anders Søgaard Cross-lingual syntactic variation over age and gender. 2015 In CoNLL. rk City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets every day, and some fraction of these tweets are geo-located. Eisenstein (2013) an Johannsen, Hovy, Søgaard, 2015 Anders Johannsen, Dirk Hovy, and Anders Søgaard. 2015. Cross-lingual syntactic variation over age and gender. In CoNLL. William Labov Sharon Ash Charles Boberg 2005 The Atlas of North American English Phonetics, Phonology and Sound Change. Mouton de Gruyter, New York, NY. inguistic hypotheses. To go beyond the spelling level, we require automatic analysis such as POS tagging, but social media language still challenges language technologies. We show how both newswire- and Twitter-adapted stateof-the-art POS taggers perform significantly worse on AAVE tweets, suggesting that large-scale dialect studies of language variation beyond the surface level are not feasible with out-ofthe-box NLP tools. 1 Introduction Dialectal and sociolinguistic studies are traditionally based on interviews of small sets of speakers of each variety. The Atlas of North American English (Labov et al., 2005) has been the reference point for American dialectology since its completion, but is based on only 762 speakers. Dallas is represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect results. • Further, we show that state-of-the-art newswire and Twitter POS taggers perform much worse on tweets containing AAVE features. This suggests an additional limitation to large-scale sociolinguistic research using social media data, namely that it is hard to analyze variation beyond the lexical level with current tools. 1.2 Sociolinguistic hypotheses AAVE is, in contrast to other North American dialects, not geographically restricted. Although variation in AAVE does exist, AAVE in urban settings has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov Labov, Ash, Boberg, 2005 William Labov, Sharon Ash, and Charles Boberg. 2005. The Atlas of North American English Phonetics, Phonology and Sound Change. Mouton de Gruyter, New York, NY. William Labov Language in the Inner City: Studies in the Black English Vernacular. 1972 University of Pennsylvania Press. (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof, 2006). Our work differs from previous work by studying phonological rules associated with specific dialects, as well as considering a wide range of actual sociolinguistic research hypotheses, but our main focus is the methodological problems doing this kind of work, as well as assessing the limitations of such work. 1.3 Methodological problems One obvious challenge relating social media data to sociolinguistic studies is that there is generally not a one-to-one relationship between phonological variation and spelling variation. People, in other words, do not spell the way they pro s, /r/ is either not pronounced or is approximated as a vocalization in the surface form, when /r/ is in a pre-vocalic position. This can result in an elongation of the preceding vowel or in an off-glide schwa /@/, e.g., guard → /gA:d/, car → /ka:/, fear → /fi@/ (Thomas, 2007). Backing in /skr/ denotes the substitution 12 of /str/ for /skr/ in word-initial positions resulting in pronunciations such as /skrit/ for street, /skraq/ for strong and /skrT/ for strip. Backing in /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations relate to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more common among African Americans than among European Americans and that the frequency of the mutations is inversely correlated with socio-economic levels and formality of speaking (Rickford, 1999). We follow Eisenstein (2013) and Doyle (2014) in assuming that spelling variation may be a result of phonological differences and select 25 word pairs for our study (Tab Labov, 1972 William Labov. 1972a. Language in the Inner City: Studies in the Black English Vernacular. University of Pennsylvania Press. William Labov Sociolingustic Patterns. 1972 University of Pennsylvania Press, Philadelphia, PA. (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof, 2006). Our work differs from previous work by studying phonological rules associated with specific dialects, as well as considering a wide range of actual sociolinguistic research hypotheses, but our main focus is the methodological problems doing this kind of work, as well as assessing the limitations of such work. 1.3 Methodological problems One obvious challenge relating social media data to sociolinguistic studies is that there is generally not a one-to-one relationship between phonological variation and spelling variation. People, in other words, do not spell the way they pro s, /r/ is either not pronounced or is approximated as a vocalization in the surface form, when /r/ is in a pre-vocalic position. This can result in an elongation of the preceding vowel or in an off-glide schwa /@/, e.g., guard → /gA:d/, car → /ka:/, fear → /fi@/ (Thomas, 2007). Backing in /skr/ denotes the substitution 12 of /str/ for /skr/ in word-initial positions resulting in pronunciations such as /skrit/ for street, /skraq/ for strong and /skrT/ for strip. Backing in /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations relate to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more common among African Americans than among European Americans and that the frequency of the mutations is inversely correlated with socio-economic levels and formality of speaking (Rickford, 1999). We follow Eisenstein (2013) and Doyle (2014) in assuming that spelling variation may be a result of phonological differences and select 25 word pairs for our study (Tab Labov, 1972 William Labov. 1972b. Sociolingustic Patterns. University of Pennsylvania Press, Philadelphia, PA. William Labov The intersection of sex and social class in the course of linguistic change. 1990 Language Variation and Change, 2 7 reet skrong/strong skrip/strip total *** *** *** *** *** *** *** *** ** *** *** *** – * *** *** *** *** – ** – *** sista/sister – *** /str/ – brova/brother *** *** *** *** dat/that *** * – *** deez/these****– – /D/ dem/them *** *** – *** dey/they *** *** – *** dis/this *** – – *** mova/mother * *** *** *** mouf/mouth *** – – *** nuffin/nothing *** *** *** *** souf/south *** *** *** *** teef/teeth ** – ** *** trough/through – *** total * *** *** *** /T/ – – foreva/forever hea/here *** *** ** *** *** 15 established that women tend to be more linguistically creative and have larger vocabularies (Labov, 1990; Brizendine, 2006). Whether women are also more meta-linguistic (METAUSE BIAS), has to the best of our knowledge not been studied. Since genders are almost equally geographically distributed, and since Twitter is generally considered genderbalanced, neither USER POPULATION BIAS nor GALTON’S PROBLEM is likely to bias our conclusions. TOPIC BIAS, on the other hand, may. While our semantically equivalent pairs control for topic, the pragmatics sometimes differ. Just like code-switching is a strategy for bilinguals, using the spelling motha instead of mother could mean something, say irony, which Labov, 1990 William Labov. 1990. The intersection of sex and social class in the course of linguistic change. Language Variation and Change, 2:205–254, 7. William Labov Unendangered dialects, endangered people. 2006 06 In Natalie Schilling-Estes, editor, we show that state-of-the-art newswire and Twitter POS taggers perform much worse on tweets containing AAVE features. This suggests an additional limitation to large-scale sociolinguistic research using social media data, namely that it is hard to analyze variation beyond the lexical level with current tools. 1.2 Sociolinguistic hypotheses AAVE is, in contrast to other North American dialects, not geographically restricted. Although variation in AAVE does exist, AAVE in urban settings has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Labov, 2006 William Labov. 2006. Unendangered dialects, endangered people. In Natalie Schilling-Estes, editor, GURT’06. Miriam Meyerhof Introducing Sociolinguistics. 2006 Routledge. l., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof, 2006). Our work differs from previous work by studying phonological rules associated with specific dialects, as well as considering a wide range of actual sociolinguistic research hypotheses, but our main focus is the methodological problems doing this kind of work, as well as assessing the limitations of such work. 1.3 Methodological problems One obvious challenge relating social media data to sociolinguistic studies is that there is generally not a one-to-one relationship between phonological variation and spelling variation. People, in other words, do not spell the way they pronounce. Eisenstein Meyerhof, 2006 Miriam Meyerhof. 2006. Introducing Sociolinguistics. Routledge. R Naroll Two solutions to Galton’s problem. 1961 Philosophy of Science, 28 ic information about the individual Twitter user, we can use the geo-located tweets to study the correlation between socio-economic variables and linguistic features at the level of cities or ZIP codes.1 Eisenstein et al. (2011) note that this level of abstraction introduces some noise. Since Twitter users do not form representative samples of the population, the mean income for a city or ZIP code is not necessarily the mean income for the Twitter users in that area. We refer to this problem as the (3) USER POPULATION BIAS. Another serious methodological problem known as (4) GALTON’S PROBLEM (Naroll, 1961; Roberts and Winters, 2013), is the observation that cross-cultural associations are 1Unlike many others, we rely on physical locations rather than user-entered profile locations. See Graham et al. (2014) for discussion. 11 often explained by geographical diffusion. In other words, it is the problem of discriminating historical from functional associations in cross-cultural surveys. Briefly put, when we sample tweets and income-levels from US cities, there is little independence between the city data points. Linguistic features diffuse geographically and do not change at random, and we can th ogical pattern may also have cultural, religious or geographical explanations. Reasons to be less worried about GALTON’S PROBLEM in our case, include that a) we only consider standard hypotheses from the sociolinguistics literature and not a huge set of previously unexplored, automatically generated hypotheses, b) we sample data points at random from all across the US, giving us a very sparse distribution compared to country-level data, but more notably, c) location is an important, explicit variable in our study. GALTON’S PROBLEM is typically identified by clustering tests based on location (Naroll, 1961). Obviously, the phonological features considered here cluster geographically, as evidenced by our geographic correlations in Table 2, but since our studies explicitly test the influence of location, it is not the case for most of the hypotheses considered here that geographic diffusion is the underlying explanation for something else. In §3, we discuss whether these four methodological problems compromise the validity of our findings. One other methodological problems that may be relevant for other studies of dialect in social media, is almost completely irrelevant for our study: It is often Naroll, 1961 R Naroll. 1961. Two solutions to Galton’s problem. Philosophy of Science, 28. Slav Petrov Dipanjan Das Ryan McDonald A universal part-of-speech tagset. 2011 CoRR abs/1104.2086. ender is more prone for. In sum, while we do believe that our data should lead sociolinguists to question whether AAVE is male-dominated, our findings may be biased by WRITTEN BIAS. 4 POS tagging We need automated syntactic analysis to study morpho-syntactic dialectal variation. We ran a state-of-the-art POS tagger trained on newswire5 (STANFORD), as well as two stateof-the-art POS taggers adapted to Twitter, namely GATE6 and ARK7, on our data. We had one professional annotator manually annotate 100 positive (AAVE) and 100 negative (non-AAVE) sentences using the coarsegrained tags proposed by Petrov et al. (2011). We map the tagger outputs to those tags and report tagging accuracies. See Table 5 for results, with A(+, −) being the absolute difference in performance from non-AAVE to AAVE. 5http://nlp.stanford.edu/software/ tagger.shtml 6https://gate.ac.uk/wiki/ twitter-postagger.html 7http://www.ark.cs.cmu.edu/ TweetNLP/ STANFORD GATE ARK AAVE 61.4 79.1 77.5 non-AAVE 74.5 83.3 77.9 Δ(+,-) 13.1 4.2 0.4 Table 5: POS tagging accuracies (%) While GATE is certainly better than STANFORD on our data, performance is generally poor and prohibitive of many downstream applications and variational studies. We also Petrov, Das, McDonald, 2011 Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. CoRR abs/1104.2086. K E Pollock G Bailey M Berni D Fletcher L Hinton I Johnson J Roberts R Weaver 1998 Phonological features of african american english. http://www.rehabmed.ualberta.ca/spa/phonology/features.htm. features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999). H5: Derhotacization is negatively correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof Pollock, Bailey, Berni, Fletcher, Hinton, Johnson, Roberts, Weaver, 1998 K.E. Pollock, G. Bailey, M. Berni, D. Fletcher, L. Hinton, I. Johnson, J. Roberts, and R. Weaver. 1998. Phonological features of african american english. http://www.rehabmed.ualberta.ca/spa/phonology/features.htm. Delip Rao David Yarowsky Abhishek Shreevats Manaswi Gupta Classifying latent user attributes in twitter. 2010 In Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents, 37--44 ACM. logy since its completion, but is based on only 762 speakers. Dallas is represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter p Rao, Yarowsky, Shreevats, Gupta, 2010 Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents, pages 37–44. ACM. Sonya Rastogi Tallese D Johnson Elizabeth M Hoeffel Malcolm P Drewery Jr The black population: 2011 Technical report, US Census, ools. 1.2 Sociolinguistic hypotheses AAVE is, in contrast to other North American dialects, not geographically restricted. Although variation in AAVE does exist, AAVE in urban settings has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999). H5: Derhotacization is negatively correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf Rastogi, Johnson, Hoeffel, Jr, 2011 Sonya Rastogi, Tallese D. Johnson, Elizabeth M. Hoeffel, and Malcolm P. Drewery Jr. 2011. The black population: 2010. Technical report, US Census, September. John Rickford African American Vernacular English: Features, Evolution, Educational Implications. 1999 Blackwell, Malden, MA. guistic research using social media data, namely that it is hard to analyze variation beyond the lexical level with current tools. 1.2 Sociolinguistic hypotheses AAVE is, in contrast to other North American dialects, not geographically restricted. Although variation in AAVE does exist, AAVE in urban settings has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999). H5: Derhotacization is negatively correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in n-rhotic dialects, /r/ is either not pronounced or is approximated as a vocalization in the surface form, when /r/ is in a pre-vocalic position. This can result in an elongation of the preceding vowel or in an off-glide schwa /@/, e.g., guard → /gA:d/, car → /ka:/, fear → /fi@/ (Thomas, 2007). Backing in /skr/ denotes the substitution 12 of /str/ for /skr/ in word-initial positions resulting in pronunciations such as /skrit/ for street, /skraq/ for strong and /skrT/ for strip. Backing in /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations relate to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more common among African Americans than among European Americans and that the frequency of the mutations is inversely correlated with socio-economic levels and formality of speaking (Rickford, 1999). We follow Eisenstein (2013) and Doyle (2014) in assuming that spelling variation may be a result of phonological differences and select 25 word pairs for o hic correlations nificantly with female speakers; seven with male. The correlations are found in Table 4. For each feature, certain words correlate significantly with female speakers, while others correlate significantly with male speakers. Consequently, neither our Twitter data not the Twitter data in the RTC suggest that AAVE is more often appropriated by men. We discuss whether our data provides a basis for falsifying the second half of H3 in §3.1. The high correlation between mutations of /D/ and longitude supports the presence of these mutations of /D/ in non-standard northern varieties (Rickford, 1999). The mutation of /T/ is also correlated with longitude, and with latitude, suggesting an Eastern American feature rather than a distinct Southern feature (Rickford, 1999). The variation in mutations could possibly be explained by both geography as well as the distribution og African Americans. There is evidence in our data that backing in /str/ (to /skr/) is appropriated more often by AAVE speakers than by speakers of other dialects (H8). There is also a negative correlation between latitude and backing in /str/ as well as a strong positive correlation with the Gulf states, suggesting that ba Rickford, 1999 John Rickford. 1999. African American Vernacular English: Features, Evolution, Educational Implications. Blackwell, Malden, MA. John Rickford Geographical diversity, residential segregation, and the vitality of african american vernacular english and its speakers. 2010 Transforming Anthropology, 18 1 has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999). H5: Derhotacization is negatively correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the Rickford, 2010 John Rickford. 2010. Geographical diversity, residential segregation, and the vitality of african american vernacular english and its speakers. Transforming Anthropology, 18(1):28–34. Sean Roberts James Winters Linguistic diversity and traffic accidents: lessons from statistical studies of cultural traits. 2013 PLoS ONE, 8 8 about the individual Twitter user, we can use the geo-located tweets to study the correlation between socio-economic variables and linguistic features at the level of cities or ZIP codes.1 Eisenstein et al. (2011) note that this level of abstraction introduces some noise. Since Twitter users do not form representative samples of the population, the mean income for a city or ZIP code is not necessarily the mean income for the Twitter users in that area. We refer to this problem as the (3) USER POPULATION BIAS. Another serious methodological problem known as (4) GALTON’S PROBLEM (Naroll, 1961; Roberts and Winters, 2013), is the observation that cross-cultural associations are 1Unlike many others, we rely on physical locations rather than user-entered profile locations. See Graham et al. (2014) for discussion. 11 often explained by geographical diffusion. In other words, it is the problem of discriminating historical from functional associations in cross-cultural surveys. Briefly put, when we sample tweets and income-levels from US cities, there is little independence between the city data points. Linguistic features diffuse geographically and do not change at random, and we can therefore expect to see more s Roberts, Winters, 2013 Sean Roberts and James Winters. 2013. Linguistic diversity and traffic accidents: lessons from statistical studies of cultural traits. PLoS ONE, 8(8). Eric Thomas Phonological and phonetic characteristics of african american vernacular english. 2007 Language and Linguistic Compass, 1 5 sent in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999). H5: Derhotacization is negatively correlated with income and educational level (Rickford, 1999). H6: Interdental fricative mutation is more frequent in AAVE than in European American speech (Pollock et al., 1998; Thomas, 2007). H7: Interdental fricative mutation is predominantly found in the Gulf states (Rastogi et al., 2011). H8: Backing in /str/ (to /skr/) is unique to AAVE (Rickford, 1999; Thomas, 2007; Labov, 2006). Hypotheses 1–8 are investigated by correlating the distribution of phonological variants in geo-located tweets with demographic information. Our method is similar to those proposed by Eisenstein (2013) and Doyle (2014), lending statistical power to sociolinguistic analyses, and circumventing traditional issues with data collection such as the Observer’s Paradox (Labov, 1972b; Meyerhof, 2006). Our wo tion, backing in /str/, and interdental fricative mutation. Specifically, we collect data to study the following four phonological variations (the latter two are both instances of interdental fricative mutation): a) derhotacization: /r/ → /Ø/ or /@/, b) /str/ → /skr/, c) /D/ → /d/ or /v/ and, d) /T/ → /t/ or /f/. In non-rhotic dialects, /r/ is either not pronounced or is approximated as a vocalization in the surface form, when /r/ is in a pre-vocalic position. This can result in an elongation of the preceding vowel or in an off-glide schwa /@/, e.g., guard → /gA:d/, car → /ka:/, fear → /fi@/ (Thomas, 2007). Backing in /skr/ denotes the substitution 12 of /str/ for /skr/ in word-initial positions resulting in pronunciations such as /skrit/ for street, /skraq/ for strong and /skrT/ for strip. Backing in /str/ has been reported to be a unique feature in AAVE, as it is unheard in other North American dialects (Rickford, 1999; Labov, 1972a; Thomas, 2007). The two interdental fricative mutations relate to substitutions of /d/ and /0/ by /d/, /v/ and /t/, /f/ in words such as that and mother or nothing and with. It has been reported that mutations of /d/ and /0/ are more common among African Americans Thomas, 2007 Eric Thomas. 2007. Phonological and phonetic characteristics of african american vernacular english. Language and Linguistic Compass, 1(5):450–475. Svitlana Volkova Theresa Wilson David Yarowsky Exploring demographic language variations to improve multilingual sentiment analysis in social media. 2013 In EMNLP. sed on only 762 speakers. Dallas is represented by four subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets ever Volkova, Wilson, Yarowsky, 2013 Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In EMNLP. Svitlana Volkova Yoram Bachrach Michael Armstrong Vijay Sharma Inferring latent user properties from texts published in social media (demo). 2015 In AAAI. r subjects, the New York City dialect by six, etc. Data is costly to collect, and, as a consequence, scarce. Written language was traditionally used for formal purposes, and therefore differed in style from colloquial, spoken language. However, with the rise of social media platforms and the vast production of user generated content, differences between written and spoken language diminish. A number of recent papers have explored social media with respect to sociolinguistic and dialectological questions (Rao et al., 2010; Eisenstein, 2013; Volkova et al., 2013; Doyle, 2014; Hovy et al., 2015; Volkova et al., 2015; Johannsen et al., 2015; Hovy and Søgaard, 2015; Eisenstein, to appear). Emails, chats and social media posts serve purposes similar to those of spoken language, and consequently, features of spoken language, such as interjections, ellipses, and phonological variation, have found their way into this type of written language. Our work differs from most previous approaches by investigating several phonological spelling correlates of a specific language variety. The 284 million active users on Twitter post more than half a billion tweets every day, and some fraction of these tweets are geo-locat Volkova, Bachrach, Armstrong, Sharma, 2015 Svitlana Volkova, Yoram Bachrach, Michael Armstrong, and Vijay Sharma. 2015. Inferring latent user properties from texts published in social media (demo). In AAAI. Walt Wolfram The grammar of urban african american vernacular english. 2004 Handbook of Varieties of English, 111--132 In Kormann B. and E. Schneider, editors, Berlin. Mouton de Gruyter. state-of-the-art newswire and Twitter POS taggers perform much worse on tweets containing AAVE features. This suggests an additional limitation to large-scale sociolinguistic research using social media data, namely that it is hard to analyze variation beyond the lexical level with current tools. 1.2 Sociolinguistic hypotheses AAVE is, in contrast to other North American dialects, not geographically restricted. Although variation in AAVE does exist, AAVE in urban settings has been established as a uniform system with suprasegmental norms (Ash and Myhill, 1986; Labov et al., 2005; Labov, 2006; Wolfram, 2004). This paper considers the following eight (8) hypotheses from the sociolinguistic literature about AAVE as a ethnolect: H1: AAVE is an urban ethnolect (Rickford, 1999; Wolfram, 2004). H2: AAVE features are more present in the Gulf states than in the rest of the United States (Rastogi et al., 2011). 10 H3: The likelihood of speaking AAVE correlates negatively with income and educational level, and AAVE is more frequently appropriated by men (Rickford, 1999; Rickford, 2010). H4: Derhotacization is more frequent in African Americans than in European Americans (Labov et al., 2005; Rickford, 1999) Wolfram, 2004 Walt Wolfram. 2004. The grammar of urban african american vernacular english. In Kormann B. and E. Schneider, editors, Handbook of Varieties of English, pages 111–132, Berlin. Mouton de Gruyter.