This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.
This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions, with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to different registers will acquire different constructions. To what degree does increased exposure lead to the convergence of register-specific grammars? The experiments in this paper simulate language learning in 12 languages (half Germanic and half Romance) with corpora representing three registers (Twitter, Wikipedia, Web). These simulations are repeated with increasing amounts of exposure, from 100k to 2 million words, to measure the impact of exposure on the convergence of grammars. The results show that increased exposure does lead to converging grammars across all languages. In addition, a shared core of register-universal constructions remains constant across increasing amounts of exposure.
This paper asks whether a distinction between production-based and perception-based grammar induction influences either (i) the growth curve of grammars and lexicons or (ii) the similarity between representations learned from independent sub-sets of a corpus. A production-based model is trained on the usage of a single individual, thus simulating the grammatical knowledge of a single speaker. A perception-based model is trained on an aggregation of many individuals, thus simulating grammatical generalizations learned from exposure to many different speakers. To ensure robustness, the experiments are replicated across two registers of written English, with four additional registers reserved as a control. A set of three computational experiments shows that production-based grammars are significantly different from perception-based grammars across all conditions, with a steeper growth curve that can be explained by substantial inter-individual grammatical differences.
Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-referenced social media and web data. The goal, however, has been to describe these corpora themselves rather than to make inferences about underlying populations. This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora that is introduced by non-local populations. These methods tell us where significant changes have taken place and whether this leads to increased or decreased diversity. This is an important step in aligning digital corpora like social media with the real-world populations that have produced them.
While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics. For example, recent work has shown that existing English gigaword corpora over-represent inner-circle varieties from the US and the UK. To correct implicit geographic and demographic biases, this paper uses country-level population demographics to guide the construction of gigaword web corpora. The resulting corpora explicitly match the ground-truth geographic distribution of each language, thus equally representing language users from around the world. This is important because it ensures that speakers of under-resourced language varieties (i.e., Indian English or Algerian French) are represented, both in the corpora themselves but also in derivative resources like word embeddings.
This paper evaluates global-scale dialect identification for 14 national varieties of English on both web-crawled data and Twitter data. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers.
A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The experiments show that association-based models produce better generalizations across all languages by a significant margin.