This article establishes that, unlike the legacy tf*idf representation, recent natural language representations (word embedding vectors) tend to exhibit a so-called concentration of measure phenomenon, in the sense that, as the representation size p and database size n are both large, their behavior is similar to that of large dimensional Gaussian random vectors. This phenomenon may have important consequences as machine learning algorithms for natural language data could be amenable to improvement, thereby providing new theoretical insights into the field of natural language processing.
The widespread usage of Twitter during emergencies has provided a new opportunity and timely resource to crisis responders for various disaster management tasks. Geolocation information of pertinent tweets is crucial for gaining situational awareness and delivering aid. However, the majority of tweets do not come with geoinformation. In this work, we focus on the task of location mention recognition from crisis-related tweets. Specifically, we investigate the influence of different types of labeled training data on the performance of a BERT-based classification model. We explore several training settings such as combing in- and out-domain data from news articles and general-purpose and crisis-related tweets. Furthermore, we investigate the effect of geospatial proximity while training on near or far-away events from the target event. Using five different datasets, our extensive experiments provide answers to several critical research questions that are useful for the research community to foster research in this important direction. For example, results show that, for training a location mention recognition model, Twitter-based data is preferred over general-purpose data; and crisis-related data is preferred over general-purpose Twitter data. Furthermore, training on data from geographically-nearby disaster events to the target event boosts the performance compared to training on distant events.
The success of deep neural networks (DNNs) is heavily dependent on the availability of labeled data. However, obtaining labeled data is a big challenge in many real-world problems. In such scenarios, a DNN model can leverage labeled and unlabeled data from a related domain, but it has to deal with the shift in data distributions between the source and the target domains. In this paper, we study the problem of classifying social media posts during a crisis event (e.g., Earthquake). For that, we use labeled and unlabeled data from past similar events (e.g., Flood) and unlabeled data for the current event. We propose a novel model that performs adversarial learning based domain adaptation to deal with distribution drifts and graph based semi-supervised learning to leverage unlabeled data within a single unified deep learning framework. Our experiments with two real-world crisis datasets collected from Twitter demonstrate significant improvements over several baselines.
Microblogging platforms such as Twitter provide active communication channels during mass convergence and emergency events such as earthquakes, typhoons. During the sudden onset of a crisis situation, affected people post useful information on Twitter that can be used for situational awareness and other humanitarian disaster response efforts, if processed timely and effectively. Processing social media information pose multiple challenges such as parsing noisy, brief and informal messages, learning information categories from the incoming stream of messages and classifying them into different classes among others. One of the basic necessities of many of these tasks is the availability of data, in particular human-annotated data. In this paper, we present human-annotated Twitter corpora collected during 19 different crises that took place between 2013 and 2015. To demonstrate the utility of the annotations, we train machine learning classifiers. Moreover, we publish first largest word2vec word embeddings trained on 52 million crisis-related tweets. To deal with tweets language issues, we present human-annotated normalized lexical resources for different lexical variations.