Proceedings of the 3rd Workshop on Noisy User-generated Text

Proceedings of the 3rd Workshop on Noisy User-generated Text Leon Derczynski Wei Xu Alan Ritter Tim Baldwin September 2017

Copenhagen, Denmark

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-44 book WNUT:2017 Boundary-based MWE segmentation with text partitioning JakeWilliams Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 1–10 http://www.aclweb.org/anthology/W17-4401 This submission describes the development of a fine-grained, text-chunking algorithm for the task of comprehensive MWE segmentation. This task notably focuses on the identification of colloquial and idiomatic language. The submission also includes a thorough model evaluation in the context of two recent shared tasks, spanning 19 different languages and many text domains, including noisy, user-generated text. Evaluations exhibit the presented model as the best overall for purposes of MWE segmentation, and open-source software is released with the submission (although links are withheld for purposes of anonymity). Additionally, the authors acknowledge the existence of a pre-print document on arxiv.org, which should be avoided to maintain anonymity in review. inproceedings williams:2017:WNUT Towards the Understanding of Gaming Audiences by Modeling Twitch Emotes FrancescoBarbieri LuisEspinosa Anke MiguelBallesteros JuanSoler HoracioSaggion Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 11–20 http://www.aclweb.org/anthology/W17-4402 Videogame streaming platforms have become a paramount example of noisy user-generated text. These are websites where gaming is broadcasted, and allows interaction with viewers via integrated chatrooms. Probably the best known platform of this kind is Twitch, which has more than 100 million monthly viewers. Despite these numbers, and unlike other platforms featuring short messages (e.g. Twitter), Twitch has not received much attention from the Natural Language Processing community. In this paper we aim at bridging this gap by proposing two important tasks specific to the Twitch platform, namely (1) Emote prediction; and (2) Trolling detection. In our experiments, we evaluate three models: a BOW baseline, a logistic supervised classifiers based on word embeddings, and a bidirectional long short-term memory recurrent neural network (LSTM). Our results show that the LSTM model outperforms the other two models, where explicit features with proven effectiveness for similar tasks were encoded. inproceedings barbieri-EtAl:2017:WNUT Churn Identification in Microblogs using Convolutional Neural Networks with Structured Logical Knowledge MouradGridach HatemHaddad HalaMulki Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 21–30 http://www.aclweb.org/anthology/W17-4403 W17-4403.Attachment.txt For brands, gaining new customer is more expensive than keeping an existing one. Therefore, the ability to keep customers in a brand is becoming more challenging these days. Churn happens when a customer leaves a brand to another competitor. Most of the previous work considers the problem of churn prediction using the Call Detail Records (CDRs). In this paper, we use micro-posts to classify customers into churny or non-churny. We explore the power of convolutional neural networks (CNNs) since they achieved state-of-the-art in various computer vision and NLP applications. However, the robustness of end-to-end models has some limitations such as the availability of a large amount of labeled data and uninterpretability of these models. We investigate the use of CNNs augmented with structured logic rules to overcome or reduce this issue. We developed our system called Churn_teacher by using an iterative distillation method that transfers the knowledge, extracted using just the combination of three logic rules, directly into the weight of the DNNs. Furthermore, we used weight normalization to speed up training our convolutional neural networks. Experimental results showed that with just these three rules, we were able to get state-of-the-art on publicly available Twitter dataset about three Telecom brands. inproceedings gridach-haddad-mulki:2017:WNUT To normalize, or not to normalize: The impact of normalization on Part-of-Speech tagging Robvan der Goot BarbaraPlank MalvinaNissim Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 31–39 http://www.aclweb.org/anthology/W17-4404 Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger. inproceedings vandergoot-plank-nissim:2017:WNUT Constructing an Alias List for Named Entities during an Event AnietieAndy MarkDredze MugiziRwebangira ChrisCallison-Burch Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 40–44 http://www.aclweb.org/anthology/W17-4405 In certain fields, real-time knowledge from events can help in making informed decisions. In order to extract pertinent real-time knowledge related to an event, it is important to identify the named entities and their corresponding aliases related to the event. The problem of identifying aliases of named entities that spike has remained unexplored. In this paper, we introduce an algorithm, EntitySpike, that identifies entities that spike in popularity in tweets from a given time period, and constructs an alias list for these spiked entities. EntitySpike uses a temporal heuristic to identify named entities with similar context that occur in the same time period (within minutes) during an event. Each entity is encoded as a vector using this temporal heuristic. We show how these entity-vectors can be used to create a named entity alias list. We evaluated our algorithm on a dataset of temporally ordered tweets from a single event, the 2013 Grammy Awards show. We carried out various experiments on tweets that were published in the same time period and show that our algorithm identifies most entity name aliases and outperforms a competitive baseline. inproceedings andy-EtAl:2017:WNUT Incorporating Metadata into Content-Based User Embeddings LinziXing Michael J.Paul Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 45–49 http://www.aclweb.org/anthology/W17-4406 Low-dimensional vector representations of social media users can benefit applications like recommendation systems and user attribute inference. Recent work has shown that user embeddings can be improved by combining different types of information, such as text and network data. We propose a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models. Experimenting with the task of friend recommendation on a dataset of 5,019 Twitter users, we show that our approach can lead to substantial performance gains with the simple addition of network and geographic features. inproceedings xing-paul:2017:WNUT Simple Queries as Distant Labels for Predicting Gender on Twitter ChrisEmmery GrzegorzChrupała WalterDaelemans Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 50–55 http://www.aclweb.org/anthology/W17-4407 The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confirm the reliability of this query heuristic by comparing with manual annotation. Moreover, using these labels for distant supervision, we demonstrate competitive model performance on the same data as models trained on manual annotations. As such, we offer a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification. inproceedings emmery-chrupala-daelemans:2017:WNUT A Dataset and Classifier for Recognizing Social Media English Su LinBlodgett JohnnyWei BrendanO'Connor Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 56–61 http://www.aclweb.org/anthology/W17-4408 While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language–-even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model–-which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter–-can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online. inproceedings blodgett-wei-oconnor:2017:WNUT Evaluating hypotheses in geolocation on a very large sample of Twitter BaharSalehi AndersSøgaard Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 62–67 http://www.aclweb.org/anthology/W17-4409 Recent work in geolocation has made several hypotheses about what linguistic markers are relevant to detect where people write from. In this paper, we examine six hypotheses against a corpus consisting of all geo-tagged tweets from the US, or whose geo-tags could be inferred, in a 19% sample of Twitter history. Our experiments lend support to all six hypotheses, including that spelling variants and hashtags are strong predictors of location. We also study what kinds of common nouns are predictive of location after controlling for named entities such as dolphins or sharks inproceedings salehi-sogaard:2017:WNUT The Effect of Error Rate in Artificially Generated Data for Automatic Preposition and Determiner Correction FraserBowen JonDehdari JosefVan Genabith Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 68–76 http://www.aclweb.org/anthology/W17-4410 In this research we investigate the impact of mismatches in the density and type of error between training and test data on a neural system correcting preposition and determiner errors. We use synthetically produced training data to control error density and type, and "real" error data for testing. Our results show it is possible to combine error types, although prepositions and determiners behave differently in terms of how much error should be artificially introduced into the training data in order to get the best results. inproceedings bowen-dehdari-vangenabith:2017:WNUT An Entity Resolution Approach to Isolate Instances of Human Trafficking Online ChiragNagpal KyleMiller BenediktBoecking ArturDubrawski Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 77–84 http://www.aclweb.org/anthology/W17-4411 Human trafficking is a challenging law enforcement problem, and traces of victims of such activity manifest as ‘escort advertisements’ on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is a convoluted task. In this paper we propose an entity resolution pipeline using a notion of proxy labels, in order to extract clusters from this data with prior history of human trafficking activity. We apply this pipeline to 5M records from backpage.com and report on the performance of this approach, challenges in terms of scalability, and some significant domain specific characteristics of our resolved entities. inproceedings nagpal-EtAl:2017:WNUT Noisy Uyghur Text Normalization OsmanTursun RuketCakici Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 85–93 http://www.aclweb.org/anthology/W17-4412 Uyghur is the second largest and most actively used social media language in China. However, a non-negligible part of Uyghur text appearing in social media is unsystematically written with the Latin alphabet, and it continues to increase in size. Uyghur text in this format is incomprehensible and ambiguous even to native Uyghur speakers. In addition, Uyghur texts in this form lack the potential for any kind of advancement for the NLP tasks related to the Uyghur language. Restoring and preventing noisy Uyghur text written with unsystematic Latin alphabets will be essential to the protection of Uyghur language and improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we propose and compare the noisy channel model and the neural encoder-decoder model as normalizing methods. inproceedings tursun-cakici:2017:WNUT Crowdsourcing Multiple Choice Science Questions JohannesWelbl Nelson F.Liu MattGardner Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 94–106 http://www.aclweb.org/anthology/W17-4413 We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions for document selection and answer distractor choice which aid the human question generation process. With this method we have assembled SciQ, a dataset of 13.7K multiple choice science exam questions. We demonstrate that the method produces in-domain questions by providing an analysis of this new dataset and by showing that humans cannot distinguish the crowdsourced questions from original questions. When using SciQ as additional training data to existing questions, we observe accuracy improvements on real science exams. inproceedings welbl-liu-gardner:2017:WNUT A Text Normalisation System for Non-Standard English Words EmmaFlint ElliotFord OliviaThomas AndrewCaines PaulaButtery Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 107–115 http://www.aclweb.org/anthology/W17-4414 This paper investigates the problem of text normalisation; specifically, the normalisation of non-standard words (NSWs) in English. Non-standard words can be defined as those word tokens which do not have a dictionary entry, and cannot be pronounced using the usual letter-to-phoneme conversion rules; e.g. lbs, 99.3%, #EMNLP2017. NSWs pose a challenge to the proper functioning of text-to-speech technology, and the solution is to spell them out in such a way that they can be pronounced appropriately. We describe our four-stage normalisation system made up of components for detection, classification, division and expansion of NSWs. Performance is favourabe compared to previous work in the field (Sproat et al. 2001, Normalization of non-standard words), as well as state-of-the-art text-to-speech software. Further, we update Sproat et al.'s NSW taxonomy, and create a more customisable system where users are able to input their own abbreviations and specify into which variety of English (currently available: British or American) they wish to normalise. inproceedings flint-EtAl:2017:WNUT Huntsville, hospitals, and hockey teams: Names can reveal your location BaharSalehi DirkHovy EduardHovy AndersSøgaard Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 116–121 http://www.aclweb.org/anthology/W17-4415 Geolocation is the task of identifying a social media user’s primary location, and in natural language processing, there is a growing literature on to what extent automated analysis of social media posts can help. However, not all content features are equally revealing of a user’s location. In this paper, we evaluate nine name entity (NE) types. Using various metrics, we find that GEO-LOC, FACILITY and SPORT-TEAM are more informative for geolocation than other NE types. Using these types, we improve geolocation accuracy and reduce distance error over various famous text-based methods. inproceedings salehi-EtAl:2017:WNUT Improving Document Clustering by Removing Unnatural Language MyunghaJang Jinho D.Choi JamesAllan Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 122–130 http://www.aclweb.org/anthology/W17-4416 Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can bean important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of un-natural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various for-mats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that re-moving unnatural language components gives an absolute improvement in document cluster-ing by up to 15%. Our corpus and tool are publicly available inproceedings jang-choi-allan:2017:WNUT Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media PreetiBhargava NemanjaSpasojevic GuoningHu Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 131–139 http://www.aclweb.org/anthology/W17-4417 In this paper, we describe the Lithium Natural Language Processing (NLP) system - a resource-constrained, high- throughput and language-agnostic system for information extraction from noisy user generated text on social media. Lithium NLP extracts a rich set of information including entities, top- ics, hashtags and sentiment from text. We discuss several real world applications of the system currently incorporated in Lithium products. We also compare our system with existing commercial and academic NLP systems in terms of performance, information extracted and languages supported. We show that Lithium NLP is at par with and in some cases, outperforms state- of-the-art commercial NLP systems. inproceedings bhargava-spasojevic-hu:2017:WNUT Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition LeonDerczynski EricNichols Mariekevan Erp NutLimsopatham Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 140–147 http://www.aclweb.org/anthology/W17-4418 This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet "so.. kktny in 30 mins?!" – even human experts find the entity 'kktny' hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text. inproceedings derczynski-EtAl:2017:WNUT A Multi-task Approach for Named Entity Recognition in Social Media Data GustavoAguilar SurajMaharjan Adrian PastorLópez Monroy ThamarSolorio Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 148–153 http://www.aclweb.org/anthology/W17-4419 Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score. inproceedings aguilar-EtAl:2017:WNUT Distributed Representation, LDA Topic Modelling and Deep Learning for Emerging Named Entity Recognition from Social Media PatrickJansson ShuhuaLiu Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 154–159 http://www.aclweb.org/anthology/W17-4420 This paper reports our participation in the W-NUT 2017 shared task on emerging and rare entity recognition from user generated noisy text such as tweets, online reviews and forum discussions. To accomplish this challenging task, we explore an approach that combines LDA topic modelling with deep learning on word level and character level embeddings. The LDA topic modelling generates topic representation for each tweet which is used as a feature for each word in the tweet. The deep learning component consists of two-layer bidirectional LSTM and a CRF output layer. Our submitted result performed at 39.98 (F1) on entity and 37.77 on surface forms. Our new experiments after submission reached a best performance of 41.81 on entity and 40.57 on surface forms. inproceedings jansson-liu:2017:WNUT Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media Bill Y.Lin FrankXu ZhiyiLuo KennyZhu Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 160–165 http://www.aclweb.org/anthology/W17-4421 In this paper, we present our multi-channel neural architecture for recognizing emerging named entity in social media messages, which we applied in the Novel and Emerging Named Entity Recognition shared task at the EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT). We propose a novel approach, which incorporates comprehensive word representations with multi-channel information and Conditional Random Fields (CRF) into a traditional Bidirectional Long Short-Term Memory (BiLSTM) neural network without using any additional hand-craft features such as gazetteers. In comparison with other systems participating in the shared task, our system won the 2nd place. inproceedings lin-EtAl:2017:WNUT Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets Piusvon Däniken MarkCieliebak Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 166–171 http://www.aclweb.org/anthology/W17-4422 We present our system for the WNUT 2017 Named Entity Recognition challenge on Twitter data. We describe two modifications of a basic neural network architecture for sequence tagging. First, we show how we exploit additional labeled data, where the Named Entity tags differ from the target task. Then, we propose a way to incorporate sentence level features. Our system uses both methods and ranked second for entity level annotations, achieving an F1-score of 40.78, and second for surface form annotations, achieving an F1-score of 39.33. inproceedings vondaniken-cieliebak:2017:WNUT Context-Sensitive Recognition for Emerging and Rare Entities JakeWilliams GiovanniSantia Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 172–176 http://www.aclweb.org/anthology/W17-4423 This paper is a shared task system description for the 2017 W-NUT shared task on Rare and Emerging Named Entities. Our paper describes the development and application of a novel algorithm for named entity recognition that relies only on the contexts of word forms. A comparison against the other submitted systems is provided. inproceedings williams-santia:2017:WNUT A Feature-based Ensemble Approach to Recognition of Emerging and Rare Named Entities Utpal KumarSikdar BjörnGambäck Proceedings of the 3rd Workshop on Noisy User-generated Text September 2017

Copenhagen, Denmark

Association for Computational Linguistics 177–181 http://www.aclweb.org/anthology/W17-4424 Detecting previously unseen named entities in text is a challenging task. The paper describes how three initial classifier models were built using Conditional Random Fields (CRFs), Support Vector Machines (SVMs) and a Long Short-Term Memory (LSTM) recurrent neural network. The outputs of these three classifiers were then used as features to train another CRF classifier working as an ensemble. 5-fold cross-validation based on training and development data for the emerging and rare named entity recognition shared task showed precision, recall and F1-score of 66.87%, 46.75% and 54.97%, respectively. For surface form evaluation, the CRF ensemble-based system achieved precision, recall and F1 scores of 65.18%, 45.20% and 53.30%. When applied to unseen test data, the model reached 47.92% precision, 31.97% recall and 38.55% F1-score for entity level evaluation, with the corresponding surface form evaluation values of 44.91%, 30.47% and 36.31%. inproceedings sikdar-gamback:2017:WNUT