<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="4400">
    <title>Proceedings of the 3rd Workshop on Noisy User-generated Text</title>
    <editor>Leon Derczynski</editor>
    <editor>Wei Xu</editor>
    <editor>Alan Ritter</editor>
    <editor>Tim Baldwin</editor>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-44</url>
    <bibtype>book</bibtype>
    <bibkey>WNUT:2017</bibkey>
  </paper>

  <paper id="4401">
    <title>Boundary-based MWE segmentation with text partitioning</title>
    <author><first>Jake</first><last>Williams</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;10</pages>
    <url>http://www.aclweb.org/anthology/W17-4401</url>
    <abstract>This submission describes the development of a fine-grained, text-chunking
	algorithm for the task of comprehensive MWE segmentation. This task notably
	focuses on the identification of colloquial and idiomatic language. The
	submission also includes a thorough model evaluation in the context of two
	recent shared tasks, spanning 19 different languages and many text domains,
	including noisy, user-generated text. Evaluations exhibit the presented model
	as the best overall for purposes of MWE segmentation, and open-source software
	is released with the submission (although links are withheld for purposes of
	anonymity). Additionally, the authors acknowledge the existence of a pre-print
	document on arxiv.org, which should be avoided to maintain anonymity in review.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>williams:2017:WNUT</bibkey>
  </paper>

  <paper id="4402">
    <title>Towards the Understanding of Gaming Audiences by Modeling Twitch Emotes</title>
    <author><first>Francesco</first><last>Barbieri</last></author>
    <author><first>Luis</first><last>Espinosa Anke</last></author>
    <author><first>Miguel</first><last>Ballesteros</last></author>
    <author><first>Juan</first><last>Soler</last></author>
    <author><first>Horacio</first><last>Saggion</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>11&#8211;20</pages>
    <url>http://www.aclweb.org/anthology/W17-4402</url>
    <abstract>Videogame streaming platforms have become a paramount example of noisy
	user-generated text. These are websites where gaming is broadcasted, and allows
	interaction with viewers via integrated chatrooms. Probably the best known
	platform of this kind is Twitch, which has more than 100 million monthly
	viewers. Despite these numbers, and unlike other platforms featuring short
	messages (e.g. Twitter), Twitch has not received much attention from the
	Natural Language Processing community. In this paper we aim at bridging this
	gap by proposing two important tasks specific to the Twitch platform, namely
	(1) Emote prediction; and (2) Trolling detection. In our experiments, we
	evaluate three models: a BOW baseline, a logistic supervised classifiers based
	on word embeddings, and a bidirectional long short-term memory recurrent neural
	network (LSTM). Our results show that the LSTM model outperforms the other two
	models, where explicit features with proven effectiveness for similar tasks
	were encoded.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>barbieri-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4403">
    <title>Churn Identification in Microblogs using Convolutional Neural Networks with Structured Logical Knowledge</title>
    <author><first>Mourad</first><last>Gridach</last></author>
    <author><first>Hatem</first><last>Haddad</last></author>
    <author><first>Hala</first><last>Mulki</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>21&#8211;30</pages>
    <url>http://www.aclweb.org/anthology/W17-4403</url>
    <attachment type="attachment">W17-4403.Attachment.txt</attachment>
    <abstract>For brands, gaining new customer is more expensive than keeping an existing
	one. Therefore, the ability to keep customers in a brand is becoming more
	challenging these days.  Churn happens when a customer leaves a brand to
	another competitor. Most of the previous work considers the problem of churn
	prediction using the Call Detail Records (CDRs). In this paper, we use
	micro-posts to classify customers into churny or non-churny. We explore the
	power of convolutional neural networks (CNNs) since they achieved
	state-of-the-art in various computer vision and NLP applications. However, the
	robustness of end-to-end models has some limitations such as the availability
	of a large amount of labeled data and uninterpretability of these models. We
	investigate the use of CNNs augmented with structured logic rules to overcome
	or reduce this issue. We developed our system called Churn_teacher by using an
	iterative distillation method that transfers the knowledge, extracted using
	just the combination of three logic rules, directly into the weight of the
	DNNs. Furthermore, we used weight normalization to speed up training our
	convolutional neural networks. Experimental results showed that with just these
	three rules, we were able to get state-of-the-art on publicly available Twitter
	dataset about three Telecom brands.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gridach-haddad-mulki:2017:WNUT</bibkey>
  </paper>

  <paper id="4404">
    <title>To normalize, or not to normalize: The impact of normalization on Part-of-Speech tagging</title>
    <author><first>Rob</first><last>van der Goot</last></author>
    <author><first>Barbara</first><last>Plank</last></author>
    <author><first>Malvina</first><last>Nissim</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>31&#8211;39</pages>
    <url>http://www.aclweb.org/anthology/W17-4404</url>
    <abstract>Does normalization help Part-of-Speech (POS) tagging accuracy on noisy,
	non-canonical data?
	To the best of our knowledge, little is known on the actual impact of
	normalization in a real-world scenario, where gold error detection is not
	available.  We investigate the effect of automatic normalization on POS tagging
	of tweets.
	We also compare normalization to strategies that leverage large amounts of
	unlabeled data kept in its raw form.  Our results show that normalization
	helps, but does not add  consistently beyond just word embedding layer
	initialization. The latter approach yields a tagging model that is competitive
	with a Twitter state-of-the-art tagger.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>vandergoot-plank-nissim:2017:WNUT</bibkey>
  </paper>

  <paper id="4405">
    <title>Constructing an Alias List for Named Entities during an Event</title>
    <author><first>Anietie</first><last>Andy</last></author>
    <author><first>Mark</first><last>Dredze</last></author>
    <author><first>Mugizi</first><last>Rwebangira</last></author>
    <author><first>Chris</first><last>Callison-Burch</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>40&#8211;44</pages>
    <url>http://www.aclweb.org/anthology/W17-4405</url>
    <abstract>In certain fields, real-time knowledge from events can help in making informed
	decisions. In order to extract pertinent real-time knowledge related to an
	event, it is important to identify the named entities and their corresponding
	aliases related to the event. The problem of identifying aliases of named
	entities that spike has remained unexplored. In this paper, we introduce an
	algorithm, EntitySpike, that identifies entities that spike in popularity in
	tweets from a given time period, and constructs an alias list for these spiked
	entities. EntitySpike uses a temporal heuristic to identify named entities with
	similar context that occur in the same time period (within minutes) during an
	event. Each entity is encoded as a vector using this temporal heuristic. We
	show how these entity-vectors can be used to create a named entity alias list. 
	We evaluated our algorithm on a dataset of temporally ordered tweets from a
	single event, the 2013 Grammy Awards show. We carried out various experiments
	on tweets that were published in the same time period and show that our
	algorithm identifies most entity name aliases and outperforms a competitive
	baseline.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>andy-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4406">
    <title>Incorporating Metadata into Content-Based User Embeddings</title>
    <author><first>Linzi</first><last>Xing</last></author>
    <author><first>Michael J.</first><last>Paul</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>45&#8211;49</pages>
    <url>http://www.aclweb.org/anthology/W17-4406</url>
    <abstract>Low-dimensional vector representations of social media users can benefit
	applications like recommendation systems and user attribute inference. Recent
	work has shown that user embeddings can be improved by combining different
	types of information, such as text and network data. We propose a data
	augmentation method that allows novel feature types to be used within
	off-the-shelf embedding models. Experimenting with the task of friend
	recommendation on a dataset of 5,019 Twitter users, we show that our approach
	can lead to substantial performance gains with the simple addition of network
	and geographic features.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>xing-paul:2017:WNUT</bibkey>
  </paper>

  <paper id="4407">
    <title>Simple Queries as Distant Labels for Predicting Gender on Twitter</title>
    <author><first>Chris</first><last>Emmery</last></author>
    <author><first>Grzegorz</first><last>Chrupa&#x142;a</last></author>
    <author><first>Walter</first><last>Daelemans</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>50&#8211;55</pages>
    <url>http://www.aclweb.org/anthology/W17-4407</url>
    <abstract>The majority of research on extracting missing user attributes from social
	media profiles use costly hand-annotated labels for supervised learning.
	Distantly supervised methods exist, although these generally rely on knowledge
	gathered using external sources. This paper demonstrates the effectiveness of
	gathering distant labels for self-reported gender on Twitter using simple
	queries. We confirm the reliability of this query heuristic by comparing with
	manual annotation. Moreover, using these labels for distant supervision, we
	demonstrate competitive model performance on the same data as models trained on
	manual annotations. As such, we offer a cheap, extensible, and fast alternative
	that can be employed beyond the task of gender classification.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>emmery-chrupala-daelemans:2017:WNUT</bibkey>
  </paper>

  <paper id="4408">
    <title>A Dataset and Classifier for Recognizing Social Media English</title>
    <author><first>Su Lin</first><last>Blodgett</last></author>
    <author><first>Johnny</first><last>Wei</last></author>
    <author><first>Brendan</first><last>O'Connor</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>56&#8211;61</pages>
    <url>http://www.aclweb.org/anthology/W17-4408</url>
    <abstract>While language identification works well on standard texts, it performs much
	worse on social media language, in particular dialectal language&#8211;-even for
	English. First, to support work on English language identification, we
	contribute a new dataset of tweets annotated for English versus non-English,
	with attention to ambiguity, code-switching, and automatic generation issues.
	It is randomly sampled from all public messages, avoiding biases towards
	pre-existing language classifiers. Second, we find that a demographic language
	model&#8211;-which identifies messages with language similar to that used by several
	U.S. ethnic populations on Twitter&#8211;-can be used to improve English language
	identification performance when combined with a traditional supervised language
	identifier. It increases recall with almost no loss of precision, including,
	surprisingly, for English messages written by non-U.S. authors.
	Our dataset and identifier ensemble are available online.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>blodgett-wei-oconnor:2017:WNUT</bibkey>
  </paper>

  <paper id="4409">
    <title>Evaluating hypotheses in geolocation on a very large sample of Twitter</title>
    <author><first>Bahar</first><last>Salehi</last></author>
    <author><first>Anders</first><last>S&#248;gaard</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>62&#8211;67</pages>
    <url>http://www.aclweb.org/anthology/W17-4409</url>
    <abstract>Recent work in geolocation has made several hypotheses about what linguistic
	markers are relevant to detect where people write from. In this paper, we
	examine six hypotheses against a corpus consisting of all geo-tagged tweets
	from the
	US, or whose geo-tags could be inferred, in a 19% sample of Twitter history.
	Our
	experiments lend support to all six hypotheses, including that spelling
	variants
	and hashtags are strong predictors of location. We also study what kinds of
	common nouns are predictive of location after controlling for named entities
	such as dolphins or sharks</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>salehi-sogaard:2017:WNUT</bibkey>
  </paper>

  <paper id="4410">
    <title>The Effect of Error Rate in Artificially Generated Data for Automatic Preposition and Determiner Correction</title>
    <author><first>Fraser</first><last>Bowen</last></author>
    <author><first>Jon</first><last>Dehdari</last></author>
    <author><first>Josef</first><last>Van Genabith</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>68&#8211;76</pages>
    <url>http://www.aclweb.org/anthology/W17-4410</url>
    <abstract>In this research we investigate the impact of mismatches in the density and
	type of error between training and test data on a neural system correcting
	preposition and determiner errors. We use synthetically produced training data
	to control error density and type, and "real" error data for testing. Our
	results show it is possible to combine error types, although prepositions and
	determiners behave differently in terms of how much error should be
	artificially introduced into the training data in order to get the best
	results.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bowen-dehdari-vangenabith:2017:WNUT</bibkey>
  </paper>

  <paper id="4411">
    <title>An Entity Resolution Approach to Isolate Instances of Human Trafficking Online</title>
    <author><first>Chirag</first><last>Nagpal</last></author>
    <author><first>Kyle</first><last>Miller</last></author>
    <author><first>Benedikt</first><last>Boecking</last></author>
    <author><first>Artur</first><last>Dubrawski</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>77&#8211;84</pages>
    <url>http://www.aclweb.org/anthology/W17-4411</url>
    <abstract>Human trafficking is a challenging law enforcement problem, and traces of
	victims of such activity manifest as ‘escort advertisements’ on various
	online forums. Given the large, heterogeneous and noisy structure of this data,
	building models to predict instances of trafficking is a convoluted task. In
	this paper we propose an entity resolution pipeline using a notion of proxy
	labels, in order to extract clusters from this data with prior history of human
	trafficking activity. We apply this pipeline to 5M records from backpage.com
	and report on the performance of this approach, challenges in terms of
	scalability, and some significant domain specific characteristics of our
	resolved entities.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>nagpal-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4412">
    <title>Noisy Uyghur Text Normalization</title>
    <author><first>Osman</first><last>Tursun</last></author>
    <author><first>Ruket</first><last>Cakici</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>85&#8211;93</pages>
    <url>http://www.aclweb.org/anthology/W17-4412</url>
    <abstract>Uyghur is the second largest and most actively used social media language in
	China. However, a non-negligible part of Uyghur text appearing in social media
	is unsystematically written with the Latin alphabet, and it continues to
	increase in size. Uyghur text in this format is incomprehensible and ambiguous
	even to native Uyghur speakers. In addition, Uyghur texts in this form lack the
	potential for any kind of advancement for the NLP tasks related to the Uyghur
	language. Restoring and preventing noisy Uyghur text written with unsystematic
	Latin alphabets will be essential to the protection of Uyghur language and
	improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we
	propose and compare the noisy channel model and the neural encoder-decoder
	model as normalizing methods.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>tursun-cakici:2017:WNUT</bibkey>
  </paper>

  <paper id="4413">
    <title>Crowdsourcing Multiple Choice Science Questions</title>
    <author><first>Johannes</first><last>Welbl</last></author>
    <author><first>Nelson F.</first><last>Liu</last></author>
    <author><first>Matt</first><last>Gardner</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>94&#8211;106</pages>
    <url>http://www.aclweb.org/anthology/W17-4413</url>
    <abstract>We present a novel method for obtaining high-quality, domain-targeted multiple
	choice questions from crowd workers. Generating these questions can be
	difficult without trading away originality, relevance or diversity in the
	answer options. Our method addresses these problems by leveraging a large
	corpus of domain-specific
	text and a small set of existing questions. It produces model suggestions for
	document selection and answer distractor choice which aid the human question
	generation process. With this method we have assembled SciQ, a dataset of 13.7K
	multiple choice science exam questions. We demonstrate that the method produces
	in-domain questions by providing an analysis of this new dataset and by showing
	that humans cannot distinguish the crowdsourced questions from original
	questions. When using SciQ as additional training data to existing questions,
	we observe accuracy improvements on real science exams.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>welbl-liu-gardner:2017:WNUT</bibkey>
  </paper>

  <paper id="4414">
    <title>A Text Normalisation System for Non-Standard English Words</title>
    <author><first>Emma</first><last>Flint</last></author>
    <author><first>Elliot</first><last>Ford</last></author>
    <author><first>Olivia</first><last>Thomas</last></author>
    <author><first>Andrew</first><last>Caines</last></author>
    <author><first>Paula</first><last>Buttery</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>107&#8211;115</pages>
    <url>http://www.aclweb.org/anthology/W17-4414</url>
    <abstract>This paper investigates the problem of text normalisation; specifically, the
	normalisation of non-standard words (NSWs) in English. Non-standard words can
	be defined as those word tokens which do not have a dictionary entry, and
	cannot be pronounced using the usual letter-to-phoneme conversion rules; e.g.
	lbs, 99.3%, #EMNLP2017. NSWs pose a challenge to the proper functioning of
	text-to-speech technology, and the solution is to spell them out in such a way
	that they can be pronounced appropriately. We describe our four-stage
	normalisation system made up of components for detection, classification,
	division and expansion of NSWs. Performance is favourabe compared to previous
	work in the field (Sproat et al. 2001, Normalization of non-standard words), as
	well as state-of-the-art text-to-speech software. Further, we update Sproat et
	al.'s NSW taxonomy, and create a more customisable system where users are able
	to input their own abbreviations and specify into which variety of English
	(currently available: British or American) they wish to normalise.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>flint-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4415">
    <title>Huntsville, hospitals, and hockey teams: Names can reveal your location</title>
    <author><first>Bahar</first><last>Salehi</last></author>
    <author><first>Dirk</first><last>Hovy</last></author>
    <author><first>Eduard</first><last>Hovy</last></author>
    <author><first>Anders</first><last>S&#248;gaard</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>116&#8211;121</pages>
    <url>http://www.aclweb.org/anthology/W17-4415</url>
    <abstract>Geolocation is the task of identifying a social media user’s primary
	location, and in natural language processing, there is a growing literature on
	to what extent automated analysis of social media posts can help. However, not
	all content features are equally revealing of a user’s location.
	In this paper, we evaluate nine name entity (NE) types. Using various metrics,
	we find that GEO-LOC, FACILITY and SPORT-TEAM are more informative for
	geolocation than other NE types. Using these types, we improve geolocation
	accuracy and reduce distance error over various famous text-based methods.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>salehi-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4416">
    <title>Improving Document Clustering by Removing Unnatural Language</title>
    <author><first>Myungha</first><last>Jang</last></author>
    <author><first>Jinho D.</first><last>Choi</last></author>
    <author><first>James</first><last>Allan</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>122&#8211;130</pages>
    <url>http://www.aclweb.org/anthology/W17-4416</url>
    <abstract>Technical documents contain a fair amount of unnatural language, such as
	tables, formulas, and pseudo-code. Unnatural language can bean important factor
	of confusing existing NLP tools. This paper presents an effective method of
	distinguishing unnatural language from natural language, and evaluates the
	impact of un-natural language detection on NLP tasks such as document
	clustering.  We view this problem as an information extraction task and build a
	multiclass classification model identifying unnatural language components into
	four categories. First, we create a new annotated corpus by collecting slides
	and papers in various for-mats, PPT, PDF, and HTML, where unnatural language
	components are annotated into four categories. We then explore features
	available from plain text to build a statistical model that can handle any
	format as long as it is converted into plain text. Our experiments show that
	re-moving unnatural language components gives an absolute improvement in
	document cluster-ing by up to 15%.   Our corpus and tool are publicly available</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jang-choi-allan:2017:WNUT</bibkey>
  </paper>

  <paper id="4417">
    <title>Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media</title>
    <author><first>Preeti</first><last>Bhargava</last></author>
    <author><first>Nemanja</first><last>Spasojevic</last></author>
    <author><first>Guoning</first><last>Hu</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>131&#8211;139</pages>
    <url>http://www.aclweb.org/anthology/W17-4417</url>
    <abstract>In this paper, we describe the Lithium Natural Language Processing (NLP) system
	- a resource-constrained, high- throughput and language-agnostic system for
	information extraction from noisy user generated text on social media. Lithium
	NLP extracts a rich set of information including entities, top- ics, hashtags
	and sentiment from text. We discuss several real world applications of the
	system currently incorporated in Lithium products. We also compare our system
	with existing commercial and academic NLP systems in terms of performance,
	information extracted and languages supported. We show that Lithium NLP is at
	par with and in some cases, outperforms state- of-the-art commercial NLP
	systems.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bhargava-spasojevic-hu:2017:WNUT</bibkey>
  </paper>

  <paper id="4418">
    <title>Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition</title>
    <author><first>Leon</first><last>Derczynski</last></author>
    <author><first>Eric</first><last>Nichols</last></author>
    <author><first>Marieke</first><last>van Erp</last></author>
    <author><first>Nut</first><last>Limsopatham</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>140&#8211;147</pages>
    <url>http://www.aclweb.org/anthology/W17-4418</url>
    <abstract>This shared task focuses on identifying unusual, previously-unseen entities in
	the context of emerging discussions. Named entities form the basis of many
	modern approaches to other tasks (like event clustering and summarization), but
	recall on them is a real problem in noisy text - even among annotators. This
	drop tends to be due to novel entities and surface forms. Take for example the
	tweet "so.. kktny in 30 mins?!" &#8211; even human experts find the entity 'kktny'
	hard to detect and resolve. The goal of this task is to provide a definition of
	emerging and of rare entities, and based on that, also datasets for detecting
	these entities. The task as described in this paper evaluated the ability of
	participating entries to detect and classify novel and emerging named entities
	in noisy text.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>derczynski-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4419">
    <title>A Multi-task Approach for Named Entity Recognition in Social Media Data</title>
    <author><first>Gustavo</first><last>Aguilar</last></author>
    <author><first>Suraj</first><last>Maharjan</last></author>
    <author><first>Adrian Pastor</first><last>L&#243;pez Monroy</last></author>
    <author><first>Thamar</first><last>Solorio</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>148&#8211;153</pages>
    <url>http://www.aclweb.org/anthology/W17-4419</url>
    <abstract>Named Entity Recognition for social media data is challenging because of its
	inherent noisiness. In addition to improper grammatical structures, it contains
	spelling inconsistencies and numerous informal abbreviations. We propose a
	novel multi-task approach by employing a more general secondary task of Named
	Entity (NE) segmentation together with the primary task of fine-grained NE
	categorization. The multi-task neural network architecture learns higher order
	feature representations from word and character sequences along with basic
	Part-of-Speech tags and gazetteer information. This neural network acts as a
	feature extractor to feed a Conditional Random Fields classifier. We were able
	to obtain the first position in the 3rd Workshop on Noisy User-generated Text
	(WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>aguilar-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4420">
    <title>Distributed Representation, LDA Topic Modelling and Deep Learning for Emerging Named Entity Recognition from Social Media</title>
    <author><first>Patrick</first><last>Jansson</last></author>
    <author><first>Shuhua</first><last>Liu</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>154&#8211;159</pages>
    <url>http://www.aclweb.org/anthology/W17-4420</url>
    <abstract>This paper reports our participation in the W-NUT 2017 shared task on emerging
	and rare entity recognition from user generated noisy text such as tweets,
	online reviews and forum discussions. To accomplish this challenging task, we
	explore an approach that combines LDA topic modelling with deep learning on
	word level and character level embeddings. The LDA topic modelling generates
	topic representation for each tweet which is used as a feature for each word in
	the tweet. The deep learning component consists of two-layer bidirectional LSTM
	and a CRF output layer. Our submitted result performed at 39.98 (F1) on entity
	and 37.77 on surface forms. Our new experiments after submission reached a best
	performance of 41.81 on entity and 40.57 on surface forms.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jansson-liu:2017:WNUT</bibkey>
  </paper>

  <paper id="4421">
    <title>Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media</title>
    <author><first>Bill Y.</first><last>Lin</last></author>
    <author><first>Frank</first><last>Xu</last></author>
    <author><first>Zhiyi</first><last>Luo</last></author>
    <author><first>Kenny</first><last>Zhu</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>160&#8211;165</pages>
    <url>http://www.aclweb.org/anthology/W17-4421</url>
    <abstract>In this paper, we present our multi-channel neural architecture for recognizing
	emerging named entity in social media messages, which we applied in the Novel
	and Emerging Named Entity Recognition shared task at the EMNLP 2017 Workshop on
	Noisy User-generated Text (W-NUT). We propose a novel approach, which
	incorporates comprehensive word representations with multi-channel information
	and Conditional Random Fields (CRF) into a traditional Bidirectional Long
	Short-Term Memory (BiLSTM) neural network without using any additional
	hand-craft features such as gazetteers. In comparison with other systems
	participating in the shared task, our system won the 2nd place.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>lin-EtAl:2017:WNUT</bibkey>
  </paper>

  <paper id="4422">
    <title>Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets</title>
    <author><first>Pius</first><last>von D&#228;niken</last></author>
    <author><first>Mark</first><last>Cieliebak</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>166&#8211;171</pages>
    <url>http://www.aclweb.org/anthology/W17-4422</url>
    <abstract>We present our system for the WNUT 2017 Named Entity Recognition challenge on
	Twitter data. We describe two modifications of a basic neural network
	architecture for sequence tagging. First, we show how we exploit additional
	labeled data, where the Named Entity tags differ from the target task. Then, we
	propose a way to incorporate sentence level features. Our system uses both
	methods and ranked second for entity level annotations, achieving an F1-score
	of 40.78, and second for surface form annotations, achieving an F1-score of
	39.33.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>vondaniken-cieliebak:2017:WNUT</bibkey>
  </paper>

  <paper id="4423">
    <title>Context-Sensitive Recognition for Emerging and Rare Entities</title>
    <author><first>Jake</first><last>Williams</last></author>
    <author><first>Giovanni</first><last>Santia</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>172&#8211;176</pages>
    <url>http://www.aclweb.org/anthology/W17-4423</url>
    <abstract>This paper is a shared task system description for the 2017 W-NUT shared task
	on Rare and Emerging Named Entities. Our paper describes the development and
	application of a novel algorithm for named entity recognition that relies only
	on the contexts of word forms. A comparison against the other submitted systems
	is provided.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>williams-santia:2017:WNUT</bibkey>
  </paper>

  <paper id="4424">
    <title>A Feature-based Ensemble Approach to Recognition of Emerging and Rare Named Entities</title>
    <author><first>Utpal Kumar</first><last>Sikdar</last></author>
    <author><first>Bj&#246;rn</first><last>Gamb&#228;ck</last></author>
    <booktitle>Proceedings of the 3rd Workshop on Noisy User-generated Text</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>177&#8211;181</pages>
    <url>http://www.aclweb.org/anthology/W17-4424</url>
    <abstract>Detecting previously unseen named entities in text is a challenging task. The
	paper describes how three initial classifier models were built using
	Conditional Random Fields (CRFs), Support Vector Machines (SVMs) and a Long
	Short-Term Memory (LSTM) recurrent neural network. The outputs of these three
	classifiers were then used as features to train another CRF classifier working
	as an ensemble. 
	5-fold cross-validation based on training and development data for the emerging
	and rare named entity recognition shared task showed precision, recall and
	F1-score of 66.87%, 46.75% and 54.97%, respectively. For surface form
	evaluation, the CRF ensemble-based system achieved precision, recall and F1
	scores of 65.18%, 45.20% and 53.30%. When applied to unseen test data, the
	model reached 47.92% precision, 31.97% recall and 38.55% F1-score for entity
	level evaluation, with the corresponding surface form evaluation values of
	44.91%, 30.47% and 36.31%.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>sikdar-gamback:2017:WNUT</bibkey>
  </paper>

</volume>

