<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W16">
  <paper id="3900">
    <title>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</title>
    <editor>Bo Han</editor>
    <editor>Alan Ritter</editor>
    <editor>Leon Derczynski</editor>
    <editor>Wei Xu</editor>
    <editor>Tim Baldwin</editor>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <url>http://aclweb.org/anthology/W16-39</url>
    <bibtype>book</bibtype>
    <bibkey>WNUT:2016</bibkey>
  </paper>

  <paper id="3901">
    <title>Processing non-canonical or noisy text: fortuitous data to the rescue</title>
    <author><first>Barbara</first><last>Plank</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>1</pages>
    <url>http://aclweb.org/anthology/W16-3901</url>
    <abstract>Real world data differs radically from the benchmark corpora we use in NLP,
	resulting in large performance drops. The reason for this problem is obvious:
	NLP models are trained on limited samples from canonical varieties considered
	standard. However, there are many dimensions, e.g., sociodemographic, language,
	genre, sentence type, etc. on which texts can differ from the standard. The
	solution is not obvious: we cannot control for all factors, and it is not clear
	how to best go beyond the current practice of training on homogeneous data from
	a single domain and language.  
	In this talk, I review the notion of canonicity, and how it shapes our
	community's approach to language. I argue for the use of fortuitous data.
	Fortuitous data is data out there that just waits to be harvested. It includes
	data which is in plain sight, but is often neglected, and more distant sources
	like behavioral data, which first need to be refined. They provide additional
	contexts and a myriad of opportunities to build more adaptive language
	technology, some of which I will explore in this talk.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>plank:2016:WNUT</bibkey>
  </paper>

  <paper id="3902">
    <title>From Entity Linking to Question Answering &#8211; Recent Progress on Semantic Grounding Tasks</title>
    <author><first>Ming-Wei</first><last>Chang</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>2</pages>
    <url>http://aclweb.org/anthology/W16-3902</url>
    <abstract>Entity linking and semantic parsing have been shown to be crucial to important
	applications such as question answering and document understanding. These tasks
	often require structured learning models, which make predictions on multiple
	interdependent variables. In this talk, I argue that carefully designed
	structured learning algorithms play a central role in entity linking and
	semantic parsing tasks. In particular, I will present several new structured
	learning models for entity linking, which jointly detect mentions and
	disambiguate entities as well as capture non-textual information. I will then
	show how to use a staged search procedure to building a state-of-the-art
	knowledge base question answering system. Finally, if time permits, I will
	discuss different supervision protocols for training semantic parsers and the
	value of labeling semantic parses.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>chang:2016:WNUT</bibkey>
  </paper>

  <paper id="3903">
    <title>DISAANA and D-SUMM: Large-scale Real Time NLP Systems for Analyzing Disaster Related Reports in Tweets</title>
    <author><first>Kentaro</first><last>Torisawa</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>3</pages>
    <url>http://aclweb.org/anthology/W16-3903</url>
    <abstract>This talk presents two NLP systems that were developed for helping disaster
	victims and rescue workers in the aftermath of large-scale disasters. DISAANA
	provides answers to questions such as "What is in short supply in Tokyo?" and
	displays locations related to each answer on a map. D-SUMM automatically
	summarizes a large number of disaster related reports concerning a specified
	area and helps rescue workers to understand disaster situations from a macro
	perspective. Both systems are publicly available as Web services.  In the
	aftermath of the 2016 Kumamoto Earthquake (M7.0), the Japanese government
	actually used DISAANA to analyze the situation.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>torisawa:2016:WNUT</bibkey>
  </paper>

  <paper id="3904">
    <title>Private or Corporate? Predicting User Types on Twitter</title>
    <author><first>Nikola</first><last>Ljube&#x161;i&#x107;</last></author>
    <author><first>Darja</first><last>Fi&#x161;er</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>4&#8211;12</pages>
    <url>http://aclweb.org/anthology/W16-3904</url>
    <abstract>In this paper we present a series of experiments on discriminating between
	private and corporate accounts on Twitter. We define features based on Twitter
	metadata, morphosyntactic tags and surface forms, showing that the simple
	bag-of-words model achieves single best results that can, however, be improved
	by building a weighted soft ensemble of classifiers based on each feature type.
	Investigating the time and language dependence of each feature type delivers
	quite unexpecting results showing that features based on metadata are neither
	time- nor language-insensitive as the way the two user groups use the social
	network varies heavily through time and space.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ljubevsic-fivser:2016:WNUT</bibkey>
  </paper>

  <paper id="3905">
    <title>From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario</title>
    <author><first>H&#233;ctor</first><last>Mart&#237;nez Alonso</last></author>
    <author><first>Djam&#233;</first><last>Seddah</last></author>
    <author><first>Beno&#238;t</first><last>Sagot</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>13&#8211;23</pages>
    <url>http://aclweb.org/anthology/W16-3905</url>
    <abstract>User-generated content presents many challenges for its automatic processing.
	While many of them do come from out-of-vocabulary effects, others spawn from
	different linguistic phenomena such as unusual syntax. In this work we present
	a French three-domain data set made up of ques- tion headlines from a cooking
	forum, game chat logs and associated forums from two popular online games
	(MINECRAFT &#38; LEAGUE OF LEGENDS). We chose these domains because they encompass
	different degrees of lexical and syntactic compliance with canonical language.
	We conduct an automatic and manual evaluation of the difficulties of processing
	these domains for part-of-speech prediction, and introduce a pilot study to
	determine whether dependency analysis lends itself well to annotate these data.
	We also discuss the development cost of our data set.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>martinezalonso-seddah-sagot:2016:WNUT</bibkey>
  </paper>

  <paper id="3906">
    <title>Disaster Analysis using User-Generated Weather Report</title>
    <author><first>Yasunobu</first><last>Asakura</last></author>
    <author><first>Masatsugu</first><last>Hangyo</last></author>
    <author><first>Mamoru</first><last>Komachi</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>24&#8211;32</pages>
    <url>http://aclweb.org/anthology/W16-3906</url>
    <abstract>Information extraction from user-generated text has gained much attention with
	the growth of the Web.Disaster analysis using information from social media
	provides valuable, real-time, geolocation information for helping people caught
	up these in disasters. However, it is not convenient to analyze texts posted on
	social media because disaster keywords match any texts that contain words. For
	collecting posts about a disaster from social media, we need to develop a
	classifier to filter posts irrelevant to disasters. Moreover, because of the
	nature of social media, we can take advantage of posts that come with GPS
	information.
	However, a post does not always refer to an event occurring at the place where
	it has been posted.
	Therefore, we propose a new task of classifying whether a flood disaster
	occurred, in addition to predicting the geolocation of events from
	user-generated text. We report the annotation of the flood disaster corpus and
	develop a classifier to demonstrate the use of this corpus for disaster
	analysis.
	Author{2}{Affiliation}},
  url       = {http://aclweb.org/anthology/W16-3906}
}
</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>asakura-hangyo-komachi:2016:WNUT</bibkey>
  </paper>

  <paper id="3907">
    <title>Veracity Computing from Lexical Cues and Perceived Certainty Trends</title>
    <author><first>Uwe</first><last>Reichel</last></author>
    <author><first>Piroska</first><last>Lendvai</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>33&#8211;42</pages>
    <url>http://aclweb.org/anthology/W16-3907</url>
    <abstract>We present a data-driven method for determining the veracity of a set of
	rumorous claims on social media data. Tweets from different sources pertaining
	to a rumor are processed on three levels: first, factuality values are assigned
	to each tweet based on four textual cue categories relevant for our journalism
	use case; these amalgamate speaker support in terms of polarity and commitment
	in terms of certainty and speculation. Next, the proportions of these lexical
	cues are utilized as predictors for tweet certainty in a generalized linear
	regression model. Subsequently, lexical cue proportions, predicted certainty,
	as well as their time course characteristics are used to compute veracity for
	each rumor in terms of the identity of the rumor-resolving tweet and its binary
	resolution value judgment. The system operates without access to
	extralinguistic resources. Evaluated on the data portion for which hand-labeled
	examples were available, it achieves .74 F1-score on identifying rumor
	resolving tweets and .76 F1-score on predicting if a rumor is resolved as true
	or false.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>reichel-lendvai:2016:WNUT</bibkey>
  </paper>

  <paper id="3908">
    <title>A Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation</title>
    <author><first>Marlies</first><last>van der Wees</last></author>
    <author><first>Arianna</first><last>Bisazza</last></author>
    <author><first>Christof</first><last>Monz</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>43&#8211;50</pages>
    <url>http://aclweb.org/anthology/W16-3908</url>
    <abstract>A major challenge for statistical machine translation (SMT) of
	Arabic-to-English user-generated text is the prevalence of text written in
	Arabizi, or Romanized Arabic. When facing such texts, a translation system
	trained on conventional Arabic-English data will suffer from extremely low
	model coverage. In addition, Arabizi is not regulated by any official
	standardization and therefore highly ambiguous, which prevents rule-based
	approaches from achieving good translation results. In this paper, we improve
	Arabizi-to-English machine translation by presenting a simple but effective
	Arabizi-to-Arabic transliteration pipeline that does not require knowledge by
	experts or native Arabic speakers. We incorporate this pipeline into a
	phrase-based SMT system, and show that translation quality after automatically
	transliterating Arabizi to Arabic yields results that are comparable to those
	achieved after human transliteration.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>vanderwees-bisazza-monz:2016:WNUT</bibkey>
  </paper>

  <paper id="3909">
    <title>Name Variation in Community Question Answering Systems</title>
    <author><first>Anietie</first><last>Andy</last></author>
    <author><first>Satoshi</first><last>Sekine</last></author>
    <author><first>Mugizi</first><last>Rwebangira</last></author>
    <author><first>Mark</first><last>Dredze</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>51&#8211;60</pages>
    <url>http://aclweb.org/anthology/W16-3909</url>
    <abstract>Name Variation in Community Question Answering Systems
	                                Abstract
	Community question answering systems are forums where users can ask and answer
	questions in various categories. Examples are Yahoo! Answers, Quora, and Stack
	Overflow. A common challenge with such systems is that a significant percentage
	of asked questions are left unanswered. In this paper, we propose an algorithm
	to reduce the number of unanswered questions in Yahoo! Answers by reusing the
	answer to the most similar past resolved question to the unanswered question,
	from the site. Semantically similar questions could be worded differently,
	thereby making it difficult to find questions that have shared needs. For
	example, "Who is the best player for the Reds?" and "Who is currently the
	biggest star at Manchester United?" have a shared need but
	are worded differently; also, "Reds" and "Manchester United" are used to refer
	to the soccer team Manchester United football club. In this research, we focus
	on question categories that contain a large number of named entities and entity
	name variations. We show that in these categories, entity linking can be used
	to identify relevant past resolved questions with shared needs as a given
	question by disambiguating named entities and matching these questions based on
	the disambiguated entities, identified entities, and knowledge base information
	related to these entities. We evaluated our algorithm on a new dataset
	constructed from Yahoo! Answers. The dataset contains annotated question pairs,
	(Qgiven, [Qpast, Answer]). We carried out experiments on several question
	categories and show that an entity-based approach gives good performance when
	searching for similar questions in entity rich categories.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>andy-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3910">
    <title>Whose Nickname is This? Recognizing Politicians from Their Aliases</title>
    <author><first>Wei-Chung</first><last>Wang</last></author>
    <author><first>Hung-Chen</first><last>Chen</last></author>
    <author><first>Zhi-Kai</first><last>Ji</last></author>
    <author><first>Hui-I</first><last>Hsiao</last></author>
    <author><first>Yu-Shian</first><last>Chiu</last></author>
    <author><first>Lun-Wei</first><last>Ku</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>61&#8211;69</pages>
    <url>http://aclweb.org/anthology/W16-3910</url>
    <abstract>Using aliases to refer to public figures is one way to make fun of people, to
	express sarcasm, or even to sidestep legal issues when expressing opinions on
	social media. However, linking an alias back to the real name is difficult, as
	it entails phonemic, graphemic, and semantic challenges. In this paper, we
	propose a phonemic-based approach and inject semantic information to align
	aliases with politicians' Chinese formal names. The proposed approach creates
	an HMM model for each name to model its phonemes and takes into account
	document-level pairwise mutual information to capture the semantic relations to
	the alias. In this work we also introduce two new datasets consisting of 167
	phonemic pairs and 279 mixed pairs of aliases and formal names. Experimental
	results show that the proposed approach models both phonemic and semantic
	information and outperforms previous work on both the phonemic and mixed
	datasets with the best top-1 accuracies of 0.78 and 0.59 respectively.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>wang-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3911">
    <title>Towards Accurate Event Detection in Social Media: A Weakly Supervised Approach for Learning Implicit Event Indicators</title>
    <author><first>Ajit</first><last>Jain</last></author>
    <author><first>Girish</first><last>Kasiviswanathan</last></author>
    <author><first>Ruihong</first><last>Huang</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>70&#8211;77</pages>
    <url>http://aclweb.org/anthology/W16-3911</url>
    <abstract>Accurate event detection in social media is very challenging because user
	generated contents are extremely noisy and sparse in content. Event indicators
	are generally words or phrases that act as a trigger that help us understand
	the semantics of the context they occur in. We present a weakly supervised
	approach that relies on  using a single strong event indicator phrase as a seed
	to acquire a variety of additional event cues. We propose to leverage various
	types of implicit event indicators, such as props, actors and precursor events,
	to achieve precise event detection. We experimented with civil
	unrest events and show that the automatically learnt event indicators are
	effective in identifying specific types of events.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jain-kasiviswanathan-huang:2016:WNUT</bibkey>
  </paper>

  <paper id="3912">
    <title>Unsupervised Stemmer for Arabic Tweets</title>
    <author><first>Fahad</first><last>Albogamy</last></author>
    <author><first>Allan</first><last>Ramsay</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>78&#8211;84</pages>
    <url>http://aclweb.org/anthology/W16-3912</url>
    <abstract>Stemming is an essential processing step in a wide range of high level text
	processing applications such as information extraction, machine translation and
	sentiment analysis. It is used to reduce words to their stems. Many stemming
	algorithms have been developed for Modern Standard Arabic (MSA). Although
	Arabic tweets and MSA are closely related and share many characteristics, there
	are substantial differences between them in lexicon and syntax. In this paper,
	we introduce a light Arabic stemmer for Arabic tweets. Our results show
	improvements over the performance of a number of well-known stemmers for
	Arabic.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>albogamy-ramsay:2016:WNUT</bibkey>
  </paper>

  <paper id="3913">
    <title>Topic Stability over Noisy Sources</title>
    <author><first>Jing</first><last>Su</last></author>
    <author><first>Derek</first><last>Greene</last></author>
    <author><first>Oisin</first><last>Boydell</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>85&#8211;93</pages>
    <url>http://aclweb.org/anthology/W16-3913</url>
    <abstract>Topic modelling techniques such as LDA have recently been applied to speech
	transcripts and OCR output. These corpora may contain noisy or erroneous texts
	which may undermine topic stability. Therefore, it is important to know how
	well a topic modelling algorithm will perform when applied to noisy data. In
	this paper we show that different types of textual noise can have diverse
	effects on the stability of topic models.  On the other hand, topic model
	stability is not consistent with the same type but different levels of noise.
	We introduce a dictionary filtering approach to address this challenge, with
	the result that a topic model with the correct number of topics is always
	identified across different levels of noise.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>su-greene-boydell:2016:WNUT</bibkey>
  </paper>

  <paper id="3914">
    <title>Analysis of Twitter Data for Postmarketing Surveillance in Pharmacovigilance</title>
    <author><first>Julie</first><last>Pain</last></author>
    <author><first>Jessie</first><last>Levacher</last></author>
    <author><first>Adam</first><last>Quinquenel</last></author>
    <author><first>Anja</first><last>Belz</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>94&#8211;101</pages>
    <url>http://aclweb.org/anthology/W16-3914</url>
    <abstract>Postmarketing surveillance (PMS) has the vital aim to monitor effects of drugs
	after release for use by the general population, but suffers from
	under-reporting and limited coverage. Automatic methods for detecting  drug
	effect reports, especially for social media, could vastly increase the scope of
	PMS. Very few automatic PMS methods are currently available, in particular for
	the messy text types encountered on Twitter. In this paper we describe first
	results for developing PMS methods specifically for tweets. We describe the
	corpus of 125,669 tweets we have created and annotated to train and test the
	tools. We find that generic tools perform well for tweet-level language
	identification and tweet-level sentiment analysis (both 0.94 F1-Score). For
	detection of effect mentions we are able to achieve 0.87 F1-Score, while
	effect-level adverse-vs.-beneficial analysis proves harder with an F1-Score of
	0.64. Among other things, our results indicate that MetaMap semantic types
	provide a very promising basis for identifying drug effect mentions in tweets.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>pain-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3915">
    <title>Named Entity Recognition and Hashtag Decomposition to Improve the Classification of Tweets</title>
    <author><first>Billal</first><last>Belainine</last></author>
    <author><first>Alexsandro</first><last>Fonseca</last></author>
    <author><first>Fatiha</first><last>Sadat</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>102&#8211;111</pages>
    <url>http://aclweb.org/anthology/W16-3915</url>
    <abstract>In social networks services like Twitter, users are overwhelmed with huge
	amount of social data,
	most of which are short, unstructured and highly noisy. Identifying accurate
	information from
	this huge amount of data is indeed a hard task. Classification of tweets into
	organized form will
	help the user to easily access these required information. Our first
	contribution relates to filtering
	parts of speech and preprocessing this kind of highly noisy and short data. Our
	second contribution
	concerns the named entity recognition (NER) in tweets. Thus, the adaptation of
	existing
	language tools for natural languages, noisy and not accurate language tweets,
	is necessary. Our
	third contribution involves segmentation of hashtags and a semantic enrichment
	using a combination
	of relations from WordNet, which helps the performance of our classification
	system,
	including disambiguation of named entities, abbreviations and acronyms. Graph
	theory is used
	to cluster the words extracted from WordNet and tweets, based on the idea of
	connected components.
	We test our automatic classification system with four categories: politics,
	economy, sports
	and the medical field. We evaluate and compare several automatic classification
	systems using
	part or all of the items described in our contributions and found that
	filtering by part of speech
	and named entity recognition dramatically increase the classification precision
	to 77.3 %. Moreover,
	a classification system incorporating segmentation of hashtags and semantic
	enrichment by
	two relations from WordNet, synonymy and hyperonymy, increase classification
	precision up to
	83.4 %.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>belainine-fonseca-sadat:2016:WNUT</bibkey>
  </paper>

  <paper id="3916">
    <title>Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization</title>
    <author><first>Thales Felipe</first><last>Costa Bertaglia</last></author>
    <author><first>Maria das Gra&#231;as</first><last>Volpe Nunes</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>112&#8211;120</pages>
    <url>http://aclweb.org/anthology/W16-3916</url>
    <abstract>Text normalization techniques based on rules, lexicons or supervised training
	requiring large
	corpora are not scalable nor domain interchangeable, and this makes them
	unsuitable for normal-
	izing user-generated content (UGC). Current tools available for Brazilian
	Portuguese make use
	of such techniques. In this work we propose a technique based on distributed
	representation of
	words (or word embeddings). It generates continuous numeric vectors of
	high-dimensionality to
	represent words. The vectors explicitly encode many linguistic regularities and
	patterns, as well
	as syntactic and semantic word relationships. Words that share semantic
	similarity are repre-
	sented by similar vectors. Based on these features, we present a totally
	unsupervised, expandable
	and language and domain independent method for learning normalization lexicons
	from word
	embeddings. Our approach obtains high correction rate of orthographic errors
	and internet slang
	in product reviews, outperforming the current available tools for Brazilian
	Portuguese.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>costabertaglia-volpenunes:2016:WNUT</bibkey>
  </paper>

  <paper id="3917">
    <title>How Document Pre-processing affects Keyphrase Extraction Performance</title>
    <author><first>Florian</first><last>Boudin</last></author>
    <author><first>Hugo</first><last>Mougard</last></author>
    <author><first>Damien</first><last>Cram</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>121&#8211;128</pages>
    <url>http://aclweb.org/anthology/W16-3917</url>
    <abstract>The SemEval-2010 benchmark dataset has brought renewed attention to the task of
	automatic keyphrase extraction. This dataset is made up of scientific articles
	that were automatically converted from PDF format to plain text and thus
	require careful preprocessing so that irrevelant spans of text do not
	negatively affect keyphrase extraction performance. In previous work, a wide
	range of document preprocessing techniques were described but their impact on
	the overall performance of keyphrase extraction models is still unexplored.
	Here, we re-assess the performance of several keyphrase extraction models and
	measure their robustness against increasingly sophisticated levels of document
	preprocessing.
	Author{2}{Affiliation}},
  url       = {http://aclweb.org/anthology/W16-3917}
}
</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>boudin-mougard-cram:2016:WNUT</bibkey>
  </paper>

  <paper id="3918">
    <title>Japanese Text Normalization with Encoder-Decoder Model</title>
    <author><first>Taishi</first><last>Ikeda</last></author>
    <author><first>Hiroyuki</first><last>Shindo</last></author>
    <author><first>Yuji</first><last>Matsumoto</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>129&#8211;137</pages>
    <url>http://aclweb.org/anthology/W16-3918</url>
    <abstract>Text normalization is the task of transforming lexical variants to their
	canonical forms.
	We model the problem of text normalization as a character-level sequence to
	sequence learning problem
	and present a neural encoder-decoder model for solving it.
	To train the encoder-decoder model, many sentences pairs are generally
	required.
	However, Japanese non-standard canonical pairs are scarce in the form of
	parallel corpora.
	To address this issue, we propose a method of data augmentation to increase
	data size
	by converting existing resources into synthesized non-standard forms using
	handcrafted rules. 
	We conducted an experiment to demonstrate that the synthesized corpus
	contributes to stably train an encoder-decoder model and improve the
	performance of Japanese text normalization.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ikeda-shindo-matsumoto:2016:WNUT</bibkey>
  </paper>

  <paper id="3919">
    <title>Results of the WNUT16 Named Entity Recognition Shared Task</title>
    <author><first>Benjamin</first><last>Strauss</last></author>
    <author><first>Bethany</first><last>Toma</last></author>
    <author><first>Alan</first><last>Ritter</last></author>
    <author><first>Marie-Catherine</first><last>de Marneffe</last></author>
    <author><first>Wei</first><last>Xu</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>138&#8211;144</pages>
    <url>http://aclweb.org/anthology/W16-3919</url>
    <abstract>This paper presents the results of the Twitter Named Entity Recognition shared
	task associated with W-NUT 2016: a named entity tagging task with 10 teams
	participating. We outline the shared task, annotation process and dataset
	statistics, and provide a high-level overview of the
	participating systems for each shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>strauss-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3920">
    <title>Bidirectional LSTM for Named Entity Recognition in Twitter Messages</title>
    <author><first>Nut</first><last>Limsopatham</last></author>
    <author><first>Nigel</first><last>Collier</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>145&#8211;152</pages>
    <url>http://aclweb.org/anthology/W16-3920</url>
    <abstract>In this paper, we present our approach for named entity recognition in Twitter
	messages that we used in our participation in the Named Entity Recognition in
	Twitter shared task at the COLING 2016 Workshop on Noisy User-generated text
	(WNUT). The main challenge that we aim to tackle in our participation is the
	short, noisy and colloquial nature of tweets, which makes named entity
	recognition in Twitter message a challenging task. In particular, we
	investigate an approach for dealing with this problem by enabling bidirectional
	long short-term memory (LSTM) to automatically learn orthographic features
	without requiring feature engineering. In comparison with other systems
	participating in the shared task, our system achieved the most effective
	performance on both the `segmentation and categorisation' and the `segmentation
	only' sub-tasks.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>limsopatham-collier:2016:WNUT</bibkey>
  </paper>

  <paper id="3921">
    <title>Learning to recognise named entities in tweets by exploiting weakly labelled data</title>
    <author><first>Kurt Junshean</first><last>Espinosa</last></author>
    <author><first>Riza Theresa</first><last>Batista-Navarro</last></author>
    <author><first>Sophia</first><last>Ananiadou</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>153&#8211;163</pages>
    <url>http://aclweb.org/anthology/W16-3921</url>
    <abstract>Named entity recognition (NER) in social media (e.g., Twitter) is a challenging
	task due to the noisy nature of text. As part of our participation in the W-NUT
	2016 Named Entity Recognition Shared Task, we proposed an unsupervised learning
	approach using deep neural networks and leverage a knowledge base (i.e.,
	DBpedia) to bootstrap sparse entity types with weakly labelled data. To further
	boost the performance, we employed a more sophisticated tagging scheme and
	applied dropout as a regularisation technique in order to reduce overfitting.
	Even without hand- crafting linguistic features nor leveraging any of the
	W-NUT-provided gazetteers, we obtained robust performance with our approach,
	which ranked third amongst all shared task participants according to the
	official evaluation on a gold standard named entity-annotated corpus of 3,856
	tweets.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>espinosa-batistanavarro-ananiadou:2016:WNUT</bibkey>
  </paper>

  <paper id="3922">
    <title>Feature-Rich Twitter Named Entity Recognition and Classification</title>
    <author><first>Utpal Kumar</first><last>Sikdar</last></author>
    <author><first>Bj&#246;rn</first><last>Gamb&#228;ck</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>164&#8211;170</pages>
    <url>http://aclweb.org/anthology/W16-3922</url>
    <abstract>Twitter named entity recognition is the process of identifying proper names and
	classifying them into some predefined labels/categories. The paper introduces a
	Twitter named entity system using a supervised machine learning approach,
	namely Conditional Random Fields. A large set of different features was
	developed and the system was trained using these. The Twitter named entity task
	can be divided into two parts: i) Named entity extraction from tweets and ii)
	Twitter name classification into ten different types. For Twitter named entity
	recognition on unseen test data, our system obtained the second highest F1
	score in the shared task: 63.22%. The system performance on the classification
	task was worse, with an F1 measure of 40.06% on unseen test data, which was the
	fourth best of the ten systems participating in the shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>sikdar-gamback:2016:WNUT</bibkey>
  </paper>

  <paper id="3923">
    <title>Learning to Search for Recognizing Named Entities in Twitter</title>
    <author><first>Ioannis</first><last>Partalas</last></author>
    <author><first>C&#233;dric</first><last>Lopez</last></author>
    <author><first>Nadia</first><last>Derbas</last></author>
    <author><first>Ruslan</first><last>Kalitvianski</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>171&#8211;177</pages>
    <url>http://aclweb.org/anthology/W16-3923</url>
    <abstract>We presented in this work our participation in the 2nd Named Entity Recognition
	for Twitter shared task. The task has been cast as a sequence labeling one and
	we employed a learning to search approach in order to tackle it. We also
	leveraged LOD for extracting rich contextual features for the named-entities.
	Our submission
	achieved  F-scores of 46.16 and 60.24 for the classification and the
	segmentation tasks and ranked 2nd and 3rd respectively. The post-analysis
	showed that LOD features improved substantially the performance of our system
	as they counter-balance the lack of context in tweets. The shared task gave us
	the opportunity to test the performance of NER systems in short and noisy
	textual data. The results of the participated systems shows that the task is
	far to be considered as a solved one and  methods with stellar performance in
	normal texts need to be revised.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>partalas-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3924">
    <title>DeepNNNER: Applying BLSTM-CNNs and Extended Lexicons to Named Entity Recognition in Tweets</title>
    <author><first>Fabrice</first><last>Dugas</last></author>
    <author><first>Eric</first><last>Nichols</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>178&#8211;187</pages>
    <url>http://aclweb.org/anthology/W16-3924</url>
    <abstract>In this paper, we describe the DeepNNNER entry to The 2nd Workshop on Noisy
	User-generated Text (WNUT) Shared Task \#2: Named Entity Recognition in Twitter.
	Our shared task submission adopts the bidirectional LSTM-CNN model of Chiu and
	Nichols (2016), as it has been shown to perform well on both newswire and Web
	texts. It uses word embeddings trained on large-scale Web text collections
	together with text normalization to cope with the diversity in Web texts, and
	lexicons for target named entity classes constructed from publicly-available
	sources. Extended evaluation comparing the effectiveness of various word
	embeddings, text normalization, and lexicon settings shows that our system
	achieves a maximum F1-score of 47.24, performance surpassing that of the shared
	task's second-ranked system.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>dugas-nichols:2016:WNUT</bibkey>
  </paper>

  <paper id="3925">
    <title>ASU: An Experimental Study on Applying Deep Learning in Twitter Named Entity Recognition.</title>
    <author><first>Michel Naim</first><last>Gerguis</last></author>
    <author><first>Cherif</first><last>Salama</last></author>
    <author><first>M. Watheq</first><last>El-Kharashi</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>188&#8211;196</pages>
    <url>http://aclweb.org/anthology/W16-3925</url>
    <abstract>This paper describes the ASU system submitted in the COLING W-NUT 2016 Twitter
	Named Entity Recognition (NER) task.
	We present an experimental study on applying deep learning to extracting named
	entities (NEs) from tweets.
	We built two Long Short-Term Memory (LSTM) models for the task.
	The first model was built to extract named entities without types while the
	second model was built to extract and then classify them into 10 fine-grained
	entity classes.
	In this effort, we show detailed experimentation results on the effectiveness
	of word embeddings, brown clusters, part-of-speech (POS) tags, shape features,
	gazetteers, and local context for the tweet input vector representation to the
	LSTM model.
	Also, we present a set of experiments, to better design the network parameters
	for the Twitter NER task.
	Our system was ranked the fifth out of ten participants with a final f1-score
	for the typed classes of 39% and 55% for the non typed ones.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gerguis-salama-elkharashi:2016:WNUT</bibkey>
  </paper>

  <paper id="3926">
    <title>UQAM-NTL: Named entity recognition in Twitter messages</title>
    <author><first>Ngoc Tan</first><last>LE</last></author>
    <author><first>Fatma</first><last>Mallek</last></author>
    <author><first>Fatiha</first><last>Sadat</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>197&#8211;202</pages>
    <url>http://aclweb.org/anthology/W16-3926</url>
    <abstract>This paper describes our system used in the 2nd Workshop on Noisy
	User-generated Text (WNUT) shared task for Named Entity Recognition (NER) in
	Twitter, in conjunction with Coling 2016. Our system is based on supervised
	machine learning by applying Conditional Random Fields (CRF) to train two
	classifiers for two evaluations. The first evaluation aims at predicting the 10
	fine-grained types of named entities; while the second evaluation aims at
	predicting no type of named entities. The experimental results show that our
	method has significantly improved Twitter NER performance.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>le-mallek-sadat:2016:WNUT</bibkey>
  </paper>

  <paper id="3927">
    <title>Semi-supervised Named Entity Recognition in noisy-text</title>
    <author><first>Shubhanshu</first><last>Mishra</last></author>
    <author><first>Jana</first><last>Diesner</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>203&#8211;212</pages>
    <url>http://aclweb.org/anthology/W16-3927</url>
    <abstract>Many of the existing Named Entity Recognition (NER) solutions are built based
	on news corpus data with proper syntax. These solutions might not lead to
	highly accurate results when being applied to noisy, user generated data, e.g.,
	tweets, which can feature sloppy spelling, concept drift, and limited
	contextualization of terms and concepts due to length constraints. The models
	described in this paper are based on linear chain conditional random fields
	(CRFs), use the BIEOU encoding scheme, and leverage random feature dropout for
	up-sampling the training data. The considered features include word clusters
	and pre-trained distributed word representations, updated gazetteer features,
	and global context predictions. The latter feature allows for ingesting the
	meaning of new or rare tokens into the system via unsupervised learning and for
	alleviating the need to learn lexicon based features, which usually tend to be
	high dimensional. In this paper, we report on the solution [ST] we submitted to
	the WNUT 2016 NER shared task. We also present an improvement over our original
	submission [SI], which we built by using semi-supervised learning on labelled
	training data and pre-trained resourced constructed from unlabelled tweet data.
	Our ST solution achieved an F1 score of 1.2% higher than the baseline (35.1%
	F1) for the task of extracting 10 entity types. The SI resulted in an increase
	of 8.2% in F1 score over the base-line (7.08% over ST). Finally, the SI
	model’s evaluation on the test data achieved a F1 score of 47.3% (~1.15%
	increase over the 2nd best submitted solution). Our experimental setup and
	results are available as a standalone twitter NER tool at
	https://github.com/napsternxg/TwitterNER.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>mishra-diesner:2016:WNUT</bibkey>
  </paper>

  <paper id="3928">
    <title>Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text</title>
    <author><first>Bo</first><last>Han</last></author>
    <author><first>Afshin</first><last>Rahimi</last></author>
    <author><first>Leon</first><last>Derczynski</last></author>
    <author><first>Timothy</first><last>Baldwin</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>213&#8211;217</pages>
    <url>http://aclweb.org/anthology/W16-3928</url>
    <abstract>This paper presents the shared task for English Twitter geolocation prediction
	in WNUT 2016. We discuss details of task settings, data preparations and
	participant systems. The derived dataset and performance figures from each
	system provide baselines for future research in this realm.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>han-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3929">
    <title>CSIRO Data61 at the WNUT Geo Shared Task</title>
    <author><first>Gaya</first><last>Jayasinghe</last></author>
    <author><first>Brian</first><last>Jin</last></author>
    <author><first>James</first><last>Mchugh</last></author>
    <author><first>Bella</first><last>Robinson</last></author>
    <author><first>Stephen</first><last>Wan</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>218&#8211;226</pages>
    <url>http://aclweb.org/anthology/W16-3929</url>
    <abstract>In this paper, we describe CSIRO Data61’s participation in the Geolocation
	shared task at the
	Workshop for Noisy User-generated Text. Our approach was to use ensemble
	methods to capitalise
	on four component methods: heuristics based on metadata, a label propagation
	method,
	timezone text classifiers, and an information retrieval approach. The ensembles
	we explored
	focused on examining the role of language technologies in geolocation
	prediction and also in
	examining the use of hard voting and cascading ensemble methods. Based on the
	accuracy of
	city-level predictions, our systems were the best performing submissions at
	this year’s shared
	task. Furthermore, when estimating the latitude and longitude of a user, our
	median error distance
	was accurate to within 30 kilometers.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jayasinghe-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3930">
    <title>Geolocation Prediction in Twitter Using Location Indicative Words and Textual Features</title>
    <author><first>Lianhua</first><last>Chi</last></author>
    <author><first>Kwan Hui</first><last>Lim</last></author>
    <author><first>Nebula</first><last>Alam</last></author>
    <author><first>Christopher J.</first><last>Butler</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>227&#8211;234</pages>
    <url>http://aclweb.org/anthology/W16-3930</url>
    <abstract>Knowing the location of a social media user and their posts is important for
	various purposes, such as the recommendation of location-based items/services,
	and locality detection of crisis/disasters. This paper describes our submission
	to the shared task ``Geolocation Prediction in Twitter" of the 2nd Workshop on
	Noisy User-generated Text. In this shared task, we propose an algorithm to
	predict the location of Twitter users and tweets using a multinomial Naive
	Bayes classifier trained on Location Indicative Words and various textual
	features (such as city/country names, \#hashtags and $@$mentions). We compared our
	approach against various baselines based on Location Indicative Words,
	city/country names, \#hashtags and $@$mentions as individual feature sets, and
	experimental results show that our approach outperforms these baselines in
	terms of classification accuracy, mean and median error distance.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>chi-EtAl:2016:WNUT</bibkey>
  </paper>

  <paper id="3931">
    <title>A Simple Scalable Neural Networks based Model for Geolocation Prediction in Twitter</title>
    <author><first>Yasuhide</first><last>Miura</last></author>
    <author><first>Motoki</first><last>Taniguchi</last></author>
    <author><first>Tomoki</first><last>Taniguchi</last></author>
    <author><first>Tomoko</first><last>Ohkuma</last></author>
    <booktitle>Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>235&#8211;239</pages>
    <url>http://aclweb.org/anthology/W16-3931</url>
    <abstract>This paper describes a model that we submitted to W-NUT 2016 Shared task #1:
	Geolocation Prediction in Twitter. Our model classifies a tweet or a user to a
	city using a simple neural networks structure with fully-connected layers and
	average pooling processes. From the findings of previous geolocation prediction
	approaches, we integrated various user metadata along with message texts and
	trained the model with them. In the test run of the task, the model achieved 
	the accuracy of 40.91% and the median distance error of 69.50 km in
	message-level prediction and the accuracy of 47.55% and the median distance
	error of 16.13 km in user-level prediction. These results are moderate
	performances in terms of accuracy and best performances in terms of distance.
	The results show a promising extension of neural networks based models for
	geolocation prediction where recent advances in neural networks can be added to
	enhance our current simple model.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>miura-EtAl:2016:WNUT</bibkey>
  </paper>

</volume>

