<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W18">
  <paper id="6100">
    <title>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</title>
    <editor>Wei Xu</editor>
    <editor>Alan Ritter</editor>
    <editor>Tim Baldwin</editor>
    <editor>Afshin Rahimi</editor>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W18-61</url>
    <bibtype>book</bibtype>
    <bibkey>W-NUT2018:2018</bibkey>
  </paper>

  <paper id="6101">
    <title>Inducing a lexicon of sociolinguistic variables from code-mixed text</title>
    <author><first>Philippa</first><last>Shoemark</last></author>
    <author><first>James</first><last>Kirby</last></author>
    <author><first>Sharon</first><last>Goldwater</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;6</pages>
    <url>http://www.aclweb.org/anthology/W18-6101</url>
    <abstract>Sociolinguistics is often concerned with how variants of a linguistic item (e.g., nothing vs. nothin’) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (football, fitba) along with their linguistic code (football→British, fitba→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70% for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>shoemark-kirby-goldwater:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6102">
    <title>Twitter Geolocation using Knowledge-Based Methods</title>
    <author><first>Taro</first><last>Miyazaki</last></author>
    <author><first>Afshin</first><last>Rahimi</last></author>
    <author><first>Trevor</first><last>Cohn</last></author>
    <author><first>Timothy</first><last>Baldwin</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>7&#8211;16</pages>
    <url>http://www.aclweb.org/anthology/W18-6102</url>
    <abstract>Geolocation of user posts on Twitter is useful for many applications, including disaster monitoring and news material gathering. However, the vast majority of tweets have no explicit geotag, motivating the need for automatic geolocation prediction methods. We propose the use of named entity linking in geolocation prediction, modelled using graph convolutional networks over a knowledge base of entity relations, which is combined with text-based models in an end-to-end deep learning framework. We show that our method improves on text-based models, and learns effective representations for named entities that do not appear in the training data.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>miyazaki-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6103">
    <title>Geocoding Without Geotags: A Text-based Approach for reddit</title>
    <author><first>Keith</first><last>Harrigian</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>17&#8211;27</pages>
    <url>http://www.aclweb.org/anthology/W18-6103</url>
    <abstract>In this paper, we introduce the first geolocation inference approach for reddit, a social media platform where user pseudonymity has thus far made supervised demographic inference difficult to implement and validate. In particular, we design a text-based heuristic schema to generate ground truth location labels for reddit users in the absence of explicitly geotagged data. After evaluating the accuracy of our labeling procedure, we train and test several geolocation inference models across our reddit data set and three benchmark Twitter geolocation data sets. Ultimately, we show that geolocation models trained and applied on the same domain substantially outperform models attempting to transfer training data across domains, even more so on reddit where platform-specific interest-group metadata can be used to improve inferences.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>harrigian:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6104">
    <title>Assigning people to tasks identified in email: The EPA dataset for addressee tagging for detected task intent</title>
    <author><first>Revanth</first><last>Rameshkumar</last></author>
    <author><first>Peter</first><last>Bailey</last></author>
    <author><first>Abhishek</first><last>Jha</last></author>
    <author><first>Chris</first><last>Quirk</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>28&#8211;32</pages>
    <url>http://www.aclweb.org/anthology/W18-6104</url>
    <abstract>We describe the Enron People Assignment (EPA) dataset, in which tasks that are described in emails are associated with the person(s) responsible for carrying out these tasks. We identify tasks and the responsible people in the Enron email dataset. We define evaluation methods for this challenge and report scores for our model and naïve baselines. The resulting model enables a user experience operating within a commercial email service: given a person and a task, it determines if the person should be notified of the task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>rameshkumar-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6105">
    <title>How do you correct run-on sentences it's not as easy as it seems</title>
    <author><first>Junchao</first><last>Zheng</last></author>
    <author><first>Courtney</first><last>Napoles</last></author>
    <author><first>Joel</first><last>Tetreault</last></author>
    <author><first>Kostiantyn</first><last>Omelianchuk</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>33&#8211;38</pages>
    <url>http://www.aclweb.org/anthology/W18-6105</url>
    <abstract>Run-on sentences are common grammatical mistakes but little research has tackled this problem to date. This work introduces two machine learning models to correct run-on sentences that outperform leading methods for related tasks, punctuation restoration and whole-sentence grammatical error correction. Due to the limited annotated data for this error, we experiment with artificially generating training data from clean newswire text. Our findings suggest artificial training data is viable for this task. We discuss implications for correcting run-ons and other types of mistakes that have low coverage in error-annotated corpora.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zheng-napoles-tetreault:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6106">
    <title>A POS Tagging Model Adapted to Learner English</title>
    <author><first>Ryo</first><last>Nagata</last></author>
    <author><first>Tomoya</first><last>Mizumoto</last></author>
    <author><first>Yuta</first><last>Kikuchi</last></author>
    <author><first>Yoshifumi</first><last>Kawasaki</last></author>
    <author><first>Kotaro</first><last>Funakoshi</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>39&#8211;48</pages>
    <url>http://www.aclweb.org/anthology/W18-6106</url>
    <abstract>There has been very limited work on the adaptation of Part-Of-Speech (POS) tagging to learner English despite the fact that POS tagging is widely used in related tasks. In this paper, we explore how we can adapt POS tagging to learner English efficiently and effectively. Based on the discussion of possible causes of POS tagging errors in learner English, we show that deep neural models are particularly suitable for this. Considering the previous findings and the discussion, we introduce the design of our model based on bidirectional Long Short-Term Memory. In addition, we describe how to adapt it to a wide variety of native languages (potentially, hundreds of them). In the evaluation section, we empirically show that it is effective for POS tagging in learner English, achieving an accuracy of 0.964, which significantly outperforms the state-of-the-art POS-tagger. We further investigate the tagging results in detail, revealing which part of the model design does or does not improve the performance.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>nagata-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6107">
    <title>Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model &#38; Levenshtein Distance</title>
    <author><first>Soumil</first><last>Mandal</last></author>
    <author><first>Karthick</first><last>Nanmaran</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>49&#8211;53</pages>
    <url>http://www.aclweb.org/anthology/W18-6107</url>
    <abstract>Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>mandal-nanmaran:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6108">
    <title>Robust Word Vectors: Context-Informed Embeddings for Noisy Texts</title>
    <author><first>Valentin</first><last>Malykh</last></author>
    <author><first>Varvara</first><last>Logacheva</last></author>
    <author><first>Taras</first><last>Khakhulin</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>54&#8211;63</pages>
    <url>http://www.aclweb.org/anthology/W18-6108</url>
    <abstract>We suggest a new language-independent architecture of robust word vectors (RoVe). It is designed to alleviate the issue of typos, which are common in almost any user-generated content, and hinder automatic text processing. Our model is morphologically motivated, which allows it to deal with unseen word forms in morphologically rich languages. </abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>malykh-logacheva-khakhulin:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6109">
    <title>Paraphrase Detection on Noisy Subtitles in Six Languages</title>
    <author><first>Eetu</first><last>Sjöblom</last></author>
    <author><first>Mathias</first><last>Creutz</last></author>
    <author><first>Mikko</first><last>Aulamo</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>64&#8211;73</pages>
    <url>http://www.aclweb.org/anthology/W18-6109</url>
    <abstract>We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>sjblom-creutz-aulamo:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6110">
    <title>Distantly Supervised Attribute Detection from Reviews</title>
    <author><first>Lisheng</first><last>Fu</last></author>
    <author><first>Pablo</first><last>Barrio</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>74&#8211;78</pages>
    <url>http://www.aclweb.org/anthology/W18-6110</url>
    <abstract>This paper aims to detect specific attributes of a place (e.g., if it has a romantic atmosphere, or if it offers outdoor seating) from its user reviews via distant supervision: without direct annotation of review text, we use the crowdsourced attribute labels of a place as labels of the review text.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>fu-barrio:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6111">
    <title>Using Wikipedia Edits in Low Resource Grammatical Error Correction</title>
    <author><first>Adriane</first><last>Boyd</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>79&#8211;84</pages>
    <url>http://www.aclweb.org/anthology/W18-6111</url>
    <abstract>We develop a grammatical error correction (GEC) system for German using a small gold GEC corpus augmented with edits extracted from Wikipedia revision history. We extend the automatic error annotation tool ERRANT (Bryant et al., 2017) for German and use it to analyze both gold GEC corrections and Wikipedia edits (Grundkiewicz and Junczys-Dowmunt, 2014) in order to select as additional training data Wikipedia edits containing grammatical corrections similar to those in the gold corpus. Using a multilayer convolutional encoder-decoder neural network GEC approach (Chollampatt and Ng, 2018), we evaluate the contribution of Wikipedia edits and find that carefully selected Wikipedia edits increase performance by over 5%.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>boyd:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6112">
    <title>Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts</title>
    <author><first>Kemal</first><last>Kurniawan</last></author>
    <author><first>Samuel</first><last>Louvan</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>85&#8211;92</pages>
    <url>http://www.aclweb.org/anthology/W18-6112</url>
    <abstract>Despite the long history of named-entity recognition (NER) task in the natural language processing community, previous work rarely studied the task on conversational texts. Such texts are challenging because they contain a lot of word variations which increase the number of out-of-vocabulary (OOV) words. The high number of OOV words poses a difficulty for word-based neural models. Meanwhile, there is plenty of evidence to the effectiveness of character-based neural models in mitigating this OOV problem. We report an empirical evaluation of neural sequence labeling models with character embedding to tackle NER task in Indonesian conversational texts. Our experiments show that (1) character models outperform word embedding-only models by up to 4 F1 points, (2) character models perform better in OOV cases with an improvement of as high as 15 F1 points, and (3) character models are robust against a very high OOV rate.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>kurniawan-louvan:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6113">
    <title>Orthogonal Matching Pursuit for Text Classification</title>
    <author><first>Konstantinos</first><last>Skianis</last></author>
    <author><first>Nikolaos</first><last>Tziortziotis</last></author>
    <author><first>Michalis</first><last>Vazirgiannis</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>93&#8211;103</pages>
    <url>http://www.aclweb.org/anthology/W18-6113</url>
    <abstract>In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>skianis-tziortziotis-vazirgiannis:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6114">
    <title>Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data</title>
    <author><first>R. Andrew</first><last>Kreek</last></author>
    <author><first>Emilia</first><last>Apostolova</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>104&#8211;109</pages>
    <url>http://www.aclweb.org/anthology/W18-6114</url>
    <abstract>Industry datasets used for text classification are rarely created for that purpose. In most cases, the data and target predictions are a by-product of accumulated historical data, typically fraught with noise, present in both the text-based document, as well as in the targeted labels. In this work, we address the question of how well performance metrics computed on noisy, historical data reflect the performance on the intended future machine learning model input. The results demonstrate the utility of dirty training datasets used to build prediction models for cleaner (and different) prediction inputs.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>kreek-apostolova:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6115">
    <title>Detecting Code-Switching between Turkish-English Language Pair</title>
    <author><first>Zeynep</first><last>Yirmibeşoğlu</last></author>
    <author><first>G&#252;lşen</first><last>Eryiğit</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>110&#8211;115</pages>
    <url>http://www.aclweb.org/anthology/W18-6115</url>
    <abstract>Code-switching (usage of different languages within a single conversation context in an alternative manner) is a highly increasing phenomenon in social media and colloquial usage which poses different challenges for natural language processing. This paper introduces the first study for the detection of Turkish-English code-switching and also a small test data collected from social media in order to smooth the way for further studies.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>yirmibeolu-eryiit:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6116">
    <title>Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture</title>
    <author><first>Soumil</first><last>Mandal</last></author>
    <author><first>Anil Kumar</first><last>Singh</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>116&#8211;120</pages>
    <url>http://www.aclweb.org/anthology/W18-6116</url>
    <abstract>An accurate language identification tool is an absolute necessity for building complex NLP systems to be used on code-mixed data. Lot of work has been recently done on the same, but there’s still room for improvement. Inspired from the recent advancements in neural network architectures for computer vision tasks, we have implemented multichannel neural networks combining CNN and LSTM for word level language identification of code-mixed data. Combining this with a Bi-LSTM-CRF context capture module, accuracies of 93.28% and 93.32% is achieved on out two testing sets.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>mandal-singh:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6117">
    <title>Modeling Student Response Times: Towards Efficient One-on-one Tutoring Dialogues</title>
    <author><first>Luciana</first><last>Benotti</last></author>
    <author><first>Jayadev</first><last>Bhaskaran</last></author>
    <author><first>Sigtryggur</first><last>Kjartansson</last></author>
    <author><first>David</first><last>Lang</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>121&#8211;131</pages>
    <url>http://www.aclweb.org/anthology/W18-6117</url>
    <abstract>In this paper we investigate the task of modeling how long it would take a student to respond to a tutor question during a tutoring dialogue.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>benotti-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6118">
    <title>Content Extraction and Lexical Analysis from Customer-Agent Interactions</title>
    <author><first>Sergiu</first><last>Nisioi</last></author>
    <author><first>Anca</first><last>Bucur</last></author>
    <author><first>Liviu P.</first><last>Dinu</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>132&#8211;136</pages>
    <url>http://www.aclweb.org/anthology/W18-6118</url>
    <abstract>In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an </abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>nisioi-bucur-dinu:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6119">
    <title>Preferred Answer Selection in Stack Overflow: Better Text Representations ... and Metadata, Metadata, Metadata</title>
    <author><first>Steven</first><last>Xu</last></author>
    <author><first>Andrew</first><last>Bennett</last></author>
    <author><first>Doris</first><last>Hoogeveen</last></author>
    <author><first>Jey Han</first><last>Lau</last></author>
    <author><first>Timothy</first><last>Baldwin</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>137&#8211;147</pages>
    <url>http://www.aclweb.org/anthology/W18-6119</url>
    <abstract>Community question answering (cQA) forums provide a rich source of</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>xu-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6120">
    <title>Word-like character n-gram embedding</title>
    <author><first>Geewook</first><last>Kim</last></author>
    <author><first>Kazuki</first><last>Fukui</last></author>
    <author><first>Hidetoshi</first><last>Shimodaira</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>148&#8211;152</pages>
    <url>http://www.aclweb.org/anthology/W18-6120</url>
    <abstract>We propose a new word embedding method called "word-like character n-gram embedding", which learns distributed representations of words by embedding word-like character n-grams.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>kim-fukui-shimodaira:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6121">
    <title>Classification of Tweets about Reported Events using Neural Networks</title>
    <author><first>Kiminobu</first><last>Makino</last></author>
    <author><first>Yuka</first><last>Takei</last></author>
    <author><first>Taro</first><last>Miyazaki</last></author>
    <author><first>Jun</first><last>Goto</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>153&#8211;163</pages>
    <url>http://www.aclweb.org/anthology/W18-6121</url>
    <abstract>We developed a system that automatically extracts "Event-describing Tweets" which include incidents or accidents information for creating news reports. Event-describing Tweets can be classified into "Reported- event Tweets" and "New-information Tweets." Reported-event Tweets cite news agencies or user generated content sites, and New- information Tweets are other Event-describing Tweets. A system is needed to classify them so that creators of factual TV programs can use them in their productions. Proposing this Tweet classification task is one of the contributions of this paper, because no prior papers have used the same task even though program creators and other events information collectors have to do it to extract required information from social networking sites. To classify Tweets in this task, this paper proposes a method to input and concatenate character and word sequences in Japanese Tweets by using convolutional neural networks. This proposed method is another contribution of this paper. For comparison, character or word input methods and other neural networks are also used. Results show that a system using the proposed method and architectures can classify Tweets with an F1 score of 88 %.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>makino-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6122">
    <title>Learning to Define Terms in the Software Domain</title>
    <author><first>Vidhisha</first><last>Balachandran</last></author>
    <author><first>Dheeraj</first><last>Rajagopal</last></author>
    <author><first>Rose Catherine</first><last>Kanjirathinkal</last></author>
    <author><first>William</first><last>Cohen</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>164&#8211;172</pages>
    <url>http://www.aclweb.org/anthology/W18-6122</url>
    <abstract>One way to test a person's knowledge of a domain is to ask them to define domain-specific terms. Here, we investigate the task of automatically generating definitions of technical terms by reading text from the technical domain. Specifically, we learn definitions of software entities from a large corpus built from the user forum Stack Overflow. To model definitions, we train a language model and incorporate additional domain-specific information like word-word co-occurrence, and ontological category information. Our approach improves previous baselines by 2 BLEU points for the definition generation task. Our experiments also show the additional challenges associated with the task and the short-comings of language-model based architectures for definition generation.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>balachandran-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6123">
    <title>FrameIt: Ontology Discovery for Noisy User-Generated Text</title>
    <author><first>Dan</first><last>Iter</last></author>
    <author><first>Alon</first><last>Halevy</last></author>
    <author><first>Wang-Chiew</first><last>Tan</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>173&#8211;183</pages>
    <url>http://www.aclweb.org/anthology/W18-6123</url>
    <abstract>A common need of NLP applications is to extract structured data from text corpora in order to perform analytics or trigger an appropriate action. The ontology defining the structure is typically application dependent and in many cases it is not known a priori. We describe the FrameIt System that provides a workflow for (1) quickly discovering an ontology to model a text corpus and (2) learning an SRL model that extracts the instances of the ontology from sentences in the corpus. FrameIt exploits data that is obtained in the ontology discovery phase as weak supervision data to bootstrap the SRL model and then enables the user to refine the model with active learning. We present empirical results and qualitative analysis of the performance of FrameIt on three corpora of noisy user-generated text.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>iter-halevy-tan:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6124">
    <title>Using Author Embeddings to Improve Tweet Stance Classification</title>
    <author><first>Adrian</first><last>Benton</last></author>
    <author><first>Mark</first><last>Dredze</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>184&#8211;194</pages>
    <url>http://www.aclweb.org/anthology/W18-6124</url>
    <abstract>Many social media classification tasks analyze the content of a message, but do not consider</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>benton-dredze:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6125">
    <title>Low-resource named entity recognition via multi-source projection: Not quite there yet?</title>
    <author><first>Jan Vium</first><last>Enghoff</last></author>
    <author><first>Søren</first><last>Harrison</last></author>
    <author><first>Željko</first><last>Agić</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>195&#8211;201</pages>
    <url>http://www.aclweb.org/anthology/W18-6125</url>
    <abstract>Projecting linguistic annotations through word alignments is one of the most prevalent approaches to cross-lingual transfer learning. Conventional wisdom suggests that annotation projection &#x201c;just works&#x201d; regardless of the task at hand. We carefully consider multi-source projection for named entity recognition. Our experiment with 17 languages shows that to detect named entities in true low-resource languages, annotation projection may not be the right way to move forward. On a more positive note, we also uncover the conditions that do favor named entity projection from multiple sources. We argue these are infeasible under noisy low-resource constraints.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>enghoff-harrison-agi:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6126">
    <title>A Case Study on Learning a Unified Encoder of Relations</title>
    <author><first>Lisheng</first><last>Fu</last></author>
    <author><first>Bonan</first><last>Min</last></author>
    <author><first>Thien Huu</first><last>Nguyen</last></author>
    <author><first>Ralph</first><last>Grishman</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>202&#8211;207</pages>
    <url>http://www.aclweb.org/anthology/W18-6126</url>
    <abstract>Typical relation extraction models are trained on a single corpus annotated with a pre-defined relation schema. </abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>fu-EtAl:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6127">
    <title>Convolutions Are All You Need (For Classifying Character Sequences)</title>
    <author><first>Zach</first><last>Wood-Doughty</last></author>
    <author><first>Nicholas</first><last>Andrews</last></author>
    <author><first>Mark</first><last>Dredze</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>208&#8211;213</pages>
    <url>http://www.aclweb.org/anthology/W18-6127</url>
    <abstract>While recurrent neural networks (RNNs)</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>wooddoughty-andrews-dredze:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6128">
    <title>Step or Not: Discriminator for The Real Instructions in User-generated Recipes</title>
    <author><first>Shintaro</first><last>Inuzuka</last></author>
    <author><first>Takahiko</first><last>Ito</last></author>
    <author><first>Jun</first><last>Harashima</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>214</pages>
    <url>http://www.aclweb.org/anthology/W18-6128</url>
    <abstract>In a recipe sharing service, users publish recipe instructions in the form of a series of steps. However, some of the "steps" are not actually part of the cooking process. Specifically, advertisements of recipes themselves (eg "introduced on TV") and comments (eg "Thanks for many messages") may often be included in the step section of the recipe, like the recipe author's communication tool. However, such fake steps can cause problems when using recipe search indexing or when being spoken by devices such as smart speakers. As presented in this talk, we have constructed a discriminator that distinguishes between such a fake step and the step actually used for cooking. This project includes, but is not limited to, the creation of annotation data by classifying and analyzing recipe steps and the construction of identification models. Our models use only text information to identify the step. In our test, machine learning models achieved higher accuracy than rule-based methods that use manually chosen words.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>inuzuka-ito-harashima:2018:W-NUT2018</bibkey>
  </paper>

  <paper id="6129">
    <title>Combining Human and Machine Transcriptions on the Zooniverse Platform</title>
    <author><first>Daniel</first><last>Hanson</last></author>
    <author><first>Andrea</first><last>Simenstad</last></author>
    <booktitle>Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text</booktitle>
    <month>November</month>
    <year>2018</year>
    <address>Brussels, Belgium</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>215&#8211;216</pages>
    <url>http://www.aclweb.org/anthology/W18-6129</url>
    <abstract>This is a 1-page abstract on a work-in-progress for the Workshop on Noisy User-generated Text.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>hanson-simenstad:2018:W-NUT2018</bibkey>
  </paper>

</volume>

