Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text Wei Xu Alan Ritter Tim Baldwin Afshin Rahimi November 2018

Brussels, Belgium

Association for Computational Linguistics http://www.aclweb.org/anthology/W18-61 book W-NUT2018:2018 Inducing a lexicon of sociolinguistic variables from code-mixed text PhilippaShoemark JamesKirby SharonGoldwater Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 1–6 http://www.aclweb.org/anthology/W18-6101 Sociolinguistics is often concerned with how variants of a linguistic item (e.g., nothing vs. nothin’) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (football, fitba) along with their linguistic code (football→British, fitba→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70% for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well. inproceedings shoemark-kirby-goldwater:2018:W-NUT2018 Twitter Geolocation using Knowledge-Based Methods TaroMiyazaki AfshinRahimi TrevorCohn TimothyBaldwin Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 7–16 http://www.aclweb.org/anthology/W18-6102 Geolocation of user posts on Twitter is useful for many applications, including disaster monitoring and news material gathering. However, the vast majority of tweets have no explicit geotag, motivating the need for automatic geolocation prediction methods. We propose the use of named entity linking in geolocation prediction, modelled using graph convolutional networks over a knowledge base of entity relations, which is combined with text-based models in an end-to-end deep learning framework. We show that our method improves on text-based models, and learns effective representations for named entities that do not appear in the training data. inproceedings miyazaki-EtAl:2018:W-NUT2018 Geocoding Without Geotags: A Text-based Approach for reddit KeithHarrigian Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 17–27 http://www.aclweb.org/anthology/W18-6103 In this paper, we introduce the first geolocation inference approach for reddit, a social media platform where user pseudonymity has thus far made supervised demographic inference difficult to implement and validate. In particular, we design a text-based heuristic schema to generate ground truth location labels for reddit users in the absence of explicitly geotagged data. After evaluating the accuracy of our labeling procedure, we train and test several geolocation inference models across our reddit data set and three benchmark Twitter geolocation data sets. Ultimately, we show that geolocation models trained and applied on the same domain substantially outperform models attempting to transfer training data across domains, even more so on reddit where platform-specific interest-group metadata can be used to improve inferences. inproceedings harrigian:2018:W-NUT2018 Assigning people to tasks identified in email: The EPA dataset for addressee tagging for detected task intent RevanthRameshkumar PeterBailey AbhishekJha ChrisQuirk Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 28–32 http://www.aclweb.org/anthology/W18-6104 We describe the Enron People Assignment (EPA) dataset, in which tasks that are described in emails are associated with the person(s) responsible for carrying out these tasks. We identify tasks and the responsible people in the Enron email dataset. We define evaluation methods for this challenge and report scores for our model and naïve baselines. The resulting model enables a user experience operating within a commercial email service: given a person and a task, it determines if the person should be notified of the task. inproceedings rameshkumar-EtAl:2018:W-NUT2018 How do you correct run-on sentences it's not as easy as it seems JunchaoZheng CourtneyNapoles JoelTetreault KostiantynOmelianchuk Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 33–38 http://www.aclweb.org/anthology/W18-6105 Run-on sentences are common grammatical mistakes but little research has tackled this problem to date. This work introduces two machine learning models to correct run-on sentences that outperform leading methods for related tasks, punctuation restoration and whole-sentence grammatical error correction. Due to the limited annotated data for this error, we experiment with artificially generating training data from clean newswire text. Our findings suggest artificial training data is viable for this task. We discuss implications for correcting run-ons and other types of mistakes that have low coverage in error-annotated corpora. inproceedings zheng-napoles-tetreault:2018:W-NUT2018 A POS Tagging Model Adapted to Learner English RyoNagata TomoyaMizumoto YutaKikuchi YoshifumiKawasaki KotaroFunakoshi Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 39–48 http://www.aclweb.org/anthology/W18-6106 There has been very limited work on the adaptation of Part-Of-Speech (POS) tagging to learner English despite the fact that POS tagging is widely used in related tasks. In this paper, we explore how we can adapt POS tagging to learner English efficiently and effectively. Based on the discussion of possible causes of POS tagging errors in learner English, we show that deep neural models are particularly suitable for this. Considering the previous findings and the discussion, we introduce the design of our model based on bidirectional Long Short-Term Memory. In addition, we describe how to adapt it to a wide variety of native languages (potentially, hundreds of them). In the evaluation section, we empirically show that it is effective for POS tagging in learner English, achieving an accuracy of 0.964, which significantly outperforms the state-of-the-art POS-tagger. We further investigate the tagging results in detail, revealing which part of the model design does or does not improve the performance. inproceedings nagata-EtAl:2018:W-NUT2018 Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance SoumilMandal KarthickNanmaran Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 49–53 http://www.aclweb.org/anthology/W18-6107 Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data. inproceedings mandal-nanmaran:2018:W-NUT2018 Robust Word Vectors: Context-Informed Embeddings for Noisy Texts ValentinMalykh VarvaraLogacheva TarasKhakhulin Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 54–63 http://www.aclweb.org/anthology/W18-6108 We suggest a new language-independent architecture of robust word vectors (RoVe). It is designed to alleviate the issue of typos, which are common in almost any user-generated content, and hinder automatic text processing. Our model is morphologically motivated, which allows it to deal with unseen word forms in morphologically rich languages. inproceedings malykh-logacheva-khakhulin:2018:W-NUT2018 Paraphrase Detection on Noisy Subtitles in Six Languages EetuSjöblom MathiasCreutz MikkoAulamo Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 64–73 http://www.aclweb.org/anthology/W18-6109 We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data. inproceedings sjblom-creutz-aulamo:2018:W-NUT2018 Distantly Supervised Attribute Detection from Reviews LishengFu PabloBarrio Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 74–78 http://www.aclweb.org/anthology/W18-6110 This paper aims to detect specific attributes of a place (e.g., if it has a romantic atmosphere, or if it offers outdoor seating) from its user reviews via distant supervision: without direct annotation of review text, we use the crowdsourced attribute labels of a place as labels of the review text. inproceedings fu-barrio:2018:W-NUT2018 Using Wikipedia Edits in Low Resource Grammatical Error Correction AdrianeBoyd Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 79–84 http://www.aclweb.org/anthology/W18-6111 We develop a grammatical error correction (GEC) system for German using a small gold GEC corpus augmented with edits extracted from Wikipedia revision history. We extend the automatic error annotation tool ERRANT (Bryant et al., 2017) for German and use it to analyze both gold GEC corrections and Wikipedia edits (Grundkiewicz and Junczys-Dowmunt, 2014) in order to select as additional training data Wikipedia edits containing grammatical corrections similar to those in the gold corpus. Using a multilayer convolutional encoder-decoder neural network GEC approach (Chollampatt and Ng, 2018), we evaluate the contribution of Wikipedia edits and find that carefully selected Wikipedia edits increase performance by over 5%. inproceedings boyd:2018:W-NUT2018 Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts KemalKurniawan SamuelLouvan Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 85–92 http://www.aclweb.org/anthology/W18-6112 Despite the long history of named-entity recognition (NER) task in the natural language processing community, previous work rarely studied the task on conversational texts. Such texts are challenging because they contain a lot of word variations which increase the number of out-of-vocabulary (OOV) words. The high number of OOV words poses a difficulty for word-based neural models. Meanwhile, there is plenty of evidence to the effectiveness of character-based neural models in mitigating this OOV problem. We report an empirical evaluation of neural sequence labeling models with character embedding to tackle NER task in Indonesian conversational texts. Our experiments show that (1) character models outperform word embedding-only models by up to 4 F1 points, (2) character models perform better in OOV cases with an improvement of as high as 15 F1 points, and (3) character models are robust against a very high OOV rate. inproceedings kurniawan-louvan:2018:W-NUT2018 Orthogonal Matching Pursuit for Text Classification KonstantinosSkianis NikolaosTziortziotis MichalisVazirgiannis Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 93–103 http://www.aclweb.org/anthology/W18-6113 In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. inproceedings skianis-tziortziotis-vazirgiannis:2018:W-NUT2018 Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data R. AndrewKreek EmiliaApostolova Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 104–109 http://www.aclweb.org/anthology/W18-6114 Industry datasets used for text classification are rarely created for that purpose. In most cases, the data and target predictions are a by-product of accumulated historical data, typically fraught with noise, present in both the text-based document, as well as in the targeted labels. In this work, we address the question of how well performance metrics computed on noisy, historical data reflect the performance on the intended future machine learning model input. The results demonstrate the utility of dirty training datasets used to build prediction models for cleaner (and different) prediction inputs. inproceedings kreek-apostolova:2018:W-NUT2018 Detecting Code-Switching between Turkish-English Language Pair ZeynepYirmibeşoğlu GülşenEryiğit Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 110–115 http://www.aclweb.org/anthology/W18-6115 Code-switching (usage of different languages within a single conversation context in an alternative manner) is a highly increasing phenomenon in social media and colloquial usage which poses different challenges for natural language processing. This paper introduces the first study for the detection of Turkish-English code-switching and also a small test data collected from social media in order to smooth the way for further studies. inproceedings yirmibeolu-eryiit:2018:W-NUT2018 Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture SoumilMandal Anil KumarSingh Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 116–120 http://www.aclweb.org/anthology/W18-6116 An accurate language identification tool is an absolute necessity for building complex NLP systems to be used on code-mixed data. Lot of work has been recently done on the same, but there’s still room for improvement. Inspired from the recent advancements in neural network architectures for computer vision tasks, we have implemented multichannel neural networks combining CNN and LSTM for word level language identification of code-mixed data. Combining this with a Bi-LSTM-CRF context capture module, accuracies of 93.28% and 93.32% is achieved on out two testing sets. inproceedings mandal-singh:2018:W-NUT2018 Modeling Student Response Times: Towards Efficient One-on-one Tutoring Dialogues LucianaBenotti JayadevBhaskaran SigtryggurKjartansson DavidLang Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 121–131 http://www.aclweb.org/anthology/W18-6117 In this paper we investigate the task of modeling how long it would take a student to respond to a tutor question during a tutoring dialogue. inproceedings benotti-EtAl:2018:W-NUT2018 Content Extraction and Lexical Analysis from Customer-Agent Interactions SergiuNisioi AncaBucur Liviu P.Dinu Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 132–136 http://www.aclweb.org/anthology/W18-6118 In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an inproceedings nisioi-bucur-dinu:2018:W-NUT2018 Preferred Answer Selection in Stack Overflow: Better Text Representations ... and Metadata, Metadata, Metadata StevenXu AndrewBennett DorisHoogeveen Jey HanLau TimothyBaldwin Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 137–147 http://www.aclweb.org/anthology/W18-6119 Community question answering (cQA) forums provide a rich source of inproceedings xu-EtAl:2018:W-NUT2018 Word-like character n-gram embedding GeewookKim KazukiFukui HidetoshiShimodaira Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 148–152 http://www.aclweb.org/anthology/W18-6120 We propose a new word embedding method called "word-like character n-gram embedding", which learns distributed representations of words by embedding word-like character n-grams. inproceedings kim-fukui-shimodaira:2018:W-NUT2018 Classification of Tweets about Reported Events using Neural Networks KiminobuMakino YukaTakei TaroMiyazaki JunGoto Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 153–163 http://www.aclweb.org/anthology/W18-6121 We developed a system that automatically extracts "Event-describing Tweets" which include incidents or accidents information for creating news reports. Event-describing Tweets can be classified into "Reported- event Tweets" and "New-information Tweets." Reported-event Tweets cite news agencies or user generated content sites, and New- information Tweets are other Event-describing Tweets. A system is needed to classify them so that creators of factual TV programs can use them in their productions. Proposing this Tweet classification task is one of the contributions of this paper, because no prior papers have used the same task even though program creators and other events information collectors have to do it to extract required information from social networking sites. To classify Tweets in this task, this paper proposes a method to input and concatenate character and word sequences in Japanese Tweets by using convolutional neural networks. This proposed method is another contribution of this paper. For comparison, character or word input methods and other neural networks are also used. Results show that a system using the proposed method and architectures can classify Tweets with an F1 score of 88 %. inproceedings makino-EtAl:2018:W-NUT2018 Learning to Define Terms in the Software Domain VidhishaBalachandran DheerajRajagopal Rose CatherineKanjirathinkal WilliamCohen Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 164–172 http://www.aclweb.org/anthology/W18-6122 One way to test a person's knowledge of a domain is to ask them to define domain-specific terms. Here, we investigate the task of automatically generating definitions of technical terms by reading text from the technical domain. Specifically, we learn definitions of software entities from a large corpus built from the user forum Stack Overflow. To model definitions, we train a language model and incorporate additional domain-specific information like word-word co-occurrence, and ontological category information. Our approach improves previous baselines by 2 BLEU points for the definition generation task. Our experiments also show the additional challenges associated with the task and the short-comings of language-model based architectures for definition generation. inproceedings balachandran-EtAl:2018:W-NUT2018 FrameIt: Ontology Discovery for Noisy User-Generated Text DanIter AlonHalevy Wang-ChiewTan Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 173–183 http://www.aclweb.org/anthology/W18-6123 A common need of NLP applications is to extract structured data from text corpora in order to perform analytics or trigger an appropriate action. The ontology defining the structure is typically application dependent and in many cases it is not known a priori. We describe the FrameIt System that provides a workflow for (1) quickly discovering an ontology to model a text corpus and (2) learning an SRL model that extracts the instances of the ontology from sentences in the corpus. FrameIt exploits data that is obtained in the ontology discovery phase as weak supervision data to bootstrap the SRL model and then enables the user to refine the model with active learning. We present empirical results and qualitative analysis of the performance of FrameIt on three corpora of noisy user-generated text. inproceedings iter-halevy-tan:2018:W-NUT2018 Using Author Embeddings to Improve Tweet Stance Classification AdrianBenton MarkDredze Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 184–194 http://www.aclweb.org/anthology/W18-6124 Many social media classification tasks analyze the content of a message, but do not consider inproceedings benton-dredze:2018:W-NUT2018 Low-resource named entity recognition via multi-source projection: Not quite there yet? Jan ViumEnghoff SørenHarrison ŽeljkoAgić Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 195–201 http://www.aclweb.org/anthology/W18-6125 Projecting linguistic annotations through word alignments is one of the most prevalent approaches to cross-lingual transfer learning. Conventional wisdom suggests that annotation projection “just works” regardless of the task at hand. We carefully consider multi-source projection for named entity recognition. Our experiment with 17 languages shows that to detect named entities in true low-resource languages, annotation projection may not be the right way to move forward. On a more positive note, we also uncover the conditions that do favor named entity projection from multiple sources. We argue these are infeasible under noisy low-resource constraints. inproceedings enghoff-harrison-agi:2018:W-NUT2018 A Case Study on Learning a Unified Encoder of Relations LishengFu BonanMin Thien HuuNguyen RalphGrishman Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 202–207 http://www.aclweb.org/anthology/W18-6126 Typical relation extraction models are trained on a single corpus annotated with a pre-defined relation schema. inproceedings fu-EtAl:2018:W-NUT2018 Convolutions Are All You Need (For Classifying Character Sequences) ZachWood-Doughty NicholasAndrews MarkDredze Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 208–213 http://www.aclweb.org/anthology/W18-6127 While recurrent neural networks (RNNs) inproceedings wooddoughty-andrews-dredze:2018:W-NUT2018 Step or Not: Discriminator for The Real Instructions in User-generated Recipes ShintaroInuzuka TakahikoIto JunHarashima Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 214 http://www.aclweb.org/anthology/W18-6128 In a recipe sharing service, users publish recipe instructions in the form of a series of steps. However, some of the "steps" are not actually part of the cooking process. Specifically, advertisements of recipes themselves (eg "introduced on TV") and comments (eg "Thanks for many messages") may often be included in the step section of the recipe, like the recipe author's communication tool. However, such fake steps can cause problems when using recipe search indexing or when being spoken by devices such as smart speakers. As presented in this talk, we have constructed a discriminator that distinguishes between such a fake step and the step actually used for cooking. This project includes, but is not limited to, the creation of annotation data by classifying and analyzing recipe steps and the construction of identification models. Our models use only text information to identify the step. In our test, machine learning models achieved higher accuracy than rule-based methods that use manually chosen words. inproceedings inuzuka-ito-harashima:2018:W-NUT2018 Combining Human and Machine Transcriptions on the Zooniverse Platform DanielHanson AndreaSimenstad Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text November 2018

Brussels, Belgium

Association for Computational Linguistics 215–216 http://www.aclweb.org/anthology/W18-6129 This is a 1-page abstract on a work-in-progress for the Workshop on Noisy User-generated Text. inproceedings hanson-simenstad:2018:W-NUT2018