Other Workshops and Events (2017)


Contents

up

pdf (full)
bib (full)
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz

pdf bib
A Morphological Parser for Odawa
Dustin Bowers | Antti Arppe | Jordan Lachler | Sjur Moshagen | Trond Trosterud

pdf bib
Creating lexical resources for polysynthetic languages—the case of Arapaho
Ghazaleh Kazeminejad | Andrew Cowell | Mans Hulden

pdf bib
From Small to Big Data: paper manuscripts to RDF triples of Australian Indigenous Vocabularies
Nick Thieberger | Conal Tuohy

pdf bib
Issues in digital text representation, on-line dissemination, sharing and re-use for African minority languages
Emmanuel Ngué Um

pdf bib
Developing collection management tools to create more robust and reliable linguistic data
Gary Holton | Kavon Hooshiar | Nick Thieberger

pdf bib
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow | Emily M. Bender | Patrick Littell | Kristen Howell | Shobhana Chelliah | Joshua Crowgey | Dan Garrette | Jeff Good | Sharon Hargus | David Inman | Michael Maxwell | Michael Tjalve | Fei Xia

pdf bib
Work With What You’ve Got
Lucy Bell | Lawrence Bell

pdf bib
Converting a comprehensive lexical database into a computational model: The case of East Cree verb inflection
Antti Arppe | Marie-Odile Junker | Delasie Torkornoo

pdf bib
Instant annotations in ELAN corpora of spoken and written Komi, an endangered language of the Barents Sea region
Ciprian Gerstenberger | Niko Partanen | Michael Rießler

pdf bib
Inferring Case Systems from IGT: Enriching the Enrichment
Kristen Howell | Emily M. Bender | Michel Lockwood | Fei Xia | Olga Zamaraeva

pdf bib
Case Studies in the Automatic Characterization of Grammars from Small Wordlists
Jordan Kodner | Spencer Caplan | Hongzhi Xu | Mitchell P. Marcus | Charles Yang

pdf bib
Endangered Data for Endangered Languages: Digitizing Print dictionaries
Michael Maxwell | Aric Bills

pdf bib
A computationally-assisted procedure for discovering poetic organization within oral tradition
David Meyer

pdf bib
Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network
Jeffrey Micher

pdf bib
Click reduction in fluent speech: a semi-automated analysis of Mangetti Dune !Xung
Amanda Miller | Micha Elsner

pdf bib
DECCA Repurposed: Detecting transcription inconsistencies without an orthographic standard
C. Anton Rytting | Julie Yelle

pdf bib
Jejueo talking dictionary: A collaborative online database for language revitalization
Moira Saltzman

pdf bib
Computational Support for Finding Word Classes: A Case Study of Abui
Olga Zamaraeva | František Kratochvíl | Emily M. Bender | Fei Xia | Kristen Howell

pdf bib
Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages
Patrick Littell | Aidan Pine | Henry Davis

pdf bib
Connecting Documentation and Revitalization: A New Approach to Language Apps
Alexa N. Little

pdf bib
Developing a Suite of Mobile Applications for Collaborative Language Documentation
Mat Bettinson | Steven Bird

pdf bib
Cross-language forced alignment to assist community-based linguistics for low resource languages
Timothy Kempton

pdf bib
A case study on using speech-to-translation alignments for language documentation
Antonios Anastasopoulos | David Chiang


up

pdf (full)
bib (full)
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Proceedings of the 21st Nordic Conference on Computational Linguistics
Jörg Tiedemann | Nina Tahmasebi

pdf bib
Joint UD Parsing of Norwegian Bokmål and Nynorsk
Erik Velldal | Lilja Øvrelid | Petter Hohle

pdf bib
Replacing OOV Words For Dependency Parsing With Distributional Semantics
Prasanth Kolachina | Martin Riedl | Chris Biemann

pdf bib
Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing
Ali Basirat | Joakim Nivre

pdf bib
Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger
Kimmo Kettunen | Laura Löfberg

pdf bib
Machine Learning for Rhetorical Figure Detection: More Chiasmus with Less Annotation
Marie Dubremetz | Joakim Nivre

pdf bib
Coreference Resolution for Swedish and German using Distant Supervision
Alexander Wallin | Pierre Nugues

pdf bib
Aligning phonemes using finte-state methods
Kimmo Koskenniemi

pdf bib
Acoustic Model Compression with MAP adaptation
Katri Leino | Mikko Kurimo

pdf bib
OCR and post-correction of historical Finnish texts
Senka Drobac | Pekka Kauppinen | Krister Lindén

pdf bib
Twitter Topic Modeling by Tweet Aggregation
Asbjørn Steinskog | Jonas Therkelsen | Björn Gambäck

pdf bib
A Multilingual Entity Linker Using PageRank and Semantic Graphs
Anton Södergren | Pierre Nugues

pdf bib
Linear Ensembles of Word Embedding Models
Avo Muromägi | Kairit Sirts | Sven Laur

pdf bib
Using Pseudowords for Algorithm Comparison: An Evaluation Framework for Graph-based Word Sense Induction
Flavio Massimiliano Cecchini | Chris Biemann | Martin Riedl

pdf bib
North-Sámi to Finnish rule-based machine translation system
Tommi Pirinen | Francis M. Tyers | Trond Trosterud | Ryan Johnson | Kevin Unhammer | Tiina Puolakainen

pdf bib
Machine translation with North Saami as a pivot language
Lene Antonsen | Ciprian Gerstenberger | Maja Kappfjell | Sandra Nystø Rahka | Marja-Liisa Olthuis | Trond Trosterud | Francis M. Tyers

pdf bib
SWEGRAM – A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts
Jesper Näsman | Beáta Megyesi | Anne Palmér

pdf bib
Optimizing a PoS Tagset for Norwegian Dependency Parsing
Petter Hohle | Lilja Øvrelid | Erik Velldal

pdf bib
Creating register sub-corpora for the Finnish Internet Parsebank
Veronika Laippala | Juhani Luotolahti | Aki-Juhani Kyröläinen | Tapio Salakoski | Filip Ginter

pdf bib
KILLE: a Framework for Situated Agents for Learning Language Through Interaction
Simon Dobnik | Erik de Graaf

pdf bib
Data Collection from Persons with Mild Forms of Cognitive Impairment and Healthy Controls - Infrastructure for Classification and Prediction of Dementia
Dimitrios Kokkinakis | Kristina Lundholm Fors | Eva Björkner | Arto Nordlund

pdf bib
Evaluation of language identification methods using 285 languages
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen

pdf bib
Can We Create a Tool for General Domain Event Analysis?
Siim Orasmaa | Heiki-Jaan Kaalep

pdf bib
From Treebank to Propbank: A Semantic-Role and VerbNet Corpus for Danish
Eckhard Bick

pdf bib
Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations
Johannes Bjerva | Robert Östling

pdf bib
Will my auxiliary tagging task help? Estimating Auxiliary Tasks Effectivity in Multi-Task Learning
Johannes Bjerva

pdf bib
Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases
Carl Börstell | Robert Östling

pdf bib
Docforia: A Multilayer Document Model
Marcus Klang | Pierre Nugues

pdf bib
Finnish resources for evaluating language model semantics
Viljami Venekoski | Jouko Vankka

pdf bib
Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech
Steinþór Steingrímsson | Jón Guðnason | Sigrún Helgadóttir | Eiríkur Rögnvaldsson

pdf bib
The Effect of Translationese on Tuning for Statistical Machine Translation
Sara Stymne

pdf bib
Multilingwis² – Explore Your Parallel Corpus
Johannes Graën | Dominique Sandoz | Martin Volk

pdf bib
A modernised version of the Glossa corpus search system
Anders Nøklestad | Kristin Hagen | Janne Bondi Johannessen | Michał Kosek | Joel Priestley

pdf bib
Dep_search: Efficient Search Tool for Large Dependency Parsebanks
Juhani Luotolahti | Jenna Kanerva | Filip Ginter

pdf bib
Proto-Indo-European Lexicon: The Generative Etymological Dictionary of Indo-European Languages
Jouna Pyysalo

pdf bib
Tilde MODEL - Multilingual Open Data for EU Languages
Roberts Rozis | Raivis Skadiņš

pdf bib
Mainstreaming August Strindberg with Text Normalization
Adam Ek | Sofia Knuutinen

pdf bib
Word vectors, reuse, and replicability: Towards a community repository of large-text resources
Murhaf Fares | Andrey Kutuzov | Stephan Oepen | Erik Velldal

pdf bib
Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
Mika Koistinen | Kimmo Kettunen | Tuula Pääkkönen

pdf bib
Redefining Context Windows for Word Embedding Models: An Experimental Study
Pierre Lison | Andrey Kutuzov

pdf bib
The Effect of Excluding Out of Domain Training Data from Supervised Named-Entity Recognition
Adam Persson

pdf bib
Quote Extraction and Attribution from Norwegian Newspapers
Andrew Salway | Paul Meurer | Knut Hofland | Øystein Reigem

pdf bib
Wordnet extension via word embeddings: Experiments on the Norwegian Wordnet
Heidi Sand | Erik Velldal | Lilja Øvrelid

pdf bib
Universal Dependencies for Swedish Sign Language
Robert Östling | Carl Börstell | Moa Gärdenfors | Mats Wirén

pdf bib
Services for text simplification and analysis
Johan Falkenjack | Evelina Rennes | Daniel Fahlborg | Vida Johansson | Arne Jönsson

pdf bib
Exploring Properties of Intralingual and Interlingual Association Measures Visually
Johannes Graën | Christof Bless

pdf bib
TALERUM - Learning Danish by Doing Danish
Peter Juel Henrichsen

pdf bib
Cross-Lingual Syntax: Relating Grammatical Framework with Universal Dependencies
Aarne Ranta | Prasanth Kolachina | Thomas Hallgren

pdf bib
Exploring Treebanks with INESS Search
Victoria Rosén | Helge Dyvik | Paul Meurer | Koenraad De Smedt

pdf bib
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Aleksi Vesanto | Filip Ginter | Hannu Salmi | Asko Nivala | Tapio Salakoski



up

pdf (full)
bib (full)
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)
Marie-Catherine de Marneffe | Joakim Nivre | Sebastian Schuster

pdf bib
Cross-Lingual Parser Selection for Low-Resource Languages
Željko Agić

pdf bib
Swedish Prepositions are not Pure Function Words
Lars Ahrenberg

pdf bib
Increasing Return on Annotation Investment: The Automatic Construction of a Universal Dependency Treebank for Dutch
Gosse Bouma | Gertjan van Noord

pdf bib
Converting the TüBa-D/Z Treebank of German to Universal Dependencies
Çağrı Çöltekin | Ben Campbell | Erhard Hinrichs | Heike Telljohann

pdf bib
Universal Dependencies for Afrikaans
Peter Dirix | Liesbeth Augustinus | Daniel van Niekerk | Frank Van Eynde

pdf bib
Elliptic Constructions: Spotting Patterns in UD Treebanks
Kira Droganova | Daniel Zeman

pdf bib
Dependency Tree Transformation with Tree Transducers
Felix Hennig | Arne Köhn

pdf bib
Towards Universal Dependencies for Learner Chinese
John Lee | Herman Leung | Keying Li

pdf bib
Does Syntactic Informativity Predict Word Length? A Cross-Linguistic Study Based on the Universal Dependencies Corpora
Natalia Levshina

pdf bib
Estonian Copular and Existential Constructions as an UD Annotation Problem
Kadri Muischnek | Kaili Müürisep

pdf bib
Universal Dependency Evaluation
Joakim Nivre | Chiao-Ting Fang

pdf bib
Udapi: Universal API for Universal Dependencies
Martin Popel | Zdeněk Žabokrtský | Martin Vojtek

pdf bib
Universal Dependencies for Greek
Prokopis Prokopidis | Haris Papageorgiou

pdf bib
From Universal Dependencies to Abstract Syntax
Aarne Ranta | Prasanth Kolachina

pdf bib
Empirically Sampling Universal Dependencies
Natalie Schluter | Željko Agić

pdf bib
Gapping Constructions in Universal Dependencies v2
Sebastian Schuster | Matthew Lamm | Christopher D. Manning

pdf bib
Toward Universal Dependencies for Ainu
Hajime Senuma | Akiko Aizawa

pdf bib
Automatic Morpheme Segmentation and Labeling in Universal Dependencies Resources
Miikka Silfverberg | Mans Hulden

pdf bib
A Systematic Comparison of Syntactic Representations of Dependency Parsing
Guillaume Wisniewski | Ophélie Lacroix


up

pdf (full)
bib (full)
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf bib
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
Gerlof Bouma | Yvonne Adesam

pdf bib
Variance in Historical Data: How bad is it and how can we profit from it for historical linguistics?
Stefanie Dipper

pdf bib
Improving POS Tagging in Old Spanish Using TEITOK
Maarten Janssen | Josep Ausensi | Josep Fontana

pdf bib
The Making of the Royal Society Corpus
Jörg Knappen | Stefan Fischer | Hannah Kermes | Elke Teich | Peter Fankhauser

pdf bib
Normalizing Medieval German Texts: from rules to deep learning
Natalia Korchagina

pdf bib
Ambiguity in Semantically Related Word Substitutions: an investigation in historical Bible translations
Maria Moritz | Marco Büchler

pdf bib
The Lemlat 3.0 Package for Morphological Analysis of Latin
Marco Passarotti | Marco Budassi | Eleonora Litta | Paolo Ruffolo

pdf bib
HistoBankVis: Detecting Language Change via Data Visualization
Christin Schätzle | Michael Hund | Frederik Dennig | Miriam Butt | Daniel Keim

pdf bib
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
Gerold Schneider | Eva Pettersson | Michael Percillier

pdf bib
Data-driven Morphology and Sociolinguistics for Early Modern Dutch
Marijn Schraagen | Marjo van Koppen | Feike Dietz

pdf bib
Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910
Aleksi Vesanto | Asko Nivala | Heli Rantala | Tapio Salakoski | Hannu Salmi | Filip Ginter



up

bib (full) Proceedings of the 7th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2017)

pdf bib
Proceedings of the 7th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2017)
Ted Gibson | Tal Linzen | Asad Sayeed | Martin van Schijndel | William Schuler

pdf bib
Entropy Reduction correlates with temporal lobe activity
Matthew Nelson | Stanislas Dehaene | Christophe Pallier | John Hale

Using the Entropy Reduction incremental complexity metric, we relate high gamma power signals from the brains of epileptic patients to incremental stages of syntactic analysis in English and French. We find that signals recorded intracranially from the anterior Inferior Temporal Sulcus (aITS) and the posterior Inferior Temporal Gyrus (pITG) correlate with word-by-word Entropy Reduction values derived from phrase structure grammars for those languages. In the anterior region, this correlation persists even in combination with surprisal co-predictors from PCFG and ngram models. The result confirms the idea that the brain’s temporal lobe houses a parsing function, one whose incremental processing difficulty profile reflects changes in grammatical uncertainty.

pdf bib
Learning an Input Filter for Argument Structure Acquisition
Laurel Perkins | Naomi Feldman | Jeffrey Lidz

How do children learn a verb’s argument structure when their input contains nonbasic clauses that obscure verb transitivity? Here we present a new model that infers verb transitivity by learning to filter out non-basic clauses that were likely parsed in error. In simulations with child-directed speech, we show that this model accurately categorizes the majority of 50 frequent transitive, intransitive and alternating verbs, and jointly learns appropriate parameters for filtering parsing errors. Our model is thus able to filter out problematic data for verb learning without knowing in advance which data need to be filtered.

pdf bib
Grounding sound change in ideal observer models of perception
Zachary Burchill | T. Florian Jaeger

An important predictor of historical sound change, functional load, fails to capture insights from speech perception. Building on ideal observer models of word recognition, we devise a new definition of functional load that incorporates both a priori predictability and perceptual information. We explore this new measure with a simple model and find that it outperforms traditional measures.

pdf bib
“Oh, I’ve Heard That Before”: Modelling Own-Dialect Bias After Perceptual Learning by Weighting Training Data
Rachael Tatman

Human listeners are able to quickly and robustly adapt to new accents and do so by using information about speaker’s identities. This paper will present experimental evidence that, even considering information about speaker’s identities, listeners retain a strong bias towards the acoustics of their own dialect after dialect learning. Participants’ behaviour was accurately mimicked by a classifier which was trained on more cases from the base dialect and fewer from the target dialect. This suggests that imbalanced training data may result in automatic speech recognition errors consistent with those of speakers from populations over-represented in the training data.

pdf bib
Inherent Biases of Recurrent Neural Networks for Phonological Assimilation and Dissimilation
Amanda Doucette

A recurrent neural network model of phonological pattern learning is proposed. The model is a relatively simple neural network with one recurrent layer, and displays biases in learning that mimic observed biases in human learning. Single-feature patterns are learned faster than two-feature patterns, and vowel or consonant-only patterns are learned faster than patterns involving vowels and consonants, mimicking the results of laboratory learning experiments. In non-recurrent models, capturing these biases requires the use of alpha features or some other representation of repeated features, but with a recurrent neural network, these elaborations are not necessary.

pdf bib
Predicting Japanese scrambling in the wild
Naho Orita

Japanese speakers have a choice between canonical SOV and scrambled OSV word order to express the same meaning. Although previous experiments examine the influence of one or two factors for scrambling in a controlled setting, it is not yet known what kinds of multiple effects contribute to scrambling. This study uses naturally distributed data to test the multiple effects on scrambling simultaneously. A regression analysis replicates the NP length effect and suggests the influence of noun types, but it provides no evidence for syntactic priming, given-new ordering, and the animacy effect. These findings only show evidence for sentence-internal factors, but we find no evidence that discourse level factors play a role.

up

pdf (full)
bib (full)
Proceedings of the 11th Linguistic Annotation Workshop

pdf bib
Proceedings of the 11th Linguistic Annotation Workshop
Nathan Schneider | Nianwen Xue

pdf bib
Readers vs. Writers vs. Texts: Coping with Different Perspectives of Text Understanding in Emotion Annotation
Sven Buechel | Udo Hahn

We here examine how different perspectives of understanding written discourse, like the reader’s, the writer’s or the text’s point of view, affect the quality of emotion annotations. We conducted a series of annotation experiments on two corpora, a popular movie review corpus and a genre- and domain-balanced corpus of standard English. We found statistical evidence that the writer’s perspective yields superior annotation quality overall. However, the quality one perspective yields compared to the other(s) seems to depend on the domain the utterance originates from. Our data further suggest that the popular movie review data set suffers from an atypical bimodal distribution which may decrease model performance when used as a training resource.

pdf bib
Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus
Courtney Napoles | Joel Tetreault | Aasish Pappu | Enrica Rosato | Brian Provenzale

This work presents a dataset and annotation scheme for the new task of identifying “good” conversations that occur online, which we call ERICs: Engaging, Respectful, and/or Informative Conversations. We develop a taxonomy to reflect features of entire threads and individual comments which we believe contribute to identifying ERICs; code a novel dataset of Yahoo News comment threads (2.4k threads and 10k comments) and 1k threads from the Internet Argument Corpus; and analyze the features characteristic of ERICs. This is one of the largest annotated corpora of online human dialogues, with the most detailed set of annotations. It will be valuable for identifying ERICs and other aspects of argumentation, dialogue, and discourse.

pdf bib
Crowdsourcing discourse interpretations: On the influence of context and the reliability of a connective insertion task
Merel Scholman | Vera Demberg

Traditional discourse annotation tasks are considered costly and time-consuming, and the reliability and validity of these tasks is in question. In this paper, we investigate whether crowdsourcing can be used to obtain reliable discourse relation annotations. We also examine the influence of context on the reliability of the data. The results of a crowdsourced connective insertion task showed that the method can be used to obtain reliable annotations: The majority of the inserted connectives converged with the original label. Further, the method is sensitive to the fact that multiple senses can often be inferred for a single relation. Regarding the presence of context, the results show no significant difference in distributions of insertions between conditions overall. However, a by-item comparison revealed several characteristics of segments that determine whether the presence of context makes a difference in annotations. The findings discussed in this paper can be taken as evidence that crowdsourcing can be used as a valuable method to obtain insights into the sense(s) of relations.

pdf bib
A Code-Switching Corpus of Turkish-German Conversations
Özlem Çetinoğlu

We present a code-switching corpus of Turkish-German that is collected by recording conversations of bilinguals. The recordings are then transcribed in two layers following speech and orthography conventions, and annotated with sentence boundaries and intersentential, intrasentential, and intra-word switch points. The total amount of data is 5 hours of speech which corresponds to 3614 sentences. The corpus aims at serving as a resource for speech or text analysis, as well as a collection for linguistic inquiries.

pdf bib
Annotating omission in statement pairs
Héctor Martínez Alonso | Amaury Delamaire | Benoît Sagot

We focus on the identification of omission in statement pairs. We compare three annotation schemes, namely two different crowdsourcing schemes and manual expert annotation. We show that the simplest of the two crowdsourcing approaches yields a better annotation quality than the more complex one. We use a dedicated classifier to assess whether the annotators’ behavior can be explained by straightforward linguistic features. The classifier benefits from a modeling that uses lexical information beyond length and overlap measures. However, for our task, we argue that expert and not crowdsourcing-based annotation is the best compromise between annotation cost and quality.

pdf bib
Annotating Speech, Attitude and Perception Reports
Corien Bary | Leopold Hess | Kees Thijs | Peter Berck | Iris Hendrickx

We present REPORTS, an annotation scheme for the annotation of speech, attitude and perception reports. Such a scheme makes it possible to annotate the various text elements involved in such reports (e.g. embedding entity, complement, complement head) and their relations in a uniform way, which in turn facilitates the automatic extraction of information on, for example, complementation and vocabulary distribution. We also present the Ancient Greek corpus RAG (Thucydides’ History of the Peloponnesian War), to which we have applied this scheme using the annotation tool BRAT. We discuss some of the issues, both theoretical and practical, that we encountered, show how the corpus helps in answering specific questions, and conclude that REPORTS fitted in well with our needs.

pdf bib
Consistent Classification of Translation Revisions: A Case Study of English-Japanese Student Translations
Atsushi Fujita | Kikuko Tanabe | Chiho Toyoshima | Mayuka Yamamoto | Kyo Kageura | Anthony Hartley

Consistency is a crucial requirement in text annotation. It is especially important in educational applications, as lack of consistency directly affects learners’ motivation and learning performance. This paper presents a quality assessment scheme for English-to-Japanese translations produced by learner translators at university. We constructed a revision typology and a decision tree manually through an application of the OntoNotes method, i.e., an iteration of assessing learners’ translations and hypothesizing the conditions for consistent decision making, as well as re-organizing the typology. Intrinsic evaluation of the created scheme confirmed its potential contribution to the consistent classification of identified erroneous text spans, achieving visibly higher Cohen’s kappa values, up to 0.831, than previous work. This paper also describes an application of our scheme to an English-to-Japanese translation exercise course for undergraduate students at a university in Japan.

pdf bib
Representation and Interchange of Linguistic Annotation. An In-Depth, Side-by-Side Comparison of Three Designs
Richard Eckart de Castilho | Nancy Ide | Emanuele Lapponi | Stephan Oepen | Keith Suderman | Erik Velldal | Marc Verhagen

For decades, most self-respecting linguistic engineering initiatives have designed and implemented custom representations for various layers of, for example, morphological, syntactic, and semantic analysis. Despite occasional efforts at harmonization or even standardization, our field today is blessed with a multitude of ways of encoding and exchanging linguistic annotations of these types, both at the levels of ‘abstract syntax’, naming choices, and of course file formats. To a large degree, it is possible to work within and across design plurality by conversion, and often there may be good reasons for divergent design reflecting differences in use. However, it is likely that some abstract commonalities across choices of representation are obscured by more superficial differences, and conversely there is no obvious procedure to tease apart what actually constitute contentful vs. mere technical divergences. In this study, we seek to conceptually align three representations for common types of morpho-syntactic analysis, pinpoint what in our view constitute contentful differences, and reflect on the underlying principles and specific requirements that led to individual choices. We expect that a more in-depth understanding of these choices across designs may led to increased harmonization, or at least to more informed design of future representations.

pdf bib
TDB 1.1: Extensions on Turkish Discourse Bank
Deniz Zeyrek | Murathan Kurfalı

This paper presents the recent developments on Turkish Discourse Bank (TDB). First, the resource is summarized and an evaluation is presented. Then, TDB 1.1, i.e. enrichments on 10% of the corpus are described (namely, senses for explicit discourse connectives, and new annotations for three discourse relation types - implicit relations, entity relations and alternative lexicalizations). The method of annotation is explained and the data are evaluated.

pdf bib
Two Layers of Annotation for Representing Event Mentions in News Stories
Maria Pia di Buono | Martin Tutek | Jan Šnajder | Goran Glavaš | Bojana Dalbelo Bašić | Nataša Milić-Frayling

In this paper, we describe our preliminary study on annotating event mention as a part of our research on high-precision news event extraction models. To this end, we propose a two-layer annotation scheme, designed to separately capture the functional and conceptual aspects of event mentions. We hypothesize that the precision of models can be improved by modeling and extracting separately the different aspects of news events, and then combining the extracted information by leveraging the complementarities of the models. In addition, we carry out a preliminary annotation using the proposed scheme and analyze the annotation quality in terms of inter-annotator agreement.

pdf bib
Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems
Syed Sarfaraz Akhtar | Arihant Gupta | Avijit Vajpayee | Arjit Srivastava | Manish Shrivastava

With the advent of word representations, word similarity tasks are becoming increasing popular as an evaluation metric for the quality of the representations. In this paper, we present manually annotated monolingual word similarity datasets of six Indian languages - Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. These languages are most spoken Indian languages worldwide after Hindi and Bengali. For the construction of these datasets, our approach relies on translation and re-annotation of word similarity datasets of English. We also present baseline scores for word representation models using state-of-the-art techniques for Urdu, Telugu and Marathi by evaluating them on newly created word similarity datasets.

pdf bib
The BECauSE Corpus 2.0: Annotating Causality and Overlapping Relations
Jesse Dunietz | Lori Levin | Jaime Carbonell

Language of cause and effect captures an essential component of the semantics of a text. However, causal language is also intertwined with other semantic relations, such as temporal precedence and correlation. This makes it difficult to determine when causation is the primary intended meaning. This paper presents BECauSE 2.0, a new version of the BECauSE corpus with exhaustively annotated expressions of causal language, but also seven semantic relations that are frequently co-present with causation. The new corpus shows high inter-annotator agreement, and yields insights both about the linguistic expressions of causation and about the process of annotating co-present semantic relations.

pdf bib
Catching the Common Cause: Extraction and Annotation of Causal Relations and their Participants
Ines Rehbein | Josef Ruppenhofer

In this paper, we present a simple, yet effective method for the automatic identification and extraction of causal relations from text, based on a large English-German parallel corpus. The goal of this effort is to create a lexical resource for German causal relations. The resource will consist of a lexicon that describes constructions that trigger causality as well as the participants of the causal event, and will be augmented by a corpus with annotated instances for each entry, that can be used as training data to develop a system for automatic classification of causal relations. Focusing on verbs, our method harvested a set of 100 different lexical triggers of causality, including support verb constructions. At the moment, our corpus includes over 1,000 annotated instances. The lexicon and the annotated data will be made available to the research community.

pdf bib
Assessing SRL Frameworks with Automatic Training Data Expansion
Silvana Hartmann | Éva Mújdricza-Maydt | Ilia Kuznetsov | Iryna Gurevych | Anette Frank

We present the first experiment-based study that explicitly contrasts the three major semantic role labeling frameworks. As a prerequisite, we create a dataset labeled with parallel FrameNet-, PropBank-, and VerbNet-style labels for German. We train a state-of-the-art SRL tool for German for the different annotation styles and provide a comparative analysis across frameworks. We further explore the behavior of the frameworks with automatic training data generation. VerbNet provides larger semantic expressivity than PropBank, and we find that its generalization capacity approaches PropBank in SRL training, but it benefits less from training data expansion than the sparse-data affected FrameNet.

up

pdf (full)
bib (full)
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

pdf bib
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics
Michael Roth | Nasrin Mostafazadeh | Nathanael Chambers | Annie Louis

pdf bib
Inducing Script Structure from Crowdsourced Event Descriptions via Semi-Supervised Clustering
Lilian Wanzare | Alessandra Zarcone | Stefan Thater | Manfred Pinkal

We present a semi-supervised clustering approach to induce script structure from crowdsourced descriptions of event sequences by grouping event descriptions into paraphrase sets (representing event types) and inducing their temporal order. Our approach exploits semantic and positional similarity and allows for flexible event order, thus overcoming the rigidity of previous approaches. We incorporate crowdsourced alignments as prior knowledge and show that exploiting a small number of alignments results in a substantial improvement in cluster quality over state-of-the-art models and provides an appropriate basis for the induction of temporal order. We also show a coverage study to demonstrate the scalability of our approach.

pdf bib
A Consolidated Open Knowledge Representation for Multiple Texts
Rachel Wities | Vered Shwartz | Gabriel Stanovsky | Meni Adler | Ori Shapira | Shyam Upadhyay | Dan Roth | Eugenio Martinez Camara | Iryna Gurevych | Ido Dagan

We propose to move from Open Information Extraction (OIE) ahead to Open Knowledge Representation (OKR), aiming to represent information conveyed jointly in a set of texts in an open text-based manner. We do so by consolidating OIE extractions using entity and predicate coreference, while modeling information containment between coreferring elements via lexical entailment. We suggest that generating OKR structures can be a useful step in the NLP pipeline, to give semantic applications an easy handle on consolidated information across multiple texts.

pdf bib
Event-Related Features in Feedforward Neural Networks Contribute to Identifying Causal Relations in Discourse
Edoardo Maria Ponti | Anna Korhonen

Causal relations play a key role in information extraction and reasoning. Most of the times, their expression is ambiguous or implicit, i.e. without signals in the text. This makes their identification challenging. We aim to improve their identification by implementing a Feedforward Neural Network with a novel set of features for this task. In particular, these are based on the position of event mentions and the semantics of events and participants. The resulting classifier outperforms strong baselines on two datasets (the Penn Discourse Treebank and the CSTNews corpus) annotated with different schemes and containing examples in two languages, English and Portuguese. This result demonstrates the importance of events for identifying discourse relations.

pdf bib
Stance Detection in Facebook Posts of a German Right-wing Party
Manfred Klenner | Don Tuggener | Simon Clematide

We argue that in order to detect stance, not only the explicit attitudes of the stance holder towards the targets are crucial. It is the whole narrative the writer drafts that counts, including the way he hypostasizes the discourse referents: as benefactors or villains, as victims or beneficiaries. We exemplify the ability of our system to identify targets and detect the writer’s stance towards them on the basis of about 100 000 Facebook posts of a German right-wing party. A reader and writer model on top of our verb-based attitude extraction directly reveal stance conflicts.

pdf bib
Behind the Scenes of an Evolving Event Cloze Test
Nathanael Chambers

This paper analyzes the narrative event cloze test and its recent evolution. The test removes one event from a document’s chain of events, and systems predict the missing event. Originally proposed to evaluate learned knowledge of event scenarios (e.g., scripts and frames), most recent work now builds ngram-like language models (LM) to beat the test. This paper argues that the test has slowly/unknowingly been altered to accommodate LMs.5 Most notably, tests are auto-generated rather than by hand, and no effort is taken to include core script events. Recent work is not clear on evaluation goals and contains contradictory results. We implement several models, and show that the test’s bias to high-frequency events explains the inconsistencies. We conclude with recommendations on how to return to the test’s original intent, and offer brief suggestions on a path forward.

pdf bib
LSDSem 2017 Shared Task: The Story Cloze Test
Nasrin Mostafazadeh | Michael Roth | Annie Louis | Nathanael Chambers | James Allen

The LSDSem’17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending to the story. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge. A total of eight systems participated in the shared task, with a variety of approaches including.

pdf bib
Story Cloze Task: UW NLP System
Roy Schwartz | Maarten Sap | Ioannis Konstas | Leila Zilles | Yejin Choi | Noah A. Smith

This paper describes University of Washington NLP’s submission for the Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem 2017) shared task—the Story Cloze Task. Our system is a linear classifier with a variety of features, including both the scores of a neural language model and style features. We report 75.2% accuracy on the task. A further discussion of our results can be found in Schwartz et al. (2017).

pdf bib
LSDSem 2017: Exploring Data Generation Methods for the Story Cloze Test
Michael Bugert | Yevgeniy Puzikov | Andreas Rücklé | Judith Eckle-Kohler | Teresa Martin | Eugenio Martínez-Cámara | Daniil Sorokin | Maxime Peyrard | Iryna Gurevych

The Story Cloze test is a recent effort in providing a common test scenario for text understanding systems. As part of the LSDSem 2017 shared task, we present a system based on a deep learning architecture combined with a rich set of manually-crafted linguistic features. The system outperforms all known baselines for the task, suggesting that the chosen approach is promising. We additionally present two methods for generating further training data based on stories from the ROCStories corpus.

pdf bib
Sentiment Analysis and Lexical Cohesion for the Story Cloze Task
Michael Flor | Swapna Somasundaran

We present two NLP components for the Story Cloze Task – dictionary-based sentiment analysis and lexical cohesion. While previous research found no contribution from sentiment analysis to the accuracy on this task, we demonstrate that sentiment is an important aspect. We describe a new approach, using a rule that estimates sentiment congruence in a story. Our sentiment-based system achieves strong results on this task. Our lexical cohesion system achieves accuracy comparable to previously published baseline results. A combination of the two systems achieves better accuracy than published baselines. We argue that sentiment analysis should be considered an integral part of narrative comprehension.

pdf bib
Resource-Lean Modeling of Coherence in Commonsense Stories
Niko Schenk | Christian Chiarcos

We present a resource-lean neural recognizer for modeling coherence in commonsense stories. Our lightweight system is inspired by successful attempts to modeling discourse relations and stands out due to its simplicity and easy optimization compared to prior approaches to narrative script learning. We evaluate our approach in the Story Cloze Test demonstrating an absolute improvement in accuracy of 4.7% over state-of-the-art implementations.

pdf bib
An RNN-based Binary Classifier for the Story Cloze Test
Melissa Roemmele | Sosuke Kobayashi | Naoya Inoue | Andrew Gordon

The Story Cloze Test consists of choosing a sentence that best completes a story given two choices. In this paper we present a system that performs this task using a supervised binary classifier on top of a recurrent neural network to predict the probability that a given story ending is correct. The classifier is trained to distinguish correct story endings given in the training data from incorrect ones that we artificially generate. Our experiments evaluate different methods for generating these negative examples, as well as different embedding-based representations of the stories. Our best result obtains 67.2% accuracy on the test set, outperforming the existing top baseline of 58.5%.

pdf bib
IIT (BHU): System Description for LSDSem’17 Shared Task
Pranav Goel | Anil Kumar Singh

This paper describes an ensemble system submitted as part of the LSDSem Shared Task 2017 - the Story Cloze Test. The main conclusion from our results is that an approach based on semantic similarity alone may not be enough for this task. We test various approaches and compare them with two ensemble systems. One is based on voting and the other on logistic regression based classifier. Our final system is able to outperform the previous state of the art for the Story Cloze test. Another very interesting observation is the performance of sentiment based approach which works almost as well on its own as our final ensemble system.

pdf bib
Story Cloze Ending Selection Baselines and Data Examination
Todor Mihaylov | Anette Frank

This paper describes two supervised baseline systems for the Story Cloze Test Shared Task (Mostafazadeh et al., 2016a). We first build a classifier using features based on word embeddings and semantic similarity computation. We further implement a neural LSTM system with different encoding strategies that try to model the relation between the story and the provided endings. Our experiments show that a model using representation features based on average word embedding vectors over the given story words and the candidate ending sentences words, joint with similarity features between the story and candidate ending representations performed better than the neural models. Our best model based on achieves an accuracy of 72.42, ranking 3rd in the official evaluation.

up

pdf (full)
bib (full)
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

pdf bib
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres
George Giannakopoulos | Elena Lloret | John M. Conroy | Josef Steinberger | Marina Litvak | Peter Rankel | Benoit Favre

pdf bib
MultiLing 2017 Overview
George Giannakopoulos | John Conroy | Jeff Kubina | Peter A. Rankel | Elena Lloret | Josef Steinberger | Marina Litvak | Benoit Favre

In this brief report we present an overview of the MultiLing 2017 effort and workshop, as implemented within EACL 2017. MultiLing is a community-driven initiative that pushes the state-of-the-art in Automatic Summarization by providing data sets and fostering further research and development of summarization systems. This year the scope of the workshop was widened, bringing together researchers that work on summarization across sources, languages and genres. We summarize the main tasks planned and implemented this year, the contributions received, and we also provide insights on next steps.

pdf bib
Decoupling Encoder and Decoder Networks for Abstractive Document Summarization
Ying Xu | Jey Han Lau | Timothy Baldwin | Trevor Cohn

Abstractive document summarization seeks to automatically generate a summary for a document, based on some abstract “understanding” of the original document. State-of-the-art techniques traditionally use attentive encoder–decoder architectures. However, due to the large number of parameters in these models, they require large training datasets and long training times. In this paper, we propose decoupling the encoder and decoder networks, and training them separately. We encode documents using an unsupervised document encoder, and then feed the document vector to a recurrent neural network decoder. With this decoupled architecture, we decrease the number of parameters in the decoder substantially, and shorten its training time. Experiments show that the decoupled model achieves comparable performance with state-of-the-art models for in-domain documents, but less well for out-of-domain documents.

pdf bib
Centroid-based Text Summarization through Compositionality of Word Embeddings
Gaetano Rossiello | Pierpaolo Basile | Giovanni Semeraro

The textual similarity is a crucial aspect for many extractive text summarization methods. A bag-of-words representation does not allow to grasp the semantic relationships between concepts when comparing strongly related sentences with no words in common. To overcome this issue, in this paper we propose a centroid-based method for text summarization that exploits the compositional capabilities of word embeddings. The evaluations on multi-document and multilingual datasets prove the effectiveness of the continuous vector representation of words compared to the bag-of-words model. Despite its simplicity, our method achieves good performance even in comparison to more complex deep learning models. Our method is unsupervised and it can be adopted in other summarization tasks.

pdf bib
Query-based summarization using MDL principle
Marina Litvak | Natalia Vanetik

Query-based text summarization is aimed at extracting essential information that answers the query from original text. The answer is presented in a minimal, often predefined, number of words. In this paper we introduce a new unsupervised approach for query-based extractive summarization, based on the minimum description length (MDL) principle that employs Krimp compression algorithm (Vreeken et al., 2011). The key idea of our approach is to select frequent word sets related to a given query that compress document sentences better and therefore describe the document better. A summary is extracted by selecting sentences that best cover query-related frequent word sets. The approach is evaluated based on the DUC 2005 and DUC 2006 datasets which are specifically designed for query-based summarization (DUC, 2005 2006). It competes with the best results.

pdf bib
Word Embedding and Topic Modeling Enhanced Multiple Features for Content Linking and Argument / Sentiment Labeling in Online Forums
Lei Li | Liyuan Mao | Moye Chen

Multiple grammatical and semantic features are adopted in content linking and argument/sentiment labeling for online forums in this paper. There are mainly two different methods for content linking. First, we utilize the deep feature obtained from Word Embedding Model in deep learning and compute sentence similarity. Second, we use multiple traditional features to locate candidate linking sentences, and then adopt a voting method to obtain the final result. LDA topic modeling is used to mine latent semantic feature and K-means clustering is implemented for argument labeling, while features from sentiment dictionaries and rule-based sentiment analysis are integrated for sentiment labeling. Experimental results have shown that our methods are valid.

pdf bib
Ultra-Concise Multi-genre Summarisation of Web2.0: towards Intelligent Content Generation
Elena Lloret | Ester Boldrini | Patricio Martínez-Barco | Manuel Palomar

The electronic Word of Mouth has become the most powerful communication channel thanks to the wide usage of the Social Media. Our research proposes an approach towards the production of automatic ultra-concise summaries from multiple Web 2.0 sources. We exploit user-generated content from reviews and microblogs in different domains, and compile and analyse four types of ultra-concise summaries: a)positive information, b) negative information; c) both or d) objective information. The appropriateness and usefulness of our model is demonstrated by its successful results and great potential in real-life applications, thus meaning a relevant advancement of the state-of-the-art approaches.

pdf bib
Machine Learning Approach to Evaluate MultiLingual Summaries
Samira Ellouze | Maher Jaoua | Lamia Hadrich Belguith

The present paper introduces a new MultiLing text summary evaluation method. This method relies on machine learning approach which operates by combining multiple features to build models that predict the human score (overall responsiveness) of a new summary. We have tried several single and “ensemble learning” classifiers to build the best model. We have experimented our method in summary level evaluation where we evaluate each text summary separately. The correlation between built models and human score is better than the correlation between baselines and manual score.

up

pdf (full)
bib (full)
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media

pdf bib
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media
Lun-Wei Ku | Cheng-Te Li

pdf bib
A Survey on Hate Speech Detection using Natural Language Processing
Anna Schmidt | Michael Wiegand

This paper presents a survey on hate speech detection. Given the steadily growing body of social media content, the amount of online hate speech is also increasing. Due to the massive scale of the web, methods that automatically detect hate speech are required. Our survey describes key areas that have been explored to automatically recognize these types of utterances using natural language processing. We also discuss limits of those approaches.

pdf bib
Facebook sentiment: Reactions and Emojis
Ye Tian | Thiago Galery | Giulio Dulcinati | Emilia Molimpakis | Chao Sun

Emojis are used frequently in social media. A widely assumed view is that emojis express the emotional state of the user, which has led to research focusing on the expressiveness of emojis independent from the linguistic context. We argue that emojis and the linguistic texts can modify the meaning of each other. The overall communicated meaning is not a simple sum of the two channels. In order to study the meaning interplay, we need data indicating the overall sentiment of the entire message as well as the sentiment of the emojis stand-alone. We propose that Facebook Reactions are a good data source for such a purpose. FB reactions (e.g. “Love” and “Angry”) indicate the readers’ overall sentiment, against which we can investigate the types of emojis used the comments under different reaction profiles. We present a data set of 21,000 FB posts (57 million reactions and 8 million comments) from public media pages across four countries.

pdf bib
Potential and Limitations of Cross-Domain Sentiment Classification
Jan Milan Deriu | Martin Weilenmann | Dirk Von Gruenigen | Mark Cieliebak

In this paper we investigate the cross-domain performance of a current state-of-the-art sentiment analysis systems. For this purpose we train a convolutional neural network (CNN) on data from different domains and evaluate its performance on other domains. Furthermore, we evaluate the usefulness of combining a large amount of different smaller annotated corpora to a large corpus. Our results show that more sophisticated approaches are required to train a system that works equally well on various domains.

pdf bib
Aligning Entity Names with Online Aliases on Twitter
Kevin McKelvey | Peter Goutzounis | Stephen da Cruz | Nathanael Chambers

This paper presents new models that automatically align online aliases with their real entity names. Many research applications rely on identifying entity names in text, but people often refer to entities with unexpected nicknames and aliases. For example, The King and King James are aliases for Lebron James, a professional basketball player. Recent work on entity linking attempts to resolve mentions to knowledge base entries, like a wikipedia page, but linking is unfortunately limited to well-known entities with pre-built pages. This paper asks a more basic question: can aliases be aligned without background knowledge of the entity? Further, can the semantics surrounding alias mentions be used to inform alignments? We describe statistical models that make decisions based on the lexicographic properties of the aliases with their semantic context in a large corpus of tweets. We experiment on a database of Twitter users and their usernames, and present the first human evaluation for this task. Alignment accuracy approaches human performance at 81%, and we show that while lexicographic features are most important, the semantic context of an alias further improves classification accuracy.

pdf bib
Character-based Neural Embeddings for Tweet Clustering
Svitlana Vakulenko | Lyndon Nixon | Mihai Lupu

In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on-line: https://github.com/vendi12/tweet2vec_clustering.

pdf bib
A Twitter Corpus and Benchmark Resources for German Sentiment Analysis
Mark Cieliebak | Jan Milan Deriu | Dominic Egger | Fatih Uzdilli

In this paper we present SB10k, a new corpus for sentiment analysis with approx. 10,000 German tweets. We use this new corpus and two existing corpora to provide state-of-the-art benchmarks for sentiment analysis in German: we implemented a CNN (based on the winning system of SemEval-2016) and a feature-based SVM and compare their performance on all three corpora. For the CNN, we also created German word embeddings trained on 300M tweets. These word embeddings were then optimized for sentiment analysis using distant-supervised learning. The new corpus, the German word embeddings (plain and optimized), and source code to re-run the benchmarks are publicly available.

up

pdf (full)
bib (full)
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

pdf bib
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Preslav Nakov | Marcos Zampieri | Nikola Ljubešić | Jörg Tiedemann | Shevin Malmasi | Ahmed Ali

pdf bib
Findings of the VarDial Evaluation Campaign 2017
Marcos Zampieri | Shervin Malmasi | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann | Yves Scherrer | Noëmi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf bib
Dialectometric analysis of language variation in Twitter
Gonzalo Donoso | David Sánchez

In the last few years, microblogging platforms such as Twitter have given rise to a deluge of textual data that can be used for the analysis of informal communication between millions of individuals. In this work, we propose an information-theoretic approach to geographic language variation using a corpus based on Twitter. We test our models with tens of concepts and their associated keywords detected in Spanish tweets geolocated in Spain. We employ dialectometric measures (cosine similarity and Jensen-Shannon divergence) to quantify the linguistic distance on the lexical level between cells created in a uniform grid over the map. This can be done for a single concept or in the general case taking into account an average of the considered variants. The latter permits an analysis of the dialects that naturally emerge from the data. Interestingly, our results reveal the existence of two dialect macrovarieties. The first group includes a region-specific speech spoken in small towns and rural areas whereas the second cluster encompasses cities that tend to use a more uniform variety. Since the results obtained with the two different metrics qualitatively agree, our work suggests that social media corpora can be efficiently used for dialectometric analyses.

pdf bib
Computational analysis of Gondi dialects
Taraka Rama | Çağrı Çöltekin | Pavel Sofroniev

This paper presents a computational analysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data using different techniques from dialectometry, deep learning, and computational biology. We show that the methods largely agree with each other and with the earlier non-computational analyses of the language group.

pdf bib
Investigating Diatopic Variation in a Historical Corpus
Stefanie Dipper | Sandra Waldenberger

This paper investigates diatopic variation in a historical corpus of German. Based on equivalent word forms from different language areas, replacement rules and mappings are derived which describe the relations between these word forms. These rules and mappings are then interpreted as reflections of morphological, phonological or graphemic variation. Based on sample rules and mappings, we show that our approach can replicate results from historical linguistics. While previous studies were restricted to predefined word lists, or confined to single authors or texts, our approach uses a much wider range of data available in historical corpora.

pdf bib
Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk)
Paolo Rosso

Author profiling is the study of how language is shared by people, a problem of growing importance in applications dealing with security, in order to understand who could be behind an anonymous threat message, and marketing, where companies may be interested in knowing the demographics of people that in online reviews liked or disliked their products. In this talk we will give an overview of the PAN shared tasks that since 2013 have been organised at CLEF and FIRE evaluation forums, mainly on age and gender identification in social media, although also personality recognition in Twitter as well as in code sources was also addressed. In 2017 the PAN author profiling shared task addresses jointly gender and language variety identification in Twitter where tweets have been annotated with authors’ gender and their specific variation of their native language: English (Australia, Canada, Great Britain, Ireland, New Zealand, United States), Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela), Portuguese (Brazil, Portugal), and Arabic (Egypt, Gulf, Levantine, Maghrebi).

pdf bib
The similarity and Mutual Intelligibility between Amharic and Tigrigna Varieties
Tekabe Legesse Feleke

The present study has examined the similarity and the mutual intelligibility between Amharic and Tigrigna using three tools namely Levenshtein distance, intelligibility test and questionnaires. The study has shown that both Tigrigna varieties have almost equal phonetic and lexical distances from Amharic. The study also indicated that Amharic speakers understand less than 50% of the two varieties. Furthermore, the study showed that Amharic speakers are more positive about the Ethiopian Tigrigna variety than the Eritrean Variety. However, their attitude towards the two varieties does not have an impact on their intelligibility. The Amharic speakers’ familiarity to the Tigrigna varieties is largely dependent on the genealogical relation between Amharic and the two Tigrigna varieties.

pdf bib
Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies
Marta R. Costa-jussà

Catalan and Spanish are two related languages given that both derive from Latin. They share similarities in several linguistic levels including morphology, syntax and semantics. This makes them particularly interesting for the MT task. Given the recent appearance and popularity of neural MT, this paper analyzes the performance of this new approach compared to the well-established rule-based and phrase-based MT systems. Experiments are reported on a large database of 180 million words. Results, in terms of standard automatic measures, show that neural MT clearly outperforms the rule-based and phrase-based MT system on in-domain test set, but it is worst in the out-of-domain test set. A naive system combination specially works for the latter. In-domain manual analysis shows that neural MT tends to improve both adequacy and fluency, for example, by being able to generate more natural translations instead of literal ones, choosing to the adequate target word when the source word has several translations and improving gender agreement. However, out-of-domain manual analysis shows how neural MT is more affected by unknown words or contexts.

pdf bib
Kurdish Interdialect Machine Translation
Hossein Hassani

This research suggests a method for machine translation among two Kurdish dialects. We chose the two widely spoken dialects, Kurmanji and Sorani, which are considered to be mutually unintelligible. Also, despite being spoken by about 30 million people in different countries, Kurdish is among less-resourced languages. The research used bi-dialectal dictionaries and showed that the lack of parallel corpora is not a major obstacle in machine translation between the two dialects. The experiments showed that the machine translated texts are comprehensible to those who do not speak the dialect. The research is the first attempt for inter-dialect machine translation in Kurdish and particularly could help in making online texts in one dialect comprehensible to those who only speak the target dialect. The results showed that the translated texts are in 71% and 79% cases rated as understandable for Kurmanji and Sorani respectively. They are rated as slightly-understandable in 29% cases for Kurmanji and 21% for Sorani.

pdf bib
Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth
Jennifer Williams | Charlie Dagli

We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geo-location, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.

pdf bib
Multi-source morphosyntactic tagging for spoken Rusyn
Yves Scherrer | Achim Rabus

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9%, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4% for part-of-speech tagging and 75.5% for full morphological tagging.

pdf bib
Identifying dialects with textual and acoustic cues
Abualsoud Hanani | Aziz Qaroush | Stephen Taylor

We describe several systems for identifying short samples of Arabic or Swiss-German dialects, which were prepared for the shared task of the 2017 DSL Workshop (Zampieri et al., 2017). The Arabic data comprises both text and acoustic files, and our best run combined both. The Swiss-German data is text-only. Coincidently, our best runs achieved a accuracy of nearly 63% on both the Swiss-German and Arabic dialects tasks.

pdf bib
Evaluating HeLI with Non-Linear Mappings
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated on the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe the HeLI method and the non-linear mappings in mathematical notation. The HeLI method uses a probabilistic model with character n-grams and word-based backoff. We also describe our trials using the non-linear mappings instead of relative frequencies and we present statistics about the back-off function of the HeLI method.

pdf bib
A Perplexity-Based Method for Similar Languages Discrimination
Pablo Gamallo | Jose Ramom Pichel | Iñaki Alegria

This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.

pdf bib
Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets
Yves Bestgen

This paper describes the system developed by the Centre for English Corpus Linguistics (CECL) to discriminating similar languages, language varieties and dialects. Based on a SVM with character and POStag n-grams as features and the BM25 weighting scheme, it achieved 92.7% accuracy in the Discriminating between Similar Languages (DSL) task, ranking first among eleven systems but with a lead over the next three teams of only 0.2%. A simpler version of the system ranked second in the German Dialect Identification (GDI) task thanks to several ad hoc postprocessing steps. Complementary analyses carried out by a cross-validation procedure suggest that the BM25 weighting scheme could be competitive in this type of tasks, at least in comparison with the sublinear TF-IDF. POStag n-grams also improved the system performance.

pdf bib
Discriminating between Similar Languages with Word-level Convolutional Neural Networks
Marcelo Criscuolo | Sandra Maria Aluísio

Discriminating between Similar Languages (DSL) is a challenging task addressed at the VarDial Workshop series. We report on our participation in the DSL shared task with a two-stage system. In the first stage, character n-grams are used to separate language groups, then specialized classifiers distinguish similar language varieties. We have conducted experiments with three system configurations and submitted one run for each. Our main approach is a word-level convolutional neural network (CNN) that learns task-specific vectors with minimal text preprocessing. We also experiment with multi-layer perceptron (MLP) networks and another hybrid configuration. Our best run achieved an accuracy of 90.76%, ranking 8th among 11 participants and getting very close to the system that ranked first (less than 2 points). Even though the CNN model could not achieve the best results, it still makes a viable approach to discriminating between similar languages.

pdf bib
Cross-lingual dependency parsing for closely related languages - Helsinki’s submission to VarDial 2017
Jörg Tiedemann

This paper describes the submission from the University of Helsinki to the shared task on cross-lingual dependency parsing at VarDial 2017. We present work on annotation projection and treebank translation that gave good results for all three target languages in the test set. In particular, Slovak seems to work well with information coming from the Czech treebank, which is in line with related work. The attachment scores for cross-lingual models even surpass the fully supervised models trained on the target language treebank. Croatian is the most difficult language in the test set and the improvements over the baseline are rather modest. Norwegian works best with information coming from Swedish whereas Danish contributes surprisingly little.

pdf bib
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
Helena Gomez | Ilia Markov | Jorge Baptista | Grigori Sidorov | David Pinto

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

pdf bib
Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing
Çağrı Çöltekin | Taraka Rama

This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive results among other systems in the shared task. We also report additional experiments with neural network models. The performance of neural network models was close but always below the corresponding SVM classifiers in the discrimination tasks. For the cross-lingual parsing task, we experimented with an approach based on automatically translating the source treebank to the target language, and training a parser on the translated treebank. We used off-the-shelf tools for both translation and parsing. Despite achieving better-than-baseline results, our scores in CLP tasks were substantially lower than the scores of the other participants.

pdf bib
When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages
Maria Medvedeva | Martin Kroon | Barbara Plank

We present the results of our participation in the VarDial 4 shared task on discriminating closely related languages. Our submission includes simple traditional models using linear support vector machines (SVMs) and a neural network (NN). The main idea was to leverage language group information. We did so with a two-layer approach in the traditional model and a multi-task objective in the neural network case. Our results confirm earlier findings: simple traditional models outperform neural networks consistently for this task, at least given the amount of systems we could examine in the available time. Our two-layer linear SVM ranked 2nd in the shared task.

pdf bib
German Dialect Identification in Interview Transcriptions
Shervin Malmasi | Marcos Zampieri

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task consists of training models to identify the dialect of Swiss-German speech transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne, and Zurich. The three systems we submitted are based on: a plurality ensemble, a mean probability ensemble, and a meta-classifier trained on character and word n-grams. The best results were obtained by the meta-classifier achieving 68.1% accuracy and 66.2% F1-score, ranking first among the 10 teams which participated in the GDI shared task.

pdf bib
CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects
Simon Clematide | Peter Makarov

Our submissions for the GDI 2017 Shared Task are the results from three different types of classifiers: Naïve Bayes, Conditional Random Fields (CRF), and Support Vector Machine (SVM). Our CRF-based run achieves a weighted F1 score of 65% (third rank) being beaten by the best system by 0.9%. Measured by classification accuracy, our ensemble run (Naïve Bayes, CRF, SVM) reaches 67% (second rank) being 1% lower than the best system. We also describe our experiments with Recurrent Neural Network (RNN) architectures. Since they performed worse than our non-neural approaches we did not include them in the submission.

pdf bib
Arabic Dialect Identification Using iVectors and ASR Transcripts
Shervin Malmasi | Marcos Zampieri

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects: Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7% accuracy, ranking second among the six teams which participated in the ADI shared task.

pdf bib
Discriminating between Similar Languages using Weighted Subword Features
Adrien Barbaresi

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).

pdf bib
Exploring Lexical and Syntactic Features for Language Variety Identification
Chris van der Lee | Antal van den Bosch

We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word n-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and part-of-speech n-grams. The effectiveness of the classifier was measured by classifying Dutch subtitles developed for either Dutch or Flemish television. Several machine learning algorithms were compared as well as feature combination methods in order to find the optimal generalization performance. A machine-learning meta classifier based on AdaBoost attained the best F-score of 0.92.

pdf bib
Learning to Identify Arabic and German Dialects using Multiple Kernels
Radu Tudor Ionescu | Andrei Butnaru

We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).

pdf bib
Slavic Forest, Norwegian Wood
Rudolf Rosa | Daniel Zeman | David Mareček | Zdeněk Žabokrtský

We once had a corp, or should we say, it once had us They showed us its tags, isn’t it great, unified tags They asked us to parse and they told us to use everything So we looked around and we noticed there was near nothing We took other langs, bitext aligned: words one-to-one We played for two weeks, and then they said, here is the test The parser kept training till morning, just until deadline So we had to wait and hope what we get would be just fine And, when we awoke, the results were done, we saw we’d won So, we wrote this paper, isn’t it good, Norwegian wood.

up

pdf (full)
bib (full)
Proceedings of the Third Arabic Natural Language Processing Workshop

pdf bib
Proceedings of the Third Arabic Natural Language Processing Workshop
Nizar Habash | Mona Diab | Kareem Darwish | Wassim El-Hajj | Hend Al-Khalifa | Houda Bouamor | Nadi Tomeh | Mahmoud El-Haj

pdf bib
Identification of Languages in Algerian Arabic Multilingual Documents
Wafia Adouane | Simon Dobnik

This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification tagging. We also experiment with a lexicon-based method. Combining all the methods in a fall-back mechanism and introducing some linguistic rules, to deal with unseen tokens and ambiguous words, gives an overall accuracy of 93.14%. Finally, we introduced rules for language identification from sequences of recognised words.

pdf bib
Arabic Diacritization: Stats, Rules, and Hacks
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali

In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29% and 12.77% without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.

pdf bib
Semantic Similarity of Arabic Sentences with Word Embeddings
El Moatez Billah Nagoudi | Didier Schwab

Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.

pdf bib
Morphological Analysis for the Maltese Language: The challenges of a hybrid system
Claudia Borg | Albert Gatt

Maltese is a morphologically rich language with a hybrid morphological system which features both concatenative and non-concatenative processes. This paper analyses the impact of this hybridity on the performance of machine learning techniques for morphological labelling and clustering. In particular, we analyse a dataset of morphologically related word clusters to evaluate the difference in results for concatenative and non-concatenative clusters. We also describe research carried out in morphological labelling, with a particular focus on the verb category. Two evaluations were carried out, one using an unseen dataset, and another one using a gold standard dataset which was manually labelled. The gold standard dataset was split into concatenative and non-concatenative to analyse the difference in results between the two morphological systems.

pdf bib
A Morphological Analyzer for Gulf Arabic Verbs
Salam Khalifa | Sara Hassan | Nizar Habash

We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute, respectively.

pdf bib
A Neural Architecture for Dialectal Arabic Segmentation
Younes Samih | Mohammed Attia | Mohamed Eldesouki | Ahmed Abdelali | Hamdy Mubarak | Laura Kallmeyer | Kareem Darwish

The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources.

pdf bib
Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments
Salima Medhaffar | Fethi Bougares | Yannick Estève | Lamia Hadrich-Belguith

Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political interest. In this paper we are interested in the SA of the Tunisian Dialect. We utilize Machine Learning techniques to determine the polarity of comments written in Tunisian Dialect. First, we evaluate the SA systems performances with models trained using freely available MSA and Multi-dialectal data sets. We then collect and annotate a Tunisian Dialect corpus of 17.000 comments from Facebook. This corpus allows us a significant accuracy improvement compared to the best model trained on other Arabic dialects or MSA data. We believe that this first freely available corpus will be valuable to researchers working in the field of Tunisian Sentiment Analysis and similar areas.

pdf bib
CAT: Credibility Analysis of Arabic Content on Twitter
Rim El Ballouli | Wassim El-Hajj | Ahmad Ghandour | Shady Elbassuoni | Hazem Hajj | Khaled Shaban

Data generated on Twitter has become a rich source for various data mining tasks. Those data analysis tasks that are dependent on the tweet semantics, such as sentiment analysis, emotion mining, and rumor detection among others, suffer considerably if the tweet is not credible, not real, or spam. In this paper, we perform an extensive analysis on credibility of Arabic content on Twitter. We also build a classification model (CAT) to automatically predict the credibility of a given Arabic tweet. Of particular originality is the inclusion of features extracted directly or indirectly from the author’s profile and timeline. To train and test CAT, we annotated for credibility a data set of 9,000 Arabic tweets that are topic independent. CAT achieved consistent improvements in predicting the credibility of the tweets when compared to several baselines and when compared to the state-of-the-art approach with an improvement of 21% in weighted average F-measure. We also conducted experiments to highlight the importance of the user-based features as opposed to the content-based features. We conclude our work with a feature reduction experiment that highlights the best indicative features of credibility.

pdf bib
A New Error Annotation for Dyslexic texts in Arabic
Maha Alamri | William J Teahan

This paper aims to develop a new classification of errors made in Arabic by those suffering from dyslexia to be used in the annotation of the Arabic dyslexia corpus (BDAC). The dyslexic error classification for Arabic texts (DECA) comprises a list of spelling errors extracted from previous studies and a collection of texts written by people with dyslexia that can provide a framework to help analyse specific errors committed by dyslexic writers. The classification comprises 37 types of errors, grouped into nine categories. The paper also discusses building a corpus of dyslexic Arabic texts that uses the error annotation scheme and provides an analysis of the errors that were found in the texts.

pdf bib
An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems
Hany Ahmed | Mohamed Elaraby | Abdullah M. Mousa | Mostafa Elhosiny | Sherif Abdou | Mohsen Rashwan

In this paper, we introduce an enhancement for speech recognition systems using an unsupervised speaker clustering technique. The proposed technique is mainly based on I-vectors and Self-Organizing Map Neural Network(SOM).The input to the proposed algorithm is a set of speech utterances. For each utterance, we extract 100-dimensional I-vector and then SOM is used to group the utterances to different speakers. In our experiments, we compared our technique with Normalized Cross Likelihood ratio Clustering (NCLR). Results show that the proposed technique reduces the speaker error rate in comparison with NCLR. Finally, we have experimented the effect of speaker clustering on Speaker Adaptive Training (SAT) in a speech recognition system implemented to test the performance of the proposed technique. It was noted that the proposed technique reduced the WER over clustering speakers with NCLR.

pdf bib
SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts
Amany Fashwan | Sameh Alansary

This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-based disambiguation layer and Out Of Vocabulary (OOV) layer. The adopted syntactic disambiguation algorithms is concerned with detecting the case ending diacritics depending on a rule based approach simulating the shallow parsing technique. This will be achieved using an annotated corpus for extracting the Arabic linguistic rules, building the language models and testing the system output. This system is considered as a good trial of the interaction between rule-based approach and statistical approach, where the rules can help the statistics in detecting the right diacritization and vice versa. At this point, the morphological Word Error Rate (WER) is 4.56% while the morphological Diacritic Error Rate (DER) is 1.88% and the syntactic WER is 9.36%. The best WER is 14.78% compared to the best-published results, of (Abandah, 2015); 11.68%, (Rashwan, et al., 2015); 12.90% and (Metwally, Rashwan, & Atiya, 2016); 13.70%.

pdf bib
Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach
Fahad Albogamy | Allan Ramsay | Hanady Ahmed

In this paper, we propose using a “bootstrapping” method for constructing a dependency treebank of Arabic tweets. This method uses a rule-based parser to create a small treebank of one thousand Arabic tweets and a data-driven parser to create a larger treebank by using the small treebank as a seed training set. We are able to create a dependency treebank from unlabelled tweets without any manual intervention. Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers.

pdf bib
Identifying Effective Translations for Cross-lingual Arabic-to-English User-generated Speech Search
Ahmad Khwileh | Haithem Afli | Gareth Jones | Andy Way

Cross Language Information Retrieval (CLIR) systems are a valuable tool to enable speakers of one language to search for content of interest expressed in a different language. A group for whom this is of particular interest is bilingual Arabic speakers who wish to search for English language content using information needs expressed in Arabic queries. A key challenge in CLIR is crossing the language barrier between the query and the documents. The most common approach to bridging this gap is automated query translation, which can be unreliable for vague or short queries. In this work, we examine the potential for improving CLIR effectiveness by predicting the translation effectiveness using Query Performance Prediction (QPP) techniques. We propose a novel QPP method to estimate the quality of translation for an Arabic-English Cross-lingual User-generated Speech Search (CLUGS) task. We present an empirical evaluation that demonstrates the quality of our method on alternative translation outputs extracted from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be integrated in CLUGS to find relevant translations for improved retrieval performance.

pdf bib
A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models
Ramy Baly | Gilbert Badaro | Georges El-Khoury | Rawan Moukalled | Rita Aoun | Hazem Hajj | Wassim El-Hajj | Nizar Habash | Khaled Shaban

Opinion mining in Arabic is a challenging task given the rich morphology of the language. The task becomes more challenging when it is applied to Twitter data, which contains additional sources of noise, such as the use of unstandardized dialectal variations, the nonconformation to grammatical rules, the use of Arabizi and code-switching, and the use of non-text objects such as images and URLs to express opinion. In this paper, we perform an analytical study to observe how such linguistic phenomena vary across different Arab regions. This study of Arabic Twitter characterization aims at providing better understanding of Arabic Tweets, and fostering advanced research on the topic. Furthermore, we explore the performance of the two schools of machine learning on Arabic Twitter, namely the feature engineering approach and the deep learning approach. We consider models that have achieved state-of-the-art performance for opinion mining in English. Results highlight the advantages of using deep learning-based models, and confirm the importance of using morphological abstractions to address Arabic’s complex morphology.

pdf bib
Robust Dictionary Lookup in Multiple Noisy Orthographies
Lingliang Zhang | Nizar Habash | Godfried Toussaint

We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard. We apply it to Arabic dictionary lookup with noisy queries done using both the Arabic and Roman scripts. Our algorithm is based on a computational phonetic distance metric that can be optionally machine learned. To benchmark our performance, we created the ArabScribe dataset, containing 10,000 noisy transcriptions of random Arabic dictionary words. Our algorithm outperforms Google Translate’s “did you mean” feature, as well as the Yamli smart Arabic keyboard.

pdf bib
Arabic POS Tagging: Don’t Abandon Feature Engineering Just Yet
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali | Mohamed Eldesouki

This paper focuses on comparing between using Support Vector Machine based ranking (SVM-Rank) and Bidirectional Long-Short-Term-Memory (bi-LSTM) neural-network based sequence labeling in building a state-of-the-art Arabic part-of-speech tagging system. Using SVM-Rank leads to state-of-the-art results, but with a fair amount of feature engineering. Using bi-LSTM, particularly when combined with word embeddings, may lead to competitive POS-tagging results by automatically deducing latent linguistic features. However, we show that augmenting bi-LSTM sequence labeling with some of the features that we used for the SVM-Rank based tagger yields to further improvements. We also show that gains that realized by using embeddings may not be additive with the gains achieved by the features. We are open-sourcing both the SVM-Rank and the bi-LSTM based systems for free.

pdf bib
Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties
Soumia Bougrine | Aicha Chorana | Abdallah Lakhdari | Hadda Cherroun

The success of machine learning for automatic speech processing has raised the need for large scale datasets. However, collecting such data is often a challenging task as it implies significant investment involving time and money cost. In this paper, we devise a recipe for building largescale Speech Corpora by harnessing Web resources namely YouTube, other Social Media, Online Radio and TV. We illustrate our methodology by building KALAM’DZ, An Arabic Spoken corpus dedicated to Algerian dialectal varieties. The preliminary version of our dataset covers all major Algerian dialects. In addition, we make sure that this material takes into account numerous aspects that foster its richness. In fact, we have targeted various speech topics. Some automatic and manual annotations are provided. They gather useful information related to the speakers and sub-dialect information at the utterance level. Our corpus encompasses the 8 major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented in utterances of at least 6 s.

pdf bib
Not All Segments are Created Equal: Syntactically Motivated Sentiment Analysis in Lexical Space
Muhammad Abdul-Mageed

Although there is by now a considerable amount of research on subjectivity and sentiment analysis on morphologically-rich languages, it is still unclear how lexical information can best be modeled in these languages. To bridge this gap, we build effective models exploiting exclusively gold- and machine-segmented lexical input and successfully employ syntactically motivated feature selection to improve classification. Our best models achieve significantly above the baselines, with 67.93% and 69.37% accuracies for subjectivity and sentiment classification respectively.

pdf bib
An enhanced automatic speech recognition system for Arabic
Mohamed Amine Menacer | Odile Mella | Dominique Fohr | Denis Jouvet | David Langlois | Kamel Smaili

Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition, it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right Z hamoza above or below Alif.

pdf bib
Universal Dependencies for Arabic
Dima Taji | Nizar Habash | Daniel Zeman

We describe the process of creating NUDAR, a Universal Dependency treebank for Arabic. We present the conversion from the Penn Arabic Treebank to the Universal Dependency syntactic representation through an intermediate dependency representation. We discuss the challenges faced in the conversion of the trees, the decisions we made to solve them, and the validation of our conversion. We also present initial parsing results on NUDAR.

pdf bib
A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic
Mohamed Al-Badrashiny | Abdelati Hawwari | Mona Diab

In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization.

pdf bib
Arabic Textual Entailment with Word Embeddings
Nada Almarwani | Mona Diab

Determining the textual entailment between texts is important in many NLP tasks, such as summarization, question answering, and information extraction and retrieval. Various methods have been suggested based on external knowledge sources; however, such resources are not always available in all languages and their acquisition is typically laborious and very costly. Distributional word representations such as word embeddings learned over large corpora have been shown to capture syntactic and semantic word relationships. Such models have contributed to improving the performance of several NLP tasks. In this paper, we address the problem of textual entailment in Arabic. We employ both traditional features and distributional representations. Crucially, we do not depend on any external resources in the process. Our suggested approach yields state of the art performance on a standard data set, ArbTE, achieving an accuracy of 76.2 % compared to state of the art of 69.3 %.

up

pdf (full)
bib (full)
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber

pdf bib
Toward Pan-Slavic NLP: Some Experiments with Language Adaptation
Serge Sharoff

There is great variation in the amount of NLP resources available for Slavonic languages. For example, the Universal Dependency treebank (Nivre et al., 2016) has about 2 MW of training resources for Czech, more than 1 MW for Russian, while only 950 words for Ukrainian and nothing for Belorussian, Bosnian or Macedonian. Similarly, the Autodesk Machine Translation dataset only covers three Slavonic languages (Czech, Polish and Russian). In this talk I will discuss a general approach, which can be called Language Adaptation, similarly to Domain Adaptation. In this approach, a model for a particular language processing task is built by lexical transfer of cognate words and by learning a new feature representation for a lesser-resourced (recipient) language starting from a better-resourced (donor) language. More specifically, I will demonstrate how language adaptation works in such training scenarios as Translation Quality Estimation, Part-of-Speech tagging and Named Entity Recognition.

pdf bib
Clustering of Russian Adjective-Noun Constructions using Word Embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Lidia Pivovarova

This paper presents a method of automatic construction extraction from a large corpus of Russian. The term ‘construction’ here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, ‘a glass of [water/juice/milk]’. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns meaning human body parts. The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used for building Russian construction dictionary as well as to accelerate theoretical studies of constructions.

pdf bib
A Preliminary Study of Croatian Lexical Substitution
Domagoj Alagić | Jan Šnajder

Lexical substitution is a task of determining a meaning-preserving replacement for a word in context. We report on a preliminary study of this task for the Croatian language on a small-scale lexical sample dataset, manually annotated using three different annotation schemes. We compare the annotations, analyze the inter-annotator agreement, and observe a number of interesting language specific details in the obtained lexical substitutes. Furthermore, we apply a recently-proposed, dependency-based lexical substitution model to our dataset. The model achieves a P@3 score of 0.35, which indicates the difficulty of the task.

pdf bib
Projecting Multiword Expression Resources on a Polish Treebank
Agata Savary | Jakub Waszczuk

Multiword expressions (MWEs) are linguistic objects containing two or more words and showing idiosyncratic behavior at different levels. Treebanks with annotated MWEs enable studies of such properties, as well as training and evaluation of MWE-aware parsers. However, few treebanks contain full-fledged MWE annotations. We show how this gap can be bridged in Polish by projecting 3 MWE resources on a constituency treebank.

pdf bib
Lexicon Induction for Spoken Rusyn – Challenges and Results
Achim Rabus | Yves Scherrer

This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.

pdf bib
The Universal Dependencies Treebank for Slovenian
Kaja Dobrovoljc | Tomaž Erjavec | Simon Krek

This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work.

pdf bib
Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages
Tanja Samardžić | Mirjana Starović | Željko Agić | Nikola Ljubešić

The paper documents the procedure of building a new Universal Dependencies (UDv2) treebank for Serbian starting from an existing Croatian UDv1 treebank and taking into account the other Slavic UD annotation guidelines. We describe the automatic and manual annotation procedures, discuss the annotation of Slavic-specific categories (case governing quantifiers, reflexive pronouns, question particles) and propose an approach to handling deverbal nouns in Slavic languages.

pdf bib
Spelling Correction for Morphologically Rich Language: a Case Study of Russian
Alexey Sorokin

We present an algorithm for automatic correction of spelling errors on the sentence level, which uses noisy channel model and feature-based reranking of hypotheses. Our system is designed for Russian and clearly outperforms the winner of SpellRuEval-2016 competition. We show that language model size has the greatest influence on spelling correction quality. We also experiment with different types of features and show that morphological and semantic information also improves the accuracy of spellchecking.

pdf bib
Debunking Sentiment Lexicons: A Case of Domain-Specific Sentiment Classification for Croatian
Paula Gombar | Zoran Medić | Domagoj Alagić | Jan Šnajder

Sentiment lexicons are widely used as an intuitive and inexpensive way of tackling sentiment classification, often within a simple lexicon word-counting approach or as part of a supervised model. However, it is an open question whether these approaches can compete with supervised models that use only word-representation features. We address this question in the context of domain-specific sentiment classification for Croatian. We experiment with the graph-based acquisition of sentiment lexicons, analyze their quality, and investigate how effectively they can be used in sentiment classification. Our results indicate that, even with as few as 500 labeled instances, a supervised model substantially outperforms a word-counting model. We also observe that adding lexicon-based features does not significantly improve supervised sentiment classification.

pdf bib
Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text
Nikola Ljubešić | Tomaž Erjavec | Darja Fišer

In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error. The final configuration achieves tagging accuracy of 87.41% on the full morphosyntactic description, which is, nevertheless, still quite far from the accuracy of 94.27% achieved on standard text.

pdf bib
Comparison of Short-Text Sentiment Analysis Methods for Croatian
Leon Rotim | Jan Šnajder

We focus on the task of supervised sentiment classification of short and informal texts in Croatian, using two simple yet effective methods: word embeddings and string kernels. We investigate whether word embeddings offer any advantage over corpus- and preprocessing-free string kernels, and how these compare to bag-of-words baselines. We conduct a comparison on three different datasets, using different preprocessing methods and kernel functions. Results show that, on two out of three datasets, word embeddings outperform string kernels, which in turn outperform word and n-gram bag-of-words baselines.

pdf bib
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber

This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.

pdf bib
Liner2 — a Generic Framework for Named Entity Recognition
Michał Marcińczuk | Jan Kocoń | Marcin Oleksy

In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.

pdf bib
Language-Independent Named Entity Analysis Using Parallel Projection and Rule-Based Disambiguation
James Mayfield | Paul McNamee | Cash Costello

The 2017 shared task at the Balto-Slavic NLP workshop requires identifying coarse-grained named entities in seven languages, identifying each entity’s base form, and clustering name mentions across the multilingual set of documents. The fact that no training data is provided to systems for building supervised classifiers further adds to the complexity. To complete the task we first use publicly available parallel texts to project named entity recognition capability from English to each evaluation language. We ignore entirely the subtask of identifying non-inflected forms of names. Finally, we create cross-document entity identifiers by clustering named mentions using a procedure-based approach.

pdf bib
Comparison of String Similarity Measures for Obscenity Filtering
Ekaterina Chernyak

In this paper we address the problem of filtering obscene lexis in Russian texts. We use string similarity measures to find words similar or identical to words from a stop list and establish both a test collection and a baseline for the task. Our experiments show that a novel string similarity measure based on the notion of an annotated suffix tree outperforms some of the other well known measures.

pdf bib
Stylometric Analysis of Parliamentary Speeches: Gender Dimension
Justina Mandravickaitė | Tomas Krilavičius

Relation between gender and language has been studied by many authors, however, there is still some uncertainty left regarding gender influence on language usage in the professional environment. Often, the studied data sets are too small or texts of individual authors are too short in order to capture differences of language usage wrt gender successfully. This study draws from a larger corpus of speeches transcripts of the Lithuanian Parliament (1990-2013) to explore language differences of political debates by gender via stylometric analysis. Experimental set up consists of stylistic features that indicate lexical style and do not require external linguistic tools, namely the most frequent words, in combination with unsupervised machine learning algorithms. Results show that gender differences in the language use remain in professional environment not only in usage of function words, preferred linguistic constructions, but in the presented topics as well.

pdf bib
Towards Never Ending Language Learning for Morphologically Rich Languages
Kseniya Buraya | Lidia Pivovarova | Sergey Budkov | Andrey Filchenkov

This work deals with ontology learning from unstructured Russian text. We implement one of components Never Ending Language Learner and introduce the algorithm extensions aimed to gather specificity of morphologicaly rich free-word-order language. We demonstrate that this method may be successfully applied to Russian data. In addition we perform several additional experiments comparing different settings of the training process. We demonstrate that utilizing of morphological features significantly improves the system precision while using of seed patterns helps to improve the coverage.

pdf bib
Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style
Ben Verhoeven | Iza Škrjanec | Senja Pollak

We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6% accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5%. Especially in the lemmatized version, we also observe stylistic and content-based differences in writing between men (e.g. more profane language, numerals and beer mentions) and women (e.g. more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages.

up

pdf (full)
bib (full)
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)

pdf bib
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)
Maciej Ogrodniczuk | Vincent Ng

pdf bib
Use Generalized Representations, But Do Not Forget Surface Features
Nafise Sadat Moosavi | Michael Strube

Only a year ago, all state-of-the-art coreference resolvers were using an extensive amount of surface features. Recently, there was a paradigm shift towards using word embeddings and deep neural networks, where the use of surface features is very limited. In this paper, we show that a simple SVM model with surface features outperforms more complex neural models for detecting anaphoric mentions. Our analysis suggests that using generalized representations and surface features have different strength that should be both taken into account for improving coreference resolution.

pdf bib
Enriching Basque Coreference Resolution System using Semantic Knowledge sources
Ander Soraluze | Olatz Arregi | Xabier Arregi | Arantza Díaz de Ilarraza

In this paper we present a Basque coreference resolution system enriched with semantic knowledge. An error analysis carried out revealed the deficiencies that the system had in resolving coreference cases in which semantic or world knowledge is needed. We attempt to improve the deficiencies using two semantic knowledge sources, specifically Wikipedia and WordNet.

pdf bib
Improving Polish Mention Detection with Valency Dictionary
Maciej Ogrodniczuk | Bartłomiej Nitoń

This paper presents results of an experiment integrating information from valency dictionary of Polish into a mention detection system. Two types of information is acquired: positions of syntactic schemata for nominal and verbal constructs and secondary prepositions present in schemata. The syntactic schemata are used to prevent (for verbal realizations) or encourage (for nominal groups) constructing mentions from phrases filling multiple schema positions, the secondary prepositions – to filter out artificial mentions created from their nominal components. Mention detection is evaluated against the manual annotation of the Polish Coreference Corpus in two settings: taking into account only mention heads or exact borders.

pdf bib
A Google-Proof Collection of French Winograd Schemas
Pascal Amsili | Olga Seminck

This article presents the first collection of French Winograd Schemas. Winograd Schemas form anaphora resolution problems that can only be resolved with extensive world knowledge. For this reason the Winograd Schema Challenge has been proposed as an alternative to the Turing Test. A very important feature of Winograd Schemas is that it should be impossible to resolve them with statistical information about word co-occurrences: they should be Google-proof. We propose a measure of Google-proofness based on Mutual Information, and demonstrate the method on our collection of French Winograd Schemas.

pdf bib
Using Coreference Links to Improve Spanish-to-English Machine Translation
Lesly Miculicich Werlen | Andrei Popescu-Belis

In this paper, we present a proof-of-concept implementation of a coreference-aware decoder for document-level machine translation. We consider that better translations should have coreference links that are closer to those in the source text, and implement this criterion in two ways. First, we define a similarity measure between source and target coreference structures, by projecting the target ones onto the source and reusing existing coreference metrics. Based on this similarity measure, we re-rank the translation hypotheses of a baseline system for each sentence. Alternatively, to address the lack of diversity of mentions in the MT hypotheses, we focus on mention pairs and integrate their coreference scores with MT ones, resulting in post-editing decisions for mentions. The experimental results for Spanish to English MT on the AnCora-ES corpus show that the second approach yields a substantial increase in the accuracy of pronoun translation, with BLEU scores remaining constant.

pdf bib
Multi-source annotation projection of coreference chains: assessing strategies and testing opportunities
Yulia Grishina | Manfred Stede

In this paper, we examine the possibility of using annotation projection from multiple sources for automatically obtaining coreference annotations in the target language. We implement a multi-source annotation projection algorithm and apply it on an English-German-Russian parallel corpus in order to transfer coreference chains from two sources to the target side. Operating in two settings – a low-resource and a more linguistically-informed one – we show that automatic coreference transfer could benefit from combining information from multiple languages, and assess the quality of both the extraction and the linking of target coreference mentions.

pdf bib
CORBON 2017 Shared Task: Projection-Based Coreference Resolution
Yulia Grishina

The CORBON 2017 Shared Task, organised as part of the Coreference Resolution Beyond OntoNotes workshop at EACL 2017, presented a new challenge for multilingual coreference resolution: we offer a projection-based setting in which one is supposed to build a coreference resolver for a new language exploiting little or even no knowledge of it, with our languages of interest being German and Russian. We additionally offer a more traditional setting, targeting the development of a multilingual coreference resolver without any restrictions on the resources and methods used. In this paper, we describe the task setting and provide the results of one participant who successfully completed the task, comparing their results to the closely related previous research. Analysing the task setting and the results, we discuss the major challenges and make suggestions on the future directions of coreference evaluation.

pdf bib
Projection-based Coreference Resolution Using Deep Syntax
Michal Novák | Anna Nedoluzhko | Zdeněk Žabokrtský

The paper describes the system for coreference resolution in German and Russian, trained exclusively on coreference relations project ed through a parallel corpus from English. The resolver operates on the level of deep syntax and makes use of multiple specialized models. It achieves 32 and 22 points in terms of CoNLL score for Russian and German, respectively. Analysis of the evaluation results show that the resolver for Russian is able to preserve 66% of the English resolver’s quality in terms of CoNLL score. The system was submitted to the Closed track of the CORBON 2017 Shared task.

up

pdf (full)
bib (full)
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

pdf bib
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing
Dirk Hovy | Shannon Spruit | Margaret Mitchell | Emily M. Bender | Michael Strube | Hanna Wallach

pdf bib
Gender as a Variable in Natural-Language Processing: Ethical Considerations
Brian Larson

Researchers and practitioners in natural-language processing (NLP) and related fields should attend to ethical principles in study design, ascription of categories/variables to study participants, and reporting of findings or results. This paper discusses theoretical and ethical frameworks for using gender as a variable in NLP studies and proposes four guidelines for researchers and practitioners. The principles outlined here should guide practitioners, researchers, and peer reviewers, and they may be applicable to other social categories, such as race, applied to human beings connected to NLP research.

pdf bib
These are not the Stereotypes You are Looking For: Bias and Fairness in Authorial Gender Attribution
Corina Koolen | Andreas van Cranenburgh

Stylometric and text categorization results show that author gender can be discerned in texts with relatively high accuracy. However, it is difficult to explain what gives rise to these results and there are many possible confounding factors, such as the domain, genre, and target audience of a text. More fundamentally, such classification efforts risk invoking stereotyping and essentialism. We explore this issue in two datasets of Dutch literary novels, using commonly used descriptive (LIWC, topic modeling) and predictive (machine learning) methods. Our results show the importance of controlling for variables in the corpus and we argue for taking care not to overgeneralize from the results.

pdf bib
A Quantitative Study of Data in the NLP community
Margot Mieskes

We present results on a quantitative analysis of publications in the NLP domain on collecting, publishing and availability of research data. We find that a wide range of publications rely on data crawled from the web, but few give details on how potentially sensitive data was treated. Additionally, we find that while links to repositories of data are given, they often do not work even a short time after publication. We put together several suggestions on how to improve this situation based on publications from the NLP domain, but also other research areas.

pdf bib
Ethical by Design: Ethics Best Practices for Natural Language Processing
Jochen L. Leidner | Vassilis Plachouras

Natural language processing (NLP) systems analyze and/or generate human language, typically on users’ behalf. One natural and necessary question that needs to be addressed in this context, both in research projects and in production settings, is the question how ethical the work is, both regarding the process and its outcome. Towards this end, we articulate a set of issues, propose a set of best practices, notably a process featuring an ethics review board, and sketch and how they could be meaningfully applied. Our main argument is that ethical outcomes ought to be achieved by design, i.e. by following a process aligned by ethical values. We also offer some response options for those facing ethics issues. While a number of previous works exist that discuss ethical issues, in particular around big data and machine learning, to the authors’ knowledge this is the first account of NLP and ethics from the perspective of a principled process.

pdf bib
Building Better Open-Source Tools to Support Fairness in Automated Scoring
Nitin Madnani | Anastassia Loukina | Alina von Davier | Jill Burstein | Aoife Cahill

Automated scoring of written and spoken responses is an NLP application that can significantly impact lives especially when deployed as part of high-stakes tests such as the GRE® and the TOEFL®. Ethical considerations require that automated scoring algorithms treat all test-takers fairly. The educational measurement community has done significant research on fairness in assessments and automated scoring systems must incorporate their recommendations. The best way to do that is by making available automated, non-proprietary tools to NLP researchers that directly incorporate these recommendations and generate the analyses needed to help identify and resolve biases in their scoring systems. In this paper, we attempt to provide such a solution.

pdf bib
Gender and Dialect Bias in YouTube’s Automatic Captions
Rachael Tatman

This project evaluates the accuracy of YouTube’s automatically-generated captions across two genders and five dialect groups. Speakers’ dialect and gender was controlled for by using videos uploaded as part of the “accent tag challenge”, where speakers explicitly identify their language background. The results show robust differences in accuracy across both gender and dialect, with lower accuracy for 1) women and 2) speakers from Scotland. This finding builds on earlier research finding that speaker’s sociolinguistic identity may negatively impact their ability to use automatic speech recognition, and demonstrates the need for sociolinguistically-stratified validation of systems.

pdf bib
Integrating the Management of Personal Data Protection and Open Science with Research Ethics
Dave Lewis | Joss Moorkens | Kaniz Fatema

We examine the impact of the EU General Data Protection Regulation and the push from research funders to provide open access research data on the current practices in Language Technology Research. We analyse the challenges that arise and the opportunities to address many of them through the use of existing open data practices. We discuss the impact of this also on current practice in research ethics.

pdf bib
Ethical Considerations in NLP Shared Tasks
Carla Parra Escartín | Wessel Reijers | Teresa Lynn | Joss Moorkens | Andy Way | Chao-Hong Liu

Shared tasks are increasingly common in our field, and new challenges are suggested at almost every conference and workshop. However, as this has become an established way of pushing research forward, it is important to discuss how we researchers organise and participate in shared tasks, and make that information available to the community to allow further research improvements. In this paper, we present a number of ethical issues along with other areas of concern that are related to the competitive nature of shared tasks. As such issues could potentially impact on research ethics in the Natural Language Processing community, we also propose the development of a framework for the organisation of and participation in shared tasks that can help mitigate against these issues arising.

pdf bib
Social Bias in Elicited Natural Language Inferences
Rachel Rudinger | Chandler May | Benjamin Van Durme

We analyze the Stanford Natural Language Inference (SNLI) corpus in an investigation of bias and stereotyping in NLP data. The SNLI human-elicitation protocol makes it prone to amplifying bias and stereotypical associations, which we demonstrate statistically (using pointwise mutual information) and with qualitative examples.

pdf bib
A Short Review of Ethical Challenges in Clinical Natural Language Processing
Simon Šuster | Stéphan Tulkens | Walter Daelemans

Clinical NLP has an immense potential in contributing to how clinical practice will be revolutionized by the advent of large scale processing of clinical records. However, this potential has remained largely untapped due to slow progress primarily caused by strict data access policies for researchers. In this paper, we discuss the concern for privacy and the measures it entails. We also suggest sources of less sensitive data. Finally, we draw attention to biases that can compromise the validity of empirical research and lead to socially harmful applications.

pdf bib
Goal-Oriented Design for Ethical Machine Learning and NLP
Tyler Schnoebelen

The argument made in this paper is that to act ethically in machine learning and NLP requires focusing on goals. NLP projects are often classificatory systems that deal with human subjects, which means that goals from people affected by the systems should be included. The paper takes as its core example a model that detects criminality, showing the problems of training data, categories, and outcomes. The paper is oriented to the kinds of critiques on power and the reproduction of inequality that are found in social theory, but it also includes concrete suggestions on how to put goal-oriented design into practice.

pdf bib
Ethical Research Protocols for Social Media Health Research
Adrian Benton | Glen Coppersmith | Mark Dredze

Social media have transformed data-driven research in political science, the social sciences, health, and medicine. Since health research often touches on sensitive topics that relate to ethics of treatment and patient privacy, similar ethical considerations should be acknowledged when using social media data in health research. While much has been said regarding the ethical considerations of social media research, health research leads to an additional set of concerns. We provide practical suggestions in the form of guidelines for researchers working with social media data in health research. These guidelines can inform an IRB proposal for researchers new to social media health research.

pdf bib
Say the Right Thing Right: Ethics Issues in Natural Language Generation Systems
Charese Smiley | Frank Schilder | Vassilis Plachouras | Jochen L. Leidner

We discuss the ethical implications of Natural Language Generation systems. We use one particular system as a case study to identify and classify issues, and we provide an ethics checklist, in the hope that future system designers may benefit from conducting their own ethics reviews based on our checklist.

up

pdf (full)
bib (full)
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

pdf bib
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Stella Markantonatou | Carlos Ramisch | Agata Savary | Veronika Vincze

pdf bib
ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs
Petra Barančíková | Václava Kettnerová

We present a new freely available dictionary of paraphrases of Czech complex predicates with light verbs, ParaDi. Candidates for single predicative paraphrases of selected complex predicates have been extracted automatically from large monolingual data using word2vec. They have been manually verified and further refined. We demonstrate one of many possible applications of ParaDi in an experiment with improving machine translation quality.

pdf bib
Multi-word Entity Classification in a Highly Multilingual Environment
Sophie Chesney | Guillaume Jacquet | Ralf Steinberger | Jakub Piskorski

This paper describes an approach for the classification of millions of existing multi-word entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an application-oriented set of entity categories, we trained and tested distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data representation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers, and discuss the results.

pdf bib
Using bilingual word-embeddings for multilingual collocation extraction
Marcos Garcia | Marcos García-Salido | Margarita Alonso-Ramos

This paper presents a new strategy for multilingual collocation extraction which takes advantage of parallel corpora to learn bilingual word-embeddings. Monolingual collocation candidates are retrieved using Universal Dependencies, while the distributional models are then applied to search for equivalents of the elements of each collocation in the target languages. The proposed method extracts not only collocation equivalents with direct translation between languages, but also other cases where the collocations in the two languages are not literal translations of each other. Several experiments -evaluating collocations with three syntactic patterns- in English, Spanish, and Portuguese show that our approach can effectively extract large pairs of bilingual equivalents with an average precision of about 90%. Moreover, preliminary results on comparable corpora suggest that the distributional models can be applied for identifying new bilingual collocations in different domains.

pdf bib
The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Agata Savary | Carlos Ramisch | Silvio Cordeiro | Federico Sangati | Veronika Vincze | Behrang QasemiZadeh | Marie Candito | Fabienne Cap | Voula Giouli | Ivelina Stoyanova | Antoine Doucet

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib
USzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques
Katalin Ilona Simkó | Viktória Kovács | Veronika Vincze

The paper describes our system submitted for the Workshop on Multiword Expressions’ shared task on automatic identification of verbal multiword expressions. It uses POS tagging and dependency parsing to identify single- and multi-token verbal MWEs in text. Our system is language independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis for the languages it was submitted for.

pdf bib
Parsing and MWE Detection: Fips at the PARSEME Shared Task
Vasiliki Foufi | Luka Nerima | Éric Wehrli

Identifying multiword expressions (MWEs) in a sentence in order to ensure their proper processing in subsequent applications, like machine translation, and performing the syntactic analysis of the sentence are interrelated processes. In our approach, priority is given to parsing alternatives involving collocations, and hence collocational information helps the parser through the maze of alternatives, with the aim to lead to substantial improvements in the performance of both tasks (collocation identification and parsing), and in that of a subsequent task (machine translation). In this paper, we are going to present our system and the procedure that we have followed in order to participate to the open track of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) in running texts.

pdf bib
Neural Networks for Multi-Word Expression Detection
Natalia Klyueva | Antoine Doucet | Milan Straka

In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The MUMULS system was implemented using a supervised approach based on recurrent neural networks using the open source library TensorFlow. The model was trained on a data set containing annotated VMWEs as well as morphological and syntactic information. The MUMULS system performed the identification of VMWEs in 15 languages, it was one of few systems that could categorize VMWEs type in nearly all languages.

pdf bib
Factoring Ambiguity out of the Prediction of Compositionality for German Multi-Word Expressions
Stefan Bott | Sabine Schulte im Walde

Ambiguity represents an obstacle for distributional semantic models(DSMs), which typically subsume the contexts of all word senses within one vector. While individual vector space approaches have been concerned with sense discrimination (e.g., Schütze 1998, Erk 2009, Erk and Pado 2010), such discrimination has rarely been integrated into DSMs across semantic tasks. This paper presents a soft-clustering approach to sense discrimination that filters sense-irrelevant features when predicting the degrees of compositionality for German noun-noun compounds and German particle verbs.

pdf bib
Multiword expressions and lexicalism: the view from LFG
Jamie Y. Findlay

Multiword expressions (MWEs) pose a problem for lexicalist theories like Lexical Functional Grammar (LFG), since they are prima facie counterexamples to a strong form of the lexical integrity principle, which entails that a lexical item can only be realised as a single, syntactically atomic word. In this paper, I demonstrate some of the problems facing any strongly lexicalist account of MWEs, and argue that the lexical integrity principle must be weakened. I conclude by sketching a formalism which integrates a Tree Adjoining Grammar into the LFG architecture, taking advantage of this relaxation.

pdf bib
Understanding Idiomatic Variation
Kristina Geeraert | R. Harald Baayen | John Newman

This study investigates the processing of idiomatic variants through an eye-tracking experiment. Four types of idiom variants were included, in addition to the canonical form and the literal meaning. Results suggest that modifications to idioms, modulo obvious effects of length differences, are not more difficult to process than the canonical forms themselves. This fits with recent corpus findings.

pdf bib
Discovering Light Verb Constructions and their Translations from Parallel Corpora without Word Alignment
Natalie Vargas | Carlos Ramisch | Helena Caseli

We propose a method for joint unsupervised discovery of multiword expressions (MWEs) and their translations from parallel corpora. First, we apply independent monolingual MWE extraction in source and target languages simultaneously. Then, we calculate translation probability, association score and distributional similarity of co-occurring pairs. Finally, we rank all translations of a given MWE using a linear combination of these features. Preliminary experiments on light verb constructions show promising results.

pdf bib
Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach
Justina Mandravickaitė | Tomas Krilavičius

We discuss an experiment on automatic identification of bi-gram multi-word expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown some nice results for Lithuanian and Latvian as well. Combining LAMs with ML we have achieved 92,4% precision and 52,2% recall for Latvian and 95,1% precision and 77,8% recall for Lithuanian.

pdf bib
Show Me Your Variance and I Tell You Who You Are - Deriving Compound Compositionality from Word Alignments
Fabienne Cap

We use word alignment variance as an indicator for the non-compositionality of German and English noun compounds. Our work-in-progress results are on their own not competitive with state-of-the art approaches, but they show that alignment variance is correlated with compositionality and thus worth a closer look in the future.

pdf bib
Semantic annotation to characterize contextual variation in terminological noun compounds: a pilot study
Melania Cabezas-García | Antonio San Martín

Noun compounds (NCs) are semantically complex and not fully compositional, as is often assumed. This paper presents a pilot study regarding the semantic annotation of environmental NCs with a view to accessing their semantics and exploring their domain-based contextual variation. Our results showed that the semantic annotation of NCs afforded important insights into how context impacts their conceptualization.

pdf bib
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking
Alfredo Maldonado | Lifeng Han | Erwan Moreau | Ashjan Alsulaimani | Koel Dutta Chowdhury | Carl Vogel | Qun Liu

A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme.

pdf bib
A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Tiberiu Boros | Sonia Pipa | Verginica Barbu Mititelu | Dan Tufis

“Multiword expressions” are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent the subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the verbal ones are particularly interesting for tasks such as parsing, as the verb is the central element in the syntactic organization of a sentence. In this paper we introduce our data-driven approach to verbal multiword expressions which was objectively validated during the PARSEME shared task on verbal multiword expressions identification. We tested our approach on 12 languages, and we provide detailed information about corpora composition, feature selection process, validation procedure and performance on all languages.

pdf bib
The ATILF-LLF System for Parseme Shared Task: a Transition-based Verbal Multiword Expression Tagger
Hazem Al Saied | Matthieu Constant | Marie Candito

We describe the ATILF-LLF system built for the MWE 2017 Shared Task on automatic identification of verbal multiword expressions. We participated in the closed track only, for all the 18 available languages. Our system is a robust greedy transition-based system, in which MWE are identified through a MERGE transition. The system was meant to accommodate the variety of linguistic resources provided for each language, in terms of accompanying morphological and syntactic information. Using per-MWE Fscore, the system was ranked first for all but two languages (Hungarian and Romanian).

pdf bib
Investigating the Opacity of Verb-Noun Multiword Expression Usages in Context
Shiva Taslimipoor | Omid Rohanian | Ruslan Mitkov | Afsaneh Fazly

This study investigates the supervised token-based identification of Multiword Expressions (MWEs). This is an ongoing research to exploit the information contained in the contexts in which different instances of an expression could occur. This information is used to investigate the question of whether an expression is literal or MWE. Lexical and syntactic context features derived from vector representations are shown to be more effective over traditional statistical measures to identify tokens of MWEs.

pdf bib
Compositionality in Verb-Particle Constructions
Archna Bhatia | Choh Man Teng | James Allen

We are developing a broad-coverage deep semantic lexicon for a system that parses sentences into a logical form expressed in a rich ontology that supports reasoning. In this paper we look at verb-particle constructions (VPCs), and the extent to which they can be treated compositionally vs idiomatically. First we distinguish between the different types of VPCs based on their compositionality and then present a set of heuristics for classifying specific instances as compositional or not. We then identify a small set of general sense classes for particles when used compositionally and discuss the resulting lexical representations that are being added to the lexicon. By treating VPCs as compositional whenever possible, we attain broad coverage in a compact way, and also enable interpretations of novel VPC usages not explicitly present in the lexicon.

pdf bib
Rule-Based Translation of Spanish Verb-Noun Combinations into Basque
Uxoa Iñurrieta | Itziar Aduriz | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola

This paper presents a method to improve the translation of Verb-Noun Combinations (VNCs) in a rule-based Machine Translation (MT) system for Spanish-Basque. Linguistic information about a set of VNCs is gathered from the public database Konbitzul, and it is integrated into the MT system, leading to an improvement in BLEU, NIST and TER scores, as well as the results being evidently better according to human evaluators.

pdf bib
Verb-Particle Constructions in Questions
Veronika Vincze

In this paper, we investigate the behavior of verb-particle constructions in English questions. We present a small dataset that contains questions and verb-particle construction candidates. We demonstrate that there are significant differences in the distribution of WH-words, verbs and prepositions/particles in sentences that contain VPCs and sentences that contain only verb + prepositional phrase combinations both by statistical means and in machine learning experiments. Hence, VPCs and non-VPCs can be effectively separated from each other by using a rich feature set, containing several novel features.

pdf bib
Simple Compound Splitting for German
Marion Weller-Di Marco

This paper presents a simple method for German compound splitting that combines a basic frequency-based approach with a form-to-lemma mapping to approximate morphological operations. With the exception of a small set of hand-crafted rules for modeling transitional elements, this approach is resource-poor. In our evaluation, the simple splitter outperforms a splitter relying on rich morphological resources.

pdf bib
Identification of Ambiguous Multiword Expressions Using Sequence Models and Lexical Resources
Manon Scholivet | Carlos Ramisch

We present a simple and efficient tagger capable of identifying highly ambiguous multiword expressions (MWEs) in French texts. It is based on conditional random fields (CRF), using local context information as features. We show that this approach can obtain results that, in some cases, approach more sophisticated parser-based MWE identification methods without requiring syntactic trees from a treebank. Moreover, we study how well the CRF can take into account external information coming from a lexicon.

pdf bib
Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction
Agnès Tutin | Olivier Kraif

This paper aims at assessing to what extent a syntax-based method (Recurring Lexico-syntactic Trees (RLT) extraction) allows us to extract large phraseological units such as prefabricated routines, e.g. “as previously said” or “as far as we/I know” in scientific writing. In order to evaluate this method, we compare it to the classical ngram extraction technique, on a subset of recurring segments including speech verbs in a French corpus of scientific writing. Results show that the LRT extraction technique is far more efficient for extended MWEs such as routines or collocations but performs more poorly for surface phenomena such as syntactic constructions or fully frozen expressions.

pdf bib
Benchmarking Joint Lexical and Syntactic Analysis on Multiword-Rich Data
Matthieu Constant | Héctor Martinez Alonso

This article evaluates the extension of a dependency parser that performs joint syntactic analysis and multiword expression identification. We show that, given sufficient training data, the parser benefits from explicit multiword information and improves overall labeled accuracy score in eight of the ten evaluation cases.

pdf bib
Semi-Automated Resolution of Inconsistency for a Harmonized Multiword Expression and Dependency Parse Annotation
King Chan | Julian Brooke | Timothy Baldwin

This paper presents a methodology for identifying and resolving various kinds of inconsistency in the context of merging dependency and multiword expression (MWE) annotations, to generate a dependency treebank with comprehensive MWE annotations. Candidates for correction are identified using a variety of heuristics, including an entirely novel one which identifies violations of MWE constituency in the dependency tree, and resolved by arbitration with minimal human intervention. Using this technique, we identified and corrected several hundred errors across both parse and MWE annotations, representing changes to a significant percentage (well over 10%) of the MWE instances in the joint corpus.

pdf bib
Combining Linguistic Features for the Detection of Croatian Multiword Expressions
Maja Buljan | Jan Šnajder

As multiword expressions (MWEs) exhibit a range of idiosyncrasies, their automatic detection warrants the use of many different features. Tsvetkov and Wintner (2014) proposed a Bayesian network model that combines linguistically motivated features and also models their interactions. In this paper, we extend their model with new features and apply it to Croatian, a morphologically complex and a relatively free word order language, achieving a satisfactory performance of 0.823 F1-score. Furthermore, by comparing against (semi)naive Bayes models, we demonstrate that manually modeling feature interactions is indeed important. We make our annotated dataset of Croatian MWEs freely available.

pdf bib
Complex Verbs are Different: Exploring the Visual Modality in Multi-Modal Models to Predict Compositionality
Maximilian Köper | Sabine Schulte im Walde

This paper compares a neural network DSM relying on textual co-occurrences with a multi-modal model integrating visual information. We focus on nominal vs. verbal compounds, and zoom into lexical, empirical and perceptual target properties to explore the contribution of the visual modality. Our experiments show that (i) visual features contribute differently for verbs than for nouns, and (ii) images complement textual information, if (a) the textual modality by itself is poor and appropriate image subsets are used, or (b) the textual modality by itself is rich and large (potentially noisy) images are added.

up

pdf (full)
bib (full)
Proceedings of the Workshop Computational Semantics Beyond Events and Roles

pdf bib
Proceedings of the Workshop Computational Semantics Beyond Events and Roles
Eduardo Blanco | Roser Morante | Roser Saurí

pdf bib
Understanding the Semantics of Narratives of Interpersonal Violence through Reader Annotations and Physiological Reactions
Alexander Calderwood | Elizabeth A. Pruett | Raymond Ptucha | Christopher Homan | Cecilia Ovesdotter Alm

Interpersonal violence (IPV) is a prominent sociological problem that affects people of all demographic backgrounds. By analyzing how readers interpret, perceive, and react to experiences narrated in social media posts, we explore an understudied source for discourse about abuse. We asked readers to annotate Reddit posts about relationships with vs. without IPV for stakeholder roles and emotion, while measuring their galvanic skin response (GSR), pulse, and facial expression. We map annotations to coreference resolution output to obtain a labeled coreference chain for stakeholders in texts, and apply automated semantic role labeling for analyzing IPV discourse. Findings provide insights into how readers process roles and emotion in narratives. For example, abusers tend to be linked with violent actions and certain affect states. We train classifiers to predict stakeholder categories of coreference chains. We also find that subjects’ GSR noticeably changed for IPV texts, suggesting that co-collected measurement-based data about annotators can be used to support text annotation.

pdf bib
Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation
Gene Kim | Lenhart Schubert

This paper describes current efforts in developing an annotation schema and guidelines for sentences in Episodic Logic (EL). We focus on important distinctions for representing modality, attitudes, and tense and present an annotation schema that makes these distinctions. EL has proved competitive with other logical formulations in speed and inference-enablement, while expressing a wider array of natural language phenomena including intensional modification of predicates and sentences, propositional attitudes, and tense and aspect.

pdf bib
Towards a lexicon of event-selecting predicates for a French FactBank
Ingrid Falk | Fabienne Martin

This paper presents ongoing work for the construction of a French FactBank and a lexicon of French event-selecting predicates (ESPs), by applying the factuality detection algorithm introduced in (Saurí and Pustejovsky, 2012). This algorithm relies on a lexicon of ESPs, specifying how these predicates influence the polarity of their embedded events. For this pilot study, we focused on French factive and implicative verbs, and capitalised on a lexical resource for the English counterparts of these verbs provided by the CLSI Group (Nairn et al., 2006; Karttunen, 2012).

pdf bib
Universal Dependencies to Logical Form with Negation Scope
Federico Fancellu | Siva Reddy | Adam Lopez | Bonnie Webber

Many language technology applications would benefit from the ability to represent negation and its scope on top of widely-used linguistic resources. In this paper, we investigate the possibility of obtaining a first-order logic representation with negation scope marked using Universal Dependencies. To do so, we enhance UDepLambda, a framework that converts dependency graphs to logical forms. The resulting UDepLambda¬ is able to handle phenomena related to scope by means of an higher-order type theory, relevant not only to negation but also to universal quantification and other complex semantic phenomena. The initial conversion we did for English is promising, in that one can represent the scope of negation also in the presence of more complex phenomena such as universal quantifiers.

pdf bib
Meaning Banking beyond Events and Roles
Johan Bos

In this talk I will discuss the analysis of several semantic phenomena that need meaning representations that can describe attributes of propositional contexts. I will do this in a version of Discourse Representation Theory, using a universal semantic tagset developed as part of a project that aims to produce a large meaning bank (a semantically-annotated corpus) for four languages (English, Dutch, German and Italian).

pdf bib
The Scope and Focus of Negation: A Complete Annotation Framework for Italian
Begoña Altuna | Anne-Lyse Minard | Manuela Speranza

In this paper we present a complete framework for the annotation of negation in Italian, which accounts for both negation scope and negation focus, and also for language-specific phenomena such as negative concord. In our view, the annotation of negation complements more comprehensive Natural Language Processing tasks, such as temporal information processing and sentiment analysis. We applied the proposed framework and the guidelines built on top of it to the annotation of written texts, namely news articles and tweets, thus producing annotated data for a total of over 36,000 tokens.

pdf bib
Annotation of negation in the IULA Spanish Clinical Record Corpus
Montserrat Marimon | Jorge Vivaldi | Núria Bel

This paper presents the IULA Spanish Clinical Record Corpus, a corpus of 3,194 sentences extracted from anonymized clinical records and manually annotated with negation markers and their scope. The corpus was conceived as a resource to support clinical text-mining systems, but it is also a useful resource for other Natural Language Processing systems handling clinical texts: automatic encoding of clinical records, diagnosis support, term extraction, among others, as well as for the study of clinical texts. The corpus is publicly available with a CC-BY-SA 3.0 license.

pdf bib
Annotating Negation in Spanish Clinical Texts
Noa Cruz | Roser Morante | Manuel J. Maña López | Jacinto Mata Vázquez | Carlos L. Parra Calderón

In this paper we present on-going work on annotating negation in Spanish clinical documents. A corpus of anamnesis and radiology reports has been annotated by two domain expert annotators with negation markers and negated events. The Dice coefficient for inter-annotator agreement is higher than 0.94 for negation markers and higher than 0.72 for negated events. The corpus will be publicly released when the annotation process is finished, constituting the first corpus annotated with negation for Spanish clinical reports available for the NLP community.

pdf bib
Neural Networks for Negation Cue Detection in Chinese
Hangfeng He | Federico Fancellu | Bonnie Webber

Negation cue detection involves identifying the span inherently expressing negation in a negative sentence. In Chinese, negative cue detection is complicated by morphological proprieties of the language. Previous work has shown that negative cue detection in Chinese can benefit from specific lexical and morphemic features, as well as cross-lingual information. We show here that they are not necessary: A bi-directional LSTM can perform equally well, with minimal feature engineering. In particular, the use of a character-based model allows us to capture characteristics of negation cues in Chinese using word-embedding information only. Not only does our model performs on par with previous work, further error analysis clarifies what problems remain to be addressed.

pdf bib
An open-source tool for negation detection: a maximum-margin approach
Martine Enger | Erik Velldal | Lilja Øvrelid

This paper presents an open-source toolkit for negation detection. It identifies negation cues and their corresponding scope in either raw or parsed text using maximum-margin classification. The system design draws on best practice from the existing literature on negation detection, aiming for a simple and portable system that still achieves competitive performance. Pre-trained models and experimental results are provided for English.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

pdf bib
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications
Jose Camacho-Collados | Mohammad Taher Pilehvar

pdf bib
Compositional Semantics using Feature-Based Models from WordNet
Pablo Gamallo | Martín Pereira-Fariña

This article describes a method to build semantic representations of composite expressions in a compositional way by using WordNet relations to represent the meaning of words. The meaning of a target word is modelled as a vector in which its semantically related words are assigned weights according to both the type of the relationship and the distance to the target word. Word vectors are compositionally combined by syntactic dependencies. Each syntactic dependency triggers two complementary compositional functions: the named head function and dependent function. The experiments show that the proposed compositional method outperforms the state-of-the-art for both intransitive subject-verb and transitive subject-verb-object constructions.

pdf bib
Automated WordNet Construction Using Word Embeddings
Mikhail Khodak | Andrej Risteski | Christiane Fellbaum | Sanjeev Arora

We present a fully unsupervised method for automated construction of WordNets based upon recent advances in distributional representations of sentences and word-senses combined with readily available machine translation tools. The approach requires very few linguistic resources and is thus extensible to multiple target languages. To evaluate our method we construct two 600-word testsets for word-to-synset matching in French and Russian using native speakers and evaluate the performance of our method along with several other recent approaches. Our method exceeds the best language-specific and multi-lingual automated WordNets in F-score for both languages. The databases we construct for French and Russian, both languages without large publicly available manually constructed WordNets, will be publicly released along with the testsets.

pdf bib
Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and Individual Senses
Maximilian Köper | Sabine Schulte im Walde

Abstract words refer to things that can not be seen, heard, felt, smelled, or tasted as opposed to concrete words. Among other applications, the degree of abstractness has been shown to be a useful information for metaphor detection. Our contribution to this topic are as follows: i) we compare supervised techniques to learn and extend abstractness ratings for huge vocabularies ii) we learn and investigate norms for larger units by propagating abstractness to verb-noun pairs which lead to better metaphor detection iii) we overcome the limitation of learning a single rating per word and show that multi-sense abstractness ratings are potentially useful for metaphor detection. Finally, with this paper we publish automatically created abstractness norms for 3million English words and multi-words as well as automatically created sense specific abstractness ratings

pdf bib
Improving Clinical Diagnosis Inference through Integration of Structured and Unstructured Knowledge
Yuan Ling | Yuan An | Sadid Hasan

This paper presents a novel approach to the task of automatically inferring the most probable diagnosis from a given clinical narrative. Structured Knowledge Bases (KBs) can be useful for such complex tasks but not sufficient. Hence, we leverage a vast amount of unstructured free text to integrate with structured KBs. The key innovative ideas include building a concept graph from both structured and unstructured knowledge sources and ranking the diagnosis concepts using the enhanced word embedding vectors learned from integrated sources. Experiments on the TREC CDS and HumanDx datasets showed that our methods improved the results of clinical diagnosis inference.

pdf bib
Classifying Lexical-semantic Relationships by Exploiting Sense/Concept Representations
Kentaro Kanada | Tetsunori Kobayashi | Yoshihiko Hayashi

This paper proposes a method for classifying the type of lexical-semantic relation between a given pair of words. Given an inventory of target relationships, this task can be seen as a multi-class classification problem. We train a supervised classifier by assuming: (1) a specific type of lexical-semantic relation between a pair of words would be indicated by a carefully designed set of relation-specific similarities associated with the words; and (2) the similarities could be effectively computed by “sense representations” (sense/concept embeddings). The experimental results show that the proposed method clearly outperforms an existing state-of-the-art method that does not utilize sense/concept embeddings, thereby demonstrating the effectiveness of the sense representations.

pdf bib
Supervised and unsupervised approaches to measuring usage similarity
Milton King | Paul Cook

Usage similarity (USim) is an approach to determining word meaning in context that does not rely on a sense inventory. Instead, pairs of usages of a target lemma are rated on a scale. In this paper we propose unsupervised approaches to USim based on embeddings for words, contexts, and sentences, and achieve state-of-the-art results over two USim datasets. We further consider supervised approaches to USim, and find that although they outperform unsupervised approaches, they are unable to generalize to lemmas that are unseen in the training data.

pdf bib
Lexical Disambiguation of Igbo using Diacritic Restoration
Ignatius Ezeani | Mark Hepple | Ikechukwu Onyenwe

Properly written texts in Igbo, a low-resource African language, are rich in both orthographic and tonal diacritics. Diacritics are essential in capturing the distinctions in pronunciation and meaning of words, as well as in lexical disambiguation. Unfortunately, most electronic texts in diacritic languages are written without diacritics. This makes diacritic restoration a necessary step in corpus building and language processing tasks for languages with diacritics. In our previous work, we built some n-gram models with simple smoothing techniques based on a closed-world assumption. However, as a classification task, diacritic restoration is well suited for and will be more generalisable with machine learning. This paper, therefore, presents a more standard approach to dealing with the task which involves the application of machine learning algorithms.

pdf bib
Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds
Mahmoud El-Haj | Paul Rayson | Scott Piao | Stephen Wattam

Creating high-quality wide-coverage multilingual semantic lexicons to support knowledge-based approaches is a challenging time-consuming manual task. This has traditionally been performed by linguistic experts: a slow and expensive process. We present an experiment in which we adapt and evaluate crowdsourcing methods employing native speakers to generate a list of coarse-grained senses under a common multilingual semantic taxonomy for sets of words in six languages. 451 non-experts (including 427 Mechanical Turk workers) and 15 expert participants semantically annotated 250 words manually for Arabic, Chinese, English, Italian, Portuguese and Urdu lexicons. In order to avoid erroneous (spam) crowdsourced results, we used a novel task-specific two-phase filtering process where users were asked to identify synonyms in the target language, and remove erroneous senses.

pdf bib
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Alexander Panchenko | Stefano Faralli | Simone Paolo Ponzetto | Chris Biemann

We introduce a new method for unsupervised knowledge-based word sense disambiguation (WSD) based on a resource that links two types of sense-aware lexical networks: one is induced from a corpus using distributional semantics, the other is manually constructed. The combination of two networks reduces the sparsity of sense representations used for WSD. We evaluate these enriched representations within two lexical sample sense disambiguation benchmarks. Our results indicate that (1) features extracted from the corpus-based resource help to significantly outperform a model based solely on the lexical resource; (2) our method achieves results comparable or better to four state-of-the-art unsupervised knowledge-based WSD systems including three hybrid systems that also rely on text corpora. In contrast to these hybrid methods, our approach does not require access to web search engines, texts mapped to a sense inventory, or machine translation systems.

pdf bib
One Representation per Word - Does it make Sense for Composition?
Thomas Kober | Julie Weeds | John Wilkie | Jeremy Reffin | David Weir

In this paper, we investigate whether an a priori disambiguation of word senses is strictly necessary or whether the meaning of a word in context can be disambiguated through composition alone. We evaluate the performance of off-the-shelf single-vector and multi-sense vector models on a benchmark phrase similarity task and a novel task for word-sense discrimination. We find that single-sense vector models perform as well or better than multi-sense vector models despite arguably less clean elementary representations. Our findings furthermore show that simple composition functions such as pointwise addition are able to recover sense specific information from a single-sense vector model remarkably well.

pdf bib
Elucidating Conceptual Properties from Word Embeddings
Kyoung-Rok Jang | Sung-Hyon Myaeng

In this paper, we introduce a method of identifying the components (i.e. dimensions) of word embeddings that strongly signifies properties of a word. By elucidating such properties hidden in word embeddings, we could make word embeddings more interpretable, and also could perform property-based meaning comparison. With the capability, we can answer questions like “To what degree a given word has the property cuteness?” or “In what perspective two words are similar?”. We verify our method by examining how the strength of property-signifying components correlates with the degree of prototypicality of a target word.

pdf bib
TTCS: a Vectorial Resource for Computing Conceptual Similarity
Enrico Mensa | Daniele P. Radicioni | Antonio Lieto

In this paper we introduce the TTCS, a linguistic resource that relies on BabelNet, NASARI and ConceptNet, that has now been used to compute the conceptual similarity between concept pairs. The conceptual representation herein provides uniform access to concepts based on BabelNet synset IDs, and consists of a vector-based semantic representation which is compliant with the Conceptual Spaces, a geometric framework for common-sense knowledge representation and reasoning. The TTCS has been evaluated in a preliminary experimentation on a conceptual similarity task.

pdf bib
Measuring the Italian-English lexical gap for action verbs and its impact on translation
Lorenzo Gregori | Alessandro Panunzi

This paper describes a method to measure the lexical gap of action verbs in Italian and English by using the IMAGACT ontology of action. The fine-grained categorization of action concepts of the data source allowed to have wide overview of the relation between concepts in the two languages. The calculated lexical gap for both English and Italian is about 30% of the action concepts, much higher than previous results. Beyond this general numbers a deeper analysis has been performed in order to evaluate the impact that lexical gaps can have on translation. In particular a distinction has been made between the cases in which the presence of a lexical gap affects translation correctness and completeness at a semantic level. The results highlight a high percentage of concepts that can be considered hard to translate (about 18% from English to Italian and 20% from Italian to English) and confirms that action verbs are a critical lexical class for translation tasks.

pdf bib
Word Sense Filtering Improves Embedding-Based Lexical Substitution
Anne Cocos | Marianna Apidianaki | Chris Callison-Burch

The role of word sense disambiguation in lexical substitution has been questioned due to the high performance of vector space models which propose good substitutes without explicitly accounting for sense. We show that a filtering mechanism based on a sense inventory optimized for substitutability can improve the results of these models. Our sense inventory is constructed using a clustering method which generates paraphrase clusters that are congruent with lexical substitution annotations in a development set. The results show that lexical substitution can still benefit from senses which can improve the output of vector space paraphrase ranking models.

pdf bib
Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambigous Synonyms
Aleksander Wawer | Agnieszka Mykowiecka

This paper compares two approaches to word sense disambiguation using word embeddings trained on unambiguous synonyms. The first is unsupervised method based on computing log probability from sequences of word embedding vectors, taking into account ambiguous word senses and guessing correct sense from context. The second method is supervised. We use a multilayer neural network model to learn a context-sensitive transformation that maps an input vector of ambiguous word into an output vector representing its sense. We evaluate both methods on corpora with manual annotations of word senses from the Polish wordnet (plWordnet).

up

pdf (full)
bib (full)
Proceedings of the Sixth Workshop on Vision and Language

pdf bib
Proceedings of the Sixth Workshop on Vision and Language
Anya Belz | Erkut Erdem | Katerina Pastra | Krystian Mikolajczyk

pdf bib
The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings
Yanchao Yu | Arash Eshghi | Gregory Mills | Oliver Lemon

We motivate and describe a new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; anon.) with a novel task, where a Learner needs to learn invented visual attribute words (such as “burchak” for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self- and other-correction, mid-sentence continuations, interruptions, turn overlaps, fillers, hedges and many kinds of ellipsis. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental dialogue data, which is freely available to researchers. We show that the simulations produce outputs that are similar to the original data (e.g. 78% turn match similarity). Finally, we train and evaluate a Reinforcement Learning dialogue control agent for learning visually grounded word meanings, trained from the BURCHAK corpus. The learned policy shows comparable performance to a rule-based system built previously.

pdf bib
The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval-Based Image Caption Generation System
Brandon Birmingham | Adrian Muscat

In this paper, a retrieval-based caption generation system that searches the web for suitable image descriptions is studied. Google’s reverse image search is used to find potentially relevant web multimedia content for query images. Sentences are extracted from web pages and the likelihood of the descriptions is computed to select one sentence from the retrieved text documents. The search mechanism is modified to replace the caption generated by Google with a caption composed of labels and spatial prepositions as part of the query’s text alongside the image. The object labels are obtained using an off-the-shelf R-CNN and a machine learning model is developed to predict the prepositions. The effect on the caption generation system performance when using the generated text is investigated. Both human evaluations and automatic metrics are used to evaluate the retrieved descriptions. Results show that the web-retrieval-based approach performed better when describing single-object images with sentences extracted from stock photography websites. On the other hand, images with two image objects were better described with template-generated sentences composed of object labels and prepositions.

pdf bib
Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision
Aparna Nurani Venkitasubramanian | Tinne Tuytelaars | Marie-Francine Moens

We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a particularly challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a large degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to most work on integrated vision and language in the literature, where textual descriptions are tightly linked to the image content, and often generated in a curated fashion for the task at hand. In particular, we investigate different image representations and models, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately. This representation allows key components in the image to be isolated, in spite of largely varying backgrounds and image clutter, without an object detection or image segmentation step. The methods are evaluated based on how well they transfer to unseen camera-trap images captured across diverse topographical regions under different environmental conditions and illumination settings, involving a large domain shift.

pdf bib
Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles
Iacer Calixto | Daniel Stein | Evgeny Matusov | Sheila Castilho | Andy Way

In this paper, we study how humans perceive the use of images as an additional knowledge source to machine-translate user-generated product listings in an e-commerce company. We conduct a human evaluation where we assess how a multi-modal neural machine translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attention-based NMT and a phrase-based statistical machine translation (PBSMT) model. We evaluate translations obtained with different systems and also discuss the data set of user-generated product listings, which in our case comprises both product listings and associated images. We found that humans preferred translations obtained with a PBSMT system to both text-only and multi-modal NMT over 56% of the time. Nonetheless, human evaluators ranked translations from a multi-modal NMT model as better than those of a text-only NMT over 88% of the time, which suggests that images do help NMT in this use-case.

pdf bib
The BreakingNews Dataset
Arnau Ramisa | Fei Yan | Francesc Moreno-Noguer | Krystian Mikolajczyk

We present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (e.g. GPS coordinates and popularity metrics). The tenuous connection between the images and text in news data is appropriate to take work at the intersection of Computer Vision and Natural Language Processing to the next step, hence we hope this dataset will help spur progress in the field.

pdf bib
Automatic identification of head movements in video-recorded conversations: can words help?
Patrizia Paggio | Costanza Navarretta | Bart Jongejan

We present an approach where an SVM classifier learns to classify head movements based on measurements of velocity, acceleration, and the third derivative of position with respect to time, jerk. Consequently, annotations of head movements are added to new video data. The results of the automatic annotation are evaluated against manual annotations in the same data and show an accuracy of 68% with respect to these. The results also show that using jerk improves accuracy. We then conduct an investigation of the overlap between temporal sequences classified as either movement or non-movement and the speech stream of the person performing the gesture. The statistics derived from this analysis show that using word features may help increase the accuracy of the model.

pdf bib
Multi-Modal Fashion Product Retrieval
Antonio Rubio Romano | LongLong Yu | Edgar Simo-Serra | Francesc Moreno-Noguer

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset.

up

pdf (full)
bib (full)
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf bib
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Feldman | Anna Kazantseva | Nils Reiter | Stan Szpakowicz

pdf bib
Metaphor Detection in a Poetry Corpus
Vaibhav Kesarwani | Diana Inkpen | Stan Szpakowicz | Chris Tanasescu

Metaphor is indispensable in poetry. It showcases the poet’s creativity, and contributes to the overall emotional pertinence of the poem while honing its specific rhetorical impact. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our method focuses on metaphor detection in a poetry corpus. It combines rule-based and statistical models (word embeddings) to develop a new classification system. Our system has achieved a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry.

pdf bib
Machine Translation and Automated Analysis of the Sumerian Language
Émilie Pagé-Perron | Maria Sukhareva | Ilya Khait | Christian Chiarcos

This paper presents a newly funded international project for machine translation and automated analysis of ancient cuneiform languages where NLP specialists and Assyriologists collaborate to create an information retrieval system for Sumerian. This research is conceived in response to the need to translate large numbers of administrative texts that are only available in transcription, in order to make them accessible to a wider audience. The methodology includes creation of a specialized NLP pipeline and also the use of linguistic linked open data to increase access to the results.

pdf bib
Investigating the Relationship between Literary Genres and Emotional Plot Development
Evgeny Kim | Sebastian Padó | Roman Klinger

Literary genres are commonly viewed as being defined in terms of content and stylistic features. In this paper, we focus on one particular class of lexical features, namely emotion information, and investigate the hypothesis that emotion-related information correlates with particular genres. Using genre classification as a testbed, we compare a model that computes lexicon-based emotion scores globally for complete stories with a model that tracks emotion arcs through stories on a subset of Project Gutenberg with five genres. Our main findings are: (a), the global emotion model is competitive with a large-vocabulary bag-of-words genre classifier (80%F1); (b), the emotion arc model shows a lower performance (59 % F1) but shows complementary behavior to the global model, as indicated by a very good performance of an oracle model (94 % F1) and an improved performance of an ensemble model (84 % F1); (c), genres differ in the extent to which stories follow the same emotional arcs, with particularly uniform behavior for anger (mystery) and fear (adventures, romance, humor, science fiction).

pdf bib
Enjambment Detection in a Large Diachronic Corpus of Spanish Sonnets
Pablo Ruiz | Clara Martínez Cantón | Thierry Poibeau | Elena González-Blanco

Enjambment takes place when a syntactic unit is broken up across two lines of poetry, giving rise to different stylistic effects. In Spanish literary studies, there are unclear points about the types of stylistic effects that can arise, and under which linguistic conditions. To systematically gather evidence about this, we developed a system to automatically identify enjambment (and its type) in Spanish. For evaluation, we manually annotated a reference corpus covering different periods. As a scholarly corpus to apply the tool, from public HTML sources we created a diachronic corpus covering four centuries of sonnets (3750 poems), and we analyzed the occurrence of enjambment across stanzaic boundaries in different periods. Besides, we found examples that highlight limitations in current definitions of enjambment.

pdf bib
Plotting Markson’s “Mistress”
Conor Kelleher | Mark Keane

The post-modern novel “Wittgenstein’s Mistress” by David Markson (1988) presents the reader with a very challenging non-linear narrative, that itself appears to one of the novel’s themes. We present a distant reading of this work designed to complement a close reading of it by David Foster Wallace (1990). Using a combination of text analysis, entity recognition and networks, we plot repetitive structures in the novel’s narrative relating them to its critical analysis.

pdf bib
Annotation Challenges for Reconstructing the Structural Elaboration of Middle Low German
Nina Seemann | Marie-Luis Merten | Michaela Geierhos | Doris Tophinke | Eyke Hüllermeier

In this paper, we present the annotation challenges we have encountered when working on a historical language that was undergoing elaboration processes. We especially focus on syntactic ambiguity and gradience in Middle Low German, which causes uncertainty to some extent. Since current annotation tools consider construction contexts and the dynamics of the grammaticalization only partially, we plan to extend CorA - a web-based annotation tool for historical and other non-standard language data - to capture elaboration phenomena and annotator unsureness. Moreover, we seek to interactively learn morphological as well as syntactic annotations.

pdf bib
Phonological Soundscapes in Medieval Poetry
Christopher Hench

The oral component of medieval poetry was integral to its performance and reception. Yet many believe that the medieval voice has been forever lost, and any attempts at rediscovering it are doomed to failure due to scribal practices, manuscript mouvance, and linguistic normalization in editing practices. This paper offers a method to abstract from this noise and better understand relative differences in phonological soundscapes by considering syllable qualities. The presented syllabification method and soundscape analysis offer themselves as cross-disciplinary tools for low-resource languages. As a case study, we examine medieval German lyric and argue that the heavily debated lyrical ‘I’ follows a unique trajectory through soundscapes, shedding light on the performance and practice of these poets.

pdf bib
An End-to-end Environment for Research Question-Driven Entity Extraction and Network Analysis
Andre Blessing | Nora Echelmeyer | Markus John | Nils Reiter

This paper presents an approach to extract co-occurrence networks from literary texts. It is a deliberate decision not to aim for a fully automatic pipeline, as the literary research questions need to guide both the definition of the nature of the things that co-occur as well as how to decide co-occurrence. We showcase the approach on a Middle High German romance, Parzival. Manual inspection and discussion shows the huge impact various choices have.

pdf bib
Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns
Stefania Degaetano-Ortlieb | Elke Teich

We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).

pdf bib
Finding a Character’s Voice: Stylome Classification on Literary Characters
Liviu P. Dinu | Ana Sabina Uban

We investigate in this paper the problem of classifying the stylome of characters in a literary work. Previous research in the field of authorship attribution has shown that the writing style of an author can be characterized and distinguished from that of other authors automatically. In this paper we take a look at the less approached problem of how the styles of different characters can be distinguished, trying to verify if an author managed to create believable characters with individual styles. We present the results of some initial experiments developed on the novel “Liaisons Dangereuses”, showing that a simple bag of words model can be used to classify the characters.

pdf bib
An Ontology-Based Method for Extracting and Classifying Domain-Specific Compositional Nominal Compounds
Maria Pia di Buono

In this paper, we present our preliminary study on an ontology-based method to extract and classify compositional nominal compounds in specific domains of knowledge. This method is based on the assumption that, applying a conceptual model to represent knowledge domain, it is possible to improve the extraction and classification of lexicon occurrences for that domain in a semi-automatic way. We explore the possibility of extracting and classifying a specific construction type (nominal compounds) spanning a specific domain (Cultural Heritage) and a specific language (Italian).

pdf bib
Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin
Géraldine Walther | Benoît Sagot

In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are meant to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child-speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling.

pdf bib
Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite
Maria Sukhareva | Francesco Fuscagni | Johannes Daxenberger | Susanne Görke | Doris Prechel | Iryna Gurevych

This paper presents a statistical approach to automatic morphosyntactic annotation of Hittite transcripts. Hittite is an extinct Indo-European language using the cuneiform script. There are currently no morphosyntactic annotations available for Hittite, so we explored methods of distant supervision. The annotations were projected from parallel German translations of the Hittite texts. In order to reduce data sparsity, we applied stemming of German and Hittite texts. As there is no off-the-shelf Hittite stemmer, a stemmer for Hittite was developed for this purpose. The resulting annotation projections were used to train a POS tagger, achieving an accuracy of 69% on a test sample. To our knowledge, this is the first attempt of statistical POS tagging of a cuneiform language.

pdf bib
A Dataset for Sanskrit Word Segmentation
Amrith Krishna | Pavan Kumar Satuluri | Pawan Goyal

The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.

pdf bib
Lexical Correction of Polish Twitter Political Data
Maciej Ogrodniczuk | Mateusz Kopeć

Language processing architectures are often evaluated in near-to-perfect conditions with respect to processed content. The tools which perform sufficiently well on electronic press, books and other type of non-interactive content may poorly handle littered, colloquial and multilingual textual data which make the majority of communication today. This paper aims at investigating how Polish Twitter data (in a slightly controlled ‘political’ flavour) differs from expectation of linguistic tools and how they could be corrected to be ready for processing by standard language processing chains available for Polish. The setting includes specialised components for spelling correction of tweets as well as hashtag and username decoding.

up

pdf (full)
bib (full)
BioNLP 2017

pdf bib
BioNLP 2017
Kevin Bretonnel Cohen | Dina Demner-Fushman | Sophia Ananiadou | Junichi Tsujii

pdf bib
Target word prediction and paraphasia classification in spoken discourse
Joel Adams | Steven Bedrick | Gerasimos Fergadiotis | Kyle Gorman | Jan van Santen

We present a system for automatically detecting and classifying phonologically anomalous productions in the speech of individuals with aphasia. Working from transcribed discourse samples, our system identifies neologisms, and uses a combination of string alignment and language models to produce a lattice of plausible words that the speaker may have intended to produce. We then score this lattice according to various features, and attempt to determine whether the anomalous production represented a phonemic error or a genuine neologism. This approach has the potential to be expanded to consider other types of paraphasic errors, and could be applied to a wide variety of screening and therapeutic applications.

pdf bib
Extracting Drug-Drug Interactions with Attention CNNs
Masaki Asada | Makoto Miwa | Yutaka Sasaki

We propose a novel attention mechanism for a Convolutional Neural Network (CNN)-based Drug-Drug Interaction (DDI) extraction model. CNNs have been shown to have a great potential on DDI extraction tasks; however, attention mechanisms, which emphasize important words in the sentence of a target-entity pair, have not been investigated with the CNNs despite the fact that attention mechanisms are shown to be effective for a general domain relation classification task. We evaluated our model on the Task 9.2 of the DDIExtraction-2013 shared task. As a result, our attention mechanism improved the performance of our base CNN-based DDI model, and the model achieved an F-score of 69.12%, which is competitive with the state-of-the-art models.

pdf bib
Insights into Analogy Completion from the Biomedical Domain
Denis Newman-Griffis | Albert Lai | Eric Fosler-Lussier

Analogy completion has been a popular task in recent years for evaluating the semantic properties of word embeddings, but the standard methodology makes a number of assumptions about analogies that do not always hold, either in recent benchmark datasets or when expanding into other domains. Through an analysis of analogies in the biomedical domain, we identify three assumptions: that of a Single Answer for any given analogy, that the pairs involved describe the Same Relationship, and that each pair is Informative with respect to the other. We propose modifying the standard methodology to relax these assumptions by allowing for multiple correct answers, reporting MAP and MRR in addition to accuracy, and using multiple example pairs. We further present BMASS, a novel dataset for evaluating linguistic regularities in biomedical embeddings, and demonstrate that the relationships described in the dataset pose significant semantic challenges to current word embedding methods.

pdf bib
Deep learning for extracting protein-protein interactions from biomedical literature
Yifan Peng | Zhiyong Lu

State-of-the-art methods for protein-protein interaction (PPI) extraction are primarily feature-based or kernel-based by leveraging lexical and syntactic information. But how to incorporate such knowledge in the recent deep learning methods remains an open question. In this paper, we propose a multichannel dependency-based convolutional neural network model (McDepCNN). It applies one channel to the embedding vector of each word in the sentence, and another channel to the embedding vector of the head of the corresponding word. Therefore, the model can use richer information obtained from different channels. Experiments on two public benchmarking datasets, AIMed and BioInfer, demonstrate that McDepCNN provides up to 6% F1-score improvement over rich feature-based methods and single-kernel methods. In addition, McDepCNN achieves 24.4% relative improvement in F1-score over the state-of-the-art methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on “difficult” instances. These results suggest that McDepCNN generalizes more easily over different corpora, and is capable of capturing long distance features in the sentences.

pdf bib
Stacking With Auxiliary Features for Entity Linking in the Medical Domain
Nazneen Fatema Rajani | Mihaela Bornea | Ken Barker

Linking spans of natural language text to concepts in a structured source is an important task for many problems. It allows intelligent systems to leverage rich knowledge available in those sources (such as concept properties and relations) to enhance the semantics of the mentions of these concepts in text. In the medical domain, it is common to link text spans to medical concepts in large, curated knowledge repositories such as the Unified Medical Language System. Different approaches have different strengths: some are precision-oriented, some recall-oriented; some better at considering context but more prone to hallucination. The variety of techniques suggests that ensembling could outperform component technologies at this task. In this paper, we describe our process for building a Stacking ensemble using additional, auxiliary features for Entity Linking in the medical domain. We report experiments that show that naive ensembling does not always outperform component Entity Linking systems, that stacking usually outperforms naive ensembling, and that auxiliary features added to the stacker further improve its performance on three distinct datasets. Our best model produces state-of-the-art results on several medical datasets.

pdf bib
Results of the fifth edition of the BioASQ Challenge
Anastasios Nentidis | Konstantinos Bougiatiotis | Anastasia Krithara | Georgios Paliouras | Ioannis Kakadiaris

The goal of the BioASQ challenge is to engage researchers into creating cuttingedge biomedical information systems. Specifically, it aims at the promotion of systems and methodologies that are able to deal with a plethora of different tasks in the biomedical domain. This is achieved through the organization of challenges. The fifth challenge consisted of three tasks: semantic indexing, question answering and a new task on information extraction. In total, 29 teams with more than 95 systems participated in the challenge. Overall, as in previous years, the best systems were able to outperform the strong baselines. This suggests that state-of-the art systems are continuously improving, pushing the frontier of research.

pdf bib
Tackling Biomedical Text Summarization: OAQA at BioASQ 5B
Khyathi Chandu | Aakanksha Naik | Aditya Chandrasekar | Zi Yang | Niloy Gupta | Eric Nyberg

In this paper, we describe our participation in phase B of task 5b of the fifth edition of the annual BioASQ challenge, which includes answering factoid, list, yes-no and summary questions from biomedical data. We describe our techniques with an emphasis on ideal answer generation, where the goal is to produce a relevant, precise, non-redundant, query-oriented summary from multiple relevant documents. We make use of extractive summarization techniques to address this task and experiment with different biomedical ontologies and various algorithms including agglomerative clustering, Maximum Marginal Relevance (MMR) and sentence compression. We propose a novel word embedding based tf-idf similarity metric and a soft positional constraint which improve our system performance. We evaluate our techniques on test batch 4 from the fourth edition of the challenge. Our best system achieves a ROUGE-2 score of 0.6534 and ROUGE-SU4 score of 0.6536.

pdf bib
Macquarie University at BioASQ 5b – Query-based Summarisation Techniques for Selecting the Ideal Answers
Diego Mollá

Macquarie University’s contribution to the BioASQ challenge (Task 5b Phase B) focused on the use of query-based extractive summarisation techniques for the generation of the ideal answers. Four runs were submitted, with approaches ranging from a trivial system that selected the first n snippets, to the use of deep learning approaches under a regression framework. Our experiments and the ROUGE results of the five test batches of BioASQ indicate surprisingly good results for the trivial approach. Overall, most of our runs on the first three test batches achieved the best ROUGE-SU4 results in the challenge.

pdf bib
Neural Question Answering at BioASQ 5B
Georg Wiese | Dirk Weissenborn | Mariana Neves

This paper describes our submission to the 2017 BioASQ challenge. We participated in Task B, Phase B which is concerned with biomedical question answering (QA). We focus on factoid and list question, using an extractive QA model, that is, we restrict our system to output substrings of the provided text snippets. At the core of our system, we use FastQA, a state-of-the-art neural QA system. We extended it with biomedical word embeddings and changed its answer layer to be able to answer list questions in addition to factoid questions. We pre-trained the model on a large-scale open-domain QA dataset, SQuAD, and then fine-tuned the parameters on the BioASQ training set. With our approach, we achieve state-of-the-art results on factoid questions and competitive results on list questions.

pdf bib
End-to-End System for Bacteria Habitat Extraction
Farrokh Mehryary | Kai Hakala | Suwisa Kaewphan | Jari Björne | Tapio Salakoski | Filip Ginter

We introduce an end-to-end system capable of named-entity detection, normalization and relation extraction for extracting information about bacteria and their habitats from biomedical literature. Our system is based on deep learning, CRF classifiers and vector space models. We train and evaluate the system on the BioNLP 2016 Shared Task Bacteria Biotope data. The official evaluation shows that the joint performance of our entity detection and relation extraction models outperforms the winning team of the Shared Task by 19pp on F1-score, establishing a new top score for the task. We also achieve state-of-the-art results in the normalization task. Our system is open source and freely available at https://github.com/TurkuNLP/BHE.

pdf bib
Creation and evaluation of a dictionary-based tagger for virus species and proteins
Helen Cook | Rūdolfs Bērziņš | Cristina Leal Rodrıguez | Juan Miguel Cejuela | Lars Juhl Jensen

ext mining automatically extracts information from the literature with the goal of making it available for further analysis, for example by incorporating it into biomedical databases. A key first step towards this goal is to identify and normalize the named entities, such as proteins and species, which are mentioned in text. Despite the large detrimental impact that viruses have on human and agricultural health, very little previous text-mining work has focused on identifying virus species and proteins in the literature. Here, we present an improved dictionary-based system for viral species and the first dictionary for viral proteins, which we benchmark on a new corpus of 300 manually annotated abstracts. We achieve 81.0% precision and 72.7% recall at the task of recognizing and normalizing viral species and 76.2% precision and 34.9% recall on viral proteins. These results are achieved despite the many challenges involved with the names of viral species and, especially, proteins. This work provides a foundation that can be used to extract more complicated relations about viruses from the literature.

pdf bib
Representation of complex terms in a vector space structured by an ontology for a normalization task
Arnaud Ferré | Pierre Zweigenbaum | Claire Nédellec

We propose in this paper a semi-supervised method for labeling terms of texts with concepts of a domain ontology. The method generates continuous vector representations of complex terms in a semantic space structured by the ontology. The proposed method relies on a distributional semantics approach, which generates initial vectors for each of the extracted terms. Then these vectors are embedded in the vector space constructed from the structure of the ontology. This embedding is carried out by training a linear model. Finally, we apply a distance calculation to determine the proximity between vectors of terms and vectors of concepts and thus to assign ontology labels to terms. We have evaluated the quality of these representations for a normalization task by using the concepts of an ontology as semantic labels. Normalization of terms is an important step to extract a part of the information containing in texts, but the vector space generated might find other applications. The performance of this method is comparable to that of the state of the art for this task of standardization, opening up encouraging prospects.

pdf bib
Improving Correlation with Human Judgments by Integrating Semantic Similarity with Second–Order Vectors
Bridget McInnes | Ted Pedersen

Vector space methods that measure semantic similarity and relatedness often rely on distributional information such as co–occurrence frequencies or statistical measures of association to weight the importance of particular co–occurrences. In this paper, we extend these methods by incorporating a measure of semantic similarity based on a human curated taxonomy into a second–order vector representation. This results in a measure of semantic relatedness that combines both the contextual information available in a corpus–based vector space representation with the semantic knowledge found in a biomedical ontology. Our results show that incorporating semantic similarity into a second order co-occurrence matrices improves correlation with human judgments for both similarity and relatedness, and that our method compares favorably to various different word embedding methods that have recently been evaluated on the same reference standards we have used.

pdf bib
Proactive Learning for Named Entity Recognition
Maolin Li | Nhung Nguyen | Sophia Ananiadou

The goal of active learning is to minimise the cost of producing an annotated dataset, in which annotators are assumed to be perfect, i.e., they always choose the correct labels. However, in practice, annotators are not infallible, and they are likely to assign incorrect labels to some instances. Proactive learning is a generalisation of active learning that can model different kinds of annotators. Although proactive learning has been applied to certain labelling tasks, such as text classification, there is little work on its application to named entity (NE) tagging. In this paper, we propose a proactive learning method for producing NE annotated corpora, using two annotators with different levels of expertise, and who charge different amounts based on their levels of experience. To optimise both cost and annotation quality, we also propose a mechanism to present multiple sentences to annotators at each iteration. Experimental results for several corpora show that our method facilitates the construction of high-quality NE labelled datasets at minimal cost.

pdf bib
Biomedical Event Extraction using Abstract Meaning Representation
Sudha Rao | Daniel Marcu | Kevin Knight | Hal Daumé III

We propose a novel, Abstract Meaning Representation (AMR) based approach to identifying molecular events/interactions in biomedical text. Our key contributions are: (1) an empirical validation of our hypothesis that an event is a subgraph of the AMR graph, (2) a neural network-based model that identifies such an event subgraph given an AMR, and (3) a distant supervision based approach to gather additional training data. We evaluate our approach on the 2013 Genia Event Extraction dataset and show promising results.

pdf bib
Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System
Ari Klein | Abeed Sarker | Masoud Rouhizadeh | Karen O’Connor | Graciela Gonzalez

Social media sites (e.g., Twitter) have been used for surveillance of drug safety at the population level, but studies that focus on the effects of medications on specific sets of individuals have had to rely on other sources of data. Mining social media data for this in-formation would require the ability to distinguish indications of personal medication in-take in this media. Towards that end, this paper presents an annotated corpus that can be used to train machine learning systems to determine whether a tweet that mentions a medication indicates that the individual posting has taken that medication at a specific time. To demonstrate the utility of the corpus as a training set, we present baseline results of supervised classification.

pdf bib
Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings
Pieter Fivez | Simon Šuster | Walter Daelemans

We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. We greatly outperform two baseline off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of an optimized noisy channel model, showing that neural embeddings can be successfully exploited to include context-awareness in a spelling correction model.

pdf bib
Characterization of Divergence in Impaired Speech of ALS Patients
Archna Bhatia | Bonnie Dorr | Kristy Hollingshead | Samuel L. Phillips | Barbara McKenzie

Approximately 80% to 95% of patients with Amyotrophic Lateral Sclerosis (ALS) eventually develop speech impairments, such as defective articulation, slow laborious speech and hypernasality. The relationship between impaired speech and asymptomatic speech may be seen as a divergence from a baseline. This relationship can be characterized in terms of measurable combinations of phonological characteristics that are indicative of the degree to which the two diverge. We demonstrate that divergence measurements based on phonological characteristics of speech correlate with physiological assessments of ALS. Speech-based assessments offer benefits over commonly-used physiological assessments in that they are inexpensive, non-intrusive, and do not require trained clinical personnel for administering and interpreting the results.

pdf bib
Deep Learning for Punctuation Restoration in Medical Reports
Wael Salloum | Greg Finley | Erik Edwards | Mark Miller | David Suendermann-Oeft

In clinical dictation, speakers try to be as concise as possible to save time, often resulting in utterances without explicit punctuation commands. Since the end product of a dictated report, e.g. an out-patient letter, does require correct orthography, including exact punctuation, the latter need to be restored, preferably by automated means. This paper describes a method for punctuation restoration based on a state-of-the-art stack of NLP and machine learning techniques including B-RNNs with an attention mechanism and late fusion, as well as a feature extraction technique tailored to the processing of medical terminology using a novel vocabulary reduction model. To the best of our knowledge, the resulting performance is superior to that reported in prior art on similar tasks.

pdf bib
Unsupervised Domain Adaptation for Clinical Negation Detection
Timothy Miller | Steven Bethard | Hadi Amiri | Guergana Savova

Detecting negated concepts in clinical texts is an important part of NLP information extraction systems. However, generalizability of negation systems is lacking, as cross-domain experiments suffer dramatic performance losses. We examine the performance of multiple unsupervised domain adaptation algorithms on clinical negation detection, finding only modest gains that fall well short of in-domain performance.

pdf bib
BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations
Rezarta Islamaj Doğan | Andrew Chatr-aryamontri | Sun Kim | Chih-Hsuan Wei | Yifan Peng | Donald Comeau | Zhiyong Lu

The Precision Medicine Track in BioCre-ative VI aims to bring together the Bi-oNLP community for a novel challenge focused on mining the biomedical litera-ture in search of mutations and protein-protein interactions (PPI). In order to support this track with an effective train-ing dataset with limited curator time, the track organizers carefully reviewed Pub-Med articles from two different sources: curated public PPI databases, and the re-sults of state-of-the-art public text mining tools. We detail here the data collection, manual review and annotation process and describe this training corpus charac-teristics. We also describe a corpus per-formance baseline. This analysis will provide useful information to developers and researchers for comparing and devel-oping innovative text mining approaches for the BioCreative VI challenge and other Precision Medicine related applica-tions.

pdf bib
Painless Relation Extraction with Kindred
Jake Lever | Steven Jones

Relation extraction methods are essential for creating robust text mining tools to help researchers find useful knowledge in the vast published literature. Easy-to-use and generalizable methods are needed to encourage an ecosystem in which researchers can easily use shared resources and build upon each others’ methods. We present the Kindred Python package for relation extraction. It builds upon methods from the most successful tools in the recent BioNLP Shared Task to predict high-quality predictions with low computational cost. It also integrates with PubAnnotation, PubTator, and BioNLP Shared Task data in order to allow easy development and application of relation extraction models.

pdf bib
Noise Reduction Methods for Distantly Supervised Biomedical Relation Extraction
Gang Li | Cathy Wu | K. Vijay-Shanker

Distant supervision has been applied to automatically generate labeled data for biomedical relation extraction. Noise exists in both positively and negatively-labeled data and affects the performance of supervised machine learning methods. In this paper, we propose three novel heuristics based on the notion of proximity, trigger word and confidence of patterns to leverage lexical and syntactic information to reduce the level of noise in the distantly labeled data. Experiments on three different tasks, extraction of protein-protein-interaction, miRNA-gene regulation relation and protein-localization event, show that the proposed methods can improve the F-score over the baseline by 6, 10 and 14 points for the three tasks, respectively. We also show that when the models are configured to output high-confidence results, high precisions can be obtained using the proposed methods, making them promising for facilitating manual curation for databases.

pdf bib
Role-Preserving Redaction of Medical Records to Enable Ontology-Driven Processing
Seth Polsley | Atif Tahir | Muppala Raju | Akintayo Akinleye | Duane Steward

Electronic medical records (EMR) have largely replaced hand-written patient files in healthcare. The growing pool of EMR data presents a significant resource in medical research, but the U.S. Health Insurance Portability and Accountability Act (HIPAA) mandates redacting medical records before performing any analysis on the same. This process complicates obtaining medical data and can remove much useful information from the record. As part of a larger project involving ontology-driven medical processing, we employ a method of recognizing protected health information (PHI) that maps to ontological terms. We then use the relationships defined in the ontology to redact medical texts so that roles and semantics of terms are retained without compromising anonymity. The method is evaluated by clinical experts on several hundred medical documents, achieving up to a 98.8% f-score, and has already shown promise for retaining semantic information in later processing.

pdf bib
Annotation of pain and anesthesia events for surgery-related processes and outcomes extraction
Wen-wai Yim | Dario Tedesco | Catherine Curtin | Tina Hernandez-Boussard

Pain and anesthesia information are crucial elements to identifying surgery-related processes and outcomes. However pain is not consistently recorded in the electronic medical record. Even when recorded, the rich complex granularity of the pain experience may be lost. Similarly, anesthesia information is recorded using local electronic collection systems; though the accuracy and completeness of the information is unknown. We propose an annotation schema to capture pain, pain management, and anesthesia event information.

pdf bib
Identifying Comparative Structures in Biomedical Text
Samir Gupta | A.S.M. Ashique Mahmood | Karen Ross | Cathy Wu | K. Vijay-Shanker

Comparison sentences are very commonly used by authors in biomedical literature to report results of experiments. In such comparisons, authors typically make observations under two different scenarios. In this paper, we present a system to automatically identify such comparative sentences and their components i.e. the compared entities, the scale of the comparison and the aspect on which the entities are being compared. Our methodology is based on dependencies obtained by applying a parser to extract a wide range of comparison structures. We evaluated our system for its effectiveness in identifying comparisons and their components. The system achieved a F-score of 0.87 for comparison sentence identification and 0.77-0.81 for identifying its components.

pdf bib
Tagging Funding Agencies and Grants in Scientific Articles using Sequential Learning Models
Subhradeep Kayal | Zubair Afzal | George Tsatsaronis | Sophia Katrenko | Pascal Coupet | Marius Doornenbal | Michelle Gregory

In this paper we present a solution for tagging funding bodies and grants in scientific articles using a combination of trained sequential learning models, namely conditional random fields (CRF), hidden markov models (HMM) and maximum entropy models (MaxEnt), on a benchmark set created in-house. We apply the trained models to address the BioASQ challenge 5c, which is a newly introduced task that aims to solve the problem of funding information extraction from scientific articles. Results in the dry-run data set of BioASQ task 5c show that the suggested approach can achieve a micro-recall of more than 85% in tagging both funding bodies and grants.

pdf bib
Deep Learning for Biomedical Information Retrieval: Learning Textual Relevance from Click Logs
Sunil Mohan | Nicolas Fiorini | Sun Kim | Zhiyong Lu

We describe a Deep Learning approach to modeling the relevance of a document’s text to a query, applied to biomedical literature. Instead of mapping each document and query to a common semantic space, we compute a variable-length difference vector between the query and document which is then passed through a deep convolution stage followed by a deep regression network to produce the estimated probability of the document’s relevance to the query. Despite the small amount of training data, this approach produces a more robust predictor than computing similarities between semantic vector representations of the query and document, and also results in significant improvements over traditional IR text factors. In the future, we plan to explore its application in improving PubMed search.

pdf bib
Detecting Dementia through Retrospective Analysis of Routine Blog Posts by Bloggers with Dementia
Vaden Masrani | Gabriel Murray | Thalia Field | Giuseppe Carenini

We investigate if writers with dementia can be automatically distinguished from those without by analyzing linguistic markers in written text, in the form of blog posts. We have built a corpus of several thousand blog posts, some by people with dementia and others by people with loved ones with dementia. We use this dataset to train and test several machine learning methods, and achieve prediction performance at a level far above the baseline.

pdf bib
Protein Word Detection using Text Segmentation Techniques
Devi Ganesan | Ashish V. Tendulkar | Sutanu Chakraborti

Literature in Molecular Biology is abundant with linguistic metaphors. There have been works in the past that attempt to draw parallels between linguistics and biology, driven by the fundamental premise that proteins have a language of their own. Since word detection is crucial to the decipherment of any unknown language, we attempt to establish a problem mapping from natural language text to protein sequences at the level of words. Towards this end, we explore the use of an unsupervised text segmentation algorithm to the task of extracting “biological words” from protein sequences. In particular, we demonstrate the effectiveness of using domain knowledge to complement data driven approaches in the text segmentation task, as well as in its biological counterpart. We also propose a novel extrinsic evaluation measure for protein words through protein family classification.

pdf bib
External Evaluation of Event Extraction Classifiers for Automatic Pathway Curation: An extended study of the mTOR pathway
Wojciech Kusa | Michael Spranger

This paper evaluates the impact of various event extraction systems on automatic pathway curation using the popular mTOR pathway. We quantify the impact of training data sets as well as different machine learning classifiers and show that some improve the quality of automatically extracted pathways.

pdf bib
Toward Automated Early Sepsis Alerting: Identifying Infection Patients from Nursing Notes
Emilia Apostolova | Tom Velez

Severe sepsis and septic shock are conditions that affect millions of patients and have close to 50% mortality rate. Early identification of at-risk patients significantly improves outcomes. Electronic surveillance tools have been developed to monitor structured Electronic Medical Records and automatically recognize early signs of sepsis. However, many sepsis risk factors (e.g. symptoms and signs of infection) are often captured only in free text clinical notes. In this study, we developed a method for automatic monitoring of nursing notes for signs and symptoms of infection. We utilized a creative approach to automatically generate an annotated dataset. The dataset was used to create a Machine Learning model that achieved an F1-score ranging from 79 to 96%.

pdf bib
Enhancing Automatic ICD-9-CM Code Assignment for Medical Texts with PubMed
Danchen Zhang | Daqing He | Sanqiang Zhao | Lei Li

Assigning a standard ICD-9-CM code to disease symptoms in medical texts is an important task in the medical domain. Automating this process could greatly reduce the costs. However, the effectiveness of an automatic ICD-9-CM code classifier faces a serious problem, which can be triggered by unbalanced training data. Frequent diseases often have more training data, which helps its classification to perform better than that of an infrequent disease. However, a disease’s frequency does not necessarily reflect its importance. To resolve this training data shortage problem, we propose to strategically draw data from PubMed to enrich the training data when there is such need. We validate our method on the CMC dataset, and the evaluation results indicate that our method can significantly improve the code assignment classifiers’ performance at the macro-averaging level.

pdf bib
Evaluating Feature Extraction Methods for Knowledge-based Biomedical Word Sense Disambiguation
Sam Henry | Clint Cuffy | Bridget McInnes

In this paper, we present an analysis of feature extraction methods via dimensionality reduction for the task of biomedical Word Sense Disambiguation (WSD). We modify the vector representations in the 2-MRD WSD algorithm, and evaluate four dimensionality reduction methods: Word Embeddings using Continuous Bag of Words and Skip Gram, Singular Value Decomposition (SVD), and Principal Component Analysis (PCA). We also evaluate the effects of vector size on the performance of each of these methods. Results are evaluated on five standard evaluation datasets (Abbrev.100, Abbrev.200, Abbrev.300, NLM-WSD, and MSH-WSD). We find that vector sizes of 100 are sufficient for all techniques except SVD, for which a vector size of 1500 is referred. We also show that SVD performs on par with Word Embeddings for all but one dataset.

pdf bib
Investigating the Documentation of Electronic Cigarette Use in the Veteran Affairs Electronic Health Record: A Pilot Study
Danielle Mowery | Brett South | Olga Patterson | Shu-Hong Zhu | Mike Conway

In this paper, we present pilot work on characterising the documentation of electronic cigarettes (e-cigarettes) in the United States Veterans Administration Electronic Health Record. The Veterans Health Administration is the largest health care system in the United States with 1,233 health care facilities nationwide, serving 8.9 million veterans per year. We identified a random sample of 2000 Veterans Administration patients, coded as current tobacco users, from 2008 to 2014. Using simple keyword matching techniques combined with qualitative analysis, we investigated the prevalence and distribution of e-cigarette terms in these clinical notes, discovering that for current smokers, 11.9% of patient records contain an e-cigarette related term.

pdf bib
Automated Preamble Detection in Dictated Medical Reports
Wael Salloum | Greg Finley | Erik Edwards | Mark Miller | David Suendermann-Oeft

Dictated medical reports very often feature a preamble containing metainformation about the report such as patient and physician names, location and name of the clinic, date of procedure, and so on. In the medical transcription process, the preamble is usually omitted from the final report, as it contains information already available in the electronic medical record. We present a method which is able to automatically identify preambles in medical dictations. The method makes use of state-of-the-art NLP techniques including word embeddings and Bi-LSTMs and achieves preamble detection performance superior to humans.

pdf bib
A Biomedical Question Answering System in BioASQ 2017
Mourad Sarrouti | Said Ouatik El Alaoui

Question answering, the identification of short accurate answers to users questions, is a longstanding challenge widely studied over the last decades in the open domain. However, it still requires further efforts in the biomedical domain. In this paper, we describe our participation in phase B of task 5b in the 2017 BioASQ challenge using our biomedical question answering system. Our system, dealing with four types of questions (i.e., yes/no, factoid, list, and summary), is based on (1) a dictionary-based approach for generating the exact answers of yes/no questions, (2) UMLS metathesaurus and term frequency metric for extracting the exact answers of factoid and list questions, and (3) the BM25 model and UMLS concepts for retrieving the ideal answers (i.e., paragraph-sized summaries). Preliminary results show that our system achieves good and competitive results in both exact and ideal answers extraction tasks as compared with the participating systems.

pdf bib
Adapting Pre-trained Word Embeddings For Use In Medical Coding
Kevin Patel | Divya Patel | Mansi Golakiya | Pushpak Bhattacharyya | Nilesh Birari

Word embeddings are a crucial component in modern NLP. Pre-trained embeddings released by different groups have been a major reason for their popularity. However, they are trained on generic corpora, which limits their direct use for domain specific tasks. In this paper, we propose a method to add task specific information to pre-trained word embeddings. Such information can improve their utility. We add information from medical coding data, as well as the first level from the hierarchy of ICD-10 medical code set to different pre-trained word embeddings. We adapt CBOW algorithm from the word2vec package for our purpose. We evaluated our approach on five different pre-trained word embeddings. Both the original word embeddings, and their modified versions (the ones with added information) were used for automated review of medical coding. The modified word embeddings give an improvement in f-score by 1% on the 5-fold evaluation on a private medical claims dataset. Our results show that adding extra information is possible and beneficial for the task at hand.

pdf bib
Initializing neural networks for hierarchical multi-label text classification
Simon Baker | Anna Korhonen

Many tasks in the biomedical domain require the assignment of one or more predefined labels to input text, where the labels are a part of a hierarchical structure (such as a taxonomy). The conventional approach is to use a one-vs.-rest (OVR) classification setup, where a binary classifier is trained for each label in the taxonomy or ontology where all instances not belonging to the class are considered negative examples. The main drawbacks to this approach are that dependencies between classes are not leveraged in the training and classification process, and the additional computational cost of training parallel classifiers. In this paper, we apply a new method for hierarchical multi-label text classification that initializes a neural network model final hidden layer such that it leverages label co-occurrence relations such as hypernymy. This approach elegantly lends itself to hierarchical classification. We evaluated this approach using two hierarchical multi-label text classification tasks in the biomedical domain using both sentence- and document-level classification. Our evaluation shows promising results for this approach.

pdf bib
Biomedical Event Trigger Identification Using Bidirectional Recurrent Neural Network Based Models
Rahul V S S Patchigolla | Sunil Sahu | Ashish Anand

Biomedical events describe complex interactions between various biomedical entities. Event trigger is a word or a phrase which typically signifies the occurrence of an event. Event trigger identification is an important first step in all event extraction methods. However many of the current approaches either rely on complex hand-crafted features or consider features only within a window. In this paper we propose a method that takes the advantage of recurrent neural network (RNN) to extract higher level features present across the sentence. Thus hidden state representation of RNN along with word and entity type embedding as features avoid relying on the complex hand-crafted features generated using various NLP toolkits. Our experiments have shown to achieve state-of-art F1-score on Multi Level Event Extraction (MLEE) corpus. We have also performed category-wise analysis of the result and discussed the importance of various features in trigger identification task.

pdf bib
Representations of Time Expressions for Temporal Relation Extraction with Convolutional Neural Networks
Chen Lin | Timothy Miller | Dmitriy Dligach | Steven Bethard | Guergana Savova

Token sequences are often used as the input for Convolutional Neural Networks (CNNs) in natural language processing. However, they might not be an ideal representation for time expressions, which are long, highly varied, and semantically complex. We describe a method for representing time expressions with single pseudo-tokens for CNNs. With this method, we establish a new state-of-the-art result for a clinical temporal relation extraction task.

pdf bib
Automatic Diagnosis Coding of Radiology Reports: A Comparison of Deep Learning and Conventional Classification Methods
Sarvnaz Karimi | Xiang Dai | Hamed Hassanzadeh | Anthony Nguyen

Diagnosis autocoding services and research intend to both improve the productivity of clinical coders and the accuracy of the coding. It is an important step in data analysis for funding and reimbursement, as well as health services planning and resource allocation. We investigate the applicability of deep learning at autocoding of radiology reports using International Classification of Diseases (ICD). Deep learning methods are known to require large training data. Our goal is to explore how to use these methods when the training data is sparse, skewed and relatively small, and how their effectiveness compares to conventional methods. We identify optimal parameters that could be used in setting up a convolutional neural network for autocoding with comparable results to that of conventional methods.

pdf bib
Automatic classification of doctor-patient questions for a virtual patient record query task
Leonardo Campillos Llanos | Sophie Rosset | Pierre Zweigenbaum

We present the work-in-progress of automating the classification of doctor-patient questions in the context of a simulated consultation with a virtual patient. We classify questions according to the computational strategy (rule-based or other) needed for looking up data in the clinical record. We compare ‘traditional’ machine learning methods (Gaussian and Multinomial Naive Bayes, and Support Vector Machines) and a neural network classifier (FastText). We obtained the best results with the SVM using semantic annotations, whereas the neural classifier achieved promising results without it.

pdf bib
Assessing the performance of Olelo, a real-time biomedical question answering application
Mariana Neves | Fabian Eckert | Hendrik Folkerts | Matthias Uflacker

Question answering (QA) can support physicians and biomedical researchers to find answers to their questions in the scientific literature. Such systems process large collections of documents in real time and include many natural language processing (NLP) procedures. We recently developed Olelo, a QA system for biomedicine which includes various NLP components, such as question processing, document and passage retrieval, answer processing and multi-document summarization. In this work, we present an evaluation of our system on the the fifth BioASQ challenge. We participated with the current state of the application and with an extension based on semantic role labeling that we are currently investigating. In addition to the BioASQ evaluation, we compared our system to other on-line biomedical QA systems in terms of the response time and the quality of the answers.

pdf bib
Clinical Event Detection with Hybrid Neural Architecture
Adyasha Maharana | Meliha Yetisgen

Event detection from clinical notes has been traditionally solved with rule based and statistical natural language processing (NLP) approaches that require extensive domain knowledge and feature engineering. In this paper, we have explored the feasibility of approaching this task with recurrent neural networks, clinical word embeddings and introduced a hybrid architecture to improve detection for entities with smaller representation in the dataset. A comparative analysis is also done which reveals the complementary behavior of neural networks and conditional random fields in clinical entity detection.

pdf bib
Extracting Personal Medical Events for User Timeline Construction using Minimal Supervision
Aakanksha Naik | Chris Bogart | Carolyn Rose

In this paper, we describe a system for automatic construction of user disease progression timelines from their posts in online support groups using minimal supervision. In recent years, several online support groups have been established which has led to a huge increase in the amount of patient-authored text available. Creating systems which can automatically extract important medical events and create disease progression timelines for users from such text can help in patient health monitoring as well as studying links between medical events and users’ participation in support groups. Prior work in this domain has used manually constructed keyword sets to detect medical events. In this work, our aim is to perform medical event detection using minimal supervision in order to develop a more general timeline construction system. Our system achieves an accuracy of 55.17%, which is 92% of the performance achieved by a supervised baseline system.

pdf bib
Detecting mentions of pain and acute confusion in Finnish clinical text
Hans Moen | Kai Hakala | Farrokh Mehryary | Laura-Maria Peltonen | Tapio Salakoski | Filip Ginter | Sanna Salanterä

We study and compare two different approaches to the task of automatic assignment of predefined classes to clinical free-text narratives. In the first approach this is treated as a traditional mention-level named-entity recognition task, while the second approach treats it as a sentence-level multi-label classification task. Performance comparison across these two approaches is conducted in the form of sentence-level evaluation and state-of-the-art methods for both approaches are evaluated. The experiments are done on two data sets consisting of Finnish clinical text, manually annotated with respect to the topics pain and acute confusion. Our results suggest that the mention-level named-entity recognition approach outperforms sentence-level classification overall, but the latter approach still manages to achieve the best prediction scores on several annotation classes.

pdf bib
A Multi-strategy Query Processing Approach for Biomedical Question Answering: USTB_PRIR at BioASQ 2017 Task 5B
Zan-Xia Jin | Bo-Wen Zhang | Fan Fang | Le-Le Zhang | Xu-Cheng Yin

This paper describes the participation of USTB_PRIR team in the 2017 BioASQ 5B on question answering, including document retrieval, snippet retrieval, and concept retrieval task. We introduce different multimodal query processing strategies to enrich query terms and assign different weights to them. Specifically, sequential dependence model (SDM), pseudo-relevance feedback (PRF), fielded sequential dependence model (FSDM) and Divergence from Randomness model (DFRM) are respectively performed on different fields of PubMed articles, sentences extracted from relevant articles, the five terminologies or ontologies (MeSH, GO, Jochem, Uniprot and DO) to achieve better search performances. Preliminary results show that our systems outperform others in the document and snippet retrieval task in the first two batches.

up

pdf (full)
bib (full)
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing

pdf bib
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing
Martin Riedl | Swapna Somasundaran | Goran Glavaš | Eduard Hovy

pdf bib
On the “Calligraphy” of Books
Vanessa Queiroz Marinho | Henrique Ferraz de Arruda | Thales Sinelli | Luciano da Fontoura Costa | Diego Raphael Amancio

Authorship attribution is a natural language processing task that has been widely studied, often by considering small order statistics. In this paper, we explore a complex network approach to assign the authorship of texts based on their mesoscopic representation, in an attempt to capture the flow of the narrative. Indeed, as reported in this work, such an approach allowed the identification of the dominant narrative structure of the studied authors. This has been achieved due to the ability of the mesoscopic approach to take into account relationships between different, not necessarily adjacent, parts of the text, which is able to capture the story flow. The potential of the proposed approach has been illustrated through principal component analysis, a comparison with the chance baseline method, and network visualization. Such visualizations reveal individual characteristics of the authors, which can be understood as a kind of calligraphy.

pdf bib
Adapting predominant and novel sense discovery algorithms for identifying corpus-specific sense differences
Binny Mathew | Suman Kalyan Maity | Pratip Sarkar | Animesh Mukherjee | Pawan Goyal

Word senses are not static and may have temporal, spatial or corpus-specific scopes. Identifying such scopes might benefit the existing WSD systems largely. In this paper, while studying corpus specific word senses, we adapt three existing predominant and novel-sense discovery algorithms to identify these corpus-specific senses. We make use of text data available in the form of millions of digitized books and newspaper archives as two different sources of corpora and propose automated methods to identify corpus-specific word senses at various time points. We conduct an extensive and thorough human judgement experiment to rigorously evaluate and compare the performance of these approaches. Post adaptation, the output of the three algorithms are in the same format and the accuracy results are also comparable, with roughly 45-60% of the reported corpus-specific senses being judged as genuine.

pdf bib
Merging knowledge bases in different languages
Jerónimo Hernández-González | Estevam R. Hruschka Jr. | Tom M. Mitchell

Recently, different systems which learn to populate and extend a knowledge base (KB) from the web in different languages have been presented. Although a large set of concepts should be learnt independently from the language used to read, there are facts which are expected to be more easily gathered in local language (e.g., culture or geography). A system that merges KBs learnt in different languages will benefit from the complementary information as long as common beliefs are identified, as well as from redundancy present in web pages written in different languages. In this paper, we deal with the problem of identifying equivalent beliefs (or concepts) across language specific KBs, assuming that they share the same ontology of categories and relations. In a case study with two KBs independently learnt from different inputs, namely web pages written in English and web pages written in Portuguese respectively, we report on the results of two methodologies: an approach based on personalized PageRank and an inference technique to find out common relevant paths through the KBs. The proposed inference technique efficiently identifies relevant paths, outperforming the baseline (a dictionary-based classifier) in the vast majority of tested categories.

pdf bib
Parameter Free Hierarchical Graph-Based Clustering for Analyzing Continuous Word Embeddings
Thomas Alexander Trost | Dietrich Klakow

Word embeddings are high-dimensional vector representations of words and are thus difficult to interpret. In order to deal with this, we introduce an unsupervised parameter free method for creating a hierarchical graphical clustering of the full ensemble of word vectors and show that this structure is a geometrically meaningful representation of the original relations between the words. This newly obtained representation can be used for better understanding and thus improving the embedding algorithm and exhibits semantic meaning, so it can also be utilized in a variety of language processing tasks like categorization or measuring similarity.

pdf bib
Spectral Graph-Based Method of Multimodal Word Embedding
Kazuki Fukui | Takamasa Oshikiri | Hidetoshi Shimodaira

In this paper, we propose a novel method for multimodal word embedding, which exploit a generalized framework of multi-view spectral graph embedding to take into account visual appearances or scenes denoted by words in a corpus. We evaluated our method through word similarity tasks and a concept-to-image search task, having found that it provides word representations that reflect visual information, while somewhat trading-off the performance on the word similarity tasks. Moreover, we demonstrate that our method captures multimodal linguistic regularities, which enable recovering relational similarities between words and images by vector arithmetics.

pdf bib
Graph Methods for Multilingual FrameNets
Collin F. Baker | Michael Ellsworth

This paper introduces a new, graph-based view of the data of the FrameNet project, which we hope will make it easier to understand the mixture of semantic and syntactic information contained in FrameNet annotation. We show how English FrameNet and other Frame Semantic resources can be represented as sets of interconnected graphs of frames, frame elements, semantic types, and annotated instances of them in text. We display examples of the new graphical representation based on the annotations, which combine Frame Semantics and Construction Grammar, thus capturing most of the syntax and semantics of each sentence. We consider how graph theory could help researchers to make better use of FrameNet data for tasks such as automatic Frame Semantic role labeling, paraphrasing, and translation. Finally, we describe the development of FrameNet-like lexical resources for other languages in the current Multilingual FrameNet project. which seeks to discover cross-lingual alignments, both in the lexicon (for frames and lexical units within frames) and across parallel or comparable texts. We conclude with an example showing graphically the semantic and syntactic similarities and differences between parallel sentences in English and Japanese. We will release software for displaying such graphs from the current data releases.

pdf bib
Extract with Order for Coherent Multi-Document Summarization
Mir Tafseer Nayeem | Yllias Chali

In this work, we aim at developing an extractive summarizer in the multi-document setting. We implement a rank based sentence selection using continuous vector representations along with key-phrases. Furthermore, we propose a model to tackle summary coherence for increasing readability. We conduct experiments on the Document Understanding Conference (DUC) 2004 datasets using ROUGE toolkit. Our experiments demonstrate that the methods bring significant improvements over the state of the art methods in terms of informativity and coherence.

pdf bib
Work Hard, Play Hard: Email Classification on the Avocado and Enron Corpora
Sakhar Alkhereyf | Owen Rambow

In this paper, we present an empirical study of email classification into two main categories “Business” and “Personal”. We train on the Enron email corpus, and test on the Enron and Avocado email corpora. We show that information from the email exchange networks improves the performance of classification. We represent the email exchange networks as social networks with graph structures. For this classification task, we extract social networks features from the graphs in addition to lexical features from email content and we compare the performance of SVM and Extra-Trees classifiers using these features. Combining graph features with lexical features improves the performance on both classifiers. We also provide manually annotated sets of the Avocado and Enron email corpora as a supplementary contribution.

pdf bib
A Graph Based Semi-Supervised Approach for Analysis of Derivational Nouns in Sanskrit
Amrith Krishna | Pavankumar Satuluri | Harshavardhan Ponnada | Muneeb Ahmed | Gulab Arora | Kaustubh Hiware | Pawan Goyal

Derivational nouns are widely used in Sanskrit corpora and represent an important cornerstone of productivity in the language. Currently there exists no analyser that identifies the derivational nouns. We propose a semi supervised approach for identification of derivational nouns in Sanskrit. We not only identify the derivational words, but also link them to their corresponding source words. Our novelty comes in the design of the network structure for the task. The edge weights are featurised based on the phonetic, morphological, syntactic and the semantic similarity shared between the words to be identified. We find that our model is effective for the task, even when we employ a labelled dataset which is only 5 % to that of the entire dataset.

pdf bib
Evaluating text coherence based on semantic similarity graph
Jan Wira Gotama Putra | Takenobu Tokunaga

Coherence is a crucial feature of text because it is indispensable for conveying its communication purpose and meaning to its readers. In this paper, we propose an unsupervised text coherence scoring based on graph construction in which edges are established between semantically similar sentences represented by vertices. The sentence similarity is calculated based on the cosine similarity of semantic vectors representing sentences. We provide three graph construction methods establishing an edge from a given vertex to a preceding adjacent vertex, to a single similar vertex, or to multiple similar vertices. We evaluated our methods in the document discrimination task and the insertion task by comparing our proposed methods to the supervised (Entity Grid) and unsupervised (Entity Graph) baselines. In the document discrimination task, our method outperformed the unsupervised baseline but could not do the supervised baseline, while in the insertion task, our method outperformed both baselines.

up

pdf (full)
bib (full)
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

pdf bib
Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp

pdf bib
Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research
Phillippe Langlais

Despite numerous studies devoted to mining parallel material from bilingual data, we have yet to see the resulting technologies wholeheartedly adopted by professional translators and terminologists alike. I argue that this state of affairs is mainly due to two factors: the emphasis published authors put on models (even though data is as important), and the conspicuous lack of concern for actual end-users.

pdf bib
Deep Investigation of Cross-Language Plagiarism Detection Methods
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès

This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages.

pdf bib
Sentence Alignment using Unfolding Recursive Autoencoders
Jeenu Grover | Pabitra Mitra

In this paper, we propose a novel two step algorithm for sentence alignment in monolingual corpora using Unfolding Recursive Autoencoders. First, we use unfolding recursive auto-encoders (RAE) to learn feature vectors for phrases in syntactical tree of the sentence. To compare two sentences we use a similarity matrix which has dimensions proportional to the size of the two sentences. Since the similarity matrix generated to compare two sentences has varying dimension due to different sentence lengths, a dynamic pooling layer is used to map it to a matrix of fixed dimension. The resulting matrix is used to calculate the similarity scores between the two sentences. The second step of the algorithm captures the contexts in which the sentences occur in the document by using a dynamic programming algorithm for global alignment.

pdf bib
Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords
Michael Bloodgood | Benjamin Strauss

With the advent of informal electronic communications such as social media, colloquial languages that were historically unwritten are being written for the first time in heavily code-switched environments. We present a method for inducing portions of translation lexicons through the use of expert knowledge in these settings where there are approximately zero resources available other than a language informant, potentially not even large amounts of monolingual data. We investigate inducing a Moroccan Darija-English translation lexicon via French loanwords bridging into English and find that a useful lexicon is induced for human-assisted translation and statistical machine translation.

pdf bib
Toward a Comparable Corpus of Latvian, Russian and English Tweets
Dmitrijs Milajevs

Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes including training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by collecting and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.

pdf bib
Automatic Extraction of Parallel Speech Corpora from Dubbed Movies
Alp Öktem | Mireia Farrús | Leo Wanner

This paper presents a methodology to extract parallel speech corpora based on any language pair from dubbed movies, together with an application framework in which some corresponding prosodic parameters are extracted. The obtained parallel corpora are especially suitable for speech-to-speech translation applications when a prosody transfer between source and target languages is desired.

pdf bib
A parallel collection of clinical trials in Portuguese and English
Mariana Neves

Parallel collections of documents are crucial resources for training and evaluating machine translation (MT) systems. Even though large collections are available for certain domains and language pairs, these are still scarce in the biomedical domain. We developed a parallel corpus of clinical trials in Portuguese and English. The documents are derived from the Brazilian Clinical Trials Registry and the corpus currently contains a total of 1188 documents. In this paper, we describe the corpus construction and discuss the quality of the translation and the sentence alignment that we obtained.

pdf bib
Weighted Set-Theoretic Alignment of Comparable Sentences
Andoni Azpeitia | Thierry Etchegoyhen | Eva Martínez Garcia

This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.

pdf bib
BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora
Francis Grégoire | Philippe Langlais

This paper describes our participation in BUCC 2017 shared task: identifying parallel sentences in comparable corpora. Our goal is to leverage continuous vector representations and distributional semantics with a minimal use of external preprocessing and postprocessing tools. We report experiments that were conducted after transmitting our results.

pdf bib
zNLP: Identifying Parallel Sentences in Chinese-English Comparable Corpora
Zheng Zhang | Pierre Zweigenbaum

This paper describes the zNLP system for the BUCC 2017 shared task. Our system identifies parallel sentence pairs in Chinese-English comparable corpora by translating word-by-word Chinese sentences into English, using the search engine Solr to select near-parallel sentences and then by using an SVM classifier to identify true parallel sentences from the previous results. It obtains an F1-score of 45% (resp. 32%) on the test (training) set.

pdf bib
BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay

A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm.

pdf bib
Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined manually a small sample of the false negative sentence pairs for the most precise French-English runs and estimated the number of parallel sentence pairs not yet in the provided gold standard. Adding them to the gold standard leads to revised estimates for the French-English F-scores of at most +1.5pt. This suggests that the BUCC 2017 datasets provide a reasonable approximate evaluation of the parallel sentence spotting task.

up

pdf (full)
bib (full)
Proceedings of the 2nd Workshop on Representation Learning for NLP

pdf bib
Proceedings of the 2nd Workshop on Representation Learning for NLP
Phil Blunsom | Antoine Bordes | Kyunghyun Cho | Shay Cohen | Chris Dyer | Edward Grefenstette | Karl Moritz Hermann | Laura Rimell | Jason Weston | Scott Yih

pdf bib
Sense Contextualization in a Dependency-Based Compositional Distributional Model
Pablo Gamallo

Little attention has been paid to distributional compositional methods which employ syntactically structured vector models. As word vectors belonging to different syntactic categories have incompatible syntactic distributions, no trivial compositional operation can be applied to combine them into a new compositional vector. In this article, we generalize the method described by Erk and Padó (2009) by proposing a dependency-base framework that contextualize not only lemmas but also selectional preferences. The main contribution of the article is to expand their model to a fully compositional framework in which syntactic dependencies are put at the core of semantic composition. We claim that semantic composition is mainly driven by syntactic dependencies. Each syntactic dependency generates two new compositional vectors representing the contextualized sense of the two related lemmas. The sequential application of the compositional operations associated to the dependencies results in as many contextualized vectors as lemmas the composite expression contains. At the end of the semantic process, we do not obtain a single compositional vector representing the semantic denotation of the whole composite expression, but one contextualized vector for each lemma of the whole expression. Our method avoids the troublesome high-order tensor representations by defining lemmas and selectional restrictions as first-order tensors (i.e. standard vectors). A corpus-based experiment is performed to both evaluate the quality of the compositional vectors built with our strategy, and to compare them to other approaches on distributional compositional semantics. The experiments show that our dependency-based compositional method performs as (or even better than) the state-of-the-art.

pdf bib
Context encoders as a simple but powerful extension of word2vec
Franziska Horn

With a strikingly simple architecture and the ability to learn meaningful word embeddings efficiently from texts containing billions of words, word2vec remains one of the most popular neural language models used today. However, as only a single embedding is learned for every word in the vocabulary, the model fails to optimally represent words with multiple meanings and, additionally, it is not possible to create embeddings for new (out-of-vocabulary) words on the spot. Based on an intuitive interpretation of the continuous bag-of-words (CBOW) word2vec model’s negative sampling training objective in terms of predicting context based similarities, we motivate an extension of the model we call context encoders (ConEc). By multiplying the matrix of trained word2vec embeddings with a word’s average context vector, out-of-vocabulary (OOV) embeddings and representations for words with multiple meanings can be created based on the words’ local contexts. The benefits of this approach are illustrated by using these word embeddings as features in the CoNLL 2003 named entity recognition (NER) task.

pdf bib
Machine Comprehension by Text-to-Text Neural Question Generation
Xingdi Yuan | Tong Wang | Caglar Gulcehre | Alessandro Sordoni | Philip Bachman | Saizheng Zhang | Sandeep Subramanian | Adam Trischler

We propose a recurrent neural model that generates natural-language questions from documents, conditioned on answers. We show how to train the model using a combination of supervised and reinforcement learning. After teacher forcing for standard maximum likelihood training, we fine-tune the model using policy gradient techniques to maximize several rewards that measure question quality. Most notably, one of these rewards is the performance of a question-answering system. We motivate question generation as a means to improve the performance of question answering systems. Our model is trained and evaluated on the recent question-answering dataset SQuAD.

pdf bib
Emergent Predication Structure in Hidden State Vectors of Neural Readers
Hai Wang | Takeshi Onishi | Kevin Gimpel | David McAllester

A significant number of neural architectures for reading comprehension have recently been developed and evaluated on large cloze-style datasets. We present experiments supporting the emergence of “predication structure” in the hidden state vectors of these readers. More specifically, we provide evidence that the hidden state vectors represent atomic formulas 𝛷c where 𝛷 is a semantic property (predicate) and c is a constant symbol entity identifier.

pdf bib
Towards Harnessing Memory Networks for Coreference Resolution
Joe Cheri | Pushpak Bhattacharyya

Coreference resolution task demands comprehending a discourse, especially for anaphoric mentions which require semantic information for resolving antecedents. We investigate into how memory networks can be helpful for coreference resolution when posed as question answering problem. The comprehension capability of memory networks assists coreference resolution, particularly for the mentions that require semantic and context information. We experiment memory networks for coreference resolution, with 4 synthetic datasets generated for coreference resolution with varying difficulty levels. Our system’s performance is compared with a traditional coreference resolution system to show why memory network can be promising for coreference resolution.

pdf bib
Combining Word-Level and Character-Level Representations for Relation Classification of Informal Text
Dongyun Liang | Weiran Xu | Yinge Zhao

Word representation models have achieved great success in natural language processing tasks, such as relation classification. However, it does not always work on informal text, and the morphemes of some misspelling words may carry important short-distance semantic information. We propose a hybrid model, combining the merits of word-level and character-level representations to learn better representations on informal text. Experiments on two dataset of relation classification, SemEval-2010 Task8 and a large-scale one we compile from informal text, show that our model achieves a competitive result in the former and state-of-the-art with the other.

pdf bib
Transfer Learning for Neural Semantic Parsing
Xing Fan | Emilio Monti | Lambert Mathias | Markus Dreyer

The goal of semantic parsing is to map natural language to a machine interpretable meaning representation language (MRL). One of the constraints that limits full exploration of deep learning technologies for semantic parsing is the lack of sufficient annotation training data. In this paper, we propose using sequence-to-sequence in a multi-task setup for semantic parsing with focus on transfer learning. We explore three multi-task architectures for sequence-to-sequence model and compare their performance with the independently trained model. Our experiments show that the multi-task setup aids transfer learning from an auxiliary task with large labeled data to the target task with smaller labeled data. We see an absolute accuracy gain ranging from 1.0% to 4.4% in in our in-house data set and we also see good gains ranging from 2.5% to 7.0% on the ATIS semantic parsing tasks with syntactic and semantic auxiliary tasks.

pdf bib
Modeling Large-Scale Structured Relationships with Shared Memory for Knowledge Base Completion
Yelong Shen | Po-Sen Huang | Ming-Wei Chang | Jianfeng Gao

Recent studies on knowledge base completion, the task of recovering missing relationships based on recorded relations, demonstrate the importance of learning embeddings from multi-step relations. However, due to the size of knowledge bases, learning multi-step relations directly on top of observed triplets could be costly. Hence, a manually designed procedure is often used when training the models. In this paper, we propose Implicit ReasoNets (IRNs), which is designed to perform multi-step inference implicitly through a controller and shared memory. Without a human-designed inference procedure, IRNs use training data to learn to perform multi-step inference in an embedding neural space through the shared memory and controller. While the inference procedure does not explicitly operate on top of observed triplets, our proposed model outperforms all previous approaches on the popular FB15k benchmark by more than 5.7%.

pdf bib
Knowledge Base Completion: Baselines Strike Back
Rudolf Kadlec | Ondrej Bajgar | Jan Kleindienst

Many papers have been published on the knowledge base completion task in the past few years. Most of these introduce novel architectures for relation learning that are evaluated on standard datasets like FB15k and WN18. This paper shows that the accuracy of almost all models published on the FB15k can be outperformed by an appropriately tuned baseline — our reimplementation of the DistMult model. Our findings cast doubt on the claim that the performance improvements of recent models are due to architectural changes as opposed to hyper-parameter tuning or different training objectives. This should prompt future research to re-consider how the performance of models is evaluated and reported.

pdf bib
Sequential Attention: A Context-Aware Alignment Function for Machine Reading
Sebastian Brarda | Philip Yeres | Samuel Bowman

In this paper we propose a neural network model with a novel Sequential Attention layer that extends soft attention by assigning weights to words in an input sequence in a way that takes into account not just how well that word matches a query, but how well surrounding words match. We evaluate this approach on the task of reading comprehension (on the Who did What and CNN datasets) and show that it dramatically improves a strong baseline—the Stanford Reader—and is competitive with the state of the art.

pdf bib
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
Jan Rygl | Jan Pomikálek | Radim Řehůřek | Michal Růžička | Vít Novotný | Petr Sojka

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.

pdf bib
Multi-task Domain Adaptation for Sequence Tagging
Nanyun Peng | Mark Dredze

Many domain adaptation approaches rely on learning cross domain shared representations to transfer the knowledge learned in one domain to other domains. Traditional domain adaptation only considers adapting for one task. In this paper, we explore multi-task representation learning under the domain adaptation scenario. We propose a neural network framework that supports domain adaptation for multiple tasks simultaneously, and learns shared representations that better generalize for domain adaptation. We apply the proposed framework to domain adaptation for sequence tagging problems considering two tasks: Chinese word segmentation and named entity recognition. Experiments show that multi-task domain adaptation works better than disjoint domain adaptation for each task, and achieves the state-of-the-art results for both tasks in the social media domain.

pdf bib
Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context
Shyam Upadhyay | Kai-Wei Chang | Matt Taddy | Adam Kalai | James Zou

Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense wor d embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields comparable performance to a state of the art monolingual model trained on five times more training data.

pdf bib
DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging
Sheng Chen | Akshay Soni | Aasish Pappu | Yashar Mehdad

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec – two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods.

pdf bib
Binary Paragraph Vectors
Karol Grzegorczyk | Marcin Kurdziel

Recently Le & Mikolov described two log-linear models, called Paragraph Vector, that can be used to learn state-of-the-art distributed representations of documents. Inspired by this work, we present Binary Paragraph Vector models: simple neural networks that learn short binary codes for fast information retrieval. We show that binary paragraph vectors outperform autoencoder-based binary codes, despite using fewer bits. We also evaluate their precision in transfer learning settings, where binary codes are inferred for documents unrelated to the training corpus. Results from these experiments indicate that binary paragraph vectors can capture semantics relevant for various domain-specific documents. Finally, we present a model that simultaneously learns short binary codes and longer, real-valued representations. This model can be used to rapidly retrieve a short list of highly relevant documents from a large document collection.

pdf bib
Representing Compositionality based on Multiple Timescales Gated Recurrent Neural Networks with Adaptive Temporal Hierarchy for Character-Level Language Models
Dennis Singh Moirangthem | Jegyung Son | Minho Lee

A novel character-level neural language model is proposed in this paper. The proposed model incorporates a biologically inspired temporal hierarchy in the architecture for representing multiple compositions of language in order to handle longer sequences for the character-level language model. The temporal hierarchy is introduced in the language model by utilizing a Gated Recurrent Neural Network with multiple timescales. The proposed model incorporates a timescale adaptation mechanism for enhancing the performance of the language model. We evaluate our proposed model using the popular Penn Treebank and Text8 corpora. The experiments show that the use of multiple timescales in a Neural Language Model (NLM) enables improved performance despite having fewer parameters and with no additional computation requirements. Our experiments also demonstrate the ability of the adaptive temporal hierarchies to represent multiple compositonality without the help of complex hierarchical architectures and shows that better representation of the longer sequences lead to enhanced performance of the probabilistic language model.

pdf bib
Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation
Pranava Swaroop Madhyastha | Cristina España-Bonet

We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our model uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over a fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language using the trained bilingual embeddings. We integrate these translation options into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English–Spanish language pair. When tested over an out-of-domain testset, we get a significant improvement of 3.9 BLEU points.

pdf bib
Prediction of Frame-to-Frame Relations in the FrameNet Hierarchy with Frame Embeddings
Teresa Botschen | Hatem Mousselly-Sergieh | Iryna Gurevych

Automatic completion of frame-to-frame (F2F) relations in the FrameNet (FN) hierarchy has received little attention, although they incorporate meta-level commonsense knowledge and are used in downstream approaches. We address the problem of sparsely annotated F2F relations. First, we examine whether the manually defined F2F relations emerge from text by learning text-based frame embeddings. Our analysis reveals insights about the difficulty of reconstructing F2F relations purely from text. Second, we present different systems for predicting F2F relations; our best-performing one uses the FN hierarchy to train on and to ground embeddings in. A comparison of systems and embeddings exposes the crucial influence of knowledge-based embeddings to a system’s performance in predicting F2F relations.

pdf bib
Learning Joint Multilingual Sentence Representations with Neural Machine Translation
Holger Schwenk | Matthijs Douze

In this paper, we use the framework of neural machine translation to learn joint sentence representations across six very different languages. Our aim is that a representation which is independent of the language, is likely to capture the underlying semantics. We define a new cross-lingual similarity measure, compare up to 1.4M sentence representations and study the characteristics of close sentences. We provide experimental evidence that sentences that are close in embedding space are indeed semantically highly related, but often have quite different structure and syntax. These relations also hold when comparing sentences in different languages.

pdf bib
Transfer Learning for Speech Recognition on a Budget
Julius Kunze | Louis Kirsch | Ilia Kurenkov | Andreas Krug | Jens Johannsmeier | Sebastian Stober

End-to-end training of automated speech recognition (ASR) systems requires massive data and compute resources. We explore transfer learning based on model adaptation as an approach for training ASR models under constrained GPU memory, throughput and training data. We conduct several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language. We show that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy, thereby lowering the cost of training ASR models in other languages. Model introspection revealed that small adaptations to the network’s weights were sufficient for good performance, especially for inner layers.

pdf bib
Gradual Learning of Matrix-Space Models of Language for Sentiment Analysis
Shima Asaadi | Sebastian Rudolph

Learning word representations to capture the semantics and compositionality of language has received much research interest in natural language processing. Beyond the popular vector space models, matrix representations for words have been proposed, since then, matrix multiplication can serve as natural composition operation. In this work, we investigate the problem of learning matrix representations of words. We present a learning approach for compositional matrix-space models for the task of sentiment analysis. We show that our approach, which learns the matrices gradually in two steps, outperforms other approaches and a gradient-descent baseline in terms of quality and computational cost.

pdf bib
Improving Language Modeling using Densely Connected Recurrent Neural Networks
Fréderic Godin | Joni Dambre | Wesley De Neve

In this paper, we introduce the novel concept of densely connected layers into recurrent neural networks. We evaluate our proposed architecture on the Penn Treebank language modeling task. We show that we can obtain similar perplexity scores with six times fewer parameters compared to a standard stacked 2-layer LSTM model trained with dropout (Zaremba et al., 2014). In contrast with the current usage of skip connections, we show that densely connecting only a few stacked layers with skip connections already yields significant perplexity reductions.

pdf bib
NewsQA: A Machine Comprehension Dataset
Adam Trischler | Tong Wang | Xingdi Yuan | Justin Harris | Alessandro Sordoni | Philip Bachman | Kaheer Suleman

We present NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text in the articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. Analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (13.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available online.

pdf bib
Intrinsic and Extrinsic Evaluation of Spatiotemporal Text Representations in Twitter Streams
Lawrence Phillips | Kyle Shaffer | Dustin Arendt | Nathan Hodas | Svitlana Volkova

Language in social media is a dynamic system, constantly evolving and adapting, with words and concepts rapidly emerging, disappearing, and changing their meaning. These changes can be estimated using word representations in context, over time and across locations. A number of methods have been proposed to track these spatiotemporal changes but no general method exists to evaluate the quality of these representations. Previous work largely focused on qualitative evaluation, which we improve by proposing a set of visualizations that highlight changes in text representation over both space and time. We demonstrate usefulness of novel spatiotemporal representations to explore and characterize specific aspects of the corpus of tweets collected from European countries over a two-week period centered around the terrorist attacks in Brussels in March 2016. In addition, we quantitatively evaluate spatiotemporal representations by feeding them into a downstream classification task – event type prediction. Thus, our work is the first to provide both intrinsic (qualitative) and extrinsic (quantitative) evaluation of text representations for spatiotemporal trends.

pdf bib
Rethinking Skip-thought: A Neighborhood based Approach
Shuai Tang | Hailin Jin | Chen Fang | Zhaowen Wang | Virginia de Sa

We study the skip-thought model with neighborhood information as weak supervision. More specifically, we propose a skip-thought neighbor model to consider the adjacent sentences as a neighborhood. We train our skip-thought neighbor model on a large corpus with continuous sentences, and then evaluate the trained model on 7 tasks, which include semantic relatedness, paraphrase detection, and classification benchmarks. Both quantitative comparison and qualitative investigation are conducted. We empirically show that, our skip-thought neighbor model performs as well as the skip-thought model on evaluation tasks. In addition, we found that, incorporating an autoencoder path in our model didn’t aid our model to perform better, while it hurts the performance of the skip-thought model.

pdf bib
A Frame Tracking Model for Memory-Enhanced Dialogue Systems
Hannes Schulz | Jeremie Zumer | Layla El Asri | Shikhar Sharma

Recently, resources and tasks were proposed to go beyond state tracking in dialogue systems. An example is the frame tracking task, which requires recording multiple frames, one for each user goal set during the dialogue. This allows a user, for instance, to compare items corresponding to different goals. This paper proposes a model which takes as input the list of frames created so far during the dialogue, the current user utterance as well as the dialogue acts, slot types, and slot values associated with this utterance. The model then outputs the frame being referenced by each triple of dialogue act, slot type, and slot value. We show that on the recently published Frames dataset, this model significantly outperforms a previously proposed rule-based baseline. In addition, we propose an extensive analysis of the frame tracking task by dividing it into sub-tasks and assessing their difficulty with respect to our model.

pdf bib
Plan, Attend, Generate: Character-Level Neural Machine Translation with Planning
Caglar Gulcehre | Francis Dutil | Adam Trischler | Yoshua Bengio

We investigate the integration of a planning mechanism into an encoder-decoder architecture with attention. We develop a model that can plan ahead when it computes alignments between the source and target sequences not only for a single time-step but for the next k time-steps as well by constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by strategic attentive reader and writer (STRAW) model, a recent neural architecture for planning with hierarchical reinforcement learning that can also learn higher level temporal abstractions. Our proposed model is end-to-end trainable with differentiable operations. We show that our model outperforms strong baselines on character-level translation task from WMT’15 with fewer parameters and computes alignments that are qualitatively intuitive.

pdf bib
Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology-Based Representations
Paul Michel | Abhilasha Ravichander | Shruti Rijhwani

We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like tf-idf, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.

pdf bib
Adversarial Generation of Natural Language
Sandeep Subramanian | Sai Rajeswar | Francis Dutil | Chris Pal | Aaron Courville

Generative Adversarial Networks (GANs) have gathered a lot of attention from the computer vision community, yielding impressive results for image generation. Advances in the adversarial generation of natural language from noise however are not commensurate with the progress made in generating images, and still lag far behind likelihood based methods. In this paper, we take a step towards generating natural language with a GAN objective alone. We introduce a simple baseline that addresses the discrete output space problem without relying on gradient estimators and show that it is able to achieve state-of-the-art results on a Chinese poem generation dataset. We present quantitative results on generating sentences from context-free and probabilistic context-free grammars, and qualitative language modeling results. A conditional version is also described that can generate sequences conditioned on sentence characteristics.

pdf bib
Deep Active Learning for Named Entity Recognition
Yanyao Shen | Hyokun Yun | Zachary Lipton | Yakov Kronrod | Animashree Anandkumar

Deep neural networks have advanced the state of the art in named entity recognition. However, under typical training procedures, advantages over classical methods emerge only with large datasets. As a result, deep learning is employed only when large public datasets or a large budget for manually labeling data is available. In this work, we show otherwise: by combining deep learning with active learning, we can outperform classical methods even with a significantly smaller amount of training data.

pdf bib
Learning when to skim and when to read
Alexander Johansen | Richard Socher

Many recent advances in deep learning for natural language processing have come at increasing computational cost, but the power of these state-of-the-art models is not needed for every example in a dataset. We demonstrate two approaches to reducing unnecessary computation in cases where a fast but weak baseline classier and a stronger, slower model are both available. Applying an AUC-based metric to the task of sentiment classification, we find significant efficiency gains with both a probability-threshold method for reducing computational cost and one that uses a secondary decision network.

pdf bib
Learning to Embed Words in Context for Syntactic Tasks
Lifu Tu | Kevin Gimpel | Karen Livescu

We present models for embedding words in the context of surrounding words. Such models, which we refer to as token embeddings, represent the characteristics of a word that are specific to a given context, such as word sense, syntactic category, and semantic role. We explore simple, efficient token embedding models based on standard neural network architectures. We learn token embeddings on a large amount of unannotated text and evaluate them as features for part-of-speech taggers and dependency parsers trained on much smaller amounts of annotated data. We find that predictors endowed with token embeddings consistently outperform baseline predictors across a range of context window and training set sizes.

up

pdf (full)
bib (full)
Proceedings of the Events and Stories in the News Workshop

pdf bib
Proceedings of the Events and Stories in the News Workshop
Tommaso Caselli | Ben Miller | Marieke van Erp | Piek Vossen | Martha Palmer | Eduard Hovy | Teruko Mitamura | David Caswell

pdf bib
newsLens: building and visualizing long-ranging news stories
Philippe Laban | Marti Hearst

We propose a method to aggregate and organize a large, multi-source dataset of news articles into a collection of major stories, and automatically name and visualize these stories in a working system. The approach is able to run online, as new articles are added, processing 4 million news articles from 20 news sources, and extracting 80000 major stories, some of which span several years. The visual interface consists of lanes of timelines, each annotated with information that is deemed important for the story, including extracted quotations. The working system allows a user to search and navigate 8 years of story information.

pdf bib
Detecting Changes in Twitter Streams using Temporal Clusters of Hashtags
Yunli Wang | Cyril Goutte

Detecting events from social media data has important applications in public security, political issues, and public health. Many studies have focused on detecting specific or unspecific events from Twitter streams. However, not much attention has been paid to detecting changes, and their impact, in online conversations related to an event. We propose methods for detecting such changes, using clustering of temporal profiles of hashtags, and three change point detection algorithms. The methods were tested on two Twitter datasets: one covering the 2014 Ottawa shooting event, and one covering the Sochi winter Olympics. We compare our approach to a baseline consisting of detecting change from raw counts in the conversation. We show that our method produces large gains in change detection accuracy on both datasets.

pdf bib
Event Detection Using Frame-Semantic Parser
Evangelia Spiliopoulou | Eduard Hovy | Teruko Mitamura

Recent methods for Event Detection focus on Deep Learning for automatic feature generation and feature ranking. However, most of those approaches fail to exploit rich semantic information, which results in relatively poor recall. This paper is a small & focused contribution, where we introduce an Event Detection and classification system, based on deep semantic information retrieved from a frame-semantic parser. Our experiments show that our system achieves higher recall than state-of-the-art systems. Further, we claim that enhancing our system with deep learning techniques like feature ranking can achieve even better results, as it can benefit from both approaches.

pdf bib
Improving Shared Argument Identification in Japanese Event Knowledge Acquisition
Yin Jou Huang | Sadao Kurohashi

Event knowledge represents the knowledge of causal and temporal relations between events. Shared arguments of event knowledge encode patterns of role shifting in successive events. A two-stage framework was proposed for the task of Japanese event knowledge acquisition, in which related event pairs are first extracted, and shared arguments are then identified to form the complete event knowledge. This paper focuses on the second stage of this framework, and proposes a method to improve the shared argument identification of related event pairs. We constructed a gold dataset for shared argument learning. By evaluating our system on this gold dataset, we found that our proposed model outperformed the baseline models by a large margin.

pdf bib
Tracing armed conflicts with diachronic word embedding models
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid

Recent studies have shown that word embedding models can be used to trace time-related (diachronic) semantic shifts in particular words. In this paper, we evaluate some of these approaches on the new task of predicting the dynamics of global armed conflicts on a year-to-year basis, using a dataset from the conflict research field as the gold standard and the Gigaword news corpus as the training data. The results show that much work still remains in extracting ‘cultural’ semantic shifts from diachronic word embedding models. At the same time, we present a new task complete with an evaluation set and introduce the ‘anchor words’ method which outperforms previous approaches on this set.

pdf bib
The Circumstantial Event Ontology (CEO)
Roxane Segers | Tommaso Caselli | Piek Vossen

In this paper we describe the ongoing work on the Circumstantial Event Ontology (CEO), a newly developed ontology for calamity events that models semantic circumstantial relations between event classes. The circumstantial relations are designed manually, based on the shared properties of each event class. We discuss and contrast two types of event circumstantial relations: semantic circumstantial relations and episodic circumstantial relations. Further, we show the metamodel and the current contents of the ontology and outline the evaluation of the CEO.

pdf bib
Event Detection and Semantic Storytelling: Generating a Travelogue from a large Collection of Personal Letters
Georg Rehm | Julian Moreno Schneider | Peter Bourgonje | Ankit Srivastava | Jan Nehring | Armin Berger | Luca König | Sören Räuchle | Jens Gerth

We present an approach at identifying a specific class of events, movement action events (MAEs), in a data set that consists of ca. 2,800 personal letters exchanged by the German architect Erich Mendelsohn and his wife, Luise. A backend system uses these and other semantic analysis results as input for an authoring environment that digital curators can use to produce new pieces of digital content. In our example case, the human expert will receive recommendations from the system with the goal of putting together a travelogue, i.e., a description of the trips and journeys undertaken by the couple. We describe the components and architecture and also apply the system to news data.

pdf bib
Inference of Fine-Grained Event Causality from Blogs and Films
Zhichao Hu | Elahe Rahimtoroghi | Marilyn Walker

Human understanding of narrative is mainly driven by reasoning about causal relations between events and thus recognizing them is a key capability for computational models of language understanding. Computational work in this area has approached this via two different routes: by focusing on acquiring a knowledge base of common causal relations between events, or by attempting to understand a particular story or macro-event, along with its storyline. In this position paper, we focus on knowledge acquisition approach and claim that newswire is a relatively poor source for learning fine-grained causal relations between everyday events. We describe experiments using an unsupervised method to learn causal relations between events in the narrative genres of first-person narratives and film scene descriptions. We show that our method learns fine-grained causal relations, judged by humans as likely to be causal over 80% of the time. We also demonstrate that the learned event pairs do not exist in publicly available event-pair datasets extracted from newswire.

pdf bib
On the Creation of a Security-Related Event Corpus
Martin Atkinson | Jakub Piskorski | Hristo Tanev | Vanni Zavarella

This paper reports on an effort of creating a corpus of structured information on security-related events automatically extracted from on-line news, part of which has been manually curated. The main motivation behind this effort is to provide material to the NLP community working on event extraction that could be used both for training and evaluation purposes.

pdf bib
Inducing Event Types and Roles in Reverse: Using Function to Discover Theme
Natalie Ahn

With growing interest in automated event extraction, there is an increasing need to overcome the labor costs of hand-written event templates, entity lists, and annotated corpora. In the last few years, more inductive approaches have emerged, seeking to discover unknown event types and roles in raw text. The main recent efforts use probabilistic generative models, as in topic modeling, which are formally concise but do not always yield stable or easily interpretable results. We argue that event schema induction can benefit from greater structure in the process and in linguistic features that distinguish words’ functions and themes. To maximize our use of limited data, we reverse the typical schema induction steps and introduce new similarity measures, building an intuitive process for inducing the structure of unknown events.

pdf bib
The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction
Tommaso Caselli | Piek Vossen

This paper reports on the Event StoryLine Corpus (ESC) v1.0, a new benchmark dataset for the temporal and causal relation detection. By developing this dataset, we also introduce a new task, the StoryLine Extraction from news data, which aims at extracting and classifying events relevant for stories, from across news documents spread in time and clustered around a single seminal event or topic. In addition to describing the dataset, we also report on three baselines systems whose results show the complexity of the task and suggest directions for the development of more robust systems.

pdf bib
The Rich Event Ontology
Susan Brown | Claire Bonial | Leo Obrst | Martha Palmer

In this paper we describe a new lexical semantic resource, The Rich Event On-tology, which provides an independent conceptual backbone to unify existing semantic role labeling (SRL) schemas and augment them with event-to-event causal and temporal relations. By unifying the FrameNet, VerbNet, Automatic Content Extraction, and Rich Entities, Relations and Events resources, the ontology serves as a shared hub for the disparate annotation schemas and therefore enables the combination of SRL training data into a larger, more diverse corpus. By adding temporal and causal relational information not found in any of the independent resources, the ontology facilitates reasoning on and across documents, revealing relationships between events that come together in temporal and causal chains to build more complex scenarios. We envision the open resource serving as a valuable tool for both moving from the ontology to text to query for event types and scenarios of interest, and for moving from text to the ontology to access interpretations of events using the combined semantic information housed there.

pdf bib
Integrating Decompositional Event Structures into Storylines
William Croft | Pavlína Pešková | Michael Regan

Storyline research links together events in stories and specifies shared participants in those stories. In these analyses, an atomic event is assumed to be a single clause headed by a single verb. However, many analyses of verbal semantics assume a decompositional analysis of events expressed in single clauses. We present a formalization of a decompositional analysis of events in which each participant in a clausal event has their own temporally extended subevent, and the subevents are related through causal and other interactions. This decomposition allows us to represent storylines as an evolving set of interactions between participants over time.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Language Grounding for Robotics

pdf bib
Proceedings of the First Workshop on Language Grounding for Robotics
Mohit Bansal | Cynthia Matuszek | Jacob Andreas | Yoav Artzi | Yonatan Bisk

pdf bib
Grounding Language for Interactive Task Learning
Peter Lindes | Aaron Mininger | James R. Kirk | John E. Laird

This paper describes how language is grounded by a comprehension system called Lucia within a robotic agent called Rosie that can manipulate objects and navigate indoors. The whole system is built within the Soar cognitive architecture and uses Embodied Construction Grammar (ECG) as a formalism for describing linguistic knowledge. Grounding is performed using knowledge from the grammar itself, from the linguistic context, from the agents perception, and from an ontology of long-term knowledge about object categories and properties and actions the agent can perform. The paper also describes a benchmark corpus of 200 sentences in this domain along with test versions of the world model and ontology and gold-standard meanings for each of the sentences. The benchmark is contained in the supplemental materials.

pdf bib
Learning how to Learn: An Adaptive Dialogue Agent for Incrementally Learning Visually Grounded Word Meanings
Yanchao Yu | Arash Eshghi | Oliver Lemon

We present an optimised multi-modal dialogue agent for interactive learning of visually grounded word meanings from a human tutor, trained on real human-human tutoring data. Within a life-long interactive learning period, the agent, trained using Reinforcement Learning (RL), must be able to handle natural conversations with human users, and achieve good learning performance (i.e. accuracy) while minimising human effort in the learning process. We train and evaluate this system in interaction with a simulated human tutor, which is built on the BURCHAK corpus – a Human-Human Dialogue dataset for the visual learning task. The results show that: 1) The learned policy can coherently interact with the simulated user to achieve the goal of the task (i.e. learning visual attributes of objects, e.g. colour and shape); and 2) it finds a better trade-off between classifier accuracy and tutoring costs than hand-crafted rule-based policies, including ones with dynamic policies.

pdf bib
Guiding Interaction Behaviors for Multi-modal Grounded Language Learning
Jesse Thomason | Jivko Sinapov | Raymond Mooney

Multi-modal grounded language learning connects language predicates to physical properties of objects in the world. Sensing with multiple modalities, such as audio, haptics, and visual colors and shapes while performing interaction behaviors like lifting, dropping, and looking on objects enables a robot to ground non-visual predicates like “empty” as well as visual predicates like “red”. Previous work has established that grounding in multi-modal space improves performance on object retrieval from human descriptions. In this work, we gather behavior annotations from humans and demonstrate that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively. We also explore adding modality annotations (whether to focus on audio or haptics when performing a behavior), which improves performance, and sharing information between linguistically related predicates (if “green” is a color, “white” is a color), which improves grounding recall but at the cost of precision.

pdf bib
Structured Learning for Context-aware Spoken Language Understanding of Robotic Commands
Andrea Vanzo | Danilo Croce | Roberto Basili | Daniele Nardi

Service robots are expected to operate in specific environments, where the presence of humans plays a key role. A major feature of such robotics platforms is thus the ability to react to spoken commands. This requires the understanding of the user utterance with an accuracy able to trigger the robot reaction. Such correct interpretation of linguistic exchanges depends on physical, cognitive and language-dependent aspects related to the environment. In this work, we present the empirical evaluation of an adaptive Spoken Language Understanding chain for robotic commands, that explicitly depends on the operational environment during both the learning and recognition stages. The effectiveness of such a context-sensitive command interpretation is tested against an extension of an already existing corpus of commands, that introduced explicit perceptual knowledge: this enabled deeper measures proving that more accurate disambiguation capabilities can be actually obtained.

pdf bib
Natural Language Grounding and Grammar Induction for Robotic Manipulation Commands
Muhannad Alomari | Paul Duckworth | Majd Hawasly | David C. Hogg | Anthony G. Cohn

We present a cognitively plausible system capable of acquiring knowledge in language and vision from pairs of short video clips and linguistic descriptions. The aim of this work is to teach a robot manipulator how to execute natural language commands by demonstration. This is achieved by first learning a set of visual ‘concepts’ that abstract the visual feature spaces into concepts that have human-level meaning. Second, learning the mapping/grounding between words and the extracted visual concepts. Third, inducing grammar rules via a semantic representation known as Robot Control Language (RCL). We evaluate our approach against state-of-the-art supervised and unsupervised grounding and grammar induction systems, and show that a robot can learn to execute never seen-before commands from pairs of unlabelled linguistic and visual inputs.

pdf bib
Communication with Robots using Multilayer Recurrent Networks
Bedřich Pišl | David Mareček

In this paper, we describe an improvement on the task of giving instructions to robots in a simulated block world using unrestricted natural language commands.

pdf bib
Grounding Symbols in Multi-Modal Instructions
Yordan Hristov | Svetlin Penkov | Alex Lascarides | Subramanian Ramamoorthy

As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability—for instance, learning to ground symbols in the physical world. Realistically, this task must cope with small datasets consisting of a particular users’ contextual assignment of meaning to terms. We present a method for processing a raw stream of cross-modal input—i.e., linguistic instructions, visual perception of a scene and a concurrent trace of 3D eye tracking fixations—to produce the segmentation of objects with a correspondent association to high-level concepts. To test our framework we present experiments in a table-top object manipulation scenario. Our results show our model learns the user’s notion of colour and shape from a small number of physical demonstrations, generalising to identifying physical referents for novel combinations of the words.

pdf bib
Exploring Variation of Natural Human Commands to a Robot in a Collaborative Navigation Task
Matthew Marge | Claire Bonial | Ashley Foots | Cory Hayes | Cassidy Henry | Kimberly Pollard | Ron Artstein | Clare Voss | David Traum

Robot-directed communication is variable, and may change based on human perception of robot capabilities. To collect training data for a dialogue system and to investigate possible communication changes over time, we developed a Wizard-of-Oz study that (a) simulates a robot’s limited understanding, and (b) collects dialogues where human participants build a progressively better mental model of the robot’s understanding. With ten participants, we collected ten hours of human-robot dialogue. We analyzed the structure of instructions that participants gave to a remote robot before it responded. Our findings show a general initial preference for including metric information (e.g., move forward 3 feet) over landmarks (e.g., move to the desk) in motion commands, but this decreased over time, suggesting changes in perception.

pdf bib
A Tale of Two DRAGGNs: A Hybrid Approach for Interpreting Action-Oriented and Goal-Oriented Instructions
Siddharth Karamcheti | Edward Clem Williams | Dilip Arumugam | Mina Rhee | Nakul Gopalan | Lawson L.S. Wong | Stefanie Tellex

Robots operating alongside humans in diverse, stochastic environments must be able to accurately interpret natural language commands. These instructions often fall into one of two categories: those that specify a goal condition or target state, and those that specify explicit actions, or how to perform a given task. Recent approaches have used reward functions as a semantic representation of goal-based commands, which allows for the use of a state-of-the-art planner to find a policy for the given task. However, these reward functions cannot be directly used to represent action-oriented commands. We introduce a new hybrid approach, the Deep Recurrent Action-Goal Grounding Network (DRAGGN), for task grounding and execution that handles natural language from either category as input, and generalizes to unseen environments. Our robot-simulation results demonstrate that a system successfully interpreting both goal-oriented and action-oriented task specifications brings us closer to robust natural language understanding for human-robot interaction.

pdf bib
Are Distributional Representations Ready for the Real World? Evaluating Word Vectors for Grounded Perceptual Meaning
Li Lucy | Jon Gauthier

Distributional word representation methods exploit word co-occurrences to build compact vector encodings of words. While these representations enjoy widespread use in modern natural language processing, it is unclear whether they accurately encode all necessary facets of conceptual meaning. In this paper, we evaluate how well these representations can predict perceptual and conceptual features of concrete concepts, drawing on two semantic norm datasets sourced from human participants. We find that several standard word representations fail to encode many salient perceptual features of concepts, and show that these deficits correlate with word-word similarity prediction errors. Our analyses provide motivation for grounded and embodied language learning approaches, which may help to remedy these deficits.

pdf bib
Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction
Jekaterina Novikova | Christian Dondrup | Ioannis Papaioannou | Oliver Lemon

Recognition of social signals, coming from human facial expressions or prosody of human speech, is a popular research topic in human-robot interaction studies. There is also a long line of research in the spoken dialogue community that investigates user satisfaction in relation to dialogue characteristics. However, very little research relates a combination of multimodal social signals and language features detected during spoken face-to-face human-robot interaction to the resulting user perception of a robot. In this paper we show how different emotional facial expressions of human users, in combination with prosodic characteristics of human speech and features of human-robot dialogue, correlate with users’ impressions of the robot after a conversation. We find that happiness in the user’s recognised facial expression strongly correlates with likeability of a robot, while dialogue-related features (such as number of human turns or number of sentences per robot utterance) correlate with perceiving a robot as intelligent. In addition, we show that the facial expression emotional features and prosody are better predictors of human ratings related to perceived robot likeability and anthropomorphism, while linguistic and non-linguistic features more often predict perceived robot intelligence and interpretability. As such, these characteristics may in future be used as an online reward signal for in-situ Reinforcement Learning-based adaptive human-robot dialogue systems.

pdf bib
Towards Problem Solving Agents that Communicate and Learn
Anjali Narayan-Chen | Colin Graber | Mayukh Das | Md Rakibul Islam | Soham Dan | Sriraam Natarajan | Janardhan Rao Doppa | Julia Hockenmaier | Martha Palmer | Dan Roth

Agents that communicate back and forth with humans to help them execute non-linguistic tasks are a long sought goal of AI. These agents need to translate between utterances and actionable meaning representations that can be interpreted by task-specific problem solvers in a context-dependent manner. They should also be able to learn such actionable interpretations for new predicates on the fly. We define an agent architecture for this scenario and present a series of experiments in the Blocks World domain that illustrate how our architecture supports language learning and problem solving in this domain.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on NLP and Computational Social Science

pdf bib
Proceedings of the Second Workshop on NLP and Computational Social Science
Dirk Hovy | Svitlana Volkova | David Bamman | David Jurgens | Brendan O’Connor | Oren Tsur | A. Seza Doğruöz

pdf bib
Language-independent Gender Prediction on Twitter
Nikola Ljubešić | Darja Fišer | Tomaž Erjavec

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users’ tweets. We perform our experiments on the TwiSty dataset containing manual gender annotations for users speaking six different languages. Our classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and testing on the same language, it regularly outperforms the bag-of-words model when applied to different languages, showing very stable results across various languages. Finally we perform a comparative analysis of feature effect sizes across the six languages and show that differences in our features correspond to cultural distances.

pdf bib
When does a compliment become sexist? Analysis and classification of ambivalent sexism using twitter data
Akshita Jha | Radhika Mamidi

Sexism is prevalent in today’s society, both offline and online, and poses a credible threat to social equality with respect to gender. According to ambivalent sexism theory (Glick and Fiske, 1996), it comes in two forms: Hostile and Benevolent. While hostile sexism is characterized by an explicitly negative attitude, benevolent sexism is more subtle. Previous works on computationally detecting sexism present online are restricted to identifying the hostile form. Our objective is to investigate the less pronounced form of sexism demonstrated online. We achieve this by creating and analyzing a dataset of tweets that exhibit benevolent sexism. By using Support Vector Machines (SVM), sequence-to-sequence models and FastText classifier, we classify tweets into ‘Hostile’, ‘Benevolent’ or ‘Others’ class depending on the kind of sexism they exhibit. We have been able to achieve an F1-score of 87.22% using FastText classifier. Our work helps analyze and understand the much prevalent ambivalent sexism in social media.

pdf bib
Personality Driven Differences in Paraphrase Preference
Daniel Preoţiuc-Pietro | Jordan Carpenter | Lyle Ungar

Personality plays a decisive role in how people behave in different scenarios, including online social media. Researchers have used such data to study how personality can be predicted from language use. In this paper, we study phrase choice as a particular stylistic linguistic difference, as opposed to the mostly topical differences identified previously. Building on previous work on demographic preferences, we quantify differences in paraphrase choice from a massive Facebook data set with posts from over 115,000 users. We quantify the predictive power of phrase choice in user profiling and use phrase choice to study psycholinguistic hypotheses. This work is relevant to future applications that aim to personalize text generation to specific personality types.

pdf bib
community2vec: Vector representations of online communities encode semantic relationships
Trevor Martin

Vector embeddings of words have been shown to encode meaningful semantic relationships that enable solving of complex analogies. This vector embedding concept has been extended successfully to many different domains and in this paper we both create and visualize vector representations of an unstructured collection of online communities based on user participation. Further, we quantitatively and qualitatively show that these representations allow solving of semantically meaningful community analogies and also other more general types of relationships. These results could help improve community recommendation engines and also serve as a tool for sociological studies of community relatedness.

pdf bib
Telling Apart Tweets Associated with Controversial versus Non-Controversial Topics
Aseel Addawood | Rezvaneh Rezapour | Omid Abdar | Jana Diesner

In this paper, we evaluate the predictability of tweets associated with controversial versus non-controversial topics. As a first step, we crowd-sourced the scoring of a predefined set of topics on a Likert scale from non-controversial to controversial. Our feature set entails and goes beyond sentiment features, e.g., by leveraging empathic language and other features that have been previously used but are new for this particular study. We find focusing on the structural characteristics of tweets to be beneficial for this task. Using a combination of emphatic, language-specific, and Twitter-specific features for supervised learning resulted in 87% accuracy (F1) for cross-validation of the training set and 63.4% accuracy when using the test set. Our analysis shows that features specific to Twitter or social media, in general, are more prevalent in tweets on controversial topics than in non-controversial ones. To test the premise of the paper, we conducted two additional sets of experiments, which led to mixed results. This finding will inform our future investigations into the relationship between language use on social media and the perceived controversiality of topics.

pdf bib
Cross-Lingual Classification of Topics in Political Texts
Goran Glavaš | Federico Nanni | Simone Paolo Ponzetto

In this paper, we propose an approach for cross-lingual topical coding of sentences from electoral manifestos of political parties in different languages. To this end, we exploit continuous semantic text representations and induce a joint multilingual semantic vector spaces to enable supervised learning using manually-coded sentences across different languages. Our experimental results show that classifiers trained on multilingual data yield performance boosts over monolingual topic classification.

pdf bib
Mining Social Science Publications for Survey Variables
Andrea Zielinski | Peter Mutschke

Research in Social Science is usually based on survey data where individual research questions relate to observable concepts (variables). However, due to a lack of standards for data citations a reliable identification of the variables used is often difficult. In this paper, we present a work-in-progress study that seeks to provide a solution to the variable detection task based on supervised machine learning algorithms, using a linguistic analysis pipeline to extract a rich feature set, including terminological concepts and similarity metric scores. Further, we present preliminary results on a small dataset that has been specifically designed for this task, yielding a significant increase in performance over the random baseline.

pdf bib
Linguistic Markers of Influence in Informal Interactions
Shrimai Prabhumoye | Samridhi Choudhary | Evangelia Spiliopoulou | Christopher Bogart | Carolyn Rose | Alan W Black

There has been a long standing interest in understanding ‘Social Influence’ both in Social Sciences and in Computational Linguistics. In this paper, we present a novel approach to study and measure interpersonal influence in daily interactions. Motivated by the basic principles of influence, we attempt to identify indicative linguistic features of the posts in an online knitting community. We present the scheme used to operationalize and label the posts as influential or non-influential. Experiments with the identified features show an improvement in the classification accuracy of influence by 3.15%. Our results illustrate the important correlation between the structure of the language and its potential to influence others.

pdf bib
Non-lexical Features Encode Political Affiliation on Twitter
Rachael Tatman | Leo Stewart | Amandalynne Paullada | Emma Spiro

Previous work on classifying Twitter users’ political alignment has mainly focused on lexical and social network features. This study provides evidence that political affiliation is also reflected in features which have been previously overlooked: users’ discourse patterns (proportion of Tweets that are retweets or replies) and their rate of use of capitalization and punctuation. We find robust differences between politically left- and right-leaning communities with respect to these discourse and sub-lexical features, although they are not enough to train a high-accuracy classifier.

pdf bib
Modelling Participation in Small Group Social Sequences with Markov Rewards Analysis
Gabriel Murray

We explore a novel computational approach for analyzing member participation in small group social sequences. Using a complex state representation combining information about dialogue act types, sentiment expression, and participant roles, we explore which sequence states are associated with high levels of member participation. Using a Markov Rewards framework, we associate particular states with immediate positive and negative rewards, and employ a Value Iteration algorithm to calculate the expected value of all states. In our findings, we focus on discourse states belonging to team leaders and project managers which are either very likely or very unlikely to lead to participation from the rest of the group members.

pdf bib
Code-Switching as a Social Act: The Case of Arabic Wikipedia Talk Pages
Michael Yoder | Shruti Rijhwani | Carolyn Rosé | Lori Levin

Code-switching has been found to have social motivations in addition to syntactic constraints. In this work, we explore the social effect of code-switching in an online community. We present a task from the Arabic Wikipedia to capture language choice, in this case code-switching between Arabic and other languages, as a predictor of social influence in collaborative editing. We find that code-switching is positively associated with Wikipedia editor success, particularly borrowing technical language on pages with topics less directly related to Arabic-speaking regions.

pdf bib
How Does Twitter User Behavior Vary Across Demographic Groups?
Zach Wood-Doughty | Michael Smith | David Broniatowski | Mark Dredze

Demographically-tagged social media messages are a common source of data for computational social science. While these messages can indicate differences in beliefs and behaviors between demographic groups, we do not have a clear understanding of how different demographic groups use platforms such as Twitter. This paper presents a preliminary analysis of how groups’ differing behaviors may confound analyses of the groups themselves. We analyzed one million Twitter users by first inferring demographic attributes, and then measuring several indicators of Twitter behavior. We find differences in these indicators across demographic groups, suggesting that there may be underlying differences in how different demographic groups use Twitter.

pdf bib
Ideological Phrase Indicators for Classification of Political Discourse Framing on Twitter
Kristen Johnson | I-Ta Lee | Dan Goldwasser

Politicians carefully word their statements in order to influence how others view an issue, a political strategy called framing. Simultaneously, these frames may also reveal the beliefs or positions on an issue of the politician. Simple language features such as unigrams, bigrams, and trigrams are important indicators for identifying the general frame of a text, for both longer congressional speeches and shorter tweets of politicians. However, tweets may contain multiple unigrams across different frames which limits the effectiveness of this approach. In this paper, we present a joint model which uses both linguistic features of tweets and ideological phrase indicators extracted from a state-of-the-art embedding-based model to predict the general frame of political tweets.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Abusive Language Online

pdf bib
Proceedings of the First Workshop on Abusive Language Online
Zeerak Waseem | Wendy Hui Kyong Chung | Dirk Hovy | Joel Tetreault

pdf bib
Dimensions of Abusive Language on Twitter
Isobelle Clarke | Jack Grieve

In this paper, we use a new categorical form of multidimensional register analysis to identify the main dimensions of functional linguistic variation in a corpus of abusive language, consisting of racist and sexist Tweets. By analysing the use of a wide variety of parts-of-speech and grammatical constructions, as well as various features related to Twitter and computer-mediated communication, we discover three dimensions of linguistic variation in this corpus, which we interpret as being related to the degree of interactive, antagonistic and attitudinal language exhibited by individual Tweets. We then demonstrate that there is a significant functional difference between racist and sexist Tweets, with sexists Tweets tending to be more interactive and attitudinal than racist Tweets.

pdf bib
Constructive Language in News Comments
Varada Kolhatkar | Maite Taboada

We discuss the characteristics of constructive news comments, and present methods to identify them. First, we define the notion of constructiveness. Second, we annotate a corpus for constructiveness. Third, we explore whether available argumentation corpora can be useful to identify constructiveness in news comments. Our model trained on argumentation corpora achieves a top accuracy of 72.59% (baseline=49.44%) on our crowd-annotated test data. Finally, we examine the relation between constructiveness and toxicity. In our crowd-annotated data, 21.42% of the non-constructive comments and 17.89% of the constructive comments are toxic, suggesting that non-constructive comments are not much more toxic than constructive comments.

pdf bib
Rephrasing Profanity in Chinese Text
Hui-Po Su | Zhen-Jie Huang | Hao-Tsung Chang | Chuan-Jie Lin

This paper proposes a system that can detect and rephrase profanity in Chinese text. Rather than just masking detected profanity, we want to revise the input sentence by using inoffensive words while keeping their original meanings. 29 of such rephrasing rules were invented after observing sentences on real-word social websites. The overall accuracy of the proposed system is 85.56%

pdf bib
Deep Learning for User Comment Moderation
John Pavlopoulos | Prodromos Malakasiotis | Ion Androutsopoulos

Experimenting with a new dataset of 1.6M user comments from a Greek news portal and existing datasets of EnglishWikipedia comments, we show that an RNN outperforms the previous state of the art in moderation. A deep, classification-specific attention mechanism improves further the overall performance of the RNN. We also compare against a CNN and a word-list baseline, considering both fully automatic and semi-automatic moderation.

pdf bib
Class-based Prediction Errors to Detect Hate Speech with Out-of-vocabulary Words
Joan Serrà | Ilias Leontiadis | Dimitris Spathis | Gianluca Stringhini | Jeremy Blackburn | Athena Vakali

Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. A paradigmatic example of this situation is abusive online behavior, with social networks and media platforms struggling to effectively combat uncommon or non-blacklisted hate words. To better deal with these issues in those fast-paced environments, we propose using the error signal of class-based language models as input to text classification algorithms. In particular, we train a next-character prediction model for any given class and then exploit the error of such class-based models to inform a neural network classifier. This way, we shift from the ‘ability to describe’ seen documents to the ‘ability to predict’ unseen content. Preliminary studies using out-of-vocabulary splits from abusive tweet data show promising results, outperforming competitive text categorization strategies by 4-11%.

pdf bib
One-step and Two-step Classification for Abusive Language Detection on Twitter
Ji Ho Park | Pascale Fung

Automatic abusive language detection is a difficult but important task for online social media. Our research explores a two-step approach of performing classification on abusive language and then classifying into specific types and compares it with one-step approach of doing one multi-class classification for detecting sexist and racist languages. With a public English Twitter corpus of 20 thousand tweets in the type of sexism and racism, our approach shows a promising performance of 0.827 F-measure by using HybridCNN in one-step and 0.824 F-measure by using logistic regression in two-steps.

pdf bib
Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene
Darja Fišer | Tomaž Erjavec | Nikola Ljubešić

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia. On this basis we aim to train an automatic identification and classification system with which we wish contribute towards an improved methodology, understanding and treatment of such practices in the contemporary, increasingly multicultural information society.

pdf bib
Abusive Language Detection on Arabic Social Media
Hamdy Mubarak | Kareem Darwish | Walid Magdy

In this paper, we present our work on detecting abusive language on Arabic social media. We extract a list of obscene words and hashtags using common patterns used in offensive and rude communications. We also classify Twitter users according to whether they use any of these words or not in their tweets. We expand the list of obscene words using this classification, and we report results on a newly created dataset of classified Arabic tweets (obscene, offensive, and clean). We make this dataset freely available for research, in addition to the list of obscene words and hashtags. We are also publicly releasing a large corpus of classified user comments that were deleted from a popular Arabic news site due to violations the site’s rules and guidelines.

pdf bib
Vectors for Counterspeech on Twitter
Lucas Wright | Derek Ruths | Kelly P Dillon | Haji Mohammad Saleem | Susan Benesch

A study of conversations on Twitter found that some arguments between strangers led to favorable change in discourse and even in attitudes. The authors propose that such exchanges can be usefully distinguished according to whether individuals or groups take part on each side, since the opportunity for a constructive exchange of views seems to vary accordingly.

pdf bib
Detecting Nastiness in Social Media
Niloofar Safi Samghabadi | Suraj Maharjan | Alan Sprague | Raquel Diaz-Sprague | Thamar Solorio

Although social media has made it easy for people to connect on a virtually unlimited basis, it has also opened doors to people who misuse it to undermine, harass, humiliate, threaten and bully others. There is a lack of adequate resources to detect and hinder its occurrence. In this paper, we present our initial NLP approach to detect invective posts as a first step to eventually detect and deter cyberbullying. We crawl data containing profanities and then determine whether or not it contains invective. Annotations on this data are improved iteratively by in-lab annotations and crowdsourcing. We pursue different NLP approaches containing various typical and some newer techniques to distinguish the use of swear words in a neutral way from those instances in which they are used in an insulting way. We also show that this model not only works for our data set, but also can be successfully applied to different data sets.

pdf bib
Technology Solutions to Combat Online Harassment
George Kennedy | Andrew McCollough | Edward Dixon | Alexei Bastidas | John Ryan | Chris Loo | Saurav Sahay

This work is part of a new initiative to use machine learning to identify online harassment in social media and comment streams. Online harassment goes under-reported due to the reliance on humans to identify and report harassment, reporting that is further slowed by requirements to fill out forms providing context. In addition, the time for moderators to respond and apply human judgment can take days, but response times in terms of minutes are needed in the online context. Though some of the major social media companies have been doing proprietary work in automating the detection of harassment, there are few tools available for use by the public. In addition, the amount of labeled online harassment data and availability of cross-platform online harassment datasets is limited. We present the methodology used to create a harassment dataset and classifier and the dataset used to help the system learn what harassment looks like.

pdf bib
Understanding Abuse: A Typology of Abusive Language Detection Subtasks
Zeerak Waseem | Thomas Davidson | Dana Warmsley | Ingmar Weber

As the body of research on abusive language detection and analysis grows, there is a need for critical consideration of the relationships between different subtasks that have been grouped under this label. Based on work on hate speech, cyberbullying, and online abuse we propose a typology that captures central similarities and differences between subtasks and discuss the implications of this for data annotation and feature construction. We emphasize the practical actions that can be taken by researchers to best approach their abusive language detection subtask of interest.

pdf bib
Using Convolutional Neural Networks to Classify Hate-Speech
Björn Gambäck | Utpal Kumar Sikdar

The paper introduces a deep learning-based Twitter hate-speech text classification system. The classifier assigns each tweet to one of four predefined categories: racism, sexism, both (racism and sexism) and non-hate-speech. Four Convolutional Neural Network models were trained on resp. character 4-grams, word vectors based on semantic information built using word2vec, randomly generated word vectors, and word vectors combined with character n-grams. The feature set was down-sized in the networks by max-pooling, and a softmax function used to classify tweets. Tested by 10-fold cross-validation, the model based on word2vec embeddings performed best, with higher precision than recall, and a 78.3% F-score.

pdf bib
Illegal is not a Noun: Linguistic Form for Detection of Pejorative Nominalizations
Alexis Palmer | Melissa Robinson | Kristy K. Phillips

This paper focuses on a particular type of abusive language, targeting expressions in which typically neutral adjectives take on pejorative meaning when used as nouns - compare ‘gay people’ to ‘the gays’. We first collect and analyze a corpus of hand-curated, expert-annotated pejorative nominalizations for four target adjectives: female, gay, illegal, and poor. We then collect a second corpus of automatically-extracted and POS-tagged, crowd-annotated tweets. For both corpora, we find support for the hypothesis that some adjectives, when nominalized, take on negative meaning. The targeted constructions are non-standard yet widely-used, and part-of-speech taggers mistag some nominal forms as adjectives. We implement a tool called NomCatcher to correct these mistaggings, and find that the same tool is effective for identifying new adjectives subject to transformation via nominalization into abusive language.

up

pdf (full)
bib (full)
Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology — From Linguistic Signal to Clinical Reality

pdf bib
Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology — From Linguistic Signal to Clinical Reality
Kristy Hollingshead | Molly E. Ireland | Kate Loveys

pdf bib
A Cross-modal Review of Indicators for Depression Detection Systems
Michelle Morales | Stefan Scherer | Rivka Levitan

Automatic detection of depression has attracted increasing attention from researchers in psychology, computer science, linguistics, and related disciplines. As a result, promising depression detection systems have been reported. This paper surveys these efforts by presenting the first cross-modal review of depression detection systems and discusses best practices and most promising approaches to this task.

pdf bib
In your wildest dreams: the language and psychological features of dreams
Kate Niederhoffer | Jonathan Schler | Patrick Crutchley | Kate Loveys | Glen Coppersmith

In this paper, we provide the first quantified exploration of the structure of the language of dreams, their linguistic style and emotional content. We present a collection of digital dream logs as a viable corpus for the growing study of mental health through the lens of language, complementary to the work done examining more traditional social media. This paper is largely exploratory in nature to lay the groundwork for subsequent research in mental health, rather than optimizing a particular text classification task.

pdf bib
A Corpus Analysis of Social Connections and Social Isolation in Adolescents Suffering from Depressive Disorders
Jia-Wen Guo | Danielle L Mowery | Djin Lai | Katherine Sward | Mike Conway

Social connection and social isolation are associated with depressive symptoms, particularly in adolescents and young adults, but how these concepts are documented in clinical notes is unknown. This pilot study aimed to identify the topics relevant to social connection and isolation by analyzing 145 clinical notes from patients with depression diagnosis. We found that providers, including physicians, nurses, social workers, and psychologists, document descriptions of both social connection and social isolation.

pdf bib
Monitoring Tweets for Depression to Detect At-risk Users
Zunaira Jamil | Diana Inkpen | Prasadith Buddhitha | Kenton White

We propose an automated system that can identify at-risk users from their public social media activity, more specifically, from Twitter. The data that we collected is from the #BellLetsTalk campaign, which is a wide-reaching, multi-year program designed to break the silence around mental illness and support mental health across Canada. To achieve our goal, we trained a user-level classifier that can detect at-risk users that achieves a reasonable precision and recall. We also trained a tweet-level classifier that predicts if a tweet indicates depression. This task was much more difficult due to the imbalanced data. In the dataset that we labeled, we came across 5% depression tweets and 95% non-depression tweets. To handle this class imbalance, we used undersampling methods. The resulting classifier had high recall, but low precision. Therefore, we only use this classifier to compute the estimated percentage of depressed tweets and to add this value as a feature for the user-level classifier.

pdf bib
Investigating Patient Attitudes Towards the use of Social Media Data to Augment Depression Diagnosis and Treatment: a Qualitative Study
Jude Mikal | Samantha Hurst | Mike Conway

In this paper, we use qualitative research methods to investigate the attitudes of social media users towards the (opt-in) integration of social media data with routine mental health care and diagnosis. Our investigation was based on secondary analysis of a series of five focus groups with Twitter users, including three groups consisting of participants with a self-reported history of depression, and two groups consisting of participants without a self reported history of depression. Our results indicate that, overall, research participants were enthusiastic about the possibility of using social media (in conjunction with automated Natural Language Processing algorithms) for mood tracking under the supervision of a mental health practitioner. However, for at least some participants, there was skepticism related to how well social media represents the mental health of users, and hence its usefulness in the clinical context.

pdf bib
Natural-language Interactive Narratives in Imaginal Exposure Therapy for Obsessive-Compulsive Disorder
Melissa Roemmele | Paola Mardo | Andrew Gordon

Obsessive-compulsive disorder (OCD) is an anxiety-based disorder that affects around 2.5% of the population. A common treatment for OCD is exposure therapy, where the patient repeatedly confronts a feared experience, which has the long-term effect of decreasing their anxiety. Some exposures consist of reading and writing stories about an imagined anxiety-provoking scenario. In this paper, we present a technology that enables patients to interactively contribute to exposure stories by supplying natural language input (typed or spoken) that advances a scenario. This interactivity could potentially increase the patient’s sense of immersion in an exposure and contribute to its success. We introduce the NLP task behind processing inputs to predict new events in the scenario, and describe our initial approach. We then illustrate the future possibility of this work with an example of an exposure scenario authored with our application.

pdf bib
Detecting Anxiety through Reddit
Judy Hanwen Shen | Frank Rudzicz

Previous investigations into detecting mental illnesses through social media have predominately focused on detecting depression through Twitter corpora. In this paper, we study anxiety disorders through personal narratives collected through the popular social media website, Reddit. We build a substantial data set of typical and anxiety-related posts, and we apply N-gram language modeling, vector embeddings, topic analysis, and emotional norms to generate features that accurately classify posts related to binary levels of anxiety. We achieve an accuracy of 91% with vector-space word embeddings, and an accuracy of 98% when combined with lexicon-based features.

pdf bib
Detecting and Explaining Crisis
Rohan Kshirsagar | Robert Morris | Samuel Bowman

Individuals on social media may reveal themselves to be in various states of crisis (e.g. suicide, self-harm, abuse, or eating disorders). Detecting crisis from social media text automatically and accurately can have profound consequences. However, detecting a general state of crisis without explaining why has limited applications. An explanation in this context is a coherent, concise subset of the text that rationalizes the crisis detection. We explore several methods to detect and explain crisis using a combination of neural and non-neural techniques. We evaluate these techniques on a unique data set obtained from Koko, an anonymous emotional support network available through various messaging applications. We annotate a small subset of the samples labeled with crisis with corresponding explanations. Our best technique significantly outperforms the baseline for detection and explanation.

pdf bib
A Dictionary-Based Comparison of Autobiographies by People and Murderous Monsters
Micah Iserman | Molly Ireland

People typically assume that killers are mentally ill or fundamentally different from the rest of humanity. Similarly, people often associate mental health conditions (such as schizophrenia or autism) with violence and otherness - treatable perhaps, but not empathically understandable. We take a dictionary approach to explore word use in a set of autobiographies, comparing the narratives of 2 killers (Adolf Hitler and Elliot Rodger) and 39 non-killers. Although results suggest several dimensions that differentiate these autobiographies - such as sentiment, temporal orientation, and references to death - they appear to reflect subject matter rather than psychology per se. Additionally, the Rodger text shows roughly typical developmental arcs in its use of words relating to friends, family, sex, and affect. From these data, we discuss the challenges of understanding killers and people in general.

pdf bib
Small but Mighty: Affective Micropatterns for Quantifying Mental Health from Social Media Language
Kate Loveys | Patrick Crutchley | Emily Wyatt | Glen Coppersmith

Many psychological phenomena occur in small time windows, measured in minutes or hours. However, most computational linguistic techniques look at data on the order of weeks, months, or years. We explore micropatterns in sequences of messages occurring over a short time window for their prevalence and power for quantifying psychological phenomena, specifically, patterns in affect. We examine affective micropatterns in social media posts from users with anxiety, eating disorders, panic attacks, schizophrenia, suicidality, and matched controls.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Neural Machine Translation

pdf bib
Proceedings of the First Workshop on Neural Machine Translation
Thang Luong | Alexandra Birch | Graham Neubig | Andrew Finch

pdf bib
An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation
Raphael Shu | Hideki Nakayama

Recently, the attention mechanism plays a key role to achieve high performance for Neural Machine Translation models. However, as it computes a score function for the encoder states in all positions at each decoding step, the attention model greatly increases the computational complexity. In this paper, we investigate the adequate vision span of attention models in the context of machine translation, by proposing a novel attention framework that is capable of reducing redundant score computation dynamically. The term “vision span”’ means a window of the encoder states considered by the attention model in one step. In our experiments, we found that the average window size of vision span can be reduced by over 50% with modest loss in accuracy on English-Japanese and German-English translation tasks.

pdf bib
Analyzing Neural MT Search and Model Performance
Jan Niehues | Eunah Cho | Thanh-Le Ha | Alex Waibel

In this paper, we offer an in-depth analysis about the modeling and search performance. We address the question if a more complex search algorithm is necessary. Furthermore, we investigate the question if more complex models which might only be applicable during rescoring are promising. By separating the search space and the modeling using n-best list reranking, we analyze the influence of both parts of an NMT system independently. By comparing differently performing NMT systems, we show that the better translation is already in the search space of the translation systems with less performance. This results indicate that the current search algorithms are sufficient for the NMT systems. Furthermore, we could show that even a relatively small n-best list of 50 hypotheses already contain notably better translations.

pdf bib
Stronger Baselines for Trustable Results in Neural Machine Translation
Michael Denkowski | Graham Neubig

Interest in neural machine translation has grown rapidly as its effectiveness has been demonstrated across language and data scenarios. New research regularly introduces architectural and algorithmic improvements that lead to significant gains over “vanilla” NMT implementations. However, these new techniques are rarely evaluated in the context of previously published techniques, specifically those that are widely used in state-of-the-art production and shared-task systems. As a result, it is often difficult to determine whether improvements from research will carry over to systems deployed for real-world use. In this work, we recommend three specific methods that are relatively easy to implement and result in much stronger experimental systems. Beyond reporting significantly higher BLEU scores, we conduct an in-depth analysis of where improvements originate and what inherent weaknesses of basic NMT models are being addressed. We then compare the relative gains afforded by several other techniques proposed in the literature when starting with vanilla systems versus our stronger baselines, showing that experimental conclusions may change depending on the baseline chosen. This indicates that choosing a strong baseline is crucial for reporting reliable experimental results.

pdf bib
Six Challenges for Neural Machine Translation
Philipp Koehn | Rebecca Knowles

We explore six challenges for neural machine translation: domain mismatch, amount of training data, rare words, long sentences, word alignment, and beam search. We show both deficiencies and improvements over the quality of phrase-based statistical machine translation.

pdf bib
Cost Weighting for Neural Machine Translation Domain Adaptation
Boxing Chen | Colin Cherry | George Foster | Samuel Larkin

In this paper, we propose a new domain adaptation technique for neural machine translation called cost weighting, which is appropriate for adaptation scenarios in which a small in-domain data set and a large general-domain data set are available. Cost weighting incorporates a domain classifier into the neural machine translation training algorithm, using features derived from the encoder representation in order to distinguish in-domain from out-of-domain data. Classifier probabilities are used to weight sentences according to their domain similarity when updating the parameters of the neural translation model. We compare cost weighting to two traditional domain adaptation techniques developed for statistical machine translation: data selection and sub-corpus weighting. Experiments on two large-data tasks show that both the traditional techniques and our novel proposal lead to significant gains, with cost weighting outperforming the traditional methods.

pdf bib
Detecting Untranslated Content for Neural Machine Translation
Isao Goto | Hideki Tanaka

Despite its promise, neural machine translation (NMT) has a serious problem in that source content may be mistakenly left untranslated. The ability to detect untranslated content is important for the practical use of NMT. We evaluate two types of probability with which to detect untranslated content: the cumulative attention (ATN) probability and back translation (BT) probability from the target sentence to the source sentence. Experiments on detecting untranslated content in Japanese-English patent translations show that ATN and BT are each more effective than random choice, BT is more effective than ATN, and the combination of the two provides further improvements. We also confirmed the effectiveness of using ATN and BT to rerank the n-best NMT outputs.

pdf bib
Beam Search Strategies for Neural Machine Translation
Markus Freitag | Yaser Al-Onaizan

The basic concept in Neural Machine Translation (NMT) is to train a large Neural Network that maximizes the translation performance on a given parallel corpus. NMT is then using a simple left-to-right beam-search decoder to generate new translations that approximately maximize the trained conditional probability. The current beam search strategy generates the target sentence word by word from left-to-right while keeping a fixed amount of active candidates at each time step. First, this simple search is less adaptive as it also expands candidates whose scores are much worse than the current best. Secondly, it does not expand hypotheses if they are not within the best scoring candidates, even if their scores are close to the best one. The latter one can be avoided by increasing the beam size until no performance improvement can be observed. While you can reach better performance, this has the drawback of a slower decoding speed. In this paper, we concentrate on speeding up the decoder by applying a more flexible beam search strategy whose candidate size may vary at each time step depending on the candidate scores. We speed up the original decoder by up to 43% for the two language pairs German to English and Chinese to English without losing any translation quality.

pdf bib
An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation
Makoto Morishita | Yusuke Oda | Graham Neubig | Koichiro Yoshino | Katsuhito Sudoh | Satoshi Nakamura

Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the amount of padding and increases the processing speed. However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated or compared. This work investigates mini-batch creation strategies with experiments over two different datasets. Our results suggest that the choice of a mini-batch creation strategy has a large effect on NMT training and some length-based sorting strategies do not always work well compared with simple shuffling.

pdf bib
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
Marine Carpuat | Yogarshi Vyas | Xing Niu

Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves translation quality.


up

pdf (full)
bib (full)
Proceedings of the 10th International Conference on Natural Language Generation

pdf bib
Proceedings of the 10th International Conference on Natural Language Generation
Jose M. Alonso | Alberto Bugarín | Ehud Reiter

pdf bib
Linguistic realisation as machine translation: Comparing different MT models for AMR-to-text generation
Thiago Castro Ferreira | Iacer Calixto | Sander Wubben | Emiel Krahmer

In this paper, we study AMR-to-text generation, framing it as a translation task and comparing two different MT approaches (Phrase-based and Neural MT). We systematically study the effects of 3 AMR preprocessing steps (Delexicalisation, Compression, and Linearisation) applied before the MT phase. Our results show that preprocessing indeed helps, although the benefits differ for the two MT models.

pdf bib
A Survey on Intelligent Poetry Generation: Languages, Features, Techniques, Reutilisation and Evaluation
Hugo Gonçalo Oliveira

Poetry generation is becoming popular among researchers of Natural Language Generation, Computational Creativity and, broadly, Artificial Intelligence. To produce text that may be regarded as poetry, poetry generation systems are typically knowledge-intensive and have to deal with several levels of language, from lexical to semantics. Interest on the topic resulted in the development of several poetry generators described in the literature, with different features covered or handled differently, by a broad range of alternative approaches, as well as different perspectives on evaluation, another challenging aspect due the underlying subjectivity. This paper surveys intelligent poetry generators around a set of relevant axis for poetry generation – targeted languages, form and content features, techniques, reutilisation of material, and evaluation – and aims to organise work developed on this topic so far.

pdf bib
Cross-linguistic differences and similarities in image descriptions
Emiel van Miltenburg | Desmond Elliott | Piek Vossen

Automatic image description systems are commonly trained and evaluated on large image description datasets. Recently, researchers have started to collect such datasets for languages other than English. An unexplored question is how different these datasets are from English and, if there are any differences, what causes them to differ. This paper provides a cross-linguistic comparison of Dutch, English, and German image descriptions. We find that these descriptions are similar in many respects, but the familiarity of crowd workers with the subjects of the images has a noticeable influence on the specificity of the descriptions.

pdf bib
Content Selection for Real-time Sports News Construction from Commentary Texts
Jin-ge Yao | Jianmin Zhang | Xiaojun Wan | Jianguo Xiao

We study the task of constructing sports news report automatically from live commentary and focus on content selection. Rather than receiving every piece of text of a sports match before news construction, as in previous related work, we novelly verify the feasibility of a more challenging but more useful setting to generate news report on the fly by treating live text input as a stream. Specifically, we design various scoring functions to address different requirements of the task. The near submodularity of scoring functions makes it possible to adapt efficient greedy algorithms even in stream data settings. Experiments suggest that our proposed framework can already produce comparable results compared with previous work that relies on a supervised learning-to-rank model with heavy feature engineering.

pdf bib
Improving the Naturalness and Expressivity of Language Generation for Spanish
Cristina Barros | Dimitra Gkatzia | Elena Lloret

We present a flexible Natural Language Generation approach for Spanish, focused on the surface realisation stage, which integrates an inflection module in order to improve the naturalness and expressivity of the generated language. This inflection module inflects the verbs using an ensemble of trainable algorithms whereas the other types of words (e.g. nouns, determiners, etc) are inflected using hand-crafted rules. We show that our approach achieves 2% higher accuracy than two state-of-art inflection generation approaches. Furthermore, our proposed approach also predicts an extra feature: the inflection of the imperative mood, which was not taken into account by previous work. We also present a user evaluation, where we demonstrate that the proposed method significantly improves the perceived naturalness of the generated language.

pdf bib
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?
Marc Tanti | Albert Gatt | Kenneth Camilleri

Image captioning has evolved into a core task for Natural Language Generation and has also proved to be an important testbed for deep learning approaches to handling multimodal representations. Most contemporary approaches rely on a combination of a convolutional network to handle image features, and a recurrent network to encode linguistic information. The latter is typically viewed as the primary “generation” component. Beyond this high-level characterisation, a CNN+RNN model supports a variety of architectural designs. The dominant model in the literature is one in which visual features encoded by a CNN are “injected” as part of the linguistic encoding process, driving the RNN’s linguistic choices. By contrast, it is possible to envisage an architecture in which visual and linguistic features are encoded separately, and merged at a subsequent stage. In this paper, we address two related questions: (1) Is direct injection the best way of combining multimodal information, or is a late merging alternative better for the image captioning task? (2) To what extent should a recurrent network be viewed as actually generating, rather than simply encoding, linguistic information?

pdf bib
Exploring the Behavior of Classic REG Algorithms in the Description of Characters in 3D Images
Gonzalo Méndez | Raquel Hervás | Susana Bautista | Adrián Rabadán | Teresa Rodríguez

Describing people and characters can be very useful in different contexts, such as computational narrative or image description for the visually impaired. However, a review of the existing literature shows that the automatic generation of people descriptions has not received much attention. Our work focuses on the description of people in snapshots from a 3D environment. First, we have conducted a survey to identify the way in which people describe other people under different conditions. We have used the information extracted from this survey to design several Referring Expression Generation algorithms which produce similar results. We have evaluated these algorithms with users in order to identify which ones generate the best description for specific characters in different situations. The evaluation has shown that, in order to generate good descriptions, a combination of different algorithms has to be used depending on the features and situation of the person to be described.

pdf bib
Co-PoeTryMe: a Co-Creative Interface for the Composition of Poetry
Hugo Gonçalo Oliveira | Tiago Mendes | Ana Boavida

Co-PoeTryMe is a web application for poetry composition, guided by the user, though with the help of automatic features, such as the generation of full (editable) drafts, as well as the acquisition of additional well-formed lines, or semantically-related words, possibly constrained by the number of syllables, rhyme, or polarity. Towards the final poem, the latter can replace lines or words in the draft.

pdf bib
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images
Sina Zarrieß | M. Soledad López Gambino | David Schlangen

Current referring expression generation systems mostly deliver their output as one-shot, written expressions. We present on-going work on incremental generation of spoken expressions referring to objects in real-world images. This approach extends upon previous work using the words-as-classifier model for generation. We implement this generator in an incremental dialogue processing framework such that we can exploit an existing interface to incremental text-to-speech synthesis. Our system generates and synthesizes referring expressions while continuously observing non-verbal user reactions.

pdf bib
Finding the “right” answers for customers
Frank Schilder

This talk will present a few NLG systems developed within Thomson Reuters providing information to professionals such as lawyers, accountants or traders. Based on the experience developing these system, I will discuss the usefulness of automatic metrics, crowd-sourced evaluation, corpora studies and expert reviews. I will conclude with exploring the question of whether developers of NLG systems need to follow ethical guidelines and how those guidelines could be established.

pdf bib
Referring Expression Generation under Uncertainty: Algorithm and Evaluation Framework
Tom Williams | Matthias Scheutz

For situated agents to effectively engage in natural-language interactions with humans, they must be able to refer to entities such as people, locations, and objects. While classic referring expression generation (REG) algorithms like the Incremental Algorithm (IA) assume perfect, complete, and accessible knowledge of all referents, this is not always possible. In this work, we show how a previously presented consultant framework (which facilitates reference resolution when knowledge is uncertain, heterogeneous and distributed) can be used to extend the IA to produce DIST-PIA, a domain-independent algorithm for REG under uncertain, heterogeneous, and distributed knowledge. We also present a novel framework that can be used to evaluate such REG algorithms without conflating the performance of the algorithm with the performance of classifiers it employs.

pdf bib
Natural Language Descriptions for Human Activities in Video Streams
Nouf Alharbi | Yoshihiko Gotoh

There has been continuous growth in the volume and ubiquity of video material. It has become essential to define video semantics in order to aid the searchability and retrieval of this data. We present a framework that produces textual descriptions of video, based on the visual semantic content. Detected action classes rendered as verbs, participant objects converted to noun phrases, visual properties of detected objects rendered as adjectives and spatial relations between objects rendered as prepositions. Further, in cases of zero-shot action recognition, a language model is used to infer a missing verb, aided by the detection of objects and scene settings. These extracted features are converted into textual descriptions using a template-based approach. The proposed video descriptions framework evaluated on the NLDHA dataset using ROUGE scores and human judgment evaluation.

pdf bib
PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences
Chris van der Lee | Emiel Krahmer | Sander Wubben

We present PASS, a data-to-text system that generates Dutch soccer reports from match statistics. One of the novel elements of PASS is the fact that the system produces corpus-based texts tailored towards fans of one club or the other, which can most prominently be observed in the tone of voice used in the reports. Furthermore, the system is open source and uses a modular design, which makes it relatively easy for people to add extensions. Human-based evaluation shows that people are generally positive towards PASS in regards to its clarity and fluency, and that the tailoring is accurately recognized in most cases.

pdf bib
Evaluation of a Runyankore grammar engine for healthcare messages
Joan Byamugisha | C. Maria Keet | Brian DeRenzi

Natural Language Generation (NLG) can be used to generate personalized health information, which is especially useful when provided in one’s own language. However, the NLG technique widely used in different domains and languages—templates—was shown to be inapplicable to Bantu languages, due to their characteristic agglutinative structure. We present here our use of the grammar engine NLG technique to generate text in Runyankore, a Bantu language indigenous to Uganda. Our grammar engine adds to previous work in this field with new rules for cardinality constraints, prepositions in roles, the passive, and phonological conditioning. We evaluated the generated text with linguists and non-linguists, who regarded most text as grammatically correct and understandable; and over 60% of them regarded all the text generated by our system to have been authored by a human being.

pdf bib
Talking about the world with a distributed model
Gemma Boleda

We use language to talk about the world, and so reference is a crucial property of language. However, modeling reference is particularly difficult, as it involves both continuous and discrete as-pects of language. For instance, referring expressions like “the big mug” or “it” typically contain content words (“big”, “mug”), which are notoriously fuzzy or vague in their meaning, and also fun-ction words (“the”, “it”) that largely serve as discrete pointers. Data-driven, distributed models based on distributional semantics or deep learning excel at the former, but struggle with the latter, and the reverse is true for symbolic models. I present ongoing work on modeling reference with a distribu-ted model aimed at capturing both aspects, and learns to refer directly from reference acts.

pdf bib
The Code2Text Challenge: Text Generation in Source Libraries
Kyle Richardson | Sina Zarrieß | Jonas Kuhn

We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction, and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.

pdf bib
Shared Task Proposal: Multilingual Surface Realization Using Universal Dependency Trees
Simon Mille | Bernd Bohnet | Leo Wanner | Anja Belz

We propose a shared task on multilingual Surface Realization, i.e., on mapping unordered and uninflected universal dependency trees to correctly ordered and inflected sentences in a number of languages. A second deeper input will be available in which, in addition, functional words, fine-grained PoS and morphological information will be removed from the input trees. The first shared task on Surface Realization was carried out in 2011 with a similar setup, with a focus on English. We think that it is time for relaunching such a shared task effort in view of the arrival of Universal Dependencies annotated treebanks for a large number of languages on the one hand, and the increasing dominance of Deep Learning, which proved to be a game changer for NLP, on the other hand.

pdf bib
The WebNLG Challenge: Generating Text from RDF Data
Claire Gardent | Anastasia Shimorina | Shashi Narayan | Laura Perez-Beltrachini

The WebNLG challenge consists in mapping sets of RDF triples to text. It provides a common benchmark on which to train, evaluate and compare “microplanners”, i.e. generation systems that verbalise a given content by making a range of complex interacting choices including referring expression generation, aggregation, lexicalisation, surface realisation and sentence segmentation. In this paper, we introduce the microplanning task, describe data preparation, introduce our evaluation methodology, analyse participant results and provide a brief description of the participating systems.

pdf bib
A Commercial Perspective on Reference
Ehud Reiter

I briefly describe some of the commercial work which XXX is doing in referring expression algorithms, and highlight differences between what is commercially important (at least to XXX) and the NLG research literature. In particular, XXX is less interested in generic reference algorithms than in high-quality algorithms for specific types of references, such as components of machines, named entities, and dates.

pdf bib
Integrated sentence generation using charts
Alexander Koller | Nikos Engonopoulos

Integrating surface realization and the generation of referring expressions into a single algorithm can improve the quality of the generated sentences. Existing algorithms for doing this, such as SPUD and CRISP, are search-based and can be slow or incomplete. We offer a chart-based algorithm for integrated sentence generation and demonstrate its runtime efficiency.

pdf bib
Adapting SimpleNLG to Spanish
Alejandro Ramos-Soto | Julio Janeiro-Gallardo | Alberto Bugarín Diz

We describe SimpleNLG-ES, an adaptation of the SimpleNLG realization library for the Spanish language. Our implementation is based on the bilingual English-French SimpleNLG-EnFr adaptation. The library has been tested using a battery of examples that ensure that the most common syntax, morphology and orthography rules for Spanish are met. The library is currently being used in three different projects for the development of data-to-text systems in the meteorological, statistical data information, and business intelligence application domains.

pdf bib
G-TUNA: a corpus of referring expressions in German, including duration information
David Howcroft | Jorrig Vogels | Vera Demberg

Corpora of referring expressions elicited from human participants in a controlled environment are an important resource for research on automatic referring expression generation. We here present G-TUNA, a new corpus of referring expressions for German. Using the furniture stimuli set developed for the TUNA and D-TUNA corpora, our corpus extends on these corpora by providing data collected in a simulated driving dual-task setting, and additionally provides exact duration annotations for the spoken referring expressions. This corpus will hence allow researchers to analyze the interaction between referring expression length and speech rate, under conditions where the listener is under high vs. low cognitive load.

pdf bib
Toward an NLG System for Bantu languages: first steps with Runyankore (demo)
Joan Byamugisha | C. Maria Keet | Brian DeRenzi

There are many domain-specific and language-specific NLG systems, of which it may be possible to adapt to related domains and languages. The languages in the Bantu language family have their own set of features distinct from other major groups, which therefore severely limits the options to bootstrap an NLG system from existing ones. We present here our first proof-of-concept application for knowledge-to-text NLG as a plugin to the Protege 5.x ontology development system, tailored to Runyankore, a Bantu language indigenous to Uganda. It comprises a basic annotation model for linguistic information such as noun class, an implementation of existing verbalisation rules and a CFG for verbs, and a basic interface for data entry.

pdf bib
A working, non-trivial, topically indifferent NLG System for 17 languages
Robert Weißgraeber | Andreas Madsack

A fully fledged practical working application for a rule-based NLG system is presented that is able to create non-trivial, human sounding narrative from structured data, in any language and for any topic.

pdf bib
Generating titles for millions of browse pages on an e-Commerce site
Prashant Mathur | Nicola Ueffing | Gregor Leusch

We present two approaches to generate titles for browse pages in five different languages, namely English, German, French, Italian and Spanish. These browse pages are structured search pages in an e-commerce domain. We first present a rule-based approach to generate these browse page titles. In addition, we also present a hybrid approach which uses a phrase-based statistical machine translation engine on top of the rule-based system to assemble the best title. For the two languages English and German we have access to a large amount of already available rule-based generated and curated titles. For these languages we present an automatic post-editing approach which learns how to post-edit the rule-based titles into curated titles.

pdf bib
Towards Automatic Generation of Product Reviews from Aspect-Sentiment Scores
Hongyu Zang | Xiaojun Wan

Data-to-text generation is very essential and important in machine writing applications. The recent deep learning models, like Recurrent Neural Networks (RNNs), have shown a bright future for relevant text generation tasks. However, rare work has been done for automatic generation of long reviews from user opinions. In this paper, we introduce a deep neural network model to generate long Chinese reviews from aspect-sentiment scores representing users’ opinions. We conduct our study within the framework of encoder-decoder networks, and we propose a hierarchical structure with aligned attention in the Long-Short Term Memory (LSTM) decoder. Experiments show that our model outperforms retrieval based baseline methods, and also beats the sequential generation models in qualitative evaluations.

pdf bib
A model of suspense for narrative generation
Richard Doust | Paul Piwek

Most work on automatic generation of narratives, and more specifically suspenseful narrative, has focused on detailed domain-specific modelling of character psychology and plot structure. Recent work in computational linguistics on the automatic learning of narrative schemas suggests an alternative approach that exploits such schemas as a starting point for modelling and measuring suspense. We propose a domain-independent model for tracking suspense in a story which can be used to predict the audience’s suspense response on a sentence-by-sentence basis at the content determination stage of narrative generation. The model lends itself as the theoretical foundation for a suspense module that is compatible with alternative narrative generation theories. The proposal is evaluated by human judges’ normalised average scores correlate strongly with predicted values.

pdf bib
Data-Driven News Generation for Automated Journalism
Leo Leppänen | Myriam Munezero | Mark Granroth-Wilding | Hannu Toivonen

Despite increasing amounts of data and ever improving natural language generation techniques, work on automated journalism is still relatively scarce. In this paper, we explore the field and challenges associated with building a journalistic natural language generation system. We present a set of requirements that should guide system design, including transparency, accuracy, modifiability and transferability. Guided by the requirements, we present a data-driven architecture for automated journalism that is largely domain and language independent. We illustrate its practical application in the production of news articles about the 2017 Finnish municipal elections in three languages, demonstrating the successfulness of the data-driven, modular approach of the design. We then draw some lessons for future automated journalism.

pdf bib
Data Augmentation for Visual Question Answering
Kushal Kafle | Mohammed Yousefhussien | Christopher Kanan

Data augmentation is widely used to train deep neural networks for image classification tasks. Simply flipping images can help learning tremendously by increasing the number of training images by a factor of two. However, little work has been done studying data augmentation in natural language processing. Here, we describe two methods for data augmentation for Visual Question Answering (VQA). The first uses existing semantic annotations to generate new questions. The second method is a generative approach using recurrent neural networks. Experiments show that the proposed data augmentation improves performance of both baseline and state-of-the-art VQA algorithms.

pdf bib
Personalized Questions, Answers and Grammars: Aiding the Search for Relevant Web Information
Marta Gatius

This work proposes an organization of knowledge to facilitate the generation of personalized questions, answers and grammars from web documents. To reduce the human effort needed in the generation of the linguistic resources for a new domain, the general aspects that can be reuse across domains are separated from those more specific. The proposed approach is based on the representation of the main domain concepts as a set of attributes. These attributes are related to a syntactico-semantic taxonomy representing the general relationships between conceptual and linguistic knowledge. User models are incorporated by distinguishing different user groups and relating each group to the appropriate conceptual attributes. Then, the data is extracted from the web documents and represented as instances of the domain concepts. Questions, answers and grammars are generated from these instances.

pdf bib
A Comparison of Neural Models for Word Ordering
Eva Hasler | Felix Stahlberg | Marcus Tomalin | Adrià de Gispert | Bill Byrne

We compare several language models for the word-ordering task and propose a new bag-to-sequence neural model based on attention-based sequence-to-sequence models. We evaluate the model on a large German WMT data set where it significantly outperforms existing models. We also describe a novel search strategy for LM-based word ordering and report results on the English Penn Treebank. Our best model setup outperforms prior work both in terms of speed and quality.

pdf bib
Investigating the content and form of referring expressions in Mandarin: introducing the Mtuna corpus
Kees van Deemter | Le Sun | Rint Sybesma | Xiao Li | Bo Chen | Muyun Yang

East Asian languages are thought to handle reference differently from languages such as English, particularly in terms of the marking of definiteness and number. We present the first Data-Text corpus for Referring Expressions in Mandarin, and we use this corpus to test some initial hypotheses inspired by the theoretical linguistics literature. Our findings suggest that function words deserve more attention in Referring Expressions Generation than they have so far received, and they have a bearing on the debate about whether different languages make different trade-offs between clarity and brevity.

pdf bib
Realization of long sentences using chunking
Ewa Muszyńska | Ann Copestake

We propose sentence chunking as a way to reduce the time and memory costs of realization of long sentences. During chunking we divide the semantic representation of a sentence into smaller components which can be processed and recombined without loss of information. Our meaning representation of choice is the Dependency Minimal Recursion Semantics (DMRS). We show that realizing chunks of a sentence and combining the results of such realizations increases the coverage for long sentences, significantly reduces the resources required and does not affect the quality of the realization.

pdf bib
SaToS: Assessing and Summarising Terms of Services from German Webshops
Daniel Braun | Elena Scepankova | Patrick Holl | Florian Matthes

Every time we buy something online, we are confronted with Terms of Services. However, only a few people actually read these terms, before accepting them, often to their disadvantage. In this paper, we present the SaToS browser plugin which summarises and simplifies Terms of Services from German webshops.

pdf bib
Textually Summarising Incomplete Data
Stephanie Inglis | Ehud Reiter | Somayajulu Sripada

Many data-to-text NLG systems work with data sets which are incomplete, ie some of the data is missing. We have worked with data journalists to understand how they describe incomplete data, and are building NLG algorithms based on these insights. A pilot evaluation showed mixed results, and highlighted several areas where we need to improve our system.

pdf bib
Improving the generation of personalised descriptions
Thiago Castro Ferreira | Ivandré Paraboni

Referring expression generation (REG) models that use speaker-dependent information require a considerable amount of training data produced by every individual speaker, or may otherwise perform poorly. In this work we propose a simple personalised method for this task, in which speakers are grouped into profiles according to their referential behaviour. Intrinsic evaluation shows that the use of speaker’s profiles generally outperforms the personalised method found in previous work.

pdf bib
Analysing Data-To-Text Generation Benchmarks
Laura Perez-Beltrachini | Claire Gardent

A generation system can only be as good as the data it is trained on. In this short paper, we propose a methodology for analysing data-to-text corpora used for training Natural Language Generation (NLG) systems. We apply this methodology to three existing benchmarks. We conclude by eliciting a set of criteria for the creation of a data-to-text benchmark which could help better support the development, evaluation and comparison of linguistically sophisticated data-to-text generators.

pdf bib
Linguistic Description of Complex Phenomena with the rLDCP R Package
Jose Alonso | Patricia Conde-Clemente | Gracian Trivino

Monitoring and analysis of complex phenomena attract the attention of both academy and industry. Dealing with data produced by complex phenomena requires the use of advance computational intelligence techniques. Namely, linguistic description of complex phenomena constitutes a mature research line. It is supported by the Computational Theory of Perceptions grounded on the Fuzzy Sets Theory. Its aim is the development of computational systems with the ability to generate vague descriptions of the world in a similar way how humans do. This is a human-centric and multi-disciplinary research work. Moreover, its success is a matter of careful design; thus, developers play a key role. The rLDCP R package was designed to facilitate the development of new applications. This demo introduces the use of rLDCP, for both beginners and advance developers, in practical use cases.

pdf bib
A demo of FORGe: the Pompeu Fabra Open Rule-based Generator
Simon Mille | Leo Wanner

This demo paper presents the multilingual deep sentence generator developed by the TALN group at Universitat Pompeu Fabra, implemented as a series of rule-based graph-transducers for the syntacticization of the input graphs, the resolution of morphological agreements, and the linearization of the trees.

pdf bib
Referential Success of Set Referring Expressions with Fuzzy Properties
Nicolás Marín | Gustavo Rivas-Gervilla | Daniel Sánchez

We introduce the properties to be satisfied by measures of referential success of set referring expressions with fuzzy properties. We define families of measures on the basis of n-cardinality measures and we illustrate some of them with a toy example.

pdf bib
Neural Response Generation for Customer Service based on Personality Traits
Jonathan Herzig | Michal Shmueli-Scheuer | Tommy Sandbank | David Konopnicki

We present a neural response generation model that generates responses conditioned on a target personality. The model learns high level features based on the target personality, and uses them to update its hidden state. Our model achieves performance improvements in both perplexity and BLEU scores over a baseline sequence-to-sequence model, and is validated by human judges.

pdf bib
Neural Paraphrase Generation using Transfer Learning
Florin Brad | Traian Rebedea

Progress in statistical paraphrase generation has been hindered for a long time by the lack of large monolingual parallel corpora. In this paper, we adapt the neural machine translation approach to paraphrase generation and perform transfer learning from the closely related task of entailment generation. We evaluate the model on the Microsoft Research Paraphrase (MSRP) corpus and show that the model is able to generate sentences that capture part of the original meaning, but fails to pick up on important words or to show large lexical variation.

up

pdf (full)
bib (full)
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms

pdf bib
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms
M. Taboada | I. da Cunha | E.G. Maziero | P. Cardoso | J.D. Antonio | M. Iruskieta

pdf bib
Deliberation as Genre: Mapping Argumentation through Relational Discourse Structure
Oier Imaz | Mikel Iruskieta

pdf bib
The Good, the Bad, and the Disagreement: Complex ground truth in rhetorical structure analysis
Debopam Das | Manfred Stede | Maite Taboada

pdf bib
A Distributional View of Discourse Encapsulation: Multifactorial Prediction of Coreference Density in RST
Amir Zeldes

pdf bib
Rhetorical relations markers in Russian RST Treebank
Svetlana Toldova | Dina Pisarevskaya | Margarita Ananyeva | Maria Kobozeva | Alexander Nasedkin | Sofia Nikiforova | Irina Pavlova | Alexey Shelepov

pdf bib
Applying the Rhetorical Structure Theory in Alzheimer patients’ speech
Anayeli Paulino | Gerardo Sierra

pdf bib
Using lexical level information in discourse structures for Basque sentiment analysis
Jon Alkorta | Koldo Gojenola | Mikel Iruskieta | Maite Taboada

pdf bib
Framework for the Analysis of Simplified Texts Taking Discourse into Account: the Basque Causal Relations as Case Study
Itziar Gonzalez-Dios | Arantza Diaz de Ilarraza | Mikel Iruskieta

pdf bib
Using Rhetorical Structure Theory for Detection of Fake Online Reviews
Olu Popoola

pdf bib
“Haters gonna hate”: challenges for sentiment analysis of Facebook comments in Brazilian Portuguese
Juliano D. Antonio | Ana Carolina L. Santin

pdf bib
Discourse Segmentation for Building a RST Chinese Treebank
Shuyuan Cao | Nianwen Xue | Iria da Cunha | Mikel Iruskieta | Chuan Wang






up

pdf (full)
bib (full)
Proceedings of the First Workshop on Subword and Character Level Models in NLP

pdf bib
Proceedings of the First Workshop on Subword and Character Level Models in NLP
Manaal Faruqui | Hinrich Schuetze | Isabel Trancoso | Yadollah Yaghoobzadeh

pdf bib
Character and Subword-Based Word Representation for Neural Language Modeling Prediction
Matthieu Labeau | Alexandre Allauzen

Most of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement.

pdf bib
Learning variable length units for SMT between related languages via Byte Pair Encoding
Anoop Kunchukuttan | Pushpak Bhattacharyya

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.

pdf bib
Character Based Pattern Mining for Neology Detection
Gaël Lejeune | Emmanuel Cartier

Detecting neologisms is essential in real-time natural language processing applications. Not only can it enable to follow the lexical evolution of languages, but it is also essential for updating linguistic resources and parsers. In this paper, neology detection is considered as a classification task where a system has to assess whether a given lexical item is an actual neologism or not. We propose a combination of an unsupervised data mining technique and a supervised machine learning approach. It is inspired by current researches in stylometry and on token-level and character-level patterns. We train and evaluate our system on a manually designed reference dataset in French and Russian. We show that this approach is able to largely outperform state-of-the-art neology detection systems. Furthermore, character-level patterns exhibit good properties for multilingual extensions of the system.

pdf bib
Automated Word Stress Detection in Russian
Maria Ponomareva | Kirill Milintsevich | Ekaterina Chernyak | Anatoly Starostin

In this study we address the problem of automated word stress detection in Russian using character level models and no part-speech-taggers. We use a simple bidirectional RNN with LSTM nodes and achieve accuracy of 90% or higher. We experiment with two training datasets and show that using the data from an annotated corpus is much more efficient than using only a dictionary, since it allows to retain the context of the word and its morphological features.

pdf bib
A Syllable-based Technique for Word Embeddings of Korean Words
Sanghyuk Choi | Taeuk Kim | Jinseok Seol | Sang-goo Lee

Word embedding has become a fundamental component to many NLP tasks such as named entity recognition and machine translation. However, popular models that learn such embeddings are unaware of the morphology of words, so it is not directly applicable to highly agglutinative languages such as Korean. We propose a syllable-based learning model for Korean using a convolutional neural network, in which word representation is composed of trained syllable vectors. Our model successfully produces morphologically meaningful representation of Korean words compared to the original Skip-gram embeddings. The results also show that it is quite robust to the Out-of-Vocabulary problem.

pdf bib
Supersense Tagging with a Combination of Character, Subword, and Word-level Representations
Youhyun Shin | Sang-goo Lee

Recently, there has been increased interest in utilizing characters or subwords for natural language processing (NLP) tasks. However, the effect of utilizing character, subword, and word-level information simultaneously has not been examined so far. In this paper, we propose a model to leverage various levels of input features to improve on the performance of an supersense tagging task. Detailed analysis of experimental results show that different levels of input representation offer distinct characteristics that explain performance discrepancy among different tasks.

pdf bib
Weakly supervised learning of allomorphy
Miikka Silfverberg | Mans Hulden

Most NLP resources that offer annotations at the word segment level provide morphological annotation that includes features indicating tense, aspect, modality, gender, case, and other inflectional information. Such information is rarely aligned to the relevant parts of the words—i.e. the allomorphs, as such annotation would be very costly. These unaligned weak labelings are commonly provided by annotated NLP corpora such as treebanks in various languages. Although they lack alignment information, the presence/absence of labels at the word level is also consistent with the amount of supervision assumed to be provided to L1 and L2 learners. In this paper, we explore several methods to learn this latent alignment between parts of word forms and the grammatical information provided. All the methods under investigation favor hypotheses regarding allomorphs of morphemes that re-use a small inventory, i.e. implicitly minimize the number of allomorphs that a morpheme can be realized as. We show that the provided information offers a significant advantage for both word segmentation and the learning of allomorphy.

pdf bib
Character-based recurrent neural networks for morphological relational reasoning
Olof Mogren | Richard Johansson

We present a model for predicting word forms based on morphological relational reasoning with analogies. While previous work has explored tasks such as morphological inflection and reinflection, these models rely on an explicit enumeration of morphological features, which may not be available in all cases. To address the task of predicting a word form given a demo relation (a pair of word forms) and a query word, we devise a character-based recurrent neural network architecture using three separate encoders and a decoder. We also investigate a multiclass learning setup, where the prediction of the relation type label is used as an auxiliary task. Our results show that the exact form can be predicted for English with an accuracy of 94.7%. For Swedish, which has a more complex morphology with more inflectional patterns for nouns and verbs, the accuracy is 89.3%. We also show that using the auxiliary task of learning the relation type speeds up convergence and improves the prediction accuracy for the word generation task.

pdf bib
Glyph-aware Embedding of Chinese Characters
Falcon Dai | Zheng Cai

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a character’s glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each character’s task-relevant semantic and syntactic information in the character-level embedding.

pdf bib
Exploring Cross-Lingual Transfer of Morphological Knowledge In Sequence-to-Sequence Models
Huiming Jin | Katharina Kann

Multi-task training is an effective method to mitigate the data sparsity problem. It has recently been applied for cross-lingual transfer learning for paradigm completion—the task of producing inflected forms of lemmata—with sequence-to-sequence networks. However, it is still vague how the model transfers knowledge across languages, as well as if and which information is shared. To investigate this, we propose a set of data-dependent experiments using an existing encoder-decoder recurrent neural network for the task. Our results show that indeed the performance gains surpass a pure regularization effect and that knowledge about language and morphology can be transferred.

pdf bib
Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models
Katharina Kann | Hinrich Schütze

We present a semi-supervised way of training a character-based encoder-decoder recurrent neural network for morphological reinflection—the task of generating one inflected wordform from another. This is achieved by using unlabeled tokens or random strings as training data for an autoencoding task, adapting a network for morphological reinflection, and performing multi-task training. We thus use limited labeled data more effectively, obtaining up to 9.92% improvement over state-of-the-art baselines for 8 different languages.

pdf bib
Vowel and Consonant Classification through Spectral Decomposition
Patricia Thaine | Gerald Penn

We consider two related problems in this paper. Given an undeciphered alphabetic writing system or mono-alphabetic cipher, determine: (1) which of its letters are vowels and which are consonants; and (2) whether the writing system is a vocalic alphabet or an abjad. We are able to show that a very simple spectral decomposition based on character co-occurrences provides nearly perfect performance with respect to answering both question types.

pdf bib
Syllable-level Neural Language Model for Agglutinative Language
Seunghak Yu | Nilesh Kulkarni | Haejun Lee | Jihie Kim

We introduce a novel method to diminish the problem of out of vocabulary words by introducing an embedding method which leverages the agglutinative property of language. We propose additional embedding derived from syllables and morphemes for the words to improve the performance of language model. We apply the above method to input prediction tasks and achieve state of the art performance in terms of Key Stroke Saving (KSS) w.r.t. to existing device input prediction methods.

pdf bib
Character-based Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition
Shotaro Misawa | Motoki Taniguchi | Yasuhide Miura | Tomoko Ohkuma

Recently, neural models have shown superior performance over conventional models in NER tasks. These models use CNN to extract sub-word information along with RNN to predict a tag for each word. However, these models have been tested almost entirely on English texts. It remains unclear whether they perform similarly in other languages. We worked on Japanese NER using neural models and discovered two obstacles of the state-of-the-art model. First, CNN is unsuitable for extracting Japanese sub-word information. Secondly, a model predicting a tag for each word cannot extract an entity when a part of a word composes an entity. The contributions of this work are (1) verifying the effectiveness of the state-of-the-art NER model for Japanese, (2) proposing a neural model for predicting a tag for each character using word and character information. Experimentally obtained results demonstrate that our model outperforms the state-of-the-art neural English NER model in Japanese.

pdf bib
Word Representation Models for Morphologically Rich Languages in Neural Machine Translation
Ekaterina Vylomova | Trevor Cohn | Xuanli He | Gholamreza Haffari

Out-of-vocabulary words present a great challenge for Machine Translation. Recently various character-level compositional models were proposed to address this issue. In current research we incorporate two most popular neural architectures, namely LSTM and CNN, into hard- and soft-attentional models of translation for character-level representation of the source. We propose semantic and morphological intrinsic evaluation of encoder-level representations. Our analysis of the learned representations reveals that character-based LSTM seems to be better at capturing morphological aspects compared to character-based CNN. We also show that hard-attentional model provides better character-level representations compared to vanilla one.

pdf bib
Spell-Checking based on Syllabification and Character-level Graphs for a Peruvian Agglutinative Language
Carlo Alva | Arturo Oncevay

There are several native languages in Peru which are mostly agglutinative. These languages are transmitted from generation to generation mainly in oral form, causing different forms of writing across different communities. For this reason, there are recent efforts to standardize the spelling in the written texts, and it would be beneficial to support these tasks with an automatic tool such as an spell-checker. In this way, this spelling corrector is being developed based on two steps: an automatic rule-based syllabification method and a character-level graph to detect the degree of error in a misspelled word. The experiments were realized on Shipibo-konibo, a highly agglutinative and amazonian language, and the results obtained have been promising in a dataset built for the purpose.

pdf bib
What do we need to know about an unknown word when parsing German
Bich-Ngoc Do | Ines Rehbein | Anette Frank

We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency parser and compare the results to the ones obtained with other types of embeddings. Our evaluation shows that adding compound embeddings yields a significant improvement of 2% LAS over using word embeddings when no POS information is available. When adding POS embeddings to the input, however, the effect levels out. This suggests that it is not the missing information about the semantics of the unknown words that causes problems for parsing German, but the lack of morphological information for unknown words. To augment our evaluation, we also test the new embeddings in a language modelling task that requires both syntactic and semantic information.

pdf bib
A General-Purpose Tagger with Convolutional Neural Networks
Xiang Yu | Agnieszka Falenska | Ngoc Thang Vu

We present a general-purpose tagger based on convolutional neural networks (CNN), used for both composing word vectors and encoding context information. The CNN tagger is robust across different tagging tasks: without task-specific tuning of hyper-parameters, it achieves state-of-the-art results in part-of-speech tagging, morphological tagging and supertagging. The CNN tagger is also robust against the out-of-vocabulary problem; it performs well on artificially unnormalized texts.

pdf bib
Reconstruction of Word Embeddings from Sub-Word Parameters
Karl Stratos

Pre-trained word embeddings improve the performance of a neural model at the cost of increasing the model size. We propose to benefit from this resource without paying the cost by operating strictly at the sub-lexical level. Our approach is quite simple: before task-specific training, we first optimize sub-word parameters to reconstruct pre-trained word embeddings using various distance measures. We report interesting results on a variety of tasks: word similarity, word analogy, and part-of-speech tagging.

pdf bib
Inflection Generation for Spanish Verbs using Supervised Learning
Cristina Barros | Dimitra Gkatzia | Elena Lloret

We present a novel supervised approach to inflection generation for verbs in Spanish. Our system takes as input the verb’s lemma form and the desired features such as person, number, tense, and is able to predict the appropriate grammatical conjugation. Even though our approach learns from fewer examples comparing to previous work, it is able to deal with all the Spanish moods (indicative, subjunctive and imperative) in contrast to previous work which only focuses on indicative and subjunctive moods. We show that in an intrinsic evaluation, our system achieves 99% accuracy, outperforming (although not significantly) two competitive state-of-art systems. The successful results obtained clearly indicate that our approach could be integrated into wider approaches related to text generation in Spanish.

pdf bib
Neural Paraphrase Identification of Questions with Noisy Pretraining
Gaurav Singh Tomar | Thyago Duque | Oscar Täckström | Jakob Uszkoreit | Dipanjan Das

We present a solution to the problem of paraphrase identification of questions. We focus on a recent dataset of question pairs annotated with binary paraphrase labels and show that a variant of the decomposable attention model (replacing the word embeddings of the decomposable attention model of Parikh et al. 2016 with character n-gram representations) results in accurate performance on this task, while being far simpler than many competing neural architectures. Furthermore, when the model is pretrained on a noisy dataset of automatically collected question paraphrases, it obtains the best reported performance on the dataset.

pdf bib
Sub-character Neural Language Modelling in Japanese
Viet Nguyen | Julian Brooke | Timothy Baldwin

In East Asian languages such as Japanese and Chinese, the semantics of a character are (somewhat) reflected in its sub-character elements. This paper examines the effect of using sub-characters for language modeling in Japanese. This is achieved by decomposing characters according to a range of character decomposition datasets, and training a neural language model over variously decomposed character representations. Our results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decomposition.

pdf bib
Byte-based Neural Machine Translation
Marta R. Costa-jussà | Carlos Escolano | José A. R. Fonollosa

This paper presents experiments comparing character-based and byte-based neural machine translation systems. The main motivation of the byte-based neural machine translation system is to build multi-lingual neural machine translation systems that can share the same vocabulary. We compare the performance of both systems in several language pairs and we see that the performance in test is similar for most language pairs while the training time is slightly reduced in the case of byte-based neural machine translation.

pdf bib
Improving Opinion-Target Extraction with Character-Level Word Embeddings
Soufian Jebbara | Philipp Cimiano

Fine-grained sentiment analysis is receiving increasing attention in recent years. Extracting opinion target expressions (OTE) in reviews is often an important step in fine-grained, aspect-based sentiment analysis. Retrieving this information from user-generated text, however, can be difficult. Customer reviews, for instance, are prone to contain misspelled words and are difficult to process due to their domain-specific language. In this work, we investigate whether character-level models can improve the performance for the identification of opinion target expressions. We integrate information about the character structure of a word into a sequence labeling system using character-level word embeddings and show their positive impact on the system’s performance. Specifically, we obtain an increase by 3.3 points F1-score with respect to our baseline model. In further experiments, we reveal encoded character patterns of the learned embeddings and give a nuanced view of the performance differences of both models.

up

pdf (full)
bib (full)
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

pdf bib
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism
Octavian Popescu | Carlo Strapparava

pdf bib
Predicting News Values from Headline Text and Emotions
Maria Pia di Buono | Jan Šnajder | Bojana Dalbelo Bašić | Goran Glavaš | Martin Tutek | Natasa Milic-Frayling

We present a preliminary study on predicting news values from headline text and emotions. We perform a multivariate analysis on a dataset manually annotated with news values and emotions, discovering interesting correlations among them. We then train two competitive machine learning models – an SVM and a CNN – to predict news values from headline text and emotions as features. We find that, while both models yield a satisfactory performance, some news values are more difficult to detect than others, while some profit more from including emotion information.

pdf bib
Predicting User Views in Online News
Daniel Hardt | Owen Rambow

We analyze user viewing behavior on an online news site. We collect data from 64,000 news articles, and use text features to predict frequency of user views. We compare predictiveness of the headline and “teaser” (viewed before clicking) and the body (viewed after clicking). Both are predictive of clicking behavior, with the full article text being most predictive.

pdf bib
Tracking Bias in News Sources Using Social Media: the Russia-Ukraine Maidan Crisis of 2013–2014
Peter Potash | Alexey Romanov | Mikhail Gronas | Anna Rumshisky | Mikhail Gronas

This paper addresses the task of identifying the bias in news articles published during a political or social conflict. We create a silver-standard corpus based on the actions of users in social media. Specifically, we reconceptualize bias in terms of how likely a given article is to be shared or liked by each of the opposing sides. We apply our methodology to a dataset of links collected in relation to the Russia-Ukraine Maidan crisis from 2013-2014. We show that on the task of predicting which side is likely to prefer a given article, a Naive Bayes classifier can record 90.3% accuracy looking only at domain names of the news sources. The best accuracy of 93.5% is achieved by a feed forward neural netwo