Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis (Editors)

Anthology ID:
Reykjavik, Iceland
European Language Resources Association (ELRA)
Bib Export formats:

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Nicoletta Calzolari | Khalid Choukri | Thierry Declerck | Hrafn Loftsson | Bente Maegaard | Joseph Mariani | Asuncion Moreno | Jan Odijk | Stelios Piperidis

pdf bib
CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text
Ben Verhoeven | Walter Daelemans

We present the CLiPS Stylometry Investigation (CSI) corpus, a new Dutch corpus containing reviews and essays written by university students. It is designed to serve multiple purposes: detection of age, gender, authorship, personality, sentiment, deception, topic and genre. Another major advantage is its planned yearly expansion with each year’s new students. The corpus currently contains about 305,000 tokens spread over 749 documents. The average review length is 128 tokens; the average essay length is 1126 tokens. The corpus will be made available on the CLiPS website ( and can freely be used for academic research purposes. An initial deception detection experiment was performed on this data. Deception detection is the task of automatically classifying a text as being either truthful or deceptive, in our case by examining the writing style of the author. This task has never been investigated for Dutch before. We performed a supervised machine learning experiment using the SVM algorithm in a 10-fold cross-validation setup. The only features were the token unigrams present in the training data. Using this simple method, we reached a state-of-the-art F-score of 72.2%.

pdf bib
On Paraphrase Identification Corpora
Vasile Rus | Rajendra Banjade | Mihai Lintean

We analyze in this paper a number of data sets proposed over the last decade or so for the task of paraphrase identification. The goal of the analysis is to identify the advantages as well as shortcomings of the previously proposed data sets. Based on the analysis, we then make recommendations about how to improve the process of creating and using such data sets for evaluating in the future approaches to the task of paraphrase identification or the more general task of semantic similarity. The recommendations are meant to improve our understanding of what a paraphrase is, offer a more fair ground for comparing approaches, increase the diversity of actual linguistic phenomena that future data sets will cover, and offer ways to improve our understanding of the contributions of various modules or approaches proposed for solving the task of paraphrase identification or similar tasks.

pdf bib
A Corpus of Comparisons in Product Reviews
Wiltrud Kessler | Jonas Kuhn

Sentiment analysis (or opinion mining) deals with the task of determining the polarity of an opinionated document or sentence. Users often express sentiment about one product by comparing it to a different product. In this work, we present a corpus of comparison sentences from English camera reviews. For our purposes we define a comparison to be any statement about the similarity or difference of two entities. For each sentence we have annotated detailed information about the comparisons it contains: The comparative predicate that expresses the comparison, the type of the comparison, the two entities that are being compared, and the aspect they are compared in. The results of our agreement study show that the decision whether a sentence contains a comparison is difficult to make even for trained human annotators. Once that decision is made, we can achieve consistent results for the very detailed annotations. In total, we have annotated 2108 comparisons in 1707 sentences from camera reviews which makes our corpus the largest resource currently available. The corpus and the annotation guidelines are publicly available on our website.

pdf bib
Generating and using probabilistic morphological resources for the biomedical domain
Vincent Claveau | Ewa Kijak

In most Indo-European languages, many biomedical terms are rich morphological structures composed of several constituents mainly originating from Greek or Latin. The interpretation of these compounds are keystones to access information. In this paper, we present morphological resources aiming at coping with these biomedical morphological compounds. Following previous work (Claveau et al. 2011,Claveau et al. 12), these resources are automatically built using Japanese terms in Kanjis as a pivot language and alignment techniques. We show how these alignment information can be used for segmenting compounds, attaching semantic interpretation to each part, proposing definitions (gloses) of the compounds... When possible, these tasks are compared with state-of-the-art tools, and the results show the interest of our automatically built probabilistic resources.

pdf bib
A System for Experiments with Dependency Parsers
Kiril Simov | Iliana Simova | Ginka Ivanova | Maria Mateva | Petya Osenova

In this paper we present a system for experimenting with combinations of dependency parsers. The system supports initial training of different parsing models, creation of parsebank(s) with these models, and different strategies for the construction of ensemble models aimed at improving the output of the individual models by voting. The system employs two algorithms for construction of dependency trees from several parses of the same sentence and several ways for ranking of the arcs in the resulting trees. We have performed experiments with state-of-the-art dependency parsers including MaltParser, MSTParser, TurboParser, and MATEParser, on the data from the Bulgarian treebank -- BulTreeBank. Our best result from these experiments is slightly better then the best result reported in the literature for this language.

pdf bib
Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities
Thamar Solorio | Ragib Hasan | Mainul Mizan

This paper describes a corpus of sockpuppet cases from Wikipedia. A sockpuppet is an online user account created with a fake identity for the purpose of covering abusive behavior and/or subverting the editing regulation process. We used a semi-automated method for crawling and curating a dataset of real sockpuppet investigation cases. To the best of our knowledge, this is the first corpus available on real-world deceptive writing. We describe the process for crawling the data and some preliminary results that can be used as baseline for benchmarking research. The dataset has been released under a Creative Commons license from our project website (

pdf bib
Textual Emigration Analysis (TEA)
Andre Blessing | Jonas Kuhn

We present a web-based application which is called TEA (Textual Emigration Analysis) as a showcase that applies textual analysis for the humanities. The TEA tool is used to transform raw text input into a graphical display of emigration source and target countries (under a global or an individual perspective). It provides emigration-related frequency information, and gives access to individual textual sources, which can be downloaded by the user. Our application is built on top of the CLARIN infrastructure which targets researchers of the humanities. In our scenario, we focus on historians, literary scientists, and other social scientists that are interested in the semantic interpretation of text. Our application processes a large set of documents to extract information about people who emigrated. The current implementation integrates two data sets: A data set from the Global Migrant Origin Database, which does not need additional processing, and a data set which was extracted from the German Wikipedia edition. The TEA tool can be accessed by using the following URL:

pdf bib
Phoneme Set Design Using English Speech Database by Japanese for Dialogue-Based English CALL Systems
Xiaoyun Wang | Jinsong Zhang | Masafumi Nishida | Seiichi Yamamoto

This paper describes a method of generating a reduced phoneme set for dialogue-based computer assisted language learning (CALL)systems. We designed a reduced phoneme set consisting of classified phonemes more aligned with the learners’ speech characteristics than the canonical set of a target language. This reduced phoneme set provides an inherently more appropriate model for dealing with mispronunciation by second language speakers. In this study, we used a phonetic decision tree (PDT)-based top-down sequential splitting method to generate the reduced phoneme set and then applied this method to a translation-game type English CALL system for Japanese to determine its effectiveness. Experimental results showed that the proposed method improves the performance of recognizing non-native speech.

pdf bib
Toward a unifying model for Opinion, Sentiment and Emotion information extraction
Amel Fraisse | Patrick Paroubek

This paper presents a logical formalization of a set 20 semantic categories related to opinion, emotion and sentiment. Our formalization is based on the BDI model (Belief, Desire and Intetion) and constitues a first step toward a unifying model for subjective information extraction. The separability of the subjective classes that we propose was assessed both formally and on two subjective reference corpora.

pdf bib
Towards automatic quality assessment of component metadata
Thorsten Trippel | Daan Broeder | Matej Durco | Oddrun Ohren

Measuring the quality of metadata is only possible by assessing the quality of the underlying schema and the metadata instance. We propose some factors that are measurable automatically for metadata according to the CMD framework, taking into account the variability of schemas that can be defined in this framework. The factors include among others the number of elements, the (re-)use of reusable components, the number of filled in elements. The resulting score can serve as an indicator of the overall quality of the CMD instance, used for feedback to metadata providers or to provide an overview of the overall quality of metadata within a reposi-tory. The score is independent of specific schemas and generalizable. An overall assessment of harvested metadata is provided in form of statistical summaries and the distribution, based on a corpus of harvested metadata. The score is implemented in XQuery and can be used in tools, editors and repositories.

pdf bib
PropBank: Semantics of New Predicate Types
Claire Bonial | Julia Bonn | Kathryn Conger | Jena D. Hwang | Martha Palmer

This research focuses on expanding PropBank, a corpus annotated with predicate argument structures, with new predicate types; namely, noun, adjective and complex predicates, such as Light Verb Constructions. This effort is in part inspired by a sister project to PropBank, the Abstract Meaning Representation project, which also attempts to capture “who is doing what to whom” in a sentence, but does so in a way that abstracts away from syntactic structures. For example, alternate realizations of a ‘destroying’ event in the form of either the verb ‘destroy’ or the noun ‘destruction’ would receive the same Abstract Meaning Representation. In order for PropBank to reach the same level of coverage and continue to serve as the bedrock for Abstract Meaning Representation, predicate types other than verbs, which have previously gone without annotation, must be annotated. This research describes the challenges therein, including the development of new annotation practices that walk the line between abstracting away from language-particular syntactic facts to explore deeper semantics, and maintaining the connection between semantics and syntactic structures that has proven to be very valuable for PropBank as a corpus of training data for Natural Language Processing applications.

pdf bib
RESTful Annotation and Efficient Collaboration
Jonathan Wright

As linguistic collection and annotation scale up and collaboration across sites increases, novel technologies are necessary to support projects. Recent events at LDC, namely the move to a web-based infrastructure, the formation of the Software Group, and our involvement in the NSF LAPPS Grid project, have converged on concerns of efficient collaboration. The underlying design of the Web, typically referred to as RESTful principles, is crucial for collaborative annotation, providing data and processing services, and participating in the Linked Data movement. This paper outlines recommendations that will facilitate such collaboration.

pdf bib
ALICO: a multimodal corpus for the study of active listening
Hendrik Buschmeier | Zofia Malisz | Joanna Skubisz | Marcin Wlodarczak | Ipke Wachsmuth | Stefan Kopp | Petra Wagner

The Active Listening Corpus (ALICO) is a multimodal database of spontaneous dyadic conversations with diverse speech and gestural annotations of both dialogue partners. The annotations consist of short feedback expression transcription with corresponding communicative function interpretation as well as segmentation of interpausal units, words, rhythmic prominence intervals and vowel-to-vowel intervals. Additionally, ALICO contains head gesture annotation of both interlocutors. The corpus contributes to research on spontaneous human--human interaction, on functional relations between modalities, and timing variability in dialogue. It also provides data that differentiates between distracted and attentive listeners. We describe the main characteristics of the corpus and present the most important results obtained from analyses in recent years.

pdf bib
PoliTa: A multitagger for Polish
Łukasz Kobyliński

Part-of-Speech (POS) tagging is a crucial task in Natural Language Processing (NLP). POS tags may be assigned to tokens in text manually, by trained linguists, or using algorithmic approaches. Particularly, in the case of annotated text corpora, the quantity of textual data makes it unfeasible to rely on manual tagging and automated methods are used extensively. The quality of such methods is of critical importance, as even 1% tagger error rate results in introducing millions of errors in a corpus consisting of a billion tokens. In case of Polish several POS taggers have been proposed to date, but even the best of the taggers achieves an accuracy of ca. 93%, as measured on the one million subcorpus of the National Corpus of Polish (NCP). As the task of tagging is an example of classification, in this article we introduce a new POS tagger for Polish, which is based on the idea of combining several classifiers to produce higher quality tagging results than using any of the taggers individually.

pdf bib
A Corpus of Participant Roles in Contentious Discussions
Siddharth Jain | Archna Bhatia | Angelique Rein | Eduard Hovy

The expansion of social roles is, nowadays, a fact due to the ability of users to interact, discuss, exchange ideas and opinions, and form social networks though social media. Users in online social environment play a variety of social roles. The concept of “social role” has long been used in social science describe the intersection of behavioural, meaningful, and structural attributes that emerge regularly in particular settings. In this paper, we present a new corpus for social roles in online contentious discussions. We explore various behavioural attributes such as stubbornness, sensibility, influence, and ignorance to create a model of social roles to distinguish among various social roles participants assume in such setup. We annotate discussions drawn from two different sets of corpora in order to ensure that our model of social roles and their signals hold up in general. We discuss the various criteria for deciding values for each behavioural attributes which define the roles.

pdf bib
Bilingual Dictionary Construction with Transliteration Filtering
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi

In this paper we present a bilingual transliteration lexicon of 170K Japanese-English technical terms in the scientific domain. Translation pairs are extracted by filtering a large list of transliteration candidates generated automatically from a phrase table trained on parallel corpora. Filtering uses a novel transliteration similarity measure based on a discriminative phrase-based machine translation approach. We demonstrate that the extracted dictionary is accurate and of high recall (F1 score 0.8). Our lexicon contains not only single words but also multi-word expressions, and is freely available. Our experiments focus on Katakana-English lexicon construction, however it would be possible to apply the proposed methods to transliteration extraction for a variety of language pairs.

pdf bib
Revising the annotation of a Broadcast News corpus: a linguistic approach
Vera Cabarrão | Helena Moniz | Fernando Batista | Ricardo Ribeiro | Nuno Mamede | Hugo Meinedo | Isabel Trancoso | Ana Isabel Mata | David Martins de Matos

This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant revised data is now being extensively used, and was of extreme importance for improving the performance of several modules, especially the punctuation and capitalization modules, but also the speech recognition system, and all the subsequent modules. The resultant data has also been recently used in disfluency studies across domains.

pdf bib
Gold-standard for Topic-specific Sentiment Analysis of Economic Texts
Pyry Takala | Pekka Malo | Ankur Sinha | Oskar Ahlgren

Public opinion, as measured by media sentiment, can be an important indicator in the financial and economic context. These are domains where traditional sentiment estimation techniques often struggle, and existing annotated sentiment text collections are of less use. Though considerable progress has been made in analyzing sentiments at sentence-level, performing topic-dependent sentiment analysis is still a relatively uncharted territory. The computation of topic-specific sentiments has commonly relied on naive aggregation methods without much consideration to the relevance of the sentences to the given topic. Clearly, the use of such methods leads to a substantial increase in noise-to-signal ratio. To foster development of methods for measuring topic-specific sentiments in documents, we have collected and annotated a corpus of financial news that have been sampled from Thomson Reuters newswire. In this paper, we describe the annotation process and evaluate the quality of the dataset using a number of inter-annotator agreement metrics. The annotations of 297 documents and over 9000 sentences can be used for research purposes when developing methods for detecting topic-wise sentiment in financial text.

pdf bib
Visualization of Language Relations and Families: MultiTree
Damir Cavar | Malgorzata Cavar

MultiTree is an NFS-funded project collecting scholarly hypotheses about language relationships, and visualizing them on a web site in the form of trees or graphs. Two open online interfaces allow scholars, students, and the general public an easy access to search for language information or comparisons of competing hypotheses. One objective of the project was to facilitate research in historical linguistics. MultiTree has evolved to a much more powerful tool, it is not just a simple repository of scholarly information. In this paper we present the MultiTree interfaces and the impact of the project beyond the field of historical linguistics, including, among others, the use of standardized ISO language codes, and creating an interconnected database of language and dialect names, codes, publications, and authors. Further, we offer the dissemination of linguistic findings world-wide to both scholars and the general public, thus boosting the collaboration and accelerating the scientific exchange. We discuss also the ways MultiTree will develop beyond the time of the duration of the funding.

pdf bib
HiEve: A Corpus for Extracting Event Hierarchies from News Stories
Goran Glavaš | Jan Šnajder | Marie-Francine Moens | Parisa Kordjamshidi

In news stories, event mentions denote real-world events of different spatial and temporal granularity. Narratives in news stories typically describe some real-world event of coarse spatial and temporal granularity along with its subevents. In this work, we present HiEve, a corpus for recognizing relations of spatiotemporal containment between events. In HiEve, the narratives are represented as hierarchies of events based on relations of spatiotemporal containment (i.e., superevent―subevent relations). We describe the process of manual annotation of HiEve. Furthermore, we build a supervised classifier for recognizing spatiotemporal containment between events to serve as a baseline for future research. Preliminary experimental results are encouraging, with classifier performance reaching 58% F1-score, only 11% less than the inter annotator agreement.

pdf bib
Image Annotation with ISO-Space: Distinguishing Content from Structure
James Pustejovsky | Zachary Yocum

Natural language descriptions of visual media present interesting problems for linguistic annotation of spatial information. This paper explores the use of ISO-Space, an annotation specification to capturing spatial information, for encoding spatial relations mentioned in descriptions of images. Especially, we focus on the distinction between references to representational content and structural components of images, and the utility of such a distinction within a compositional semantics. We also discuss how such a structure-content distinction within the linguistic annotation can be leveraged to compute further inferences about spatial configurations depicted by images with verbal captions. We construct a composition table to relate content-based relations to structure-based relations in the image, as expressed in the captions. While still preliminary, our initial results suggest that a weak composition table is both sound and informative for deriving new spatial relations.

pdf bib
The ETAPE speech processing evaluation
Olivier Galibert | Jeremy Leixa | Gilles Adda | Khalid Choukri | Guillaume Gravier

The ETAPE evaluation is the third evaluation in automatic speech recognition and associated technologies in a series which started with ESTER. This evaluation proposed some new challenges, by proposing TV and radio shows with prepared and spontaneous speech, annotation and evaluation of overlapping speech, a cross-show condition in speaker diarization, and new, complex but very informative named entities in the information extraction task. This paper presents the whole campaign, including the data annotated, the metrics used and the anonymized system results. All the data created in the evaluation, hopefully including system outputs, will be distributed through the ELRA catalogue in the future.

pdf bib
PanLex: Building a Resource for Panlingual Lexical Translation
David Kamholz | Jonathan Pool | Susan Colowick

PanLex, a project of The Long Now Foundation, aims to enable the translation of lexemes among all human languages in the world. By focusing on lexemic translations, rather than grammatical or corpus data, it achieves broader lexical and language coverage than related projects. The PanLex database currently documents 20 million lexemes in about 9,000 language varieties, with 1.1 billion pairwise translations. The project primarily engages in content procurement, while encouraging outside use of its data for research and development. Its data acquisition strategy emphasizes broad, high-quality lexical and language coverage. The project plans to add data derived from 4,000 new sources to the database by the end of 2016. The dataset is publicly accessible via an HTTP API and monthly snapshots in CSV, JSON, and XML formats. Several online applications have been developed that query PanLex data. More broadly, the project aims to make a contribution to the preservation of global linguistic diversity.

pdf bib
Large SMT data-sets extracted from Wikipedia
Dan Tufiş

The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training three-language pairs SMT systems. The experiments showed that for a similarity score higher than 0.7 all sentence pairs in the three language pairs were fully parallel. However, including in the training sets less parallel sentence pairs (that is with a lower similarity score) showed significant improvements in the translation quality (BLEU-based evaluations). The optimized SMT systems were evaluated on unseen test-sets also extracted from Wikipedia. As one of the main goals of our work was to help Wikipedia contributors to translate (with as little post editing as possible) new articles from major languages into less resourced languages and vice-versa, we call this type of translation experiments “in-genre” translation. As in the case of “in-domain” translation, our evaluations showed that using only “in-genre” training data for translating same genre new texts is better than mixing the training data with “out-of-genre” (even) parallel texts.

pdf bib
NomLex-PT: A Lexicon of Portuguese Nominalizations
Valeria de Paiva | Livy Real | Alexandre Rademaker | Gerard de Melo

This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words when analysing a text. NomLex-PT is freely available and encoded in RDF for easy integration with other resources. Most notably, we have integrated NomLex-PT with OpenWordNet-PT, an open Portuguese Wordnet.

pdf bib
VERTa: Facing a Multilingual Experience of a Linguistically-based MT Evaluation
Elisabet Comelles | Jordi Atserias | Victoria Arranz | Irene Castellón | Jordi Sesé

There are several MT metrics used to evaluate translation into Spanish, although most of them use partial or little linguistic information. In this paper we present the multilingual capability of VERTa, an automatic MT metric that combines linguistic information at lexical, morphological, syntactic and semantic level. In the experiments conducted we aim at identifying those linguistic features that prove the most effective to evaluate adequacy in Spanish segments. This linguistic information is tested both as independent modules (to observe what each type of feature provides) and in a combinatory fastion (where different kinds of information interact with each other). This allows us to extract the optimal combination. In addition we compare these linguistic features to those used in previous versions of VERTa aimed at evaluating adequacy for English segments. Finally, experiments show that VERTa can be easily adapted to other languages than English and that its collaborative approach correlates better with human judgements on adequacy than other well-known metrics.

pdf bib
Corpus and Method for Identifying Citations in Non-Academic Text
Yifan He | Adam Meyers

We attempt to identify citations in non-academic text such as patents. Unlike academic articles which often provide bibliographies and follow consistent citation styles, non-academic text cites scientific research in a more ad-hoc manner. We manually annotate citations in 50 patents, train a CRF classifier to find new citations, and apply a reranker to incorporate non-local information. Our best system achieves 0.83 F-score on 5-fold cross validation.

pdf bib
Building a Dataset for Summarization and Keyword Extraction from Emails
Vanessa Loza | Shibamouli Lahiri | Rada Mihalcea | Po-Hsiang Lai

This paper introduces a new email dataset, consisting of both single and thread emails, manually annotated with summaries and keywords. A total of 349 emails and threads have been annotated. The dataset is our first step toward developing automatic methods for summarization and keyword extraction from emails. We describe the email corpus, along with the annotation interface, annotator guidelines, and agreement studies.

pdf bib
Improving Open Relation Extraction via Sentence Re-Structuring
Jordan Schmidek | Denilson Barbosa

Information Extraction is an important task in Natural Language Processing, consisting of finding a structured representation for the information expressed in natural language text. Two key steps in information extraction are identifying the entities mentioned in the text, and the relations among those entities. In the context of Information Extraction for the World Wide Web, unsupervised relation extraction methods, also called Open Relation Extraction (ORE) systems, have become prevalent, due to their effectiveness without domain-specific training data. In general, these systems exploit part-of-speech tags or semantic information from the sentences to determine whether or not a relation exists, and if so, its predicate. This paper discusses some of the issues that arise when even moderately complex sentences are fed into ORE systems. A process for re-structuring such sentences is discussed and evaluated. The proposed approach replaces complex sentences by several others that, together, convey the same meaning and are more amenable to extraction by current ORE systems. The results of an experimental evaluation show that this approach succeeds in reducing the processing time and increasing the accuracy of the state-of-the-art ORE systems.

pdf bib
How to Use less Features and Reach Better Performance in Author Gender Identification
Juan Soler Company | Leo Wanner

Over the last years, author profiling in general and author gender identification in particular have become a popular research area due to their potential attractive applications that range from forensic investigations to online marketing studies. However, nearly all state-of-the-art works in the area still very much depend on the datasets they were trained and tested on, since they heavily draw on content features, mostly a large number of recurrent words or combinations of words extracted from the training sets. We show that using a small number of features that mainly depend on the structure of the texts we can outperform other approaches that depend mainly on the content of the texts and that use a huge number of features in the process of identifying if the author of a text is a man or a woman. Our system has been tested against a dataset constructed for our work as well as against two datasets that were previously used in other papers.

pdf bib
Semi-supervised methods for expanding psycholinguistics norms by integrating distributional similarity with the structure of WordNet
Michael Mohler | Marc Tomlinson | David Bracewell | Bryan Rink

In this work, we present two complementary methods for the expansion of psycholinguistics norms. The first method is a random-traversal spreading activation approach which transfers existing norms onto semantically related terms using notions of synonymy, hypernymy, and pertainymy to approach full coverage of the English language. The second method makes use of recent advances in distributional similarity representation to transfer existing norms to their closest neighbors in a high-dimensional vector space. These two methods (along with a naive hybrid approach combining the two) have been shown to significantly outperform a state-of-the-art resource expansion system at our pilot task of imageability expansion. We have evaluated these systems in a cross-validation experiment using 8,188 norms found in existing pscholinguistics literature. We have also validated the quality of these combined norms by performing a small study using Amazon Mechanical Turk (AMT).

pdf bib - Including Multimedia Language Resources to disseminate Knowledge and Create Educational Material on less-Resourced Languages
Dagmar Jung | Katarzyna Klessa | Zsuzsa Duray | Beatrix Oszkó | Mária Sipos | Sándor Szeverényi | Zsuzsa Várnai | Paul Trilsbeek | Tamás Váradi

The present paper describes the development of the interactive website as an example of including multimedia language resources to disseminate knowledge and create educational material on less-resourced languages. The website is a product of INNET (Innovative networking in infrastructure for endangered languages), European FP7 project. Its main functions can be summarized as related to the three following areas: (1) raising students’ awareness of language endangerment and arouse their interest in linguistic diversity, language maintenance and language documentation; (2) informing both students and teachers about these topics and show ways how they can enlarge their knowledge further with a special emphasis on information about language archives; (3) helping teachers include these topics into their classes. The website has been localized into five language versions with the intention to be accessible to both scientific and non-scientific communities such as (primarily) secondary school teachers and students, beginning university students of linguistics, journalists, the interested public, and also members of speech communities who speak minority languages.

pdf bib
A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence
Bistra Andreeva | William Barry | Jacques Koreman

The present article describes a corpus which was collected for the cross-language comparison of prominence. In the data analysis, the acoustic-phonetic properties of words spoken with two different levels of accentuation (de-accented and nuclear accented in non-contrastive narrow-focus) are examined in question-answer elicited sentences and iterative imitations (on the syllable ‘da’) produced by Bulgarian, Russian, French, German and Norwegian speakers (3 male and 3 female per language). Normalized parameter values allow a comparison of the properties employed in differentiating the two levels of accentuation. Across the five languages there are systematic differences in the degree to which duration, f0, intensity and spectral vowel definition change with changing prominence under different focus conditions. The link with phonological differences between the languages is discussed.

pdf bib
DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German
Benoît Sagot

We introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy.

pdf bib
Using Resource-Rich Languages to Improve Morphological Analysis of Under-Resourced Languages
Peter Baumann | Janet Pierrehumbert

The world-wide proliferation of digital communications has created the need for language and speech processing systems for under-resourced languages. Developing such systems is challenging if only small data sets are available, and the problem is exacerbated for languages with highly productive morphology. However, many under-resourced languages are spoken in multi-lingual environments together with at least one resource-rich language and thus have numerous borrowings from resource-rich languages. Based on this insight, we argue that readily available resources from resource-rich languages can be used to bootstrap the morphological analyses of under-resourced languages with complex and productive morphological systems. In a case study of two such languages, Tagalog and Zulu, we show that an easily obtainable English wordlist can be deployed to seed a morphological analysis algorithm from a small training set of conversational transcripts. Our method achieves a precision of 100% and identifies 28 and 66 of the most productive affixes in Tagalog and Zulu, respectively.

pdf bib
Accommodations in Tuscany as Linked Data
Clara Bacciu | Angelica Lo Duca | Andrea Marchetti | Maurizio Tesconi

The OpeNER Linked Dataset (OLD) contains 19.140 entries about accommodations in Tuscany (Italy). For each accommodation, it describes the type, e.g. hotel, bed and breakfast, hostel, camping etc., and other useful information, such as a short description, the Web address, its location and the features it provides. OLD is the linked data version of the open dataset provided by Fondazione Sistema Toscana, the representative system for tourism in Tuscany. In addition, to the original dataset, OLD provides also the link of each accommodation to the most common social media (Facebook, Foursquare, Google Places and Booking). OLD exploits three common ontologies of the accommodation domain: Acco, Hontology and GoodRelations. The idea is to provide a flexible dataset, which speaks more than one ontology. OLD is available as a SPARQL node and is released under the Creative Commons release. Finally, OLD is developed within the OpeNER European project, which aims at building a set of ready to use tools to recognize and disambiguate entity mentions and perform sentiment analysis and opinion detection on texts. Within the project, OLD provides a named entity repository for entity disambiguation.

pdf bib
The DWAN framework: Application of a web annotation framework for the general humanities to the domain of language resources
Przemyslaw Lenkiewicz | Olha Shkaravska | Twan Goosen | Daan Broeder | Menzo Windhouwer | Stephanie Roth | Olof Olsson

Researchers share large amounts of digital resources, which offer new chances for cooperation. Collaborative annotation systems are meant to support this. Often these systems are targeted at a specific task or domain, e.g., annotation of a corpus. The DWAN framework for web annotation is generic and can support a wide range of tasks and domains. A key feature of the framework is its support for caching representations of the annotated resource. This allows showing the context of the annotation even if the resource has changed or has been removed. The paper describes the design and implementation of the framework. Use cases provided by researchers are well in line with the key characteristics of the DWAN annotation framework.

pdf bib
Croatian Memories
Arjan van Hessen | Franciska de Jong | Stef Scagliola | Tanja Petrovic

In this contribution we describe a collection of approximately 400 video interviews recorded in the context of the project Croatian Memories (CroMe) with the objective of documenting personal war-related experiences. The value of this type of sources is threefold: they contain information that is missing in written sources, they can contribute to the process of reconciliation, and they provide a basis for reuse of data in disciplines with an interest in narrative data. The CroMe collection is not primarily designed as a linguistic corpus, but is the result of an archival effort to collect so-called oral history data. For researchers in the fields of natural language processing and speech analy¬sis this type of life-stories may function as an object trouvé containing real-life language data that can prove to be useful for the purpose of modelling specific aspects of human expression and communication.

pdf bib
Combining elicited imitation and fluency features for oral proficiency measurement
Deryle Lonsdale | Carl Christensen

The automatic grading of oral language tests has been the subject of much research in recent years. Several obstacles lie in the way of achieving this goal. Recent work suggests a testing technique called elicited imitation (EI) that can serve to accurately approximate global oral proficiency. This testing methodology, however, does not incorporate some fundamental aspects of language, such as fluency. Other work has suggested another testing technique, simulated speech (SS), as a supplement or an alternative to EI that can provide automated fluency metrics. In this work, we investigate a combination of fluency features extracted from SS tests and EI test scores as a means to more accurately predict oral language proficiency. Using machine learning and statistical modeling, we identify which features automatically extracted from SS tests best predicted hand-scored SS test results, and demonstrate the benefit of adding EI scores to these models. Results indicate that the combination of EI and fluency features do indeed more effectively predict hand-scored SS test scores. We finally discuss implications of this work for future automated oral testing scenarios.

pdf bib
Semantic Search in Documents Enriched by LOD-based Annotations
Pavel Smrz | Jan Kouril

This paper deals with information retrieval on semantically enriched web-scale document collections. It particularly focuses on web-crawled content in which mentions of entities appearing in Freebase, DBpedia and other Linked Open Data resources have been identified. A special attention is paid to indexing structures and advanced query mechanisms that have been employed into a new semantic retrieval system. Scalability features are discussed together with performance statistics and results of experimental evaluation of presented approaches. Examples given to demonstrate key features of the developed solution correspond to the cultural heritage domain in which the results of our work have been primarily applied.

pdf bib
A Meta-data Driven Platform for Semi-automatic Configuration of Ontology Mediators
Manuel Fiorelli | Maria Teresa Pazienza | Armando Stellato

Ontology mediators often demand extensive configuration, or even the adaptation of the input ontologies for remedying unsupported modeling patterns. In this paper we propose MAPLE (MAPping Architecture based on Linguistic Evidences), an architecture and software platform that semi-automatically solves this configuration problem, by reasoning on metadata about the linguistic expressivity of the input ontologies, the available mediators and other components relevant to the mediation task. In our methodology mediators should access the input ontologies through uniform interfaces abstracting many low-level details, while depending on generic third-party linguistic resources providing external information. Given a pair of ontologies to reconcile, MAPLE ranks the available mediators according to their ability to exploit most of the input ontologies content, while coping with the exhibited degree of linguistic heterogeneity. MAPLE provides the chosen mediator with concrete linguistic resources and suitable implementations of the required interfaces. The resulting mediators are more robust, as they are isolated from many low-level issues, and their applicability and performance may increase over time as new and better resources and other components are made available. To sustain this trend, we foresee the use of the Web as a large scale repository.

pdf bib
Semantic approaches to software component retrieval with English queries
Huijing Deng | Grzegorz Chrupała

Enabling code reuse is an important goal in software engineering, and it depends crucially on effective code search interfaces. We propose to ground word meanings in source code and use such language-code mappings in order to enable a search engine for programming library code where users can pose queries in English. We exploit the fact that there are large programming language libraries which are documented both via formally specified function or method signatures as well as descriptions written in natural language. Automatically learned associations between words in descriptions and items in signatures allows us to use queries formulated in English to retrieve methods which are not documented via natural language descriptions, only based on their signatures. We show that the rankings returned by our model substantially outperforms a strong term-matching baseline.

pdf bib
The Ellogon Pattern Engine: Context-free Grammars over Annotations
Georgios Petasis

This paper presents the pattern engine that is offered by the Ellogon language engineering platform. This pattern engine allows the application of context-free grammars over annotations, which are metadata generated during the processing of documents by natural language tools. In addition, grammar development is aided by a graphical grammar editor, giving grammar authors the capability to test and debug grammars.

pdf bib
Missed opportunities in translation memory matching
Friedel Wolff | Laurette Pretorius | Paul Buitelaar

A translation memory system stores a data set of source-target pairs of translations. It attempts to respond to a query in the source language with a useful target text from the data set to assist a human translator. Such systems estimate the usefulness of a target text suggestion according to the similarity of its associated source text to the source text query. This study analyses two data sets in two language pairs each to find highly similar target texts, which would be useful mutual suggestions. We further investigate which of these useful suggestions can not be selected through source text similarity, and we do a thorough analysis of these cases to categorise and quantify them. This analysis provides insight into areas where the recall of translation memory systems can be improved. Specifically, source texts with an omission, and semantically very similar source texts are some of the more frequent cases with useful target text suggestions that are not selected with the baseline approach of simple edit distance between the source texts.

pdf bib
Universal Stanford dependencies: A cross-linguistic typology
Marie-Catherine de Marneffe | Timothy Dozat | Natalia Silveira | Katri Haverinen | Filip Ginter | Joakim Nivre | Christopher D. Manning

Revisiting the now de facto standard Stanford dependency representation, we propose an improved taxonomy to capture grammatical relations across languages, including morphologically rich ones. We suggest a two-layered taxonomy: a set of broadly attested universal grammatical relations, to which language-specific relations can be added. We emphasize the lexicalist stance of the Stanford Dependencies, which leads to a particular, partially new treatment of compounding, prepositions, and morphology. We show how existing dependency schemes for several languages map onto the universal taxonomy proposed here and close with consideration of practical implications of dependency representation choices for NLP applications, in particular parsing.

pdf bib
Getting Reliable Annotations for Sarcasm in Online Dialogues
Reid Swanson | Stephanie Lukin | Luke Eisenberg | Thomas Corcoran | Marilyn Walker

The language used in online forums differs in many ways from that of traditional language resources such as news. One difference is the use and frequency of nonliteral, subjective dialogue acts such as sarcasm. Whether the aim is to develop a theory of sarcasm in dialogue, or engineer automatic methods for reliably detecting sarcasm, a major challenge is simply the difficulty of getting enough reliably labelled examples. In this paper we describe our work on methods for achieving highly reliable sarcasm annotations from untrained annotators on Mechanical Turk. We explore the use of a number of common statistical reliability measures, such as Kappa, Karger’s, Majority Class, and EM. We show that more sophisticated measures do not appear to yield better results for our data than simple measures such as assuming that the correct label is the one that a majority of Turkers apply.

pdf bib
Collaboratively Annotating Multilingual Parallel Corpora in the Biomedical Domain—some MANTRAs
Johannes Hellrich | Simon Clematide | Udo Hahn | Dietrich Rebholz-Schuhmann

The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languages―an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corpora and simultaneously acquiring new biomedical terminology for these under-resourced non-English languages on the basis of two types of language resources, namely parallel corpora (i.e. full translation equivalents at the document unit level) and (admittedly deficient) multilingual biomedical terminologies, with English as their anchor language. We automatically annotate these parallel corpora with biomedical named entities by an ensemble of named entity taggers and harmonize non-identical annotations the outcome of which is a so-called silver standard corpus. We conclude with an empirical assessment of this approach to automatically identify both known and new terms in multilingual corpora.

pdf bib
On the Importance of Text Analysis for Stock Price Prediction
Heeyoung Lee | Mihai Surdeanu | Bill MacCartney | Dan Jurafsky

We investigate the importance of text analysis for stock price prediction. In particular, we introduce a system that forecasts companies’ stock price changes (UP, DOWN, STAY) in response to financial events reported in 8-K documents. Our results indicate that using text boosts prediction accuracy over 10% (relative) over a strong baseline that incorporates many financially-rooted features. This impact is most important in the short term (i.e., the next day after the financial event) but persists for up to five days.

pdf bib
On the use of a fuzzy classifier to speed up the Sp_ToBI labeling of the Glissando Spanish corpus
David Escudero | Lourdes Aguilar-Cuevas | César González-Ferreras | Yurena Gutiérrez-González | Valentín Cardeñoso-Payo

In this paper, we present the application of a novel automatic prosodic labeling methodology for speeding up the manual labeling of the Glissando corpus (Spanish read news items). The methodology is based on the use of soft classification techniques. The output of the automatic system consists on a set of label candidates per word. The number of predicted candidates depends on the degree of certainty assigned by the classifier to each of the predictions. The manual transcriber checks the sets of predictions to select the correct one. We describe the fundamentals of the fuzzy classification tool and its training with a corpus labeled with Sp TOBI labels. Results show a clear coherence between the most confused labels in the output of the automatic classifier and the most confused labels detected in inter-transcriber consistency tests. More importantly, in a preliminary test, the real time ratio of the labeling process was 1:66 when the template of predictions is used and 1:80 when it is not.

pdf bib
Definition patterns for predicative terms in specialized lexical resources
Antonio San Martín | Marie-Claude L’Homme

The research presented in this paper is part of a larger project on the semi-automatic generation of definitions of semantically-related terms in specialized resources. The work reported here involves the formulation of instructions to generate the definitions of sets of morphologically-related predicative terms, based on the definition of one of the members of the set. In many cases, it is assumed that the definition of a predicative term can be inferred by combining the definition of a related lexical unit with the information provided by the semantic relation (i.e. lexical function) that links them. In other words, terminographers only need to know the definition of “pollute” and the semantic relation that links it to other morphologically-related terms (“polluter”, “polluting”, “pollutant”, etc.) in order to create the definitions of the set. The results show that rules can be used to generate a preliminary set of definitions (based on specific lexical functions). They also show that more complex rules would need to be devised for other morphological pairs.

pdf bib
Native Language Identification Using Large, Longitudinal Data
Xiao Jiang | Yufan Guo | Jeroen Geertzen | Dora Alexopoulou | Lin Sun | Anna Korhonen

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to the recently released EFCamDat corpus which is not only multiple times larger than previous L2 corpora but also provides longitudinal data at several proficiency levels. Our investigation using accurate machine learning with a wide range of linguistic features reveals interesting patterns in the longitudinal data which are useful for both further development of NLI and its application to research on L2 acquisition.

pdf bib
Production of Phrase Tables in 11 European Languages using an Improved Sub-sentential Aligner
Juan Luo | Yves Lepage

This paper is a partial report of an on-going Kakenhi project which aims to improve sub-sentential alignment and release multilingual syntactic patterns for statistical and example-based machine translation. Here we focus on improving a sub-sentential aligner which is an instance of the association approach. Phrase table is not only an essential component in the machine translation systems but also an important resource for research and usage in other domains. As part of this project, all phrase tables produced in the experiments will also be made freely available.

pdf bib
Construction and Annotation of a French Folkstale Corpus
Anne Garcia-Fernandez | Anne-Laure Ligozat | Anne Vilnat

In this paper, we present the digitization and annotation of a tales corpus - which is to our knowledge the only French tales corpus available and classified according to the Aarne&Thompson classification - composed of historical texts (with old French parts). We first studied whether the pre-processing tools, namely OCR and PoS-tagging, have good enough accuracies to allow automatic analysis. We also manually annotated this corpus according to several types of information which could prove useful for future work: character references, episodes, and motifs. The contributions are the creation of an corpus of French tales from classical anthropology material, which will be made available to the community; the evaluation of OCR and NLP tools on this corpus; and the annotation with anthropological information.

pdf bib
The Making of Ancient Greek WordNet
Yuri Bizzoni | Federico Boschetti | Harry Diakoff | Riccardo Del Gratta | Monica Monachini | Gregory Crane

This paper describes the process of creation and review of a new lexico-semantic resource for the classical studies: AncientGreekWordNet. The candidate sets of synonyms (synsets) are extracted from Greek-English dictionaries, on the assumption that Greek words translated by the same English word or phrase have a high probability of being synonyms or at least semantically closely related. The process of validation and the web interface developed to edit and query the resource are described in detail. The lexical coverage of Ancient Greek WordNet is illustrated and the accuracy is evaluated. Finally, scenarios for exploiting the resource are discussed.

pdf bib
Enriching ODIN
Fei Xia | William Lewis | Michael Wayne Goodman | Joshua Crowgey | Emily M. Bender

In this paper, we describe the expansion of the ODIN resource, a database containing many thousands of instances of Interlinear Glossed Text (IGT) for over a thousand languages harvested from scholarly linguistic papers posted to the Web. A database containing a large number of instances of IGT, which are effectively richly annotated and heuristically aligned bitexts, provides a unique resource for bootstrapping NLP tools for resource-poor languages. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we propose a new XML format for IGT, called Xigt. We call the updated release ODIN-II.

pdf bib
Turkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing
Özlem Çetinoğlu

So far predicted scenarios for Turkish dependency parsing have used a morphological disambiguator that is trained on the data distributed with the tool(Sak et al., 2008). Although models trained on this data have high accuracy scores on the test and development data of the same set, the accuracy drastically drops when the model is used in the preprocessing of Turkish Treebank parsing experiments. We propose to use the Turkish Treebank(Oflazer et al., 2003) as a morphological resource to overcome this problem and convert the treebank to the morphological disambiguator’s format. The experimental results show that we achieve improvements in disambiguating the Turkish Treebank and the results also carry over to parsing. With the help of better morphological analysis, we present the best labelled dependency parsing scores to date on Turkish.

pdf bib
CroDeriV: a new resource for processing Croatian morphology
Krešimir Šojat | Matea Srebačić | Marko Tadić | Tin Pavelić

The paper deals with the processing of Croatian morphology and presents CroDeriV ― a newly developed language resource that contains data about morphological structure and derivational relatedness of verbs in Croatian. In its present shape, CroDeriV contains 14 192 Croatian verbs. Verbs in CroDeriV are analyzed for morphemes and segmented into lexical, derivational and inflectional morphemes. The structure of CroDeriV enables the detection of verbal derivational families in Croatian as well as the distribution and frequency of particular affixes and lexical morphemes. Derivational families consist of a verbal base form and all prefixed or suffixed derivatives detected in available machine readable Croatian dictionaries and corpora. Language data structured in this way was further used for the expansion of other language resources for Croatian, such as Croatian WordNet and the Croatian Morphological Lexicon. Matching the data from CroDeriV on one side and Croatian WordNet and the Croatian Morphological Lexicon on the other resulted in significant enrichment of Croatian WordNet and enlargement of the Croatian Morphological Lexicon.

pdf bib
Building a Database of Japanese Adjective Examples from Special Purpose Web Corpora
Masaya Yamaguchi

It is often difficult to collect many examples for low-frequency words from a single general purpose corpus. In this paper, I present a method of building a database of Japanese adjective examples from special purpose Web corpora (SPW corpora) and investigates the characteristics of examples in the database by comparison with examples that are collected from a general purpose Web corpus (GPW corpus). My proposed method construct a SPW corpus for each adjective considering to collect examples that have the following features: (i) non-bias, (ii) the distribution of examples extracted from every SPW corpus bears much similarity to that of examples extracted from a GPW corpus. The results of experiments shows the following: (i) my proposed method can collect many examples rapidly. The number of examples extracted from SPW corpora is more than 8.0 times (median value) greater than that from the GPW corpus. (ii) the distributions of co-occurrence words for adjectives in the database are similar to those taken from the GPW corpus.

pdf bib
Praaline: Integrating Tools for Speech Corpus Research
George Christodoulides

This paper presents Praaline, an open-source software system for managing, annotating, analysing and visualising speech corpora. Researchers working with speech corpora are often faced with multiple tools and formats, and they need to work with ever-increasing amounts of data in a collaborative way. Praaline integrates and extends existing time-proven tools for spoken corpora analysis (Praat, Sonic Visualiser and a bridge to the R statistical package) in a modular system, facilitating automation and reuse. Users are exposed to an integrated, user-friendly interface from which to access multiple tools. Corpus metadata and annotations may be stored in a database, locally or remotely, and users can define the metadata and annotation structure. Users may run a customisable cascade of analysis steps, based on plug-ins and scripts, and update the database with the results. The corpus database may be queried, to produce aggregated data-sets. Praaline is extensible using Python or C++ plug-ins, while Praat and R scripts may be executed against the corpus data. A series of visualisations, editors and plug-ins are provided. Praaline is free software, released under the GPL license (

pdf bib
Extracting a bilingual semantic grammar from FrameNet-annotated corpora
Dana Dannélls | Normunds Gruzitis

We present the creation of an English-Swedish FrameNet-based grammar in Grammatical Framework. The aim of this research is to make existing framenets computationally accessible for multilingual natural language applications via a common semantic grammar API, and to facilitate the porting of such grammar to other languages. In this paper, we describe the abstract syntax of the semantic grammar while focusing on its automatic extraction possibilities. We have extracted a shared abstract syntax from ~58,500 annotated sentences in Berkeley FrameNet (BFN) and ~3,500 annotated sentences in Swedish FrameNet (SweFN). The abstract syntax defines 769 frame-specific valence patterns that cover 77,8% examples in BFN and 74,9% in SweFN belonging to the shared set of 471 frames. As a side result, we provide a unified method for comparing semantic and syntactic valence patterns across framenets.

pdf bib
Mapping Between English Strings and Reentrant Semantic Graphs
Fabienne Braune | Daniel Bauer | Kevin Knight

We investigate formalisms for capturing the relation between semantic graphs and English strings. Semantic graph corpora have spurred recent interest in graph transduction formalisms, but it is not yet clear whether such formalisms are a good fit for natural language data―in particular, for describing how semantic reentrancies correspond to English pronouns, zero pronouns, reflexives, passives, nominalizations, etc. We introduce a data set that focuses on these problems, we build grammars to capture the graph/string relation in this data, and we evaluate those grammars for conciseness and accuracy.

pdf bib
The KiezDeutsch Korpus (KiDKo) Release 1.0
Ines Rehbein | Sören Schalowski | Heike Wiese

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multiethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.

pdf bib
Etymological Wordnet: Tracing The History of Words
Gerard de Melo

Research on the history of words has led to remarkable insights about language and also about the history of human civilization more generally. This paper presents the Etymological Wordnet, the first database that aims at making word origin information available as a large, machine-readable network of words in many languages. The information in this resource is obtained from Wiktionary. Extracting a network of etymological information from Wiktionary requires significant effort, as much of the etymological information is only given in prose. We rely on custom pattern matching techniques and mine a large network with over 500,000 word origin links as well as over 2 million derivational/compositional links.

pdf bib
The Interplay Between Lexical and Syntactic Resources in Incremental Parsebanking
Victoria Rosén | Petter Haugereid | Martha Thunes | Gyri S. Losnegaard | Helge Dyvik

Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. In building the INESS Norwegian treebank, it is often the case that necessary lexical information is missing in the morphology or lexicon. The approach used to build the treebank is incremental parsebanking; a corpus is parsed with an existing grammar, and the analyses are efficiently disambiguated by annotators. When the intended analysis is unavailable after parsing, the reason is often that necessary information is not available in the lexicon. INESS has therefore implemented a text preprocessing interface where annotators can enter unrecognized words before parsing. This may concern words that are unknown to the morphology and/or lexicon, and also words that are known, but for which important information is missing. When this information is added, either during text preprocessing or during disambiguation, the result is that after reparsing the intended analysis can be chosen and stored in the treebank. The lexical information added to the lexicon in this way may be of great interest both to lexicographers and to other language technology efforts, and the enriched lexical resource being developed will be made available at the end of the project.

pdf bib
Interoperability and Customisation of Annotation Schemata in Argo
Rafal Rak | Jacob Carter | Andrew Rowley | Riza Theresa Batista-Navarro | Sophia Ananiadou

The process of annotating text corpora involves establishing annotation schemata which define the scope and depth of an annotation task at hand. We demonstrate this activity in Argo, a Web-based workbench for the analysis of textual resources, which facilitates both automatic and manual annotation. Annotation tasks in the workbench are defined by building workflows consisting of a selection of available elementary analytics developed in compliance with the Unstructured Information Management Architecture specification. The architecture accommodates complex annotation types that may define primitive as well as referential attributes. Argo aids the development of custom annotation schemata and supports their interoperability by featuring a schema editor and specialised analytics for schemata alignment. The schema editor is a self-contained graphical user interface for defining annotation types. Multiple heterogeneous schemata can be aligned by including one of two type mapping analytics currently offered in Argo. One is based on a simple mapping syntax and, although limited in functionality, covers most common use cases. The other utilises a well established graph query language, SPARQL, and is superior to other state-of-the-art solutions in terms of expressiveness. We argue that the customisation of annotation schemata does not need to compromise their interoperability.

pdf bib
Polish Coreference Corpus in Numbers
Maciej Ogrodniczuk | Mateusz Kopeć | Agata Savary

This paper attempts a preliminary interpretation of the occurrence of different types of linguistic constructs in the manually-annotated Polish Coreference Corpus by providing analyses of various statistical properties related to mentions, clusters and near-identity links. Among others, frequency of mentions, zero subjects and singleton clusters is presented, as well as the average mention and cluster size. We also show that some coreference clustering constraints, such as gender or number agreement, are frequently not valid in case of Polish. The need for lemmatization for automatic coreference resolution is supported by an empirical study. Correlation between cluster and mention count within a text is investigated, with short characteristics of outlier cases. We also examine this correlation in each of the 14 text domains present in the corpus and show that none of them has abnormal frequency of outlier texts regarding the cluster/mention ratio. Finally, we report on our negative experiences concerning the annotation of the near-identity relation. In the conclusion we put forward some guidelines for the future research in the area.

pdf bib
A Gold Standard Dependency Corpus for English
Natalia Silveira | Timothy Dozat | Marie-Catherine de Marneffe | Samuel Bowman | Miriam Connor | John Bauer | Chris Manning

We present a gold standard annotation of syntactic dependencies in the English Web Treebank corpus using the Stanford Dependencies formalism. This resource addresses the lack of a gold standard dependency treebank for English, as well as the limited availability of gold standard syntactic annotations for English informal text genres. We also present experiments on the use of this resource, both for training dependency parsers and for evaluating the quality of different versions of the Stanford Parser, which includes a converter tool to produce dependency annotation from constituency trees. We show that training a dependency parser on a mix of newswire and web data leads to better performance on that type of data without hurting performance on newswire text, and therefore gold standard annotations for non-canonical text can be a valuable resource for parsing. Furthermore, the systematic annotation effort has informed both the SD formalism and its implementation in the Stanford Parser’s dependency converter. In response to the challenges encountered by annotators in the EWT corpus, the formalism has been revised and extended, and the converter has been improved.

pdf bib A High-Coverage Derivational Morphology Resource for Croatian
Jan Šnajder

Knowledge about derivational morphology has been proven useful for a number of natural language processing (NLP) tasks. We describe the construction and evaluation of, a large-coverage morphological resource for Croatian. groups 100k lemmas from web corpus hrWaC into 56k clusters of derivationally related lemmas, so-called derivational families. We focus on suffixal derivation between and within nouns, verbs, and adjectives. We propose two approaches: an unsupervised approach and a knowledge-based approach based on a hand-crafted morphology model but without using any additional lexico-semantic resources The resource acquisition procedure consists of three steps: corpus preprocessing, acquisition of an inflectional lexicon, and the induction of derivational families. We describe an evaluation methodology based on manually constructed derivational families from which we sample and annotate pairs of lemmas. We evaluate on the so-obtained sample, and show that the knowledge-based version attains good clustering quality of 81.2% precision, 76.5% recall, and 78.8% F1 -score. As with similar resources for other languages, we expect to be useful for a number of NLP tasks.

pdf bib
Extracting Information for Context-aware Meeting Preparation
Simon Scerri | Behrang Q. Zadeh | Maciej Dabrowski | Ismael Rivera

People working in an office environment suffer from large volumes of information that they need to manage and access. Frequently, the problem is due to machines not being able to recognise the many implicit relationships between office artefacts, and also due to them not being aware of the context surrounding them. In order to expose these relationships and enrich artefact context, text analytics can be employed over semi-structured and unstructured content, including free text. In this paper, we explain how this strategy is applied and partly evaluated for a specific use-case: supporting the attendees of a calendar event to prepare for the meeting.

pdf bib
A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization
Kai Hong | John Conroy | Benoit Favre | Alex Kulesza | Hui Lin | Ani Nenkova

In the period since 2004, many novel sophisticated approaches for generic multi-document summarization have been developed. Intuitive simple approaches have also been shown to perform unexpectedly well for the task. Yet it is practically impossible to compare the existing approaches directly, because systems have been evaluated on different datasets, with different evaluation measures, against different sets of comparison systems. Here we present a corpus of summaries produced by several state-of-the-art extractive summarization systems or by popular baseline systems. The inputs come from the 2004 DUC evaluation, the latest year in which generic summarization was addressed in a shared task. We use the same settings for ROUGE automatic evaluation to compare the systems directly and analyze the statistical significance of the differences in performance. We show that in terms of average scores the state-of-the-art systems appear similar but that in fact they produce very different summaries. Our corpus will facilitate future research on generic summarization and motivates the need for development of more sensitive evaluation measures and for approaches to system combination in summarization.

pdf bib
Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
Zhiyi Song | Stephanie Strassel | Haejoong Lee | Kevin Walker | Jonathan Wright | Jennifer Garland | Dana Fore | Brian Gainor | Preston Cabe | Thomas Thomas | Brendan Callahan | Ann Sawyer

The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.

pdf bib
Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Bruno Laranjeira | Viviane Moreira | Aline Villavicencio | Carlos Ramisch | Maria José Finatto

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.

pdf bib
Augmenting English Adjective Senses with Supersenses
Yulia Tsvetkov | Nathan Schneider | Dirk Hovy | Archna Bhatia | Manaal Faruqui | Chris Dyer

We develop a supersense taxonomy for adjectives, based on that of GermaNet, and apply it to English adjectives in WordNet using human annotation and supervised classification. Results show that accuracy for automatic adjective type classification is high, but synsets are considerably more difficult to classify, even for trained human annotators. We release the manually annotated data, the classifier, and the induced supersense labeling of 12,304 WordNet adjective synsets.

pdf bib
N-gram Counts and Language Models from the Common Crawl
Christian Buck | Kenneth Heafield | Bas van Ooyen

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English $5$-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5-1.4 BLEU by using large language models to translate into various languages.

pdf bib
ColLex.en: Automatically Generating and Evaluating a Full-form Lexicon for English
Tim vor der Brück | Alexander Mehler | Zahurul Islam

The paper describes a procedure for the automatic generation of a large full-form lexicon of English. We put emphasis on two statistical methods to lexicon extension and adjustment: in terms of a letter-based HMM and in terms of a detector of spelling variants and misspellings. The resulting resource, \collexen, is evaluated with respect to two tasks: text categorization and lexical coverage by example of the SUSANNE corpus and the \openanc.

pdf bib
Evaluation of Simple Distributional Compositional Operations on Longer Texts
Tamara Polajnar | Laura Rimell | Stephen Clark

Distributional semantic models have been effective at representing linguistic semantics at the word level, and more recently research has moved on to the construction of distributional representations for larger segments of text. However, it is not well understood how the composition operators that work well on short phrase-based models scale up to full-length sentences. In this paper we test several simple compositional methods on a sentence-length similarity task and discover that their performance peaks at fewer than ten operations. We also introduce a novel sentence segmentation method that reduces the number of compositional operations.

pdf bib
Creating Summarization Systems with SUMMA
Horacio Saggion

Automatic text summarization, the reduction of a text to its essential content is fundamental for an on-line information society. Although many summarization algorithms exist, there are few tools or infrastructures providing capabilities for developing summarization applications. This paper presents a new version of SUMMA, a text summarization toolkit for the development of adaptive summarization applications. SUMMA includes algorithms for computation of various sentence relevance features and functionality for single and multidocument summarization in various languages. It also offers methods for content-based evaluation of summaries.

pdf bib
BiographyNet: Methodological Issues when NLP supports historical research
Antske Fokkens | Serge ter Braake | Niels Ockeloen | Piek Vossen | Susan Legêne | Guus Schreiber

When NLP is used to support research in the humanities, new methodological issues come into play. NLP methods may introduce a bias in their analysis that can influence the results of the hypothesis a humanities scholar is testing. This paper addresses this issue in the context of BiographyNet a multi-disciplinary project involving NLP, Linked Data and history. We introduce the project to the NLP community. We argue that it is essential for historians to get insight into the provenance of information, including how information was extracted from text by NLP tools.

pdf bib
Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks
Anthony Rousseau | Paul Deléglise | Yannick Estève

In this paper, we present improvements made to the TED-LIUM corpus we released in 2012. These enhancements fall into two categories. First, we describe how we filtered publicly available monolingual data and used it to estimate well-suited language models (LMs), using open-source tools. Then, we describe the process of selection we applied to new acoustic data from TED talks, providing additions to our previously released corpus. Finally, we report some experiments we made around these improvements.

pdf bib
Enrichment of Bilingual Dictionary through News Stream Data
Ajay Dubey | Parth Gupta | Vasudeva Varma | Paolo Rosso

Bilingual dictionaries are the key component of the cross-lingual similarity estimation methods. Usually such dictionary generation is accomplished by manual or automatic means. Automatic generation approaches include to exploit parallel or comparable data to derive dictionary entries. Such approaches require large amount of bilingual data in order to produce good quality dictionary. Many time the language pair does not have large bilingual comparable corpora and in such cases the best automatic dictionary is upper bounded by the quality and coverage of such corpora. In this work we propose a method which exploits continuous quasi-comparable corpora to derive term level associations for enrichment of such limited dictionary. Though we propose our experiments for English and Hindi, our approach can be easily extendable to other languages. We evaluated dictionary by manually computing the precision. In experiments we show our approach is able to derive interesting term level associations across languages.

pdf bib
sloWCrowd: A crowdsourcing tool for lexicographic tasks
Darja Fišer | Aleš Tavčar | Tomaž Erjavec

The paper presents sloWCrowd, a simple tool developed to facilitate crowdsourcing lexicographic tasks, such as error correction in automatically generated wordnets and semantic annotation of corpora. The tool is open-source, language-independent and can be adapted to a broad range of crowdsourcing tasks. Since volunteers who participate in our crowdsourcing tasks are not trained lexicographers, the tool has been designed to obtain multiple answers to the same question and compute the majority vote, making sure individual unreliable answers are discarded. We also make sure unreliable volunteers, who systematically provide unreliable answers, are not taken into account. This is achieved by measuring their accuracy against a gold standard, the questions from which are posed to the annotators on a regular basis in between the real question. We tested the tool in an extensive crowdsourcing task, i.e. error correction of the Slovene wordnet, the results of which are encouraging, motivating us to use the tool in other annotation tasks in the future as well.

pdf bib
A Large Scale Database of Strongly-related Events in Japanese
Tomohide Shibata | Shotaro Kohama | Sadao Kurohashi

The knowledge about the relation between events is quite useful for coreference resolution, anaphora resolution, and several NLP applications such as dialogue system. This paper presents a large scale database of strongly-related events in Japanese, which has been acquired with our proposed method (Shibata and Kurohashi, 2011). In languages, where omitted arguments or zero anaphora are often utilized, such as Japanese, the coreference-based event extraction methods are hard to be applied, and so our method extracts strongly-related events in a two-phrase construct. This method first calculates the co-occurrence measure between predicate-arguments (events), and regards an event pair, whose mutual information is high, as strongly-related events. To calculate the co-occurrence measure efficiently, we adopt an association rule mining method. Then, we identify the remaining arguments by using case frames. The database contains approximately 100,000 unique events, with approximately 340,000 strongly-related event pairs, which is much larger than an existing automatically-constructed event database. We evaluated randomly-chosen 100 event pairs, and the accuracy was approximately 68%.

pdf bib
FLELex: a graded lexical resource for French foreign learners
Thomas François | Nùria Gala | Patrick Watrin | Cédrick Fairon

In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) that reports word frequencies by difficulty level (according to the CEFR scale). It has been obtained from a tagged corpus of 777,000 words from available textbooks and simplified readers intended for FFL learners. Our goal is to freely provide this resource to the community to be used for a variety of purposes going from the assessment of the lexical difficulty of a text, to the selection of simpler words within text simplification systems, and also as a dictionary in assistive tools for writing.

pdf bib
Building Domain Specific Bilingual Dictionaries
Lucas Hilgert | Lucelene Lopes | Artur Freitas | Renata Vieira | Denise Hogetop | Aline Vanin

This paper proposes a method to build bilingual dictionaries for specific domains defined by a parallel corpora. The proposed method is based on an original method that is not domain specific. Both the original and the proposed methods are constructed with previously available natural language processing tools. Therefore, this paper contribution resides in the choice and parametrization of the chosen tools. To illustrate the proposed method benefits we conduct an experiment over technical manuals in English and Portuguese. The results of our proposed method were analyzed by human specialists and our results indicates significant increases in precision for unigrams and muli-grams. Numerically, the precision increase is as big as 15% according to our evaluation.

pdf bib
A Corpus of Machine Translation Errors Extracted from Translation Students Exercises
Guillaume Wisniewski | Natalie Kübler | François Yvon

In this paper, we present a freely available corpus of automatic translations accompanied with post-edited versions, annotated with labels identifying the different kinds of errors made by the MT system. These data have been extracted from translation students exercises that have been corrected by a senior professor. This corpus can be useful for training quality estimation tools and for analyzing the types of errors made MT system.

pdf bib
Finding Romanized Arabic Dialect in Code-Mixed Tweets
Clare Voss | Stephen Tratz | Jamal Laoudi | Douglas Briesch

Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a Romanized Arabic dialect and distinguishes it from French and English in tweets. We focus on Moroccan Darija, one of several spoken vernaculars in the family of Maghrebi Arabic dialects. Even given noisy, code-mixed tweets,the classifier achieved token-level recall of 93.2% on Romanized Arabic dialect, 83.2% on English, and 90.1% on French. The classifier, now integrated into our tweet conversation annotation tool (Tratz et al. 2013), has semi-automated the construction of a Romanized Arabic-dialect lexicon. Two datasets, a full list of Moroccan Darija surface token forms and a table of lexical entries derived from this list with spelling variants, as extracted from our tweet corpus collection, will be made available in the LRE MAP.

pdf bib
Co-Training for Classification of Live or Studio Music Recordings
Nicolas Auguin | Pascale Fung

The fast-spreading development of online streaming services has enabled people from all over the world to listen to music. However, it is not always straightforward for a given user to find the “right” song version he or she is looking for. As streaming services may be affected by the potential dissatisfaction among their customers, the quality of songs and the presence of tags (or labels) associated with songs returned to the users are very important. Thus, the need for precise and reliable metadata becomes paramount. In this work, we are particularly interested in distinguishing between live and studio versions of songs. Specifically, we tackle the problem in the case where very little-annotated training data are available, and demonstrate how an original co-training algorithm in a semi-supervised setting can alleviate the problem of data scarcity to successfully discriminate between live and studio music recordings.

pdf bib
#mygoal: Finding Motivations on Twitter
Marc Tomlinson | David Bracewell | Wayne Krug | David Hinote

Our everyday language reflects our psychological and cognitive state and effects the states of other individuals. In this contribution we look at the intersection between motivational state and language. We create a set of hashtags, which are annotated for the degree to which they are used by individuals to mark-up language that is indicative of a collection of factors that interact with an individual’s motivational state. We look for tags that reflect a goal mention, reward, or a perception of control. Finally, we present results for a language-model based classifier which is able to predict the presence of one of these factors in a tweet with between 69\% and 80\% accuracy on a balanced testing set. Our approach suggests that hashtags can be used to understand, not just the language of topics, but the deeper psychological and social meaning of a tweet.

pdf bib
The RATS Collection: Supporting HLT Research with Degraded Audio Data
David Graff | Kevin Walker | Stephanie Strassel | Xiaoyi Ma | Karen Jones | Ann Sawyer

The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and quality assessment of relevant data. This paper describes the LDC’s multi-year effort to build the RATS data collection, summarizes the content and properties of the resulting corpora, and discusses the novel problems and approaches involved in ensuring that the data would satisfy its intended use, to provide speech recordings and annotations for training and evaluating HLT systems that perform 4 specific tasks on difficult radio channels: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Keyword Spotting (KWS).

pdf bib
Modeling Language Proficiency Using Implicit Feedback
Chris Hokamp | Rada Mihalcea | Peter Schuelke

We describe the results of several experiments with interactive interfaces for native and L2 English students, designed to collect implicit feedback from students as they complete a reading activity. In this study, implicit means that all data is obtained without asking the user for feedback. To test the value of implicit feedback for assessing student proficiency, we collect features of user behavior and interaction, which are then used to train classification models. Based upon the feedback collected during these experiments, a student’s performance on a quiz and proficiency relative to other students can be accurately predicted, which is a step on the path to our goal of providing automatic feedback and unintrusive evaluation in interactive learning environments.

pdf bib
Event Extraction Using Distant Supervision
Kevin Reschke | Martin Jankowiak | Mihai Surdeanu | Christopher Manning | Daniel Jurafsky

Distant supervision is a successful paradigm that gathers training data for information extraction systems by automatically aligning vast databases of facts with text. Previous work has demonstrated its usefulness for the extraction of binary relations such as a person’s employer or a film’s director. Here, we extend the distant supervision approach to template-based event extraction, focusing on the extraction of passenger counts, aircraft types, and other facts concerning airplane crash events. We present a new publicly available dataset and event extraction task in the plane crash domain based on Wikipedia infoboxes and newswire text. Using this dataset, we conduct a preliminary evaluation of four distantly supervised extraction models which assign named entity mentions in text to entries in the event template. Our results indicate that joint inference over sequences of candidate entity mentions is beneficial. Furthermore, we demonstrate that the Searn algorithm outperforms a linear-chain CRF and strong baselines with local inference.

pdf bib
First Insight into Quality-Adaptive Dialogue
Stefan Ultes | Hüseyin Dikme | Wolfgang Minker

While Spoken Dialogue Systems have gained in importance in recent years, most systems applied in the real world are still static and error-prone. To overcome this, the user is put into the focus of dialogue management. Hence, an approach for adapting the course of the dialogue to Interaction Quality, an objective variant of user satisfaction, is presented in this work. In general, rendering the dialogue adaptive to user satisfaction enables the dialogue system to improve the course of the dialogue and to handle problematic situations better. In this contribution, we present a pilot study of quality-adaptive dialogue. By selecting the confirmation strategy based on the current IQ value, the course of the dialogue is adapted in order to improve the overall user experience. In a user experiment comparing three different confirmation strategies in a train booking domain, the adaptive strategy performs successful and is among the two best rated strategies based on the overall user experience.

pdf bib
TLAXCALA: a multilingual corpus of independent news
Antonio Toral

We acquire corpora from the domain of independent news from the Tlaxcala website. We build monolingual corpora for 15 languages and parallel corpora for all the combinations of those 15 languages. These corpora include languages for which only very limited such resources exist (e.g. Tamazight). We present the acquisition process in detail and we also present detailed statistics of the produced corpora, concerning mainly quantitative dimensions such as the size of the corpora per language (for the monolingual corpora) and per language pair (for the parallel corpora). To the best of our knowledge, these are the first publicly available parallel and monolingual corpora for the domain of independent news. We also create models for unsupervised sentence splitting for all the languages of the study.

pdf bib
Creating and using large monolingual parallel corpora for sentential paraphrase generation
Sander Wubben | Antal van den Bosch | Emiel Krahmer

In this paper we investigate the automatic generation of paraphrases by using machine translation techniques. Three contributions we make are the construction of a large paraphrase corpus for English and Dutch, a re-ranking heuristic to use machine translation for paraphrase generation and a proper evaluation methodology. A large parallel corpus is constructed by aligning clustered headlines that are scraped from a news aggregator site. To generate sentential paraphrases we use a standard phrase-based machine translation (PBMT) framework modified with a re-ranking component (henceforth PBMT-R). We demonstrate this approach for Dutch and English and evaluate by using human judgements collected from 76 participants. The judgments are compared to two automatic machine translation evaluation metrics. We observe that as the paraphrases deviate more from the source sentence, the performance of the PBMT-R system degrades less than that of the word substitution baseline system.

pdf bib
Benchmarking of English-Hindi parallel corpora
Jayendra Rakesh Yeka | Prasanth Kolachina | Dipti Misra Sharma

In this paper we present several parallel corpora for English↔Hindi and talk about their natures and domains. We also discuss briefly a few previous attempts in MT for translation from English to Hindi. The lack of uniformly annotated data makes it difficult to compare these attempts and precisely analyze their strengths and shortcomings. With this in mind, we propose a standard pipeline to provide uniform linguistic annotations to these resources using state-of-art NLP technologies. We conclude the paper by presenting evaluation scores of different statistical MT systems on the corpora detailed in this paper for English→Hindi and present the proposed plans for future work. We hope that both these annotated parallel corpora resources and MT systems will serve as benchmarks for future approaches to MT in English→Hindi. This was and remains the main motivation for the attempts detailed in this paper.

pdf bib
A New Framework for Sign Language Recognition based on 3D Handshape Identification and Linguistic Modeling
Mark Dilsizian | Polina Yanovich | Shu Wang | Carol Neidle | Dimitris Metaxas

Current approaches to sign recognition by computer generally have at least some of the following limitations: they rely on laboratory conditions for sign production, are limited to a small vocabulary, rely on 2D modeling (and therefore cannot deal with occlusions and off-plane rotations), and/or achieve limited success. Here we propose a new framework that (1) provides a new tracking method less dependent than others on laboratory conditions and able to deal with variations in background and skin regions (such as the face, forearms, or other hands); (2) allows for identification of 3D hand configurations that are linguistically important in American Sign Language (ASL); and (3) incorporates statistical information reflecting linguistic constraints in sign production. For purposes of large-scale computer-based sign language recognition from video, the ability to distinguish hand configurations accurately is critical. Our current method estimates the 3D hand configuration to distinguish among 77 hand configurations linguistically relevant for ASL. Constraining the problem in this way makes recognition of 3D hand configuration more tractable and provides the information specifically needed for sign recognition. Further improvements are obtained by incorporation of statistical information about linguistic dependencies among handshapes within a sign derived from an annotated corpus of almost 10,000 sign tokens.

pdf bib
Evaluating Improvised Hip Hop Lyrics - Challenges and Observations
Karteek Addanki | Dekai Wu

We investigate novel challenges involved in comparing model performance on the task of improvising responses to hip hop lyrics and discuss observations regarding inter-evaluator agreement on judging improvisation quality. We believe the analysis serves as a first step toward designing robust evaluation strategies for improvisation tasks, a relatively neglected area to date. Unlike most natural language processing tasks, improvisation tasks suffer from a high degree of subjectivity, making it difficult to design discriminative evaluation strategies to drive model development. We propose a simple strategy with fluency and rhyming as the criteria for evaluating the quality of generated responses, which we apply to both our inversion transduction grammar based FREESTYLE hip hop challenge-response improvisation system, as well as various contrastive systems. We report inter-evaluator agreement for both English and French hip hop lyrics, and analyze correlation with challenge length. We also compare the extent of agreement in evaluating fluency with that of rhyming, and quantify the difference in agreement with and without precise definitions of evaluation criteria.

pdf bib
Votter Corpus: A Corpus of Social Polling Language
Nathan Green | Septina Dian Larasati

The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter Corpus is novel in its use of the mobile application format and novel in its coverage of specific demographics. With over 26,000 polls and close to 1 millions votes, the Votter Corpus covers everyday question and answer language, primarily for users who are female and between the ages of 13-24. The corpus is annotated by topic and by popularity of particular answers. The corpus contains many unique characteristics such as emoticons, common mobile misspellings, and images associated with many of the questions. The corpus is a collection of questions and answers from The Votter App on the Android operating system. Data is created solely on this mobile platform which differs from most social media corpora. The Votter Corpus is being made available online in XML format for research and non-commercial use. The Votter android app can be downloaded for free in most android app stores.

pdf bib
SinoCoreferencer: An End-to-End Chinese Event Coreference Resolver
Chen Chen | Vincent Ng

Compared to entity coreference resolution, there is a relatively small amount of work on event coreference resolution. Much work on event coreference was done for English. In fact, to our knowledge, there are no publicly available results on Chinese event coreference resolution. This paper describes the design, implementation, and evaluation of SinoCoreferencer, an end-to-end state-of-the-art ACE-style Chinese event coreference system. We have made SinoCoreferencer publicly available, in hope to facilitate the development of high-level Chinese natural language applications that can potentially benefit from event coreference information.

pdf bib
Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development
Mohamed Maamouri | Ann Bies | Seth Kulick | Michael Ciul | Nizar Habash | Ramy Eskander

This paper describes the parallel development of an Egyptian Arabic Treebank and a morphological analyzer for Egyptian Arabic (CALIMA). By the very nature of Egyptian Arabic, the data collected is informal, for example Discussion Forum text, which we use for the treebank discussed here. In addition, Egyptian Arabic, like other Arabic dialects, is sufficiently different from Modern Standard Arabic (MSA) that tools and techniques developed for MSA cannot be simply transferred over to work on Egyptian Arabic work. In particular, a morphological analyzer for Egyptian Arabic is needed to mediate between the written text and the segmented, vocalized form used for the syntactic trees. This led to the necessity of a feedback loop between the treebank team and the analyzer team, as improvements in each area were fed to the other. Therefore, by necessity, there needed to be close cooperation between the annotation team and the tool development team, which was to their mutual benefit. Collaboration on this type of challenge, where tools and resources are limited, proved to be remarkably synergistic and opens the way to further fruitful work on Arabic dialects.

pdf bib
A German Twitter Snapshot
Tatjana Scheffler

We present a new corpus of German tweets. Due to the relatively small number of German messages on Twitter, it is possible to collect a virtually complete snapshot of German twitter messages over a period of time. In this paper, we present our collection method which produced a 24 million tweet corpus, representing a large majority of all German tweets sent in April, 2013. Further, we analyze this representative data set and characterize the German twitterverse. While German Twitter data is similar to other Twitter data in terms of its temporal distribution, German Twitter users are much more reluctant to share geolocation information with their tweets. Finally, the corpus collection method allows for a study of discourse phenomena in the Twitter data, structured into discussion threads.

pdf bib
A Comparative Evaluation Methodology for NLG in Interactive Systems
Helen Hastie | Anja Belz

Interactive systems have become an increasingly important type of application for deployment of NLG technology over recent years. At present, we do not yet have commonly agreed terminology or methodology for evaluating NLG within interactive systems. In this paper, we take steps towards addressing this gap by presenting a set of principles for designing new evaluations in our comparative evaluation methodology. We start with presenting a categorisation framework, giving an overview of different categories of evaluation measures, in order to provide standard terminology for categorising existing and new evaluation techniques. Background on existing evaluation methodologies for NLG and interactive systems is presented. The comparative evaluation methodology is presented. Finally, a methodology for comparative evaluation of NLG components embedded within interactive systems is presented in terms of the comparative evaluation methodology, using a specific task for illustrative purposes.

pdf bib
Relating Frames and Constructions in Japanese FrameNet
Kyoko Ohara

Relations between frames and constructions must be made explicit in FrameNet-style linguistic resources such as Berkeley FrameNet (Fillmore & Baker, 2010, Fillmore, Lee-Goldman & Rhomieux, 2012), Japanese FrameNet (Ohara, 2013), and Swedish Constructicon (Lyngfelt et al., 2013). On the basis of analyses of Japanese constructions for the purpose of building a constructicon in the Japanese FrameNet project, this paper argues that constructions can be classified based on whether they evoke frames or not. By recognizing such a distinction among constructions, it becomes possible for FrameNet-style linguistic resources to have a proper division of labor between frame annotations and construction annotations. In addition to the three kinds of “meaningless” constructions which have been proposed already, this paper suggests there may be yet another subtype of constructions without meanings. Furthermore, the present paper adds support to the claim that there may be constructions without meanings (Fillmore, Lee-Goldman & Rhomieux, 2012) in a current debate concerning whether all constructions should be seen as meaning-bearing (Goldberg, 2006: 166-182).

pdf bib
TMO — The Federated Ontology of the TrendMiner Project
Hans-Ulrich Krieger | Thierry Declerck

This paper describes work carried out in the European project TrendMiner which partly deals with the extraction and representation of real time information from dynamic data streams. The focus of this paper lies on the construction of an integrated ontology, TMO, the TrendMiner Ontology, that has been assembled from several independent multilingual taxonomies and ontologies which are brought together by an interface specification, expressed in OWL. Within TrendMiner, TMO serves as a common language that helps to interlink data, delivered from both symbolic and statistical components of the TrendMiner system. Very often, the extracted data is supplied as quintuples, RDF triples that are extended by two further temporal arguments, expressing the temporal extent in which an atemporal statement is true. In this paper, we will also sneak a peek on the temporal entailment rules and queries that are built into the semantic repository hosting the data and which can be used to derive useful new information.

pdf bib
A Graph-Based Approach for Computing Free Word Associations
Gemma Bel Enguix | Reinhard Rapp | Michael Zock

A graph-based algorithm is used to analyze the co-occurrences of words in the British National Corpus. It is shown that the statistical regularities detected can be exploited to predict human word associations. The corpus-derived associations are evaluated using a large test set comprising several thousand stimulus/response pairs as collected from humans. The finding is that there is a high agreement between the two types of data. The considerable size of the test set allows us to split the stimulus words into a number of classes relating to particular word properties. For example, we construct six saliency classes, and for the words in each of these classes we compare the simulation results with the human data. It turns out that for each class there is a close relationship between the performance of our system and human performance. This is also the case for classes based on two other properties of words, namely syntactic and semantic word ambiguity. We interpret these findings as evidence for the claim that human association acquisition must be based on the statistical analysis of perceived language and that when producing associations the detected statistical regularities are replicated.

pdf bib
Developing Text Resources for Ten South African Languages
Roald Eiselen | Martin Puttkammer

The development of linguistic resources for use in natural language processing is of utmost importance for the continued growth of research and development in the field, especially for resource-scarce languages. In this paper we describe the process and challenges of simultaneously developing multiple linguistic resources for ten of the official languages of South Africa. The project focussed on establishing a set of foundational resources that can foster further development of both resources and technologies for the NLP industry in South Africa. The development efforts during the project included creating monolingual unannotated corpora, of which a subset of the corpora for each language was annotated on token, orthographic, morphological and morphosyntactic layers. The annotated subsets includes both development and test sets and were used in the creation of five core-technologies, viz. a tokeniser, sentenciser, lemmatiser, part of speech tagger and morphological decomposer for each language. We report on the quality of these tools for each language and discuss the importance of the resources within the South African context.

pdf bib
Momresp: A Bayesian Model for Multi-Annotator Document Labeling
Paul Felt | Robbie Haertel | Eric Ringger | Kevin Seppi

Data annotation in modern practice often involves multiple, imperfect human annotators. Multiple annotations can be used to infer estimates of the ground-truth labels and to estimate individual annotator error characteristics (or reliability). We introduce MomResp, a model that incorporates information from both natural data clusters as well as annotations from multiple annotators to infer ground-truth labels and annotator reliability for the document classification task. We implement this model and show dramatic improvements over majority vote in situations where both annotations are scarce and annotation quality is low as well as in situations where annotators disagree consistently. Because MomResp predictions are subject to label switching, we introduce a solution that finds nearly optimal predicted class reassignments in a variety of settings using only information available to the model at inference time. Although MomResp does not perform well in annotation-rich situations, we show evidence suggesting how this shortcoming may be overcome in future work.

pdf bib
Towards Automatic Detection of Narrative Structure
Jessica Ouyang | Kathy McKeown

We present novel computational experiments using William Labov’s theory of narrative analysis. We describe his six elements of narrative structure and construct a new corpus based on his most recent work on narrative. Using this corpus, we explore the correspondence between Labov’s elements of narrative structure and the implicit discourse relations of the Penn Discourse Treebank, and we construct a mapping between the elements of narrative structure and the discourse relation classes of the PDTB. We present first experiments on detecting Complicating Actions, the most common of the elements of narrative structure, achieving an f-score of 71.55. We compare the contributions of features derived from narrative analysis, such as the length of clauses and the tenses of main verbs, with those of features drawn from work on detecting implicit discourse relations. Finally, we suggest directions for future research on narrative structure, such as applications in assessing text quality and in narrative generation.

pdf bib
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
Anabela Barreiro | Fernando Batista | Ricardo Ribeiro | Helena Moniz | Isabel Trancoso

This paper presents 3 sets of OpenLogos resources, namely the English-German, the English-French, and the English-Italian bilingual dictionaries. In addition to the usual information on part-of-speech, gender, and number for nouns, offered by most dictionaries currently available, OpenLogos bilingual dictionaries have some distinctive features that make them unique: they contain cross-language morphological information (inflectional and derivational), semantico-syntactic knowledge, indication of the head word in multiword units, information about whether a source word corresponds to an homograph, information about verb auxiliaries, alternate words (i.e., predicate or process nouns), causatives, reflexivity, verb aspect, among others. The focal point of the paper will be the semantico-syntactic knowledge that is important for disambiguation and translation precision. The resources are publicly available at the METANET platform for free use by the research community.

pdf bib
Using Large Biomedical Databases as Gold Annotations for Automatic Relation Extraction
Tilia Ellendorff | Fabio Rinaldi | Simon Clematide

We show how to use large biomedical databases in order to obtain a gold standard for training a machine learning system over a corpus of biomedical text. As an example we use the Comparative Toxicogenomics Database (CTD) and describe by means of a short case study how the obtained data can be applied. We explain how we exploit the structure of the database for compiling training material and a testset. Using a Naive Bayes document classification approach based on words, stem bigrams and MeSH descriptors we achieve a macro-average F-score of 61% on a subset of 8 action terms. This outperforms a baseline system based on a lookup of stemmed keywords by more than 20%. Furthermore, we present directions of future work, taking the described system as a vantage point. Future work will be aiming towards a weakly supervised system capable of discovering complete biomedical interactions and events.

pdf bib
Crowdsourcing for the identification of event nominals: an experiment
Rachele Sprugnoli | Alessandro Lenci

This paper presents the design and results of a crowdsourcing experiment on the recognition of Italian event nominals. The aim of the experiment was to assess the feasibility of crowdsourcing methods for a complex semantic task such as distinguishing the eventive interpretation of polysemous nominals taking into consideration various types of syntagmatic cues. Details on the theoretical background and on the experiment set up are provided together with the final results in terms of accuracy and inter-annotator agreement. These results are compared with the ones obtained by expert annotators on the same task. The low values in accuracy and Fleiss’ kappa of the crowdsourcing experiment demonstrate that crowdsourcing is not always optimal for complex linguistic tasks. On the other hand, the use of non-expert contributors allows to understand what are the most ambiguous patterns of polysemy and the most useful syntagmatic cues to be used to identify the eventive reading of nominals.

pdf bib
Automatic Refinement of Syntactic Categories in Chinese Word Structures
Jianqiang Ma

Annotated word structures are useful for various Chinese NLP tasks, such as word segmentation, POS tagging and syntactic parsing. Chinese word structures are often represented by binary trees, the nodes of which are labeled with syntactic categories, due to the syntactic nature of Chinese word formation. It is desirable to refine the annotation by labeling nodes of word structure trees with more proper syntactic categories so that the combinatorial properties in the word formation process are better captured. This can lead to improved performances on the tasks that exploit word structure annotations. We propose syntactically inspired algorithms to automatically induce syntactic categories of word structure trees using POS tagged corpus and branching in existing Chinese word structure trees. We evaluate the quality of our annotation by comparing the performances of models based on our annotation and another publicly available annotation, respectively. The results on two variations of Chinese word segmentation task show that using our annotation can lead to significant performance improvements.

pdf bib
Incorporating Alternate Translations into English Translation Treebank
Ann Bies | Justin Mott | Seth Kulick | Jennifer Garland | Colin Warner

New annotation guidelines and new processing methods were developed to accommodate English treebank annotation of a parallel English/Chinese corpus of web data that includes alternate English translations (one fluent, one literal) of expressions that are idiomatic in the Chinese source. In previous machine translation programs, alternate translations of idiomatic expressions had been present in untreebanked data only, but due to the high frequency of such expressions in informal genres such as discussion forums, machine translation system developers requested that alternatives be added to the treebanked data as well. In consultation with machine translation researchers, we chose a pragmatic approach of syntactically annotating only the fluent translation, while retaining the alternate literal translation as a segregated node in the tree. Since the literal translation alternates are often incompatible with English syntax, this approach allows us to create fluent trees without losing information. This resource is expected to support machine translation efforts, and the flexibility provided by the alternate translations is an enhancement to the treebank for this purpose.

pdf bib
Zmorge: A German Morphological Lexicon Extracted from Wiktionary
Rico Sennrich | Beat Kunz

We describe a method to automatically extract a German lexicon from Wiktionary that is compatible with the finite-state morphological grammar SMOR. The main advantage of the resulting lexicon over existing lexica for SMOR is that it is open and permissively licensed. A recall-oriented evaluation shows that a morphological analyser built with our lexicon has comparable coverage compared to existing lexica, and continues to improve as Wiktionary grows. We also describe modifications to the SMOR grammar that result in a more conventional lemmatisation of words.

pdf bib
Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
Mona Diab | Mohamed Al-Badrashiny | Maryam Aminian | Mohammed Attia | Heba Elfardy | Nizar Habash | Abdelati Hawwari | Wael Salloum | Pradeep Dasigi | Ramy Eskander

We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwa’s creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.

pdf bib
A SKOS-based Schema for TEI encoded Dictionaries at ICLTT
Thierry Declerck | Karlheinz Mörth | Eveline Wandl-Vogt

At our institutes we are working with quite some dictionaries and lexical resources in the field of less-resourced language data, like dialects and historical languages. We are aiming at publishing those lexical data in the Linked Open Data framework in order to link them with available data sets for highly-resourced languages and elevating them thus to the same “digital dignity” the mainstream languages have gained. In this paper we concentrate on two TEI encoded variants of the Arabic language and propose a mapping of this TEI encoded data onto SKOS, showing how the lexical entries of the two dialectal dictionaries can be linked to other language resources available in the Linked Open Data cloud.

pdf bib
Semantic Technologies for Querying Linguistic Annotations: An Experiment Focusing on Graph-Structured Data
Milen Kouylekov | Stephan Oepen

With growing interest in the creation and search of linguistic annotations that form general graphs (in contrast to formally simpler, rooted trees), there also is an increased need for infrastructures that support the exploration of such representations, for example logical-form meaning representations or semantic dependency graphs. In this work, we heavily lean on semantic technologies and in particular the data model of the Resource Description Framework (RDF) to represent, store, and efficiently query very large collections of text annotated with graph-structured representations of sentence meaning.

pdf bib
Eliciting and Annotating Uncertainty in Spoken Language
Heather Pon-Barry | Stuart Shieber | Nicholas Longenbaugh

A major challenge in the field of automatic recognition of emotion and affect in speech is the subjective nature of affect labels. The most common approach to acquiring affect labels is to ask a panel of listeners to rate a corpus of spoken utterances along one or more dimensions of interest. For applications ranging from educational technology to voice search to dictation, a speaker’s level of certainty is a primary dimension of interest. In such applications, we would like to know the speaker’s actual level of certainty, but past research has only revealed listeners’ perception of the speaker’s level of certainty. In this paper, we present a method for eliciting spoken utterances using stimuli that we design such that they have a quantitative, crowdsourced legibility score. While we cannot control a speaker’s actual internal level of certainty, the use of these stimuli provides a better estimate of internal certainty compared to existing speech corpora. The Harvard Uncertainty Speech Corpus, containing speech data, certainty annotations, and prosodic features, is made available to the research community.

pdf bib
A hierarchical taxonomy for classifying hardness of inference tasks
Martin Gleize | Brigitte Grau

Exhibiting inferential capabilities is one of the major goals of many modern Natural Language Processing systems. However, if attempts have been made to define what textual inferences are, few seek to classify inference phenomena by difficulty. In this paper we propose a hierarchical taxonomy for inferences, relatively to their hardness, and with corpus annotation and system design and evaluation in mind. Indeed, a fine-grained assessment of the difficulty of a task allows us to design more appropriate systems and to evaluate them only on what they are designed to handle. Each of seven classes is described and provided with examples from different tasks like question answering, textual entailment and coreference resolution. We then test the classes of our hierarchy on the specific task of question answering. Our annotation process of the testing data at the QA4MRE 2013 evaluation campaign reveals that it is possible to quantify the contrasts in types of difficulty on datasets of the same task.

pdf bib
Automatic Methods for the Extension of a Bilingual Dictionary using Comparable Corpora
Michael Rosner | Kurt Sultana

Bilingual dictionaries define word equivalents from one language to another, thus acting as an important bridge between languages. No bilingual dictionary is complete since languages are in a constant state of change. Additionally, dictionaries are unlikely to achieve complete coverage of all language terms. This paper investigates methods for extending dictionaries using non-aligned corpora, by finding translations through context similarity. Most methods compute word contexts from general corpora. This can lead to errors due to data sparsity. We investigate the hypothesis that this problem can be addressed by carefully choosing smaller corpora in which domain-specific terms are more predominant. We also introduce the notion of efficiency which we consider as the effort required to obtain a set of dictionary entries from a given corpus

pdf bib
A Method for Building Burst-Annotated Co-Occurrence Networks for Analysing Trends in Textual Data
Yutaka Mitsuishi | Vít Nováček | Pierre-Yves Vandenbussche

This paper presents a method for constructing a specific type of language resources that are conveniently applicable to analysis of trending topics in time-annotated textual data. More specifically, the method consists of building a co-occurrence network from the on-line content (such as New York Times articles) that conform to key words selected by users (e.g., ‘Arab Spring’). Within the network, burstiness of the particular nodes (key words) and edges (co-occurrence relations) is computed. A service deployed on the network then facilitates exploration of the underlying text in order to identify trending topics. Using the graph structure of the network, one can assess also a broader context of the trending events. To limit the information overload of users, we filter the edges and nodes displayed by their burstiness scores to show only the presumably more important ones. The paper gives details on the proposed method, including a step-by-step walk through with plenty of real data examples. We report on a specific application of our method to the topic of `Arab Spring’ and make the language resource applied therein publicly available for experimentation. Last but not least, we outline a methodology of an ongoing evaluation of our method.

pdf bib
Casa de la Lhéngua: a set of language resources and natural language processing tools for Mirandese
José Pedro Ferreira | Cristiano Chesi | Daan Baldewijns | Fernando Miguel Pinto | Margarita Correia | Daniela Braga | Hyongsil Cho | Amadeu Ferreira | Miguel Dias

This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.

pdf bib
Untrained Forced Alignment of Transcriptions and Audio for Language Documentation Corpora using WebMAUS
Jan Strunk | Florian Schiel | Frank Seifart

Language documentation projects supported by recent funding intiatives have created a large number of multimedia corpora of typologically diverse languages. Most of these corpora provide a manual alignment of transcription and audio data at the level of larger units, such as sentences or intonation units. Their usefulness both for corpus-linguistic and psycholinguistic research and for the development of tools and teaching materials could, however, be increased by achieving a more fine-grained alignment of transcription and audio at the word or even phoneme level. Since most language documentation corpora contain data on small languages, there usually do not exist any speech recognizers or acoustic models specifically trained on these languages. We therefore investigate the feasibility of untrained forced alignment for such corpora. We report on an evaluation of the tool (Web)MAUS (Kisler, 2012) on several language documentation corpora and discuss practical issues in the application of forced alignment. Our evaluation shows that (Web)MAUS with its existing acoustic models combined with simple grapheme-to-phoneme conversion can be successfully used for word-level forced alignment of a diverse set of languages without additional training, especially if a manual prealignment of larger annotation units is already avaible.

pdf bib
MultiVal - towards a multilingual valence lexicon
Lars Hellan | Dorothee Beermann | Tore Bruland | Mary Esther Kropp Dakubu | Montserrat Marimon

MultiVal is a valence lexicon derived from lexicons of computational HPSG grammars for Norwegian, Spanish and Ga (ISO 639-3, gaa), with altogether about 22,000 verb entries and on average more than 200 valence types defined for each language. These lexical resources are mapped onto a common set of discriminants with a common array of values, and stored in a relational database linked to a web demo and a wiki presentation. Search discriminants are ‘syntactic argument structure’ (SAS), functional specification, situation type and aspect, for any subset of languages, as well as the verb type systems of the grammars. Search results are lexical entries satisfying the discriminants entered, exposing the specifications from the respective provenance grammars. The Ga grammar lexicon has in turn been converted from a Ga Toolbox lexicon. Aside from the creation of such a multilingual valence resource through converging or converting existing resources, the paper also addresses a tool for the creation of such a resource as part of corpus annotation for less resourced languages.

pdf bib
The Sweet-Home speech and multimodal corpus for home automation interaction
Michel Vacher | Benjamin Lecouteux | Pedro Chahuara | François Portet | Brigitte Meillon | Nicolas Bonnefond

Ambient Assisted Living aims at enhancing the quality of life of older and disabled people at home thanks to Smart Homes and Home Automation. However, many studies do not include tests in real settings, because data collection in this domain is very expensive and challenging and because of the few available data sets. The S WEET-H OME multimodal corpus is a dataset recorded in realistic conditions in D OMUS, a fully equipped Smart Home with microphones and home automation sensors, in which participants performed Activities of Daily living (ADL). This corpus is made of a multimodal subset, a French home automation speech subset recorded in Distant Speech conditions, and two interaction subsets, the first one being recorded by 16 persons without disabilities and the second one by 6 seniors and 5 visually impaired people. This corpus was used in studies related to ADL recognition, context aware interaction and distant speech recognition applied to home automation controled through voice.

pdf bib
Global Intelligent Content: Active Curation of Language Resources using Linked Data
David Lewis | Rob Brennan | Leroy Finn | Dominic Jones | Alan Meehan | Declan O’Sullivan | Sebastian Hellmann | Felix Sasaki

As language resources start to become available in linked data formats, it becomes relevant to consider how linked data interoperability can play a role in active language processing workflows as well as for more static language resource publishing. This paper proposes that linked data may have a valuable role to play in tracking the use and generation of language resources in such workflows in order to assess and improve the performance of the language technologies that use the resources, based on feedback from the human involvement typically required within such processes. We refer to this as Active Curation of the language resources, since it is performed systematically over language processing workflows to continuously improve the quality of the resource in specific applications, rather than via dedicated curation steps. We use modern localisation workflows, i.e. assisted by machine translation and text analytics services, to explain how linked data can support such active curation. By referencing how a suitable linked data vocabulary can be assembled by combining existing linked data vocabularies and meta-data from other multilingual content processing annotations and tool exchange standards we aim to demonstrate the relative ease with which active curation can be deployed more broadly.

pdf bib
On the Romance Languages Mutual Intelligibility
Liviu Dinu | Alina Maria Ciobanu

We propose a method for computing the similarity of natural languages and for clustering them based on their lexical similarity. Our study provides evidence to be used in the investigation of the written intelligibility, i.e., the ability of people writing in different languages to understand one another without prior knowledge of foreign languages. We account for etymons and cognates, we quantify lexical similarity and we extend our analysis from words to languages. Based on the introduced methodology, we compute a matrix of Romance languages intelligibility.

pdf bib
Aggregation methods for efficient collocation detection
Anca Dinu | Liviu Dinu | Ionut Sorodoc

In this article we propose a rank aggregation method for the task of collocations detection. It consists of applying some well-known methods (e.g. Dice method, chi-square test, z-test and likelihood ratio) and then aggregating the resulting collocations rankings by rank distance and Borda score. These two aggregation methods are especially well suited for the task, since the results of each individual method naturally forms a ranking of collocations. Combination methods are known to usually improve the results, and indeed, the proposed aggregation method performs better then each individual method taken in isolation.

pdf bib
Annotating Inter-Sentence Temporal Relations in Clinical Notes
Jennifer D’Souza | Vincent Ng

Owing in part to the surge of interest in temporal relation extraction, a number of datasets manually annotated with temporal relations between event-event pairs and event-time pairs have been produced recently. However, it is not uncommon to find missing annotations in these manually annotated datasets. Many researchers attributed this problem to “annotator fatigue”. While some of these missing relations can be recovered automatically, many of them cannot. Our goals in this paper are to (1) manually annotate certain types of missing links that cannot be automatically recovered in the i2b2 Clinical Temporal Relations Challenge Corpus, one of the recently released evaluation corpora for temporal relation extraction; and (2) empirically determine the usefulness of these additional annotations. We will make our annotations publicly available, in hopes of enabling a more accurate evaluation of temporal relation extraction systems.

pdf bib
Terminology localization guidelines for the national scenario
Juris Borzovs | Ilze Ilziņa | Iveta Keiša | Mārcis Pinnis | Andrejs Vasiļjevs

This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a novel approach in a corpus-based selection and an evaluation of the most frequently used terms. Analysis of the terms proves that, in general, in the normative terminology work in Latvia localized terms are coined according to these guidelines. We further evaluate how terms included in the database of official terminology are adopted in the general use such as newspaper articles, blogs, forums, websites etc. Our evaluation shows that in a non-normative context the official terminology faces a strong competition from other variations of localized terms. Conclusions and recommendations from lexical analysis of localized terms are provided. We hope that presented guidelines and approach in evaluation will be useful to terminology institutions, regulative authorities and researchers in different countries that are involved in the national terminology work.

pdf bib
Tools for Arabic Natural Language Processing: a case study in qalqalah prosody
Claire Brierley | Majdi Sawalha | Eric Atwell

In this paper, we focus on the prosodic effect of qalqalah or “vibration” applied to a subset of Arabic consonants under certain constraints during correct Qur’anic recitation or taǧwīd, using our Boundary-Annotated Qur’an dataset of 77430 words (Brierley et al 2012; Sawalha et al 2014). These qalqalah events are rule-governed and are signified orthographically in the Arabic script. Hence they can be given abstract definition in the form of regular expressions and thus located and collected automatically. High frequency qalqalah content words are also found to be statistically significant discriminators or keywords when comparing Meccan and Medinan chapters in the Qur’an using a state-of-the-art Visual Analytics toolkit: Semantic Pathways. Thus we hypothesise that qalqalah prosody is one way of highlighting salient items in the text. Finally, we implement Arabic transcription technology (Brierley et al under review; Sawalha et al forthcoming) to create a qalqalah pronunciation guide where each word is transcribed phonetically in IPA and mapped to its chapter-verse ID. This is funded research under the EPSRC “Working Together” theme.

pdf bib
Teenage and adult speech in school context: building and processing a corpus of European Portuguese
Ana Isabel Mata | Helena Moniz | Fernando Batista | Julia Hirschberg

We present a corpus of European Portuguese spoken by teenagers and adults in school context, CPE-FACES, with an overview of the differential characteristics of high school oral presentations and the challenges this data poses to automatic speech processing. The CPE-FACES corpus has been created with two main goals: to provide a resource for the study of prosodic patterns in both spontaneous and prepared unscripted speech, and to capture inter-speaker and speaking style variations common at school, for research on oral presentations. Research on speaking styles is still largely based on adult speech. References to teenagers are sparse and cross-analyses of speech types comparing teenagers and adults are rare. We expect CPE-FACES, currently a unique resource in this domain, will contribute to filling this gap in European Portuguese. Focusing on disfluencies and phrase-final phonetic-phonological processes we show the impact of teenage speech on the automatic segmentation of oral presentations. Analyzing fluent final intonation contours in declarative utterances, we also show that communicative situation specificities, speaker status and cross-gender differences are key factors in speaking style variation at school.

pdf bib
A Unified Annotation Scheme for the Semantic/Pragmatic Components of Definiteness
Archna Bhatia | Mandy Simons | Lori Levin | Yulia Tsvetkov | Chris Dyer | Jordan Bender

We present a definiteness annotation scheme that captures the semantic, pragmatic, and discourse information, which we call communicative functions, associated with linguistic descriptions such as “a story about my speech”, “the story”, “every time I give it”, “this slideshow”. A survey of the literature suggests that definiteness does not express a single communicative function but is a grammaticalization of many such functions, for example, identifiability, familiarity, uniqueness, specificity. Our annotation scheme unifies ideas from previous research on definiteness while attempting to remove redundancy and make it easily annotatable. This annotation scheme encodes the communicative functions of definiteness rather than the grammatical forms of definiteness. We assume that the communicative functions are largely maintained across languages while the grammaticalization of this information may vary. One of the final goals is to use our semantically annotated corpora to discover how definiteness is grammaticalized in different languages. We release our annotated corpora for English and Hindi, and sample annotations for Hebrew and Russian, together with an annotation manual.

pdf bib
Aligning Predicate-Argument Structures for Paraphrase Fragment Extraction
Michaela Regneri | Rui Wang | Manfred Pinkal

Paraphrases and paraphrasing algorithms have been found of great importance in various natural language processing tasks. While most paraphrase extraction approaches extract equivalent sentences, sentences are an inconvenient unit for further processing, because they are too specific, and often not exact paraphrases. Paraphrase fragment extraction is a technique that post-processes sentential paraphrases and prunes them to more convenient phrase-level units. We present a new approach that uses semantic roles to extract paraphrase fragments from sentence pairs that share semantic content to varying degrees, including full paraphrases. In contrast to previous systems, the use of semantic parses allows for extracting paraphrases with high wording variance and different syntactic categories. The approach is tested on four different input corpora and compared to two previous systems for extracting paraphrase fragments. Our system finds three times as many good paraphrase fragments per sentence pair as the baselines, and at the same time outputs 30% fewer unrelated fragment pairs.

pdf bib
An evaluation of the role of statistical measures and frequency for MWE identification
Sandra Antunes | Amália Mendes

We report on an experiment to evaluate the role of statistical association measures and frequency for the identification of MWE. We base our evaluation on a lexicon of 14.000 MWE comprising different types of word combinations: collocations, nominal compounds, light verbs + predicate, idioms, etc. These MWE were manually validated from a list of n-grams extracted from a 50 million word corpus of Portuguese (a subcorpus of the Reference Corpus of Contemporary Portuguese), using several criteria: syntactic fixedness, idiomaticity, frequency and Mutual Information measure, although no threshold was established, either in terms of group frequency or MI. We report on MWE that were selected on the basis of their syntactic and semantics properties while the MI or both the MI and the frequency show low values, which would constitute difficult cases to establish a cutting point. We analyze the MI values of the MWE selected in our gold dataset and, for some specific cases, compare these values with two other statistical measures.

pdf bib
On the reliability and inter-annotator agreement of human semantic MT evaluation via HMEANT
Chi-kiu Lo | Dekai Wu

We present analyses showing that HMEANT is a reliable, accurate and fine-grained semantic frame based human MT evaluation metric with high inter-annotator agreement (IAA) and correlation with human adequacy judgments, despite only requiring a minimal training of about 15 minutes for lay annotators. Previous work shows that the IAA on the semantic role labeling (SRL) subtask within HMEANT is over 70%. In this paper we focus on (1) the IAA on the semantic role alignment task and (2) the overall IAA of HMEANT. Our results show that the IAA on the alignment task of HMEANT is over 90% when humans align SRL output from the same SRL annotator, which shows that the instructions on the alignment task are sufficiently precise, although the overall IAA where humans align SRL output from different SRL annotators falls to only 61% due to the pipeline effect on the disagreement in the two annotation task. We show that instead of manually aligning the semantic roles using an automatic algorithm not only helps maintaining the overall IAA of HMEANT at 70%, but also provides a finer-grained assessment on the phrasal similarity of the semantic role fillers. This suggests that HMEANT equipped with automatic alignment is reliable and accurate for humans to evaluate MT adequacy while achieving higher correlation with human adequacy judgments than HTER.

pdf bib
Dual Subtitles as Parallel Corpora
Shikun Zhang | Wang Ling | Chris Dyer

In this paper, we leverage the existence of dual subtitles as a source of parallel data. Dual subtitles present viewers with two languages simultaneously, and are generally aligned in the segment level, which removes the need to automatically perform this alignment. This is desirable as extracted parallel data does not contain alignment errors present in previous work that aligns different subtitle files for the same movie. We present a simple heuristic to detect and extract dual subtitles and show that more than 20 million sentence pairs can be extracted for the Mandarin-English language pair. We also show that extracting data from this source can be a viable solution for improving Machine Translation systems in the domain of subtitles.

pdf bib
REFRACTIVE: An Open Source Tool to Extract Knowledge from Syntactic and Semantic Relations
Peter Exner | Pierre Nugues

The extraction of semantic propositions has proven instrumental in applications like IBM Watson and in Google’s knowledge graph . One of the core components of IBM Watson is the PRISMATIC knowledge base consisting of one billion propositions extracted from the English version of Wikipedia and the New York Times. However, extracting the propositions from the English version of Wikipedia is a time-consuming process. In practice, this task requires multiple machines and a computation distribution involving a good deal of system technicalities. In this paper, we describe Refractive, an open-source tool to extract propositions from a parsed corpus based on the Hadoop variant of MapReduce. While the complete process consists of a parsing part and an extraction part, we focus here on the extraction from the parsed corpus and we hope this tool will help computational linguists speed up the development of applications.

pdf bib
Variations on quantitative comparability measures and their evaluations on synthetic French-English comparable corpora
Guiyao Ke | Pierre-Francois Marteau | Gildas Menier

Following the pioneering work by \cite{Li-Gaussier-10}, we address in this paper the analysis of a family of quantitative comparability measures dedicated to the construction and evaluation of topical comparable corpora. After recalling the definition of the quantitative comparability measure proposed by \cite{Li-Gaussier-10}, we develop some variants of this measure based primarily on the consideration that the occurrence frequencies of lexical entries and the number of their translations are important. We compare the respective advantages and disadvantages of these variants in the context of an evaluation framework that is based on the progressive degradation of the Europarl parallel corpus. The degradation is obtained by replacing either deterministically or randomly a varying amount of lines in blocks that compose partitions of the initial Europarl corpus. The impact of the coverage of bilingual dictionaries on these measures is also discussed and perspectives are finally presented.

pdf bib
Using a machine learning model to assess the complexity of stress systems
Liviu Dinu | Alina Maria Ciobanu | Ioana Chitoran | Vlad Niculae

We address the task of stress prediction as a sequence tagging problem. We present sequential models with averaged perceptron training for learning primary stress in Romanian words. We use character n-grams and syllable n-grams as features and we account for the consonant-vowel structure of the words. We show in this paper that Romanian stress is predictable, though not deterministic, by using data-driven machine learning techniques.

pdf bib
Prosodic, syntactic, semantic guidelines for topic structures across domains and corpora
Ana Isabel Mata | Helena Moniz | Telmo Móia | Anabela Gonçalves | Fátima Silva | Fernando Batista | Inês Duarte | Fátima Oliveira | Isabel Falé

This paper presents the annotation guidelines applied to naturally occurring speech, aiming at an integrated account of contrast and parallel structures in European Portuguese. These guidelines were defined to allow for the empirical study of interactions among intonation and syntax-discourse patterns in selected sets of different corpora (monologues and dialogues, by adults and teenagers). In this paper we focus on the multilayer annotation process of left periphery structures by using a small sample of highly spontaneous speech in which the distinct types of topic structures are displayed. The analysis of this sample provides fundamental training and testing material for further application in a wider range of domains and corpora. The annotation process comprises the following time-linked levels (manual and automatic): phone, syllable and word level transcriptions (including co-articulation effects); tonal events and break levels; part-of-speech tagging; syntactic-discourse patterns (construction type; construction position; syntactic function; discourse function), and disfluency events as well. Speech corpora with such a multi-level annotation are a valuable resource to look into grammar module relations in language use from an integrated viewpoint. Such viewpoint is innovative in our language, and has not been often assumed by studies for other languages.

pdf bib
Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage
Kevin Black | Eric Ringger | Paul Felt | Kevin Seppi | Kristian Heal | Deryle Lonsdale

The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word-senses. Such CDL resources are essential in learning a language and in linguistic research, translation, and philology. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas are treated as dictionary entry headwords. We intend to use data-driven lemmatization models to provide machine assistance to human annotators in the form of pre-annotations, and thereby reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy but only by a margin of 0.86% over Morfette at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.

pdf bib
Finite-state morphological transducers for three Kypchak languages
Jonathan Washington | Ilnar Salimzyanov | Francis Tyers

This paper describes the development of free/open-source finite-state morphological transducers for three Turkic languages―Kazakh, Tatar, and Kumyk―representing one language from each of the three sub-branches of the Kypchak branch of Turkic. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST). This paper describes how the development of a transducer for each subsequent closely-related language took less development time. An evaluation is presented which shows that the transducers all have a reasonable coverage―around 90\%―on freely available corpora of the languages, and high precision over a manually verified test set.

pdf bib
Automatic creation of WordNets from parallel corpora
Antoni Oliver | Salvador Climent

In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordNet creation is converted into a word alignment problem, where we want to alignWordNet synsets in the English part of the corpus with lemmata on the target language part of the corpus. The word alignment algorithm used in these experiments is a simple most frequent translation algorithm implemented into the WN-Toolkit. The obtained precision values are quite satisfactory, but the overall number of extracted synset-variant pairs is too low, leading into very poor recall values. In the conclusions, the use of more advanced word alignment algorithms, such as Giza++, Fast Align or Berkeley aligner is suggested.

pdf bib
The Polish Summaries Corpus
Maciej Ogrodniczuk | Mateusz Kopeć

This article presents the Polish Summaries Corpus, a new resource created to support the development and evaluation of the tools for automated single-document summarization of Polish. The Corpus contains a large number of manual summaries of news articles, with many independently created summaries for a single text. Such approach is supposed to overcome the annotator bias, which is often described as a problem during the evaluation of the summarization algorithms against a single gold standard. There are several summarizers developed specifically for Polish language, but their in-depth evaluation and comparison was impossible without a large, manually created corpus. We present in detail the process of text selection, annotation process and the contents of the corpus, which includes both abstract free-word summaries, as well as extraction-based summaries created by selecting text spans from the original document. Finally, we describe how that resource could be used not only for the evaluation of the existing summarization tools, but also for studies on the human summarization process in Polish language.

pdf bib
GlobalPhone: Pronunciation Dictionaries in 20 Languages
Tanja Schultz | Tim Schlippe

This paper describes the advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages. GlobalPhone was designed to be uniform across languages with respect to the amount of data, speech quality, the collection scenario, the transcription and phone set conventions. With more than 400 hours of transcribed audio data from more than 2000 native speakers GlobalPhone supplies an excellent basis for research in the areas of multilingual speech recognition, rapid deployment of speech processing systems to yet unsupported languages, language identification tasks, speaker recognition in multiple languages, multilingual speech synthesis, as well as monolingual speech recognition in a large variety of languages. Very recently the GlobalPhone pronunciation dictionaries have been made available for research and commercial purposes by the European Language Resources Association (ELRA).

pdf bib
Pre-ordering of phrase-based machine translation input in translation workflow
Alexandru Ceausu | Sabine Hunsicker

Word reordering is a difficult task for decoders when the languages involved have a significant difference in syntax. Phrase-based statistical machine translation (PBSMT), preferred in commercial settings due to its maturity, is particularly prone to errors in long range reordering. Source sentence pre-ordering, as a pre-processing step before PBSMT, proved to be an efficient solution that can be achieved using limited resources. We propose a dependency-based pre-ordering model with parameters optimized using a reordering score to pre-order the source sentence. The source sentence is then translated using an existing phrase-based system. The proposed solution is very simple to implement. It uses a hierarchical phrase-based statistical machine translation system (HPBSMT) for pre-ordering, combined with a PBSMT system for the actual translation. We show that the system can provide alternate translations of less post-editing effort in a translation workflow with German as the source language.

pdf bib
A Vector Space Model for Syntactic Distances Between Dialects
Emanuele Di Buccio | Giorgio Maria Di Nunzio | Gianmaria Silvello

Syntactic comparison across languages is essential in the research field of linguistics, e.g. when investigating the relationship among closely related languages. In IR and NLP, the syntactic information is used to understand the meaning of word occurrences according to the context in which their appear. In this paper, we discuss a mathematical framework to compute the distance between languages based on the data available in current state-of-the-art linguistic databases. This framework is inspired by approaches presented in IR and NLP.

pdf bib
A finite-state morphological analyzer for a Lakota HPSG grammar
Christian Curtis

This paper reports on the design and implementation of a morphophonological analyzer for Lakota, a member of the Siouan language family. The initial motivation for this work was to support development of a precision implemented grammar for Lakota on the basis of the LinGO Grammar Matrix. A finite-state transducer (FST) was developed to adapt Lakota’s complex verbal morphology into a form directly usable as input to the Grammar Matrix-derived grammar. As the FST formalism can be applied in both directions, this approach also supports generative output of correct surface forms from the implemented grammar. This article describes the approach used to model Lakota verbal morphology using finite-state methods. It also discusses the results of developing a lexicon from existing text and evaluating its application to related but novel text. The analyzer presented here, along with its companion precision grammar, explores an approach that may have application in enabling machine translation for endangered and under-resourced languages.

pdf bib
A Wikipedia-based Corpus for Contextualized Machine Translation
Jennifer Drexler | Pushpendre Rastogi | Jacqueline Aguilar | Benjamin Van Durme | Matt Post

We describe a corpus for target-contextualized machine translation (MT), where the task is to improve the translation of source documents using language models built over presumably related documents in the target language. The idea presumes a situation where most of the information about a topic is in a foreign language, yet some related target-language information is known to exist. Our corpus comprises a set of curated English Wikipedia articles describing news events, along with (i) their Spanish counterparts and (ii) some of the Spanish source articles cited within them. In experiments, we translated these Spanish documents, treating the English articles as target-side context, and evaluate the effect on translation quality when including target-side language models built over this English context and interpolated with other, separately-derived language model data. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score.

pdf bib
Mapping WordNet Domains, WordNet Topics and Wikipedia Categories to Generate Multilingual Domain Specific Resources
Spandana Gella | Carlo Strapparava | Vivi Nastase

In this paper we present the mapping between WordNet domains and WordNet topics, and the emergent Wikipedia categories. This mapping leads to a coarse alignment between WordNet and Wikipedia, useful for producing domain-specific and multilingual corpora. Multilinguality is achieved through the cross-language links between Wikipedia categories. Research in word-sense disambiguation has shown that within a specific domain, relevant words have restricted senses. The multilingual, and comparable, domain-specific corpora we produce have the potential to enhance research in word-sense disambiguation and terminology extraction in different languages, which could enhance the performance of various NLP tasks.

pdf bib
Annotating Events in an Emotion Corpus
Sophia Lee | Shoushan Li | Chu-Ren Huang

This paper presents the development of a Chinese event-based emotion corpus. It specifically describes the corpus design, collection and annotation. The proposed annotation scheme provides a consistent way of identifying some emotion-associated events (namely pre-events and post-events). Corpus data show that there are significant interactions between emotions and pre-events as well as that of between emotion and post-events. We believe that emotion as a pivot event underlies an innovative approach towards a linguistic model of emotion as well as automatic emotion detection and classification.

pdf bib
Statistical Analysis of Multilingual Text Corpus and Development of Language Models
Shyam Sundar Agrawal | Abhimanue | Shweta Bansal | Minakshi Mahajan

This paper presents two studies, first a statistical analysis for three languages i.e. Hindi, Punjabi and Nepali and the other, development of language models for three Indian languages i.e. Indian English, Punjabi and Nepali. The main objective of this study is to find distinction among these languages and development of language models for their identification. Detailed statistical analysis have been done to compute the information about entropy, perplexity, vocabulary growth rate etc. Based on statistical features a comparative analysis has been done to find the similarities and differences among these languages. Subsequently an effort has been made to develop a trigram model of Indian English, Punjabi and Nepali. A corpus of 500000 words of each language has been collected and used to develop their models (unigram, bigram and trigram models). The models have been tried in two different databases- Parallel corpora of French and English and Non-parallel corpora of Indian English, Punjabi and Nepali. In the second case, the performance of the model is comparable. Usage of JAVA platform has provided a special effect for dealing with a very large database with high computational speed. Furthermore various enhancive concepts like Smoothing, Discounting, Back off, and Interpolation have been included for the designing of an effective model. The results obtained from this experiment have been described. The information can be useful for development of Automatic Speech Language Identification System.

pdf bib
New Directions for Language Resource Development and Distribution
Christopher Cieri | Denise DiPersio | Mark Liberman | Andrea Mazzucchi | Stephanie Strassel | Jonathan Wright

Despite the growth in the number of linguistic data centers around the world, their accomplishments and expansions and the advances they have help enable, the language resources that exist are a small fraction of those required to meet the goals of Human Language Technologies (HLT) for the world’s languages and the promises they offer: broad access to knowledge, direct communication across language boundaries and engagement in a global community. Using the Linguistic Data Consortium as a focus case, this paper sketches the progress of data centers, summarizes recent activities and then turns to several issues that have received inadequate attention and proposes some new approaches to their resolution.

pdf bib
Rediscovering 15 Years of Discoveries in Language Resources and Evaluation: The LREC Anthology Analysis
Joseph Mariani | Patrick Paroubek | Gil Francopoulo | Olivier Hamon

This paper aims at analyzing the content of the LREC conferences contained in the ELRA Anthology over the past 15 years (1998-2013). It follows similar exercises that have been conducted, such as the survey on the IEEE ICASSP conference series from 1976 to 1990, which served in the launching of the ESCA Eurospeech conference, a survey of the Association of Computational Linguistics (ACL) over 50 years of existence, which was presented at the ACL conference in 2012, or a survey over the 25 years (1987-2012) of the conferences contained in the ISCA Archive, presented at Interspeech 2013. It contains first an analysis of the evolution of the number of papers and authors over time, including the study of their gender, nationality and affiliation, and of the collaboration among authors. It then studies the funding sources of the research investigations that are reported in the papers. It conducts an analysis of the evolution of the research topics within the community over time. It finally looks at reuse and plagiarism in the papers. The survey shows the present trends in the conference series and in the Language Resources and Evaluation scientific community. Conducting this survey also demonstrated the importance of a clear and unique identification of authors, papers and other sources to facilitate the analysis. This survey is preliminary, as many other aspects also deserve attention. But we hope it will help better understanding and forging our community in the global village.

pdf bib
Annotating Question Decomposition on Complex Medical Questions
Kirk Roberts | Kate Masterton | Marcelo Fiszman | Halil Kilicoglu | Dina Demner-Fushman

This paper presents a method for annotating question decomposition on complex medical questions. The annotations cover multiple syntactic ways that questions can be decomposed, including separating independent clauses as well as recognizing coordinations and exemplifications. We annotate a corpus of 1,467 multi-sentence consumer health questions about genetic and rare diseases. Furthermore, we label two additional medical-specific annotations: (1) background sentences are annotated with a number of medical categories such as symptoms, treatments, and family history, and (2) the central focus of the complex question (a disease) is marked. We present simple baseline results for automatic classification of these annotations, demonstrating the challenging but important nature of this task.

pdf bib
Boosting Open Information Extraction with Noun-Based Relations
Clarissa Xavier | Vera Lima

Open Information Extraction (Open IE) is a strategy for learning relations from texts, regardless the domain and without predefining these relations. Work in this area has focused mainly on verbal relations. In order to extend Open IE to extract relationships that are not expressed by verbs, we present a novel Open IE approach that extracts relations expressed in noun compounds (NCs), such as (oil, extracted from, olive) from “olive oil”, or in adjective-noun pairs (ANs), such as (moon, that is, gorgeous) from “gorgeous moon”. The approach consists of three steps: detection of NCs and ANs, interpretation of these compounds in view of corpus enrichment and extraction of relations from the enriched corpus. To confirm the feasibility of this method we created a prototype and evaluated the impact of the application of our proposal in two state-of-the-art Open IE extractors. Based on these tests we conclude that the proposed approach is an important step to fulfil the gap concerning the extraction of relations within the noun compounds and adjective-noun pairs in Open IE.

pdf bib
Bootstrapping Open-Source English-Bulgarian Computational Dictionary
Krasimir Angelov

We present an open-source English-Bulgarian dictionary which is a unification and consolidation of existing and freely available resources for the two languages. The new resource can be used as either a pair of two monolingual morphological lexicons, or as a bidirectional translation dictionary between the languages. The structure of the resource is compatible with the existing synchronous English-Bulgarian grammar in Grammatical Framework (GF). This makes it possible to immediately plug it in as a component in a grammar-based translation system that is currently under development in the same framework. This also meant that we had to enrich the dictionary with additional syntactic and semantic information that was missing in the original resources.

pdf bib
MotàMot project: conversion of a French-Khmer published dictionary for building a multilingual lexical system
Mathieu Mangeot

Economic issues related to the information processing techniques are very important. The development of such technologies is a major asset for developing countries like Cambodia and Laos, and emerging ones like Vietnam, Malaysia and Thailand. The MotAMot project aims to computerize an under-resourced language: Khmer, spoken mainly in Cambodia. The main goal of the project is the development of a multilingual lexical system targeted for Khmer. The macrostructure is a pivot one with each word sense of each language linked to a pivot axi. The microstructure comes from a simplification of the explanatory and combinatory dictionary. The lexical system has been initialized with data coming mainly from the conversion of the French-Khmer bilingual dictionary of Denis Richer from Word to XML format. The French part was completed with pronunciation and parts-of-speech coming from the FeM French-english-Malay dictionary. The Khmer headwords noted in IPA in the Richer dictionary were converted to Khmer writing with OpenFST, a finite state transducer tool. The resulting resource is available online for lookup, editing, download and remote programming via a REST API on a Jibiki platform.

pdf bib
JUST.ASK, a QA system that learns to answer new questions from previous interactions
Sérgio Curto | Ana C. Mendes | Pedro Curto | Luísa Coheur | Ângela Costa

We present JUST.ASK, a publicly available Question Answering system, which is freely available. Its architecture is composed of the usual Question Processing, Passage Retrieval and Answer Extraction components. Several details on the information generated and manipulated by each of these components are also provided to the user when interacting with the demonstration. Since JUST.ASK also learns to answer new questions based on users’ feedback, (s)he is invited to identify the correct answers. These will then be used to retrieve answers to future questions.

pdf bib
Design and Development of an Online Computational Framework to Facilitate Language Comprehension Research on Indian Languages
Manjira Sinha | Tirthankar Dasgupta | Anupam Basu

In this paper we have developed an open-source online computational framework that can be used by different research groups to conduct reading researches on Indian language texts. The framework can be used to develop a large annotated Indian language text comprehension data from different user based experiments. The novelty in this framework lies in the fact that it brings different empirical data-collection techniques for text comprehension under one roof. The framework has been customized specifically to address language particularities for Indian languages. It will also offer many types of automatic analysis on the data at different levels such as full text, sentence and word level. To address the subjectivity of text difficulty perception, the framework allows to capture user background against multiple factors. The assimilated data can be automatically cross referenced against varying strata of readers.

pdf bib
The Nijmegen Corpus of Casual Czech
Mirjam Ernestus | Lucie Kočková-Amortová | Petr Pollak

This article introduces a new speech corpus, the Nijmegen Corpus of Casual Czech (NCCCz), which contains more than 30 hours of high-quality recordings of casual conversations in Common Czech, among ten groups of three male and ten groups of three female friends. All speakers were native speakers of Czech, raised in Prague or in the region of Central Bohemia, and were between 19 and 26 years old. Every group of speakers consisted of one confederate, who was instructed to keep the conversations lively, and two speakers naive to the purposes of the recordings. The naive speakers were engaged in conversations for approximately 90 minutes, while the confederate joined them for approximately the last 72 minutes. The corpus was orthographically annotated by experienced transcribers and this orthographic transcription was aligned with the speech signal. In addition, the conversations were videotaped. This corpus can form the basis for all types of research on casual conversations in Czech, including phonetic research and research on how to improve automatic speech recognition. The corpus will be freely available.

pdf bib
Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties
Yan Song | Fei Xia

Languages change over time and ancient languages have been studied in linguistics and other related fields. A main challenge in this research area is the lack of empirical data; for instance, ancient spoken languages often leave little trace of their linguistic properties. From the perspective of natural language processing (NLP), while the NLP community has created dozens of annotated corpora, very few of them are on ancient languages. As an effort toward bridging the gap, we have created a word segmented and POS tagged corpus for Archaic Chinese using articles from Huainanzi, a book written during China’s Western Han Dynasty (206 BC-9 AD). We then compare this corpus with the Chinese Penn Treebank (CTB), a well-known corpus for Modern Chinese, and report several interesting differences and similarities between the two corpora. Finally, we demonstrate that the CTB can be used to improve the performance of word segmenters and POS taggers for Archaic Chinese, but only through features that have similar behaviors in the two corpora.

pdf bib
Digital Library 2.0: Source of Knowledge and Research Collaboration Platform
Włodzimierz Gruszczyński | Maciej Ogrodniczuk

Digital libraries are frequently treated just as a new method of storage of digitized artifacts, with all consequences of transferring long-established ways of dealing with physical objects into the digital world. Such attitude improves availability, but often neglects other opportunities offered by global and immediate access, virtuality and linking ― as easy as never before. The article presents the idea of transforming a conventional digital library into knowledge source and research collaboration platform, facilitating content augmentation, interpretation and co-operation of geographically distributed researchers representing different academic fields. This concept has been verified by the process of extending descriptions stored in thematic Digital Library of Polish and Poland-related Ephemeral Prints from the 16th, 17th and 18th Centuries with extended item-associated information provided by historians, philologists, librarians and computer scientists. It resulted in associating the customary fixed metadata and digitized content with historical comments, mini-dictionaries of foreign interjections or explanation of less-known background details.

pdf bib
Open-domain Interaction and Online Content in the Sami Language
Kristiina Jokinen

This paper presents data collection and collaborative community events organised within the project Digital Natives on the North Sami language. The project is one of the collaboration initiatives on endangered Finno-Ugric languages, supported by the larger framework between the Academy of Finland and the Hungarian Academy of Sciences. The goal of the project is to improve digital visibility and viability of the targeted Finno-Ugric languages, as well as to develop language technology tools and resources in order to assist automatic language processing and experimenting with multilingual interactive applications.

pdf bib
A Character-based Approach to Distributional Semantic Models: Exploiting Kanji Characters for Constructing JapaneseWord Vectors
Akira Utsumi

Many Japanese words are made of kanji characters, which themselves represent meanings. However traditional word-based distributional semantic models (DSMs) do not benefit from the useful semantic information of kanji characters. In this paper, we propose a method for exploiting the semantic information of kanji characters for constructing Japanese word vectors in DSMs. In the proposed method, the semantic representations of kanji characters (i.e, kanji vectors) are constructed first using the techniques of DSMs, and then word vectors are computed by combining the vectors of constituent kanji characters using vector composition methods. The evaluation experiment using a synonym identification task demonstrates that the kanji-based DSM achieves the best performance when a kanji-kanji matrix is weighted by positive pointwise mutual information and word vectors are composed by weighted multiplication. Comparison between kanji-based DSMs and word-based DSMs reveals that our kanji-based DSMs generally outperform latent semantic analysis, and also surpasses the best score word-based DSM for infrequent words comprising only frequent kanji characters. These findings clearly indicate that kanji-based DSMs are beneficial in improvement of quality of Japanese word vectors.

pdf bib
Cross-Language Authorship Attribution
Dasha Bogdanova | Angeliki Lazaridou

This paper presents a novel task of cross-language authorship attribution (CLAA), an extension of authorship attribution task to multilingual settings: given data labelled with authors in language X, the objective is to determine the author of a document written in language Y , where X is different from Y . We propose a number of cross-language stylometric features for the task of CLAA, such as those based on sentiment and emotional markers. We also explore an approach based on machine translation (MT) with both lexical and cross-language features. We experimentally show that MT could be used as a starting point to CLAA, since it allows good attribution accuracy to be achieved. The cross-language features provide acceptable accuracy while using jointly with MT, though do not outperform lexical features.

pdf bib
Using Transfer Learning to Assist Exploratory Corpus Annotation
Paul Felt | Eric Ringger | Kevin Seppi | Kristian Heal

We describe an under-studied problem in language resource management: that of providing automatic assistance to annotators working in exploratory settings. When no satisfactory tagset already exists, such as in under-resourced or undocumented languages, it must be developed iteratively while annotating data. This process naturally gives rise to a sequence of datasets, each annotated differently. We argue that this problem is best regarded as a transfer learning problem with multiple source tasks. Using part-of-speech tagging data with simulated exploratory tagsets, we demonstrate that even simple transfer learning techniques can significantly improve the quality of pre-annotations in an exploratory annotation.

pdf bib
ANCOR_Centre, a large free spoken French coreference corpus: description of the resource and reliability measures
Judith Muzerelle | Anaïs Lefeuvre | Emmanuel Schang | Jean-Yves Antoine | Aurore Pelletier | Denis Maurel | Iris Eshkol | Jeanne Villaneau

This article presents ANCOR_Centre, a French coreference corpus, available under the Creative Commons Licence. With a size of around 500,000 words, the corpus is large enough to serve the needs of data-driven approaches in NLP and represents one of the largest coreference resources currently available. The corpus focuses exclusively on spoken language, it aims at representing a certain variety of spoken genders. ANCOR_Centre includes anaphora as well as coreference relations which involve nominal and pronominal mentions. The paper describes into details the annotation scheme and the reliability measures computed on the resource.

pdf bib
Guampa: a Toolkit for Collaborative Translation
Alex Rudnick | Taylor Skidmore | Alberto Samaniego | Michael Gasser

Here we present Guampa, a new software package for online collaborative translation. This system grows out of our discussions with Guarani-language activists and educators in Paraguay, and attempts to address problems faced by machine translation researchers and by members of any community speaking an under-represented language. Guampa enables volunteers and students to work together to translate documents into heritage languages, both to make more materials available in those languages, and also to generate bitext suitable for training machine translation systems. While many approaches to crowdsourcing bitext corpora focus on Mechanical Turk and temporarily engaging anonymous workers, Guampa is intended to foster an online community in which discussions can take place, language learners can practice their translation skills, and complete documents can be translated. This approach is appropriate for the Spanish-Guarani language pair as there are many speakers of both languages, and Guarani has a dedicated activist community. Our goal is to make it easy for anyone to set up their own instance of Guampa and populate it with documents -- such as automatically imported Wikipedia articles -- to be translated for their particular language pair. Guampa is freely available and relatively easy to use.

pdf bib
Experiences with the ISOcat Data Category Registry
Daan Broeder | Ineke Schuurman | Menzo Windhouwer

The ISOcat Data Category Registry has been a joint project of both ISO TC 37 and the European CLARIN infrastructure. In this paper the experiences of using ISOcat in CLARIN are described and evaluated. This evaluation clarifies the requirements of CLARIN with regard to a semantic registry to support its semantic interoperability needs. A simpler model based on concepts instead of data cate-gories and a simpler workflow based on community recommendations will address these needs better and offer the required flexibility.

pdf bib
RELISH LMF: Unlocking the Full Power of the Lexical Markup Framework
Menzo Windhouwer | Justin Petro | Shakila Shayan

The Lexical Markup Framework (ISO 24613:2008) provides a core class diagram and various extensions as the basis for constructing lexical resources. Unfortunately the informative Document Type Definition provided by the standard and other available LMF serializations lack support for many of the powerful features of the model. This paper describes RELISH LMF, which unlocks the full power of the LMF model by providing a set of extensible modern schema modules. As use cases RELISH LL LMF and support by LEXUS, an online lexicon tool, are described.

pdf bib
Building a Corpus of Manually Revised Texts from Discourse Perspective
Ryu Iida | Takenobu Tokunaga

This paper presents building a corpus of manually revised texts which includes both before and after-revision information. In order to create such a corpus, we propose a procedure for revising a text from a discourse perspective, consisting of dividing a text to discourse units, organising and reordering groups of discourse units and finally modifying referring and connective expressions, each of which imposes limits on freedom of revision. Following the procedure, six revisers who have enough experience in either teaching Japanese or scoring Japanese essays revised 120 Japanese essays written by Japanese native speakers. Comparing the original and revised texts, we found some specific manual revisions frequently occurred between the original and revised texts, e.g. ‘thesis’ statements were frequently placed at the beginning of a text. We also evaluate text coherence using the original and revised texts on the task of pairwise information ordering, identifying a more coherent text. The experimental results using two text coherence models demonstrated that the two models did not outperform the random baseline.

pdf bib
The CMD Cloud
Matej Ďurčo | Menzo Windhouwer

The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records. In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer: the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metadata modellers, for metadata curation tasks as well as for general audience seeking for a ‘big picture’ of the complex CMD data domain.

pdf bib
Linguistic landscaping of South Asia using digital language resources: Genetic vs. areal linguistics
Lars Borin | Anju Saxena | Taraka Rama | Bernard Comrie

Like many other research fields, linguistics is entering the age of big data. We are now at a point where it is possible to see how new research questions can be formulated - and old research questions addressed from a new angle or established results verified - on the basis of exhaustive collections of data, rather than small, carefully selected samples. For example, South Asia is often mentioned in the literature as a classic example of a linguistic area, but there is no systematic, empirical study substantiating this claim. Examination of genealogical and areal relationships among South Asian languages requires a large-scale quantitative and qualitative comparative study, encompassing more than one language family. Further, such a study cannot be conducted manually, but needs to draw on extensive digitized language resources and state-of-the-art computational tools. We present some preliminary results of our large-scale investigation of the genealogical and areal relationships among the languages of this region, based on the linguistic descriptions available in the 19 tomes of Grierson’s monumental “Linguistic Survey of India” (1903-1927), which is currently being digitized with the aim of turning the linguistic information in the LSI into a digital language resource suitable for a broad array of linguistic investigations.

pdf bib
Linguistic Evaluation of Support Verb Constructions by OpenLogos and Google Translate
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Susanne Preuß | Kutz Arrieta | Wang Ling | Fernando Batista | Isabel Trancoso

This paper presents a systematic human evaluation of translations of English support verb constructions produced by a rule-based machine translation (RBMT) system (OpenLogos) and a statistical machine translation (SMT) system (Google Translate) for five languages: French, German, Italian, Portuguese and Spanish. We classify support verb constructions by means of their syntactic structure and semantic behavior and present a qualitative analysis of their translation errors. The study aims to verify how machine translation (MT) systems translate fine-grained linguistic phenomena, and how well-equipped they are to produce high-quality translation. Another goal of the linguistically motivated quality analysis of SVC raw output is to reinforce the need for better system hybridization, which leverages the strengths of RBMT to the benefit of SMT, especially in improving the translation of multiword units. Taking multiword units into account, we propose an effective method to achieve MT hybridization based on the integration of semantico-syntactic knowledge into SMT.

pdf bib
Single-Person and Multi-Party 3D Visualizations for Nonverbal Communication Analysis
Michael Kipp | Levin Freiherr von Hollen | Michael Christopher Hrstka | Franziska Zamponi

The qualitative analysis of nonverbal communication is more and more relying on 3D recording technology. However, the human analysis of 3D data on a regular 2D screen can be challenging as 3D scenes are difficult to visually parse. To optimally exploit the full depth of the 3D data, we propose to enhance the 3D view with a number of visualizations that clarify spatial and conceptual relationships and add derived data like speed and angles. In this paper, we present visualizations for directional body motion, hand movement direction, gesture space location, and proxemic dimensions like interpersonal distance, movement and orientation. The proposed visualizations are available in the open source tool JMocap and are planned to be fully integrated into the ANVIL video annotation tool. The described techniques are intended to make annotation more efficient and reliable and may allow the discovery of entirely new phenomena.

pdf bib
Collection of a Simultaneous Translation Corpus for Comparative Analysis
Hiroaki Shimizu | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura

This paper describes the collection of an English-Japanese/Japanese-English simultaneous interpretation corpus. There are two main features of the corpus. The first is that professional simultaneous interpreters with different amounts of experience cooperated with the collection. By comparing data from simultaneous interpretation of each interpreter, it is possible to compare better interpretations to those that are not as good. The second is that for part of our corpus there are already translation data available. This makes it possible to compare translation data with simultaneous interpretation data. We recorded the interpretations of lectures and news, and created time-aligned transcriptions. A total of 387k words of transcribed data were collected. The corpus will be helpful to analyze differences in interpretations styles and to construct simultaneous interpretation systems.

pdf bib
The AV-LASYN Database : A synchronous corpus of audio and 3D facial marker data for audio-visual laughter synthesis
Hüseyin Çakmak | Jérôme Urbain | Thierry Dutoit | Joëlle Tilmanne

A synchronous database of acoustic and 3D facial marker data was built for audio-visual laughter synthesis. Since the aim is to use this database for HMM-based modeling and synthesis, the amount of collected data from one given subject had to be maximized. The corpus contains 251 utterances of laughter from one male participant. Laughter was elicited with the help of humorous videos. The resulting database is synchronous between modalities (audio and 3D facial motion capture data). Visual 3D data is available in common formats such as BVH and C3D with head motion and facial deformation independently available. Data is segmented and audio has been annotated. Phonetic transcriptions are available in the HTK-compatible format. Principal component analysis has been conducted on visual data and has shown that a dimensionality reduction might be relevant. The corpus may be obtained under a research license upon request to authors.

pdf bib
Interoperability of Dialogue Corpora through ISO 24617-2-based Querying
Volha Petukhova | Andrei Malchanau | Harry Bunt

This paper explores a way of achieving interoperability: developing a query format for accessing existing annotated corpora whose expressions make use of the annotation language defined by the standard. The interpretation of expressions in the query implements a mapping from ISO 24617-2 concepts to those of the annotation scheme used in the corpus. We discuss two possible ways to query existing annotated corpora using DiAML. One way is to transform corpora into DiAML compliant format, and subsequently query these data using XQuery or XPath. The second approach is to define a DiAML query that can be directly used to retrieve requested information from the annotated data. Both approaches are valid. The first one presents a standard way of querying XML data. The second approach is a DiAML-oriented querying of dialogue act annotated data, for which we designed an interface. The proposed approach is tested on two important types of existing dialogue corpora: spoken two-person dialogue corpora collected and annotated within the HCRC Map Task paradigm, and multiparty face-to-face dialogues of the AMI corpus. We present the results and evaluate them with respect to accuracy and completeness through statistical comparisons between retrieved and manually constructed reference annotations.

pdf bib
ASR-based CALL systems and learner speech data: new resources and opportunities for research and development in second language learning
Catia Cucchiarini | Steve Bodnar | Bart Penning de Vries | Roeland van Hout | Helmer Strik

In this paper we describe the language resources developed within the project “Feedback and the Acquisition of Syntax in Oral Proficiency” (FASOP), which is aimed at investigating the effectiveness of various forms of practice and feedback on the acquisition of syntax in second language (L2) oral proficiency, as well as their interplay with learner characteristics such as education level, learner motivation and confidence. For this purpose, use is made of a Computer Assisted Language Learning (CALL) system that employs Automatic Speech Recognition (ASR) technology to allow spoken interaction and to create an experimental environment that guarantees as much control over the language learning setting as possible. The focus of the present paper is on the resources that are being produced in FASOP. In line with the theme of this conference, we present the different types of resources developed within this project and the way in which these could be used to pursue innovative research in second language acquisition and to develop and improve ASR-based language learning applications.

pdf bib
The DBOX Corpus Collection of Spoken Human-Human and Human-Machine Dialogues
Volha Petukhova | Martin Gropp | Dietrich Klakow | Gregor Eigner | Mario Topf | Stefan Srb | Petr Motlicek | Blaise Potard | John Dines | Olivier Deroo | Ronny Egeler | Uwe Meinz | Steffen Liersch | Anna Schmidt

This paper describes the data collection and annotation carried out within the DBOX project ( Eureka project, number E! 7152). This project aims to develop interactive games based on spoken natural language human-computer dialogues, in 3 European languages: English, German and French. We collect the DBOX data continuously. We first start with human-human Wizard of Oz experiments to collect human-human data in order to model natural human dialogue behaviour, for better understanding of phenomena of human interactions and predicting interlocutors actions, and then replace the human Wizard by an increasingly advanced dialogue system, using evaluation data for system improvement. The designed dialogue system relies on a Question-Answering (QA) approach, but showing truly interactive gaming behaviour, e.g., by providing feedback, managing turns and contact, producing social signals and acts, e.g., encouraging vs. downplaying, polite vs. rude, positive vs. negative attitude towards players or their actions, etc. The DBOX dialogue corpus has required substantial investment. We expect it to have a great impact on the rest of the project. The DBOX project consortium will continue to maintain the corpus and to take an interest in its growth, e.g., expand to other languages. The resulting corpus will be publicly released.

pdf bib
The Database for Spoken German — DGD2
Thomas Schmidt

The Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD2, is the central platform for publishing and disseminating spoken language corpora from the Archive of Spoken German (Archiv für Gesprochenes Deutsch, AGD, at the Institute for the German Language in Mannheim. The corpora contained in the DGD2 come from a variety of sources, some of them in-house projects, some of them external projects. Most of the corpora were originally intended either for research into the (dialectal) variation of German or for studies in conversation analysis and related fields. The AGD has taken over the task of permanently archiving these resources and making them available for reuse to the research community. To date, the DGD2 offers access to 19 different corpora, totalling around 9000 speech events, 2500 hours of audio recordings or 8 million transcribed words. This paper gives an overview of the data made available via the DGD2, of the technical basis for its implementation, and of the most important functionalities it offers. The paper concludes with information about the users of the database and future plans for its development.

pdf bib
Building a Dataset of Multilingual Cognates for the Romanian Lexicon
Liviu Dinu | Alina Maria Ciobanu

Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymology-related information from electronic dictionaries. We employ the dataset of cognates that we obtain as a gold standard for evaluating to which extent orthographic methods can be used to detect cognate pairs. The question that arises is whether they are able to discriminate between cognates and non-cognates, given the orthographic changes undergone by foreign words when entering new languages. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We run our experiments on the Romanian lexicon, but the method we propose is adaptable to any language, as far as resources are available.

pdf bib
Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web
Giuseppe Rizzo | Marieke van Erp | Raphaël Troncy

Named entity recognition and disambiguation are of primary importance for extracting information and for populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources, such as those in DBpedia, has been tackled by the Semantic Web community. As these tasks are treated in different communities, there is as yet no oversight on the performance of these tasks combined. We present an approach that combines the state-of-the art from named entity recognition in the natural language processing domain and named entity linking from the semantic web community. We report on experiments and results to gain more insights into the strengths and limitations of current approaches on these tasks. Our approach relies on the numerous web extractors supported by the NERD framework, which we combine with a machine learning algorithm to optimize recognition and linking of named entities. We test our approach on four standard data sets that are composed of two diverse text types, namely newswire and microposts.

pdf bib
Automatic Expansion of the MRC Psycholinguistic Database Imageability Ratings
Ting Liu | Kit Cho | G. Aaron Broadwell | Samira Shaikh | Tomek Strzalkowski | John Lien | Sarah Taylor | Laurie Feldman | Boris Yamrom | Nick Webb | Umit Boz | Ignacio Cases | Ching-sheng Lin

Recent studies in metaphor extraction across several languages (Broadwell et al., 2013; Strzalkowski et al., 2013) have shown that word imageability ratings are highly correlated with the presence of metaphors in text. Information about imageability of words can be obtained from the MRC Psycholinguistic Database (MRCPD) for English words and Léxico Informatizado del Español Programa (LEXESP) for Spanish words, which is a collection of human ratings obtained in a series of controlled surveys. Unfortunately, word imageability ratings were collected for only a limited number of words: 9,240 words in English, 6,233 in Spanish; and are unavailable at all in the other two languages studied: Russian and Farsi. The present study describes an automated method for expanding the MRCPD by conferring imageability ratings over the synonyms and hyponyms of existing MRCPD words, as identified in Wordnet. The result is an expanded MRCPD+ database with imagea-bility scores for more than 100,000 words. The appropriateness of this expansion process is assessed by examining the structural coherence of the expanded set and by validating the expanded lexicon against human judgment. Finally, the performance of the metaphor extraction system is shown to improve significantly with the expanded database. This paper describes the process for English MRCPD+ and the resulting lexical resource. The process is analogous for other languages.

pdf bib
Towards building a Kashmiri Treebank: Setting up the Annotation Pipeline
Riyaz Ahmad Bhat | Shahid Mushtaq Bhat | Dipti Misra Sharma

Kashmiri is a resource poor language with very less computational and language resources available for its text processing. As the main contribution of this paper, we present an initial version of the Kashmiri Dependency Treebank. The treebank consists of 1,000 sentences (17,462 tokens), annotated with part-of-speech (POS), chunk and dependency information. The treebank has been manually annotated using the Paninian Computational Grammar (PCG) formalism (Begum et al., 2008; Bharati et al., 2009). This version of Kashmiri treebank is an extension of its earlier verion of 500 sentences (Bhat, 2012), a pilot experiment aimed at defining the annotation guidelines on a small subset of Kashmiri corpora. In this paper, we have refined the guidelines with some significant changes and have carried out inter-annotator agreement studies to ascertain its quality. We also present a dependency parsing pipeline, consisting of a tokenizer, a stemmer, a POS tagger, a chunker and an inter-chunk dependency parser. It, therefore, constitutes the first freely available, open source dependency parser of Kashmiri, setting the initial baseline for Kashmiri dependency parsing.

pdf bib
SenTube: A Corpus for Sentiment Analysis on YouTube Social Media
Olga Uryupina | Barbara Plank | Aliaksei Severyn | Agata Rotondi | Alessandro Moschitti

In this paper we present SenTube -- a dataset of user-generated comments on YouTube videos annotated for information content and sentiment polarity. It contains annotations that allow to develop classifiers for several important NLP tasks: (i) sentiment analysis, (ii) text categorization (relatedness of a comment to video and/or product), (iii) spam detection, and (iv) prediction of comment informativeness. The SenTube corpus favors the development of research on indexing and searching YouTube videos exploiting information derived from comments. The corpus will cover several languages: at the moment, we focus on English and Italian, with Spanish and Dutch parts scheduled for the later stages of the project. For all the languages, we collect videos for the same set of products, thus offering possibilities for multi- and cross-lingual experiments. The paper provides annotation guidelines, corpus statistics and annotator agreement details.

pdf bib
CIEMPIESS: A New Open-Sourced Mexican Spanish Radio Corpus
Carlos Daniel Hernandez Mena | Abel Herrera Camacho

Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social” (CIEMPIESS) is a new open-sourced corpus extracted from Spanish spoken FM podcasts in the dialect of the center of Mexico. The CIEMPIESS corpus was designed to be used in the field of automatic speech recongnition (ASR) and it is provided with two different kind of pronouncing dictionaries, one of them containing the phonemes of Mexican Spanish and the other containing this same phonemes plus allophones. Corpus annotation took into account the tonic vowel of every word and the four different sounds that letter “x” presents in the Spanish language. CIEMPIESS corpus is also provided with two different language models extracted from electronic newsletters, one of them takes into account the tonic vowels but not the other one. Both the dictionaries and the language models allow users to experiment different scenarios for the recognition task in order to adequate the corpus to their needs.

pdf bib
SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling
Arantza del Pozo | Carlo Aliprandi | Aitor Álvarez | Carlos Mendes | Joao P. Neto | Sérgio Paulo | Nicola Piccinini | Matteo Raffaelli

This paper describes the data collection, annotation and sharing activities carried out within the FP7 EU-funded SAVAS project. The project aims to collect, share and reuse audiovisual language resources from broadcasters and subtitling companies to develop large vocabulary continuous speech recognisers in specific domains and new languages, with the purpose of solving the automated subtitling needs of the media industry.

pdf bib
Exploring and Visualizing Variation in Language Resources
Peter Fankhauser | Jörg Knappen | Elke Teich

Language resources are often compiled for the purpose of variational analysis, such as studying differences between genres, registers, and disciplines, regional and diachronic variation, influence of gender, cultural context, etc. Often the sheer number of potentially interesting contrastive pairs can get overwhelming due to the combinatorial explosion of possible combinations. In this paper, we present an approach that combines well understood techniques for visualization heatmaps and word clouds with intuitive paradigms for exploration drill down and side by side comparison to facilitate the analysis of language variation in such highly combinatorial situations. Heatmaps assist in analyzing the overall pattern of variation in a corpus, and word clouds allow for inspecting variation at the level of words.

pdf bib
Simple Effective Microblog Named Entity Recognition: Arabic as an Example
Kareem Darwish | Wei Gao

Despite many recent papers on Arabic Named Entity Recognition (NER) in the news domain, little work has been done on microblog NER. NER on microblogs presents many complications such as informality of language, shortened named entities, brevity of expressions, and inconsistent capitalization (for cased languages). We introduce simple effective language-independent approaches for improving NER on microblogs, based on using large gazetteers, domain adaptation, and a two-pass semi-supervised method. We use Arabic as an example language to compare the relative effectiveness of the approaches and when best to use them. We also present a new dataset for the task. Results of combining the proposed approaches show an improvement of 35.3 F-measure points over a baseline system trained on news data and an improvement of 19.9 F-measure points over the same system but trained on microblog data.

pdf bib
Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese
Miguel B. Almeida | Mariana S. C. Almeida | André F. T. Martins | Helena Figueira | Pedro Mendes | Cláudia Pinto

In this paper, we introduce the Priberam Compressive Summarization Corpus, a new multi-document summarization corpus for European Portuguese. The corpus follows the format of the summarization corpora for English in recent DUC and TAC conferences. It contains 80 manually chosen topics referring to events occurred between 2010 and 2013. Each topic contains 10 news stories from major Portuguese newspapers, radio and TV stations, along with two human generated summaries up to 100 words. Apart from the language, one important difference from the DUC/TAC setup is that the human summaries in our corpus are \emph{compressive}: the annotators performed only sentence and word deletion operations, as opposed to generating summaries from scratch. We use this corpus to train and evaluate learning-based extractive and compressive summarization systems, providing an empirical comparison between these two approaches. The corpus is made freely available in order to facilitate research on automatic summarization.

pdf bib
Hope and Fear: How Opinions Influence Factuality
Chantal van Son | Marieke van Erp | Antske Fokkens | Piek Vossen

Both sentiment and event factuality are fundamental information levels for our understanding of events mentioned in news texts. Most research so far has focused on either modeling opinions or factuality. In this paper, we propose a model that combines the two for the extraction and interpretation of perspectives on events. By doing so, we can explain the way people perceive changes in (their belief of) the world as a function of their fears of changes to the bad or their hopes of changes to the good. This study seeks to examine the effectiveness of this approach by applying factuality annotations, based on FactBank, on top of the MPQA Corpus, a corpus containing news texts annotated for sentiments and other private states. Our findings suggest that this approach can be valuable for the understanding of perspectives, but that there is still some work to do on the refinement of the integration.

pdf bib
Linking Pictographs to Synsets: Sclera2Cornetto
Vincent Vandeghinste | Ineke Schuurman

Social inclusion of people with Intellectual and Developmental Disabilities can be promoted by offering them ways to independently use the internet. People with reading or writing disabilities can use pictographs instead of text. We present a resource in which we have linked a set of 5710 pictographs to lexical-semantic concepts in Cornetto, a Wordnet-like database for Dutch. We show that, by using this resource in a text-to-pictograph translation system, we can greatly improve the coverage comparing with a baseline where words are converted into pictographs only if the word equals the filename.

pdf bib
Characterizing and Predicting Bursty Events: The Buzz Case Study on Twitter
Mohamed Morchid | Georges Linarès | Richard Dufour

The prediction of bursty events on the Internet is a challenging task. Difficulties are due to the diversity of information sources, the size of the Internet, dynamics of popularity, user behaviors... On the other hand, Twitter is a structured and limited space. In this paper, we present a new method for predicting bursty events using content-related indices. Prediction is performed by a neural network that combines three features in order to predict the number of retweets of a tweet on the Twitter platform. The indices are related to popularity, expressivity and singularity. Popularity index is based on the analysis of RSS streams. Expressivity uses a dictionary that contains words annotated in terms of expressivity load. Singularity represents outlying topic association estimated via a Latent Dirichlet Allocation (LDA) model. Experiments demonstrate the effectiveness of the proposal with a 72% F-measure prediction score for the tweets that have been forwarded at least 60 times.

pdf bib
Information Extraction from German Patient Records via Hybrid Parsing and Relation Extraction Strategies
Hans-Ulrich Krieger | Christian Spurk | Hans Uszkoreit | Feiyu Xu | Yi Zhang | Frank Müller | Thomas Tolxdorff

In this paper, we report on first attempts and findings to analyzing German patient records, using a hybrid parsing architecture and a combination of two relation extraction strategies. On a practical level, we are interested in the extraction of concepts and relations among those concepts, a necessary cornerstone for building medical information systems. The parsing pipeline consists of a morphological analyzer, a robust chunk parser adapted to Latin phrases used in medical diagnosis, a repair rule stage, and a probabilistic context-free parser that respects the output from the chunker. The relation extraction stage is a combination of two systems: SProUT, a shallow processor which uses hand-written rules to discover relation instances from local text units and DARE which extracts relation instances from complete sentences, using rules that are learned in a bootstrapping process, starting with semantic seeds. Two small experiments have been carried out for the parsing pipeline and the relation extraction stage.

pdf bib
The MMASCS multi-modal annotated synchronous corpus of audio, video, facial motion and tongue motion data of normal, fast and slow speech
Dietmar Schabus | Michael Pucher | Phil Hoole

In this paper, we describe and analyze a corpus of speech data that we have recorded in multiple modalities simultaneously: facial motion via optical motion capturing, tongue motion via electro-magnetic articulography, as well as conventional video and high-quality audio. The corpus consists of 320 phonetically diverse sentences uttered by a male Austrian German speaker at normal, fast and slow speaking rate. We analyze the influence of speaking rate on phone durations and on tongue motion. Furthermore, we investigate the correlation between tongue and facial motion. The data corpus is available free of charge for research use, including phonetic annotations and a playback software which visualizes the 3D data, from the website

pdf bib
Genres in the Prague Discourse Treebank
Lucie Poláková | Pavlína Jínová | Jiří Mírovský

We present the project of classification of Prague Discourse Treebank documents (Czech journalistic texts) for their genres. Our main interest lies in opening the possibility to observe how text coherence is realized in different types (in the genre sense) of language data and, in the future, in exploring the ways of using genres as a feature for multi-sentence-level language technologies. In the paper, we first describe the motivation and the concept of the genre annotation, and briefly introduce the Prague Discourse Treebank. Then, we elaborate on the process of manual annotation of genres in the treebank, from the annotators’ manual work to post-annotation checks and to the inter-annotator agreement measurements. The annotated genres are subsequently analyzed together with discourse relations (already annotated in the treebank) ― we present distributions of the annotated genres and results of studying distinctions of distributions of discourse relations across the individual genres.

pdf bib
Speech Recognition Web Services for Dutch
Joris Pelemans | Kris Demuynck | Hugo Van hamme | Patrick Wambacq

In this paper we present 3 applications in the domain of Automatic Speech Recognition for Dutch, all of which are developed using our in-house speech recognition toolkit SPRAAK. The speech-to-text transcriber is a large vocabulary continuous speech recognizer, optimized for Southern Dutch. It is capable to select components and adjust parameters on the fly, based on the observed conditions in the audio and was recently extended with the capability of adding new words to the lexicon. The grapheme-to-phoneme converter generates possible pronunciations for Dutch words, based on lexicon lookup and linguistic rules. The speech-text alignment system takes audio and text as input and constructs a time aligned output where every word receives exact begin and end times. All three of the applications (and others) are freely available, after registration, as a web application on and in addition, can be accessed as a web service in automated tools.

pdf bib
Translation errors from English to Portuguese: an annotated corpus
Angela Costa | Tiago Luís | Luísa Coheur

Analysing the translation errors is a task that can help us finding and describing translation problems in greater detail, but can also suggest where the automatic engines should be improved. Having these aims in mind we have created a corpus composed of 150 sentences, 50 from the TAP magazine, 50 from a TED talk and the other 50 from the from the TREC collection of factoid questions. We have automatically translated these sentences from English into Portuguese using Google Translate and Moses. After we have analysed the errors and created the error annotation taxonomy, the corpus was annotated by a linguist native speaker of Portuguese. Although Google’s overall performance was better in the translation task (we have also calculated the BLUE and NIST scores), there are some error types that Moses was better at coping with, specially discourse level errors.

pdf bib
Amazigh Verb Conjugator
Fadoua Ataa Allah | Siham Boulaknadel

With the aim of preserving the Amazigh heritage from being threatened with disappearance, it seems suitable to provide Amazigh with required resources to confront the stakes of access to the domain of New Information and Communication Technologies (ICT). In this context and in the perspective to build linguistic resources and natural language processing tools for this language, we have undertaken to develop an online conjugating tool that generates the inflectional forms of the Amazigh verbs. This tool is based on novel linguistically motivated morphological rules describing the verbal paradigm for all the Moroccan Amazigh varieties. Furthermore, it is based on the notion of morphological tree structure and uses transformational rules which are attached to the leaf nodes. Each rule may have numerous mutually exclusive clauses, where each part of a clause is a regular expression pattern that is matched against the radical pattern. This tool is an interactive conjugator that provides exhaustive coverage of linguistically accurate conjugation paradigms for over 3584 Armazigh verbs. It has been made simple and easy to use and designed from the ground up to be a highly effective learning aid that stimulates a desire to learn.

pdf bib
The liability of service providers in e-Research Infrastructures: killing the messenger?
Pawel Kamocki

Hosting Providers play an essential role in the development of Internet services such as e-Research Infrastructures. In order to promote the development of such services, legislators on both sides of the Atlantic Ocean introduced “safe harbour” provisions to protect Service Providers (a category which includes Hosting Providers) from legal claims (e.g. of copyright infringement). Relevant provisions can be found in § 512 of the United States Copyright Act and in art. 14 of the Directive 2000/31/EC (and its national implementations). The cornerstone of this framework is the passive role of the Hosting Provider through which he has no knowledge of the content that he hosts. With the arrival of Web 2.0, however, the role of Hosting Providers on the Internet changed; this change has been reflected in court decisions that have reached varying conclusions in the last few years. The purpose of this article is to present the existing framework (including recent case law from the US, Germany and France).

pdf bib
Adapting VerbNet to French using existing resources
Quentin Pradet | Laurence Danlos | Gaël de Chalendar

VerbNet is an English lexical resource for verbs that has proven useful for English NLP due to its high coverage and coherent classification. Such a resource doesn’t exist for other languages, despite some (mostly automatic and unsupervised) attempts. We show how to semi-automatically adapt VerbNet using existing resources designed for different purposes. This study focuses on French and uses two French resources: a semantic lexicon (Les Verbes Français) and a syntactic lexicon (Lexique-Grammaire).

pdf bib
English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling
Sharid Loáiciga | Thomas Meyer | Andrei Popescu-Belis

This paper presents a method for verb phrase (VP) alignment in an English-French parallel corpus and its use for improving statistical machine translation (SMT) of verb tenses. The method starts from automatic word alignment performed with GIZA++, and relies on a POS tagger and a parser, in combination with several heuristics, in order to identify non-contiguous components of VPs, and to label the aligned VPs with their tense and voice on each side. This procedure is applied to the Europarl corpus, leading to the creation of a smaller, high-precision parallel corpus with about 320,000 pairs of finite VPs, which is made publicly available. This resource is used to train a tense predictor for translation from English into French, based on a large number of surface features. Three MT systems are compared: (1) a baseline phrase-based SMT; (2) a tense-aware SMT system using the above predictions within a factored translation model; and (3) a system using oracle predictions from the aligned VPs. For several tenses, such as the French “imparfait”, the tense-aware SMT system improves significantly over the baseline and is closer to the oracle system.

pdf bib
The evolving infrastructure for language resources and the role for data scientists
Nelleke Oostdijk | Henk van den Heuvel

In the context of ongoing developments as regards the creation of a sustainable, interoperable language resource infrastructure and spreading ideas of the need for open access, not only of research publications but also of the underlying data, various issues present themselves which require that different stakeholders reconsider their positions. In the present paper we relate the experiences from the CLARIN-NL data curation service (DCS) over the two years that it has been operational, and the future role we envisage for expertise centres like the DCS in the evolving infrastructure.

pdf bib
A New Form of Humor — Mapping Constraint-Based Computational Morphologies to a Finite-State Representation
Attila Novák

MorphoLogic’s Humor morphological analyzer engine has been used for the development of several high-quality computational morphologies, among them ones for complex agglutinative languages. However, Humor’s closed source licensing scheme has been an obstacle to making these resources widely available. Moreover, there are other limitations of the rule-based Humor engine: lack of support for morphological guessing and for the integration of frequency information or other weighting of the models. These problems were solved by converting the databases to a finite-state representation that allows for morphological guessing and the addition of weights. Moreover, it has open-source implementations.

pdf bib
SLMotion - An extensible sign language oriented video analysis tool
Matti Karppa | Ville Viitaniemi | Marcos Luzardo | Jorma Laaksonen | Tommi Jantunen

We present a software toolkit called SLMotion which provides a framework for automatic and semiautomatic analysis, feature extraction and annotation of individual sign language videos, and which can easily be adapted to batch processing of entire sign language corpora. The program follows a modular design, and exposes a Numpy-compatible Python application programming interface that makes it easy and convenient to extend its functionality through scripting. The program includes support for exporting the annotations in ELAN format. The program is released as free software, and is available for GNU/Linux and MacOS platforms.

pdf bib
Constructing a Chinese—Japanese Parallel Corpus from Wikipedia
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at

pdf bib
CFT13: A resource for research into the post-editing process
Michael Carl | Mercedes Martínez García | Bartolomé Mesa-Lao

This paper describes the most recent dataset that has been added to the CRITT Translation Process Research Database (TPR-DB). Under the name CFT13, this new study contains user activity data (UAD) in the form of key-logging and eye-tracking collected during the second CasMaCat field trial in June 2013. The CFT13 is a publicly available resource featuring a number of simple and compound process and product units suited to investigate human-computer interaction while post-editing machine translation outputs.

pdf bib
Mörkum Njálu. An annotated corpus to analyse and explain grammatical divergences between 14th-century manuscripts of Njál’s saga.
Ludger Zeevaert

The work of the research project “Variance of Njáls saga” at the Árni Magnússon Institute for Icelandic Studies in Reykjavík relies mainly on an annotated XML-corpus of manuscripts of Brennu-Njáls saga or ‘The Story of Burnt Njál’, an Icelandic prose narrative from the end of the 13th century. One part of the project is devoted to linguistic variation in the earliest transmission of the text in parchment manuscripts and fragments from the 14th century. The article gives a short overview over the design of the corpus that has to serve quite different purposes from palaeographic over stemmatological to literary research. It focuses on features important for the analysis of certain linguistic variables and the challenge lying in their implementation in a corpus consisting of close transcriptions of medieval manuscripts and gives examples for the use of the corpus for linguistic research in the frame of the project that mainly consists of the analysis of different grammatical/syntactic constructions that are often referred to in connection with stylistic research (narrative inversion, historical present tense, indirect-speech constructions).

pdf bib
A Crowdsourcing Smartphone Application for Swiss German: Putting Language Documentation in the Hands of the Users
Jean-Philippe Goldman | Adrian Leeman | Marie-José Kolly | Ingrid Hove | Ibrahim Almajai | Volker Dellwo | Steven Moran

This contribution describes an on-going projects a smartphone application called Voice Ãpp, which is a follow-up of a previous application called Dialäkt Ãpp. The main purpose of both apps is to identify the user’s Swiss German dialect on the basis of the dialectal variations of 15 words. The result is returned as one or more geographical points on a map. In Dialäkt Ãpp, launched in 2013, the user provides his or her own pronunciation through buttons, while the Voice Ãpp, currently in development, asks users to pronounce the word and uses speech recognition techniques to identify the variants and localize the user. This second app is more challenging from a technical point of view but nevertheless recovers the nature of dialect variation of spoken language. Besides, the Voice Ãpp takes its users on a journey in which they explore the individuality of their own voices, answering questions such as: How high is my voice? How fast do I speak? Do I speak faster than users in the neighbouring city?

pdf bib
ClearTK 2.0: Design Patterns for Machine Learning in UIMA
Steven Bethard | Philip Ogren | Lee Becker

ClearTK adds machine learning functionality to the UIMA framework, providing wrappers to popular machine learning libraries, a rich feature extraction library that works across different classifiers, and utilities for applying and evaluating machine learning models. Since its inception in 2008, ClearTK has evolved in response to feedback from developers and the community. This evolution has followed a number of important design principles including: conceptually simple annotator interfaces, readable pipeline descriptions, minimal collection readers, type system agnostic code, modules organized for ease of import, and assisting user comprehension of the complex UIMA framework.

pdf bib
A Conventional Orthography for Tunisian Arabic
Inès Zribi | Rahma Boujelbane | Abir Masmoudi | Mariem Ellouze | Lamia Belguith | Nizar Habash

Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.

pdf bib
Creating a massively parallel Bible corpus
Thomas Mayer | Michael Cysouw

We present our ongoing effort to create a massively parallel Bible corpus. While an ever-increasing number of Bible translations is available in electronic form on the internet, there is no large-scale parallel Bible corpus that allows language researchers to easily get access to the texts and their parallel structure for a large variety of different languages. We report on the current status of the corpus, with over 900 translations in more than 830 language varieties. All translations are tokenized (e.g., separating punctuation marks) and Unicode normalized. Mainly due to copyright restrictions only portions of the texts are made publicly available. However, we provide co-occurrence information for each translation in a (sparse) matrix format. All word forms in the translation are given together with their frequency and the verses in which they occur.

pdf bib
Corpus-Based Computation of Reverse Associations
Reinhard Rapp

According to psychological learning theory an important principle governing language acquisition is co-occurrence. For example, when we perceive language, our brain seems to unconsciously analyze and store the co-occurrence patterns of the words. And during language production, these co-occurrence patterns are reproduced. The applicability of this principle is particularly obvious in the case of word associations. There is evidence that the associative responses people typically come up with upon presentation of a stimulus word are often words which frequently co-occur with it. It is thus possible to predict a response by looking at co-occurrence data. The work presented here is along these lines. However, it differs from most previous work in that it investigates the direction from the response to the stimulus rather than vice-versa, and that it also deals with the case when several responses are known. Our results indicate that it is possible to predict a stimulus word from its responses, and that it helps if several responses are given.

pdf bib
LexTec — a rich language resource for technical domains in Portuguese
Palmira Marrafa | Raquel Amaro | Sara Mendes

The growing amount of available information and the importance given to the access to technical information enhance the potential role of NLP applications in enabling users to deal with information for a variety of knowledge domains. In this process, language resources are crucial. This paper presents Lextec, a rich computational language resource for technical vocabulary in Portuguese. Encoding a representative set of terms for ten different technical domains, this concept-based relational language resource combines a wide range of linguistic information by integrating each entry in a domain-specific wordnet and associating it with a precise definition for each lexicalization in the technical domain at stake, illustrative texts and information for translation into English.

pdf bib
Boosting the creation of a treebank
Blanca Arias | Núria Bel | Mercè Lorente | Montserrat Marimón | Alba Milà | Jorge Vivaldi | Muntsa Padró | Marina Fomicheva | Imanol Larrea

In this paper we present the results of an ongoing experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by automatically: (i) annotating with a de-lexicalized Spanish parser, (ii) manually correcting the parses, and (iii) using the Catalan corrected sentences to train a Catalan parser. The results showed that the number of parsed sentences required to train a Catalan parser is about 1000 that were achieved in 4 months, with 2 annotators.

pdf bib
The Dutch LESLLA Corpus
Eric Sanders | Ineke van de Craats | Vanja de Lint

This paper describes the Dutch LESLLA data and its curation. LESLLA stands for Low-Educated Second Language and Literacy Acquisition. The data was collected for research in this field and would have been disappeared if it were not saved. Within the CLARIN project Data Curation Service the data was made into a spoken language resource and made available to other researchers.

pdf bib
Improvements to Dependency Parsing Using Automatic Simplification of Data
Tomáš Jelínek

In dependency parsing, much effort is devoted to the development of new methods of language modeling and better feature settings. Less attention is paid to actual linguistic data and how appropriate they are for automatic parsing: linguistic data can be too complex for a given parser, morphological tags may not reflect well syntactic properties of words, a detailed, complex annotation scheme may be ill suited for automatic parsing. In this paper, I present a study of this problem on the following case: automatic dependency parsing using the data of the Prague Dependency Treebank with two dependency parsers: MSTParser and MaltParser. I show that by means of small, reversible simplifications of the text and of the annotation, a considerable improvement of parsing accuracy can be achieved. In order to facilitate the task of language modeling performed by the parser, I reduce variability of lemmas and forms in the text. I modify the system of morphological annotation to adapt it better for parsing. Finally, the dependency annotation scheme is also partially modified. All such modifications are automatic and fully reversible: after the parsing is done, the original data and structures are automatically restored. With MaltParser, I achieve an 8.3% error rate reduction.

pdf bib
The Meta-knowledge of Causality in Biomedical Scientific Discourse
Claudiu Mihăilă | Sophia Ananiadou

Causality lies at the heart of biomedical knowledge, being involved in diagnosis, pathology or systems biology. Thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. For this, we rely on corpora that are annotated with classified, structured representations of important facts and findings contained within text. However, it is impossible to correctly interpret these annotations without additional information, e.g., classification of an event as fact, hypothesis, experimental result or analysis of results, confidence of authors about the validity of their analyses etc. In this study, we analyse and automatically detect this type of information, collectively termed meta-knowledge (MK), in the context of existing discourse causality annotations. Our effort proves the feasibility of identifying such pieces of information, without which the understanding of causal relations is limited.

pdf bib
Discosuite - A parser test suite for German discontinuous structures
Wolfgang Maier | Miriam Kaeshammer | Peter Baumann | Sandra Kübler

Parser evaluation traditionally relies on evaluation metrics which deliver a single aggregate score over all sentences in the parser output, such as PARSEVAL. However, for the evaluation of parser performance concerning a particular phenomenon, a test suite of sentences is needed in which this phenomenon has been identified. In recent years, the parsing of discontinuous structures has received a rising interest. Therefore, in this paper, we present a test suite for testing the performance of dependency and constituency parsers on non-projective dependencies and discontinuous constituents for German. The test suite is based on the newly released TIGER treebank version 2.2. It provides a unique possibility of benchmarking parsers on non-local syntactic relationships in German, for constituents and dependencies. We include a linguistic analysis of the phenomena that cause discontinuity in the TIGER annotation, thereby closing gaps in previous literature. The linguistic phenomena we investigate include extraposition, a placeholder/repeated element construction, topicalization, scrambling, local movement, parentheticals, and fronting of pronouns.

pdf bib
Modelling Irony in Twitter: Feature Analysis and Evaluation
Francesco Barbieri | Horacio Saggion

Irony, a creative use of language, has received scarce attention from the computational linguistics research point of view. We propose an automatic system capable of detecting irony with good accuracy in the social network Twitter. Twitter allows users to post short messages (140 characters) which usually do not follow the expected rules of the grammar, users tend to truncate words and use particular punctuation. For these reason automatic detection of Irony in Twitter is not trivial and requires specific linguistic tools. We propose in this paper a new set of experiments to assess the relevance of the features included in our model. Our model does not include words or sequences of words as features, aiming to detect inner characteristic of Irony.

pdf bib
DBpedia Domains: augmenting DBpedia with domain information
Gregor Titze | Volha Bryl | Cäcilia Zirn | Simone Paolo Ponzetto

We present an approach for augmenting DBpedia, a very large ontology lying at the heart of the Linked Open Data (LOD) cloud, with domain information. Our approach uses the thematic labels provided for DBpedia entities by Wikipedia categories, and groups them based on a kernel based k-means clustering algorithm. Experiments on gold-standard data show that our approach provides a first solution to the automatic annotation of DBpedia entities with domain labels, thus providing the largest LOD domain-annotated ontology to date.

pdf bib
Mining a multimodal corpus for non-verbal behavior sequences conveying attitudes
Mathieu Chollet | Magalie Ochs | Catherine Pelachaud

Interpersonal attitudes are expressed by non-verbal behaviors on a variety of different modalities. The perception of these behaviors is influenced by how they are sequenced with other behaviors from the same person and behaviors from other interactants. In this paper, we present a method for extracting and generating sequences of non-verbal signals expressing interpersonal attitudes. These sequences are used as part of a framework for non-verbal expression with Embodied Conversational Agents that considers different features of non-verbal behavior: global behavior tendencies, interpersonal reactions, sequencing of non-verbal signals, and communicative intentions. Our method uses a sequence mining technique on an annotated multimodal corpus to extract sequences characteristic of different attitudes. New sequences of non-verbal signals are generated using a probabilistic model, and evaluated using the previously mined sequences.

pdf bib
Biomedical entity extraction using machine-learning based approaches
Cyril Grouin

In this paper, we present the experiments we made to process entities from the biomedical domain. Depending on the task to process, we used two distinct supervised machine-learning techniques: Conditional Random Fields to perform both named entity identification and classification, and Maximum Entropy to classify given entities. Machine-learning approaches outperformed knowledge-based techniques on categories where sufficient annotated data was available. We showed that the use of external features (unsupervised clusters, information from ontology and taxonomy) improved the results significantly.

pdf bib
Parsing Heterogeneous Corpora with a Rich Dependency Grammar
Achim Stein

Grammar models conceived for parsing purposes are often poorer than models that are motivated linguistically. We present a grammar model which is linguistically satisfactory and based on the principles of traditional dependency grammar. We show how a state-of-the-art dependency parser (mate tools) performs with this model, trained on the Syntactic Reference Corpus of Medieval French (SRCMF), a manually annotated corpus of medieval (Old French) texts. We focus on the problems caused by small and heterogeneous training sets typical for corpora of older periods. The result is the first publicly available dependency parser for Old French. On a 90/10 training/evaluation split of eleven OF texts (206000 words), we obtained an UAS of 89.68% and a LAS of 82.62%. Three experiments showed how heterogeneity, typical of medieval corpora, affects the parsing results: (a) a ‘one-on-one’ cross evaluation for individual texts, (b) a ‘leave-one-out’ cross evaluation, and (c) a prose/verse cross evaluation.

pdf bib
ML-Optimization of Ported Constraint Grammars
Eckhard Bick

In this paper, we describe how a Constraint Grammar with linguist-written rules can be optimized and ported to another language using a Machine Learning technique. The effects of rule movements, sorting, grammar-sectioning and systematic rule modifications are discussed and quantitatively evaluated. Statistical information is used to provide a baseline and to enhance the core of manual rules. The best-performing parameter combinations achieved part-of-speech F-scores of over 92 for a grammar ported from English to Danish, a considerable advance over both the statistical baseline (85.7), and the raw ported grammar (86.1). When the same technique was applied to an existing native Danish CG, error reduction was 10% (F=96.94).

pdf bib
A Multi-Cultural Repository of Automatically Discovered Linguistic and Conceptual Metaphors
Samira Shaikh | Tomek Strzalkowski | Ting Liu | George Aaron Broadwell | Boris Yamrom | Sarah Taylor | Laurie Feldman | Kit Cho | Umit Boz | Ignacio Cases | Yuliya Peshkova | Ching-Sheng Lin

In this article, we present details about our ongoing work towards building a repository of Linguistic and Conceptual Metaphors. This resource is being developed as part of our research effort into the large-scale detection of metaphors from unrestricted text. We have stored a large amount of automatically extracted metaphors in American English, Mexican Spanish, Russian and Iranian Farsi in a relational database, along with pertinent metadata associated with these metaphors. A substantial subset of the contents of our repository has been systematically validated via rigorous social science experiments. Using information stored in the repository, we are able to posit certain claims in a cross-cultural context about how peoples in these cultures (America, Mexico, Russia and Iran) view particular concepts related to Governance and Economic Inequality through the use of metaphor. Researchers in the field can use this resource as a reference of typical metaphors used across these cultures. In addition, it can be used to recognize metaphors of the same form or pattern, in other domains of research.

pdf bib
First approach toward Semantic Role Labeling for Basque
Haritz Salaberri | Olatz Arregi | Beñat Zapirain

In this paper, we present the first Semantic Role Labeling system developed for Basque. The system is implemented using machine learning techniques and trained with the Reference Corpus for the Processing of Basque (EPEC). In our experiments the classifier that offers the best results is based on Support Vector Machines. Our system achieves 84.30 F1 score in identifying the PropBank semantic role for a given constituent and 82.90 F1 score in identifying the VerbNet role. Our study establishes a baseline for Basque SRL. Although there are no directly comparable systems for English we can state that the results we have achieved are quite good. In addition, we have performed a Leave-One-Out feature selection procedure in order to establish which features are the worthiest regarding argument classification. This will help smooth the way for future stages of Basque SRL and will help draw some of the guidelines of our research.

pdf bib
Generating a Lexicon of Errors in Portuguese to Support an Error Identification System for Spanish Native Learners
Lianet Sepúlveda Torres | Magali Sanches Duran | Sandra Aluísio

Portuguese is a less resourced language in what concerns foreign language learning. Aiming to inform a module of a system designed to support scientific written production of Spanish native speakers learning Portuguese, we developed an approach to automatically generate a lexicon of wrong words, reproducing language transfer errors made by such foreign learners. Each item of the artificially generated lexicon contains, besides the wrong word, the respective Spanish and Portuguese correct words. The wrong word is used to identify the interlanguage error and the correct Spanish and Portuguese forms are used to generate the suggestions. Keeping control of the correct word forms, we can provide correction or, at least, useful suggestions for the learners. We propose to combine two automatic procedures to obtain the error correction: i) a similarity measure and ii) a translation algorithm based on aligned parallel corpus. The similarity-based method achieved a precision of 52%, whereas the alignment-based method achieved a precision of 90%. In this paper we focus only on interlanguage errors involving suffixes that have different forms in both languages. The approach, however, is very promising to tackle other types of errors, such as gender errors.

pdf bib
xLiD-Lexica: Cross-lingual Linked Data Lexica
Lei Zhang | Michael Färber | Achim Rettinger

In this paper, we introduce our cross-lingual linked data lexica, called xLiD-Lexica, which are constructed by exploiting the multilingual Wikipedia and linked data resources from Linked Open Data (LOD). We provide the cross-lingual groundings of linked data resources from LOD as RDF data, which can be easily integrated into the LOD data sources. In addition, we build a SPARQL endpoint over our xLiD-Lexica to allow users to easily access them using SPARQL query language. Multilingual and cross-lingual information access can be facilitated by the availability of such lexica, e.g., allowing for an easy mapping of natural language expressions in different languages to linked data resources from LOD. Many tasks in natural language processing, such as natural language generation, cross-lingual entity linking, text annotation and question answering, can benefit from our xLiD-Lexica.

pdf bib
A Study on Expert Sourcing Enterprise Question Collection and Classification
Yuan Luo | Thomas Boucher | Tolga Oral | David Osofsky | Sara Weber

Large enterprises, such as IBM, accumulate petabytes of free-text data within their organizations. To mine this big data, a critical ability is to enable meaningful question answering beyond keywords search. In this paper, we present a study on the characteristics and classification of IBM sales questions. The characteristics are analyzed both semantically and syntactically, from where a question classification guideline evolves. We adopted an enterprise level expert sourcing approach to gather questions, annotate questions based on the guideline and manage the quality of annotations via enhanced inter-annotator agreement analysis. We developed a question feature extraction system and experimented with rule-based, statistical and hybrid question classifiers. We share our annotated corpus of questions and report our experimental results. Statistical classifiers separately based on n-grams and hand-crafted rule features give reasonable macro-f1 scores at 61.7% and 63.1% respectively. Rule based classifier gives a macro-f1 at 77.1%. The hybrid classifier with n-gram and rule features using a second guess model further improves the macro-f1 to 83.9%.

pdf bib
Annotating Relation Mentions in Tabloid Press
Hong Li | Sebastian Krause | Feiyu Xu | Hans Uszkoreit | Robert Hummel | Veselina Mironova

This paper presents a new resource for the training and evaluation needed by relation extraction experiments. The corpus consists of annotations of mentions for three semantic relations: marriage, parent―child, siblings, selected from the domain of biographic facts about persons and their social relationships. The corpus contains more than one hundred news articles from Tabloid Press. In the current corpus, we only consider the relation mentions occurring in the individual sentences. We provide multi-level annotations which specify the marked facts from relation, argument, entity, down to the token level, thus allowing for detailed analysis of linguistic phenomena and their interactions. A generic markup tool Recon developed at the DFKI LT lab has been utilised for the annotation task. The corpus has been annotated by two human experts, supported by additional conflict resolution conducted by a third expert. As shown in the evaluation, the annotation is of high quality as proved by the stated inter-annotator agreements both on sentence level and on relationmention level. The current corpus is already in active use in our research for evaluation of the relation extraction performance of our automatically learned extraction patterns.

pdf bib
Efficient Reuse of Structured and Unstructured Resources for Ontology Population
Chetana Gavankar | Ashish Kulkarni | Ganesh Ramakrishnan

We study the problem of ontology population for a domain ontology and present solutions based on semi-automatic techniques. A domain ontology for an organization, often consists of classes whose instances are either specific to, or independent of the organization. E.g. in an academic domain ontology, classes like Professor, Department could be organization (university) specific, while Conference, Programming languages are organization independent. This distinction allows us to leverage data sources both―within the organization and those in the Internet ― to extract entities and populate an ontology. We propose techniques that build on those for open domain IE. Together with user input, we show through comprehensive evaluation, how these semi-automatic techniques achieve high precision. We experimented with the academic domain and built an ontology comprising of over 220 classes. Intranet documents from five universities formed our organization specific corpora and we used open domain knowledge bases like Wikipedia, Linked Open Data, and web pages from the Internet as the organization independent data sources. The populated ontology that we built for one of the universities comprised of over 75,000 instances. We adhere to the semantic web standards and tools and make the resources available in the OWL format. These could be useful for applications such as information extraction, text annotation, and information retrieval.

pdf bib
Mapping Diatopic and Diachronic Variation in Spoken Czech: The ORTOFON and DIALEKT Corpora
Marie Kopřivová | Hana Goláňová | Petra Klimešová | David Lukeš

ORTOFON and DIALEKT are two corpora of spoken Czech (recordings + transcripts) which are currently being built at the Institute of the Czech National Corpus. The first one (ORTOFON) continues the tradition of the CNC’s ORAL series of spoken corpora by focusing on collecting recordings of unscripted informal spoken interactions (“prototypically spoken texts”), but also provides new features, most notably an annotation scheme with multiple tiers per speaker, including orthographic and phonetic transcripts and allowing for a more precise treatment of overlapping speech. Rich speaker- and situation-related metadata are also collected for possible use as factors in sociolinguistic analyses. One of the stated goals is to make the data in the corpus balanced with respect to a subset of these. The second project, DIALEKT, consists in annotating (in a way partially compatible with the ORTOFON corpus) and providing electronic access to historical (1960s--80s) dialect recordings, mainly of a monological nature, from all over the Czech Republic. The goal is to integrate both corpora into one map-based browsing interface, allowing an intuitive and informative spatial visualization of query results or dialect feature maps, confrontation with isoglosses previously established through the effort of dialectologists etc.

pdf bib
Corpus and Evaluation of Handwriting Recognition of Historical Genealogical Records
Patrick Schone | Heath Nielson | Mark Ward

Over the last few decades, significant strides have been made in handwriting recognition (HR), which is the automatic transcription of handwritten documents. HR often focuses on modern handwritten material, but in the electronic age, the volume of handwritten material is rapidly declining. However, we believe HR is on the verge of having major application to historical record collections. In recent years, archives and genealogical organizations have conducted huge campaigns to transcribe valuable historical record content with such transcription being largely done through human-intensive labor. HR has the potential of revolutionizing these transcription endeavors. To test the hypothesis that this technology is close to applicability, and to provide a testbed for reducing any accuracy gaps, we have developed an evaluation paradigm for historical record handwriting recognition. We created a huge test corpus consisting of four historical data collections of four differing genres and three languages. In this paper, we provide the details of these extensive resources which we intend to release to the research community for further study. Since several research organizations have already participated in this evaluation, we also show initial results and comparisons to human levels of performance.

pdf bib
Reusing Swedish FrameNet for training semantic roles
Ildikó Pilán | Elena Volodina

In this article we present the first experiences of reusing the Swedish FrameNet (SweFN) as a resource for training semantic roles. We give an account of the procedure we used to adapt SweFN to the needs of students of Linguistics in the form of an automatically generated exercise. During this adaptation, the mapping of the fine-grained distinction of roles from SweFN into learner-friendlier coarse-grained roles presented a major challenge. Besides discussing the details of this mapping, we describe the resulting multiple-choice exercise and its graphical user interface. The exercise was made available through Lärka, an online platform for students of Linguistics and learners of Swedish as a second language. We outline also aspects underlying the selection of the incorrect answer options which include semantic as well as frequency-based criteria. Finally, we present our own observations and initial user feedback about the applicability of such a resource in the pedagogical domain. Students’ answers indicated an overall positive experience, the majority found the exercise useful for learning semantic roles.

pdf bib
A Database of Freely Written Texts of German School Students for the Purpose of Automatic Spelling Error Classification
Kay Berkling | Johanna Fay | Masood Ghayoomi | Katrin Hein | Rémi Lavalley | Ludwig Linhuber | Sebastian Stüker

The spelling competence of school students is best measured on freely written texts, instead of pre-determined, dictated texts. Since the analysis of the error categories in these kinds of texts is very labor intensive and costly, we are working on an automatic systems to perform this task. The modules of the systems are derived from techniques from the area of natural language processing, and are learning systems that need large amounts of training data. To obtain the data necessary for training and evaluating the resulting system, we conducted data collection of freely written, German texts by school children. 1,730 students from grade 1 through 8 participated in this data collection. The data was transcribed electronically and annotated with their corrected version. This resulted in a total of 14,563 sentences that can now be used for research regarding spelling diagnostics. Additional meta-data was collected regarding writers’ language biography, teaching methodology, age, gender, and school year. In order to do a detailed manual annotation of the categories of the spelling errors committed by the students we developed a tool specifically tailored to the task.

pdf bib
PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone
Christian Haenig | Andreas Niekler | Carsten Wuensch

In this paper, we describe a publicly available multilingual evaluation corpus for phrase-level Sentiment Analysis that can be used to evaluate real world applications in an industrial context. This corpus contains data from English and German Internet forums (1000 posts each) focusing on the automotive domain. The major topic of the corpus is connecting and using cellphones to/in cars. The presented corpus contains different types of annotations: objects (e.g. my car, my new cellphone), features (e.g. address book, sound quality) and phrase-level polarities (e.g. the best possible automobile, big problem). Each of the posts has been annotated by at least four different annotators ― these annotations are retained in their original form. The reliability of the annotations is evaluated by inter-annotator agreement scores. Besides the corpus data and format, we provide comprehensive corpus statistics. This corpus is one of the first lexical resources focusing on real world applications that analyze the voice of the customer which is crucial for various industrial use cases.

pdf bib
Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Veronika Vincze | Viktor Varga | Katalin Ilona Simkó | János Zsibrita | Ágoston Nagy | Richárd Farkas | János Csirik

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.

pdf bib
Linked Open Data and Web Corpus Data for noun compound bracketing
Pierre André Ménard | Caroline Barrière

This research provides a comparison of a linked open data resource (DBpedia) and web corpus data resources (Google Web Ngrams and Google Books Ngrams) for noun compound bracketing. Large corpus statistical analysis has often been used for noun compound bracketing, and our goal is to introduce a linked open data (LOD) resource for such task. We show its particularities and its performance on the task. Results obtained on resources tested individually are promising, showing a potential for DBpedia to be included in future hybrid systems.

pdf bib
Multimodal Corpora for Silent Speech Interaction
João Freitas | António Teixeira | Miguel Dias

A Silent Speech Interface (SSI) allows for speech communication to take place in the absence of an acoustic signal. This type of interface is an alternative to conventional Automatic Speech Recognition which is not adequate for users with some speech impairments or in the presence of environmental noise. The work presented here produces the conditions to explore and analyze complex combinations of input modalities applicable in SSI research. By exploring non-invasive and promising modalities, we have selected the following sensing technologies used in human-computer interaction: Video and Depth input, Ultrasonic Doppler sensing and Surface Electromyography. This paper describes a novel data collection methodology where these independent streams of information are synchronously acquired with the aim of supporting research and development of a multimodal SSI. The reported recordings were divided into two rounds: a first one where the acquired data was silently uttered and a second round where speakers pronounced the scripted prompts in an audible and normal tone. In the first round of recordings, a total of 53.94 minutes were captured where 30.25% was estimated to be silent speech. In the second round of recordings, a total of 30.45 minutes were obtained and 30.05% of the recordings were audible speech.

pdf bib
Constructing a Corpus of Japanese Predicate Phrases for Synonym/Antonym Relations
Tomoko Izumi | Tomohide Shibata | Hisako Asano | Yoshihiro Matsuo | Sadao Kurohashi

We construct a large corpus of Japanese predicate phrases for synonym-antonym relations. The corpus consists of 7,278 pairs of predicates such as “receive-permission (ACC)” vs. “obtain-permission (ACC)”, in which each predicate pair is accompanied by a noun phrase and case information. The relations are categorized as synonyms, entailment, antonyms, or unrelated. Antonyms are further categorized into three different classes depending on their aspect of oppositeness. Using the data as a training corpus, we conduct the supervised binary classification of synonymous predicates based on linguistically-motivated features. Combining features that are characteristic of synonymous predicates with those that are characteristic of antonymous predicates, we succeed in automatically identifying synonymous predicates at the high F-score of 0.92, a 0.4 improvement over the baseline method of using the Japanese WordNet. The results of an experiment confirm that the quality of the corpus is high enough to achieve automatic classification. To the best of our knowledge, this is the first and the largest publicly available corpus of Japanese predicate phrases for synonym-antonym relations.

pdf bib
The Impact of Cohesion Errors in Extraction Based Summaries
Evelina Rennes | Arne Jönsson

We present results from an eye tracking study of automatic text summarization. Automatic text summarization is a growing field due to the modern world’s Internet based society, but to automatically create perfect summaries is challenging. One problem is that extraction based summaries often have cohesion errors. By the usage of an eye tracking camera, we have studied the nature of four different types of cohesion errors occurring in extraction based summaries. A total of 23 participants read and rated four different texts and marked the most difficult areas of each text. Statistical analysis of the data revealed that absent cohesion or context and broken anaphoric reference (pronouns) caused some disturbance in reading, but that the impact is restricted to the effort to read rather than the comprehension of the text. However, erroneous anaphoric references (pronouns) were not always detected by the participants which poses a problem for automatic text summarizers. The study also revealed other potential disturbing factors.

pdf bib
The CUHK Discourse TreeBank for Chinese: Annotating Explicit Discourse Connectives for the Chinese TreeBank
Lanjun Zhou | Binyang Li | Zhongyu Wei | Kam-Fai Wong

The lack of open discourse corpus for Chinese brings limitations for many natural language processing tasks. In this work, we present the first open discourse treebank for Chinese, namely, the Discourse Treebank for Chinese (DTBC). At the current stage, we annotated explicit intra-sentence discourse connectives, their corresponding arguments and senses for all 890 documents of the Chinese Treebank 5. We started by analysing the characteristics of discourse annotation for Chinese, adapted the annotation scheme of Penn Discourse Treebank 2 (PDTB2) to Chinese language while maintaining the compatibility as far as possible. We made adjustments to 3 essential aspects according to the previous study of Chinese linguistics. They are sense hierarchy, argument scope and semantics of arguments. Agreement study showed that our annotation scheme could achieve highly reliable results.

pdf bib
Extraction of Daily Changing Words for Question Answering
Kugatsu Sadamitsu | Ryuichiro Higashinaka | Yoshihiro Matsuo

This paper proposes a method for extracting Daily Changing Words (DCWs), words that indicate which questions are real-time dependent. Our approach is based on two types of template matching using time and named entity slots from large size corpora and adding simple filtering methods from news corpora. Extracted DCWs are utilized for detecting and sorting real-time dependent questions. Experiments confirm that our DCW method achieves higher accuracy in detecting real-time dependent questions than existing word classes and a simple supervised machine learning approach.

pdf bib
MTWatch: A Tool for the Analysis of Noisy Parallel Data
Sandipan Dandapat | Declan Groves

State-of-the-art statistical machine translation (SMT) technique requires a good quality parallel data to build a translation model. The availability of large parallel corpora has rapidly increased over the past decade. However, often these newly developed parallel data contains contain significant noise. In this paper, we describe our approach for classifying good quality parallel sentence pairs from noisy parallel data. We use 10 different features within a Support Vector Machine (SVM)-based model for our classification task. We report a reasonably good classification accuracy and its positive effect on overall MT accuracy.

pdf bib
Distributed Distributional Similarities of Google Books Over the Centuries
Martin Riedl | Richard Steuer | Chris Biemann

This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.

pdf bib
The CLE Urdu POS Tagset
Saba Urooj | Sarmad Hussain | Asad Mustafa | Rahila Parveen | Farah Adeeba | Tafseer Ahmed Khan | Miriam Butt | Annette Hautli

The paper presents a design schema and details of a new Urdu POS tagset. This tagset is designed due to challenges encountered in working with existing tagsets for Urdu. It uses tags that judiciously incorporate information about special morpho-syntactic categories found in Urdu. With respect to the overall naming schema and the basic divisions, the tagset draws on the Penn Treebank and a Common Tagset for Indian Languages. The resulting CLE Urdu POS Tagset consists of 12 major categories with subdivisions, resulting in 32 tags. The tagset has been used to tag 100k words of the CLE Urdu Digest Corpus, giving a tagging accuracy of 96.8%.

pdf bib
NoSta-D Named Entity Annotation for German: Guidelines and Dataset
Darina Benikova | Chris Biemann | Marc Reznicek

We describe the annotation of a new dataset for German Named Entity Recognition (NER). The need for this dataset is motivated by licensing issues and consistency issues of existing datasets. We describe our approach to creating annotation guidelines based on linguistic and semantic considerations, and how we iteratively refined and tested them in the early stages of annotation in order to arrive at the largest publicly available dataset for German NER, consisting of over 31,000 manually annotated sentences (over 591,000 tokens) from German Wikipedia and German online news. We provide a number of statistics on the dataset, which indicate its high quality, and discuss legal aspects of distributing the data as a compilation of citations. The data is released under the permissive CC-BY license, and will be fully available for download in September 2014 after it has been used for the GermEval 2014 shared task on NER. We further provide the full annotation guidelines and links to the annotation tool used for the creation of this resource.

pdf bib
Phone Boundary Annotation in Conversational Speech
Yi-Fen Liu | Shu-Chuan Tseng | J.-S. Roger Jang

Phone-aligned spoken corpora are indispensable language resources for quantitative linguistic analyses and automatic speech systems. However, producing this type of data resources is not an easy task due to high costs of time and man power as well as difficulties of applying valid annotation criteria and achieving reliable inter-labeler’s consistency. Among different types of spoken corpora, conversational speech that is often filled with extreme reduction and varying pronunciation variants is particularly challenging. By adopting a combined verification procedure, we obtained reasonably good annotation results. Preliminary phone boundaries that were automatically generated by a phone aligner were provided to human labelers for verifying. Instead of making use of the visualization of acoustic cues, the labelers should solely rely on their perceptual judgments to locate a position that best separates two adjacent phones. Impressionistic judgments in cases of reduction and segment deletion were helpful and necessary, as they balanced subtle nuance caused by differences in perception.

pdf bib
A Colloquial Corpus of Japanese Sign Language: Linguistic Resources for Observing Sign Language Conversations
Mayumi Bono | Kouhei Kikuchi | Paul Cibulka | Yutaka Osugi

We began building a corpus of Japanese Sign Language (JSL) in April 2011. The purpose of this project was to increase awareness of sign language as a distinctive language in Japan. This corpus is beneficial not only to linguistic research but also to hearing-impaired and deaf individuals, as it helps them to recognize and respect their linguistic differences and communication styles. This is the first large-scale JSL corpus developed for both academic and public use. We collected data in three ways: interviews (for introductory purposes only), dialogues, and lexical elicitation. In this paper, we focus particularly on data collected during a dialogue to discuss the application of conversation analysis (CA) to signed dialogues and signed conversations. Our annotation scheme was designed not only to elucidate theoretical issues related to grammar and linguistics but also to clarify pragmatic and interactional phenomena related to the use of JSL.

pdf bib
Walenty: Towards a comprehensive valence dictionary of Polish
Adam Przepiórkowski | Elżbieta Hajnicz | Agnieszka Patejuk | Marcin Woliński | Filip Skwarski | Marek Świdziński

This paper presents Walenty, a comprehensive valence dictionary of Polish, with a number of novel features, as compared to other such dictionaries. The notion of argument is based on the coordination test and takes into consideration the possibility of diverse morphosyntactic realisations. Some aspects of the internal structure of phraseological (idiomatic) arguments are handled explicitly. While the current version of the dictionary concentrates on syntax, it already contains some semantic features, including semantically defined arguments, such as locative, temporal or manner, as well as control and raising, and work on extending it with semantic roles and selectional preferences is in progress. Although Walenty is still being intensively developed, it is already by far the largest Polish valence dictionary, with around 8600 verbal lemmata and almost 39 000 valence schemata. The dictionary is publicly available on the Creative Commons BY SA licence and may be downloaded from

pdf bib
Can the Crowd be Controlled?: A Case Study on Crowd Sourcing and Automatic Validation of Completed Tasks based on User Modeling
Balamurali A.R

Annotation is an essential step in the development cycle of many Natural Language Processing (NLP) systems. Lately, crowd-sourcing has been employed to facilitate large scale annotation at a reduced cost. Unfortunately, verifying the quality of the submitted annotations is a daunting task. Existing approaches address this problem either through sampling or redundancy. However, these approaches do have a cost associated with it. Based on the observation that a crowd-sourcing worker returns to do a task that he has done previously, a novel framework for automatic validation of crowd-sourced task is proposed in this paper. A case study based on sentiment analysis is presented to elucidate the framework and its feasibility. The result suggests that validation of the crowd-sourced task can be automated to a certain extent.

pdf bib
Computational Narratology: Extracting Tense Clusters from Narrative Texts
Thomas Bögel | Jannik Strötgen | Michael Gertz

Computational Narratology is an emerging field within the Digital Humanities. In this paper, we tackle the problem of extracting temporal information as a basis for event extraction and ordering, as well as further investigations of complex phenomena in narrative texts. While most existing systems focus on news texts and extract explicit temporal information exclusively, we show that this approach is not feasible for narratives. Based on tense information of verbs, we define temporal clusters as an annotation task and validate the annotation schema by showing that the task can be performed with high inter-annotator agreement. To alleviate and reduce the manual annotation effort, we propose a rule-based approach to robustly extract temporal clusters using a multi-layered and dynamic NLP pipeline that combines off-the-shelf components in a heuristic setting. Comparing our results against human judgements, our system is capable of predicting the tense of verbs and sentences with very high reliability: for the most prevalent tense in our corpus, more than 95% of all verbs are annotated correctly.

pdf bib
Designing the Latvian Speech Recognition Corpus
Mārcis Pinnis | Ilze Auziņa | Kārlis Goba

In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus creation guidelines are fairly general for them to be re-used by other researchers when working on different language speech recognition corpora. The corpus consists of two parts ― an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers, noise levels, speech styles, etc. The speech recognition corpus is phonetically balanced and phonetically rich and the paper describes also the methodology how the phonetical balancedness has been assessed.

pdf bib
Aligning parallel texts with InterText
Pavel Vondřička

InterText is a flexible manager and editor for alignment of parallel texts aimed both at individual and collaborative creation of parallel corpora of any size or translational memories. It is available in two versions: as a multi-user server application with a web-based interface and as a native desktop application for personal use. Both versions are able to cooperate with each other. InterText can process plain text or custom XML documents, deploy existing automatic aligners and provide a comfortable interface for manual post-alignment correction of both the alignment and the text contents and segmentation of the documents. One language version may be aligned with several other versions (using stand-off alignment) and the application ensures consistency between them. The server version supports different user levels and privileges and it can also track changes made to the texts for easier supervision. It also allows for batch import, alignment and export and can be connected to other tools and scripts for better integration in a more complex project workflow.

pdf bib
Corpus for Coreference Resolution on Scientific Papers
Panot Chaimongkol | Akiko Aizawa | Yuka Tateisi

The ever-growing number of published scientific papers prompts the need for automatic knowledge extraction to help scientists keep up with the state-of-the-art in their respective fields. To construct a good knowledge extraction system, annotated corpora in the scientific domain are required to train machine learning models. As described in this paper, we have constructed an annotated corpus for coreference resolution in multiple scientific domains, based on an existing corpus. We have modified the annotation scheme from Message Understanding Conference to better suit scientific texts. Then we applied that to the corpus. The annotated corpus is then compared with corpora in general domains in terms of distribution of resolution classes and performance of the Stanford Dcoref coreference resolver. Through these comparisons, we have demonstrated quantitatively that our manually annotated corpus differs from a general-domain corpus, which suggests deep differences between general-domain texts and scientific texts and which shows that different approaches can be made to tackle coreference resolution for general texts and scientific texts.

pdf bib
From Non Word to New Word: Automatically Identifying Neologisms in French Newspapers
Ingrid Falk | Delphine Bernhard | Christophe Gérard

In this paper we present a statistical machine learning approach to formal neologism detection going some way beyond the use of exclusion lists. We explore the impact of three groups of features: form related, morpho-lexical and thematic features. The latter type of features has not yet been used in this kind of application and represents a way to access the semantic context of new words. The results suggest that form related features are helpful at the overall classification task, while morpho-lexical and thematic features better single out true neologisms.

pdf bib
Evaluating the effects of interactivity in a post-editing workbench
Nancy Underwood | Bartolomé Mesa-Lao | Mercedes García Martínez | Michael Carl | Vicent Alabau | Jesús González-Rubio | Luis A. Leiva | Germán Sanchis-Trilles | Daniel Ortíz-Martínez | Francisco Casacuberta

This paper describes the field trial and subsequent evaluation of a post-editing workbench which is currently under development in the EU-funded CasMaCat project. Based on user evaluations of the initial prototype of the workbench, this second prototype of the workbench includes a number of interactive features designed to improve productivity and user satisfaction. Using CasMaCat’s own facilities for logging keystrokes and eye tracking, data were collected from nine post-editors in a professional setting. These data were then used to investigate the effects of the interactive features on productivity, quality, user satisfaction and cognitive load as reflected in the post-editors’ gaze activity. These quantitative results are combined with the qualitative results derived from user questionnaires and interviews conducted with all the participants.

pdf bib
Using Twitter and Sentiment Analysis for event detection
Georgios Paltoglou

pdf bib
The Research and Teaching Corpus of Spoken German — FOLK
Thomas Schmidt

FOLK is the “Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK)” (eng.: research and teaching corpus of spoken German). The project has set itself the aim of building a corpus of German conversations which a) covers a broad range of interaction types in private, institutional and public settings, b) is sufficiently large and diverse and of sufficient quality to support different qualitative and quantitative research approaches, c) is transcribed, annotated and made accessible according to current technological standards, and d) is available to the scientific community on a sound legal basis and without unnecessary restrictions of usage. This paper gives an overview of the corpus design, the strategies for acquisition of a diverse range of interaction data, and the corpus construction workflow from recording via transcription an annotation to dissemination.

pdf bib
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
Stefania Degaetano-Ortlieb | Peter Fankhauser | Hannah Kermes | Ekaterina Lapshinova-Koltunski | Noam Ordan | Elke Teich

We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.

pdf bib
On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter
Hassan Saif | Miriam Fernandez | Yulan He | Harith Alani

Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier’s feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and shrinking the feature space.

pdf bib
Adapting Freely Available Resources to Build an Opinion Mining Pipeline in Portuguese
Patrik Lambert | Carlos Rodríguez-Penagos

We present a complete UIMA-based pipeline for sentiment analysis in Portuguese news using freely available resources and a minimal set of manually annotated training data. We obtained good precision on binary classification but concluded that news feed is a challenging environment to detect the extent of opinionated text.

pdf bib
The SYN-series corpora of written Czech
Milena Hnátková | Michal Křen | Pavel Procházka | Hana Skoumalová

The paper overviews the SYN series of synchronic corpora of written Czech compiled within the framework of the Czech National Corpus project. It describes their design and processing with a focus on the annotation, i.e. lemmatization and morphological tagging. The paper also introduces SYN2013PUB, a new 935-million newspaper corpus of Czech published in 2013 as the most recent addition to the SYN series before planned revision of its architecture. SYN2013PUB can be seen as a completion of the series in terms of titles and publication dates of major Czech newspapers that are now covered by complete volumes in comparable proportions. All SYN-series corpora can be characterized as traditional, with emphasis on cleared copyright issues, well-defined composition, reliable metadata and high-quality data processing; their overall size currently exceeds 2.2 billion running words.

pdf bib
ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT
Liane Guillou | Christian Hardmeier | Aaron Smith | Jörg Tiedemann | Bonnie Webber

We present ParCor, a parallel corpus of texts in which pronoun coreference ― reduced coreference in which pronouns are used as referring expressions ― has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.

pdf bib
A Benchmark Database of Phonetic Alignments in Historical Linguistics and Dialectology
Johann-Mattis List | Jelena Prokić

In the last two decades, alignment analyses have become an important technique in quantitative historical linguistics and dialectology. Phonetic alignment plays a crucial role in the identification of regular sound correspondences and deeper genealogical relations between and within languages and language families. Surprisingly, up to today, there are no easily accessible benchmark data sets for phonetic alignment analyses. Here we present a publicly available database of manually edited phonetic alignments which can serve as a platform for testing and improving the performance of automatic alignment algorithms. The database consists of a great variety of alignments drawn from a large number of different sources. The data is arranged in a such way that typical problems encountered in phonetic alignment analyses (metathesis, diversity of phonetic sequences) are represented and can be directly tested.

pdf bib
Extracting News Web Page Creation Time with DCTFinder
Xavier Tannier

Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title detection, supervised learning with Conditional Random Fields (CRFs) for document date extraction, and rule-based creation time recognition. Using such a system allows further deep and efficient temporal analysis of web pages. Evaluation on three corpora of English and French web pages indicates that the tool can extract document creation times with reasonably high accuracy (between 87 and 92\%). DCTFinder is made freely available on, as well as all resources (vocabulary and annotated documents) built for training and evaluating the system in English and French, and the English trained model itself.

pdf bib
Corpus of 19th-century Czech Texts: Problems and Solutions
Karel Kučera | Martin Stluka

Although the Czech language of the 19th century represents the roots of modern Czech and many features of the 20th- and 21st-century language cannot be properly understood without this historical background, the 19th-century Czech has not been thoroughly and consistently researched so far. The long-term project of a corpus of 19th-century Czech printed texts, currently in its third year, is intended to stimulate the research as well as to provide a firm material basis for it. The reason why, in our opinion, the project is worth mentioning is that it is faced with an unusual concentration of problems following mostly from the fact that the 19th century was arguably the most tumultuous period in the history of Czech, as well as from the fact that Czech is a highly inflectional language with a long history of sound changes, orthography reforms and rather discontinuous development of its vocabulary. The paper will briefly characterize the general background of the problems and present the reasoning behind the solutions that have been implemented in the ongoing project.

pdf bib
Investigating the Image of Entities in Social Media: Dataset Design and First Results
Julien Velcin | Young-Min Kim | Caroline Brun | Jean-Yves Dormagen | Eric SanJuan | Leila Khouas | Anne Peradotto | Stephane Bonnevay | Claude Roux | Julien Boyadjian | Alejandro Molina | Marie Neihouser

The objective of this paper is to describe the design of a dataset that deals with the image (i.e., representation, web reputation) of various entities populating the Internet: politicians, celebrities, companies, brands etc. Our main contribution is to build and provide an original annotated French dataset. This dataset consists of 11527 manually annotated tweets expressing the opinion on specific facets (e.g., ethic, communication, economic project) describing two French policitians over time. We believe that other researchers might benefit from this experience, since designing and implementing such a dataset has proven quite an interesting challenge. This design comprises different processes such as data selection, formal definition and instantiation of an image. We have set up a full open-source annotation platform. In addition to the dataset design, we present the first results that we obtained by applying clustering methods to the annotated dataset in order to extract the entity images.

pdf bib
The Norwegian Dependency Treebank
Per Erik Solberg | Arne Skjærholt | Lilja Øvrelid | Kristin Hagen | Janne Bondi Johannessen

The Norwegian Dependency Treebank is a new syntactic treebank for Norwegian Bokmäl and Nynorsk with manual syntactic and morphological annotation, developed at the National Library of Norway in collaboration with the University of Oslo. It is the first publically available treebank for Norwegian. This paper presents the core principles behind the syntactic annotation and how these principles were employed in certain specific cases. We then present the selection of texts and distribution between genres, as well as the annotation process and an evaluation of the inter-annotator agreement. Finally, we present the first results of data-driven dependency parsing of Norwegian, contrasting four state-of-the-art dependency parsers trained on the treebank. The consistency and the parsability of this treebank is shown to be comparable to other large treebank initiatives.

pdf bib
Extending standoff annotation
Maik Stührenberg

Textual information is sometimes accompanied by additional encodings (such as visuals). These multimodal documents may be interesting objects of investigation for linguistics. Another class of complex documents are pre-annotated documents. Classic XML inline annotation often fails for both document classes because of overlapping markup. However, standoff annotation, that is the separation of primary data and markup, is a valuable and common mechanism to annotate multiple hierarchies and/or read-only primary data. We demonstrate an extended version of the XStandoff meta markup language, that allows the definition of segments in spatial and pre-annotated primary data. Together with the ability to import already established (linguistic) serialization formats as annotation levels and layers in an XStandoff instance, we are able to annotate a variety of primary data files, including text, audio, still and moving images. Application scenarios that may benefit from using XStandoff are the analyzation of multimodal documents such as instruction manuals, or sports match analysis, or the less destructive cleaning of web pages.

pdf bib
An efficient language independent toolkit for complete morphological disambiguation
László Laki | György Orosz

In this paper a Moses SMT toolkit-based language-independent complete morphological annotation tool is presented called HuLaPos2. Our system performs PoS tagging and lemmatization simultaneously. Amongst others, the algorithm used is able to handle phrases instead of unigrams, and can perform the tagging in a not strictly left-to-right order. With utilizing these gains, our system outperforms the HMM-based ones. In order to handle the unknown words, a suffix-tree based guesser was integrated into HuLaPos2. To demonstrate the performance of our system it was compared with several systems in different languages and PoS tag sets. In general, it can be concluded that the quality of HuLaPos2 is comparable with the state-of-the-art systems, and in the case of PoS tagging it outperformed many available systems.

pdf bib
A decade of HLT Agency activities in the Low Countries: from resource maintenance (BLARK) to service offerings (BLAISE)
Peter Spyns | Remco van Veenendaal

In this paper we report on the Flemish-Dutch Agency for Human Language Technologies (HLT Agency or TST-Centrale in Dutch) in the Low Countries. We present its activities in its first decade of existence. The main goal of the HLT Agency is to ensure the sustainability of linguistic resources for Dutch. 10 years after its inception, the HLT Agency faces new challenges and opportunities. An important contextual factor is the rise of the infrastructure networks and proliferation of resource centres. We summarise some lessons learnt and we propose as future work to define and build for Dutch (which by extension can apply to any national language) a set of Basic LAnguage Infrastructure SErvices (BLAISE). As a conclusion, we state that the HLT Agency, also by its peculiar institutional status, has fulfilled and still is fulfilling an important role in maintaining Dutch as a digitally fully fledged functional language.

pdf bib
A Corpus of Spontaneous Speech in Lectures: The KIT Lecture Corpus for Spoken Language Processing and Translation
Eunah Cho | Sarah Fünfer | Sebastian Stüker | Alex Waibel

With the increasing number of applications handling spontaneous speech, the needs to process spoken languages become stronger. Speech disfluency is one of the most challenging tasks to deal with in automatic speech processing. As most applications are trained with well-formed, written texts, many issues arise when processing spontaneous speech due to its distinctive characteristics. Therefore, more data with annotated speech disfluencies will help the adaptation of natural language processing applications, such as machine translation systems. In order to support this, we have annotated speech disfluencies in German lectures at KIT. In this paper we describe how we annotated the disfluencies in the data and provide detailed statistics on the size of the corpus and the speakers. Moreover, machine translation performance on a source text including disfluencies is compared to the results of the translation of a source text without different sorts of disfluencies or no disfluencies at all.

pdf bib
Free Acoustic and Language Models for Large Vocabulary Continuous Speech Recognition in Swedish
Niklas Vanhainen | Giampiero Salvi

This paper presents results for large vocabulary continuous speech recognition (LVCSR) in Swedish. We trained acoustic models on the public domain NST Swedish corpus and made them freely available to the community. The training procedure corresponds to the reference recogniser (RefRec) developed for the SpeechDat databases during the COST249 action. We describe the modifications we made to the procedure in order to train on the NST database, and the language models we created based on the N-gram data available at the Norwegian Language Council. Our tests include medium vocabulary isolated word recognition and LVCSR. Because no previous results are available for LVCSR in Swedish, we use as baseline the performance of the SpeechDat models on the same tasks. We also compare our best results to the ones obtained in similar conditions on resource rich languages such as American English. We tested the acoustic models with HTK and Julius and plan to make them available in CMU Sphinx format as well in the near future. We believe that the free availability of these resources will boost research in speech and language technology in Swedish, even in research groups that do not have resources to develop ASR systems.

pdf bib
Turkish Resources for Visual Word Recognition
Begüm Erten | Cem Bozsahin | Deniz Zeyrek

We report two tools to conduct psycholinguistic experiments on Turkish words. KelimetriK allows experimenters to choose words based on desired orthographic scores of word frequency, bigram and trigram frequency, ON, OLD20, ATL and subset/superset similarity. Turkish version of Wuggy generates pseudowords from one or more template words using an efficient method. The syllabified version of the words are used as the input, which are decomposed into their sub-syllabic components. The bigram frequency chains are constructed by the entire words’ onset, nucleus and coda patterns. Lexical statistics of stems and their syllabification are compiled by us from BOUN corpus of 490 million words. Use of these tools in some experiments is shown.

pdf bib
An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis
Eshrag Refaee | Verena Rieser

We present a newly collected data set of 8,868 gold-standard annotated Arabic feeds. The corpus is manually labelled for subjectivity and sentiment analysis (SSA) ( = 0:816). In addition, the corpus is annotated with a variety of motivated feature-sets that have previously shown positive impact on performance. The paper highlights issues posed by twitter as a genre, such as mixture of language varieties and topic-shifts. Our next step is to extend the current corpus, using online semi-supervised learning. A first sub-corpus will be released via the ELRA repository as part of this submission.

pdf bib
The IMAGACT Visual Ontology. An Extendable Multilingual Infrastructure for the representation of lexical encoding of Action
Massimo Moneglia | Susan Brown | Francesca Frontini | Gloria Gagliardi | Fahad Khan | Monica Monachini | Alessandro Panunzi

Action verbs have many meanings, covering actions in different ontological types. Moreover, each language categorizes action in its own way. One verb can refer to many different actions and one action can be identified by more than one verb. The range of variations within and across languages is largely unknown, causing trouble for natural language processing tasks. IMAGACT is a corpus-based ontology of action concepts, derived from English and Italian spontaneous speech corpora, which makes use of the universal language of images to identify the different action types extended by verbs referring to action in English, Italian, Chinese and Spanish. This paper presents the infrastructure and the various linguistic information the user can derive from it. IMAGACT makes explicit the variation of meaning of action verbs within one language and allows comparisons of verb variations within and across languages. Because the action concepts are represented with videos, extension into new languages beyond those presently implemented in IMAGACT is done using competence-based judgments by mother-tongue informants without intense lexicographic work involving underdetermined semantic description

pdf bib
Collaboration in the Production of a Massively Multilingual Lexicon
Martin Benjamin

This paper discusses the multiple approaches to collaboration that the Kamusi Project is employing in the creation of a massively multilingual lexical resource. The project’s data structure enables the inclusion of large amounts of rich data within each sense-specific entry, with transitive concept-based links across languages. Data collection involves mining existing data sets, language experts using an online editing system, crowdsourcing, and games with a purpose. The paper discusses the benefits and drawbacks of each of these elements, and the steps the project is taking to account for those. Special attention is paid to guiding crowd members with targeted questions that produce results in a specific format. Collaboration is seen as an essential method for generating large amounts of linguistic data, as well as for validating the data so it can be considered trustworthy.

pdf bib
An Effortless Way To Create Large-Scale Datasets For Famous Speakers
François Salmon | Félicien Vallet

The creation of large-scale multimedia datasets has become a scientific matter in itself. Indeed, the fully-manual annotation of hundreds or thousands of hours of video and/or audio turns out to be practically infeasible. In this paper, we propose an extremly handy approach to automatically construct a database of famous speakers from TV broadcast news material. We then run a user experiment with a correctly designed tool that demonstrates that very reliable results can be obtained with this method. In particular, a thorough error analysis demonstrates the value of the approach and provides hints for the improvement of the quality of the dataset.

pdf bib
Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems
Bogdan Ludusan | Maarten Versteegh | Aren Jansen | Guillaume Gravier | Xuan-Nga Cao | Mark Johnson | Emmanuel Dupoux

The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox, drawing ideas from both speech technology and natural language processing. We first transform the speech-based output into a symbolic representation and compute five types of evaluation metrics on this representation: the quality of acoustic matching, the quality of the clusters found, and the quality of the alignment with real words (type, token, and boundary scores). We tested our approach on two term discovery systems taking speech as input, and one using symbolic input. The latter was run using both the gold transcription and a transcription obtained from an automatic speech recognizer, in order to simulate the case when only imperfect symbolic information is available. The results obtained are analysed through the use of the proposed evaluation metrics and the implications of these metrics are discussed.

pdf bib
Modeling and evaluating dialog success in the LAST MINUTE corpus
Dietmar Rösner | Rafael Friesen | Stephan Günther | Rico Andrich

The LAST MINUTE corpus comprises records and transcripts of naturalistic problem solving dialogs between N = 130 subjects and a companion system simulated in a Wizard of Oz experiment. Our goal is to detect dialog situations where subjects might break up the dialog with the system which might happen when the subject is unsuccessful. We present a dialog act based representation of the dialog courses in the problem solving phase of the experiment and propose and evaluate measures for dialog success or failure derived from this representation. This dialog act representation refines our previous coarse measure as it enables the correct classification of many dialog sequences that were ambiguous before. The dialog act representation is useful for the identification of different subject groups and the exploration of interesting dialog courses in the corpus. We find young females to be most successful in the challenging last part of the problem solving phase and young subjects to have the initiative in the dialog more often than the elderly.

pdf bib
Comparison of Gender- and Speaker-adaptive Emotion Recognition
Maxim Sidorov | Stefan Ultes | Alexander Schmitt

Deriving the emotion of a human speaker is a hard task, especially if only the audio stream is taken into account. While state-of-the-art approaches already provide good results, adaptive methods have been proposed in order to further improve the recognition accuracy. A recent approach is to add characteristics of the speaker, e.g., the gender of the speaker. In this contribution, we argue that adding information unique for each speaker, i.e., by using speaker identification techniques, improves emotion recognition simply by adding this additional information to the feature vector of the statistical classification algorithm. Moreover, we compare this approach to emotion recognition adding only the speaker gender being a non-unique speaker attribute. We justify this by performing adaptive emotion recognition using both gender and speaker information on four different corpora of different languages containing acted and non-acted speech. The final results show that adding speaker information significantly outperforms both adding gender information and solely using a generic speaker-independent approach.

pdf bib
The pragmatic annotation of a corpus of academic lectures
Siân Alsop | Hilary Nesi

This paper will describe a process of ‘pragmatic annotation’ (c.f. Simpson-Vlach and Leicher 2006) which systematically identifies pragmatic meaning in spoken text. The annotation of stretches of text that perform particular pragmatic functions allows conclusions to be drawn across data sets at a different level than that of the individual lexical item, or structural content. The annotation of linguistic features, which cannot be identified by purely objective means, is distinguished here from structural mark-up of speaker identity, turns, pauses etc. The features annotated are ‘explaining’, ‘housekeeping’, ‘humour’, ‘storytelling’ and ‘summarising’. Twenty-two subcategories are attributed to these elements. Data is from the Engineering Lecture Corpus (ELC), which includes 76 English-medium engineering lectures from the UK, New Zealand and Malaysia. The annotation allows us to compare differences in the use of these discourse features across cultural subcorpora. Results show that cultural context does impact on the linguistic realisation of commonly occurring discourse features in engineering lectures.

pdf bib
Using TEI, CMDI and ISOcat in CLARIN-DK
Dorte Haltrup Hansen | Lene Offersgaard | Sussi Olsen

This paper presents the challenges and issues encountered in the conversion of TEI header metadata into the CMDI format. The work is carried out in the Danish research infrastructure, CLARIN-DK, in order to enable the exchange of language resources nationally as well as internationally, in particular with other partners of CLARIN ERIC. The paper describes the task of converting an existing TEI specification applied to all the text resources deposited in DK-CLARIN. During the task we have tried to reuse and share CMDI profiles and components in the CLARIN Component Registry, as well as linking the CMDI components and elements to the relevant data categories in the ISOcat Data Category Registry. The conversion of the existing metadata into the CMDI format turned out not to be a trivial task and the experience and insights gained from this work have resulted in a proposal for a work flow for future use. We also present a core TEI header metadata set.

pdf bib
Comparative analysis of verbal alignment in human-human and human-agent interactions
Sabrina Campano | Jessica Durand | Chloé Clavel

Engagement is an important feature in human-human and human-agent interaction. In this paper, we investigate lexical alignment as a cue of engagement, relying on two different corpora : CID and SEMAINE. Our final goal is to build a virtual conversational character that could use alignment strategies to maintain user’s engagement. To do so, we investigate two alignment processes : shared vocabulary and other-repetitions. A quantitative and qualitative approach is proposed to characterize these aspects in human-human (CID) and human-operator (SEMAINE) interactions. Our results show that these processes are observable in both corpora, indicating a stable pattern that can be further modelled in conversational agents.

pdf bib
Transliteration and alignment of parallel texts from Cyrillic to Latin
Mircea Petic | Daniela Gîfu

This article describes a methodology of recovering and preservation of old Romanian texts and problems related to their recognition. Our focus is to create a gold corpus for Romanian language (the novella Sania), for both alphabets used in Transnistria ― Cyrillic and Latin. The resource is available for similar researches. This technology is based on transliteration and semiautomatic alignment of parallel texts at the level of letter/lexem/multiwords. We have analysed every text segment present in this corpus and discovered other conventions of writing at the level of transliteration, academic norms and editorial interventions. These conventions allowed us to elaborate and implement some new heuristics that make a correct automatic transliteration process. Sometimes the words of Latin script are modified in Cyrillic script from semantic reasons (for instance, editor’s interpretation). Semantic transliteration is seen as a good practice in introducing multiwords from Cyrillic to Latin. Not only does it preserve how a multiwords sound in the source script, but also enables the translator to modify in the original text (here, choosing the most common sense of an expression). Such a technology could be of interest to lexicographers, but also to specialists in computational linguistics to improve the actual transliteration standards.

pdf bib
How to Tell a Schneemann from a Milchmann: An Annotation Scheme for Compound-Internal Relations
Corina Dima | Verena Henrich | Erhard Hinrichs | Christina Hoppermann

This paper presents a language-independent annotation scheme for the semantic relations that link the constituents of noun-noun compounds, such as Schneemann ‘snow man’ or Milchmann ‘milk man’. The annotation scheme is hybrid in the sense that it assigns each compound a two-place label consisting of a semantic property and a prepositional paraphrase. The resulting inventory combines the insights of previous annotation schemes that rely exclusively on either semantic properties or prepositions, thus avoiding the known weaknesses that result from using only one of the two label types. The proposed annotation scheme has been used to annotate a set of 5112 German noun-noun compounds. A release of the dataset is currently being prepared and will be made available via the CLARIN Center Tübingen. In addition to the presentation of the hybrid annotation scheme, the paper also reports on an inter-annotator agreement study that has resulted in a substantial agreement among annotators.

pdf bib
Can Numerical Expressions Be Simpler? Implementation and Demostration of a Numerical Simplification System for Spanish
Susana Bautista | Horacio Saggion

Information in newspapers is often showed in the form of numerical expressions which present comprehension problems for many people, including people with disabilities, illiteracy or lack of access to advanced technology. The purpose of this paper is to motivate, describe, and demonstrate a rule-based lexical component that simplifies numerical expressions in Spanish texts. We propose an approach that makes news articles more accessible to certain readers by rewriting difficult numerical expressions in a simpler way. We will showcase the numerical simplification system with a live demo based on the execution of our components over different texts, and which will consider both successful and unsuccessful simplification cases.

pdf bib
4FX: Light Verb Constructions in a Multilingual Parallel Corpus
Anita Rácz | István Nagy T. | Veronika Vincze

In this paper, we describe 4FX, a quadrilingual (English-Spanish-German-Hungarian) parallel corpus annotated for light verb constructions. We present the annotation process, and report statistical data on the frequency of LVCs in each language. We also offer inter-annotator agreement rates and we highlight some interesting facts and tendencies on the basis of comparing multilingual data from the four corpora. According to the frequency of LVC categories and the calculated Kendall’s coefficient for the four corpora, we found that Spanish and German are very similar to each other, Hungarian is also similar to both, but German differs from all these three. The qualitative and quantitative data analysis might prove useful in theoretical linguistic research for all the four languages. Moreover, the corpus will be an excellent testbed for the development and evaluation of machine learning based methods aiming at extracting or identifying light verb constructions in these four languages.

pdf bib
The eIdentity Text Exploration Workbench
Fritz Kliche | André Blessing | Ulrich Heid | Jonathan Sonntag

We work on tools to explore text contents and metadata of newspaper articles as provided by news archives. Our tool components are being integrated into an “Exploration Workbench” for Digital Humanities researchers. Next to the conversion of different data formats and character encodings, a prominent feature of our design is its “Wizard” function for corpus building: Researchers import raw data and define patterns to extract text contents and metadata. The Workbench also comprises different tools for data cleaning. These include filtering of off-topic articles, duplicates and near-duplicates, corrupted and empty articles. We currently work on ca. 860.000 newspaper articles from different media archives, provided in different data formats. We index the data with state-of-the-art systems to allow for large scale information retrieval. We extract metadata on publishing dates, author names, newspaper sections, etc., and split articles into segments such as headlines, subtitles, paragraphs, etc. After cleaning the data and compiling a thematically homogeneous corpus, the sample can be used for quantitative analyses which are not affected by noise. Users can retrieve sets of articles on different topics, issues or otherwise defined research questions (“subcorpora”) and investigate quantitatively their media attention on the timeline (“Issue Cycles”).

pdf bib
Emilya: Emotional body expression in daily actions database
Nesrine Fourati | Catherine Pelachaud

The studies of bodily expression of emotion have been so far mostly focused on body movement patterns associated with emotional expression. Recently, there is an increasing interest on the expression of emotion in daily actions, called also non-emblematic movements (such as walking or knocking at the door). Previous studies were based on database limited to a small range of movement tasks or emotional states. In this paper, we describe our new database of emotional body expression in daily actions, where 11 actors express 8 emotions in 7 actions. We use motion capture technology to record body movements, but we recorded as well synchronized audio-visual data to enlarge the use of the database for different research purposes. We investigate also the matching between the expressed emotions and the perceived ones through a perceptive study. The first results of this study are discussed in this paper.

pdf bib
Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging
Kareem Darwish | Ahmed Abdelali | Hamdy Mubarak

This paper presents an end-to-end automatic processing system for Arabic. The system performs: correction of common spelling errors pertaining to different forms of alef, ta marbouta and ha, and alef maqsoura and ya; context sensitive word segmentation into underlying clitics, POS tagging, and gender and number tagging of nouns and adjectives. We introduce the use of stem templates as a feature to improve POS tagging by 0.5\% and to help ascertain the gender and number of nouns and adjectives. For gender and number tagging, we report accuracies that are significantly higher on previously unseen words compared to a state-of-the-art system.

pdf bib
Construction of Diachronic Ontologies from People’s Daily of Fifty Years
Shaoda He | Xiaojun Zou | Liumingjing Xiao | Junfeng Hu

This paper presents an Ontology Learning From Text (OLFT) method follows the well-known OLFT cake layer framework. Based on the distributional similarity, the proposed method generates multi-level ontologies from comparatively small corpora with the aid of HITS algorithm. Currently, this method covers terms extraction, synonyms recognition, concepts discovery and concepts hierarchical clustering. Among them, both concepts discovery and concepts hierarchical clustering are aided by the HITS authority, which is obtained from the HITS algorithm by an iteratively recommended way. With this method, a set of diachronic ontologies is constructed for each year based on People’s Daily corpora of fifty years (i.e., from 1947 to 1996). Preliminary experiments show that our algorithm outperforms the Google’s RNN and K-means based algorithm in both concepts discovery and concepts hierarchical clustering.

pdf bib
ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections
Jonathan Chevelu | Gwénolé Lecorvé | Damien Lolive

The development of new methods for given speech and natural language processing tasks usually consists in annotating large corpora of data before applying machine learning techniques to train models or to extract information. Beyond scientific aspects, creating and managing such annotated data sets is a recurrent problem. While using human annotators is obviously expensive in time and money, relying on automatic annotation processes is not a simple solution neither. Typically, the high diversity of annotation tools and of data formats, as well as the lack of efficient middleware to interface them all together, make such processes very complex and painful to design. To circumvent this problem, this paper presents the toolkit ROOTS, a freshly released open source toolkit ( for easy, fast and consistent management of heterogeneously annotated data. ROOTS is designed to efficiently handle massive complex sequential data and to allow quick and light prototyping, as this is often required for research purposes. To illustrate these properties, three sample applications are presented in the field of speech and language processing, though ROOTS can more generally be easily extended to other application domains.

pdf bib
Computer-Aided Quality Assurance of an Icelandic Pronunciation Dictionary
Martin Jansche

We propose a model-driven method for ensuring the quality of pronunciation dictionaries. The key ingredient is computing an alignment between letter strings and phoneme strings, a standard technique in pronunciation modeling. The novel aspect of our method is the use of informative, parametric alignment models which are refined iteratively as they are tested against the data. We discuss the use of alignment failures as a signal for detecting and correcting problematic dictionary entries. We illustrate this method using an existing pronunciation dictionary for Icelandic. Our method is completely general and has been applied in the construction of pronunciation dictionaries for commercially deployed speech recognition systems in several languages.

pdf bib
Disambiguating Verbs by Collocation: Corpus Lexicography meets Natural Language Processing
Ismail El Maarouf | Jane Bradbury | Vít Baisa | Patrick Hanks

This paper reports the results of Natural Language Processing (NLP) experiments in semantic parsing, based on a new semantic resource, the Pattern Dictionary of English Verbs (PDEV) (Hanks, 2013). This work is set in the DVC (Disambiguating Verbs by Collocation) project , a project in Corpus Lexicography aimed at expanding PDEV to a large scale. This project springs from a long-term collaboration of lexicographers with computer scientists which has given rise to the design and maintenance of specific, adapted, and user-friendly editing and exploration tools. Particular attention is drawn on the use of NLP deep semantic methods to help in data processing. Possible contributions of NLP include pattern disambiguation, the focus of this article. The present article explains how PDEV differs from other lexical resources and describes its structure in detail. It also presents new classification experiments on a subset of 25 verbs. The SVM model obtained a micro-average F1 score of 0.81.

pdf bib
Automatic Error Detection concerning the Definite and Indefinite Conjugation in the HunLearner Corpus
Veronika Vincze | János Zsibrita | Péter Durst | Martina Katalin Szabó

In this paper we present the results of automatic error detection, concerning the definite and indefinite conjugation in the extended version of the HunLearner corpus, the learners’ corpus of the Hungarian language. We present the most typical structures that trigger definite or indefinite conjugation in Hungarian and we also discuss the most frequent types of errors made by language learners in the corpus texts. We also illustrate the error types with sentences taken from the corpus. Our results highlight grammatical structures that might pose problems for learners of Hungarian, which can be fruitfully applied in the teaching and practicing of such constructions from the language teacher’s or learners’ point of view. On the other hand, these results may be exploited in extending the functionalities of a grammar checker, concerning the definiteness of the verb. Our automatic system was able to achieve perfect recall, i.e. it could find all the mismatches between the type of the object and the conjugation of the verb, which is promising for future studies in this area.

pdf bib
Speech-Based Emotion Recognition: Feature Selection by Self-Adaptive Multi-Criteria Genetic Algorithm
Maxim Sidorov | Christina Brester | Wolfgang Minker | Eugene Semenkin

Automated emotion recognition has a number of applications in Interactive Voice Response systems, call centers, etc. While employing existing feature sets and methods for automated emotion recognition has already achieved reasonable results, there is still a lot to do for improvement. Meanwhile, an optimal feature set, which should be used to represent speech signals for performing speech-based emotion recognition techniques, is still an open question. In our research, we tried to figure out the most essential features with self-adaptive multi-objective genetic algorithm as a feature selection technique and a probabilistic neural network as a classifier. The proposed approach was evaluated using a number of multi-languages databases (English, German), which were represented by 37- and 384-dimensional feature sets. According to the obtained results, the developed technique allows to increase the emotion recognition performance by up to 26.08 relative improvement in accuracy. Moreover, emotion recognition performance scores for all applied databases are improved.

pdf bib
Constructing and exploiting an automatically annotated resource of legislative texts
Stefan Höfler | Kyoko Sugisaki

In this paper, we report on the construction of a resource of Swiss legislative texts that is automatically annotated with structural, morphosyntactic and content-related information, and we discuss the exploitation of this resource for the purposes of legislative drafting, legal linguistics and translation and for the evaluation of legislation. Our resource is based on the classified compilation of Swiss federal legislation. All texts contained in the classified compilation exist in German, French and Italian, some of them are also available in Romansh and English. Our resource is currently being exploited (a) as a testing environment for developing methods of automated style checking for legislative drafts, (b) as the basis of a statistical multilingual word concordance, and (c) for the empirical evaluation of legislation. The paper describes the domain- and language-specific procedures that we have implemented to provide the automatic annotations needed for these applications.

pdf bib
GenitivDB — a Corpus-Generated Database for German Genitive Classification
Roman Schneider

We present a novel NLP resource for the explanation of linguistic phenomena, built and evaluated exploring very large annotated language corpora. For the compilation, we use the German Reference Corpus (DeReKo) with more than 5 billion word forms, which is the largest linguistic resource worldwide for the study of contemporary written German. The result is a comprehensive database of German genitive formations, enriched with a broad range of intra- und extralinguistic metadata. It can be used for the notoriously controversial classification and prediction of genitive endings (short endings, long endings, zero-marker). We also evaluate the main factors influencing the use of specific endings. To get a general idea about a factor’s influences and its side effects, we calculate chi-square-tests and visualize the residuals with an association plot. The results are evaluated against a gold standard by implementing tree-based machine learning algorithms. For the statistical analysis, we applied the supervised LMT Logistic Model Trees algorithm, using the WEKA software. We intend to use this gold standard to evaluate GenitivDB, as well as to explore methodologies for a predictive genitive model.

pdf bib
Resources in Conflict: A Bilingual Valency Lexicon vs. a Bilingual Treebank vs. a Linguistic Theory
Jana Šindlerová | Zdeňka Urešová | Eva Fucikova

In this paper, we would like to exemplify how a syntactically annotated bilingual treebank can help us in exploring and revising a developed linguistic theory. On the material of the Prague Czech-English Dependency Treebank we observe sentences in which an Addressee argument in one language is linked translationally to a Patient argument in the other one, and make generalizations about the theoretical grounds of the argument non-correspondences and its relations to the valency theory beyond the annotation practice. Exploring verbs of three semantic classes (Judgement verbs, Teaching verbs and Attempt Suasion verbs) we claim that the Functional Generative Description argument labelling is highly dependent on the morphosyntactic realization of the individual participants, which then results in valency frame differences. Nevertheless, most of the differences can be overcome without substantial changes to the linguistic theory itself.

pdf bib
The NewSoMe Corpus: A Unifying Opinion Annotation Framework across Genres and in Multiple Languages
Roser Saurí | Judith Domingo | Toni Badia

We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in languages other than English, and to build a unifying annotation framework for analyzing opinion in different genres, including controlled text, such as news reports, as well as different types of user generated contents (UGC). Given the broad design of the resource, most of the annotation effort were carried out resorting to crowdsourcing platforms: Amazon Mechanical Turk and CrowdFlower. This created an excellent opportunity to research on the feasibility of crowdsourcing methods for annotating big amounts of text in different languages.

pdf bib
Buy one get one free: Distant annotation of Chinese tense, event type and modality
Nianwen Xue | Yuchen Zhang

We describe a “distant annotation” method where we mark up the semantic tense, event type, and modality of Chinese events via a word-aligned parallel corpus. We first map Chinese verbs to their English counterparts via word alignment, and then annotate the resulting English text spans with coarse-grained categories for semantic tense, event type, and modality that we believe apply to both English and Chinese. Because English has richer morpho-syntactic indicators for semantic tense, event type and modality than Chinese, our intuition is that this distant annotation approach will yield more consistent annotation than if we annotate the Chinese side directly. We report experimental results that show stable annotation agreement statistics and that event type and modality have significant influence on tense prediction. We also report the size of the annotated corpus that we have obtained, and how different domains impact annotation consistency.

pdf bib
Multimodal dialogue segmentation with gesture post-processing
Kodai Takahashi | Masashi Inoue

We investigate an automatic dialogue segmentation method using both verbal and non-verbal modalities. Dialogue contents are used for the initial segmentation of dialogue; then, gesture occurrences are used to remove the incorrect segment boundaries. A unique characteristic of our method is to use verbal and non-verbal information separately. We use a three-party dialogue that is rich in gesture as data. The transcription of the dialogue is segmented into topics without prior training by using the TextTiling and U00 algorithm. Some candidates for segment boundaries - where the topic continues - are irrelevant. Those boundaries can be found and removed by locating gestures that stretch over the boundary candidates. This ltering improves the segmentation accuracy of text-only segmentation.

pdf bib
The Dangerous Myth of the Star System
André Bittar | Luca Dini | Sigrid Maurel | Mathieu Ruhlmann

In recent years we have observed two parallel trends in computational linguistics research and e-commerce development. On the research side, there has been an increasing interest in algorithms and approaches that are able to capture the polarity of opinions expressed by users on products, institutions and services. On the other hand, almost all big e-commerce and aggregator sites are by now providing users the possibility of writing comments and expressing their appreciation with a numeric score (usually represented as a number of stars). This generates the impression that the work carried out in the research community is made partially useless (at least for economic exploitation) by an evolution in web practices. In this paper we describe an experiment on a large corpus which shows that the score judgments provided by users are often conflicting with the text contained in the opinion, and to such a point that a rule-based opinion mining system can be demonstrated to perform better than the users themselves in ranking their opinions.

pdf bib
Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese
Haibo Li | Masato Hagiwara | Qi Li | Heng Ji

Word Segmentation is usually considered an essential step for many Chinese and Japanese Natural Language Processing tasks, such as name tagging. This paper presents several new observations and analysis on the impact of word segmentation on name tagging; (1). Due to the limitation of current state-of-the-art Chinese word segmentation performance, a character-based name tagger can outperform its word-based counterparts for Chinese but not for Japanese; (2). It is crucial to keep segmentation settings (e.g. definitions, specifications, methods) consistent between training and testing for name tagging; (3). As long as (2) is ensured, the performance of word segmentation does not have appreciable impact on Chinese and Japanese name tagging.

pdf bib
CoRoLa — The Reference Corpus of Contemporary Romanian Language
Verginica Barbu Mititelu | Elena Irimia | Dan Tufiș

We present the project of creating CoRoLa, a reference corpus of contemporary Romanian (from 1945 onwards). In the international context, the project finds its place among the initiatives of gathering huge collections of texts, of pre-processing and annotating them at several levels, and also of documenting them with metadata (CMDI). Our project is a joined effort of two institutes of the Romanian Academy. We foresee a corpus of more than 500 million word forms, covering all functional styles of the language. Although the vast majority of texts will be in written form, we target about 300 hours of oral texts, too, obligatorily with associated transcripts. Most of the texts will be from books, while the rest will be harvested from newspapers, booklets, technical reports, etc. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will be done at a morphological level in a first stage, followed by lemmatization, with the possibility of adding syntactic, semantic and discourse annotation in a later stage. A core of CoRoLa is described in the article. The target users of our corpus will be researchers in linguistics and language processing, teachers of Romanian, students.

pdf bib
Building a reference lexicon for countability in English
Tibor Kiss | Francis Jeffry Pelletier | Tobias Stadtfeld

The present paper describes the construction of a resource to determine the lexical preference class of a large number of English noun-senses ($\approx$ 14,000) with respect to the distinction between mass and count interpretations. In constructing the lexicon, we have employed a questionnaire-based approach based on existing resources such as the Open ANC (\url{}) and WordNet \cite{Miller95}. The questionnaire requires annotators to answer six questions about a noun-sense pair. Depending on the answers, a given noun-sense pair can be assigned to fine-grained noun classes, spanning the area between count and mass. The reference lexicon contains almost 14,000 noun-sense pairs. An initial data set of 1,000 has been annotated together by four native speakers, while the remaining 12,800 noun-sense pairs have been annotated in parallel by two annotators each. We can confirm the general feasibility of the approach by reporting satisfactory values between 0.694 and 0.755 in inter-annotator agreement using Krippendorff’s $\alpha$.

pdf bib
The LIMA Multilingual Analyzer Made Free: FLOSS Resources Adaptation and Correction
Gaël de Chalendar

At CEA LIST, we have decided to release our multilingual analyzer LIMA as Free software. As we were not proprietary of all the language resources used we had to select and adapt free ones in order to attain results good enough and equivalent to those obtained with our previous ones. For English and French, we found and adapted a full-form dictionary and an annotated corpus for learning part-of-speech tagging models.

pdf bib
A SICK cure for the evaluation of compositional distributional semantic models
Marco Marelli | Stefano Menini | Marco Baroni | Luisa Bentivogli | Raffaella Bernardi | Roberto Zamparelli

Shared and internationally recognized benchmarks are fundamental for the development of any computational system. We aim to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. SICK consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of CDSMs. By means of crowdsourcing techniques, each pair was annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale as gold score) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral). The SICK data set was used in SemEval-2014 Task 1, and it freely available for research purposes.

pdf bib
Twente Debate Corpus — A Multimodal Corpus for Head Movement Analysis
Bayu Rahayudi | Ronald Poppe | Dirk Heylen

This paper introduces a multimodal discussion corpus for the study into head movement and turn-taking patterns in debates. Given that participants either acted alone or in a pair, cooperation and competition and their nonverbal correlates can be analyzed. In addition to the video and audio of the recordings, the corpus contains automatically estimated head movements, and manual annotations of who is speaking and who is looking where. The corpus consists of over 2 hours of debates, in 6 groups with 18 participants in total. We describe the recording setup and present initial analyses of the recorded data. We found that the person who acted as single debater speaks more and also receives more attention compared to the other debaters, also when corrected for the time speaking. We also found that a single debater was more likely to speak after a team debater. Future work will be aimed at further analysis of the relation between speaking and looking patterns, the outcome of the debate and perceived dominance of the debaters.

pdf bib
The EASR Corpora of European Portuguese, French, Hungarian and Polish Elderly Speech
Annika Hämäläinen | Jairo Avelar | Silvia Rodrigues | Miguel Sales Dias | Artur Kolesiński | Tibor Fegyó | Géza Németh | Petra Csobánka | Karine Lan | David Hewson

Currently available speech recognisers do not usually work well with elderly speech. This is because several characteristics of speech (e.g. fundamental frequency, jitter, shimmer and harmonic noise ratio) change with age and because the acoustic models used by speech recognisers are typically trained with speech collected from younger adults only. To develop speech-driven applications capable of successfully recognising elderly speech, this type of speech data is needed for training acoustic models from scratch or for adapting acoustic models trained with younger adults’ speech. However, the availability of suitable elderly speech corpora is still very limited. This paper describes an ongoing project to design, collect, transcribe and annotate large elderly speech corpora for four European languages: Portuguese, French, Hungarian and Polish. The Portuguese, French and Polish corpora contain read speech only, whereas the Hungarian corpus also contains spontaneous command and control type of speech. Depending on the language in question, the corpora contain 76 to 205 hours of speech collected from 328 to 986 speakers aged 60 and over. The final corpora will come with manually verified orthographic transcriptions, as well as annotations for filled pauses, noises and damaged words.

pdf bib
Sharing Cultural Heritage: the Clavius on the Web Project
Matteo Abrate | Angelo Mario Del Grosso | Emiliano Giovannetti | Angelica Lo Duca | Damiana Luzzi | Lorenzo Mancini | Andrea Marchetti | Irene Pedretti | Silvia Piccini

In the last few years the amount of manuscripts digitized and made available on the Web has been constantly increasing. However, there is still a considarable lack of results concerning both the explicitation of their content and the tools developed to make it available. The objective of the Clavius on the Web project is to develop a Web platform exposing a selection of Christophorus Clavius letters along with three different levels of analysis: linguistic, lexical and semantic. The multilayered annotation of the corpus involves a XML-TEI encoding followed by a tokenization step where each token is univocally identified through a CTS urn notation and then associated to a part-of-speech and a lemma. The text is lexically and semantically annotated on the basis of a lexicon and a domain ontology, the former structuring the most relevant terms occurring in the text and the latter representing the domain entities of interest (e.g. people, places, etc.). Moreover, each entity is connected to linked and non linked resources, including DBpedia and VIAF. Finally, the results of the three layers of analysis are gathered and shown through interactive visualization and storytelling techniques. A demo version of the integrated architecture was developed.

pdf bib
3D Face Tracking and Multi-Scale, Spatio-temporal Analysis of Linguistically Significant Facial Expressions and Head Positions in ASL
Bo Liu | Jingjing Liu | Xiang Yu | Dimitris Metaxas | Carol Neidle

Essential grammatical information is conveyed in signed languages by clusters of events involving facial expressions and movements of the head and upper body. This poses a significant challenge for computer-based sign language recognition. Here, we present new methods for the recognition of nonmanual grammatical markers in American Sign Language (ASL) based on: (1) new 3D tracking methods for the estimation of 3D head pose and facial expressions to determine the relevant low-level features; (2) methods for higher-level analysis of component events (raised/lowered eyebrows, periodic head nods and head shakes) used in grammatical markings―with differentiation of temporal phases (onset, core, offset, where appropriate), analysis of their characteristic properties, and extraction of corresponding features; (3) a 2-level learning framework to combine low- and high-level features of differing spatio-temporal scales. This new approach achieves significantly better tracking and recognition results than our previous methods.

pdf bib
Exploring factors that contribute to successful fingerspelling comprehension
Leah Geer | Jonathan Keane

Using a novel approach, we examine which cues in a fingerspelling stream, namely holds or transitions, allow for more successful comprehension by students learning American Sign Language (ASL). Sixteen university-level ASL students participated in this study. They were shown video clips of a native signer fingerspelling common English words. Clips were modified in the following ways: all were slowed down to half speed, one-third of the clips were modified to black out the transition portion of the fingerspelling stream, and one-third modified to have holds blacked out. The remaining third of clips were free of blacked out portions, which we used to establish a baseline of comprehension. Research by Wilcox (1992), among others, suggested that transitions provide more rich information, and thus items with the holds blacked out should be easier to comprehend than items with the transitions blacked out. This was not found to be the case here. Students achieved higher comprehension scores when hold information was provided. Data from this project can be used to design training tools to help students become more proficient at fingerspelling comprehension, a skill with which most students struggle.

pdf bib
The DARE Corpus: A Resource for Anaphora Resolution in Dialogue Based Intelligent Tutoring Systems
Nobal Niraula | Vasile Rus | Rajendra Banjade | Dan Stefanescu | William Baggett | Brent Morgan

We describe the DARE corpus, an annotated data set focusing on pronoun resolution in tutorial dialogue. Although data sets for general purpose anaphora resolution exist, they are not suitable for dialogue based Intelligent Tutoring Systems. To the best of our knowledge, no data set is currently available for pronoun resolution in dialogue based intelligent tutoring systems. The described DARE corpus consists of 1,000 annotated pronoun instances collected from conversations between high-school students and the intelligent tutoring system DeepTutor. The data set is publicly available.

pdf bib
On the annotation of TMX translation memories for advanced leveraging in computer-aided translation
Mikel Forcada

The term advanced leveraging refers to extensions beyond the current usage of translation memory (TM) in computer-aided translation (CAT). One of these extensions is the ability to identify and use matches on the sub-segment level ― for instance, using sub-sentential elements when segments are sentences― to help the translator when a reasonable fuzzy-matched proposal is not available; some such functionalities have started to become available in commercial CAT tools. Resources such as statistical word aligners, external machine translation systems, glossaries and term bases could be used to identify and annotate segment-level translation units at the sub-segment level, but there is currently no single, agreed standard supporting the interchange of sub-segmental annotation of translation memories to create a richer translation resource. This paper discusses the capabilities and limitations of some current standards, envisages possible alternatives, and ends with a tentative proposal which slightly abuses (repurposes) the usage of existing elements in the TMX standard.

pdf bib
The D-ANS corpus: the Dublin-Autonomous Nervous System corpus of biosignal and multimodal recordings of conversational speech
Shannon Hennig | Ryad Chellali | Nick Campbell

Biosignals, such as electrodermal activity (EDA) and heart rate, are increasingly being considered as potential data sources to provide information about the temporal fluctuations in affective experience during human interaction. This paper describes an English-speaking, multiple session corpus of small groups of people engaged in informal, unscripted conversation while wearing wireless, wrist-based EDA sensors. Additionally, one participant per recording session wore a heart rate monitor. This corpus was collected in order to observe potential interactions between various social and communicative phenomena and the temporal dynamics of the recorded biosignals. Here we describe the communicative context, technical set-up, synchronization process, and challenges in collecting and utilizing such data. We describe the segmentation and annotations to date, including laughter annotations, and how the research community can access and collaborate on this corpus now and in the future. We believe this corpus is particularly relevant to researchers interested in unscripted social conversation as well as to researchers with a specific interest in observing the dynamics of biosignals during informal social conversation rich with examples of laughter, conversational turn-taking, and non-task-based interaction.

pdf bib
Annotating the MASC Corpus with BabelNet
Andrea Moro | Roberto Navigli | Francesco Maria Tucci | Rebecca J. Passonneau

In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.

pdf bib
All Fragments Count in Parser Evaluation
Jasmijn Bastings | Khalil Sima’an

PARSEVAL, the default paradigm for evaluating constituency parsers, calculates parsing success (Precision/Recall) as a function of the number of matching labeled brackets across the test set. Nodes in constituency trees, however, are connected together to reflect important linguistic relations such as predicate-argument and direct-dominance relations between categories. In this paper, we present FREVAL, a generalization of PARSEVAL, where the precision and recall are calculated not only for individual brackets, but also for co-occurring, connected brackets (i.e. fragments). FREVAL fragments precision (FLP) and recall (FLR) interpolate the match across the whole spectrum of fragment sizes ranging from those consisting of individual nodes (labeled brackets) to those consisting of full parse trees. We provide evidence that FREVAL is informative for inspecting relative parser performance by comparing a range of existing parsers.

pdf bib
TexAFon 2.0: A text processing tool for the generation of expressive speech in TTS applications
Juan María Garrido | Yesika Laplaza | Benjamin Kolz | Miquel Cornudella

This paper presents TexAfon 2.0, an improved version of the text processing tool TexAFon, specially oriented to the generation of synthetic speech with expressive content. TexAFon is a text processing module in Catalan and Spanish for TTS systems, which performs all the typical tasks needed for the generation of synthetic speech from text: sentence detection, pre-processing, phonetic transcription, syllabication, prosodic segmentation and stress prediction. These improvements include a new normalisation module for the standardisation on chat text in Spanish, a module for the detection of the expressed emotions in the input text, and a module for the automatic detection of the intended speech acts, which are briefly described in the paper. The results of the evaluations carried out for each module are also presented.

pdf bib
A Persian Treebank with Stanford Typed Dependencies
Mojgan Seraji | Carina Jahani | Beáta Megyesi | Joakim Nivre

We present the Uppsala Persian Dependency Treebank (UPDT) with a syntactic annotation scheme based on Stanford Typed Dependencies. The treebank consists of 6,000 sentences and 151,671 tokens with an average sentence length of 25 words. The data is from different genres, including newspaper articles and fiction, as well as technical descriptions and texts about culture and art, taken from the open source Uppsala Persian Corpus (UPC). The syntactic annotation scheme is extended for Persian to include all syntactic relations that could not be covered by the primary scheme developed for English. In addition, we present open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizer, a part-of-speech tagger, and a parser. The treebank and the parser have been developed simultaneously in a bootstrapping procedure. The result of a parsing experiment shows an overall labeled attachment score of 82.05% and an unlabeled attachment score of 85.29%. The treebank is freely available as an open source resource.

pdf bib
A Language-independent Approach to Extracting Derivational Relations from an Inflectional Lexicon
Marion Baranes | Benoît Sagot

In this paper, we describe and evaluate an unsupervised method for acquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on surface form analogies and on part-of-speech information. It is generic enough to be applied to any language with a mainly concatenative derivational morphology. Results were obtained and evaluated on English, French, German and Spanish. Precision results are satisfying, and our French results favorably compare with another resource, although its construction relied on manually developed lexicographic information whereas our approach only requires an inflectional lexicon.

pdf bib
Named Entity Recognition on Turkish Tweets
Dilek Küçük | Guillaume Jacquet | Ralf Steinberger

Various recent studies show that the performance of named entity recognition (NER) systems developed for well-formed text types drops significantly when applied to tweets. The only existing study for the highly inflected agglutinative language Turkish reports a drop in F-Measure from 91% to 19% when ported from news articles to tweets. In this study, we present a new named entity-annotated tweet corpus and a detailed analysis of the various tweet-specific linguistic phenomena. We perform comparative NER experiments with a rule-based multilingual NER system adapted to Turkish on three corpora: a news corpus, our new tweet corpus, and another tweet corpus. Based on the analysis and the experimentation results, we suggest system features required to improve NER results for social media like Twitter.

pdf bib
Rhapsodie: a Prosodic-Syntactic Treebank for Spoken French
Anne Lacheret | Sylvain Kahane | Julie Beliao | Anne Dister | Kim Gerdes | Jean-Philippe Goldman | Nicolas Obin | Paola Pietrandrea | Atanas Tchobanov

The main objective of the Rhapsodie project (ANR Rhapsodie 07 Corp-030-01) was to define rich, explicit, and reproducible schemes for the annotation of prosody and syntax in different genres (± spontaneous, ± planned, face-to-face interviews vs. broadcast, etc.), in order to study the prosody/syntax/discourse interface in spoken French, and their roles in the segmentation of speech into discourse units (Lacheret, Kahane, & Pietrandrea forthcoming). We here describe the deliverable, a syntactic and prosodic treebank of spoken French, composed of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and 33000 words), orthographically and phonetically transcribed. The transcriptions and the annotations are all aligned on the speech signal: phonemes, syllables, words, speakers, overlaps. This resource is freely available at The sound samples (wav/mp3), the acoustic analysis (original F0 curve manually corrected and automatic stylized F0, pitch format), the orthographic transcriptions (txt), the microsyntactic annotations (tabular format), the macrosyntactic annotations (txt, tabular format), the prosodic annotations (xml, textgrid, tabular format), and the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France. The metadata are encoded in the IMDI-CMFI format and can be parsed on line.

pdf bib
The IULA Spanish LSP Treebank
Montserrat Marimon | Núria Bel | Beatriz Fisas | Blanca Arias | Silvia Vázquez | Jorge Vivaldi | Carlos Morell | Mercè Lorente

This paper presents the IULA Spanish LSP Treebank, a dependency treebank of over 41,000 sentences of different domains (Law, Economy, Computing Science, Environment, and Medicine), developed in the framework of the European project METANET4U. Dependency annotations in the treebank were automatically derived from manually selected parses produced by an HPSG-grammar by a deterministic conversion algorithm that used the identifiers of grammar rules to identify the heads, the dependents, and some dependency types that were directly transferred onto the dependency structure (e.g., subject, specifier, and modifier), and the identifiers of the lexical entries to identify the argument-related dependency functions (e.g. direct object, indirect object, and oblique complement). The treebank is accessible with a browser that provides concordance-based search functions and delivers the results in two formats: (i) a column-based format, in the style of CoNLL-2006 shared task, and (ii) a dependency graph, where dependency relations are noted by an oriented arrow which goes from the dependent node to the head node. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level following the dependency grammar theory. The treebank has been made publicly and freely available from the META-SHARE platform with a Creative Commons CC-by licence.

pdf bib
Morpho-Syntactic Study of Errors from Speech Recognition System
Maria Goryainova | Cyril Grouin | Sophie Rosset | Ioana Vasilescu

The study provides an original standpoint of the speech transcription errors by focusing on the morpho-syntactic features of the erroneous chunks and of the surrounding left and right context. The typology concerns the forms, the lemmas and the POS involved in erroneous chunks, and in the surrounding contexts. Comparison with error free contexts are also provided. The study is conducted on French. Morpho-syntactic analysis underlines that three main classes are particularly represented in the erroneous chunks: (i) grammatical words (to, of, the), (ii) auxiliary verbs (has, is), and (iii) modal verbs (should, must). Such items are widely encountered in the ASR outputs as frequent candidates to transcription errors. The analysis of the context points out that some left 3-grams contexts (e.g., repetitions, that is disfluencies, bracketing formulas such as “c’est”, etc.) may be better predictors than others. Finally, the surface analysis conducted through a Levensthein distance analysis, highlighted that the most common distance is of 2 characters and mainly involves differences between inflected forms of a unique item.

pdf bib
Not an Interlingua, But Close: Comparison of English AMRs to Chinese and Czech
Nianwen Xue | Ondřej Bojar | Jan Hajič | Martha Palmer | Zdeňka Urešová | Xiuhong Zhang

Abstract Meaning Representations (AMRs) are rooted, directional and labeled graphs that abstract away from morpho-syntactic idiosyncrasies such as word category (verbs and nouns), word order, and function words (determiners, some prepositions). Because these syntactic idiosyncrasies account for many of the cross-lingual differences, it would be interesting to see if this representation can serve, e.g., as a useful, minimally divergent transfer layer in machine translation. To answer this question, we have translated 100 English sentences that have existing AMRs into Chinese and Czech to create AMRs for them. A cross-linguistic comparison of English to Chinese and Czech AMRs reveals both cases where the AMRs for the language pairs align well structurally and cases of linguistic divergence. We found that the level of compatibility of AMR between English and Chinese is higher than between English and Czech. We believe this kind of comparison is beneficial to further refining the annotation standards for each of the three languages and will lead to more compatible annotation guidelines between the languages.

pdf bib
Annotating Relations in Scientific Articles
Adam Meyers | Giancarlo Lee | Angus Grieve-Smith | Yifan He | Harriet Taber

Relations (ABBREVIATE, EXEMPLIFY, ORIGINATE, REL_WORK, OPINION) between entities (citations, jargon, people, organizations) are annotated for PubMed scientific articles. We discuss our specifications, pre-processing and evaluation

pdf bib
Annotating Clinical Events in Text Snippets for Phenotype Detection
Prescott Klassen | Fei Xia | Lucy Vanderwende | Meliha Yetisgen

Early detection and treatment of diseases that onset after a patient is admitted to a hospital, such as pneumonia, is critical to improving and reducing costs in healthcare. NLP systems that analyze the narrative data embedded in clinical artifacts such as x-ray reports can help support early detection. In this paper, we consider the importance of identifying the change of state for events - in particular, clinical events that measure and compare the multiple states of a patient’s health across time. We propose a schema for event annotation comprised of five fields and create preliminary annotation guidelines for annotators to apply the schema. We then train annotators, measure their performance, and finalize our guidelines. With the complete guidelines, we then annotate a corpus of snippets extracted from chest x-ray reports in order to integrate the annotations as a new source of features for classification tasks.

pdf bib
Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtitling
Pablo Ruiz | Aitor Álvarez | Haritz Arzelus

Long audio alignment systems for Spanish and English are presented, within an automatic subtitling application. Language-specific phone decoders automatically recognize audio contents at phoneme level. At the same time, language-dependent grapheme-to-phoneme modules perform a transcription of the script for the audio. A dynamic programming algorithm (Hirschberg’s algorithm) finds matches between the phonemes automatically recognized by the phone decoder and the phonemes in the script’s transcription. Alignment accuracy is evaluated when scoring alignment operations with a baseline binary matrix, and when scoring alignment operations with several continuous-score matrices, based on phoneme similarity as assessed through comparing multivalued phonological features. Alignment accuracy results are reported at phoneme, word and subtitle level. Alignment accuracy when using the continuous scoring matrices based on phonological similarity was clearly higher than when using the baseline binary matrix.

pdf bib
Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports
Maria Evangelia Chatzimina | Cyril Grouin | Pierre Zweigenbaum

Unsupervised word classes induced from unannotated text corpora are increasingly used to help tasks addressed by supervised classification, such as standard named entity detection. This paper studies the contribution of unsupervised word classes to a medical entity detection task with two specific objectives: How do unsupervised word classes compare to available knowledge-based semantic classes? Does syntactic information help produce unsupervised word classes with better properties? We design and test two syntax-based methods to produce word classes: one applies the Brown clustering algorithm to syntactic dependencies, the other collects latent categories created by a PCFG-LA parser. When added to non-semantic features, knowledge-based semantic classes gain 7.28 points of F-measure. In the same context, basic unsupervised word classes gain 4.16pt, reaching 60% of the contribution of knowledge-based semantic classes and outperforming Wikipedia, and adding PCFG-LA unsupervised word classes gain one more point at 5.11pt, reaching 70%. Unsupervised word classes could therefore provide a useful semantic back-off in domains where no knowledge-based semantic classes are available. The combination of both knowledge-based and basic unsupervised classes gains 8.33pt. Therefore, unsupervised classes are still useful even when rich knowledge-based classes exist.

pdf bib
Three dimensions of the so-called “interoperability” of annotation schemes”
Eva Hajičová

“Interoperability” of annotation schemes is one of the key words in the discussions about annotation of corpora. In the present contribution, we propose to look at the so-called interoperability from (at least) three angles, namely (i) as a relation (and possible interaction or cooperation) of different annotation schemes for different layers or phenomena of a single language, (ii) the possibility to annotate different languages by a single (modified or not) annotation scheme, and (iii) the relation between different annotation schemes for a single language, or for a single phenomenon or layer of the same language. The pros and cons of each of these aspects are discussed as well as their contribution to linguistic studies and natural language processing. It is stressed that a communication and collaboration between different annotation schemes requires an explicit specification and consistency of each of the schemes.

pdf bib
On Complex Word Alignment Configurations
Miriam Kaeshammer | Anika Westburg

Resources of manual word alignments contain configurations that are beyond the alignment capacity of current translation models, hence the term complex alignment configuration. They have been the matter of some debate in the machine translation community, as they call for more powerful translation models that come with further complications. In this work we investigate instances of complex alignment configurations in data sets of four different language pairs to shed more light on the nature and cause of those configurations. For the English-German alignments from Padó and Lapata (2006), for instance, we find that only a small fraction of the complex configurations are due to real annotation errors. While a third of the complex configurations in this data set could be simplified when annotating according to a different style guide, the remaining ones are phenomena that one would like to be able to generate during translation. Those instances are mainly caused by the different word order of English and German. Our findings thus motivate further research in the area of translation beyond phrase-based and context-free translation modeling.

pdf bib
HFST-SweNER — A New NER Resource for Swedish
Dimitrios Kokkinakis | Jyrki Niemi | Sam Hardwick | Krister Lindén | Lars Borin

Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).

pdf bib
Building and Modelling Multilingual Subjective Corpora
Motaz Saad | David Langlois | Kamel Smaïli

Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sentences into subjective/objective label using a training data from “movie reviews” domain which is in English language. The annotation can be transferred to another language by classifying English sentences in parallel corpora and transferring the same annotation to the same sentences of the other language. We also shed the light on the link between opinion mining and statistical language modelling, and how such corpora are useful for domain specific language modelling. We show the distinction between subjective and objective sentences which tends to be stable across domains and languages. Our experiments show that language models trained on objective (respectively subjective) corpus lead to better perplexities on objective (respectively subjective) test.

pdf bib
GRASS: the Graz corpus of Read And Spontaneous Speech
Barbara Schuppler | Martin Hagmueller | Juan A. Morales-Cordovilla | Hannes Pessentheiner

This paper provides a description of the preparation, the speakers, the recordings, and the creation of the orthographic transcriptions of the first large scale speech database for Austrian German. It contains approximately 1900 minutes of (read and spontaneous) speech produced by 38 speakers. The corpus consists of three components. First, the Conversation Speech (CS) component contains free conversations of one hour length between friends, colleagues, couples, or family members. Second, the Commands Component (CC) contains commands and keywords which were either read or elicited by pictures. Third, the Read Speech (RS) component contains phonetically balanced sentences and digits. The speech of all components has been recorded at super-wideband quality in a soundproof recording-studio with head-mounted microphones, large-diaphragm microphones, a laryngograph, and with a video camera. The orthographic transcriptions, which have been created and subsequently corrected manually, contain approximately 290 000 word tokens from 15 000 different word types.

pdf bib
Linguistic resources and cats: how to use ISOcat, RELcat and SCHEMAcat
Menzo Windhouwer | Ineke Schuurman

Within the European CLARIN infrastructure ISOcat is used to enable both humans and computer programs to find specific resources even when they use different terminology or data structures. In order to do so, it should be clear which concepts are used in these resources, both at the level of metadata for the resource as well as its content, and what is meant by them. The concepts can be specified in ISOcat. SCHEMAcat enables us to relate the concepts used by a resource, while RELcat enables to type these relationships and add relationships beyond resource boundaries. This way these three registries together allow us (and the programs) to find what we are looking for.

pdf bib
Bring vs. MTRoget: Evaluating automatic thesaurus translation
Lars Borin | Jens Allwood | Gerard de Melo

Evaluation of automatic language-independent methods for language technology resource creation is difficult, and confounded by a largely unknown quantity, viz. to what extent typological differences among languages are significant for results achieved for one language or language pair to be applicable across languages generally. In the work presented here, as a simplifying assumption, language-independence is taken as axiomatic within certain specified bounds. We evaluate the automatic translation of Roget’s “Thesaurus” from English into Swedish using an independently compiled Roget-style Swedish thesaurus, S.C. Bring’s “Swedish vocabulary arranged into conceptual classes” (1930). Our expectation is that this explicit evaluation of one of the thesaureses created in the MTRoget project will provide a good estimate of the quality of the other thesauruses created using similar methods.

pdf bib
Introducing a Framework for the Evaluation of Music Detection Tools
Paula Lopez-Otero | Laura Docio-Fernandez | Carmen Garcia-Mateo

The huge amount of multimedia information available nowadays makes its manual processing prohibitive, requiring tools for automatic labelling of these contents. This paper describes a framework for assessing a music detection tool; this framework consists of a database, composed of several hours of radio recordings that include different types of radio programmes, and a set of evaluation measures for evaluating the performance of a music detection tool in detail. A tool for automatically detecting music in audio streams, with application to music information retrieval tasks, is presented as well. The aim of this tool is to discard the audio excerpts that do not contain music in order to avoid their unnecessary processing. This tool applies fingerprinting to different acoustic features extracted from the audio signal in order to remove perceptual irrelevancies, and a support vector machine is trained for classifying these fingerprints in classes music and no-music. The validity of this tool is assessed in the proposed evaluation framework.

pdf bib
WordNet—Wikipedia—Wiktionary: Construction of a Three-way Alignment
Tristan Miller | Iryna Gurevych

The coverage and quality of conceptual information contained in lexical semantic resources is crucial for many tasks in natural language processing. Automatic alignment of complementary resources is one way of improving this coverage and quality; however, past attempts have always been between pairs of specific resources. In this paper we establish some set-theoretic conventions for describing concepts and their alignments, and use them to describe a method for automatically constructing n-way alignments from arbitrary pairwise alignments. We apply this technique to the production of a three-way alignment from previously published WordNet-Wikipedia and WordNet-Wiktionary alignments. We then present a quantitative and informal qualitative analysis of the aligned resource. The three-way alignment was found to have greater coverage, an enriched sense representation, and coarser sense granularity than both the original resources and their pairwise alignments, though this came at the cost of accuracy. An evaluation of the induced word sense clusters in a word sense disambiguation task showed that they were no better than random clusters of equivalent granularity. However, use of the alignments to enrich a sense inventory with additional sense glosses did significantly improve the performance of a baseline knowledge-based WSD algorithm.

pdf bib
Cross-linguistic annotation of narrativity for English/French verb tense disambiguation
Cristina Grisot | Thomas Meyer

This paper presents manual and automatic annotation experiments for a pragmatic verb tense feature (narrativity) in English/French parallel corpora. The feature is considered to play an important role for translating English Simple Past tense into French, where three different tenses are available. Whether the French Passe ́ Compose ́, Passe ́ Simple or Imparfait should be used is highly dependent on a longer-range context, in which either narrative events ordered in time or mere non-narrative state of affairs in the past are described. This longer-range context is usually not available to current machine translation (MT) systems, that are trained on parallel corpora. Annotating narrativity prior to translation is therefore likely to help current MT systems. Our experiments show that narrativity can be reliably identified with kappa-values of up to 0.91 in manual annotation and with F1 scores of up to 0.72 in automatic annotation.

pdf bib
The tara corpus of human-annotated machine translations
Eleftherios Avramidis | Aljoscha Burchardt | Sabine Hunsicker | Maja Popović | Cindy Tscherwinka | David Vilar | Hans Uszkoreit

Human translators are the key to evaluating machine translation (MT) quality and also to addressing the so far unanswered question when and how to use MT in professional translation workflows. This paper describes the corpus developed as a result of a detailed large scale human evaluation consisting of three tightly connected tasks: ranking, error classification and post-editing.

pdf bib
Detecting Document Structure in a Very Large Corpus of UK Financial Reports
Mahmoud El-Haj | Paul Rayson | Steve Young | Martin Walker

In this paper we present the evaluation of our automatic methods for detecting and extracting document structure in annual financial reports. The work presented is part of the Corporate Financial Information Environment (CFIE) project in which we are using Natural Language Processing (NLP) techniques to study the causes and consequences of corporate disclosure and financial reporting outcomes. We aim to uncover the determinants of financial reporting quality and the factors that influence the quality of information disclosed to investors beyond the financial statements. The CFIE consists of the supply of information by firms to investors, and the mediating influences of information intermediaries on the timing, relevance and reliability of information available to investors. It is important to compare and contrast specific elements or sections of each annual financial report across our entire corpus rather than working at the full document level. We show that the values of some metrics e.g. readability will vary across sections, thus improving on previous research research based on full texts.

pdf bib
Latent Semantic Analysis Models on Wikipedia and TASA
Dan Ștefănescu | Rajendra Banjade | Vasile Rus

This paper introduces a collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus. The models differ not only on their source, Wikipedia versus TASA, but also on the linguistic items they focus on: all words, content-words, nouns-verbs, and main concepts. Generating such models from large datasets (e.g. Wikipedia), that can provide a large coverage for the actual vocabulary in use, is computationally challenging, which is the reason why large LSA models are rarely available. Our experiments show that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
The Weltmodell: A Data-Driven Commonsense Knowledge Base
Alan Akbik | Thilo Michael

We present the Weltmodell, a commonsense knowledge base that was automatically generated from aggregated dependency parse fragments gathered from over 3.5 million English language books. We leverage the magnitude and diversity of this dataset to arrive at close to ten million distinct N-ary commonsense facts using techniques from open-domain Information Extraction (IE). Furthermore, we compute a range of measures of association and distributional similarity on this data. We present the results of our efforts using a browsable web demonstrator and publicly release all generated data for use and discussion by the research community. In this paper, we give an overview of our knowledge acquisition method and representation model, and present our web demonstrator.

pdf bib
German Alcohol Language Corpus - the Question of Dialect
Florian Schiel | Thomas Kisler

Speech uttered under the influence of alcohol is known to deviate from the speech of the same person when sober. This is an important feature in forensic investigations and could also be used to detect intoxication in the automotive environment. Aside from acoustic-phonetic features and speech content which have already been studied by others in this contribution we address the question whether speakers use dialectal variation or dialect words more frequently when intoxicated than when sober. We analyzed 300,000 recorded word tokens in read and spontaneous speech uttered by 162 female and male speakers within the German Alcohol Language Corpus. We found that contrary to our expectations the frequency of dialectal forms decreases significantly when speakers are under the influence. We explain this effect with a compensatory over-shoot mechanism: speakers are aware of their intoxication and that they are being monitored. In forensic analysis of speech this `awareness factor’ must be taken into account.

pdf bib
CLARA: A New Generation of Researchers in Common Language Resources and Their Applications
Koenraad De Smedt | Erhard Hinrichs | Detmar Meurers | Inguna Skadiņa | Bolette Pedersen | Costanza Navarretta | Núria Bel | Krister Lindén | Markéta Lopatková | Jan Hajič | Gisle Andersen | Przemyslaw Lenkiewicz

CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.

pdf bib
A Large Corpus of Product Reviews in Portuguese: Tackling Out-Of-Vocabulary Words
Nathan Hartmann | Lucas Avanço | Pedro Balage | Magali Duran | Maria das Graças Volpe Nunes | Thiago Pardo | Sandra Aluísio

Web 2.0 has allowed a never imagined communication boom. With the widespread use of computational and mobile devices, anyone, in practically any language, may post comments in the web. As such, formal language is not necessarily used. In fact, in these communicative situations, language is marked by the absence of more complex syntactic structures and the presence of internet slang, with missing diacritics, repetitions of vowels, and the use of chat-speak style abbreviations, emoticons and colloquial expressions. Such language use poses severe new challenges for Natural Language Processing (NLP) tools and applications, which, so far, have focused on well-written texts. In this work, we report the construction of a large web corpus of product reviews in Brazilian Portuguese and the analysis of its lexical phenomena, which support the development of a lexical normalization tool for, in future work, subsidizing the use of standard NLP products for web opinion mining and summarization purposes.

pdf bib
Shata-Anuvadak: Tackling Multiway Translation of Indian Languages
Anoop Kunchukuttan | Abhijit Mishra | Rajen Chatterjee | Ritesh Shah | Pushpak Bhattacharyya

We present a compendium of 110 Statistical Machine Translation systems built from parallel corpora of 11 Indian languages belonging to both Indo-Aryan and Dravidian families. We analyze the relationship between translation accuracy and the language families involved. We feel that insights obtained from this analysis will provide guidelines for creating machine translation systems of specific Indian language pairs. We build phrase based systems and some extensions. Across multiple languages, we show improvements on the baseline phrase based systems using these extensions: (1) source side reordering for English-Indian language translation, and (2) transliteration of untranslated words for Indian language-Indian language translation. These enhancements harness shared characteristics of Indian languages. To stimulate similar innovation widely in the NLP community, we have made the trained models for these language pairs publicly available.

pdf bib
The CLARIN Research Infrastructure: Resources and Tools for eHumanities Scholars
Erhard Hinrichs | Steven Krauwer

CLARIN is the short name for the Common Language Resources and Technology Infrastructure, which aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. CLARIN is in the process of building a networked federation of European data repositories, service centers and centers of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centers will be interoperable so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work. Interoperability of language resources and tools in the federation of CLARIN Centers is ensured by adherence to TEI and ISO standards for text encoding, by the use of persistent identifiers, and by the observance of common protocols. The purpose of the present paper is to give an overview of language resources, tools, and services that CLARIN presently offers.

pdf bib
Sprinter: Language Technologies for Interactive and Multimedia Language Learning
Renlong Ai | Marcela Charfuelan | Walter Kasper | Tina Klüwer | Hans Uszkoreit | Feiyu Xu | Sandra Gasber | Philip Gienandt

Modern language learning courses are no longer exclusively based on books or face-to-face lectures. More and more lessons make use of multimedia and personalized learning methods. Many of these are based on e-learning solutions. Learning via the Internet provides 7/24 services that require sizeable human resources. Therefore we witness a growing economic pressure to employ computer-assisted methods for improving language learning in quality, efficiency and scalability. In this paper, we will address three applications of language technologies for language learning: 1) Methods and strategies for pronunciation training in second language learning, e.g., multimodal feedback via visualization of sound features, speech verification and prosody transplantation; 2) Dialogue-based language learning games; 3) Application of parsing and generation technologies to the automatic generation of paraphrases for the semi-automatic production of learning material.

pdf bib
Bilingual Dictionary Induction as an Optimization Problem
Wushouer Mairidan | Toru Ishida | Donghui Lin | Katsutoshi Hirayama

Bilingual dictionaries are vital in many areas of natural language processing, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Pivot-based induction consists of using a third language to bridge a language pair. As an approach to create new dictionaries, it can generate wrong translations due to polysemy and ambiguous words. In this paper we propose a constraint approach to pivot-based dictionary induction for the case of two closely related languages. In order to take into account the word senses, we use an approach based on semantic distances, in which possibly missing translations are considered, and instance of induction is encoded as an optimization problem to generate new dictionary. Evaluations show that the proposal achieves 83.7% accuracy and approximately 70.5% recall, thus outperforming the baseline pivot-based method.

pdf bib
Two Approaches to Metaphor Detection
Brian MacWhinney | Davida Fromm

Methods for automatic detection and interpretation of metaphors have focused on analysis and utilization of the ways in which metaphors violate selectional preferences (Martin, 2006). Detection and interpretation processes that rely on this method can achieve wide coverage and may be able to detect some novel metaphors. However, they are prone to high false alarm rates, often arising from imprecision in parsing and supporting ontological and lexical resources. An alternative approach to metaphor detection emphasizes the fact that many metaphors become conventionalized collocations, while still preserving their active metaphorical status. Given a large enough corpus for a given language, it is possible to use tools like SketchEngine (Kilgariff, Rychly, Smrz, & Tugwell, 2004) to locate these high frequency metaphors for a given target domain. In this paper, we examine the application of these two approaches and discuss their relative strengths and weaknesses for metaphors in the target domain of economic inequality in English, Spanish, Farsi, and Russian.

pdf bib
A Japanese Word Dependency Corpus
Shinsuke Mori | Hideki Ogura | Tetsuro Sasada

In this paper, we present a corpus annotated with dependency relationships in Japanese. It contains about 30 thousand sentences in various domains. Six domains in Balanced Corpus of Contemporary Written Japanese have part-of-speech and pronunciation annotation as well. Dictionary example sentences have pronunciation annotation and cover basic vocabulary in Japanese with English sentence equivalent. Economic newspaper articles also have pronunciation annotation and the topics are similar to those of Penn Treebank. Invention disclosures do not have other annotation, but it has a clear application, machine translation. The unit of our corpus is word like other languages contrary to existing Japanese corpora whose unit is phrase called bunsetsu. Each sentence is manually segmented into words. We first present the specification of our corpus. Then we give a detailed explanation about our standard of word dependency. We also report some preliminary results of an MST-based dependency parser on our corpus.

pdf bib
Crowdsourcing and annotating NER for Twitter #drift
Hege Fromreide | Dirk Hovy | Anders Søgaard

We present two new NER datasets for Twitter; a manually annotated set of 1,467 tweets (kappa=0.942) and a set of 2,975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can be obtained from crowdsourced annotations, making it more feasible to “catch up” with language drift.

pdf bib
Language Resources and Annotation Tools for Cross-Sentence Relation Extraction
Sebastian Krause | Hong Li | Feiyu Xu | Hans Uszkoreit | Robert Hummel | Luise Spielhagen

In this paper, we present a novel combination of two types of language resources dedicated to the detection of relevant relations (RE) such as events or facts across sentence boundaries. One of the two resources is the sar-graph, which aggregates for each target relation ten thousands of linguistic patterns of semantically associated relations that signal instances of the target relation (Uszkoreit and Xu, 2013). These have been learned from the Web by intra-sentence pattern extraction (Krause et al., 2012) and after semantic filtering and enriching have been automatically combined into a single graph. The other resource is cockrACE, a specially annotated corpus for the training and evaluation of cross-sentence RE. By employing our powerful annotation tool Recon, annotators mark selected entities and relations (including events), coreference relations among these entities and events, and also terms that are semantically related to the relevant relations and events. This paper describes how the two resources are created and how they complement each other.

pdf bib
Evaluating corpora documentation with regards to the Ethics and Big Data Charter
Alain Couillault | Karën Fort | Gilles Adda | Hugues de Mazancourt

The authors have written the Ethic and Big Data Charter in collaboration with various agencies, private bodies and associations. This Charter aims at describing any large or complex resources, and in particular language resources, from a legal and ethical viewpoint and ensuring the transparency of the process of creating and distributing such resources. We propose in this article an analysis of the documentation coverage of the most frequently mentioned language resources with regards to the Charter, in order to show the benefit it offers.

pdf bib
Bootstrapping Term Extractors for Multiple Languages
Ahmet Aker | Monica Paramita | Emma Barker | Robert Gaizauskas

Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. We create a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia corpus to automatically create term grammar for each language. Our results show that these automatically generated resources can assist term extraction process with similar performance to manually generated resources. All resources resulted in this experiment are freely available for download.

pdf bib
Evaluation of Automatic Hypernym Extraction from Technical Corpora in English and Dutch
Els Lefever | Marjan Van de Kauter | Véronique Hoste

In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts.

pdf bib
Measuring Readability of Polish Texts: Baseline Experiments
Bartosz Broda | Bartłomiej Nitoń | Włodzimierz Gruszczyński | Maciej Ogrodniczuk

Measuring readability of a text is the first sensible step to its simplification. In this paper we present an overview of the most common approaches to automatic measuring of readability. Of the described ones, we implemented and evaluated: Gunning FOG index, Flesch-based Pisarek method. We also present two other approaches. The first one is based on measuring distributional lexical similarity of a target text and comparing it to reference texts. In the second one, we propose a novel method for automation of Taylor test ― which, in its base form, requires performing a large amount of surveys. The automation of Taylor test is performed using a technique called statistical language modelling. We have developed a free on-line web-based system and constructed plugins for the most common text editors, namely Microsoft Word and Inner workings of the system are described in detail. Finally, extensive evaluations are performed for Polish ― a Slavic, highly inflected language. We show that Pisarek’s method is highly correlated to Gunning FOG Index, even if different in form, and that both the similarity-based approach and automated Taylor test achieve high accuracy. Merits of using either of them are discussed.

pdf bib
Introducing a web application for labeling, visualizing speech and correcting derived speech signals
Raphael Winkelmann | Georg Raess

The advent of HTML5 has sparked a great increase in interest in the web as a development platform for a variety of different research applications. Due to its ability to easily deploy software to remote clients and the recent development of standardized browser APIs, we argue that the browser has become a good platform to develop a speech labeling tool for. This paper introduces a preliminary version of an open-source client-side web application for labeling speech data, visualizing speech and segmentation information and manually correcting derived speech signals such as formant trajectories. The user interface has been designed to be as user-friendly as possible in order to make the sometimes tedious task of transcribing as easy and efficient as possible. The future integration into the next iteration of the EMU speech database management system and its general architecture will also be outlined, as the work presented here is only one of several components contributing to the future system.

pdf bib
An Iterative Approach for Mining Parallel Sentences in a Comparable Corpus
Lise Rebout | Phillippe Langlais

We describe an approach for mining parallel sentences in a collection of documents in two languages. While several approaches have been proposed for doing so, our proposal differs in several respects. First, we use a document level classifier in order to focus on potentially fruitful document pairs, an understudied approach. We show that mining less, but more parallel documents can lead to better gains in machine translation. Second, we compare different strategies for post-processing the output of a classifier trained to recognize parallel sentences. Last, we report a simple bootstrapping experiment which shows that promising sentence pairs extracted in a first stage can help to mine new sentence pairs in a second stage. We applied our approach on the English-French Wikipedia. Gains of a statistical machine translation (SMT) engine are analyzed along different test sets.

pdf bib
Development of a TV Broadcasts Speech Recognition System for Qatari Arabic
Mohamed Elmahdy | Mark Hasegawa-Johnson | Eiman Mustafawi

A major problem with dialectal Arabic speech recognition is due to the sparsity of speech resources. In this paper, a transfer learning framework is proposed to jointly use a large amount of Modern Standard Arabic (MSA) data and little amount of dialectal Arabic data to improve acoustic and language modeling. The Qatari Arabic (QA) dialect has been chosen as a typical example for an under-resourced Arabic dialect. A wide-band speech corpus has been collected and transcribed from several Qatari TV series and talk-show programs. A large vocabulary speech recognition baseline system was built using the QA corpus. The proposed MSA-based transfer learning technique was performed by applying orthographic normalization, phone mapping, data pooling, acoustic model adaptation, and system combination. The proposed approach can achieve more than 28% relative reduction in WER.

pdf bib
Can Crowdsourcing be used for Effective Annotation of Arabic?
Wajdi Zaghouani | Kais Dukes

Crowdsourcing has been used recently as an alternative to traditional costly annotation by many natural language processing groups. In this paper, we explore the use of Amazon Mechanical Turk (AMT) in order to assess the feasibility of using AMT workers (also known as Turkers) to perform linguistic annotation of Arabic. We used a gold standard data set taken from the Quran corpus project annotated with part-of-speech and morphological information. An Arabic language qualification test was used to filter out potential non-qualified participants. Two experiments were performed, a part-of-speech tagging task in where the annotators were asked to choose a correct word-category from a multiple choice list and case ending identification task. The results obtained so far showed that annotating Arabic grammatical case is harder than POS tagging, and crowdsourcing for Arabic linguistic annotation requiring expert annotators could be not as effective as other crowdsourcing experiments requiring less expertise and qualifications.

pdf bib
Design and development of an RDB version of the Corpus of Spontaneous Japanese
Hanae Koiso | Yasuharu Den | Ken’ya Nishikawa | Kikuo Maekawa

In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.

pdf bib
Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech
Mohamed Elmahdy | Mark Hasegawa-Johnson | Eiman Mustafawi

In this paper, a framework for long audio alignment for conversational Arabic speech is proposed. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer acoustic model (AM) training, audio summarizing and retrieving, etc. We have collected more than 1,400 hours of conversational Arabic besides the corresponding human generated non-aligned transcriptions. Automatic audio segmentation is performed using a split and merge approach. A biased language model (LM) is trained using the corresponding text after a pre-processing stage. Because of the dominance of non-standard Arabic in conversational speech, a graphemic pronunciation model (PM) is utilized. The proposed alignment approach is performed in two passes. Firstly, a generic standard Arabic AM is used along with the biased LM and the graphemic PM in a fast speech recognition pass. In a second pass, a more restricted LM is generated for each audio segment, and unsupervised acoustic model adaptation is applied. The recognizer output is aligned with the processed transcriptions using Levenshtein algorithm. The proposed approach resulted in an initial alignment accuracy of 97.8-99.0% depending on the amount of disfluencies. A confidence scoring metric is proposed to accept/reject aligner output. Using confidence scores, it was possible to reject the majority of mis-aligned segments resulting in alignment accuracy of 99.0-99.8% depending on the speech domain and the amount of disfluencies.

pdf bib
Vocabulary-Based Language Similarity using Web Corpora
Dirk Goldhahn | Uwe Quasthoff

This paper will focus on the evaluation of automatic methods for quantifying language similarity. This is achieved by ascribing language similarity to the similarity of text corpora. This corpus similarity will first be determined by the resemblance of the vocabulary of languages. Thereto words or parts of them such as letter n-grams are examined. Extensions like transliteration of the text data will ensure the independence of the methods from text characteristics such as the writing system used. Further analyzes will show to what extent knowledge about the distribution of words in parallel text can be used in the context of language similarity.

pdf bib
NewsReader: recording history from daily news streams
Piek Vossen | German Rigau | Luciano Serafini | Pim Stouten | Francis Irving | Willem Van Hage

The European project NewsReader develops technology to process daily news streams in 4 languages, extracting what happened, when, where and who was involved. NewsReader does not just read a single newspaper but massive amounts of news coming from thousands of sources. It compares the results across sources to complement information and determine where they disagree. Furthermore, it merges news of today with previous news, creating a long-term history rather than separate events. The result is stored in a KnowledgeStore, that cumulates information over time, producing an extremely large knowledge graph that is visualized using new techniques to provide more comprehensive access. We present the first version of the system and the results of processing first batches of data.

pdf bib
A set of open source tools for Turkish natural language processing
Çağrı Çöltekin

This paper introduces a set of freely available, open-source tools for Turkish that are built around TRmorph, a morphological analyzer introduced earlier in Coltekin (2010). The article first provides an update on the analyzer, which includes a complete rewrite using a different finite-state description language and tool set as well as major tagset changes to comply better with the state-of-the-art computational processing of Turkish and the user requests received so far. Besides these major changes to the analyzer, this paper introduces tools for morphological segmentation, stemming and lemmatization, guessing unknown words, grapheme to phoneme conversion, hyphenation and a morphological disambiguation.

pdf bib
The Gulf of Guinea Creole Corpora
Tjerk Hagemeijer | Michel Généreux | Iris Hendrickx | Amália Mendes | Abigail Tiny | Armando Zamora

We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.

pdf bib
S-pot - a benchmark in spotting signs within continuous signing
Ville Viitaniemi | Tommi Jantunen | Leena Savolainen | Matti Karppa | Jorma Laaksonen

In this paper we present S-pot, a benchmark setting for evaluating the performance of automatic spotting of signs in continuous sign language videos. The benchmark includes 5539 video files of Finnish Sign Language, ground truth sign spotting results, a tool for assessing the spottings against the ground truth, and a repository for storing information on the results. In addition we will make our sign detection system and results made with it publicly available as a baseline for comparison and further developments.

pdf bib
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
Masood Ghayoomi | Jonas Kuhn

A treebank is an important language resource for supervised statistical parsers. The parser induces the grammatical properties of a language from this language resource and uses the model to parse unseen data automatically. Since developing such a resource is very time-consuming and tedious, one can take advantage of already extant resources by adapting them to a particular application. This reduces the amount of human effort required to develop a new language resource. In this paper, we introduce an algorithm to convert an HPSG-based treebank into its parallel dependency-based treebank. With this converter, we can automatically create a new language resource from an existing treebank developed based on a grammar formalism. Our proposed algorithm is able to create both projective and non-projective dependency trees.

pdf bib
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria | Nora Aranberri | Pere Comas | Víctor Fresno | Pablo Gamallo | Lluis Padró | Iñaki San Vicente | Jordi Turmo | Arkaitz Zubiaga

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

pdf bib
The Procedure of Lexico-Semantic Annotation of Składnica Treebank
Elżbieta Hajnicz

In this paper, the procedure of lexico-semantic annotation of Składnica Treebank using Polish WordNet is presented. Other semantically annotated corpora, in particular treebanks, are outlined first. Resources involved in annotation as well as a tool called Semantikon used for it are described. The main part of the paper is the analysis of the applied procedure. It consists of the basic and correction phases. During basic phase all nouns, verbs and adjectives are annotated with wordnet senses. The annotation is performed independently by two linguists. During the correction phase, conflicts are resolved by the linguist supervising the process. Multi-word units obtain special tags, synonyms and hypernyms are used for senses absent in Polish WordNet. Additionally, each sentence receives its general assessment. Finally, some statistics of the results of annotation are given, including inter-annotator agreement. The final resource is represented in XML files preserving the structure of Składnica.

pdf bib
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian
Júlia Pajzs | Ralf Steinberger | Maud Ehrmann | Mohamed Ebrahim | Leonida Della Rocca | Stefano Bucci | Eszter Simon | Tamás Váradi

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web page

pdf bib
French Resources for Extraction and Normalization of Temporal Expressions with HeidelTime
Véronique Moriceau | Xavier Tannier

In this paper, we describe the development of French resources for the extraction and normalization of temporal expressions with HeidelTime, a open-source multilingual, cross-domain temporal tagger. HeidelTime extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. Several types of temporal expressions are extracted: dates, times, durations and temporal sets. French resources have been evaluated in two different ways: on the French TimeBank corpus, a corpus of newspaper articles in French annotated according to the ISO-TimeML standard, and on a user application for automatic building of event timelines. Results on the French TimeBank are quite satisfaying as they are comparable to those obtained by HeidelTime in English and Spanish on newswire articles. Concerning the user application, we used two temporal taggers for the preprocessing of the corpus in order to compare their performance and results show that the performances of our application on French documents are better with HeidelTime. The French resources and evaluation scripts are publicly available with HeidelTime.

pdf bib
Legal aspects of text mining
Maarten Truyens | Patrick Van Eecke

Unlike data mining, text mining has received only limited attention in legal circles. Nevertheless, interesting legal stumbling blocks exist, both with respect to the data collection and data sharing phases, due to the strict rules of copyright and database law. Conflicts are particularly likely when content is extracted from commercial databases, and when texts that have a minimal level of creativity are stored in a permanent way. In all circumstances, even with non-commercial research, license agreements and website terms of use can impose further restrictions. Accordingly, only for some delineated areas (very old texts for which copyright expired, legal statutes, texts in the public domain) strong legal certainty can be obtained without case-by-case assessments. As a result, while prior permission is certainly not required in all cases, many researchers tend to err on the side of caution, and seek permission from publishers, institutions and individual authors before including texts in their corpora, although this process can be difficult and very time-consuming. In the United States, the legal assessment is very different, due to the open-ended nature and flexibility offered by the “fair use” doctrine.

pdf bib
Treelet Probabilities for HPSG Parsing and Error Correction
Angelina Ivanova | Gertjan van Noord

Most state-of-the-art parsers take an approach to produce an analysis for any input despite errors. However, small grammatical mistakes in a sentence often cause parser to fail to build a correct syntactic tree. Applications that can identify and correct mistakes during parsing are particularly interesting for processing user-generated noisy content. Such systems potentially could take advantage of linguistic depth of broad-coverage precision grammars. In order to choose the best correction for an utterance, probabilities of parse trees of different sentences should be comparable which is not supported by discriminative methods underlying parsing software for processing deep grammars. In the present work we assess the treelet model for determining generative probabilities for HPSG parsing with error correction. In the first experiment the treelet model is applied to the parse selection task and shows superior exact match accuracy than the baseline and PCFG. In the second experiment it is tested for the ability to score the parse tree of the correct sentence higher than the constituency tree of the original version of the sentence containing grammatical error.

pdf bib
A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition
Abir Masmoudi | Mariem Ellouze Khmekhem | Yannick Estève | Lamia Hadrich Belguith | Nizar Habash

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

pdf bib
Discovering frames in specialized domains
Marie-Claude L’Homme | Benoît Robichaud | Carlos Subirats Rüggeberg

This paper proposes a method for discovering semantic frames (Fillmore, 1982, 1985; Fillmore et al., 2003) in specialized domains. It is assumed that frames are especially relevant for capturing the lexical structure in specialized domains and that they complement structures such as ontologies that appear better suited to represent specific relationships between entities. The method we devised is based on existing lexical entries recorded in a specialized database related to the field of the environment (erode, impact, melt, recycling, warming). The frames and the data encoded in FrameNet are used as a reference. Selected information was extracted automatically from the database on the environment (and, when possible, compared to FrameNet), and presented to a linguist who analyzed this information to discover potential frames. Several different frames were discovered with this method. About half of them correspond to frames already described in FrameNet; some new frames were also defined and part of these might be specific to the field of the environment.

pdf bib
Resources for the Detection of Conventionalized Metaphors in Four Languages
Lori Levin | Teruko Mitamura | Brian MacWhinney | Davida Fromm | Jaime Carbonell | Weston Feely | Robert Frederking | Anatole Gershman | Carlos Ramirez

This paper describes a suite of tools for extracting conventionalized metaphors in English, Spanish, Farsi, and Russian. The method depends on three significant resources for each language: a corpus of conventionalized metaphors, a table of conventionalized conceptual metaphors (CCM table), and a set of extraction rules. Conventionalized metaphors are things like “escape from poverty” and “burden of taxation”. For each metaphor, the CCM table contains the metaphorical source domain word (such as “escape”) the target domain word (such as “poverty”) and the grammatical construction in which they can be found. The extraction rules operate on the output of a dependency parser and identify the grammatical configurations (such as a verb with a prepositional phrase complement) that are likely to contain conventional metaphors. We present results on detection rates for conventional metaphors and analysis of the similarity and differences of source domains for conventional metaphors in the four languages.

pdf bib
CLARIN-NL: Major results
Jan Odijk

In this paper I provide a high level overview of the major results of CLARIN-NL so far. I will show that CLARIN-NL is starting to provide the data, facilities and services in the CLARIN infrastructure to carry out humanities research supported by large amounts of data and tools. These services have easy interfaces and easy search options (no technical background needed). Still some training is required, to understand both the possibilities and the limitations of the data and the tools. Actual use of the facilities leads to suggestions for improvements and to suggestions for new functionality. All researchers are therefore invited to start using the elements in the CLARIN infrastructure offered by CLARIN-NL. Though I will show that a lot has been achieved in the CLARIN-NL project, I will also provide a long list of functionality and interoperability cases that have not been dealt with in CLARIN-NL and must remain for future work.

pdf bib
Exploiting Portuguese Lexical Knowledge Bases for Answering Open Domain Cloze Questions Automatically
Hugo Gonçalo Oliveira | Inês Coelho | Paulo Gomes

We present the task of answering cloze questions automatically and how it can be tackled by exploiting lexical knowledge bases (LKBs). This task was performed in what can be seen as an indirect evaluation of Portuguese LKB. We introduce the LKBs used and the algorithms applied, and then report on the obtained results and draw some conclusions: LKBs are definitely useful resources for this challenging task, and exploiting them, especially with PageRanking-based algorithms, clearly improves the baselines. Moreover, larger LKB, created automatically and not sense-aware led to the best results, as opposed to handcrafted LKB structured on synsets.

pdf bib
Annotation of Computer Science Papers for Semantic Relation Extrac-tion
Yuka Tateisi | Yo Shidahara | Yusuke Miyao | Akiko Aizawa

We designed a new annotation scheme for formalising relation structures in research papers, through the investigation of computer science papers. The annotation scheme is based on the hypothesis that identifying the role of entities and events that are described in a paper is useful for intelligent information retrieval in academic literature, and the role can be determined by the relationship between the author and the described entities or events, and relationships among them. Using the scheme, we have annotated research abstracts from the IPSJ Journal published in Japanese by the Information Processing Society of Japan. On the basis of the annotated corpus, we have developed a prototype information extraction system which has the facility to classify sentences according to the relationship between entities mentioned, to help find the role of the entity in which the searcher is interested.

pdf bib
Identifying Idioms in Chinese Translations
Wan Yu Ho | Christine Kng | Shan Wang | Francis Bond

Optimally, a translated text should preserve information while maintaining the writing style of the original. When this is not possible, as is often the case with figurative speech, a common practice is to simplify and make explicit the implications. However, in our investigations of translations from English to another language, English-to-Chinese texts were often found to include idiomatic expressions (usually in the form of Chengyu 成è ̄) where there were originally no idiomatic, metaphorical, or even figurative expressions. We have created an initial small lexicon of Chengyu, with which we can use to find all occurrences of Chengyu in a given corpus, and will continue to expand the database. By examining the rates and patterns of occurrence across four genres in the NTU Multilingual Corpus, a resource may be created to aid machine translation or, going further, predict Chinese translational trends in any given genre.

pdf bib
Machine Translation for Subtitling: A Large-Scale Evaluation
Thierry Etchegoyhen | Lindsay Bywood | Mark Fishel | Panayota Georgakopoulou | Jie Jiang | Gerard van Loenhout | Arantza del Pozo | Mirjam Sepesy Maučec | Anja Turner | Martin Volk

This article describes a large-scale evaluation of the use of Statistical Machine Translation for professional subtitling. The work was carried out within the FP7 EU-funded project SUMAT and involved two rounds of evaluation: a quality evaluation and a measure of productivity gain/loss. We present the SMT systems built for the project and the corpora they were trained on, which combine professionally created and crowd-sourced data. Evaluation goals, methodology and results are presented for the eleven translation pairs that were evaluated by professional subtitlers. Overall, a majority of the machine translated subtitles received good quality ratings. The results were also positive in terms of productivity, with a global gain approaching 40%. We also evaluated the impact of applying quality estimation and filtering of poor MT output, which resulted in higher productivity gains for filtered files as opposed to fully machine-translated files. Finally, we present and discuss feedback from the subtitlers who participated in the evaluation, a key aspect for any eventual adoption of machine translation technology in professional subtitling.

pdf bib
T-PAS; A resource of Typed Predicate Argument Structures for linguistic analysis and semantic processing
Elisabetta Jezek | Bernardo Magnini | Anna Feltracco | Alessia Bianchini | Octavian Popescu

The goal of this paper is to introduce T-PAS, a resource of typed predicate argument structures for Italian, acquired from corpora by manual clustering of distributional information about Italian verbs, to be used for linguistic analysis and semantic processing tasks. T-PAS is the first resource for Italian in which semantic selection properties and sense-in-context distinctions of verbs are characterized fully on empirical ground. In the paper, we first describe the process of pattern acquisition and corpus annotation (section 2) and its ongoing evaluation (section 3). We then demonstrate the benefits of pattern tagging for NLP purposes (section 4), and discuss current effort to improve the annotation of the corpus (section 5). We conclude by reporting on ongoing experiments using semiautomatic techniques for extending coverage (section 6).

pdf bib
Narrowing the Gap Between Termbases and Corpora in Commercial Environments
Kara Warburton

Terminological resources offer potential to support applications beyond translation, such as controlled authoring and indexing, which are increasingly of interest to commercial enterprises. The ad-hoc semasiological approach adopted by commercial terminographers diverges considerably from methodologies prescribed by conventional theory. The notion of termhood in such production-oriented environments is driven by pragmatic criteria such as frequency and repurposability of the terminological unit. A high degree of correspondence between the commercial corpus and the termbase is desired. Research carried out at the City University of Hong Kong using four IT companies as case studies revealed a large gap between corpora and termbases. Problems in selecting terms and in encoding them properly in termbases account for a significant portion of this gap. A rigorous corpus-based approach to term selection would significantly reduce this gap and improve the effectiveness of commercial termbases. In particular, single-word terms (keywords) identified by comparison to a reference corpus offer great potential for identifying important multi-word terms in this context. We conclude that terminography for production purposes should be more corpus-based than is currently the norm.

pdf bib
Author-Specific Sentiment Aggregation for Polarity Prediction of Reviews
Subhabrata Mukherjee | Sachindra Joshi

In this work, we propose an author-specific sentiment aggregation model for polarity prediction of reviews using an ontology. We propose an approach to construct a Phrase Annotated Author Specific Sentiment Ontology Tree (PASOT), where the facet nodes are annotated with opinion phrases of the author, used to describe the facets, as well as the author’s preference for the facets. We show that an author-specific aggregation of sentiment over an ontology fares better than a flat classification model, which does not take the domain-specific facet importance or author-specific facet preference into account. We compare our approach to supervised classification using Support Vector Machines, as well as other baselines from previous works, where we achieve an accuracy improvement of 7.55% over the SVM baseline. Furthermore, we also show the effectiveness of our approach in capturing thwarting in reviews, achieving an accuracy improvement of 11.53% over the SVM baseline.

pdf bib
Clustering of Multi-Word Named Entity variants: Multilingual Evaluation
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger

Multi-word entities, such as organisation names, are frequently written in many different ways. We have previously automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics. In this paper, we address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables and we show that this method produces good results.

pdf bib
A Database for Measuring Linguistic Information Content
Richard Sproat | Bruno Cartoni | HyunJeong Choe | David Huynh | Linne Ha | Ravindran Rajakumar | Evelyn Wenzel-Grondie

Which languages convey the most information in a given amount of space? This is a question often asked of linguists, especially by engineers who often have some information theoretic measure of “information” in mind, but rarely define exactly how they would measure that information. The question is, in fact remarkably hard to answer, and many linguists consider it unanswerable. But it is a question that seems as if it ought to have an answer. If one had a database of close translations between a set of typologically diverse languages, with detailed marking of morphosyntactic and morphosemantic features, one could hope to quantify the differences between how these different languages convey information. Since no appropriate database exists we decided to construct one. The purpose of this paper is to present our work on the database, along with some preliminary results. We plan to release the dataset once complete.

pdf bib
Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
Noushin Rezapour Asheghi | Serge Sharoff | Katja Markert

Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories’ chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.

pdf bib
Crowdsourcing as a preprocessing for complex semantic annotation tasks
Héctor Martínez Alonso | Lauren Romeo

This article outlines a methodology that uses crowdsourcing to reduce the workload of experts for complex semantic tasks. We split turker-annotated datasets into a high-agreement block, which is not modified, and a low-agreement block, which is re-annotated by experts. The resulting annotations have higher observed agreement. We identify different biases in the annotation for both turkers and experts.

pdf bib
Automatic Annotation of Machine Translation Datasets with Binary Quality Judgements
Marco Turchi | Matteo Negri

The automatic estimation of machine translation (MT) output quality is an active research area due to its many potential applications (e.g. aiding human translation and post-editing, re-ranking MT hypotheses, MT system combination). Current approaches to the task rely on supervised learning methods for which high-quality labelled data is fundamental. In this framework, quality estimation (QE) has been mainly addressed as a regression problem where models trained on (source, target) sentence pairs annotated with continuous scores (in the [0-1] interval) are used to assign quality scores (in the same interval) to unseen data. Such definition of the problem assumes that continuous scores are informative and easily interpretable by different users. These assumptions, however, conflict with the subjectivity inherent to human translation and evaluation. On one side, the subjectivity of human judgements adds noise and biases to annotations based on scaled values. This problem reduces the usability of the resulting datasets, especially in application scenarios where a sharp distinction between “good” and “bad” translations is needed. On the other side, continuous scores are not always sufficient to decide whether a translation is actually acceptable or not. To overcome these issues, we present an automatic method for the annotation of (source, target) pairs with binary judgements that reflect an empirical, and easily interpretable notion of quality. The method is applied to annotate with binary judgements three QE datasets for different language combinations. The three datasets are combined in a single resource, called BinQE, which can be freely downloaded from

pdf bib
Semantic Clustering of Pivot Paraphrases
Marianna Apidianaki | Emilia Verzeni | Diana McCarthy

Paraphrases extracted from parallel corpora by the pivot method (Bannard and Callison-Burch, 2005) constitute a valuable resource for multilingual NLP applications. In this study, we analyse the semantics of unigram pivot paraphrases and use a graph-based sense induction approach to unveil hidden sense distinctions in the paraphrase sets. The comparison of the acquired senses to gold data from the Lexical Substitution shared task (McCarthy and Navigli, 2007) demonstrates that sense distinctions exist in the paraphrase sets and highlights the need for a disambiguation step in applications using this resource.

pdf bib
When POS data sets don’t add up: Combatting sample bias
Dirk Hovy | Barbara Plank | Anders Søgaard

Several works in Natural Language Processing have recently looked into part-of-speech annotation of Twitter data and typically used their own data sets. Since conventions on Twitter change rapidly, models often show sample bias. Training on a combination of the existing data sets should help overcome this bias and produce more robust models than any trained on the individual corpora. Unfortunately, combining the existing corpora proves difficult: many of the corpora use proprietary tag sets that have little or no overlap. Even when mapped to a common tag set, the different corpora systematically differ in their treatment of various tags and tokens. This includes both pre-processing decisions, as well as default labels for frequent tokens, thus exhibiting data bias and label bias, respectively. Only if we address these biases can we combine the existing data sets to also overcome sample bias. We present a systematic study of several Twitter POS data sets, the problems of label and data bias, discuss their effects on model performance, and show how to overcome them to learn models that perform well on various test sets, achieving relative error reduction of up to 21%.

pdf bib
Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline
Matthew Shardlow

Lexical simplification is the task of automatically reducing the complexity of a text by identifying difficult words and replacing them with simpler alternatives. Whilst this is a valuable application of natural language generation, rudimentary lexical simplification systems suffer from a high error rate which often results in nonsensical, non-simple text. This paper seeks to characterise and quantify the errors which occur in a typical baseline lexical simplification system. We expose 6 distinct categories of error and propose a classification scheme for these. We also quantify these errors for a moderate size corpus, showing the magnitude of each error type. We find that for 183 identified simplification instances, only 19 (10.38%) result in a valid simplification, with the rest causing errors of varying gravity.

pdf bib
The N2 corpus: A semantically annotated collection of Islamist extremist stories
Mark Finlayson | Jeffry Halverson | Steven Corman

We describe the N2 (Narrative Networks) Corpus, a new language resource. The corpus is unique in three important ways. First, every text in the corpus is a story, which is in contrast to other language resources that may contain stories or story-like texts, but are not specifically curated to contain only stories. Second, the unifying theme of the corpus is material relevant to Islamist Extremists, having been produced by or often referenced by them. Third, every text in the corpus has been annotated for 14 layers of syntax and semantics, including: referring expressions and co-reference; events, time expressions, and temporal relationships; semantic roles; and word senses. In cases where analyzers were not available to do high-quality automatic annotations, layers were manually double-annotated and adjudicated by trained annotators. The corpus comprises 100 texts and 42,480 words. Most of the texts were originally in Arabic but all are provided in English translation. We explain the motivation for constructing the corpus, the process for selecting the texts, the detailed contents of the corpus itself, the rationale behind the choice of annotation layers, and the annotation procedure.

pdf bib
Learning from Domain Complexity
Robert Remus | Dominique Ziegelmayer

Sentiment analysis is genre and domain dependent, i.e. the same method performs differently when applied to text that originates from different genres and domains. Intuitively, this is due to different language use in different genres and domains. We measure such differences in a sentiment analysis gold standard dataset that contains texts from 1 genre and 10 domains. Differences in language use are quantified using certain language statistics, viz. domain complexity measures. We investigate 4 domain complexity measures: percentage of rare words, word richness, relative entropy and corpus homogeneity. We relate domain complexity measurements to performance of a standard machine learning-based classifier and find strong correlations. We show that we can accurately estimate its performance based on domain complexity using linear regression models fitted using robust loss functions. Moreover, we illustrate how domain complexity may guide us in model selection, viz. in deciding what word n-gram order to employ in a discriminative model and whether to employ aggressive or conservative word n-gram feature selection.

pdf bib
Benchmarking Twitter Sentiment Analysis Tools
Ahmed Abbasi | Ammar Hassan | Milan Dhar

Twitter has become one of the quintessential social media platforms for user-generated content. Researchers and industry practitioners are increasingly interested in Twitter sentiments. Consequently, an array of commercial and freely available Twitter sentiment analysis tools have emerged, though it remains unclear how well these tools really work. This study presents the findings of a detailed benchmark analysis of Twitter sentiment analysis tools, incorporating 20 tools applied to 5 different test beds. In addition to presenting detailed performance evaluation results, a thorough error analysis is used to highlight the most prevalent challenges facing Twitter sentiment analysis tools. The results have important implications for various stakeholder groups, including social media analytics researchers, NLP developers, and industry managers and practitioners using social media sentiments as input for decision-making.

pdf bib
Designing a Bilingual Speech Corpus for French and German Language Learners: a Two-Step Process
Camille Fauth | Anne Bonneau | Frank Zimmerer | Juergen Trouvain | Bistra Andreeva | Vincent Colotte | Dominique Fohr | Denis Jouvet | Jeanin Jügler | Yves Laprie | Odile Mella | Bernd Möbius

We present the design of a corpus of native and non-native speech for the language pair French-German, with a special emphasis on phonetic and prosodic aspects. To our knowledge there is no suitable corpus, in terms of size and coverage, currently available for the target language pair. To select the target L1-L2 interference phenomena we prepare a small preliminary corpus (corpus1), which is analyzed for coverage and cross-checked jointly by French and German experts. Based on this analysis, target phenomena on the phonetic and phonological level are selected on the basis of the expected degree of deviation from the native performance and the frequency of occurrence. 14 speakers performed both L2 (either French or German) and L1 material (either German or French). This allowed us to test, recordings duration, recordings material, the performance of our automatic aligner software. Then, we built corpus2 taking into account what we learned about corpus1. The aims are the same but we adapted speech material to avoid too long recording sessions. 100 speakers will be recorded. The corpus (corpus1 and corpus2) will be prepared as a searchable database, available for the scientific community after completion of the project.

pdf bib
Creative language explorations through a high-expressivity N-grams query language
Carlo Strapparava | Lorenzo Gatti | Marco Guerini | Oliviero Stock

In computation linguistics a combination of syntagmatic and paradigmatic features is often exploited. While the first aspects are typically managed by information present in large n-gram databases, domain and ontological aspects are more properly modeled by lexical ontologies such as WordNet and semantic similarity spaces. This interconnection is even stricter when we are dealing with creative language phenomena, such as metaphors, prototypical properties, puns generation, hyperbolae and other rhetorical phenomena. This paper describes a way to focus on and accomplish some of these tasks by exploiting NgramQuery, a generalized query language on Google N-gram database. The expressiveness of this query language is boosted by plugging semantic similarity acquired both from corpora (e.g. LSA) and from WordNet, also integrating operators for phonetics and sentiment analysis. The paper reports a number of examples of usage in some creative language tasks.

pdf bib
Using Word Familiarities and Word Associations to Measure Corpus Representativeness
Reinhard Rapp

The definition of corpus representativeness used here assumes that a representative corpus should reflect as well as possible the average language use a native speaker encounters in everyday life over a longer period of time. As it is not practical to observe people’s language input over years, we suggest to utilize two types of experimental data capturing two forms of human intuitions: Word familiarity norms and word association norms. If it is true that human language acquisition is corpus-based, such data should reflect people’s perceived language input. Assuming so, we compute a representativeness score for a corpus by extracting word frequency and word association statistics from it and by comparing these statistics to the human data. The higher the similarity, the more representative the corpus should be for the language environments of the test persons. We present results for five different corpora and for truncated versions thereof. The results confirm the expectation that corpus size and corpus balance are crucial aspects for corpus representativeness.

pdf bib
Deep Syntax Annotation of the Sequoia French Treebank
Marie Candito | Guy Perrier | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah | Éric de la Clergerie

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

pdf bib
Developing a French FrameNet: Methodology and First results
Marie Candito | Pascal Amsili | Lucie Barque | Farah Benamara | Gaël de Chalendar | Marianne Djemaa | Pauline Haas | Richard Huyghe | Yvette Yannick Mathieu | Philippe Muller | Benoît Sagot | Laure Vieu

The Asfalda project aims to develop a French corpus with frame-based semantic annotations and automatic tools for shallow semantic analysis. We present the first part of the project: focusing on a set of notional domains, we delimited a subset of English frames, adapted them to French data when necessary, and developed the corresponding French lexicon. We believe that working domain by domain helped us to enforce the coherence of the resulting resource, and also has the advantage that, though the number of frames is limited (around a hundred), we obtain full coverage within a given domain.

pdf bib
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Marta Sabou | Kalina Bontcheva | Leon Derczynski | Arno Scharl

Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature. We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.

pdf bib
Locating Requests among Open Source Software Communication Messages
Ioannis Korkontzelos | Sophia Ananiadou

As a first step towards assessing the quality of support offered online for Open Source Software (OSS), we address the task of locating requests, i.e., messages that raise an issue to be addressed by the OSS community, as opposed to any other message. We present a corpus of online communication messages randomly sampled from newsgroups and bug trackers, manually annotated as requests or non-requests. We identify several linguistically shallow, content-based heuristics that correlate with the classification and investigate the extent to which they can serve as independent classification criteria. Then, we train machine-learning classifiers on these heuristics. We experiment with a wide range of settings, such as different learners, excluding some heuristics and adding unigram features of various parts-of-speech and frequency. We conclude that some heuristics can perform well, while their accuracy can be improved further using machine learning, at the cost of obtaining manual annotations.

pdf bib
Valency and Word Order in Czech — A Corpus Probe
Kateřina Rysová | Jiří Mírovský

We present a part of broader research on word order aiming at finding factors influencing word order in Czech (i.e. in an inflectional language) and their intensity. The main aim of the paper is to test a hypothesis that obligatory adverbials (in terms of the valency) follow the non-obligatory (i.e. optional) ones in the surface word order. The determined hypothesis was tested by creating a list of features for the decision trees algorithm and by searching in data of the Prague Dependency Treebank using the search tool PML Tree Query. Apart from the valency, our experiment also evaluates importance of several other features, such as argument length and deep syntactic function. Neither of the used methods has proved the given hypothesis but according to the results, there are several other features that influence word order of contextually non-bound free modifiers of a verb in Czech, namely position of the sentence in the text, form and length of the verb modifiers (the whole subtrees), and the semantic dependency relation (functor) of the modifiers.

pdf bib
Harmonization of German Lexical Resources for Opinion Mining
Thierry Declerck | Hans-Ulrich Krieger

We present on-going work on the harmonization of existing German lexical resources in the field of opinion and sentiment mining. The input of our harmonization effort consisted in four distinct lexicons of German word forms, encoded either as lemmas or as full forms, marked up with polarity features, at distinct granularity levels. We describe how the lexical resources have been mapped onto each other, generating a unique list of entries, with unified Part-of-Speech information and basic polarity features. Future work will be dedicated to the comparison of the harmonized lexicon with German corpora annotated with polarity information. We are further aiming at both linking the harmonized German lexical resources with similar resources in other languages and publishing the resulting set of lexical data in the context of the Linguistic Linked Open Data cloud.

pdf bib
Word-Formation Network for Czech
Magda Ševčíková | Zdeněk Žabokrtský

In the present paper, we describe the development of the lexical network DeriNet, which captures core word-formation relations on the set of around 266 thousand Czech lexemes. The network is currently limited to derivational relations because derivation is the most frequent and most productive word-formation process in Czech. This limitation is reflected in the architecture of the network: each lexeme is allowed to be linked up with just a single base word; composition as well as combined processes (composition with derivation) are thus not included. After a brief summarization of theoretical descriptions of Czech derivation and the state of the art of NLP approaches to Czech derivation, we discuss the linguistic background of the network and introduce the formal structure of the network and the semi-automatic annotation procedure. The network was initialized with a set of lexemes whose existence was supported by corpus evidence. Derivational links were created using three sources of information: links delivered by a tool for morphological analysis, links based on an automatically discovered set of derivation rules, and on a grammar-based set of rules. Finally, we propose some research topics which could profit from the existence of such lexical network.

pdf bib
An Analysis of Older Users’ Interactions with Spoken Dialogue Systems
Jamie Bost | Johanna Moore

This study explores communication differences between older and younger users with a task-oriented spoken dialogue system. Previous analyses of the MATCH corpus show that older users have significantly longer dialogues than younger users and that they are less satisfied with the system. Open questions remain regarding the relationship between information recall and cognitive abilities. This study documents a length annotation scheme designed to explore causes of additional length in the dialogues and the relationships between length, cognitive abilities, user satisfaction, and information recall. Results show that primary causes of older users’ additional length include using polite vocabulary, providing additional information relevant to the task, and using full sentences to respond to the system. Regression models were built to predict length from cognitive abilities and user satisfaction from length. Overall, users with higher cognitive ability scores had shorter dialogues than users with lower cognitive ability scores, and users with shorter dialogues were more satisfied with the system than users with longer dialogues. Dialogue length and cognitive abilities were significantly correlated with information recall. Overall, older users tended to use a human-to-human communication style with the system, whereas younger users tended to adopt a factual interaction style.

pdf bib
Innovations in Parallel Corpus Search Tools
Martin Volk | Johannes Graën | Elena Callegaro

Recent years have seen an increased interest in and availability of parallel corpora. Large corpora from international organizations (e.g. European Union, United Nations, European Patent Office), or from multilingual Internet sites (e.g. OpenSubtitles) are now easily available and are used for statistical machine translation but also for online search by different user groups. This paper gives an overview of different usages and different types of search systems. In the past, parallel corpus search systems were based on sentence-aligned corpora. We argue that automatic word alignment allows for major innovations in searching parallel corpora. Some online query systems already employ word alignment for sorting translation variants, but none supports the full query functionality that has been developed for parallel treebanks. We propose to develop such a system for efficiently searching large parallel corpora with a powerful query language.

pdf bib
Machine Translationness: Machine-likeness in Machine Translation Evaluation
Joaquim Moré | Salvador Climent

Machine translationness (MTness) is the linguistic phenomena that make machine translations distinguishable from human translations. This paper intends to present MTness as a research object and suggests an MT evaluation method based on determining whether the translation is machine-like instead of determining its human-likeness as in evaluation current approaches. The method rates the MTness of a translation with a metric, the MTS (Machine Translationness Score). The MTS calculation is in accordance with the results of an experimental study on machine translation perception by common people. MTS proved to correlate well with human ratings on translation quality. Besides, our approach allows the performance of cheap evaluations since expensive resources (e.g. reference translations, training corpora) are not needed. The paper points out the challenge of dealing with MTness as an everyday phenomenon caused by the massive use of MT.

pdf bib
Towards an environment for the production and the validation of lexical semantic resources
Mikaël Morardo | Éric Villemonte de la Clergerie

We present the components of a processing chain for the creation, visualization, and validation of lexical resources (formed of terms and relations between terms). The core of the chain is a component for building lexical networks relying on Harris’ distributional hypothesis applied on the syntactic dependencies produced by the French parser FRMG on large corpora. Another important aspect concerns the use of an online interface for the visualization and collaborative validation of the resulting resources.

pdf bib
The Distress Analysis Interview Corpus of human and computer interviews
Jonathan Gratch | Ron Artstein | Gale Lucas | Giota Stratou | Stefan Scherer | Angela Nazarian | Rachel Wood | Jill Boberg | David DeVault | Stacy Marsella | David Traum | Skip Rizzo | Louis-Philippe Morency

The Distress Analysis Interview Corpus (DAIC) contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post traumatic stress disorder. The interviews are conducted by humans, human controlled agents and autonomous agents, and the participants include both distressed and non-distressed individuals. Data collected include audio and video recordings and extensive questionnaire responses; parts of the corpus have been transcribed and annotated for a variety of verbal and non-verbal features. The corpus has been used to support the creation of an automated interviewer agent, and for research on the automatic identification of psychological distress.

pdf bib
Representing Multimodal Linguistic Annotated data
Brigitte Bigi | Tatsuya Watanabe | Laurent Prévot

The question of interoperability for linguistic annotated resources covers different aspects. First, it requires a representation framework making it possible to compare, and eventually merge, different annotation schema. In this paper, a general description level representing the multimodal linguistic annotations is proposed. It focuses on time representation and on the data content representation: This paper reconsiders and enhances the current and generalized representation of annotations. An XML schema of such annotations is proposed. A Python API is also proposed. This framework is implemented in a multi-platform software and distributed under the terms of the GNU Public License.

pdf bib
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
Timur Gilmanov | Olga Scrivner | Sandra Kübler

It is well known that word aligned parallel corpora are valuable linguistic resources. Since many factors affect automatic alignment quality, manual post-editing may be required in some applications. While there are several state-of-the-art word-aligners, such as GIZA++ and Berkeley, there is no simple visual tool that would enable correcting and editing aligned corpora of different formats. We have developed SWIFT Aligner, a free, portable software that allows for visual representation and editing of aligned corpora from several most commonly used formats: TALP, GIZA, and NAACL. In addition, our tool has incorporated part-of-speech and syntactic dependency transfer from an annotated source language into an unannotated target language, by means of word-alignment.

pdf bib
Semi-automatic annotation of the UCU accents speech corpus
Rosemary Orr | Marijn Huijbregts | Roeland van Beek | Lisa Teunissen | Kate Backhouse | David van Leeuwen

Annotation and labeling of speech tasks in large multitask speech corpora is a necessary part of preparing a corpus for distribution. We address three approaches to annotation and labeling: manual, semi automatic and automatic procedures for labeling the UCU Accent Project speech data, a multilingual multitask longitudinal speech corpus. Accuracy and minimal time investment are the priorities in assessing the efficacy of each procedure. While manual labeling based on aural and visual input should produce the most accurate results, this approach is error-prone because of its repetitive nature. A semi automatic event detection system requiring manual rejection of false alarms and location and labeling of misses provided the best results. A fully automatic system could not be applied to entire speech recordings because of the variety of tasks and genres. However, it could be used to annotate separate sentences within a specific task. Acoustic confidence measures can correctly detect sentences that do not match the text with an EER of 3.3%

pdf bib
Comparative Analysis of Portuguese Named Entities Recognition Tools
Daniela Amaral | Evandro Fonseca | Lucelene Lopes | Renata Vieira

This paper describes an experiment to compare four tools to recognize named entities in Portuguese texts. The experiment was made over the HAREM corpora, a golden standard for named entities recognition in Portuguese. The tools experimented are based on natural language processing techniques and also machine learning. Specifically, one of the tools is based on Conditional random fields, an unsupervised machine learning model that has being used to named entities recognition in several languages, while the other tools follow more traditional natural language approaches. The comparison results indicate advantages for different tools according to the different classes of named entities. Despite of such balance among tools, we conclude pointing out foreseeable advantages to the machine learning based tool.

pdf bib
A corpus of European Portuguese child and child-directed speech
Ana Lúcia Santos | Michel Généreux | Aida Cardoso | Celina Agostinho | Silvana Abalada

We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.

pdf bib
Using C5.0 and Exhaustive Search for Boosting Frame-Semantic Parsing Accuracy
Guntis Barzdins | Didzis Gosko | Laura Rituma | Peteris Paikens

Frame-semantic parsing is a kind of automatic semantic role labeling performed according to the FrameNet paradigm. The paper reports a novel approach for boosting frame-semantic parsing accuracy through the use of the C5.0 decision tree classifier, a commercial version of the popular C4.5 decision tree classifier, and manual rule enhancement. Additionally, the possibility to replace C5.0 by an exhaustive search based algorithm (nicknamed C6.0) is described, leading to even higher frame-semantic parsing accuracy at the expense of slightly increased training time. The described approach is particularly efficient for languages with small FrameNet annotated corpora as it is for Latvian, which is used for illustration. Frame-semantic parsing accuracy achieved for Latvian through the C6.0 algorithm is on par with the state-of-the-art English frame-semantic parsers. The paper includes also a frame-semantic parsing use-case for extracting structured information from unstructured newswire texts, sometimes referred to as bridging of the semantic gap.

pdf bib
‘interHist’ ̶ an interactive visual interface for corpus exploration
Verena Lyding | Lionel Nicolas | Egon Stemle

In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISA corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providing an interactive visual overview of the data, which supports the user-steered navigation by means of interactive filtering. It allows to dynamically switch between an overview on the data and a detailed view on results in their immediate textual context, thus helping to detect and inspect relevant hits more efficiently. We provide background information on corpus linguistics and related work on visualizations for language and linguistic data. We introduce the architecture of interHist, by detailing the data structure it relies on, describing the visualization design and providing technical details of the implementation and its integration with the corpus querying environment. Finally, we illustrate its usage by presenting a use case for the analysis of the composition of Italian noun phrases.

pdf bib
Identification of Multiword Expressions in the brWaC
Rodrigo Boos | Kassius Prestes | Aline Villavicencio

Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool initiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task.

pdf bib
Collocation or Free Combination? — Applying Machine Translation Techniques to identify collocations in Japanese
Lis Pereira | Elga Strafella | Yuji Matsumoto

This work presents an initial investigation on how to distinguish collocations from free combinations. The assumption is that, while free combinations can be literally translated, the overall meaning of collocations is different from the sum of the translation of its parts. Based on that, we verify whether a machine translation system can help us perform such distinction. Results show that it improves the precision compared with standard methods of collocation identification through statistical association measures.

pdf bib
Extrinsic Corpus Evaluation with a Collocation Dictionary Task
Adam Kilgarriff | Pavel Rychlý | Miloš Jakubíček | Vojtěch Kovář | Vít Baisa | Lucia Kocincová

The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of `general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.

pdf bib
AusTalk: an audio-visual corpus of Australian English
Dominique Estival | Steve Cassidy | Felicity Cox | Denis Burnham

This paper describes the AusTalk corpus, which was designed and created through the Big ASC, a collaborative project with the two main goals of providing a standardised infrastructure for audio-visual recordings in Australia and of producing a large audio-visual corpus of Australian English, with 3 hours of AV recordings for 1000 speakers. We first present the overall project, then describe the corpus itself and its components, the strict data collection protocol with high levels of standardisation and automation, and the processes put in place for quality control. We also discuss the annotation phase of the project, along with its goals and challenges; a major contribution of the project has been to explore procedures for automating annotations and we present our solutions. We conclude with the current status of the corpus and with some examples of research already conducted with this new resource. AusTalk is one of the corpora included in the HCS vLab, which is briefly sketched in the conclusion.

pdf bib
Comprehensive Annotation of Multiword Expressions in a Social Web Corpus
Nathan Schneider | Spencer Onuffer | Nora Kazour | Emily Danchik | Michael T. Mordowanec | Henrietta Conrad | Noah A. Smith

Multiword expressions (MWEs) are quite frequent in languages such as English, but their diversity, the scarcity of individual MWE types, and contextual ambiguity have presented obstacles to corpus-based studies and NLP systems addressing them as a class. Here we advocate for a comprehensive annotation approach: proceeding sentence by sentence, our annotators manually group tokens into MWEs according to guidelines that cover a broad range of multiword phenomena. Under this scheme, we have fully annotated an English web corpus for multiword expressions, including those containing gaps.

pdf bib
Automatic semantic relation extraction from Portuguese texts
Leonardo Sameshima Taba | Helena Caseli

Nowadays we are facing a growing demand for semantic knowledge in computational applications, particularly in Natural Language Processing (NLP). However, there aren’t sufficient human resources to produce that knowledge at the same rate of its demand. Considering the Portuguese language, which has few resources in the semantic area, the situation is even more alarming. Aiming to solve that problem, this work investigates how some semantic relations can be automatically extracted from Portuguese texts. The two main approaches investigated here are based on (i) textual patterns and (ii) machine learning algorithms. Thus, this work investigates how and to which extent these two approaches can be applied to the automatic extraction of seven binary semantic relations (is-a, part-of, location-of, effect-of, property-of, made-of and used-for) in Portuguese texts. The results indicate that machine learning, in particular Support Vector Machines, is a promising technique for the task, although textual patterns presented better results for the used-for relation.

pdf bib
A Multidialectal Parallel Corpus of Arabic
Houda Bouamor | Nizar Habash | Kemal Oflazer

The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

pdf bib
Transfer learning of feedback head expressions in Danish and Polish comparable multimodal corpora
Costanza Navarretta | Magdalena Lis

The paper is an investigation of the reusability of the annotations of head movements in a corpus in a language to predict the feedback functions of head movements in a comparable corpus in another language. The two corpora consist of naturally occurring triadic conversations in Danish and Polish, which were annotated according to the same scheme. The intersection of common annotation features was used in the experiments. A Naïve Bayes classifier was trained on the annotations of a corpus and tested on the annotations of the other corpus. Training and test datasets were then reversed and the experiments repeated. The results show that the classifier identifies more feedback behaviours than the majority baseline in both cases and the improvements are significant. The performance of the classifier decreases significantly compared with the results obtained when training and test data belong to the same corpus. Annotating multimodal data is resource consuming, thus the results are promising. However, they also confirm preceding studies that have identified both similarities and differences in the use of feedback head movements in different languages. Since our datasets are small and only regard a communicative behaviour in two languages, the experiments should be tested on more data types.

pdf bib
Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites
Miquel Esplà-Gomis | Filip Klubička | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English―Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manually examined and the success rate was computed on the collection of pairs of documents detected by each setting. We compare the performance of the settings and the amount of different corpora detected by each setting. In addition, we describe the resource obtained, both by the settings and through the human evaluation, which has been released as a high-quality parallel corpus.

pdf bib
Hashtag Occurrences, Layout and Translation: A Corpus-driven Analysis of Tweets Published by the Canadian Government
Fabrizio Gotti | Phillippe Langlais | Atefeh Farzindar

We present an aligned bilingual corpus of 8758 tweet pairs in French and English, derived from Canadian government agencies. Hashtags appear in a tweet’s prologue, announcing its topic, or in the tweet’s text in lieu of traditional words, or in an epilogue. Hashtags are words prefixed with a pound sign in 80% of the cases. The rest is mostly multiword hashtags, for which we describe a segmentation algorithm. A manual analysis of the bilingual alignment of 5000 hashtags shows that 5% (French) to 18% (English) of them don’t have a counterpart in their containing tweet’s translation. This analysis shows that 80% of multiword hashtags are correctly translated by humans, and that the mistranslation of the rest may be due to incomplete translation directives regarding social media. We show how these resources and their analysis can guide the design of a machine translation pipeline, and its evaluation. A baseline system implementing a tweet-specific tokenizer yields promising results. The system is improved by translating epilogues, prologues, and text separately. We attempt to feed the SMT engine with the original hashtag and some alternatives (“dehashed” version or a segmented version of multiword hashtags), but translation quality improves at the cost of hashtag recall.

pdf bib
Towards an Integration of Syntactic and Temporal Annotations in Estonian
Siim Orasmaa

We investigate the question how manually created syntactic annotations can be used to analyse and improve consistency in manually created temporal annotations. Our work introduces an annotation project for Estonian, where temporal annotations in TimeML framework were manually added to a corpus containing gold standard morphological and dependency syntactic annotations. In the first part of our work, we evaluate the consistency of manual temporal annotations, focusing on event annotations. We use syntactic annotations to distinguish different event annotation models, and we observe highest inter-annotator agreements on models representing ”prototypical events” (event verbs and events being part of the syntactic predicate of clause). In the second part of our work, we investigate how to improve consistency between syntactic and temporal annotations. We test on whether syntactic annotations can be used to validate temporal annotations: to find missing or partial annotations. Although the initial results indicate that such validation is promising, we also note that a better bridging between temporal (semantic) and syntactic annotations is needed for a complete automatic validation.

pdf bib
HuRIC: a Human Robot Interaction Corpus
Emanuele Bastianelli | Giuseppe Castellucci | Danilo Croce | Luca Iocchi | Roberto Basili | Daniele Nardi

Recent years show the development of large scale resources (e.g. FrameNet for the Frame Semantics) that supported the definition of several state-of-the-art approaches in Natural Language Processing. However, the reuse of existing resources in heterogeneous domains such as Human Robot Interaction is not straightforward. The generalization offered by many data driven methods is strongly biased by the employed data, whose performance in out-of-domain conditions exhibit large drops. In this paper, we present the Human Robot Interaction Corpus (HuRIC). It is made of audio files paired with their transcriptions referring to commands for a robot, e.g. in a home environment. The recorded sentences are annotated with different kinds of linguistic information, ranging from morphological and syntactic information to rich semantic information, according to the Frame Semantics, to characterize robot actions, and Spatial Semantics, to capture the robot environment. All texts are represented through the Abstract Meaning Representation, to adopt a simple but expressive representation of commands, that can be easily translated into the internal representation of the robot.

pdf bib
On the origin of errors: A fine-grained analysis of MT and PE errors and their relationship
Joke Daems | Lieve Macken | Sonia Vandepitte

In order to improve the symbiosis between machine translation (MT) system and post-editor, it is not enough to know that the output of one system is better than the output of another system. A fine-grained error analysis is needed to provide information on the type and location of errors occurring in MT and the corresponding errors occurring after post-editing (PE). This article reports on a fine-grained translation quality assessment approach which was applied to machine translated-texts and the post-edited versions of these texts, made by student post-editors. By linking each error to the corresponding source text-passage, it is possible to identify passages that were problematic in MT, but not after PE, or passages that were problematic even after PE. This method provides rich data on the origin and impact of errors, which can be used to improve post-editor training as well as machine translation systems. We present the results of a pilot experiment on the post-editing of newspaper articles and highlight the advantages of our approach.

pdf bib
The WaveSurfer Automatic Speech Recognition Plugin
Giampiero Salvi | Niklas Vanhainen

This paper presents a plugin that adds automatic speech recognition (ASR) functionality to the WaveSurfer sound manipulation and visualisation program. The plugin allows the user to run continuous speech recognition on spoken utterances, or to align an already available orthographic transcription to the spoken material. The plugin is distributed as free software and is based on free resources, namely the Julius speech recognition engine and a number of freely available ASR resources for different languages. Among these are the acoustic and language models we have created for Swedish using the NST database.

pdf bib
Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license
Matěj Korvas | Ondřej Plátek | Ondřej Dušek | Lukáš Žilka | Filip Jurčíček

We present a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition (ASR) in spoken dialogue systems (SDSs). The data comprise 45 hours of speech in English and over 18 hours in Czech. Large part of the data, both audio and transcriptions, was collected using crowdsourcing, the rest are transcriptions by hired transcribers. We release the data together with scripts for data pre-processing and building acoustic models using the HTK and Kaldi ASR toolkits. We publish also the trained models described in this paper. The data are released under the CC-BY-SA~3.0 license, the scripts are licensed under Apache~2.0. In the paper, we report on the methodology of collecting the data, on the size and properties of the data, and on the scripts and their use. We verify the usability of the datasets by training and evaluating acoustic models using the presented data and scripts.

pdf bib
Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patent Texts
Antje Schlaf | Claudia Bobach | Matthias Irmer

This paper describes the creation of a gold standard for chemistry-disease relations in patent texts. We start with an automated annotation of named entities of the domains chemistry (e.g. “propranolol”) and diseases (e.g. “hypertension”) as well as of related domains like methods and substances. After that, domain-relevant relations between these entities, e.g. “propranolol treats hypertension”, have been manually annotated. The corpus is intended to be suitable for developing and evaluating relation extraction methods. In addition, we present two reasoning methods of high precision for automatically extending the set of extracted relations. Chain reasoning provides a method to infer and integrate additional, indirectly expressed relations occurring in relation chains. Enumeration reasoning exploits the frequent occurrence of enumerations in patents and automatically derives additional relations. These two methods are applicable both for verifying and extending the manually annotated data as well as for potential improvements of automatic relation extraction.

pdf bib
The SSPNet-Mobile Corpus: Social Signal Processing Over Mobile Phones.
Anna Polychroniou | Hugues Salamin | Alessandro Vinciarelli

This article presents the SSPNet-Mobile Corpus, a collection of 60 mobile phone calls between unacquainted individuals (120 subjects). The corpus is designed to support research on non-verbal behavior and it has been manually annotated into conversational topics and behavioral events (laughter, fillers, back-channel, etc.). Furthermore, the corpus includes, for each subject, psychometric questionnaires measuring personality, conflict attitude and interpersonal attraction. Besides presenting the main characteristics of the corpus (scenario, subjects, experimental protocol, sensing approach, psychometric measurements), the paper reviews the main results obtained so far using the data.

pdf bib
Projection-based Annotation of a Polish Dependency Treebank
Alina Wróblewska | Adam Przepiórkowski

This paper presents an approach of automatic annotation of sentences with dependency structures. The approach builds on the idea of cross-lingual dependency projection. The presented method of acquiring dependency trees involves a weighting factor in the processes of projecting source dependency relations to target sentences and inducing well-formed target dependency trees from sets of projected dependency relations. Using a parallel corpus, source trees are transferred onto equivalent target sentences via an extended set of alignment links. Projected arcs are initially weighted according to the certainty of word alignment links. Then, arc weights are recalculated using a method based on the EM selection algorithm. Maximum spanning trees selected from EM-scored digraphs and labelled with appropriate grammatical functions constitute a target dependency treebank. Extrinsic evaluation shows that parsers trained on such a treebank may perform comparably to parsers trained on a manually developed treebank.

pdf bib
Bootstrapping an Italian VerbNet: data-driven analysis of verb alternations
Gianluca Lebani | Veronica Viola | Alessandro Lenci

The goal of this paper is to propose a classification of the syntactic alternations admitted by the most frequent Italian verbs. The data-driven two-steps procedure exploited and the structure of the identified classes of alternations are presented in depth and discussed. Even if this classification has been developed with a practical application in mind, namely the semi-automatic building of a VerbNet-like lexicon for Italian verbs, partly following the methodology proposed in the context of the VerbNet project, its availability may have a positive impact on several related research topics and Natural Language Processing tasks

pdf bib
Self-training a Constituency Parser using n-gram Trees
Arda Çelebi | Arzucan Özgür

In this study, we tackle the problem of self-training a feature-rich discriminative constituency parser. We approach the self-training problem with the assumption that while the full sentence parse tree produced by a parser may contain errors, some portions of it are more likely to be correct. We hypothesize that instead of feeding the parser the guessed full sentence parse trees of its own, we can break them down into smaller ones, namely n-gram trees, and perform self-training on them. We build an n-gram parser and transfer the distinct expertise of the $n$-gram parser to the full sentence parser by using the Hierarchical Joint Learning (HJL) approach. The resulting jointly self-trained parser obtains slight improvement over the baseline.

pdf bib
A Tagged Corpus and a Tagger for Urdu
Bushra Jawaid | Amir Kamran | Ondřej Bojar

In this paper, we describe a release of a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the tagged corpus. Additionally, we use this data to train a single standalone tagger which will hopefully significantly simplify Urdu processing. The standalone tagger obtains the accuracy of 88.74% on test data.

pdf bib
Lexical Substitution Dataset for German
Kostadin Cholakov | Chris Biemann | Judith Eckle-Kohler | Iryna Gurevych

This article describes a lexical substitution dataset for German. The whole dataset contains 2,040 sentences from the German Wikipedia, with one target word in each sentence. There are 51 target nouns, 51 adjectives, and 51 verbs randomly selected from 3 frequency groups based on the lemma frequency list of the German WaCKy corpus. 200 sentences have been annotated by 4 professional annotators and the remaining sentences by 1 professional annotator and 5 additional annotators who have been recruited via crowdsourcing. The resulting dataset can be used to evaluate not only lexical substitution systems, but also different sense inventories and word sense disambiguation systems.

pdf bib
A cascade approach for complex-type classification
Lauren Romeo | Sara Mendes | Núria Bel

The work detailed in this paper describes a 2-step cascade approach for the classification of complex-type nominals. We describe an experiment that demonstrates how a cascade approach performs when the task consists in distinguishing nominals from a given complex-type from any other noun in the language. Overall, our classifier successfully identifies very specific and not highly frequent lexical items such as complex-types with high accuracy, and distinguishes them from those instances that are not complex types by using lexico-syntactic patterns indicative of the semantic classes corresponding to each of the individual sense components of the complex type. Although there is still room for improvement with regard to the coverage of the classifiers developed, the cascade approach increases the precision of classification of the complex-type nouns that are covered in the experiment presented.

pdf bib
Generating a Resource for Products and Brandnames Recognition. Application to the Cosmetic Domain.
Cédric Lopez | Frédérique Segond | Olivier Hondermarck | Paolo Curtoni | Luca Dini

Named Entity Recognition task needs high-quality and large-scale resources. In this paper, we present RENCO, a based-rules system focused on the recognition of entities in the Cosmetic domain (brandnames, product names, …). RENCO has two main objectives: 1) Generating resources for named entity recognition; 2) Mining new named entities relying on the previous generated resources. In order to build lexical resources for the cosmetic domain, we propose a system based on local lexico-syntactic rules complemented by a learning module. As the outcome of the system, we generate both a simple lexicon and a structured lexicon. Results of the evaluation show that even if RENCO outperforms a classic Conditional Random Fields algorithm, both systems should combine their respective strengths.

pdf bib
Annotation of specialized corpora using a comprehensive entity and relation scheme
Louise Deléger | Anne-Laure Ligozat | Cyril Grouin | Pierre Zweigenbaum | Aurélie Névéol

Annotated corpora are essential resources for many applications in Natural Language Processing. They provide insight on the linguistic and semantic characteristics of the genre and domain covered, and can be used for the training and evaluation of automatic tools. In the biomedical domain, annotated corpora of English texts have become available for several genres and subfields. However, very few similar resources are available for languages other than English. In this paper we present an effort to produce a high-quality corpus of clinical documents in French, annotated with a comprehensive scheme of entities and relations. We present the annotation scheme as well as the results of a pilot annotation study covering 35 clinical documents in a variety of subfields and genres. We show that high inter-annotator agreement can be achieved using a complex annotation scheme.

pdf bib
Annotation Pro + TGA: automation of speech timing analysis
Katarzyna Klessa | Dafydd Gibbon

This paper reports on two tools for the automatic statistical analysis of selected properties of speech timing on the basis of speech annotation files. The tools, one online (TGA, Time Group Analyser) and one offline (Annotation Pro+TGA), are intended to support the rapid analysis of speech timing data without the need to create specific scripts or spreadsheet functions for this purpose. The software calculates, inter alia, mean, median, rPVI, nPVI, slope and intercept functions within interpausal groups, provides visualisations of timing patterns, as well as correlations between these, and parses interpausal groups into hierarchies based on duration relations. Although many studies, especially in speech technology, use computational means, enquiries have shown that a large number of phoneticians and phonetics students do not have script creation skills and therefore use traditional copy+spreadsheet techniques, which are slow, preclude the analysis of large data sets, and are prone to inconsistencies. The present tools have been tested in a number of studies on English, Mandarin and Polish, and are introduced here with reference to results from these studies.

pdf bib
Polysemy Index for Nouns: an Experiment on Italian using the PAROLE SIMPLE CLIPS Lexical Database
Francesca Frontini | Valeria Quochi | Sebastian Padó | Monica Monachini | Jason Utt

An experiment is presented to induce a set of polysemous basic type alternations (such as Animal-Food, or Building-Institution) by deriving them from the sense alternations found in an existing lexical resource. The paper builds on previous work and applies those results to the Italian lexicon PAROLE SIMPLE CLIPS. The new results show how the set of frequent type alternations that can be induced from the lexicon is partly different from the set of polysemy relations selected and explicitely applied by lexicographers when building it. The analysis of mismatches shows that frequent type alternations do not always correpond to prototypical polysemy relations, nevertheless the proposed methodology represents a useful tool offered to lexicographers to systematically check for possible gaps in their resource.

pdf bib
YouDACC: the Youtube Dialectal Arabic Comment Corpus
Ahmed Salama | Houda Bouamor | Behrang Mohit | Kemal Oflazer

This paper presents YOUDACC, an automatically annotated large-scale multi-dialectal Arabic corpus collected from user comments on Youtube videos. Our corpus covers different groups of dialects: Egyptian (EG), Gulf (GU), Iraqi (IQ), Maghrebi (MG) and Levantine (LV). We perform an empirical analysis on the crawled corpus and demonstrate that our location-based proposed method is effective for the task of dialect labeling.

pdf bib
Towards an Encyclopedia of Compositional Semantics: Documenting the Interface of the English Resource Grammar
Dan Flickinger | Emily M. Bender | Stephan Oepen

We motivate and describe the design and development of an emerging encyclopedia of compositional semantics, pursuing three objectives. We first seek to compile a comprehensive catalogue of interoperable semantic analyses, i.e., a precise characterization of meaning representations for a broad range of common semantic phenomena. Second, we operationalize the discovery of semantic phenomena and their definition in terms of what we call their semantic fingerprint, a formal account of the building blocks of meaning representation involved and their configuration. Third, we ground our work in a carefully constructed semantic test suite of minimal exemplars for each phenomenon, along with a `target’ fingerprint that enables automated regression testing. We work towards these objectives by codifying and documenting the body of knowledge that has been constructed in a long-term collaborative effort, the development of the LinGO English Resource Grammar. Documentation of its semantic interface is a prerequisite to use by non-experts of the grammar and the analyses it produces, but this effort also advances our own understanding of relevant interactions among phenomena, as well as of areas for future work in the grammar.

pdf bib
Enriching the “Senso Comune” Platform with Automatically Acquired Data
Tommaso Caselli | Laure Vieu | Carlo Strapparava | Guido Vetere

This paper reports on research activities on automatic methods for the enrichment of the Senso Comune platform. At this stage of development, we will report on two tasks, namely word sense alignment with MultiWordNet and automatic acquisition of Verb Shallow Frames from sense annotated data in the MultiSemCor corpus. The results obtained are satisfying. We achieved a final F-measure of 0.64 for noun sense alignment and a F-measure of 0.47 for verb sense alignment, and an accuracy of 68\% on the acquisition of Verb Shallow Frames.

pdf bib
Online experiments with the Percy software framework - experiences and some early results
Christoph Draxler

In early 2012 the online perception experiment software Percy was deployed on a production server at our lab. Since then, 38 experiments have been made publicly available, with a total of 3078 experiment sessions. In the course of time, the software has been continuously updated and extended to adapt to changing user requirements. Web-based editors for the structure and layout of the experiments have been developed. This paper describes the system architecture, presents usage statistics, discusses typical characteristics of online experiments, and gives an outlook on ongoing work. lists all currently active experiments.

pdf bib
Improving the exploitation of linguistic annotations in ELAN
Onno Crasborn | Han Sloetjes

This paper discusses some improvements in recent and planned versions of the multimodal annotation tool ELAN, which are targeted at improving the usability of annotated files. Increased support for multilingual documents is provided, by allowing for multilingual vocabularies and by specifying a language per document, annotation layer (tier) or annotation. In addition, improvements in the search possibilities and the display of the results have been implemented, which are especially relevant in the interpretation of the results of complex multi-tier searches.

pdf bib
A Deep Context Grammatical Model For Authorship Attribution
Simon Fuller | Phil Maguire | Philippe Moser

We define a variable-order Markov model, representing a Probabilistic Context Free Grammar, built from the sentence-level, de-lexicalized parse of source texts generated by a standard lexicalized parser, which we apply to the authorship attribution task. First, we motivate this model in the context of previous research on syntactic features in the area, outlining some of the general strengths and limitations of the overall approach. Next we describe the procedure for building syntactic models for each author based on training cases. We then outline the attribution process - assigning authorship to the model which yields the highest probability for the given test case. We demonstrate the efficacy for authorship attribution over different Markov orders and compare it against syntactic features trained by a linear kernel SVM. We find that the model performs somewhat less successfully than the SVM over similar features. In the conclusion, we outline how we plan to employ the model for syntactic evaluation of literary texts.

pdf bib
Manual Analysis of Structurally Informed Reordering in German-English Machine Translation
Teresa Herrmann | Jan Niehues | Alex Waibel

Word reordering is a difficult task for translation. Common automatic metrics such as BLEU have problems reflecting improvements in target language word order. However, it is a crucial aspect for humans when deciding on translation quality. This paper presents a detailed analysis of a structure-aware reordering approach applied in a German-to-English phrase-based machine translation system. We compare the translation outputs of two translation systems applying reordering rules based on parts-of-speech and syntax trees on a sentence-by-sentence basis. For each sentence-pair we examine the global translation performance and classify local changes in the translated sentences. This analysis is applied to three data sets representing different genres. While the improvement in BLEU differed substantially between the data sets, the manual evaluation showed that both global translation performance as well as individual types of improvements and degradations exhibit a similar behavior throughout the three data sets. We have observed that for 55-64% of the sentences with different translations, the translation produced using the tree-based reordering was considered to be the better translation. As intended by the investigated reordering model, most improvements are achieved by improving the position of the verb or being able to translate a verb that could not be translated before.

pdf bib
Presenting a system of human-machine interaction for performing map tasks.
Gabriele Pallotti | Francesca Frontini | Fabio Affè | Monica Monachini | Stefania Ferrari

A system for human machine interaction is presented, that offers second language learners of Italian the possibility of assessing their competence by performing a map task, namely by guiding the a virtual follower through a map with written instructions in natural language. The underlying natural language processing algorithm is described, and the map authoring infrastructure is presented.

pdf bib
Automatic Extraction of Synonyms for German Particle Verbs from Parallel Data with Distributional Similarity as a Re-Ranking Feature
Moritz Wittmann | Marion Weller | Sabine Schulte im Walde

We present a method for the extraction of synonyms for German particle verbs based on a word-aligned German-English parallel corpus: by translating the particle verb to a pivot, which is then translated back, a set of synonym candidates can be extracted and ranked according to the respective translation probabilities. In order to deal with separated particle verbs, we apply re-ordering rules to the German part of the data. In our evaluation against a gold standard, we compare different pre-processing strategies (lemmatized vs. inflected forms) and introduce language model scores of synonym candidates in the context of the input particle verb as well as distributional similarity as additional re-ranking criteria. Our evaluation shows that distributional similarity as a re-ranking feature is more robust than language model scores and leads to an improved ranking of the synonym candidates. In addition to evaluating against a gold standard, we also present a small-scale manual evaluation.

pdf bib
NASTIA: Negotiating Appointment Setting Interface
Layla El Asri | Rémi Lemonnier | Romain Laroche | Olivier Pietquin | Hatim Khouzaimi

This paper describes a French Spoken Dialogue System (SDS) named NASTIA (Negotiating Appointment SeTting InterfAce). Appointment scheduling is a hybrid task halfway between slot-filling and negotiation. NASTIA implements three different negotiation strategies. These strategies were tested on 1734 dialogues with 385 users who interacted at most 5 times with the SDS and gave a rating on a scale of 1 to 10 for each dialogue. Previous appointment scheduling systems were evaluated with the same experimental protocol. NASTIA is different from these systems in that it can adapt its strategy during the dialogue. The highest system task completion rate with these systems was 81% whereas NASTIA had an 88% average and its best performing strategy even reached 92%. This strategy also significantly outperformed previous systems in terms of overall user rating with an average of 8.28 against 7.40. The experiment also enabled highlighting global recommendations for building spoken dialogue systems.

pdf bib
DINASTI: Dialogues with a Negotiating Appointment Setting Interface
Layla El Asri | Romain Laroche | Olivier Pietquin

This paper describes the DINASTI (DIalogues with a Negotiating Appointment SeTting Interface) corpus, which is composed of 1734 dialogues with the French spoken dialogue system NASTIA (Negotiating Appointment SeTting InterfAce). NASTIA is a reinforcement learning-based system. The DINASTI corpus was collected while the system was following a uniform policy. Each entry of the corpus is a system-user exchange annotated with 120 automatically computable features.The corpus contains a total of 21587 entries, with 385 testers. Each tester performed at most five scenario-based interactions with NASTIA. The dialogues last an average of 10.82 dialogue turns, with 4.45 reinforcement learning decisions. The testers filled an evaluation questionnaire after each dialogue. The questionnaire includes three questions to measure task completion. In addition, it comprises 7 Likert-scaled items evaluating several aspects of the interaction, a numerical overall evaluation on a scale of 1 to 10, and a free text entry. Answers to this questionnaire are provided with DINASTI. This corpus is meant for research on reinforcement learning modelling for dialogue management.

pdf bib
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
Annemarie Friedrich | Marina Valeeva | Alexis Palmer

We present LQVSumm, a corpus of about 2000 automatically created extractive multi-document summaries from the TAC 2011 shared task on Guided Summarization, which we annotated with several types of linguistic quality violations. Examples for such violations include pronouns that lack antecedents or ungrammatical clauses. We give details on the annotation scheme and show that inter-annotator agreement is good given the open-ended nature of the task. The annotated summaries have previously been scored for Readability on a numeric scale by human annotators in the context of the TAC challenge; we show that the number of instances of violations of linguistic quality of a summary correlates with these intuitively assigned numeric scores. On a system-level, the average number of violations marked in a system’s summaries achieves higher correlation with the Readability scores than current supervised state-of-the-art methods for assigning a single readability score to a summary. It is our hope that our corpus facilitates the development of methods that not only judge the linguistic quality of automatically generated summaries as a whole, but which also allow for detecting, labeling, and fixing particular violations in a text.

pdf bib
Potsdam Commentary Corpus 2.0: Annotation for Discourse Research
Manfred Stede | Arne Neumann

We present a revised and extended version of the Potsdam Commentary Corpus, a collection of 175 German newspaper commentaries (op-ed pieces) that has been annotated with syntax trees and three layers of discourse-level information: nominal coreference,connectives and their arguments (similar to the PDTB, Prasad et al. 2008), and trees reflecting discourse structure according to Rhetorical Structure Theory (Mann/Thompson 1988). Connectives have been annotated with the help of a semi-automatic tool, Conano (Stede/Heintze 2004), which identifies most connectives and suggests arguments based on their syntactic category. The other layers have been created manually with dedicated annotation tools. The corpus is made available on the one hand as a set of original XML files produced with the annotation tools, based on identical tokenization. On the other hand, it is distributed together with the open-source linguistic database ANNIS3 (Chiarcos et al. 2008; Zeldes et al. 2009), which provides multi-layer search functionality and layer-specific visualization modules. This allows for comfortable qualitative evaluation of the correlations between annotation layers.

pdf bib
GLÀFF, a Large Versatile French Lexicon
Nabil Hathout | Franck Sajous | Basilio Calderone

This paper introduces GLAFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLAFF contains, for each entry, inflectional features and phonemic transcriptions. It distinguishes itself from the other available French lexicons by its size, its potential for constant updating and its copylefted license. We explain how we have built GLAFF and compare it to other known resources in terms of coverage and quality of the phonemic transcriptions. We show that its size and quality are strong assets that could allow GLAFF to become a reference lexicon for French NLP and linguistics. Moreover, other derived lexicons can easily be based on GLAFF to satisfy specific needs of various fields such as psycholinguistics.

pdf bib
Dense Components in the Structure of WordNet
Ahti Lohk | Kaarel Allik | Heili Orav | Leo Võhandu

This paper introduces a test-pattern named a dense component for checking inconsistencies in the hierarchical structure of a wordnet. Dense component (viewed as substructure) points out the cases of regular polysemy in the context of multiple inheritance. Definition of the regular polysemy is redefined ― instead of lexical units there are used lexical concepts (synsets). All dense components are evaluated by expert lexicographer. Based on this experiment we give an overview of the inconsistencies which the test-pattern helps to detect. Special attention is turned to all different kind of corrections made by lexicographer. Authors of this paper find that the greatest benefit of the use of dense components is helping to detect if the regular polysemy is justified or not. In-depth analysis has been performed for Estonian Wordnet Version 66. Some comparative figures are also given for the Estonian Wordnet (EstWN) Version 67 and Princeton WordNet (PrWN) Version 3.1. Analysing hierarchies only hypernym-relations are used.

pdf bib
Choosing which to use? A study of distributional models for nominal lexical semantic classification
Lauren Romeo | Gianluca Lebani | Núria Bel | Alessandro Lenci

This paper empirically evaluates the performances of different state-of-the-art distributional models in a nominal lexical semantic classification task. We consider models that exploit various types of distributional features, which thereby provide different representations of nominal behavior in context. The experiments presented in this work demonstrate the advantages and disadvantages of each model considered. This analysis also considers a combined strategy that we found to be capable of leveraging the bottlenecks of each model, especially when large robust data is not available.

pdf bib
Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather
Jens Forster | Christoph Schmidt | Oscar Koller | Martin Bellgardt | Hermann Ney

This paper introduces the RWTH-PHOENIX-Weather 2014, a video-based, large vocabulary, German sign language corpus which has been extended over the last two years, tripling the size of the original corpus. The corpus contains weather forecasts simultaneously interpreted into sign language which were recorded from German public TV and manually annotated using glosses on the sentence level and semi-automatically transcribed spoken German extracted from the videos using the open-source speech recognition system RASR. Spatial annotations of the signers’ hands as well as shape and orientation annotations of the dominant hand have been added for more than 40k respectively 10k video frames creating one of the largest corpora allowing for quantitative evaluation of object tracking algorithms. Further, over 2k signs have been annotated using the SignWriting annotation system, focusing on the shape, orientation, movement as well as spatial contacts of both hands. Finally, extended recognition and translation setups are defined, and baseline results are presented.

pdf bib
HESITA(te) in Portuguese
Sara Candeias | Dirce Celorico | Jorge Proença | Arlindo Veiga | Carla Lopes | Fernando Perdigão

Hesitations, so-called disfluencies, are a characteristic of spontaneous speech, playing a primary role in its structure, reflecting aspects of the language production and the management of inter-communication. In this paper we intend to present a database of hesitations in European Portuguese speech - HESITA - as a relevant base of work to study a variety of speech phenomena. Patterns of hesitations, hesitation distribution according to speaking style, and phonetic properties of the fillers are some of the characteristics we extrapolated from the HESITA database. This database also represents an important resource for improvement in synthetic speech naturalness as well as in robust acoustic modelling for automatic speech recognition. The HESITA database is the output of a project in the speech-processing field for European Portuguese held by an interdisciplinary group in intimate articulation between engineering tools and experience and the linguistic approach.

pdf bib
MUHIT: A Multilingual Harmonized Dictionary
Sameh Alansary

This paper discusses a trial to build a multilingual harmonized dictionary that contains more than 40 languages, with special reference to Arabic which represents about 20% of the whole size of the dictionary. This dictionary is called MUHIT which is an interactive multilingual dictionary application. It is a web application that makes it easily accessible to all users. MUHIT is developed within the Universal Networking Language (UNL) framework by the UNDL Foundation, in cooperation with Bibliotheca Alexandrina (BA). This application targets to serve specialists and non-specialists. It provides users with full linguistic description to each lexical item. This free application is useful to many NLP tasks such as multilingual translation and cross-language synonym search. This dictionary is built depending on WordNet and corpus based approaches, in a specially designed linguistic environment called UNLariam that is developed by the UNLD foundation. This dictionary is the first launched application by the UNLD foundation.

pdf bib
Predicate Matrix: extending SemLink through WordNet mappings
Maddalen Lopez de Lacalle | Egoitz Laparra | German Rigau

This paper presents the Predicate Matrix v1.1, a new lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet. We start from the basis of SemLink. Then, we use advanced graph-based algorithms to further extend the mapping coverage of SemLink. Second, we also exploit the current content of SemLink to infer new role mappings among the different predicate schemas. As a result, we have obtained a new version of the Predicate Matrix which largely extends the current coverage of SemLink and the previous version of the Predicate Matrix.

pdf bib
TaLAPi — A Thai Linguistically Annotated Corpus for Language Processing
AiTi Aw | Sharifah Mahani Aljunied | Nattadaporn Lertcheva | Sasiwimon Kalunsima

This paper discusses a Thai corpus, TaLAPi, fully annotated with word segmentation (WS), part-of-speech (POS) and named entity (NE) information with the aim to provide a high-quality and sufficiently large corpus for real-life implementation of Thai language processing tools. The corpus contains 2,720 articles (1,043,471words) from the entertainment and lifestyle (NE&L) domain and 5,489 articles (3,181,487 words) in the news (NEWS) domain, with a total of 35 POS tags and 10 named entity categories. In particular, we present an approach to segment and tag foreign and loan words expressed in transliterated or original form in Thai text corpora. We see this as an area for study as adapted and un-adapted foreign language sequences have not been well addressed in the literature and this poses a challenge to the annotation process due to the increasing use and adoption of foreign words in the Thai language nowadays. To reduce the ambiguities in POS tagging and to provide rich information for facilitating Thai syntactic analysis, we adapted the POS tags used in ORCHID and propose a framework to tag Thai text and also addresses the tagging of loan and foreign words based on the proposed segmentation strategy. TaLAPi also includes a detailed guideline for tagging the 10 named entity categories

pdf bib
T2K^2: a System for Automatically Extracting and Organizing Knowledge from Texts
Felice Dell’Orletta | Giulia Venturi | Andrea Cimino | Simonetta Montemagni

In this paper, we present T2K^2, a suite of tools for automatically extracting domain―specific knowledge from collections of Italian and English texts. T2K^2 (Text―To―Knowledge v2) relies on a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine learning which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured documents. Extracted knowledge ranges from domain―specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K^2 also includes “linguistic profiling” functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the “added value” of newly inserted documents. T2K^2 is a web application which can be accessed from any browser through a personal account which has been tested in a wide range of domains.

pdf bib
EMOVO Corpus: an Italian Emotional Speech Database
Giovanni Costantini | Iacopo Iaderola | Andrea Paoloni | Massimiliano Todisco

This article describes the first emotional corpus, named EMOVO, applicable to Italian language,. It is a database built from the voices of up to 6 actors who played 14 sentences simulating 6 emotional states (disgust, fear, anger, joy, surprise, sadness) plus the neutral state. These emotions are the well-known Big Six found in most of the literature related to emotional speech. The recordings were made with professional equipment in the Fondazione Ugo Bordoni laboratories. The paper also describes a subjective validation test of the corpus, based on emotion-discrimination of two sentences carried out by two different groups of 24 listeners. The test was successful because it yielded an overall recognition accuracy of 80%. It is observed that emotions less easy to recognize are joy and disgust, whereas the most easy to detect are anger, sadness and the neutral state.

pdf bib
MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic
Arfath Pasha | Mohamed Al-Badrashiny | Mona Diab | Ahmed El Kholy | Ramy Eskander | Nizar Habash | Manoj Pooleery | Owen Rambow | Ryan Roth

In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see that highlights these aspects.

pdf bib
Developing Politeness Annotated Corpus of Hindi Blogs
Ritesh Kumar

In this paper I discuss the creation and annotation of a corpus of Hindi blogs. The corpus consists of a total of over 479,000 blog posts and blog comments. It is annotated with the information about the politeness level of each blog post and blog comment. The annotation is carried out using four levels of politeness ― neutral, appropriate, polite and impolite. For the annotation, three classifiers ― were trained and tested maximum entropy (MaxEnt), Support Vector Machines (SVM) and C4.5 - using around 30,000 manually annotated texts. Among these, C4.5 gave the best accuracy. It achieved an accuracy of around 78% which is within 2% of the human accuracy during annotation. Consequently this classifier is used to annotate the rest of the corpus

pdf bib
The CMU METAL Farsi NLP Approach
Weston Feely | Mehdi Manshadi | Robert Frederking | Lori Levin

While many high-quality tools are available for analyzing major languages such as English, equivalent freely-available tools for important but lower-resourced languages such as Farsi are more difficult to acquire and integrate into a useful NLP front end. We report here on an accurate and efficient Farsi analysis front end that we have assembled, which may be useful to others who wish to work with written Farsi. The pre-existing components and resources that we incorporated include the Carnegie Mellon TurboParser and TurboTagger (Martins et al., 2010) trained on the Dadegan Treebank (Rasooli et al., 2013), the Uppsala Farsi text normalizer PrePer (Seraji, 2013), the Uppsala Farsi tokenizer (Seraji et al., 2012a), and Jon Dehdari’s PerStem (Jadidinejad et al., 2010). This set of tools (combined with additional normalization and tokenization modules that we have developed and made available) achieves a dependency parsing labeled attachment score of 89.49%, unlabeled attachment score of 92.19%, and label accuracy score of 91.38% on a held-out parsing test data set. All of the components and resources used are freely available. In addition to describing the components and resources, we also explain the rationale for our choices.

pdf bib
Recognising suicidal messages in Dutch social media
Bart Desmet | Véronique Hoste

Early detection of suicidal thoughts is an important part of effective suicide prevention. Such thoughts may be expressed online, especially by young people. This paper presents on-going work on the automatic recognition of suicidal messages in social media. We present experiments for automatically detecting relevant messages (with suicide-related content), and those containing suicide threats. A sample of 1357 texts was annotated in a corpus of 2674 blog posts and forum messages from Netlog, indicating relevance, origin, severity of suicide threat and risks as well as protective factors. For the classification experiments, Naive Bayes, SVM and KNN algorithms are combined with shallow features, i.e. bag-of-words of word, lemma and character ngrams, and post length. The best relevance classification is achieved by using SVM with post length, lemma and character ngrams, resulting in an F-score of 85.6% (78.7% precision and 93.8% recall). For the second task (threat detection), a cascaded setup which first filters out irrelevant messages with SVM and then predicts the severity with KNN, performs best: 59.2% F-score (69.5% precision and 51.6% recall).

pdf bib
A Rank-based Distance Measure to Detect Polysemy and to Determine Salient Vector-Space Features for German Prepositions
Maximilian Köper | Sabine Schulte im Walde

This paper addresses vector space models of prepositions, a notoriously ambiguous word class. We propose a rank-based distance measure to explore the vector-spatial properties of the ambiguous objects, focusing on two research tasks: (i) to distinguish polysemous from monosemous prepositions in vector space; and (ii) to determine salient vector-space features for a classification of preposition senses. The rank-based measure predicts the polysemy vs. monosemy of prepositions with a precision of up to 88%, and suggests preposition-subcategorised nouns as more salient preposition features than preposition-subcategorising verbs.

pdf bib
Expanding n-gram analytics in ELAN and a case study for sign synthesis
Rosalee Wolfe | John McDonald | Larwan Berke | Marie Stumbo

Corpus analysis is a powerful tool for signed language synthesis. A new extension to ELAN offers expanded n-gram analysis tools including improved search capabilities and an extensive library of statistical measures of association for n-grams. Uncovering and exploring coarticulatory timing effects via corpus analysis requires n-gram analysis to discover the most frequently occurring bigrams. This paper presents an overview of the new tools and a case study in American Sign Language synthesis that exploits these capabilities for computing more natural timing in generated sentences. The new extension provides a time-saving convenience for language researchers using ELAN.

pdf bib
Sentence Rephrasing for Parsing Sentences with OOV Words
Hen-Hsen Huang | Huan-Yuan Chen | Chang-Sheng Yu | Hsin-Hsi Chen | Po-Ching Lee | Chun-Hsun Chen

This paper addresses the problems of out-of-vocabulary (OOV) words, named entities in particular, in dependency parsing. The OOV words, whose word forms are unknown to the learning-based parser, in a sentence may decrease the parsing performance. To deal with this problem, we propose a sentence rephrasing approach to replace each OOV word in a sentence with a popular word of the same named entity type in the training set, so that the knowledge of the word forms can be used for parsing. The highest-frequency-based rephrasing strategy and the information-retrieval-based rephrasing strategy are explored to select the word to replace, and the Chinese Treebank 6.0 (CTB6) corpus is adopted to evaluate the feasibility of the proposed sentence rephrasing strategies. Experimental results show that rephrasing some specific types of OOV words such as Corporation, Organization, and Competition increases the parsing performances. This methodology can be applied to domain adaptation to deal with OOV problems.

pdf bib
Mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to a NLP lexicon using examples
Bruno Guillaume | Karën Fort | Guy Perrier | Paul Bédaride

This article presents experiments aiming at mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to FRILEX, a Natural Language Processing (NLP) lexicon based on D ICOVALENCE. The two resources (Lexicon of French Verbs and D ICOVALENCE) were built by linguists, based on very different theories, which makes a direct mapping nearly impossible. We chose to use the examples provided in one of the resource to find implicit links between the two and make them explicit.

pdf bib
Language Resources for French in the Biomedical Domain
Aurélie Névéol | Julien Grosjean | Stéfan Darmoni | Pierre Zweigenbaum

The biomedical domain offers a wealth of linguistic resources for Natural Language Processing, including terminologies and corpora. While many of these resources are prominently available for English, other languages including French benefit from substantial coverage thanks to the contribution of an active community over the past decades. However, access to terminological resources in languages other than English may not be as straight-forward as access to their English counterparts. Herein, we review the extent of resource coverage for French and give pointers to access French-language resources. We also discuss the sources and methods for making additional material available for French.

pdf bib
The MERLIN corpus: Learner language and the CEFR
Adriane Boyd | Jirka Hana | Lionel Nicolas | Detmar Meurers | Katrin Wisniewski | Andrea Abel | Karin Schöne | Barbora Štindlová | Chiara Vettori

The MERLIN corpus is a written learner corpus for Czech, German,and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that enable research into the empirical foundations of the CEFR scales and provide language teachers, test developers, and Second Language Acquisition researchers with concrete examples of learner performance and progress across multiple proficiency levels. For computational linguistics, it provide a range of authentic learner data for three target languages, supporting a broadening of the scope of research in areas such as automatic proficiency classification or native language identification. The annotated corpus and related information will be freely available as a corpus resource and through a freely accessible, didactically-oriented online platform.

pdf bib
Computer-aided morphology expansion for Old Swedish
Yvonne Adesam | Malin Ahlberg | Peter Andersson | Gerlof Bouma | Markus Forsberg | Mans Hulden

In this paper we describe and evaluate a tool for paradigm induction and lexicon extraction that has been applied to Old Swedish. The tool is semi-supervised and uses a small seed lexicon and unannotated corpora to derive full inflection tables for input lemmata. In the work presented here, the tool has been modified to deal with the rich spelling variation found in Old Swedish texts. We also present some initial experiments, which are the first steps towards creating a large-scale morphology for Old Swedish.

pdf bib
Two-Step Machine Translation with Lattices
Bushra Jawaid | Ondřej Bojar

The idea of two-step machine translation was introduced to divide the complexity of the search space into two independent steps: (1) lexical translation and reordering, and (2) conjugation and declination in the target language. In this paper, we extend the two-step machine translation structure by replacing state-of-the-art phrase-based machine translation with the hierarchical machine translation in the 1st step. We further extend the fixed string-based input format of the 2nd step with word lattices (Dyer et al., 2008); this provides the 2nd step with the opportunity to choose among a sample of possible reorderings instead of relying on the single best one as produced by the 1st step.

pdf bib
The Munich Biovoice Corpus: Effects of Physical Exercising, Heart Rate, and Skin Conductance on Human Speech Production
Björn Schuller | Felix Friedmann | Florian Eyben

We introduce a spoken language resource for the analysis of impact that physical exercising has on human speech production. In particular, the database provides heart rate and skin conductance measurement information alongside the audio recordings. It contains recordings from 19 subjects in a relaxed state and after exercising. The audio material includes breathing, sustained vowels, and read text. Further, we describe pre-extracted audio-features from our openSMILE feature extractor together with baseline performances for the recognition of high and low heart rate using these features. The baseline results clearly show the feasibility of automatic estimation of heart rate from the human voice, in particular from sustained vowels. Both regression - in order to predict the exact heart rate value - and a binary classification setting for high and low heart rate classes are investigated. Finally, we give tendencies on feature group relevance in the named contexts of heart rate estimation and skin conductivity estimation.

pdf bib
DysList: An Annotated Resource of Dyslexic Errors
Luz Rello | Ricardo Baeza-Yates | Joaquim Llisterri

We introduce a language resource for Spanish, DysList, composed of a list of unique errors extracted from a collection of texts written by people with dyslexia. Each of the errors was annotated with a set of characteristics as well as visual and phonetic features. To the best of our knowledge this is the largest resource of this kind, especially given the difficulty of finding texts written by people with dyslexia

pdf bib