Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias (Editors)

Anthology ID:
Valletta, Malta
European Language Resources Association (ELRA)
Bib Export formats:

Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Nicoletta Calzolari | Khalid Choukri | Bente Maegaard | Joseph Mariani | Jan Odijk | Stelios Piperidis | Mike Rosner | Daniel Tapias

pdf bib
Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction
Hercules Dalianis | Hao-chun Xing | Xin Zhang

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus. Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese. Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an accuracy of 73.1 percent.

pdf bib
FreeLing 2.1: Five Years of Open-source Language Processing Tools
Lluís Padró | Miquel Collado | Samuel Reese | Marina Lloberes | Irene Castellón

FreeLing is an open-source multilingual language processing library providing a wide range of language analyzers for several languages. It offers text processing and language annotation facilities to natural language processing application developers, simplifying the task of building those applications. FreeLing is customizable and extensible. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.) directly, or extend them, adapt them to specific domains, or even develop new ones for specific languages. This paper overviews the recent history of this tool, summarizes the improvements and extensions incorporated in the latest version, and depicts the architecture of the library. Special focus is brought to the fact and consequences of the library being open-source: After five years and over 35,000 downloads, a growing user community has extended the initial threelanguages (English, Spanish and Catalan) to eight (adding Galician, Italian, Welsh, Portuguese, and Asturian), proving that the collaborative open model is a productive approach for the development of NLP tools and resources.

pdf bib
A General Method for Creating a Bilingual Transliteration Dictionary
Amit Kirschenbaum | Shuly Wintner

Transliteration is the rendering in one language of terms from another language (and, possibly, another writing system), approximating spelling and/or phonetic equivalents between the two languages. A transliteration dictionary is a crucial resource for a variety of natural language applications, most notably machine translation. We describe a general method for creating bilingual transliteration dictionaries from Wikipedia article titles. The method can be applied to any language pair with Wikipedia presence, independently of the writing systems involved, and requires only a single simple resource that can be provided by any literate bilingual speaker. It was successfully applied to extract a Hebrew-English transliteration dictionary which, when incorporated in a machine translation system, indeed improved its performance.

pdf bib
Comment Extraction from Blog Posts and Its Applications to Opinion Mining
Huan-An Kao | Hsin-Hsi Chen

Blog posts containing many personal experiences or perspectives toward specific subjects are useful. Blogs allow readers to interact with bloggers by placing comments on specific blog posts. The comments carry viewpoints of readers toward the targets described in the post, or supportive/non-supportive attitude toward the post. Comment extraction is challenging due to that there does not exist a unique template among all blog service providers. This paper proposes methods to deal with this problem. Firstly, the repetitive patterns and their corresponding blocks are extracted from input posts by pattern identification algorithm. Secondly, three filtering strategies, i.e., tag pattern loop filtering, rule overlap filtering, and longest rule first, are used to remove non-comment blocks. Finally, a comment/non-comment classifier is learned to distinguish comment blocks from non-comment blocks with 14 block-level features and 5 rule-level features. In the experiments, we randomly select 600 blog posts from 12 blog service providers. F-measure, recall, and precision are 0.801, 0.855, and 0.780, respectively, by using all of the three filtering strategies together with some selected features. The application of comment extraction to blog mining is also illustrated. We show how to identify the relevant opinionated objects ― say, opinion holders, opinions, and targets, from posts.

pdf bib
FOLKER: An Annotation Tool for Efficient Transcription of Natural, Multi-party Interaction
Thomas Schmidt | Wilfried Schütte

This paper presents FOLKER, an annotation tool developed for the efficient transcription of natural, multi-party interaction in a conversation analysis framework. FOLKER is being developed at the Institute for German Language in and for the FOLK project, whose aim is the construction of a large corpus of spoken present-day German, to be used for research and teaching purposes. FOLKER builds on the experience gained with multi-purpose annotation tools like ELAN and EXMARaLDA, but attempts to improve transcription efficiency by restricting and optimizing both data model and tool functionality to a single, well-defined purpose. The tool’s most important features in this respect are the possibility to freely switch between several editable views according to the requirements of different steps in the annotation process, and an automatic syntax check of annotations during input for their conformance to the GAT transcription convention. This paper starts with a description of the GAT transcription conventions and the data model underlying the tool. It then gives an overview of the tool functionality and compares this functionality to that of other widely used tools.

pdf bib
An Annotated Dataset for Extracting Definitions and Hypernyms from the Web
Roberto Navigli | Paola Velardi | Juana Maria Ruiz-Martínez

This paper presents and analyzes an annotated corpus of definitions, created to train an algorithm for the automatic extraction of definitions and hypernyms from web documents. As an additional resource, we also include a corpus of non-definitions with syntactic patterns similar to those of definition sentences, e.g.: ""An android is a robot"" vs. ""Snowcap is unmistakable"". Domain and style independence is obtained thanks to the annotation of a large and domain-balanced corpus and to a novel pattern generalization algorithm based on word-class lattices (WCL). A lattice is a directed acyclic graph (DAG), a subclass of nondeterministic finite state automata (NFA). The lattice structure has the purpose of preserving the salient differences among distinct sequences, while eliminating redundant information. The WCL algorithm will be integrated into an improved version of the GlossExtractor Web application (Velardi et al., 2008). This paper is mostly concerned with a description of the corpus, the annotation strategy, and a linguistic analysis of the data. A summary of the WCL algorithm is also provided for the sake of completeness.

pdf bib
Studying Word Sketches for Russian
Maria Khokhlova | Victor Zakharov

Without any doubt corpora are vital tools for linguistic studies and solution for applied tasks. Although corpora opportunities are very useful, there is a need of another kind of software for further improvement of linguistic research as it is impossible to process huge amount of linguistic data manually. The Sketch Engine representing itself a corpus tool which takes as input a corpus of any language and corresponding grammar patterns. The paper describes the writing of Sketch grammar for the Russian language as a part of the Sketch Engine system. The system gives information about a word’s collocability on concrete dependency models, and generates lists of the most frequent phrases for a given word based on appropriate models. The paper deals with two different approaches to writing rules for the grammar, based on morphological information, and also with applying word sketches to the Russian language. The data evidences that such results may find an extensive use in various fields of linguistics, such as dictionary compiling, language learning and teaching, translation (including machine translation), phraseology, information retrieval etc.

pdf bib
Using Linear Interpolation and Weighted Reordering Hypotheses in the Moses System
Marta R. Costa-jussà | José A. R. Fonollosa

This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT translation from a source language into a third representation that is called reordered source language. Each hypothesis has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific in-domain and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English 2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art Moses baseline system.

pdf bib
The Sign Linguistics Corpora Network: Towards Standards for Signed Language Resources
Onno Crasborn

The Sign Linguistics Corpora Network is a three-year network initiative that aims to collect existing knowledge and practices on the creation and use of signed language resources. The concrete goals are to organise a series of four workshops in 2009 and 2010, create a stable Internet location for such knowledge, and generate new ideas for employing the most recent technologies for the study of signed languages. The network covers a wide range of subjects: data collection, metadata, annotation, and exploitation; these are the topics of the four workshops. The outcomes of the first two workshops are summarised in this paper; both workshops demonstrated that the need for dedicated knowledge on sign language corpora is especially salient in countries where researchers work alone or in small groups, which is still quite common in many places in Europe. While the original goal of the network was primarily to focus on corpus linguistics and language documentation, human language technology has gradually been incorporated as a user group of signed language resources.

pdf bib
A Bilingual Dictionary Mexican Sign Language-Spanish/Spanish-Mexican Sign Language
Antoinette Hawayek | Riccardo Del Gratta | Giuseppe Cappelli

We present a three-part bilingual specialized dictionary Mexican Sign Language-Spanish / Spanish-Mexican Sign Language. This dictionary will be the outcome of a three-years agreement between the Italian “Consiglio Nazionale delle Ricerche” and the Mexican Conacyt. Although many other sign language dictionaries have been provided to deaf communities, there are no Mexican Sign Language dictionaries in Mexico, yet. We want to stress on the specialized feature of the proposed dictionary: the bilingual dictionary will contain frequently used general Spanish forms along with scholastic course specific specialized words whose meanings warrant comprehension of school curricula. We emphasize that this aspect of the bilingual dictionary can have a deep social impact, since we will furnish to deaf people the possibility to get competence in official language, which is necessary to ensure access to school curriculum and to become full-fledged citizens. From a technical point of view, the dictionary consists of a relational database, where we have saved the sign parameters and a graphical user interface especially designed to allow deaf children to retrieve signs using the relevant parameters and,thus, the meaning of the sign in Spanish.

pdf bib
The Web Library of Babel: evaluating genre collections
Serge Sharoff | Zhili Wu | Katja Markert

We present experiments in automatic genre classification on web corpora, comparing a wide variety of features on several different genreannotated datasets (HGC, I-EN, KI-04, KRYS-I, MGC and SANTINIS).We investigate the performance of several types of features (POS n-grams, character n-grams and word n-grams) and show that simple character n-grams perform best on current collections because of their ability to generalise both lexical and syntactic phenomena related to genres. However, we also show that these impressive results might not be transferrable to the wider web due to the lack of comparability between different annotation labels (many webpages cannot be described in terms of the genre labels in individual collections), lack of representativeness of existing collections (many genres are represented by webpages coming from a small number of sources) as well as problems in the reliability of genre annotation (many pages from the web are difficult to interpret in terms of the labels available). This suggests that more research is needed to understand genres on the Web.

pdf bib
A General Methodology for Equipping Ontologies with Time
Hans-Ulrich Krieger

In the first part of this paper, we present a framework for enriching arbitrary upper or domain-specific ontologies with a concept of time. To do so, we need the notion of a time slice. Contrary to other approaches, we directly interpret the original entities as time slices in order to (i) avoid a duplication of the original ontology and (ii) to prevent a knowledge engineer from ontology rewriting. The diachronic representation of time is complemented by a sophisticated time ontology that supports underspecification and an arbitrarily fine granularity of time. As a showcase, we describe how the time ontology has been interfaced with the PROTON upper ontology. The second part investigates a temporal extension of RDF that replaces the usual triple notation by a more general tuple representation. In this setting, Hayes/ter Horst-like entailment rules are replaced by their temporal counterparts. Our motivation to move towards this direction is twofold: firstly, extending binary relation instances with time leads to a massive proliferation of useless objects (independently of the encoding); secondly, reasoning and querying with such extended relations is extremely complex, expensive, and error-prone.

pdf bib
A Python Toolkit for Universal Transliteration
Ting Qian | Kristy Hollingshead | Su-youn Yoon | Kyoung-young Kim | Richard Sproat

We describe ScriptTranscriber, an open source toolkit for extracting transliterations in comparable corpora from languages written in different scripts. The system includes various methods for extracting potential terms of interest from raw text, for providing guesses on the pronunciations of terms, and for comparing two strings as possible transliterations using both phonetic and temporal measures. The system works with any script in the Unicode Basic Multilingual Plane and is easily extended to include new modules. Given comparable corpora, such as newswire text, in a pair of languages that use different scripts, ScriptTranscriber provides an easy way to mine transliterations from the comparable texts. This is particularly useful for underresourced languages, where training data for transliteration may be lacking, and where it is thus hard to train good transliterators. ScriptTranscriber provides an open source package that allows for ready incorporation of more sophisticated modules ― e.g. a trained transliteration model for a particular language pair. ScriptTranscriber is available as part of the nltk contrib source tree at

pdf bib
Test Suite Design for Biomedical Ontology Concept Recognition Systems
K. Bretonnel Cohen | Christophe Roeder | William A. Baumgartner Jr. | Lawrence E. Hunter | Karin Verspoor

Systems that locate mentions of concepts from ontologies in free text are known as ontology concept recognition systems. This paper describes an approach to the evaluation of the workings of ontology concept recognition systems through use of a structured test suite and presents a publicly available test suite for this purpose. It is built using the principles of descriptive linguistic fieldwork and of software testing. More broadly, we also seek to investigate what general principles might inform the construction of such test suites. The test suite was found to be effective in identifying performance errors in an ontology concept recognition system. The system could not recognize 2.1% of all canonical forms and no non-canonical forms at all. Regarding the question of general principles of test suite construction, we compared this test suite to a named entity recognition test suite constructor. We found that they had twenty features in total and that seven were shared between the two models, suggesting that there is a core of feature types that may be applicable to test suite construction for any similar type of application.

pdf bib
Construction of a Benchmark Data Set for Cross-lingual Word Sense Disambiguation
Els Lefever | Véronique Hoste

Given the recent trend to evaluate the performance of word sense disambiguation systems in a more application-oriented set-up, we report on the construction of a multilingual benchmark data set for cross-lingual word sense disambiguation. The data set was created for a lexical sample of 25 English nouns, for which translations were retrieved in 5 languages, namely Dutch, German, French, Italian and Spanish. The corpus underlying the sense inventory was the parallel data set Europarl. The gold standard sense inventory was based on the automatic word alignments of the parallel corpus, which were manually verified. The resulting word alignments were used to perform a manual clustering of the translations over all languages in the parallel corpus. The inventory then served as input for the annotators of the sentences, who were asked to provide a maximum of three contextually relevant translations per language for a given focus word. The data set was released in the framework of the SemEval-2010 competition.

pdf bib
Corpus and Evaluation Measures for Automatic Plagiarism Detection
Alberto Barrón-Cedeño | Martin Potthast | Paolo Rosso | Benno Stein

The simple access to texts on digital libraries and the World Wide Web has led to an increased number of plagiarism cases in recent years, which renders manual plagiarism detection infeasible at large. Various methods for automatic plagiarism detection have been developed whose objective is to assist human experts in the analysis of documents for plagiarism. The methods can be divided into two main approaches: intrinsic and external. Unlike other tasks in natural language processing and information retrieval, it is not possible to publish a collection of real plagiarism cases for evaluation purposes since they cannot be properly anonymized. Therefore, current evaluations found in the literature are incomparable and, very often not even reproducible. Our contribution in this respect is a newly developed large-scale corpus of artificial plagiarism useful for the evaluation of intrinsic as well as external plagiarism detection. Additionally, new detection performance measures tailored to the evaluation of plagiarism detection algorithms are proposed.

pdf bib
An Evolving eScience Environment for Research Data in Linguistics
Claus Zinn | Peter Wittenburg | Jacquelijn Ringersma

The amount of research data in the Humanities is increasing at fast speed. Metadata helps describing and making accessible this data to interested researchers within and across institutions. While metadata interoperability is an issue that is being recognised and addressed, the systematic and user-driven provision of annotations and the linking together of resources into new organisational layers have received much less attention. This paper gives an overview of our evolving technological eScience environment to support such functionality. It describes two tools, ADDIT and ViCoS, which enable researchers, rather than archive managers, to organise and reorganise research data to fit their particular needs. The two tools, which are embedded into our institute's existing software landscape, are an initial step towards an eScience environment that gives our scientists easy access to (multimodal) research data of their interest, and empowers them to structure, enrich, link together, and share such data as they wish.

pdf bib
Classifying Action Items for Semantic Email
Simon Scerri | Gerhard Gossen | Brian Davis | Siegfried Handschuh

Email can be considered as a virtual working environment in which users are constantly struggling to manage the vast amount of exchanged data. Although most of this data belongs to well-defined workflows, these are implicit and largely unsupported by existing email clients. Semanta provides this support by enabling Semantic Email ― email enhanced with machine-processable metadata about specific types of email Action Items (e.g. Task Assignment, Meeting Proposal). In the larger picture, these items form part of ad-hoc workflows (e.g. Task Delegation, Meeting Scheduling). Semanta is faced with a knowledge-acquisition bottleneck, as users cannot be expected to annotate each action item, and their automatic recognition proves difficult. This paper focuses on applying computationally treatable aspects of speech act theory for the classification of email action items. A rule-based classification model is employed, based on the presence or form of a number of linguistic features. The technology’s evaluation suggests that whereas full automation is not feasible, the results are good enough to be presented as suggestions for the user to review. In addition the rule-based system will bootstrap a machine learning system that is currently in development, to generate the initial training sets which are then improved through the user’s reviewing.

pdf bib
Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content
Yulia Tsvetkov | Shuly Wintner

Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.

pdf bib
United we Stand: Improving Sentiment Analysis by Joining Machine Learning and Rule Based Methods
Vassiliki Rentoumi | Stefanos Petrakis | Manfred Klenner | George A. Vouros | Vangelis Karkaletsis

In the past, we have succesfully used machine learning approaches for sentiment analysis. In the course of those experiments, we observed that our machine learning method, although able to cope well with figurative language could not always reach a certain decision about the polarity orientation of sentences, yielding erroneous evaluations. We support the conjecture that these cases bearing mild figurativeness could be better handled by a rule-based system. These two systems, acting complementarily, could bridge the gap between machine learning and rule-based approaches. Experimental results using the corpus of the Affective Text Task of SemEval ’07, provide evidence in favor of this direction.

pdf bib
Handling of Missing Values in Lexical Acquisition
Núria Bel

In this work we propose a strategy to reduce the impact of the sparse data problem in the tasks of lexical information acquisition based on the observation of linguistic cues. We propose a way to handle the uncertainty created by missing values, that is, when a zero value could mean either that the cue has not been observed because the word in question does not belong to the class, i.e. negative evidence, or that the word in question has just not been observed in the context sought by chance, i.e. lack of evidence. This uncertainty creates problems to the learner, because zero values for incompatible labelled examples make the cue lose its predictive capacity and even though some samples display the sought context, it is not taken into account. In this paper we present the results of our experiments to try to reduce this uncertainty by, as other authors do (Joanis et al. 2007, for instance), substituting zero values for pre-processed estimates. Here we present a first round of experiments that have been the basis for the estimates of linguistic information motivated by lexical classes. We obtained experimental results that show a clear benefit of the proposed approach.

pdf bib
Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish
Elin Carlsson | Hercules Dalianis

Electronic patient records (EPRs) are a valuable resource for research but for confidentiality reasons they cannot be used freely. In order to make EPRs available to a wider group of researchers, sensitive information such as personal names has to be removed. De-identification is a process that makes this possible. Both rule-based as well as statistical and machine learning based methods exist to perform de-identification, but the second method requires annotated training material which exists only very sparsely for patient names. It is therefore necessary to use rule-based methods for de-identification of EPRs. Not much is known, however, about the order in which the various rules should be applied and how the different rules influence precision and recall. This paper aims to answer this research question by implementing and evaluating four common rules for de-identification of personal names in EPRs written in Swedish: (1) dictionary name matching, (2) title matching, (3) common words filtering and (4) learning from previous modules. The results show that to obtain the highest recall and precision, the rules should be applied in the following order: title matching, common words filtering and dictionary name matching.

pdf bib
Automatic and Human Evaluation Study of a Rule-based and a Statistical Catalan-Spanish Machine Translation Systems
Marta R. Costa-jussà | Mireia Farrús | José B. Mariño | José A. R. Fonollosa

Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core technology. Since both paradigms have largely been used during the last years, one of the aims in the research community is to know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of a rule-based and a corpus-based (particularly, statistical) Catalan-Spanish machine translation systems, both of them freely available in the web. The translation quality analysis is performed under two different domains: journalistic and medical. The systems are evaluated by using standard automatic measures, as well as by native human evaluators. Automatic results show that the statistical system performs better than the rule-based system. Human judgements show that in the Spanish-to-Catalan direction the statistical system also performs better than the rule-based system, while in the Catalan-to-Spanish direction is the other way round. Although the statistical system obtains the best automatic scores, its errors tend to be more penalized by human judgements than the errors of the rule-based system. This can be explained because statistical errors are usually unexpected and they do not follow any pattern.

pdf bib
An Integrated Digital Tool for Accessing Language Resources
Anil Kumar Singh | Bharat Ram Ambati

Language resources can be classified under several categories. To be able to query and operate on all (or most of) these categories using a single digital tool would be very helpful for a large number of researchers working on languages. We describe such a tool in this paper. It is different from other such tools in that it allows querying and transformation on different kinds of resources (such as corpora, lexicon and language models) with the same framework. Search options can be given based on the kind of resource being queried. It is possible to select a matched resource and open it for editing in the specialized interfaces with which that resource is associated. The tool also allows the extracted or modified data to be saved separately, apart from having the usual facilities like displaying the results in KeyWord-In-Context (KWIC) format. We also present the notation used for querying and transformation, which is comparable to but different from the Corpus Query Language (CQL).

pdf bib
A Speech Corpus for Dyslexic Reading Training
Jakob Schou Pedersen | Lars Bo Larsen

Traditional Danish reading training for dyslexic readers typically involves the presence of a professional reading therapist for guidance, advice and evaluation. Allowing dyslexic readers to train their reading skills on their own could not only benefit the dyslexics themselves in terms of increased flexibility but could also allow professional therapists to increase the amount of dyslexic readers to whom they have a professional contact. It is envisioned that an automated reading training tool operating on the basis of ASR could provide dyslexic users with such independence. However, only limited experience in handling dyslexic input (in Danish) by a speech recognizer exists currently. This paper reports on the establishment of a speech corpus of Danish dyslexic speech along with an annotation hereof and the setup of a proof-of-concept training tool allowing dyslexic users to improve their reading skills on their own. Despite relatively limited ASR performance, a usability evaluation by dyslexic users shows an unconditional belief in the fairness of the system and indicates furthermore willingness for using such a training tool.

pdf bib
Arabic Word Segmentation for Better Unit of Analysis
Yassine Benajiba | Imed Zitouni

The Arabic language has a very rich morphology where a word is composed of zero or more prefixes, a stem and zero or more suffixes. This makes Arabic data sparse compared to other languages, such as English, and consequently word segmentation becomes very important for many Natural Language Processing tasks that deal with the Arabic language. We present in this paper two segmentation schemes that are morphological segmentation and Arabic TreeBank segmentation and we show their impact on an important natural language processing task that is mention detection. Experiments on Arabic TreeBank corpus show 98.1% accuracy on morphological segmentation and 99.4% on morphological segmentation. We also discuss the importance of segmenting the text; experiments show up to 6F points improvement of the mention detection system performance when morphological segmentation is used instead of not segmenting the text. Obtained results also show up to 3F points improvement is achieved when the appropriate segmentation style is used.

pdf bib
ISO-TimeML: An International Standard for Semantic Annotation
James Pustejovsky | Kiyong Lee | Harry Bunt | Laurent Romary

In this paper, we present ISO-TimeML, a revised and interoperable version of the temporal markup language, TimeML. We describe the changes and enrichments made, while framing the effort in a more general methodology of semantic annotation. In particular, we assume a principled distinction between the annotation of an expression and the representation which that annotation denotes. This involves not only the specification of an annotation language for a particular phenomenon, but also the development of a meta-model that allows one to interpret the syntactic expressions of the specification semantically.

pdf bib
GIS Application Improvement with Multilingual Lexical and Terminological Resources
Ranka Stanković | Ivan Obradović | Olivera Kitanović

This paper introduces the results of integration of lexical and terminological resources, most of them developed within the Human Language Technology (HLT) Group at the University of Belgrade, with the Geological information system of Serbia (GeolISS) developed at the Faculty of Mining and Geology and funded by the Ministry of the Environmental protection. The approach to GeolISS development, which is aimed at the integration of existing geologic archives, data from published maps on different scales, newly acquired field data, and intranet and internet publishing of geologic is given, followed by the description of the geologic multilingual vocabulary and other lexical and terminological resources used. Two basic results are outlined: multilingual map annotation and improvement of queries for the GeolISS geodatabase. Multilingual labelling and annotation of maps for their graphic display and printing have been tested with Serbian, which describes regional information in the local language, and English, used for sharing geographic information with the world, although the geological vocabulary offers the possibility for integration of other languages as well. The resources also enable semantic and morphological expansion of queries, the latter being very important in highly inflective languages, such as Serbian.

pdf bib
A Database of Narrative Schemas
Nathanael Chambers | Dan Jurafsky

This paper describes a new language resource of events and semantic roles that characterize real-world situations. Narrative schemas contain sets of related events (edit and publish), a temporal ordering of the events (edit before publish), and the semantic roles of the participants (authors publish books). This type of world knowledge was central to early research in natural language understanding, scripts being one of the main formalisms, they represented common sequences of events that occur in the world. Unfortunately, most of this knowledge was hand-coded and time consuming to create. Current machine learning techniques, as well as a new approach to learning through coreference chains, has allowed us to automatically extract rich event structure from open domain text in the form of narrative schemas. The narrative schema resource described in this paper contains approximately 5000 unique events combined into schemas of varying sizes. We describe the resource, how it is learned, and a new evaluation of the coverage of these schemas over unseen documents.

pdf bib
A Swedish Scientific Medical Corpus for Terminology Management and Linguistic Exploration
Dimitrios Kokkinakis | Ulla Gerdin

This paper describes the development of a new Swedish scientific medical corpus. We provide a detailed description of the characteristics of this new collection as well results of an application of the corpus on term management tasks, including terminology validation and terminology extraction. Although the corpus is representative for the scientific medical domain it still covers in detail a lot of specialised sub-disciplines such as diabetes and osteoporosis which makes it suitable for facilitating the production of smaller but more focused sub-corpora. We address this issue by making explicit some features of the corpus in order to demonstrate the usability of the corpus particularly for the quality assessment of subsets of official terminologies such as the Systematized NOmenclature of MEDicine - Clinical Terms (SNOMED CT). Domain-dependent language resources, labelled or not, are a crucial key components for progressing R&D in the human language technology field since such resources are an indispensable, integrated part for terminology management, evaluation, software prototyping and design validation and a prerequisite for the development and evaluation of a number of sublanguage dependent applications including information extraction, text mining and information retrieval.

pdf bib
Using High-Quality Resources in NLP: The Valency Dictionary of English as a Resource for Left-Associative Grammars
Thomas Proisl | Besim Kabashi

In Natural Language Processing (NLP), the quality of a system depends to a great extent on the quality of the linguistic resources it uses. One area where precise information is particularly needed is valency. The unpredictable character of valency properties requires a reliable source of information for syntactic and semantic analysis. There are several (electronic) dictionaries that provide the necessary information. One such dictionary that contains especially detailed valency descriptions is the Valency Dictionary of English. We will discuss how the Valency Dictionary of English in machine-readable form can be used as a resource for NLP. We will use valency descriptions that are freely available online via the Erlangen Valency Pattern Bank which contains most of the information from the printed dictionary. We will show that the valency data can be used for accurately parsing natural language with a rule-based approach by integrating it into a Left-Associative Grammar. The Valency Dictionary of English can therefore be regarded as being well suited for NLP purposes.

pdf bib
Dictionary and Monolingual Corpus-based Query Translation for Basque-English CLIR
Xabier Saralegi | Maddalen Lopez de Lacalle

This paper deals with the main problems that arise in the query translation process in dictionary-based Cross-lingual Information Retrieval (CLIR): translation selection, presence of Out-Of-Vocabulary (OOV) terms and translation of Multi-Word Expressions (MWE). We analyse to what extent each problem affects the retrieval performance for the Basque-English pair of languages, and the improvement obtained when using parallel corpora free methods to address them. To tackle the translation selection problem we provide novel extensions of an already existing monolingual target co-occurrence-based method, the Out-Of Vocabulary terms are dealt with by means of a cognate detection-based method and finally, for the Multi-Word Expression translation problem, a naïve matching technique is applied. The error analysis shows significant differences in the deterioration of the performance depending on the problem, in terms of Mean Average Precision (MAP), the translation selection problem being the cause of most of the errors. Otherwise, the proposed combined strategy shows a good performance to tackle the three above-mentioned main problems.

pdf bib
FastKwic, an “Intelligent“ Concordancer Using FASTR
Véronika Lux-Pogodalla | Dominique Besagni | Karën Fort

In this paper, we introduce the FastKwic (Key Word In Context using FASTR), a new concordancer for French and English that does not require users to learn any particular request language. Built on FASTR, it shows them not only occurrences of the searched term but also of several morphological, morpho-syntactic and syntactic variants (for example, image enhancement, enhancement of image, enhancement of fingerprint image, image texture enhancement). Fastkwic is freely available. It consists of two UTF-8 compliant Perl modules that depend on several external tools and resources : FASTR, TreeTagger, Flemm (for French). Licenses of theses tools and resources permitting, the FastKwic package is nevertheless self-sufficient. FastKwic first modules is for terminological resource compilation. Its input is a list of terms - as required by FASTR. FastKwic second module is for processing concordances. It relies on FASTR again for indexing the input corpus with terms and their variants. Its output is a concordancer: for each term and its variants, the context of occurrence is provided.

pdf bib
A Description of Morphological Features of Serbian: a Revision using Feature System Declaration
Cvetana Krstev | Ranka Stanković | Duško Vitas

In this paper we discuss some well-known morphological descriptions used in various projects and applications (most notably MULTEXT-East and Unitex) and illustrate the encountered problems on Serbian. We have spotted four groups of problems: the lack of a value for an existing category, the lack of a category, the interdependence of values and categories lacking some description, and the lack of a support for some types of categories. At the same time, various descriptions often describe exactly the same morphological property using different approaches. We propose a new morphological description for Serbian following the feature structure representation defined by the ISO standard. In this description we try do incorporate all characteristics of Serbian that need to be specified for various applications. We have developed several XSLT scripts that transform our description into descriptions needed for various applications. We have developed the first version of this new description, but we treat it as an ongoing project because for some properties we have not yet found the satisfactory solution.

pdf bib
Determining Reliability of Subjective and Multi-label Emotion Annotation through Novel Fuzzy Agreement Measure
Plaban Kr. Bhowmick | Anupam Basu | Pabitra Mitra

The paper presents a new fuzzy agreement measure $\gamma_f$ for determining the agreement in multi-label and subjective annotation task. In this annotation framework, one data item may belong to a category or a class with a belief value denoting the degree of confidence of an annotator in assigning the data item to that category. We have provided a notion of disagreement based on the belief values provided by the annotators with respect to a category. The fuzzy agreement measure $\gamma_f$ has been proposed by defining different fuzzy agreement sets based on the distribution of difference of belief values provided by the annotators. The fuzzy agreement has been computed by studying the average agreement over all the data items and annotators. Finally, we elaborate on the computation $\gamma_f$ measure with a case study on emotion text data where a data item (sentence) may belong to more than one emotion category with varying belief values.

pdf bib
FIDJI: Web Question-Answering at Quaero 2009
Xavier Tannier | Véronique Moriceau

This paper presents the participation of FIDJI system to the Web Question-Answering evaluation campaign organized by Quaero in 2009. FIDJI is an open-domain question-answering system which combines syntactic information with traditional QA techniques such as named entity recognition and term weighting in order to validate answers through multiple documents. It was originally designed to process ``clean'' document collections. Overall results are significantly lower than in traditional campaigns but results (for French evaluation) are quite good compared to other state-of-the-art systems. They show that a syntax-based strategy, applied on uncleaned Web data, can still obtain good results. Moreover, we obtain much higher scores on ``complex'' questions, i.e. `how' and `why' questions, which are more representative of real user needs. These results show that questioning the Web with advanced linguistic techniques can be done without heavy pre-processing and with results that come near to best systems that use strong resources and large structured indexes.

pdf bib
A Tool for Feature-Structure Stand-Off-Annotation on Transcriptions of Spoken Discourse
Kai Wörner

Annotation Science, a discipline dedicated to developing and maturing methodology for the annotation of language resources, is playing a prominent role in the fields of computational and corpus linguistics. While progress in the search for the right annotation model and format is undeniable, these results only sparsely become manifest in actual solutions (i.e. software tools) that could be used by researchers wishing to annotate their resources right away, even less so for resources of spoken language transcriptions. The paper presents a solution consisting of a data model and an annotation tool that tries to fill this gap between „annotation science“ and the practice of transcribing spoken language in the area of discourse analysis and pragmatics, where the lack of ready-to-use annotation solutions is especially remarkable. The chosen model combines feature structures in standoff-annotation and a data model based on annotation graphs, combining their advantages. It is ideally fitted for the transcription of spoken language by centering on the temporal relations of the speaker’s utterances and is implemented in reliable tools that support an iterative workflow. The standoff annotation allows for more complex annotations and relies on an established and well documented model.

pdf bib
The CLARIN-NL Project
Jan Odijk

In this paper I present the CLARIN-NL project, the Dutch national project that aims to play a central role in the European CLARIN infrastructure, not only for the preparatory phase, but also for the implementation and exploitation phases. I argue that the way the CLARIN-NL project has been set-up can serve as an excellent example for other national CLARIN projects, for the following reasons: (1) it is a mix between a programme and a project; (2) it offers opportunities to seriously test standards and protocols currently proposed by CLARIN, thus providing evidence-based requirements and desiderata for the CLARIN infrastructure and ensuring compatibility of CLARIN with national data and tools; (3) it brings the intended users (humanities researchers) and the technology providers (infrastructure specialists and language and speech technology researchers) together in concrete cooperation projects, with a central role for the user’s research questions,, thus ensuring that the infrastructure will provide functionality that is needed by its intended users.

pdf bib
Language Resource Management System for Asian WordNet Collaboration and Its Web Service Application
Virach Sornlertlamvanich | Thatsanee Charoenporn | Hitoshi Isahara

This paper presents the language resource management system for the development and dissemination of Asian WordNet (AWN) and its web service application. We develop the platform to establish a network for the cross language WordNet development. Each node of the network is designed for maintaining the WordNet for a language. Via the table that maps between each language WordNet and the Princeton WordNet (PWN), the Asian WordNet is realized to visualize the cross language WordNet between the Asian languages. We propose a language resource management system, called WordNet Management System (WNMS), as a distributed management system that allows the server to perform the cross language WordNet retrieval, including the fundamental web service applications for editing, visualizing and language processing. The WNMS is implemented on a web service protocol therefore each node can be independently maintained, and the service of each language WordNet can be called directly through the web service API. In case of cross language implementation, the synset ID (or synset offset) defined by PWN is used to determined the linkage between the languages.

pdf bib
Propbank Frameset Annotation Guidelines Using a Dedicated Editor, Cornerstone
Jinho D. Choi | Claire Bonial | Martha Palmer

This paper gives guidelines of how to create and update Propbank frameset files using a dedicated editor, Cornerstone. Propbank is a corpus in which the arguments of each verb predicate are annotated with their semantic roles in relation to the predicate. Propbank annotation also requires the choice of a sense ID for each predicate. Thus, for each predicate in Propbank, there exists a corresponding frameset file showing the expected predicate argument structure of each sense related to the predicate. Since most Propbank annotations are based on the predicate argument structure defined in the frameset files, it is important to keep the files consistent, simple to read as well as easy to update. The frameset files are written in XML, which can be difficult to edit when using a simple text editor. Therefore, it is helpful to develop a user-friendly editor such as Cornerstone, specifically customized to create and edit frameset files. Cornerstone runs platform independently, is light enough to run as an X11 application and supports multiple languages such as Arabic, Chinese, English, Hindi and Korean.

pdf bib
A Multilayered Declarative Approach to Cope with Morphotactics and Allomorphy in Derivational Morphology
Johannes Handl | Carsten Weber

This paper deals with the derivational morphology of automatic word form recognition. It presents a set of declarative rules which augment lexical entries with information governing the allomorphic changes of derivation in addition to the existing allomorphy rules for inflection. The resulting component generates a single lexicon for derivational and inflectional allomorphy from an elementary base-form lexicon. Thereby our focus lies both on avoiding redundant allomorph entries and on the suitability of the resulting lexical entries for morphological analysis. We prove the usability of our approach by using the generated allomorphs as the lexicon for automatic wordform recognition.

pdf bib
Bank of Russian Constructions and Valencies
Olga Lyashevskaya

The Bank of Russian Constructions and Valencies (Russian FrameBank) is an annotation project that takes as input samples from the Russian National Corpus ( Since Russian verbs and predicates from other POS classes have their particular and not always predictable case pattern, these words and their argument structures are to be described as lexical constructions. The slots of partially filled phrasal constructions (e.g. vzjal i uexal ‘he suddenly (lit. took and) went away’) are also under analysis. Thus, the notion of construction is understood in the sense of Fillmore’s Construction Grammar and is not limited to that of argument structure of verbs. FrameBank brings together the dictionary of constructions and the annotated collection of examples. Our goal is to mark the set of arguments and adjuncts of a certain construction. The main focus is on realization of the elements in the running text, to facilitate searches through pattern realizations by a certain combination of features. The relevant dataset involves lexical, POS and other morphosyntactic tags, semantic classes, as well as grammatical constructions that introduce or license the use of elements within a given construction.

pdf bib
Automatic Discovery of Semantic Relations using MindNet
Zareen Syed | Evelyne Viegas | Savas Parastatidis

Information extraction deals with extracting entities (such as people, organizations or locations) and named relations between entities (such as ""People born-in Country"") from text documents. An important challenge in information extraction is the labeling of training data which is usually done manually and is therefore very laborious and in certain cases impractical. This paper introduces a new “model” to extract semantic relations fully automatically from text using the Encarta encyclopedia and lexical-semantic relations discovered by MindNet. MindNet is a lexical knowledge base that can be constructed fully automatically from a given text corpus without any human intervention. Encarta articles are categorized and linked to related articles by experts. We demonstrate how the structured data available in Encarta and the lexical semantic relations between words in MindNet can be used to enrich MindNet with semantic relations between entities. With a slight trade off of accuracy a semantically enriched MindNet can be used to extract relations from a text corpus without any human intervention.

pdf bib
A Corpus Factory for Many Languages
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS

For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.

pdf bib
A Flexible Representation of Heterogeneous Annotation Data
Richard Johansson | Alessandro Moschitti

This paper describes a new flexible representation for the annotation of complex structures of metadata over heterogeneous data collections containing text and other types of media such as images or audio files. We argue that existing frameworks are not suitable for this purpose, most importantly because they do not easily generalize to multi-document and multimodal corpora, and because they often require the use of particular software frameworks. In the paper, we define a data model to represent such structured data over multimodal collections. Furthermore, we define a surface realization of the data structure as a simple and readable XML format. We present two examples of annotation tasks to illustrate how the representation and format work for complex structures involving multimodal annotation and cross-document links. The representation described here has been used in a large-scale project focusing on the annotation of a wide range of information ― from low-level features to high-level semantics ― in a multimodal data collection containing both text and images.

pdf bib
Hybrid Citation Extraction from Patents
Olivier Galibert | Sophie Rosset | Xavier Tannier | Fanny Grandry

The Quaero project organized a set of evaluations of Named Entity recognition systems in 2009. One of the sub-tasks consists in extracting citations from patents, i.e. references to other documents, either other patents or general literature from English-language patents. We present in this paper the participation of LIMSI in this evaluation, with a complete system description and the evaluation results. The corpus shown that patent and non-patent citations have a very different nature. We then separated references to other patents and to general literature papers and we created a hybrid system. For patent citations, the system used rule-based expert knowledge on the form of regular expressions. The system for detecting non-patent citations, on the other hand, is purely stochastic (machine learning with CRF++). Then we mixed both approaches to provide a single output. 4 teams participated to this task and our system obtained the best results of this evaluation campaign, even if the difference between the first two systems is poorly significant.

pdf bib
The Impact of Grammar Enhancement on Semantic Resources Induction
Luca Dini | Giampaolo Mazzini

In this paper describes the effects of the evolution of an Italian dependency grammar on a task of multilingual FrameNet acquisition. The task is based on the creation of virtual English/Italian parallel annotation corpora, which are then aligned at dependency level by using two manually encoded grammar based dependency parsers. We show how the evolution of the LAS (Labeled Attachment Score) metric for the considered grammar has a direct impact on the quality of the induced FrameNet, thus proving that the evolution of the quality of syntactic resources is mirrored by an analogous evolution in semantic ones. In particular we show that an improvement of 30% in LAS causes an improvement of precision for the induced resource ranging from 5% to 10%, depending on the type of evaluation.

pdf bib
Adapting Chinese Word Segmentation for Machine Translation Based on Short Units
Yiou Wang | Kiyotaka Uchimoto | Jun’ichi Kazama | Canasai Kruengkrai | Kentaro Torisawa

In Chinese texts, words composed of single or multiple characters are not separated by spaces, unlike most western languages. Therefore Chinese word segmentation is considered an important first step in machine translation (MT) and its performance impacts MT results. Many factors affect Chinese word segmentations, including the segmentation standards and segmentation strategies. The performance of a corpus-based word segmentation model depends heavily on the quality and the segmentation standard of the training corpora. However, we observed that existing manually annotated Chinese corpora tend to have low segmentation granularity and provide poor morphological information due to the present segmentation standards. In this paper, we introduce a short-unit standard of Chinese word segmentation, which is particularly suitable for machine translation, and propose a semi-automatic method of transforming the existing corpora into the ones that can satisfy our standards. We evaluate the usefulness of our approach on the basis of translation tasks from the technology newswire domain and the scientific paper domain, and demonstrate that it significantly improves the performance of Chinese-Japanese machine translation (over 1.0 BLEU increase).

pdf bib
Data-Driven and Ontological Analysis of FrameNet for Natural Language Reasoning
Ekaterina Ovchinnikova | Laure Vieu | Alessandro Oltramari | Stefano Borgo | Theodore Alexandrov

This paper focuses on the improvement of the conceptual structure of FrameNet (FN) for the sake of applying this resource to knowledge-intensive NLP tasks requiring reasoning, such as question answering, information extraction etc. In this paper we show that in addition to coverage incompleteness, the current version of FN suffers from conceptual inconsistency and lacks axiomatization which can prevent appropriate inferences. For the sake of discovering and classifying conceptual problems in FN we investigate the FrameNet-Annotated corpus for Textual Entailment. Then we propose a methodology for improving the conceptual organization of FN. The main issue we focus on in our study is enriching, axiomatizing and cleaning up frame relations. Our methodology includes a data-driven analysis of frames resulting in discovering new frame relations and an ontological analysis of frames and frame relations resulting in axiomatizing relations and formulating constraints on them. In this paper, frames and frame relations are analyzed in terms of the DOLCE formal ontology. Additionally, we have described a case study aiming at demonstrating how the proposed methodology works in practice as well as investigating the impact of the restructured and axiomatized frame relations on recognizing textual entailment.

pdf bib
MPC: A Multi-Party Chat Corpus for Modeling Social Phenomena in Discourse
Samira Shaikh | Tomek Strzalkowski | Aaron Broadwell | Jennifer Stromer-Galley | Sarah Taylor | Nick Webb

In this paper, we describe our experience with collecting and creating an annotated corpus of multi-party online conversations in a chat-room environment. This effort is part of a larger project to develop computational models of social phenomena such as agenda control, influence, and leadership in on-line interactions. Such models will help capturing the dialogue dynamics that are essential for developing, among others, realistic human-machine dialogue systems, including autonomous virtual chat agents. In this paper we describe data collection method used and the characteristics of the initial dataset of English chat. We have devised a multi-tiered collection process in which the subjects start from simple, free-flowing conversations and progress towards more complex and structured interactions. In this paper, we report on the first two stages of this process, which were recently completed. The third, large-scale collection effort is currently being conducted. All English dialogue has been annotated at four levels: communication links, dialogue acts, local topics and meso-topics. Some details of these annotations will be discussed later in this paper, although a full description is impossible within the scope of this article.

pdf bib
Mining the Correlation between Human and Automatic Evaluation at Sentence Level
Yanli Sun

Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Translation (MT) system. However, as humans are the end-user of MT output, human judgement is the benchmark to assess the usefulness of automatic evaluation metrics. While most studies report the correlation between human evaluation and automatic evaluation at corpus level, our study examines their correlation at sentence level. In addition to the statistical correlation scores, such as Spearman's rank-order correlation coefficient, a finer-grained and detailed examination of the sensitivity of automatic metrics compared to human evaluation is also reported in this study. The results show that the threshold for human evaluators to agree with the judgements of automatic metrics varies with the automatic metrics at sentence level. While the automatic scores for two translations are greatly different, human evaluators may consider the translations to be qualitatively similar and vice versa. The detailed analysis of the correlation between automatic and human evaluation allows us determine with increased confidence whether an increase in the automatic scores will be agreed by human evaluators or not.

pdf bib
Processing and Extracting Data from Dicionário Aberto
Alberto Simões | José João Almeida | Rita Farinha

Synonyms dictionaries are useful resources for natural language processing. Unfortunately their availability in digital format is limited, as publishing companies do not release their dictionaries in open digital formats. Dicionário-Aberto (Simões and Farinha, 2010) is an open and free digital synonyms dictionary for the Portuguese language. It is under public domain and in textual digital format, which makes it usable for any task. Synonyms dictionaries are commonly used for the extraction of relations between words, the construction of complex structures like ontologies or thesaurus (comparable to WordNet (Miller et al., 1990)), or just the extraction of lists of words of specific type. This article will present Dicionário-Aberto, discussing how it was created, its main characteristics, the type of information present on it and the formats in which it is available. Follows the description of an API designed specifically to help Dicionário-Aberto processing without the need to tackle with the dictionary format. Finally, we will analyze the results on some data extraction experiments, extracting lists of words from a specific class, and extracting relationships between words.

pdf bib
GermanPolarityClues: A Lexical Resource for German Sentiment Analysis
Ulli Waltinger

In this paper, we propose GermanPolarityClues, a new publicly available lexical resource for sentiment analysis for the German language. While sentiment analysis and polarity classification has been extensively studied at different document levels (e.g. sentences and phrases), only a few approaches explored the effect of a polarity-based feature selection and subjectivity resources for the German language. This paper evaluates four different English and three different German sentiment resources in a comparative manner by combining a polarity-based feature selection with SVM-based machine learning classifier. Using a semi-automatic translation approach, we were able to construct three different resources for a German sentiment analysis. The manually finalized GermanPolarityClues dictionary offers thereby a number of 10, 141 polarity features, associated to three numerical polarity scores, determining the positive, negative and neutral direction of specific term features. While the results show that the size of dictionaries clearly correlate to polarity-based feature coverage, this property does not correlate to classification accuracy. Using a polarity-based feature selection, considering a minimum amount of prior polarity features, in combination with SVM-based machine learning methods exhibits for both languages the best performance (F1: 0.83-0.88).

pdf bib
Ontology-based Interoperation of Linguistic Tools for an Improved Lemma Annotation in Spanish
Antonio Pareja-Lora | Guadalupe Aguado de Cea

In this paper, we present an ontology-based methodology and architecture for the comparison, assessment, combination (and, to some extent, also contrastive evaluation) of the results of different linguistic tools. More specifically, we describe an experiment aiming at the improvement of the correctness of lemma tagging for Spanish. This improvement was achieved by means of the standardisation and combination of the results of three different linguistic annotation tools (Bitext’s DataLexica, Connexor’s FDG Parser and LACELL’s POS tagger), using (1) ontologies, (2) a set of lemma tagging correction rules, determined empirically during the experiment, and (3) W3C standard languages, such as XML, RDF(S) and OWL. As we show in the results of the experiment, the interoperation of these tools by means of ontologies and the correction rules applied in the experiment improved significantly the quality of the resulting lemma tagging (when compared to the separate lemma tagging performed by each of the tools that we made interoperate).

pdf bib
The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures
Torsten Zesch | Iryna Gurevych

Wikipedia has been used as a knowledge source in many areas of natural language processing. As most studies only use a certain Wikipedia snapshot, the influence of Wikipedia’s massive growth on the results is largely unknown. For the first time, we perform an in-depth analysis of this influence using semantic relatedness as an example application that tests a wide range of Wikipedia’s properties. We find that the growth of Wikipedia has almost no effect on the correlation of semantic relatedness measures with human judgments, while the coverage steadily increases.

pdf bib
A Software Toolkit for Viewing Annotated Multimodal Data Interactively over the Web
Nick Campbell | Akiko Tabata

This paper describes a software toolkit for the interactive display and analysis of automatically extracted or manually derived annotation features of visual and audio data. It has been extensively tested with material collected as part of the FreeTalk Multimodal Conversation Corpus. Both the corpus and the software are available for download from sites in Europe and Japan. The corpus consists of several hours of video and audio recordings from a variety of capture devices, and includes subjective annotations of the content, along with derived data obtained from image processing. Because of the large size of the corpus, it is unrealistic to expect researchers to download all the material before deciding whether it will be useful to them in their research. We have therefore devised a means for interactive browsing of the content and for viewing at different levels of granularity. This has resulted in a simple set of tools that can be added to any website to allow similar browsing of audio- video recordings and their related data and annotations.

pdf bib
Named Entity Recognition in Questions: Towards a Golden Collection
Ana Cristina Mendes | Luísa Coheur | Paula Vaz Lobo

Named Entity Recognition (NER) plays a relevant role in several Natural Language Processing tasks. Question-Answering (QA) is an example of such, since answers are frequently named entities in agreement with the semantic category expected by a given question. In this context, the recognition of named entities is usually applied in free text data. NER in natural language questions can also aid QA and, thus, should not be disregarded. Nevertheless, it has not yet been given the necessary importance. In this paper, we approach the identification and classification of named entities in natural language questions. We hypothesize that NER results can benefit with the inclusion of previously labeled questions in the training corpus. We present a broad study addressing that hypothesis, focusing on the balance to be achieved between the amount of free text and questions in order to build a suitable training corpus. This work also contributes by providing a set of nearly 5,500 annotated questions with their named entities, freely available for research purposes.

pdf bib
The NOMCO Multimodal Nordic Resource - Goals and Characteristics
Patrizia Paggio | Jens Allwood | Elisabeth Ahlsén | Kristiina Jokinen | Costanza Navarretta

This paper presents the multimodal corpora that are being collected and annotated in the Nordic NOMCO project. The corpora will be used to study communicative phenomena such as feedback, turn management and sequencing. They already include video material for Swedish, Danish, Finnish and Estonian, and several social activities are represented. The data will make it possible to verify empirically how gestures (head movements, facial displays, hand gestures and body postures) and speech interact in all the three mentioned aspects of communication. The data are being annotated following the MUMIN annotation scheme, which provides attributes concerning the shape and the communicative functions of head movements, face expressions, body posture and hand gestures. After having described the corpora, the paper discusses how they will be used to study the way feedback is expressed in speech and gestures, and reports results from two pilot studies where we investigated the function of head gestures ― both single and repeated ― in combination with feedback expressions. The annotated corpora will be valuable sources for research on intercultural communication as well as for interaction in the individual languages.

pdf bib
Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Kikuo Maekawa | Makoto Yamazaki | Takehiko Maruyama | Masaya Yamaguchi | Hideki Ogura | Wakako Kashino | Toshinobu Ogiso | Hanae Koiso | Yasuharu Den

Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upon two different, but mutually related, definitions of ‘word.’ Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classify the text genres with a correct identification rate of 88% as far as the samples of books, newspapers, whitepapers, and internet bulletin boards are concerned. When the samples of blogs were included in this data set, however, the identification rate went down to 68%, suggesting the considerable variance of the blog texts in terms of the textual register and style.

pdf bib
An Annotation Scheme and Gold Standard for Dutch-English Word Alignment
Lieve Macken

The importance of sentence-aligned parallel corpora has been widely acknowledged. Reference corpora in which sub-sentential translational correspondences are indicated manually are more labour-intensive to create, and hence less wide-spread. Such manually created reference alignments -- also called Gold Standards -- have been used in research projects to develop or test automatic word alignment systems. In most translations, translational correspondences are rather complex; for example word-by-word correspondences can be found only for a limited number of words. A reference corpus in which those complex translational correspondences are aligned manually is therefore also a useful resource for the development of translation tools and for translation studies. In this paper, we describe how we created a Gold Standard for the Dutch-English language pair. We present the annotation scheme, annotation guidelines, annotation tool and inter-annotator results. To cover a wide range of syntactic and stylistic phenomena that emerge from different writing and translation styles, our Gold Standard data set contains texts from different text types. The Gold Standard will be publicly available as part of the Dutch Parallel Corpus.

pdf bib
Developing an Expressive Speech Labeling Tool Incorporating the Temporal Characteristics of Emotion
Stefan Scherer | Ingo Siegert | Lutz Bigalke | Sascha Meudt

A lot of research effort has been spent on the development of emotion theories and modeling, however, their suitability and applicability to expressions in human computer interaction has not exhaustively been evaluated. Furthermore, investigations concerning the ability of the annotators to map certain expressions onto the developed emotion models is lacking proof. The proposed annotation tool, which incorporates the standard Geneva Emotional Wheel developed by Klaus Scherer and a novel temporal characteristic description feature, is aiming towards enabling the annotator to label expressions recorded in human computer interaction scenarios on an utterance level. Further, it is respecting key features of realistic and natural emotional expressions, such as their sequentiality, temporal characteristics, their mixed occurrences, and their expressivity or clarity of perception. Additionally, first steps towards evaluating the proposed tool, by analyzing utterance annotations taken from two expressive speech corpora, are undertaken and some future goals including the open source accessibility of the tool are given.

pdf bib
Model Summaries for Location-related Images
Ahmet Aker | Robert Gaizauskas

At present there is no publicly available data set to evaluate the performance of different summarization systems on the task of generating location-related extended image captions. In this paper we describe a corpus of human generated model captions in English and German. We have collected 932 model summaries in English from existing image descriptions and machine translated these summaries into German. We also performed post-editing on the translated German summaries to ensure high quality. Both English and German summaries are evaluated using a readability assessment as in DUC and TAC to assess their quality. Our model summaries performed similar to the ones reported in Dang (2005) and thus are suitable for evaluating automatic summarization systems on the task of generating image descriptions for location related images. In addition, we also investigated whether post-editing of machine-translated model summaries is necessary for automated ROUGE evaluations. We found a high correlation in ROUGE scores between post-edited and non-post-edited model summaries which indicates that the expensive process of post-editing is not necessary.

pdf bib
Incorporating Speech Synthesis in the Development of a Mobile Platform for e-learning.
Justus Roux | Pieter Scholtz | Daleen Klop | Claus Povlsen | Bart Jongejan | Asta Magnusdottir

This presentation and accompanying demonstration focuses on the development of a mobile platform for e-learning purposes with enhanced text-to-speech capabilities. It reports on an international consortium project entitled Mobile E-learning for Africa (MELFA), which includes a reading and literacy training component, particularly focusing on an African language, isiXhosa. The high penetration rate of mobile phones within the African continent has created new opportunities for delivering various kinds of information, including e-learning material to communities that have not had appropriate infrastructures. Aspects of the mobile platform development are described paying attention to basic functionalities of the user interface, as well as to the underlying web technologies involved. Some of the main features of the literacy training module are described, such as grapheme-sound correspondence, syllabification-sound relationships, varying tempo of presentation. A particular point is made for using HMM (HTS) synthesis in this case, as it seems to be very appropriate for less resourced languages.

pdf bib
A Derivational Rephrasing Experiment for Question Answering
Bernard Jacquemin

In Knowledge Management, variations in information expressions have proven a real challenge. In particular, classical semantic relations (e.g. synonymy) do not connect words with different parts-of-speech. The method proposed tries to address this issue. It consists in building a derivational resource from a morphological derivation tool together with derivational guidelines from a dictionary in order to store only correct derivatives. This resource, combined with a syntactic parser, a semantic disambiguator and some derivational patterns, helps to reformulate an original sentence while keeping the initial meaning in a convincing manner This approach has been evaluated in three different ways: the precision of the derivatives produced from a lemma; its ability to provide well-formed reformulations from an original sentence, preserving the initial meaning; its impact on the results coping with a real issue, \textit{ie} a question answering task . The evaluation of this approach through a question answering system shows the pros and cons of this system, while foreshadowing some interesting future developments.

pdf bib
Evaluation of Machine Translation Errors in English and Iraqi Arabic
Sherri Condon | Dan Parvaz | John Aberdeen | Christy Doran | Andrew Freeman | Marwan Awad

Errors in machine translations of English-Iraqi Arabic dialogues were analyzed at two different points in the systems? development using HTER methods to identify errors and human annotations to refine TER annotations. The analyses were performed on approximately 100 translations into each language from 4 translation systems collected at two annual evaluations. Although the frequencies of errors in the more mature systems were lower, the proportions of error types exhibited little change. Results include high frequencies of pronoun errors in translations to English, high frequencies of subject person inflection in translations to Iraqi Arabic, similar frequencies of word order errors in both translation directions, and very low frequencies of polarity errors. The problems with many errors can be generalized as the need to insert lexemes not present in the source or vice versa, which includes errors in multi-word expressions. Discourse context will be required to resolve some problems with deictic elements like pronouns.

pdf bib
A Persian Part-Of-Speech Tagger Based on Morphological Analysis
Mahdi Mohseni | Behrouz Minaei-bidgoli

This paper describes a method based on morphological analysis of words for a Persian Part-Of-Speech (POS) tagging system. This is a main part of a process for expanding a large Persian corpus called Peyekare (or Textual Corpus of Persian Language). Peykare is arranged into two parts: annotated and unannotated parts. We use the annotated part in order to create an automatic morphological analyzer, a main segment of the system. Morphosyntactic features of Persian words cause two problems: the number of tags is increased in the corpus (586 tags) and the form of the words is changed. This high number of tags debilitates any taggers to work efficiently. From other side the change of word forms reduces the frequency of words with the same lemma; and the number of words belonging to a specific tag reduces as well. This problem also has a bad effect on statistical taggers. The morphological analyzer by removing the problems helps the tagger to cover a large number of tags in the corpus. Using a Markov tagger the method is evaluated on the corpus. The experiments show the efficiency of the method in Persian POS tagging.

pdf bib
Evaluation of Document Citations in Phase 2 Gale Distillation
Olga Babko-Malaya | Dan Hunter | Connie Fournelle | Jim White

The focus of information retrieval evaluations, such as NIST’s TREC evaluations (e.g. Voorhees 2003), is on evaluation of the information content of system responses. On the other hand, retrieval tasks usually involve two different dimensions: reporting relevant information and providing sources of information, including corroborating evidence and alternative documents. Under the DARPA Global Autonomous Language Exploitation (GALE) program, Distillation provides succinct, direct responses to the formatted queries using the outputs of automated transcription and translation technologies. These responses are evaluated in two dimensions: information content, which measures the amount of relevant and non-redundant information, and document support, which measures the number of alternative sources provided in support of reported information. The final metric in the overall GALE distillation evaluation combines the results of scoring of both query responses and document citations. In this paper, we describe our evaluation framework with emphasis on the scoring of document citations and an analysis of how systems perform at providing sources of information.

pdf bib
A Freely Available Morphological Analyzer for Turkish
Çağrı Çöltekin

This paper presents TRmorph, a two-level morphological analyzer for Turkish. TRmorph is a fairly complete and accurate morphological analyzer for Turkish. However, strength of TRmorph is neither in its performance, nor in its novelty. The main feature of this analyzer is its availability. It has completely been implemented using freely available tools and resources, and the two-level description is also distributed with a license that allows others to use and modify it freely for different applications. To our knowledge, TRmorph is the first freely available morphological analyzer for Turkish. This makes TRmorph particularly suitable for applications where the analyzer has to be changed in some way, or as a starting point for morphological analyzers for similar languages. TRmorph's specification of Turkish morphology is relatively complete, and it is distributed with a large lexicon. Along with the description of how the analyzer is implemented, this paper provides an evaluation of the analyzer on two large corpora.

pdf bib
Challenges in Building a Multilingual Alpine Heritage Corpus
Martin Volk | Noah Bubenhofer | Adrian Althaus | Maya Bangerter | Lenz Furrer | Beni Ruef

This paper describes our efforts to build a multilingual heritage corpus of alpine texts. Currently we digitize the yearbooks of the Swiss Alpine Club which contain articles in French, German, Italian and Romansch. Articles comprise mountaineering reports from all corners of the earth, but also scientific topics such as topography, geology or glacierology as well as occasional poetry and lyrics. We have already scanned close to 70,000 pages which has resulted in a corpus of 25 million words, 10% of which is a parallel French-German corpus. We have solved a number of challenges in automatic language identification and text structure recognition. Our next goal is to identify the great variety of toponyms (e.g. names of mountains and valleys, glaciers and rivers, trails and cabins) in this corpus, and we sketch how a large gazetteer of Swiss topographical names can be exploited for this purpose. Despite the size of the resource, exact matching leads to a low recall because of spelling variations, language mixtures and partial repetitions.

pdf bib
Annotating Attribution Relations: Towards an Italian Discourse Treebank
Silvia Pareti | Irina Prodanof

In this paper we describe the development of a schema for the annotation of attribution relations and present the first findings and some relevant issues concerning this phenomenon. Following the D-LTAG approach to discourse, we have developed a lexically anchored description of attribution, considering this relation, contrary to the approach in the PDTB, independently from other discourse relations. This approach has allowed us to deal with the phenomenon in a broader perspective than previous studies, reaching therefore a more accurate description of it and making it possible to raise some still unaddressed issues. Following this analysis, we propose an annotation schema and discuss the first results concerning its applicability. The schema has been applied to a pilot portion of the ISST corpus of Italian and represents the initial phase of a project aiming at the creation of an Italian Discourse Treebank. We believe this work will raise some awareness concerning the fundamental importance of attribution relations. The identification of the source has in fact strong implications for the attributed material. Moreover, it will make overt the complexity of a phenomenon for long underestimated.

pdf bib
Evaluating Human-Machine Conversation for Appropriateness
Nick Webb | David Benyon | Preben Hansen | Oil Mival

Evaluation of complex, collaborative dialogue systems is a difficult task. Traditionally, developers have relied upon subjective feedback from the user, and parametrisation over observable metrics. However, both models place some reliance on the notion of a task; that is, the system is helping to user achieve some clearly defined goal, such as book a flight or complete a banking transaction. It is not clear that such metrics are as useful when dealing with a system that has a more complex task, or even no definable task at all, beyond maintain and performing a collaborative dialogue. Working within the EU funded COMPANIONS program, we investigate the use of appropriateness as a measure of conversation quality, the hypothesis being that good companions need to be good conversational partners . We report initial work in the direction of annotating dialogue for indicators of good conversation, including the annotation and comparison of the output of two generations of the same dialogue system.

pdf bib
The English-Swedish-Turkish Parallel Treebank
Beáta Megyesi | Bengt Dahlqvist | Éva Á. Csató | Joakim Nivre

We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish. The corpus consists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150 000 in English, containing both fiction and technical documents. We build the corpus by using the Uplug toolkit for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment. In addition, we use basic language resource kits for the linguistic analysis of the languages involved. The annotation is carried on various layers from morphological and part of speech analysis to dependency structures. The tools used for linguistic annotation, e.g.,\ HunPos tagger and MaltParser, are freely available data-driven resources, trained on existing corpora and treebanks for each language. The parallel treebank is used in teaching and linguistic research to study the relationship between the structurally different languages. In order to study the treebank, several tools have been developed for the visualization of the annotation and alignment, allowing search for linguistic patterns.

pdf bib
A Use Case for Controlled Languages as Interfaces to Semantic Web Applications
Pradeep Dantuluri | Brian Davis | Siegfried Handschuh

Although the Semantic web is steadily gaining in popularity, it remains a mystery to a large percentage of Internet users. This can be attributed to the complexity of the technologies that form its core. Creating intuitive interfaces which completely abstract the technologies underneath, is one way to solve this problem. A contrasting approach is to ease the user into understanding the technologies. We propose a solution which anchors on using controlled languages as interfaces to semantic web applications. This paper describes one such approach for the domain of meeting minutes, status reports and other project specific documents. A controlled language is developed along with an ontology to handle semi-automatic knowledge extraction. The contributions of this paper include an ontology designed for the domain of meeting minutes and status reports, and a controlled language grammar tailored for the above domain to perform the semi-automatic knowledge acquisition and generate RDF triples. This paper also describes two grammar prototypes, which were developed and evaluated prior to the development of the final grammar, as well as the Link grammar, which was the grammar formalism of choice.

pdf bib
Resources for Calendar Expressions Semantic Tagging and Temporal Navigation through Texts
Charles Teissèdre | Delphine Battistelli | Jean-Luc Minel

The linguistic resources presented in this paper are designed for the recognition and semantic tagging of calendar expressions in French. While existing resources generally put the emphasis on describing calendar bases pointed out by calendar expressions (which are considered as named entities), our approach tries to explicit how references to calendar are linguistically built up, taking into account not only the calendar bases but as well the prepositions and units that operate on them, as they provide valuable information on how texts refer to the calendar. The modelling of these expressions led us to consider calendar expressions as a conjunction of operators interacting with temporal references. Though the resources aim to be generic and easily reusable, we illustrate the interest of our approach by using the resources output to feed a text navigation tool that is currently being improved, in order to offer users a way of temporally progressing or navigating in texts.

pdf bib
The Alborada-I3A Corpus of Disordered Speech
Oscar Saz | Eduardo Lleida | Carlos Vaquero | W.-Ricardo Rodríguez

This paper describes the Alborada-I3A corpus of disordered speech, acquired during the recent years for the research in different speech technologies for the handicapped like Automatic Speech Recognition or pronunciation assessment. It contains more than 2 hours of speech from 14 young impaired speakers and nearly 9 hours from 232 unimpaired age-matched peers whose collaboration was possible by the joint work with different educational and assistive institutions. Furthermore, some extra resources are provided with the corpus, including the results of a perceptual human-based labeling of the lexical mispronunciations made by the impaired speakers. The corpus has been used to achieve results in different tasks like analyses on the speech production in impaired children, acoustic and lexical adaptation for ASR and studies on the speech proficiency of the impaired speakers. Finally, the full corpus is freely available for the research community with the only restrictions of maintaining all its data and resources for research purposes only and keeping the privacy of the speakers and their speech data

pdf bib
Evaluating a Text Mining Based Educational Search Portal
Sophia Ananiadou | John McNaught | James Thomas | Mark Rickinson | Sandy Oliver

In this paper, we present the main features of a text mining based search engine for the UK Educational Evidence Portal available at the UK National Centre for Text Mining (NaCTeM), together with a user-centred framework for the evaluation of the search engine. The framework is adapted from an existing proposal by the ISLE (EAGLES) Evaluation Working group. We introduce the metrics employed for the evaluation, and explain how these relate to the text mining based search engine. Following this, we describe how we applied the framework to the evaluation of a number of key text mining features of the search engine, namely the automatic clustering of search results, classification of search results according to a taxonomy, and identification of topics and other documents that are related to a chosen document. Finally, we present the results of the evaluation in terms of the strengths, weaknesses and improvements identified for each of these features.

pdf bib
A Large List of Confusion Sets for Spellchecking Assessed Against a Corpus of Real-word Errors
Jennifer Pedler | Roger Mitton

One of the methods that has been proposed for dealing with real-word errors (errors that occur when a correctly spelled word is substituted for the one intended) is the ""confusion-set"" approach - a confusion set being a small group of words that are likely to be confused with one another. Using a list of confusion sets drawn up in advance, a spellchecker, on finding one of these words in a text, can assess whether one of the other members of its set would be a better fit and, if it appears to be so, propose that word as a correction. Much of the research using this approach has suffered from two weaknesses. The first is the small number of confusion sets used. The second is that systems have largely been tested on artificial errors. In this paper we address these two weaknesses. We describe the creation of a realistically sized list of confusion sets, then the assembling of a corpus of real-word errors, and then we assess the potential of that list in relation to that corpus.

pdf bib
WITcHCRafT: A Workbench for Intelligent exploraTion of Human ComputeR conversaTions
Alexander Schmitt | Gregor Bertrand | Tobias Heinroth | Wolfgang Minker | Jackson Liscombe

We present Witchcraft, an open-source framework for the evaluation of prediction models for spoken dialogue systems based on interaction logs and audio recordings. The use of Witchcraft is two fold: first, it provides an adaptable user interface to easily manage and browse thousands of logged dialogues (e.g. calls). Second, with help of the underlying models and the connected machine learning framework RapidMiner the workbench is able to display at each dialogue turn the probability of the task being completed based on the dialogue history. It estimates the emotional state, gender and age of the user. While browsing through a logged conversation, the user can directly observe the prediction result of the models at each dialogue step. By that, Witchcraft allows for spotting problematic dialogue situations and demonstrates where the current system and the prediction models have design flaws. Witchcraft will be made publically available to the community and will be deployed as open-source project.

pdf bib
Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues
Svetlana Stoyanchev | Paul Piwek

We describe the construction of the CODA corpus, a parallel corpus of monologues and expository dialogues. The dialogue part of the corpus consists of expository, i.e., information-delivering rather than dramatic, dialogues written by several acclaimed authors. The monologue part of the corpus is a paraphrase in monologue form of these dialogues by a human annotator. The annotator-written monologue preserves all information present in the original dialogue and does not introduce any new information that is not present in the original dialogue. The corpus was constructed as a resource for extracting rules for automated generation of dialogue from monologue. Using authored dialogues allows us to analyse the techniques used by accomplished writers for presenting information in the form of dialogue. The dialogues are annotated with dialogue acts and the monologues with rhetorical structure. We developed annotation and translation guidelines together with a custom-developed tool for carrying out translation, alignment and annotation of the dialogues. The final parallel CODA corpus consists of 1000 dialogue turns that are tagged with dialogue acts and aligned with monologue that expresses the same information and has been annotated with rhetorical structure relations.

pdf bib
Annotation Process Management Revisited
Dain Kaplan | Ryu Iida | Takenobu Tokunaga

Proper annotation process management is crucial to the construction of corpora, which are in turn indispensable to the data-driven techniques that have come to the forefront in NLP during the last two decades. It is still common to see ad-hoc tools created for a specific annotation project, but it is time this changed; creation of such tools is labor and time expensive, and is secondary to corpus creation. In addition, such tools likely lack proper annotation process management, increasingly more important as corpora sizes grow in size and complexity. This paper first raises a list of ten needs that any general purpose annotation system should address moving forward, such as user & role management, delegation & monitoring of work, diffing & merging annotators’ work, versioning of corpora, multilingual support, import/export format flexibility, and so on. A framework to address these needs is then proposed, and how having proper annotation process management can be beneficial to the creation and maintenance of corpora explained. The paper then introduces SLATE (Segment and Link-based Annotation Tool Enhanced), the second iteration of a web-based annotation tool, which is being rewritten to implement the proposed framework.

pdf bib
Wikipedia-based Approach for Linking Ontology Concepts to their Realisations in Text
Giulio Paci | Giorgio Pedrazzi | Roberta Turra

A novel method to automatically associate ontological concepts to their realisations in texts is presented. The method has been developed in the context of the Papyrus project to annotate texts and audio transcripts with a set of relevant concepts from the Papyrus News Ontology. To avoid strong dependency on a specific ontology, the annotation process starts by performing a Wikipedia-based annotation of news items: the most relevant keywords are detected and the Wikipedia pages that best describe their actual meaning are identified. In a later step this annotation is translated into an Ontology-based one: keywords are connected to the most appropriate ontology classes on the basis of a relatedness measure that relies on Wikipedia knowledge. Wikipedia-annotation provides a domain independent abstraction layer that simplify the adaptation of the approach to other domains and ontologies. Evaluation has been performed on a set of manually annotated news, resulting in 58% F1 score for relevant Wikipedia pages and 64% for relevant ontology concepts identification.

pdf bib
Ad-hoc Evaluations Along the Lifecycle of Industrial Spoken Dialogue Systems: Heading to Harmonisation?
Marianne Laurent | Philippe Bretier | Carole Manquillet

With a view to rationalise the evaluation process within the Orange Labs spoken dialogue system projects, a field audit has been realised among the various related professionals. The article presents the study's main conclusions and draws work perspectives to enhance the evaluation process in such a complex organisation. We first present the typical spoken dialogue system project lifecycle and the involved communities of stakeholders. We then sketch a map of indicators used across the teams. It shows that each professional category designs its evaluation metrics according to a case-by-case strategy, each one targeting different goals and methodologies. And last, we identify weaknesses in the evaluation process is handled by the various teams. Among others, we mention: the dependency on the design and exploitation tools that may not be suitable for an adequate collection of relevant indicators, the need to refine some indicators' definition and analysis to obtain valuable information for system enhancement, the sharing issue that advocates for a common definition of indicators across the teams and, as a consequence, the need for shared applications that support and encourage such a rationalisation.

pdf bib
Construction of Text Summarization Corpus for the Credibility of Information on the Web
Masahiro Nakano | Hideyuki Shibuki | Rintaro Miyazaki | Madoka Ishioroshi | Koichi Kaneko | Tatsunori Mori

Recently, the credibility of information on the Web has become an important issue. In addition to telling about content of source documents, indicating how to interpret the content, especially showing interpretation of the relation between statements appeared to contradict each other, is important for helping a user judge the credibility of information. In this paper, we will describe the purpose and the way in the construction of a text summarization corpus. Our purpose in the construction of the corpus includes the following three points; to collect Web documents relevant to several query sentences, to prepare gold standard data to evaluate smaller sub-processes in the extraction process and the summary generation process, to investigate the summaries made by human summarizers. The constructed corpus contains six query sentences, 24 manually-constructed summaries, and 24 collections of source Web documents. We also investigated how the descriptions of interpretation, which help a user judge the credibility of other descriptions in the summary, appear in the corpus. As a result, we confirmed that showing interpretation on conflicts is important for helping a user judge the credibility of information.

pdf bib
Top-Performing Robust Constituency Parsing of Portuguese: Freely Available in as Many Ways as you Can Get it
João Silva | António Branco | Patricia Gonçalves

In this paper we present LX-Parser, a probabilistic, robust constituency parser for Portuguese. This parser achieves ca. 88% f-score in the labeled bracketing task, thus reaching a state-of-the-art performance score that is in line with those that are currently obtained by top-ranking parsers for English, the most studied natural language. To the best of our knowledge, LX-Parser is the first state-of-the-art, robust constituency parser for Portuguese that is made freely available. This parser is being distributed in a variety of ways, each suited for a different type of usage. More specifically, LX-Parser is being made available (i) as a downloadable, stand-alone parsing tool that can be run locally by its users; (ii) as a Web service that exposes an interface that can be invoked remotely and transparently by client applications; and finally (iii) as an on-line parsing service, aimed at human users, that can be accessed through any common Web browser.

pdf bib
Resources for Controlled Languages for Alert Messages and Protocols in the European Perspective
Sylviane Cardey | Krzysztof Bogacki | Xavier Blanco | Ruslan Mitkov

This paper is concerned with resources for controlled languages for alert messages and protocols in the European perspective. These resources have been produced as the outcome of a project (Alert Messages and Protocols: MESSAGE) which has been funded with the support of the European Commission - Directorate-General Justice, Freedom and Security, and with the specific objective of 'promoting and supporting the development of security standards, and an exchange of know-how and experience on protection of people'. The MESSAGE project involved the development and transfer of a methodology for writing safe and safely translatable alert messages and protocols created by Centre Tesnière in collaboration with the aircraft industry, the health profession, and emergency services by means of a consortium of four partners to their four European member states in their languages (ES, FR (Coordinator), GB, PL). The paper describes alert messages and protocols, controlled languages for safety and security, the target groups involved, controlled language evaluation, dissemination, the resources that are available, both “Freely available” and “From Owner”, together with illustrations of the resources, and the potential transferability to other sectors and users.

pdf bib
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
Tomaž Erjavec

The paper presents the fourth, ``Mondilex'' edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications; morphosyntactic lexica; and annotated parallel, comparable, and speech corpora. The fourth release of these resources introduces XML-encoded morphosyntactic specifications and adds six new languages, bringing the total to 16: to Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Serbian, Slovene, and the Resian dialect of Slovene it adds Macedonian, Persian, Polish, Russian, Slovak, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes at

pdf bib
The JOS Linguistically Tagged Corpus of Slovene
Tomaž Erjavec | Darja Fišer | Simon Krek | Nina Ledinek

The JOS language resources are meant to facilitate developments of HLT and corpus linguistics for the Slovene language and consist of the morphosyntactic specifications, defining the Slovene morphosyntactic features and tagset; two annotated corpora (jos100k and jos1M); and two web services (a concordancer and text annotation tool). The paper introduces these components, and concentrates on jos100k, a 100,000 word sampled balanced monolingual Slovene corpus, manually annotated for three levels of linguistic description. On the morphosyntactic level, each word is annotated with its morphosyntactic description and lemma; on the syntactic level the sentences are annotated with dependency links; on the semantic level, all the occurrences of 100 top nouns in the corpus are annotated with their wordnet synset from the Slovene semantic lexicon sloWNet. The JOS corpora and specifications have a standardised encoding (Text Encoding Initiative Guidelines TEI P5) and are available for research from under the Creative Commons licence.

pdf bib
Corpus-based Semantics of Concession: Where do Expectations Come from?
Livio Robaldo | Eleni Miltsakaki | Alessia Bianchini

In this paper, we discuss our analysis and resulting new annotations of Penn Discourse Treebank (PDTB) data tagged as Concession. Concession arises whenever one of the two arguments creates an expectation, and the other ones denies it. In Natural Languages, typical discourse connectives conveying Concession are 'but', 'although', 'nevertheless', etc. Extending previous theoretical accounts, our corpus analysis reveals that concessive interpretations are due to different sources of expectation, each giving rise to critical inferences about the relationship of the involved eventualities. We identify four different sources of expectation: Causality, Implication, Correlation, and Implicature. The reliability of these categories is supported by a high inter-annotator agreement score, computed over a sample of one thousand tokens of explicit connectives annotated as Concession in PDTB. Following earlier work of (Hobbs, 1998) and (Davidson, 1967) notion of reification, we extend the logical account of Concession originally proposed in (Robaldo et al., 2008) to provide refined formal descriptions for the first three mentioned sources of expectations in Concessive relations.

pdf bib
Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources
Darja Fišer | Senja Pollak | Špela Vintar

The paper presents an innovative approach to extract Slovene definition candidates from domain-specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find well-formed definitions among the extracted candidates. The results of the experiment are encouraging, with accuracy ranging from 67% to 71%. The paper also addresses some drawbacks of the approach and suggests ways to overcome them in future work.

pdf bib
A Differential Semantics Approach to the Annotation of Synsets in WordNet
Dan Tufiş | Dan Ştefănescu

We describe a new method for sentiment load annotation of the synsets of a wordnet, along the principles of Osgood’s “Semantic Differential” theory and extending the Kamp and Marx calculus, by taking into account not only the WordNet structure but also the SUMO/MILO (Niles & Pease, 2001) and DOMAINS (Bentivogli et al., 2004) knowledge sources. We discuss the method to annotate all the synsets in PWN2.0, irrespective of their part of speech. As the number of possible factors (semantic oppositions, along which the synsets are ranked) is very large, we developed also an application allowing the text analyst to select the most discriminating factors for the type of text to be analyzed. Once the factors have been selected, the underlying wordnet is marked-up on the fly and it can be used for the intended textual analysis. We anticipate that these annotations can be imported in other language wordnets, provided they are aligned to PWN2.0. The method for the synsets annotation generalizes the usual subjectivity mark-up (positive, negative and objective) according to a user-based multi-criteria differential semantics model.

pdf bib
Multimodal Russian Corpus (MURCO): First Steps
Elena Grishina

The paper introduces the Multimodal Russian Corpus (MURCO), which has been created in the framework of the Russian National Corpus (RNC). The MURCO provides the users with the great amount of phonetic, orthoepic, intonational information related to Russian. Moreover, the deeply annotated part of the MURCO contains the data concerning Russian gesticulation, speech act system, types of vocal gestures and interjections in Russian, and so on. The Corpus is on free access. The paper describes the main types of annotation and the interface structure of the MURCO. The MURCO consists of two parts, the second part being the subset of the first: 1) the whole Corpus, which is annotated from the lexical (lemmatization), morphological, semantic, accentological, metatextual, socioligical point of view (these types of annotation are standard for the RNC), and also from the point of view of phonetics (the orthoepic annotation and the mark-up of accentological word structure), 2) the deeply annotated MURCO, which is annotated in addition from the point of view of gesticulation and speech act structure.

pdf bib
Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree Alignment
Jörg Tiedemann

In this paper we present an experimental toolbox for automatic tree-to-tree alignment based on a binary classification model. The aligner implements a recurrent architecture for structural prediction using history features and a sequential classification procedure. The discriminative base classifier uses a log-linear model in the current setup which enables simple integration of various features extracted from the data. The Lingua-Align toolbox provides a flexible framework for feature extraction including contextual properties and implements several alignment inference procedures. Various settings and constraints can be controlled via a simple frontend or called from external scripts. Lingua-Align supports different treebank formats and includes additional tools for conversion and evaluation. In our experiments we can show that our tree aligner produces results with high quality and outperforms unsupervised techniques proposed otherwise. It also integrates well with another existing tool for manual tree alignment which makes it possible to quickly integrate additional training material and to run semi-automatic alignment strategies.

pdf bib
Human Judgements on Causation in French Texts
Cécile Grivaz

The annotation of causal relations in natural language texts can lead to a low inter-annotator agreement. A French corpus annotated with causal relations would be helpful for the evaluation of programs that extract causal knowledge, as well as for the study of the expression of causation. As previous theoretical work provides no necessary and sufficient condition that would allow an annotator to easily identify causation, we explore features that are associated with causation in human judgements. We present an experiment that allows us to elicit intuitive features of causation. We test the statistical association of features of causation from theoretical previous work with causation itself in human judgements in an annotation experiment. We then establish guidelines based on these features for annotating a French corpus. We argue that our approach leads to coherent annotation guidelines, since it allows us to obtain a κ = 0.84 agreement between the majority of the annotators answers and our own educated judgements. We present these annotation instructions in detail.

pdf bib
Lexical Semantic Resources in a Terminological Network
Rita Marinelli | Adriana Roventini | Giovanni Spadoni | Sebastiana Cucurullo

A research has been carried on and is still in progress aimed at the construction of three specialized lexicons organized as databases of relational type. The three databases contain terms belonging to the specialized knowledge fields of maritime terminology (technical-nautical and maritime transport domain), taxation law, and labour law with union labour rules, respectively. The EuroWordNet/ItalWordNet model was firstly used to structure the terminological database of maritime domain. The methodology experimented for its construction was applied to construct the next databases. It consists in i) the management of corpora of specialized languages and ii) the use of generic databases to identify and extract a set of candidate terms to be codified in the terminological databases. The three specialized resources are described highlighting the various kinds of lexical semantic relations linking each term to the others within the single terminological database and to the generic resources WordNet and ItalWordNet. The construction of these specialized lexicons was carried on in the framework of different projects; but they can be seen as a first nucleus of an organized network of generic and specialized lexicons with the purpose of making the meaning of each term clearer from a cognitive point of view.

pdf bib
Is Sentiment a Property of Synsets? Evaluating Resources for Sentiment Classification using Machine Learning
Aleksander Wawer

Existing approaches to classifying documents by sentiment include machine learning with features created from n-grams and part of speech. This paper explores a different approach and examines performance of one selected machine learning algorithm, Support Vector Machines, with features computed using existing lexical resources. Special attention has been paid to fine tuning of the algorithm regarding number of features. The immediate purpose of this experiment is to evaluate lexical and sentiment resources in document-level sentiment classification task. Results described in the paper are also useful to indicate how lexicon design, different language dimensions and semantic categories contribute to document-level sentiment recognition. In a less direct way (through the examination of evaluated resources), the experiment analyzes adequacy of lexemes, word senses and synsets as different possible layers for ascribing sentiment, or as candidates for sentiment carriers. The proposed approach of machine learning word category frequencies instead of n-grams and part of speech features can potentially exhibit improvements in domain independency, but this hypothesis has to be verified in future works.

pdf bib
A Morphological Processor Based on Foma for Biscayan (a Basque dialect)
Iñaki Alegria | Garbiñe Aranbarri | Klara Ceberio | Gorka Labaka | Bittor Laskurain | Ruben Urizar

We present a new morphological processor for Biscayan, a dialect of Basque, developed on the description of the morphology of standard Basque. The database for the standard morphology has been extended for dialects and an open-source tool for morphological description named foma is used for building the processor. Biscayan is a dialect of the Basque language spoken mainly in Biscay, a province on the western of the Basque Country. The description of the lexicon and the morphotactics (or word grammar) for the standard Basque was carried out using a relational database and the database has been extended in order to include dialectal variants linked to the standard entries. XuxenB, a spelling checker/corrector for this dialect, is the first application of this work. Additionally to the basic analyzer used for spelling, a new transducer is included. It is an enhanced analyzer for linking standard form with the corresponding standard ones. It is used in correction for generation of proposals when in the input text appear standard forms which we want to replace with dialectal forms.

pdf bib
Recent Developments in the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Marek Łaziński | Piotr Pęzik

The aim of the paper is to present recent ― as of March 2010 ― developments in the construction of the National Corpus of Polish (NKJP). The NKJP project was launched at the very end of 2007 and it is aimed at compiling a large, linguistically annotated corpus of contemporary Polish by the end of 2010. Out of the total pool of 1 billion words of text data collected in the project, a 300 million word balanced corpus will be selected to match a set of predefined representativeness criteria. This present paper outlines a number of recent developments in the NKJP project, including: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools. As the work on NKJP progresses, it becomes clear that this project serves as an important testbed for linguistic annotation and interoperability standards. We believe that our recent experiences will prove relevant to future large-scale language resource compilation efforts.

pdf bib
Developing a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank
António Branco | Francisco Costa | João Silva | Sara Silveira | Sérgio Castro | Mariana Avelãs | Clara Pinto | João Graça

Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools and, above all, supporting computational grammars appear no longer as a matter of convenience but of necessity. In this paper, we report on the design features, the development conditions and the methodological options of a deep linguistic databank, the CINTIL DeepGramBank. In this corpus, sentences are annotated with fully fledged linguistically informed grammatical representations that are produced by a deep linguistic processing grammar, thus consistently integrating morphological, syntactic and semantic information. We also report on how such corpus permits to straightforwardly obtain a whole range of past generation annotated corpora (POS, NER and morphology), current generation treebanks (constituency treebanks, dependency banks, propbanks) and next generation databanks (logical form banks) simply by means of a very residual selection/extraction effort to get the appropriate ""views"" exposing the relevant layers of information.

pdf bib
Diabase: Towards a Diachronic BLARK in Support of Historical Studies
Lars Borin | Markus Forsberg | Dimitrios Kokkinakis

We present our ongoing work on language technology-based e-science in the humanities, social sciences and education, with a focus on text-based research in the historical sciences. An important aspect of language technology is the research infrastructure known by the acronym BLARK (Basic LAnguage Resource Kit). A BLARK as normally presented in the literature arguably reflects a modern standard language, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. We argue that this notion could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal), in our case the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.

pdf bib
The Grande Grammaire du Français Project
Anne Abeillé | Danièle Godard

We present a new reference Grammar of French (La Grande Grammaire du français), which is a collective project (gathering around fifty contributors), producing a book (about 2200 pages, to be published en 2011) and associated databases. Like the recent reference grammars of the other Romance Languages, it takes into account the important results of the linguistic research of the past thrity years, while aiming at a non specialist audience and avoiding formalization. We differ from existing French grammar by being focused on contemporary French from a purely descriptive point of view, and by taking spoken data into account. We include a description of all the syntactic phenomena, as well as lexical, semantic, pragmatic and prosodic insights, specially as they interact with syntax. The analysis concerns the data from contemporary written French, but also includes data from spoken corpora and regional or non standard French (when accessible). Throughout the grammar, a simple phrase structure grammar is used, in order to maintain a common representation. The analyses are modular with a strict division of labor between morphology, syntax and semantics. From the syntactic point of view, POS are also distinguished from grammatical relations (or functions). The databases include a terminological glossary, different lexical databases for certain POS, certain valence frames and certain semantic classes, and a bibliographical database.

pdf bib
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Satoshi Sekine | Kapil Dalwani

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). The previous system (Sekine 08) can only handle tokens and unrestricted wildcards in the query, such as “* was established in *”. However, being able to constrain the wildcards by POS, chunk or NE is quite useful to filter out noise. For example, the new system can search for “NE=COMPANY was established in POS=CD”. This finer specification reduces the number of outputs to less than half and avoids the ngrams which have a comma or a common noun at the first position or location information at the last position. It outputs the matched ngrams with their frequencies as well as all the contexts (i.e. sentences, KWIC lists and document ID information) where the matched ngrams occur in the corpus. It takes a fraction of a second for a search on a single CPU Linux-PC (1GB memory and 500GB disk) environment.

pdf bib
The Influence of the Utterance Length on the Recognition of Aged Voices
Alexander Schmitt | Tim Polzehl | Wolfgang Minker | Jackson Liscombe

This paper addresses the recognition of elderly callers based on short and narrow-band utterances, which are typical for Interactive Voice Response (IVR) systems. Our study is based on 2308 short utterances from a deployed IVR application. We show that features such as speaking rate, jitter and shimmer that are considered as most meaningful ones for determining elderly users underperform when used in the IVR context while pitch and intensity features seem to gain importance. We further demonstrate the influence of the utterance length on the classifier’s performance: for both humans and classifier, the distinction between aged and non-aged voices becomes increasingly difficult the shorter the utterances get. Our setup based on a Support Vector Machine (SVM) with linear kernel reaches a comparably poor performance of 58% accuracy, which can be attributed to an average utterance length of only 1.6 seconds. The automatic distinction between aged and non-aged utterances drops to random when the utterance length falls below 1.2 seconds.

pdf bib
A Typology of Near-Identity Relations for Coreference (NIDENT)
Marta Recasens | Eduard Hovy | M. Antònia Martí

The task of coreference resolution requires people or systems to decide when two referring expressions refer to the 'same' entity or event. In real text, this is often a difficult decision because identity is never adequately defined, leading to contradictory treatment of cases in previous work. This paper introduces the concept of 'near-identity', a middle ground category between identity and non-identity, to handle such cases systematically. We present a typology of Near-Identity Relations (NIDENT) that includes fifteen types―grouped under four main families―that capture a wide range of ways in which (near-)coreference relations hold between discourse entities. We validate the theoretical model by annotating a small sample of real data and showing that inter-annotator agreement is high enough for stability (K=0.58, and up to K=0.65 and K=0.84 when leaving out one and two outliers, respectively). This work enables subsequent creation of the first internally consistent language resource of this type through larger annotation efforts.

pdf bib
Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch
Ineke Schuurman | Véronique Hoste | Paola Monachesi

This paper reports on the annotation of a corpus of 1 million words with four semantic annotation layers, including named entities, co- reference relations, semantic roles and spatial and temporal expressions. These semantic annotation layers can benefit from the manually verified part of speech tagging, lemmatization and syntactic analysis (dependency tree) information layers which resulted from an earlier project (Van Noord et al., 2006) and will thus result in a deeply syntactically and semantically annotated corpus. This annotation effort is carried out in the framework of a larger project which aims at the collection of a 500-million word corpus of contemporary Dutch, covering the variants used in the Netherlands and Flanders, the Dutch speaking part of Belgium. All the annotation schemes used were (co-)developed by the authors within the Flemish-Dutch STEVIN-programme as no previous schemes for Dutch were available. They were created taking into account standards (either de facto or official (like ISO)) used elsewhere.

pdf bib
A Data Category Registry- and Component-based Metadata Framework
Daan Broeder | Marc Kemps-Snijders | Dieter Van Uytvanck | Menzo Windhouwer | Peter Withers | Peter Wittenburg | Claus Zinn

We describe our computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the categories can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be grounded in its use of categories. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.

pdf bib
Annotation Scheme and Gold Standard for Dutch Subjective Adjectives
Isa Maks | Piek Vossen

Many techniques are developed to derive automatically lexical resources for opinion mining. In this paper we present a gold standard for Dutch adjectives developed for the evaluation of these techniques. In the first part of the paper we introduce our annotation guidelines. They are based upon guidelines recently developed for English which annotate subjectivity and polarity at word sense level. In addition to subjectivity and polarity we propose a third annotation category: that of the attitude holder. The identity of the attitude holder is partly implied by the word itself and may provide useful information for opinion mining systems. In the second part of paper we present the criteria adopted for the selection of items which should be included in this gold standard. Our design is aimed at an equal representation of all dimensions of the lexicon , like frequency and polysemy, in order to create a gold standard which can be used not only for benchmarking purposes but also may help to improve in a systematic way, the methods which derive the word lists. Finally we present the results of the annotation task including annotator agreement rates and disagreement analysis.

pdf bib
Indexing Methods for Faster and More Effective Person Name Search
Mark Arehart

This paper compares several indexing methods for person names extracted from text, developed for an information retrieval system with requirements for fast approximate matching of noisy and multicultural Romanized names. Such matching algorithms are computationally expensive and unacceptably slow when used without an indexing or blocking step. The goal is to create a small candidate pool containing all the true matches that can be exhaustively searched by a more effective but slower name comparison method. In addition to dramatically faster search, some of the methods evaluated here led to modest gains in effectiveness by eliminating false positives. Four indexing techniques using either phonetic keys or substrings of name segments, with and without name segment stopword lists, were combined with three name matching algorithms. On a test set of 700 queries run against 70K noisy and multicultural names, the best-performing technique took just 2.1% as long as a naive exhaustive search and increased F1 by 3 points, showing that an appropriate indexing technique can increase both speed and effectiveness.

pdf bib
Detection of Peculiar Examples using LOF and One Class SVM
Hiroyuki Shinnou | Minoru Sasaki

This paper proposes the method to detect peculiar examples of the target word from a corpus. In this paper we regard following examples as peculiar examples: (1) a meaning of the target word in the example is new, (2) a compound word consisting of the target word in the example is new or very technical. The peculiar example is regarded as an outlier in the given example set. Therefore we can apply many methods proposed in the data mining domain to our task. In this paper, we propose the method to combine the density based method, Local Outlier Factor (LOF), and One Class SVM, which are representative outlier detection methods in the data mining domain. In the experiment, we use the Whitepaper text in BCCWJ as the corpus, and 10 noun words as target words. Our method improved precision and recall of LOF and One Class SVM. And we show that our method can detect new meanings by using the noun `midori (green)'. The main reason of un-detections and wrong detection is that similarity measure of two examples is inadequacy. In future, we must improve it.

pdf bib
TRIOS-TimeBank Corpus: Extended TimeBank Corpus with Help of Deep Understanding of Text
Naushad UzZaman | James Allen

TimeBank (Pustejovsky et al, 2003a), a reference for TimeML (Pustejovsky et al, 2003b) compliant annotation, is widely used temporally annotated corpus in the community. It captures time expressions, events, and relations between events and event and temporal expression; but there is room for improvements in this hand-annotated widely used TimeBank corpus. This work is one such effort to extend the TimeBank corpus. Our first goal is to suggest missing TimeBank events and temporal expressions, i.e. events and temporal expressions that were missed by TimeBank annotators. Along with that this paper also suggests some additions to TimeML language by adding new event features (ontology type), some more SLINKs and also relations between events with their arguments, which we call RLINK (relation link). With our new suggestions we present the TRIOS-TimeBank corpus, an extended TimeBank corpus. We conclude by suggesting our future work to clean the TimeBank corpus even more and automatically generating larger temporally annotated corpus for the community.

pdf bib
Ontology-Based Categorization of Web Services with Machine Learning
Adam Funk | Kalina Bontcheva

We present the problem of categorizing web services according to a shallow ontology for presentation on a specialist portal, using their WSDL and associated textual documents found by a crawler. We treat this as a text classification problem and apply first information extraction (IE) techniques (voting using keywords weight according to their context), then machine learning (ML), and finally a combined approach in which ML has priority over weighted keywords, but the latter can still make up categorizations for services for which ML does not produce enough. We evaluate the techniques (using data manually annotated through the portal, which we also use as the training data for ML) according to standard IE measures for flat categorization as well as the Balanced Distance Metric (more suitable for ontological classification) and compare them with related work in web service categorization. The ML and combined categorization results are good and the system is designed to take users' contributions through the portal's Web 2.0 features as additional training data.

pdf bib
Online Japanese Unknown Morpheme Detection using Orthographic Variation
Yugo Murawaki | Sadao Kurohashi

To solve the unknown morpheme problem in Japanese morphological analysis, we previously proposed a novel framework of online unknown morpheme acquisition and its implementation. This framework poses a previously unexplored problem, online unknown morpheme detection. Online unknown morpheme detection is a task of finding morphemes in each sentence that are not listed in a given lexicon. Unlike in English, it is a non-trivial task because Japanese does not delimit words by white space. We first present a baseline method that simply uses the output of the morphological analyzer. We then show that it fails to detect some unknown morphemes because they are over-segmented into shorter registered morphemes. To cope with this problem, we present a simple solution, the use of orthographic variation of Japanese. Under the assumption that orthographic variants behave similarly, each over-segmentation candidate is checked against its counterparts. Experiments show that the proposed method improves the recall of detection and contributes to improving unknown morpheme acquisition.

pdf bib
A Morphologically-Analyzed CHILDES Corpus of Hebrew
Bracha Nir | Brian MacWhinney | Shuly Wintner

We present a corpus of transcribed spoken Hebrew that forms an integral part of a comprehensive data system that has been developed to suit the specific needs and interests of child language researchers: CHILDES (Child Language Data Exchange System). We introduce a dedicated transcription scheme for the spoken Hebrew data that is aware both of the phonology and of the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus.

pdf bib
Non-verbal Signals for Turn-taking and Feedback
Kristiina Jokinen

This paper concerns non-verbal communication, and describes especially the use of eye-gaze to signal turn-taking and feedback in conversational settings. Eye-gaze supports smooth interaction by providing signals that the interlocutors interpret with respect to such conversational functions as taking turns and giving feedback. New possibilities to study the effect of eye-gaze on the interlocutors’ communicative behaviour have appeared with the eye-tracking technology which in the past years has matured to the level where its use to study naturally occurring dialogues have become easier and more reliable to conduct. It enables the tracking of eye-fixations and gaze-paths, and thus allows analysis of the person’s turn-taking and feedback behaviour through the analysis of their focus of attention. In this paper, experiments on the interlocutors’ non-verbal communication in conversational settings using the eye-tracker are reported, and results of classifying turn-taking using eye-gaze and gesture information are presented. Also the hybrid method that combines signal level analysis with human interpretation is discussed.

pdf bib
A Study of the Influence of Speech Type on Automatic Language Recognition Performance
Alejandro Abejón | Doroteo T. Toledano | Danilo Spada | González Victor | Daniel Hernández López

Automatic language recognition on spontaneous speech has experienced a rapid development in the last few years. This development has been in part due to the competitive technological Language Recognition Evaluations (LRE) organized by the National Institute of Standards and Technology (NIST). Until now, the need to have clearly defined and consistent evaluations has kept some real-life application issues out of these evaluations. In particular, all past NIST LREs have used exclusively conversational telephone speech (CTS) for development and test. Fortunately this has changed in the current NIST LRE since it includes also broadcast speech. However, for testing only the telephone speech found in broadcast data will be used. In real-life applications, there could be several more types of speech and systems could be forced to use a mix of different types of data for training and development and recognition. In this article, we have defined a test-bed including several types of speech data and have analyzed how a typical language recognition system works using different types of speech, and also a combination of different types of speech, for training and testing.

pdf bib
Cross-lingual Ontology Alignment using EuroWordNet and Wikipedia
Gosse Bouma

This paper describes a system for linking the thesaurus of the Netherlands Institute for Sound and Vision to English WordNet and dbpedia. The thesaurus contains subject (concept) terms, and names of persons, locations, and miscalleneous names. We used EuroWordNet, a multilingual wordnet, and Dutch Wikipedia as intermediaries for the two alignments. EuroWordNet covers most of the subject terms in the thesaurus, but the organization of the cross-lingual links makes selection of the most appropriate English target term almost impossible. Precision and recall of the automatic alignment with WordNet for subject terms is 0.59. Using page titles, redirects, disambiguation pages, and anchor text harvested from Dutch Wikipedia gives reasonable performance on subject terms and geographical terms. Many person and miscalleneous names in the thesaurus could not be located in (Dutch or English) Wikipedia. Precision for miscellaneous names, subjects, persons and locations for the alignment with Wikipedia ranges from 0.63 to 0.94, while recall for subject terms is 0.62.

pdf bib
Video Retrieval in Sign Language Videos : How to Model and Compare Signs?
François Lefebvre-Albaret | Patrice Dalle

This paper deals with the problem of finding sign occurrences in a sign language (SL) video. It begins with an analysis of sign models and the way they can take into account the sign variability. Then, we review the most popular technics dedicated to automatic sign language processing and we focus on their adaptation to model sign variability. We present a new method to provide a parametric description of the sign as a set of continuous and discrete parameters. Signs are classified according to there categories (ballistic movements, circles ...), the symmetry between the hand movements, hand absolute and relative locations. Membership grades to sign categories and continuous parameter comparisons can be combined to estimate the similarity between two signs. We set out our system and we evaluate how much time can be saved when looking for a sign in a french sign language video. By now, our formalism only uses hand 2D locations, we finally discuss about the way of integrating other parameters as hand shape or facial expression in our framework.

pdf bib
Identifying Sources of Weakness in Syntactic Lexicon Extraction
Claire Gardent | Alejandra Lorenzo

Previous work has shown that large scale subcategorisation lexicons could be extracted from parsed corpora with reasonably high precision. In this paper, we apply a standard extraction procedure to a 100 millions words parsed corpus of french and obtain rather poor results. We investigate different factors likely to improve performance such as in particular, the specific extraction procedure and the parser used; the size of the input corpus; and the type of frames learned. We try out different ways of interleaving the output of several parsers with the lexicon extraction process and show that none of them improves the results. Conversely, we show that increasing the size of the input corpus and modifying the extraction procedure to better differentiate prepositional arguments from prepositional modifiers improves performance. In conclusion, we suggest that a more sophisticated approach to parser combination and better probabilistic models of the various types of prepositional objects in French are likely ways to get better results.

pdf bib
Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin
Marco Passarotti | Felice Dell’Orletta

The creation of language resources for less-resourced languages like the historical ones benefits from the exploitation of language-independent tools and methods developed over the years by many projects for modern languages. Along these lines, a number of treebanks for historical languages started recently to arise, including treebanks for Latin. Among the Latin treebanks, the Index Thomisticus Treebank is a 68,000 token dependency treebank based on the Index Thomisticus by Roberto Busa SJ, which contains the opera omnia of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million tokens. In this paper, we describe a number of modifications that we applied to the dependency parser DeSR, in order to improve the parsing accuracy rates on the Index Thomisticus Treebank. First, we adapted the parser to the specific processing of Medieval Latin, defining an ad-hoc configuration of its features. Then, in order to improve the accuracy rates provided by DeSR, we applied a revision parsing method and we combined the outputs produced by different algorithms. This allowed us to improve accuracy rates substantially, reaching results that are well beyond the state of the art of parsing for Latin.

pdf bib
Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications
Ineke Schuurman | Vincent Vandeghinste

In this paper we want to point out some issues arising when a natural language processing task involves several languages (like multi- lingual, multidocument summarization and the machine translation aspects involved) which are often neglected. These issues are of a more cultural nature, and may even come into play when several documents in a single language are involved. We pay special attention to those aspects dealing with the spatiotemporal characteristics of a text. Correct automatic selection of (parts of) texts such as handling the same eventuality, presupposes spatiotemporal disambiguation at a rather specific level. The same holds for the analysis of the query. For generation and translation purposes, spatiotemporal aspects may be relevant as well. At the moment English (both the British and American variants) and Dutch (the Flemish and Dutch variant) are covered, all taking into account the perspective of a contemporary, Flemish user. In our approach the cultural aspects associated with for example the language of publication and the language used by the user play a crucial role.

pdf bib
An Associative Concept Dictionary for Verbs and its Application to Elliptical Word Estimation
Takehiro Teraoka | Jun Okamoto | Shun Ishizaki

Natural language processing technology has developed remarkably, but it is still difficult for computers to understand contextual meanings as humans do. The purpose of our work has been to construct an associative concept dictionary for Japanese verbs and make computers understand contextual meanings with a high degree of accuracy. We constructed an automatic system that can be used to estimate elliptical words. We present the result of comparing words that were estimated both by our proposed system (VNACD) and three baseline systems (VACD, NACD, and CF). We then calculated the mean reciprocal rank (MRR), top N accuracy (top 1, top 5, and top 10), and the mean average precision (MAP). Finally, we showed the effectiveness of our method for which both an associative concept dictionary for verbs (Verb-ACD) and one for nouns (Noun-ACD) were used. From the results, we conclude that both the Verb-ACD and the Noun-ACD play a key role in estimating elliptical words.

pdf bib
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages
Sowmya V. B. | Monojit Choudhury | Kalika Bali | Tirthankar Dasgupta | Anupam Basu

Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.

pdf bib
Building a Lexicon of French Deverbal Nouns from a Semantically Annotated Corpus
Antonio Balvet | Lucie Barque | Rafael Marín

This paper presents project Nomage, which aims at describing the aspectual properties of deverbal nouns in an empirical way. It is centered on the development of two resources: a semantically annotated corpus of deverbal nouns, and an electronic lexicon. They are both presented in this paper, and emphasize how the semantic annotations of the corpus allow the lexicographic description of deverbal nouns to be validated, in particular their polysemy. Nominalizations have occupied a central place in grammatical analysis, with a focus on morphological and syntactic aspects. More recently, researchers have begun to address a specific issue often neglected before, i.e. the semantics of nominalizations, and its implications for Natural Language Processing applications such as electronic ontologies or Information Retrieval. We focus on precisely this issue in the research project NOMAGE, funded by the French National Research Agency (ANR-07-JCJC-0085-01). In this paper, we present the Nomage corpus and the annotations we make on deverbal nouns (section 2). We then show how we build our lexicon with the semantically annotated corpus and illustrate the kind of generalizations we can make from such data (section 3).

pdf bib
Annotation of Discourse Relations for Conversational Spoken Dialogs
Sara Tonelli | Giuseppe Riccardi | Rashmi Prasad | Aravind Joshi

In this paper, we make a qualitative and quantitative analysis of discourse relations within the LUNA conversational spoken dialog corpus. In particular, we first describe the Penn Discourse Treebank (PDTB) and then we detail the adaptation of its annotation scheme to the LUNA corpus of Italian task-oriented dialogs in the domain of software/hardware assistance. We discuss similarities and differences between our approach and the PDTB paradigm and point out the peculiarities of spontaneous dialogs w.r.t. written text, which motivated some changes in the annotation strategy. In particular, we introduced the annotation of relations between non-contiguous arguments and we modified the sense hierarchy in order to take into account the important role of pragmatics in dialogs. In the final part of the paper, we present a comparison between the sense and connective frequency in a representative subset of the LUNA corpus and in the PDTB. Such analysis confirmed the differences between the two corpora and corroborates our choice to introduce dialog-specific adaptations.

pdf bib
A Fact-aligned Corpus of Numerical Expressions
Sandra Williams | Richard Power

We describe a corpus of numerical expressions, developed as part of the NUMGEN project. The corpus contains newspaper articles and scientific papers in which exactly the same numerical facts are presented many times (both within and across texts). Some annotations of numerical facts are original: for example, numbers are automatically classified as round or non-round by an algorithm derived from Jansen and Pollmann (2001); also, numerical hedges such as ‘about’ or ‘a little under’ are marked up and classified semantically using arithmetical relations. Through explicit alignment of phrases describing the same fact, the corpus can support research on the influence of various contextual factors (e.g., document position, intended readership) on the way in which numerical facts are expressed. As an example we present results from an investigation showing that when a fact is mentioned more than once in a text, there is a clear tendency for precision to increase from first to subsequent mentions, and for mathematical level either to remain constant or to increase.

pdf bib
Question Answering on Web Data: The QA Evaluation in Quæro
Ludovic Quintard | Olivier Galibert | Gilles Adda | Brigitte Grau | Dominique Laurent | Véronique Moriceau | Sophie Rosset | Xavier Tannier | Anne Vilnat

In the QA and information retrieval domains progress has been assessed via evaluation campaigns(Clef, Ntcir, Equer, Trec).In these evaluations, the systems handle independent questions and should provide one answer to each question, extracted from textual data, for both open domain and restricted domain. Quæro is a program promoting research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Among the many research areas concerned by Quæro. The Quaero project organized a series of evaluations of Question Answering on Web Data systems in 2008 and 2009. For each language, English and French the full corpus has a size of around 20Gb for 2.5M documents. We describe the task and corpora, and especially the methodologies used in 2008 to construct the test of question and a new one in the 2009 campaign. Six types of questions were addressed, factual, Non-factual(How, Why, What), List, Boolean. A description of the participating systems and the obtained results is provided. We show the difficulty for a question-answering system to work with complex data and questions.

pdf bib
A Comprehensive Resource to Evaluate Complex Open Domain Question Answering
Silvia Quarteroni | Alessandro Moschitti

Complex Question Answering is a discipline that involves a deep understanding of question/answer relations, such as those characterizing definition and procedural questions and their answers. To contribute to the improvement of this technology, we deliver two question and answer corpora for complex questions, WEB-QA and TREC-QA, extracted by the same Question Answering system, YourQA, from the Web and from the AQUAINT-6 data collection respectively. We believe that such corpora can be useful resources to address a type of QA that is far from being efficiently solved. WEB-QA and TREC-QA are available in two formats: judgment files and training/testing files. Judgment files contain a ranked list of candidate answers to TREC-10 complex questions, extracted using YourQA as a baseline system and manually labelled according to a Likert scale from 1 (completely incorrect) to 5 (totally correct). Training and testing files contain learning instances compatible with SVM-light; these are useful for experimenting with shallow and complex structural features such as parse trees and semantic role labels. Our experiments with the above corpora have allowed to prove that structured information representation is useful to improve the accuracy of complex QA systems and to re-rank answers.

pdf bib
Named and Specific Entity Detection in Varied Data: The Quæro Named Entity Baseline Evaluation
Olivier Galibert | Ludovic Quintard | Sophie Rosset | Pierre Zweigenbaum | Claire Nédellec | Sophie Aubin | Laurent Gillard | Jean-Pierre Raysz | Delphine Pois | Xavier Tannier | Louise Deléger | Dominique Laurent

The Quæro program that promotes research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Within its context a set of evaluations of Named Entity recognition systems was held in 2009. Four tasks were defined. The first two concerned traditional named entities in French broadcast news for one (a rerun of ESTER 2) and of OCR-ed old newspapers for the other. The third was a gene and protein name extraction in medical abstracts. The last one was the detection of references in patents. Four different partners participated, giving a total of 16 systems. We provide a synthetic descriptions of all of them classifying them by the main approaches chosen (resource-based, rules-based or statistical), without forgetting the fact that any modern system is at some point hybrid. The metric (the relatively standard Slot Error Rate) and the results are also presented and discussed. Finally, a process is ongoing with preliminary acceptance of the partners to ensure the availability for the community of all the corpora used with the exception of the non-Quæro produced ESTER 2 one.

pdf bib
Study of Word Sense Disambiguation System that uses Contextual Features - Approach of Combining Associative Concept Dictionary and Corpus -
Kyota Tsutsumida | Jun Okamoto | Shun Ishizaki | Makoto Nakatsuji | Akimichi Tanaka | Tadasu Uchiyama

We propose a Word Sense Disambiguation (WSD) method that accurately classifies ambiguous words to concepts in the Associative Concept Dictionary (ACD) even when the test corpus and the training corpus for WSD are acquired from different domains. Many WSD studies determine the context of the target ambiguous word by analyzing sentences containing the target word. However, they offer poor performance when they are applied to a corpus that differs from the training corpus. One solution is to use associated words that are domain-independently assigned to the concept in ACD; i.e. many users commonly imagine those words against a given concept. Furthermore, by using the associated words of a concept as search queries for a training corpus, our method extracts relevant words, those that are computationally judged to be related to that concept. By checking the frequency of associated words and relevant words that appear near to the target word in a sentence in the test corpus, our method classifies the target word to the concept in ACD. Our evaluation using two different types of corpus demonstrates its good accuracy.

pdf bib
Alignment-based Profiling of Europarl Data in an English-Swedish Parallel Corpus
Lars Ahrenberg

This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the same parallel corpus. We first describe our method for comparison which is based on manually reviewed word alignments. We investigate relative frequences of different types of correspondence, including null alignments, many-to-one correspondences and crossings. In addition, both halves of the parallel corpus have been annotated with morpho-syntactic information. The syntactic annotation uses labelled dependency relations. Thus, we can see how different types of correspondences are distributed on different parts-of-speech and compute correspondences at the structural level. In spite of the fact that two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion of many types of restructurings, including additions, deletions, long distance reorderings and dependency reversals. We explain this by the fact that the majority of Europarl segments are parallel translations rather than source texts and their translations.

pdf bib
Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar
Muhammad Kamran Malik | Tafseer Ahmed | Sebastian Sulger | Tina Bögel | Atif Gulzar | Ghulam Raza | Sarmad Hussain | Miriam Butt

In this paper, we present a system for transliterating the Arabic-based script of Urdu to a Roman transliteration scheme. The system is integrated into a larger system consisting of a morphology module, implemented via finite state technologies, and a computational LFG grammar of Urdu that was developed with the grammar development platform XLE (Crouch et al. 2008). Our long-term goal is to handle Hindi alongside Urdu; the two languages are very similar with respect to syntax and lexicon and hence, one grammar can be used to cover both languages. However, they are not similar concerning the script -- Hindi is written in Devanagari, while Urdu uses an Arabic-based script. By abstracting away to a common Roman transliteration scheme in the respective transliterators, our system can be enabled to handle both languages in parallel. In this paper, we discuss the pipeline architecture of the Urdu-Roman transliterator, mention several linguistic and orthographic issues and present the integration of the transliterator into the LFG parsing system.

pdf bib
Towards an Integrated Scheme for Semantic Annotation of Multimodal Dialogue Data
Volha Petukhova | Harry Bunt

Recent years witness a growing interest in the use of multimodal data for modelling of communicative behaviour in dialogue. Dybkjaer and Bernsen (2002), point out that coding schemes for multimodal data are used solely by their creators. Standardisation has been achieved to some extent for coding behavioural features for certain nonverbal expressions, e.g. for facial expression, however, for the semantic annotation of such expressions combined with other modalities such as speech there is still a long way to go. The majority of existing dialogue act annotation schemes that are designed to code semantic and pragmatic dialogue information are limited to analysis of spoken modality. This paper investigates the applicability of existing dialogue act annotation schemes to the semantic annotation of multimodal data, and the way a dialogue act annotation scheme can be extended to cover dialogue phenomena from multiple modalities. The general conclusion of our explorative study is that a multidimensional dialogue act taxonomy is usable for this purpose when some adjustments are made. We proposed a solution for adding these aspects to a dialogue act annotation scheme without changing its set of communicative functions, in the form of qualifiers that can be attached to communicative function tags.

pdf bib
Comparing the Influence of Different Treebank Annotations on Dependency Parsing
Cristina Bosco | Simonetta Montemagni | Alessandro Mazzei | Vincenzo Lombardo | Felice Dell’Orletta | Alessandro Lenci | Leonardo Lesmo | Giuseppe Attardi | Maria Simi | Alberto Lavelli | Johan Hall | Jens Nilsson | Joakim Nivre

As the interest of the NLP community grows to develop several treebanks also for languages other than English, we observe efforts towards evaluating the impact of different annotation strategies used to represent particular languages or with reference to particular tasks. This paper contributes to the debate on the influence of resources used for the training and development on the performance of parsing systems. It presents a comparative analysis of the results achieved by three different dependency parsers developed and tested with respect to two treebanks for the Italian language, namely TUT and ISST--TANL, which differ significantly at the level of both corpus composition and adopted dependency representations.

pdf bib
Appraise: An Open-Source Toolkit for Manual Phrase-Based Evaluation of Translations
Christian Federmann

We describe a focused effort to investigate the performance of phrase-based, human evaluation of machine translation output achieving a high annotator agreement. We define phrase-based evaluation and describe the implementation of Appraise, a toolkit that supports the manual evaluation of machine translation results. Phrase ranking can be done using either a fine-grained six-way scoring scheme that allows to differentiate between ""much better"" and ""slightly better"", or a reduced subset of ranking choices. Afterwards we discuss kappa values for both scoring models from several experiments conducted with human annotators. Our results show that phrase-based evaluation can be used for fast evaluation obtaining significant agreement among annotators. The granularity of ranking choices should, however, not be too fine-grained as this seems to confuse annotators and thus reduces the overall agreement. The work reported in this paper confirms previous work in the field and illustrates that the usage of human evaluation in machine translation should be reconsidered. The Appraise toolkit is available as open-source and can be downloaded from the author's website.

pdf bib
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Hai Zhao | Yan Song | Chunyu Kit

We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s law. In our research, Chinese word segmentation is chosen as the study case and a series of experiments are specially conducted for it, in which two types of segmentation techniques, statistical learning and rule-based methods, are examined. The empirical results show that a linear performance improvement in statistical learning requires an exponential increasing of training corpus size at least. As for the rule-based method, an approximate negative inverse relationship between the performance and the size of the input lexicon can be observed.

pdf bib
Merging Specialist Taxonomies and Folk Taxonomies in Wordnets - A case Study of Plants, Animals and Foods in the Danish Wordnet
Bolette S. Pedersen | Sanni Nimb | Anna Braasch

In this paper we investigate the problem of merging specialist taxonomies with the more intuitive folk taxonomies in lexical-semantic resources like wordnets; and we focus in particular on plants, animals and foods. We show that a traditional dictionary like Den Danske Ordbog (DDO) survives well with several inconsistencies between different taxonomies of the vocabulary and that a restructuring is therefore necessary in order to compile a consistent wordnet resource on its basis. To this end, we apply Cruse’s definitions for hyponymies, namely those of natural kinds (such as plants and animals) on the one hand and functional kinds (such as foods) on the other. We pursue this distinction in the development of the Danish wordnet, DanNet, which has recently been built on the basis of DDO and is made open source for all potential users at Not surprisingly, we conclude that cultural background influences the structure of folk taxonomies quite radically, and that wordnet builders must therefore consider these carefully in order to capture their central characteristics in a systematic way.

pdf bib
Inducing Ontologies from Folksonomies using Natural Language Understanding
Marta Tatu | Dan Moldovan

Folksonomies are unsystematic, unsophisticated collections of keywords associated by social bookmarking users to web content and, despite their inconsistency problems (typographical errors, spelling variations, use of space or punctuation as delimiters, same tag applied in different context, synonymy of concepts, etc.), their popularity is increasing among Web 2.0 application developers. In this paper, in addition to eliminating folksonomic irregularities existing at the lexical, syntactic or semantic understanding levels, we propose an algorithm that automatically builds a semantic representation of the folksonomy by exploiting the tags, their social bookmarking associations (co-occuring tags) and, more importantly, the content of labeled documents. We derive the semantics of each tag, discover semantic links between the folksonomic tags and expose the underlying semantic structure of the folksonomy, thus, enabling a number of information discovery and ontology-based reasoning applications.

pdf bib
Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus
Orphée De Clercq | Maribel Montero Perez

After three years of work the Dutch Parallel Corpus (DPC) project has reached an end. The finalized corpus is a ten-million-word high-quality sentence-aligned bidirectional parallel corpus of Dutch, English and French, with Dutch as central language. In this paper we present the corpus and try to formulate some basic data collection principles, based on the work that was carried out for the project. Building a corpus is a difficult and time-consuming task, especially when every text sample included has to be cleared from copyrights. The DPC is balanced according to five text types (literature, journalistic texts, instructive texts, administrative texts and texts treating external communication) and four translation directions (Dutch-English, English-Dutch, Dutch-French and French-Dutch). All the text material was cleared from copyrights. The data collection process necessitated the involvement of different text providers, which resulted in drawing up four different licence agreements. Problems such as an unknown source language, copyright issues and changes to the corpus design are discussed in close detail and illustrated with examples so as to be of help to future corpus compilers.

pdf bib
Event Models for Historical Perspectives: Determining Relations between High and Low Level Events in Text, Based on the Classification of Time, Location and Participants.
Agata Cybulska | Piek Vossen

In this paper, we report on a study that was performed within the “Semantics of History” project on how descriptions of historical events are realized in different types of text and what the implications are for modeling the event information. We believe that different historical perspectives of writers correspond in some degree with genre distinction and correlate with variation in language use. To capture differences between event representations in diverse text types and thus to identify relations between historical events, we defined an event model. We observed clear relations between particular parts of event descriptions - actors, time and location modifiers. Texts, written shortly after an event happened, use more specific and uniquely occurring event descriptions than texts describing the same events but written from a longer time perspective. We carried out some statistical corpus research to confirm this hypothesis. The ability to automatically determine relations between historical events and their sub-events over textual data, based on the relations between event participants, time markers and locations, will have important repercussions for the design of historical information retrieval systems.

pdf bib
An Evaluation of Predicate Argument Clustering using Pseudo-Disambiguation
Christian Scheible

Schulte im Walde et al. (2008) presented a novel approach to semantic verb classication. The predicate argument model (PAC) presented in their paper models selectional preferences by using soft clustering that incorporates the Expectation Maximization (EM) algorithm and the MDL principle. In this paper, I will show how the model handles the task of differentiating between plausible and implau- sible combinations of verbs, subcategorization frames and arguments by applying the pseudo-disambiguation evaluation method. The predicate argument clustering model will be evaluated in comparison with the latent semantic clustering model by Rooth et al. (1999). In particular, the influences of the model parameters, data frequency, and the individual components of the predicate argument model are examined. The results of these experiments show that (i) the selectional preference model overgeneralizes over arguments for the purpose of a pseudo-disambiguation task and that (ii) pseudo-disambiguation should not be used as a universal indicator for the quality of a model.

pdf bib
Meaning Representation: From Continuity to Discreteness
Fabienne Venant

This paper presents a geometric approach to meaning representation within the framework of continuous mathematics. Meaning representation is a central issue in Natural Language Processing, in particular for tasks like word sense disambiguation or information extraction. We want here to discuss the relevance of using continuous models in semantics. We don’t want to argue the continuous or discrete nature of lexical meaning. We use continuity as a tool to access and manipulate lexical meaning. Following Victorri (1994), we assume that continuity or discreteness are not properties of phenomena but characterizations of theories upon phenomena. We briefly describe our theoretical framework, the dynamical construction of meaning (Victorri and Fuchs, 1996), then present the way we automatically build continuous semantic spaces from a graph of synonymy and discuss their relevance and utility. We also think that discreteness and continuity can collaborate. We show here how we can complete our geometric representations with informations from discrete descriptions of meaning.

pdf bib
Learning Subjectivity Phrases missing from Resources through a Large Set of Semantic Tests
Matthieu Vernier | Laura Monceaux | Béatrice Daille

In recent years, blogs and social networks have particularly boosted interests for opinion mining research. In order to satisfy real-scale applicative needs, a main task is to create or to enhance lexical and semantic resources on evaluative language. Classical resources of the area are mostly built for english, they contain simple opinion word markers and are far to cover the lexical richness of this linguistic phenomenon. In particular, infrequent subjective words, idiomatic expressions, and cultural stereotypes are missing from resources. We propose a new method, applied on french, to enhance automatically an opinion word lexicon. This learning method relies on linguistic uses of internet users and on semantic tests to infer the degree of subjectivity of many new adjectives, nouns, verbs, noun phrases, verbal phrases which are usually forgotten by other resources. The final appraisal lexicon contains 3,456 entries. We evaluate the lexicon enhancement with and without textual context.

pdf bib
Evaluation Metrics for Persuasive NLP with Google AdWords
Marco Guerini | Carlo Strapparava | Oliviero Stock

Evaluating systems and theories about persuasion represents a bottleneck for both theoretical and applied fields: experiments are usually expensive and time consuming. Still, measuring the persuasive impact of a message is of paramount importance. In this paper we present a new ``cheap and fast'' methodology for measuring the persuasiveness of communication. This methodology allows conducting experiments with thousands of subjects for a few dollars in a few hours, by tweaking and using existing commercial tools for advertising on the web, such as Google AdWords. The central idea is to use AdWords features for defining message persuasiveness metrics. Along with a description of our approach we provide some pilot experiments, conducted both with text and image based ads, that confirm the effectiveness of our ideas. We also discuss the possible application of research on persuasive systems to Google AdWords in order to add more flexibility in the wearing out of persuasive messages.

pdf bib
Towards a Balanced Named Entity Corpus for Dutch
Bart Desmet | Véronique Hoste

This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.

pdf bib
Transcriber Driving Strategies for Transcription Aid System
Grégory Senay | Georges Linarès | Benjamin Lecouteux | Stanislas Oger | Thierry Michel

Speech recognition technology suffers from a lack of robustness which limits its usability for fully automated speech-to-text transcription, and manual correction is generally required to obtain perfect transcripts. In this paper, we propose a general scheme for semi-automatic transcription, in which the system and the transcriptionist contribute jointly to the speech transcription. The proposed system relies on the editing of confusion networks and on reactive decoding, the latter one being supposed to take benefits from the manual correction and improve the error rates. In order to reduce the correction time, we evaluate various strategies aiming to guide the transcriptionist towards the critical areas of transcripts. These strategies are based on graph density-based criterion and two semantic consistency criterion; using a corpus-based method and a web-search engine. They allow to indicate to the user the areas which present severe lacks of understandability. We evaluate these driving strategies by simulating the correction process of French broadcast news transcriptions. Results show that interactive decoding improves the correction act efficiency with all driving strategies and semantic information must be integrated into the interactive decoding process.

pdf bib
Building a Gold Standard for Event Detection in Croatian
Nikola Ljubešić | Tomislava Lauc | Damir Boras

This paper describes the process of building a newspaper corpus annotated with events described in specific documents. The main difference to the corpora built as part of the TDT initiative is that documents are not annotated by topics, but by specific events they describe. Additionally, documents are gathered from sixteen sources and all documents in the corpus are annotated with the corresponding event. The annotation process consists of a browsing and a searching step. Experiments are performed with a threshold that could be used in the browsing step yielding the result of having to browse through only 1% of document pairs for a 2% loss of relevant document pairs. A statistical analysis of the annotated corpus is undertaken showing that most events are described by few documents while just some events are reported by many documents. The inter-annotator agreement measures show high agreement concerning grouping documents into event clusters, but show a much lower agreement concerning the number of events the documents are organized into. An initial experiment is described giving a baseline for further research on this corpus.

pdf bib
Improving Domain-specific Entity Recognition with Automatic Term Recognition and Feature Extraction
Ziqi Zhang | José Iria | Fabio Ciravegna

Domain specific entity recognition often relies on domain-specific knowledge to improve system performance. However, such knowledge often suffers from limited domain portability and is expensive to build and maintain. Therefore, obtaining it in a generic and unsupervised manner would be a desirable feature for domain-specific entity recognition systems. In this paper, we introduce an approach that exploits domain-specificity of words as a form of domain-knowledge for entity-recognition tasks. Compared to prior work in the field, our approach is generic and completely unsupervised. We empirically show an improvement in entity extraction accuracy when features derived by our unsupervised method are used, with respect to baseline methods that do not employ domain knowledge. We also compared the results against those of existing systems that use manually crafted domain knowledge, and found them to be competitive.

pdf bib
Czech Information Retrieval with Syntax-based Language Models
Jana Straková | Pavel Pecina

In recent years, considerable attention has been dedicated to language modeling methods in information retrieval. Although these approaches generally allow exploitation of any type of language model, most of the published experiments were conducted with a classical n-gram model, usually limited only to unigrams. A few works exploiting syntax in information retrieval can be cited in this context, but significant contribution of syntax based language modeling for information retrieval is yet to be proved. In this paper, we propose, implement, and evaluate an enrichment of language model employing syntactic dependency information acquired automatically from both documents and queries. Our experiments are conducted on Czech which is a morphologically rich language and has a considerably free word order, therefore a syntactic language model is expected to contribute positively to the unigram and bigram language model based on surface word order. By testing our model on the Czech test collection from Cross Language Evaluation Forum 2007 Ad-Hoc track, we show positive contribution of using dependency syntax in this context.

pdf bib
A Resource and Tool for Super-sense Tagging of Italian Texts
Giuseppe Attardi | Stefano Dei Rossi | Giulia Di Pietro | Alessandro Lenci | Simonetta Montemagni | Maria Simi

A SuperSense Tagger is a tool for the automatic analysis of texts that associates to each noun, verb, adjective and adverb a semantic category within a general taxonomy. The developed tagger, based on a statistical model (Maximum Entropy), required the creation of an Italian annotated corpus, to be used as a training set, and the improvement of various existing tools. The obtained results significantly improved the current state-of-the art for this particular task.

pdf bib
Building the Basque PropBank
Izaskun Aldezabal | María Jesús Aranzabe | Arantza Díaz de Ilarraza | Ainara Estarrona

This paper presents the work that has been carried out to annotate semantic roles in the Basque Dependency Treebank (BDT). We will describe the resources we have used and the way the annotation of 100 verbs has been done. We decide to follow the model proposed in the PropBank project that has been deployed in other languages, such as Chinese, Spanish, Catalan and Russian. The resources used are: an in-house database with syntactic/semantic subcategorization frames for Basque verbs, an English-Basque verb mapping based on Levin’s classification and the BDT itself. Detailed guidelines for human taggers have been established as a result of this annotation process. In addition, we have characterized the information associated to the semantic tag. Besides, and based on this study, we will define semi-automatic procedures that will facilitate the task of manual annotation for the rest of the verbs of the Treebank. We have also adapted AbarHitz, a tool used in the construction of the BDT, for the task of annotating semantic roles according to the proposed characterization.

pdf bib
Automatic Detection of Syllable Boundaries in Spontaneous Speech
Brigitte Bigi | Christine Meunier | Irina Nesterenko | Roxane Bertrand

This paper presents the outline and performance of an automatic syllable boundary detection system. The syllabification of phonemes is performed with a rule-based system, implemented in a Java program. Phonemes are categorized into 6 classes. A set of specific rules are developed and categorized as general rules which can be applied in all cases, and exception rules which are applied in some specific situations. These rules deal with a French spontaneous speech corpus. Moreover, the proposed phonemes, classes and rules are listed in an external configuration file of the tool (under GPL licence) that make the tool very easy to adapt to a specific corpus by adding or modifying rules, phoneme encoding or phoneme classes, by the use of a new configuration file. Finally, performances are evaluated and compared to 3 other French syllabification systems and show significant improvements. Automatic system output and expert's syllabification are in agreement for most of syllable boundaries in our corpus.

pdf bib
Examining the Effects of Rephrasing User Input on Two Mobile Spoken Language Systems
Nikos Tsourakis | Agnes Lisowska | Manny Rayner | Pierrette Bouillon

During the construction of a spoken dialogue system much effort is spent on improving the quality of speech recognition as possible. However, even if an application perfectly recognizes the input, its understanding may be far from what the user originally meant. The user should be informed about what the system actually understood so that an error will not have a negative impact in the later stages of the dialogue. One important aspect that this work tries to address is the effect of presenting the system’s understanding during interaction with users. We argue that for specific kinds of applications it’s important to confirm the understanding of the system before obtaining the output. In this way the user can avoid misconceptions and problems occurring in the dialogue flow and he can enhance his confidence in the system. Nevertheless this has an impact on the interaction, as the mental workload increases, and the user’s behavior may adapt to the system’s coverage. We focus on two applications that implement the notion of rephrasing user’s input in a different way. Our study took place among 14 subjects that used both systems on a Nokia N810 Internet Tablet.

pdf bib
Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus
Samuel Reese | Gemma Boleda | Montse Cuadros | Lluís Padró | German Rigau

This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely available to the community: In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assigns WordNet senses, and WordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. We present a first attempt at creating a trilingual lexical resource from the sense-tagged Wikipedia corpora, namely, WikiNet. Moreover, we present two by-products of the project that are of use for the NLP community: An open source Java-based parser for Wikipedia pages developed for the construction of the corpus, and the integration of the WSD algorithm UKB in FreeLing.

pdf bib
Evaluating Distributional Properties of Tagsets
Markus Dickinson | Charles Jochim

We investigate which distributional properties should be present in a tagset by examining different mappings of various current part-of-speech tagsets, looking at English, German, and Italian corpora. Given the importance of distributional information, we present a simple model for evaluating how a tagset mapping captures distribution, specifically by utilizing a notion of frames to capture the local context. In addition to an accuracy metric capturing the internal quality of a tagset, we introduce a way to evaluate the external quality of tagset mappings so that we can ensure that the mapping retains linguistically important information from the original tagset. Although most of the mappings we evaluate are motivated by linguistic concerns, we also explore an automatic, bottom-up way to define mappings, to illustrate that better distributional mappings are possible. Comparing our initial evaluations to POS tagging results, we find that more distributional tagsets can sometimes result in worse accuracy, underscring the need to carefully define the properties of a tagset.

pdf bib
Towards Improving English-Latvian Translation: A System Comparison and a New Rescoring Feature
Maxim Khalilov | José A. R. Fonollosa | Inguna Skadin̨a | Edgars Brālītis | Lauma Pretkalnin̨a

Translation into the languages with relatively free word order has received a lot less attention than translation into fixed word order languages (English), or into analytical languages (Chinese). At the same time this translation task is found among the most difficult challenges for machine translation (MT), and intuitively it seems that there is some space in improvement intending to reflect the free word order structure of the target language. This paper presents a comparative study of two alternative approaches to statistical machine translation (SMT) and their application to a task of English-to-Latvian translation. Furthermore, a novel feature intending to reflect the relatively free word order scheme of the Latvian language is proposed and successfully applied on the n-best list rescoring step. Moving beyond classical automatic scores of translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems output that helps to shed light on advantages and disadvantages of the SMT systems under consideration.

pdf bib
English-Spanish Large Statistical Dictionary of Inflectional Forms
Grigori Sidorov | Alberto Barrón-Cedeño | Paolo Rosso

The paper presents an approach for constructing a weighted bilingual dictionary of inflectional forms using as input data a traditional bilingual dictionary, and not parallel corpora. An algorithm is developed that generates all possible morphological (inflectional) forms and weights them using information on distribution of corresponding grammar sets (grammar information) in large corpora for each language. The algorithm also takes into account the compatibility of grammar sets in a language pair; for example, verb in past tense in language L normally is expected to be translated by verb in past tense in Language L'. We consider that the developed method is universal, i.e. can be applied to any pair of languages. The obtained dictionary is freely available. It can be used in several NLP tasks, for example, statistical machine translation.

pdf bib
HIFI-AV: An Audio-visual Corpus for Spoken Language Human-Machine Dialogue Research in Spanish
Fernando Fernández-Martínez | Juan Manuel Lucas-Cuesta | Roberto Barra Chicote | Javier Ferreiros | Javier Macías-Guarasa

In this paper, we describe a new multi-purpose audio-visual database on the context of speech interfaces for controlling household electronic devices. The database comprises speech and video recordings of 19 speakers interacting with a HIFI audio box by means of a spoken dialogue system. Dialogue management is based on Bayesian Networks and the system is provided with contextual information handling strategies. Each speaker was requested to fulfil different sets of specific goals following predefined scenarios, according to both different complexity levels and degrees of freedom or initiative allowed to the user. Due to a careful design and its size, the recorded database allows comprehensive studies on speech recognition, speech understanding, dialogue modeling and management, microphone array based speech processing, and both speech and video-based acoustic source localisation. The database has been labelled for quality and efficiency studies on dialogue performance. The whole database has been validated through both objective and subjective tests.

pdf bib
Inter-Annotator Agreement on a Linguistic Ontology for Spatial Language - A Case Study for GUM-Space
Joana Hois

In this paper, we present a case study for measuring inter-annotator agreement on a linguistic ontology for spatial language, namely the spatial extension of the Generalized Upper Model. This linguistic ontology specifies semantic categories, and it is used in dialogue systems for natural language of space in the context of human-computer interaction and spatial assistance systems. Its core representation for spatial language distinguishes how sentences can be structured and categorized into units that contribute certain meanings to the expression. This representation is here evaluated in terms of inter-annotator agreement: four uninformed annotators were instructed by a manual how to annotate sentences with the linguistic ontology. They have been assigned to annotate 200 sentences with varying length and complexity. Their resulting agreements are calculated together with our own 'expert annotation' of the same sentences. We show that linguistic ontologies can be evaluated with respect to inter-annotator agreement, and we present encouraging results of calculating agreements for the spatial extension of the Generalized Upper Model.

pdf bib
New Tools for Web-Scale N-grams
Dekang Lin | Kenneth Church | Heng Ji | Satoshi Sekine | David Yarowsky | Shane Bergsma | Kailash Patil | Emily Pitler | Rachel Lathbury | Vikram Rao | Kapil Dalwani | Sushant Narsale

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

pdf bib
ELAN as Flexible Annotation Framework for Sound and Image Processing Detectors
Eric Auer | Albert Russel | Han Sloetjes | Peter Wittenburg | Oliver Schreer | S. Masnieri | Daniel Schneider | Sebastian Tschöpel

Annotation of digital recordings in humanities research still is, to a large extend, a process that is performed manually. This paper describes the first pattern recognition based software components developed in the AVATecH project and their integration in the annotation tool ELAN. AVATecH (Advancing Video/Audio Technology in Humanities Research) is a project that involves two Max Planck Institutes (Max Planck Institute for Psycholinguistics, Nijmegen, Max Planck Institute for Social Anthropology, Halle) and two Fraunhofer Institutes (Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS, Sankt Augustin, Fraunhofer Heinrich-Hertz-Institute, Berlin) and that aims to develop and implement audio and video technology for semi-automatic annotation of heterogeneous media collections as they occur in multimedia based research. The highly diverse nature of the digital recordings stored in the archives of both Max Planck Institutes, poses a huge challenge to most of the existing pattern recognition solutions and is a motivation to make such technology available to researchers in the humanities.

pdf bib
Acquisition and Annotation of Slovenian Lombard Speech Database
Damjan Vlaj | Aleksandra Zögling Markuš | Marko Kos | Zdravko Kačič

This paper presents the acquisition and annotation of Slovenian Lombard Speech Database, the recording of which started in the year 2008. The database was recorded at the University of Maribor, Slovenia. The goal of this paper is to describe the hardware platform used for the acquisition of speech material, recording scenarios and tools used for the annotation of Slovenian Lombard Speech Database. The database consists of recordings of 10 Slovenian native speakers. Five males and five females were recorded. Each speaker pronounced a set of eight corpuses in two recording sessions with at least one week pause between recordings. The structure of the corpus is similar to SpeechDat II database. Approximately 30 minutes of speech material per speaker and per session was recorded. The manual annotation of speech material is performed with the LombardSpeechLabel tool developed at the University of Maribor. The speech and annotation material was saved on 10 DVDs (one speaker on one DVD).

pdf bib
The MuLeXFoR Database: Representing Word-Formation Processes in a Multilingual Lexicographic Environment
Bruno Cartoni | Marie-Aude Lefer

This paper introduces a new lexicographic resource, the MuLeXFoR database, which aims to present word-formation processes in a multilingual environment. Morphological items represent a real challenge for lexicography, especially for the development of multilingual tools. Affixes can take part in several word-formation rules and, conversely, rules can be realised by means of a variety of affixes. Consequently, it is often difficult to provide enough information to help users understand the meaning(s) of an affix or familiarise with the most frequent strategies used to translate the meaning(s) conveyed by affixes. In fact, traditional dictionaries often fail to achieve this goal. The database introduced in this paper tries to take advantage of recent advances in electronic implementation and morphological theory. Word-formation is presented as a set of multilingual rules that users can access via different indexes (affixes, rules and constructed words). MuLeXFoR entries contain, among other things, detailed descriptions of morphological constraints and productivity notes, which are sorely lacking in currently available tools such as bilingual dictionaries.

pdf bib
Paragraph Acquisition and Selection for List Question Using Amazon’s Mechanical Turk
Fang Xu | Dietrich Klakow

Creating more fine-grained annotated data than previously relevent document sets is important for evaluating individual components in automatic question answering systems. In this paper, we describe using the Amazon's Mechanical Turk (AMT) to judge whether paragraphs in relevant documents answer corresponding list questions in TREC QA track 2004. Based on AMT results, we build a collection of 1300 gold-standard supporting paragraphs for list questions. Our online experiments suggested that recruiting more people per task assures better annotation quality. In order to learning true labels from AMT annotations, we investigated three approaches on two datasets with different levels of annotation errors. Experimental studies show that the Naive Bayesian model and EM-based GLAD model can generate results highly agreeing with gold-standard annotations, and dominate significantly over the majority voting method for true label learning. We also suggested setting higher HIT approval rate to assure better online annotation quality, which leads to better performance of learning methods.

pdf bib
Construction of a Chinese Opinion Treebank
Lun-Wei Ku | Ting-Hao Huang | Hsin-Hsi Chen

In this paper, we base on the syntactic structural Chinese Treebank corpus, construct the Chinese Opinon Treebank for the research of opinion analysis. We introduce the tagging scheme and develop a tagging tool for constructing this corpus. Annotated samples are described. Information including opinions (yes or no), their polarities (positive, neutral or negative), types (expression, status, or action), is defined and annotated. In addition, five structure trios are introduced according to the linguistic relations between two Chinese words. Four of them that are possibly related to opinions are also annotated in the constructed corpus to provide the linguistic cues. The number of opinion sentences together with the number of their polarities, opinion types, and trio types are calculated. These statistics are compared and discussed. To know the quality of the annotations in this corpus, the kappa values of the annotations are calculated. The substantial agreement between annotations ensures the applicability and reliability of the constructed corpus.

pdf bib
Community-based Construction of Draft and Final Translation Corpus Through a Translation Hosting Site Minna no Hon’yaku (MNH)
Takeshi Abekawa | Masao Utiyama | Eiichiro Sumita | Kyo Kageura

In this paper we report a way of constructing a translation corpus that contains not only source and target texts, but draft and final versions of target texts, through the translation hosting site Minna no Hon'yaku (MNH). We made MNH publicly available on April 2009. Since then, more than 1,000 users have registered and over 3,500 documents have been translated, as of February 2010, from English to Japanese and from Japanese to English. MNH provides an integrated translation-aid environment, QRedit, which enables translators to look up high-quality dictionaries and Wikipedia as well as to search Google seamlessly. As MNH keeps translation logs, a corpus consisting of source texts, draft translations in several versions, and final translations is constructed naturally through MNH. As of 7 February, 764 documents with multiple translation versions are accumulated, of which 110 are edited by more than one translators. This corpus can be used for self-learning by inexperienced translators on MNH, and potentially for improving machine translation.

pdf bib
Active Learning and Crowd-Sourcing for Machine Translation
Vamshi Ambati | Stephan Vogel | Jaime Carbonell

Large scale parallel data generation for new language pairs requires intensive human effort and availability of experts. It becomes immensely difficult and costly to provide Statistical Machine Translation (SMT) systems for most languages due to the paucity of expert translators to provide parallel data. Even if experts are present, it appears infeasible due to the impending costs. In this paper we propose Active Crowd Translation (ACT), a new paradigm where active learning and crowd-sourcing come together to enable automatic translation for low-resource language pairs. Active learning aims at reducing cost of label acquisition by prioritizing the most informative data for annotation, while crowd-sourcing reduces cost by using the power of the crowds to make do for the lack of expensive language experts. We experiment and compare our active learning strategies with strong baselines and see significant improvements in translation quality. Similarly, our experiments with crowd-sourcing on Mechanical Turk have shown that it is possible to create parallel corpora using non-experts and with sufficient quality assurance, a translation system that is trained using this corpus approaches expert quality.

pdf bib
How Certain are Clinical Assessments? Annotating Swedish Clinical Text for (Un)certainties, Speculations and Negations
Hercules Dalianis | Sumithra Velupillai

Clinical texts contain a large amount of information. Some of this information is embedded in contexts where e.g. a patient status is reasoned about, which may lead to a considerable amount of statements that indicate uncertainty and speculation. We believe that distinguishing such instances from factual statements will be very beneficial for automatic information extraction. We have annotated a subset of the Stockholm Electronic Patient Record Corpus for certain and uncertain expressions as well as speculative and negation keywords, with the purpose of creating a resource for the development of automatic detection of speculative language in Swedish clinical text. We have analyzed the results from the initial annotation trial by means of pairwise Inter-Annotator Agreement (IAA) measured with F-score. Our main findings are that IAA results for certain expressions and negations are very high, but for uncertain expressions and speculative keywords results are less encouraging. These instances need to be defined in more detail. With this annotation trial, we have created an important resource that can be used to further analyze the properties of speculative language in Swedish clinical text. Our intention is to release this subset to other research groups in the future after removing identifiable information.

pdf bib
Base Concepts in the African Languages Compared to Upper Ontologies and the WordNet Top Ontology
Winston Anderson | Laurette Pretorius | Albert Kotzé

Ontologies, and in particular upper ontologies, are foundational to the establishment of the Semantic Web. Upper ontologies are used as equivalence formalisms between domain specific ontologies. Multilingualism brings one of the key challenges to the development of these ontologies. Fundamental to the challenges of defining upper ontologies is the assumption that concepts are universally shared. The approach to developing linguistic ontologies aligned to upper ontologies, particularly in the non-Indo-European language families, has highlighted these challenges. Previously two approaches to developing new linguistic ontologies and the influence of these approaches on the upper ontologies have been well documented. These approaches are examined in a unique new context: the African, and in particular, the Bantu languages. In particular, we address the following two questions: Which approach is better for the alignment of the African languages to upper ontologies? Can the concepts that are linguistically shared amongst the African languages be aligned easily with upper ontology concepts claimed to be universally shared?

pdf bib
CASIA-CASSIL: a Chinese Telephone Conversation Corpus in Real Scenarios with Multi-leveled Annotation
Keyan Zhou | Aijun Li | Zhigang Yin | Chengqing Zong

CASIA-CASSIL is a large-scale corpus base of Chinese human-human naturally-occurring telephone conversations in restricted domains. The first edition consists of 792 90-second conversations belonging to tourism domain, which are selected from 7,639 spontaneous telephone recordings in real scenarios. The corpus is now being annotated with wide range of linguistic and paralinguistic information in multi-levels. The annotations include Turns, Speaker Gender, Orthographic Transcription, Chinese Syllable, Chinese Phonetic Transcription, Prosodic Boundary, Stress of Sentence, Non-Speech Sounds, Voice Quality, Topic, Dialog-act and Adjacency Pairs, Ill-formedness, and Expressive Emotion as well, 13 levels in total. The abundant annotation will be effective especially for studying Chinese spoken language phenomena. This paper describes the whole process to build the conversation corpus, including collecting and selecting the original data, and the follow-up process such as transcribing, annotating, and so on. CASIA-CASSIL is being extended to a large scale corpus base of annotated Chinese dialogs for spoken Chinese study.

pdf bib
Online Temporal Language Model Adaptation for a Thai Broadcast News Transcription System
Kwanchiva Saykham | Ananlada Chotimongkol | Chai Wutiwiwatchai

This paper investigates the effectiveness of online temporal language model adaptation when applied to a Thai broadcast news transcription task. Our adaptation scheme works as follow: first an initial language model is trained with broadcast news transcription available during the development period. Then the language model is adapted over time with more recent broadcast news transcription and online news articles available during deployment especially the data from the same time period as the broadcast news speech being recognized. We found that the data that are closer in time are more similar in terms of perplexity and are more suitable for language model adaptation. The LMs that are adapted over time with more recent news data are better, both in terms of perplexity and WER, than the static LM trained from only the initial set of broadcast news data. Adaptation data from broadcast news transcription improved perplexity by 38.3% and WER by 7.1% relatively. Though, online news articles achieved less improvement, it is still a useful resource as it can be obtained automatically. Better data pre-processing techniques and data selection techniques based on text similarity could be applied to the news articles to obtain further improvement from this promising result.

pdf bib
The D-TUNA Corpus: A Dutch Dataset for the Evaluation of Referring Expression Generation Algorithms
Ruud Koolen | Emiel Krahmer

We present the D-TUNA corpus, which is the first semantically annotated corpus of referring expressions in Dutch. Its primary function is to evaluate and improve the performance of REG algorithms. Such algorithms are computational models that automatically generate referring expressions by computing how a specific target can be identified to an addressee by distinguishing it from a set of distractor objects. We performed a large-scale production experiment, in which participants were asked to describe furniture items and people, and provided all descriptions with semantic information regarding the target and the distractor objects. Besides being useful for evaluating REG algorithms, the corpus addresses several other research goals. Firstly, the corpus contains both written and spoken referring expressions uttered in the direction of an addressee, which enables systematic analyses of how modality (text or speech) influences the human production of referring expressions. Secondly, due to its comparability with the English TUNA corpus, our Dutch corpus can be used to explore the differences between Dutch and English speakers regarding the production of referring expressions.

pdf bib
ADN-Classifier:Automatically Assigning Denotation Types to Nominalizations
Aina Peris | Mariona Taulé | Gemma Boleda | Horacio Rodríguez

This paper presents the ADN-Classifier, an Automatic classification system of Spanish Deverbal Nominalizations aimed at identifying its semantic denotation (i.e. event, result, underspecified, or lexicalized). The classifier can be used for NLP tasks such as coreference resolution or paraphrase detection. To our knowledge, the ADN-Classifier is the first effort in acquisition of denotations for nominalizations using Machine Learning. We compare the results of the classifier when using a decreasing number of Knowledge Sources, namely (1) the complete nominal lexicon (AnCora-Nom) that includes sense distictions, (2) the nominal lexicon (AnCora-Nom) removing the sense-specific information, (3) nominalizations’ context information obtained from a treebank corpus (AnCora-Es) and (4) the combination of the previous linguistic resources. In a realistic scenario, that is, without sense distinction, the best results achieved are those taking into account the information declared in the lexicon (89.40% accuracy). This shows that the lexicon contains crucial information (such as argument structure) that corpus-derived features cannot substitute for.

pdf bib
Reusing Grammatical Resources for New Languages
Lene Antonsen | Trond Trosterud | Linda Wiechetek

Grammatical approaches to language technology are often considered less optimal than statistical approaches in multilingual settings, where large-scale portability becomes an important issue. The present paper argues that there is a notable gain in reusing grammatical resources when porting technology to new languages. The pivot language is North Sámi, and the paper discusses portability with respect to the closely related Lule and South Sámi, and to the unrelated Faroese and Greenlandic languages.

pdf bib
NameDat: A Database of English Proper Names Spoken by Native Norwegians
Line Adde | Torbjørn Svendsen

This paper describes the design and collection of NameDat, a database containing English proper names spoken by native Norwegians. The database was designed to cover the typical acoustic and phonetic variations that appear when Norwegians pronounce English names. The intended use of the database is acoustic and lexical modeling of these phonetic variations. The English names in the database have been enriched with several annotation tiers. The recorded names were selected according to three selection criteria: the familiarity of the name, the expected recognition performance and the coverage of non-native phonemes. The validity of the manual annotations was verified by means of an automatic recognition experiment of non-native names. The experiment showed that the use of the manual transcriptions from NameDat yields an increase in recognition performance over automatically generated transcriptions.

pdf bib
The Study of Writing Variants in an Under-resourced Language: Some Evidence from Mobile N-Deletion in Luxembourgish
Natalie D. Snoeren | Martine Adda-Decker | Gilles Adda

The national language of the Grand-Duchy of Luxembourg, Luxembourgish, has often been characterized as one of Europe's under-described and under-resourced languages. Because of a limited written production of Luxembourgish, poorly observed writing standardization (as compared to other languages such as English and French) and a large diversity of spoken varieties, the study of Luxembourgish poses many interesting challenges to automatic speech processing studies as well as to linguistic enquiries. In the present paper, we make use of large corpora to focus on typical writing and derived pronunciation variants in Luxembourgish, elicited by mobile -n deletion (hereafter shortened to MND). Using transcriptions from the House of Parliament debates and 10k words from news reports, we examine the reality of MND variants in written transcripts of speech. The goal of this study is manyfold: quantify the potential of variation due to MND in written Luxembourgish, check the mandatory status of the MND rule and discuss the arising problems for automatic spoken Luxembourgish processing.

pdf bib
The Design of Syntactic Annotation Levels in the National Corpus of Polish
Katarzyna Głowińska | Adam Przepiórkowski

The paper presents the procedure of syntactic annotation of the National Corpus of Polish. The paper concentrates on the delimitation of syntactic words (analytical forms, reflexive verbs, discontinuous conjunctions, etc.) and syntactic groups, as well as on problems encountered during the annotation process: syntactic group boundaries, multiword entities, abbreviations, discontinuous phrases and syntactic words. It includes the complete tagset for syntactic words and the list of syntactic groups recognized in NKJP. The tagset defines grammatical classes and categories according to morphosyntactic and syntactic criteria only. Syntactic annotation in the National Corpus of Polish is limited to making constituents of combinations of words. Annotation depends on shallow parsing and manual post-editing of the results by annotators. Manual annotation is performed by two independents annotators, with a referee in cases of disagreement. The manually constructed grammar, both for syntactic words and for syntactic groups, is encoded in the shallow parsing system Spejd.

pdf bib
Construction of Back-Channel Utterance Corpus for Responsive Spoken Dialogue System Development
Yuki Kamiya | Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka

In spoken dialogues, if a spoken dialogue system does not respond at all during user’s utterances, the user might feel uneasy because the user does not know whether or not the system has recognized the utterances. In particular, back-channel utterances, which the system outputs as voices such as “yeah” and “uh huh” in English have important roles for a driver in in-car speech dialogues because the driver does not look owards a listener while driving. This paper describes construction of a back-channel utterance corpus and its analysis to develop the system which can output back-channel utterances at the proper timing in the responsive in-car speech dialogue. First, we constructed the back-channel utterance corpus by integrating the back-channel utterances that four subjects provided for the driver’s utterances in 60 dialogues in the CIAIR in-car speech dialogue corpus. Next, we analyzed the corpus and revealed the relation between back-channel utterance timings and information on bunsetsu, clause, pause and rate of speech. Based on the analysis, we examined the possibility of detecting back-channel utterance timings by machine learning technique. As the result of the experiment, we confirmed that our technique achieved as same detection capability as a human.

pdf bib
Human Language Technology and Communicative Disabilities: Requirements and Possibilities for the Future
Marina B. Ruiter | Toni C. M. Rietveld | Catia Cucchiarini | Emiel J. Krahmer | Helmer Strik

For some years now, the Nederlandse Taalunie (Dutch Language Union) has been active in promoting the development of human language technology (HLT) applications for users of Dutch with communication disabilities. The reason is that HLT products and services may enable these users to improve their verbal autonomy and communication skills. We sought to identify a minimum common set of HLT resources that is required to develop tools for a wide range of communication disabilities. In order to reach this goal, we investigated the specific HLT needs of communicatively disabled people and related these needs to the underlying HLT software components. By analysing the availability and quality of these essential HLT resources, we were able to identify which of the crucial elements need further research and development to become usable for developing applications for communicatively disabled users of Dutch. The results obtained in the current survey can be used to inform policy institutions on how they can stimulate the development of HLT resources for this target group. In the current study results were obtained for Dutch, but a similar approach can also be used for other languages.

pdf bib
A Database of Age and Gender Annotated Telephone Speech
Felix Burkhardt | Martin Eckert | Wiebke Johannsen | Joachim Stegmann

This article describes an age-annotated database of German telephone speech. All in all 47 hours of prompted and free text was recorded, uttered by 954 paid participants in a style typical for automated voice services. The participants were selected based on an equal distribution of males and females within four age cluster groups; children, youth, adults and seniors. Within the children, gender is not distinguished, because it doesn’t have a strong enough effect on the voice. The textual content was designed to be typical for automated voice services and consists mainly of short commands, single words and numbers. An additional database consists of 659 speakers (368 female and 291 male) that called an automated voice portal server and answered freely on one of the two questions “What is your favourite dish?” and “What would you take to an island?” (island set, 422 speakers). This data might be used for out-of domain testing. The data will be used to tune an age-detecting automated voice service and might be released to research institutes under controlled conditions as part of an open age and gender detection challenge.

pdf bib
DutchParl. The Parliamentary Documents in Dutch
Maarten Marx | Anne Schuth

A corpus called DutchParl is created which aims to contain all digitally available parliamentary documents written in the Dutch language. The first version of DutchParl contains documents from the parliaments of The Netherlands, Flanders and Belgium. The corpus is divided along three dimensions: per parliament, scanned or digital documents, written recordings of spoken text and others. The digital collection contains more than 800 million tokens, the scanned collection more than 1 billion. All documents are available as UTF-8 encoded XML files with extensive metadata in Dublin Core standard. The text itself is divided into pages which are divided into paragraphs. Every document, page and paragraph has a unique URN which resolves to a web page. Every page element in the XML files is connected to a facsimile image of that page in PDF or JPEG format. We created a viewer in which both versions can be inspected simultaneously. The corpus is available for download in several formats. The corpus can be used for corpus-linguistic and political science research, and is suitable for performing scalability tests for XML information systems.

pdf bib
GernEdiT - The GermaNet Editing Tool
Verena Henrich | Erhard Hinrichs

This paper introduces GernEdiT (short for: GermaNet Editing Tool), a new graphical user interface for the lexicographers and developers of GermaNet, the German version of the Princeton WordNet. GermaNet is a lexical-semantic net that relates German nouns, verbs, and adjectives. Traditionally, lexicographic work for extending the coverage of GermaNet utilized the Princeton WordNet development environment of lexicographer files. Due to a complex data format and no opportunity of automatic consistency checks, this process was very error prone and time consuming. The GermaNet Editing Tool GernEdiT was developed to overcome these shortcomings. The main purposes of the GernEdiT tool are, besides supporting lexicographers to access, modify, and extend GermaNet data in an easy and adaptive way, as follows: Replace the standard editing tools by a more user-friendly tool, use a relational database as data storage, support export formats in the form of XML, and facilitate internal consistency and correctness of the linguistic resource. All these core functionalities of GernEdiT along with the main aspects of the underlying lexical resource GermaNet and its current database format are presented in this paper.

pdf bib
Sustainability of Linguistic Data and Analysis in the Context of a Collaborative eScience Environment
Erhard Hinrichs | Verena Henrich | Thomas Zastrow

For researchers, it is especially important that primary research data are preserved and made available on a long-term basis and to a wide variety of researchers. In order to ensure long-term availability of the archived data, it is imperative that the data to be stored is conformant with standardized data formats and best practices followed by the relevant research communities. Storing, managing, and accessing such standard-conformant data requires a repository-based infrastructure. Two projects at the University of Tübingen are realizing a collaborative eScience research environment with the help of eSciDoc for the university that supports long-term preservation of all kinds of data as well as a fine-grained and contextualized data management: the INF project and the BW-eSci(T) project. The task of the infrastructure (INF) project within the collaborative research centre „Emergence of Meaning“ (SFB 833) is to guarantee the long-term availability of the SFBs data. BW-eSci(T) is a joint project of the University of Tübingen and the Fachinformationszentrums (FIZ) Karlsruhe. The goal of this project is to develop a prototypical eScience research environment for the University of Tübingen.

pdf bib
Pattern-Based Extraction of Negative Polarity Items from Dependency-Parsed Text
Fabienne Fritzinger | Frank Richter | Marion Weller

We describe a new method for extracting Negative Polarity Item candidates (NPI candidates) from dependency-parsed German text corpora. Semi-automatic extraction of NPIs is a challenging task since NPIs do not have uniform categorical or other syntactic properties that could be used for detecting them; they occur as single words or as multi-word expressions of almost any syntactic category. Their defining property is of a semantic nature, they may only occur in the scope of negation and related semantic operators. In contrast to an earlier approach to NPI extraction from corpora, we specifically target multi-word expressions. Besides applying statistical methods to measure the co-occurrence of our candidate expressions with negative contexts, we also apply linguistic criteria in an attempt to determine to which degree they are idiomatic. Our method is evaluated by comparing the set of NPIs we found with the most comprehensive electronic list of German NPIs, which currently contains 165 entries. Our method retrieved 142 NPIs, 114 of which are new.

pdf bib
Computer Assisted Semantic Annotation in the DutchSemCor Project
Attila Görög | Piek Vossen

The goal of this paper is to describe the annotation protocols and the Semantic Annotation Tool (SAT) used in the DutchSemCor project. The DutchSemCor project is aiming at aligning the Cornetto lexical database with the Dutch language corpus SoNaR. 250K corpus occurrences of the 3,000 most frequent and most ambiguous Dutch nouns, adjectives and verbs are being annotated manually using the SAT. This data is then used for bootstrapping 750K extra occurrences which in turn will be checked manually. Our main focus in this paper is the methodology applied in the project to attain the envisaged Inter-annotator Agreement (IA) of =80%. We will also discuss one of the main objectives of DutchSemCor i.e. to provide semantically annotated language data with high scores for quantity, quality and diversity. Sample data with high scores for these three features can yield better results for co-training WSD systems. Finally, we will take a brief look at our annotation tool.

pdf bib
WebLicht: Web-based LRT Services in a Distributed eScience Infrastructure
Marie Hinrichs | Thomas Zastrow | Erhard Hinrichs

eScience - enhanced science - is a new paradigm of scientific work and research. In the humanities, eScience environments can be helpful in establishing new workflows and lifecycles of scientific data. WebLicht is such an eScience environment for linguistic analysis, making linguistic tools and resources available network-wide. Today, most digital language resources and tools (LRT) are available by download only. This is inconvenient for someone who wants to use and combine several tools because these tools are normally not compatible with each other. To overcome this restriction, WebLicht makes the functionality of linguistic tools and the resources themselves available via the internet as web services. In WebLicht, several kinds of linguistic tools are available which cover the basic functionality of automatic and incremental creation of annotated text corpora. To make use of the more than 70 tools and resources currently available, the end user needs nothing more than just a common web browser.

pdf bib
The Nijmegen Corpus of Casual Spanish
Francisco Torreira | Mirjam Ernestus

This article describes the preparation, recording and orthographic transcription of a new speech corpus, the Nijmegen Corpus of Casual Spanish (NCCSp). The corpus contains around 30 hours of recordings of 52 Madrid Spanish speakers engaged in conversations with friends. Casual speech was elicited during three different parts, which together provided around ninety minutes of speech from every group of speakers. While Parts 1 and 2 did not require participants to perform any specific task, in Part 3 participants negotiated a common answer to general questions about society. Information about how to obtain a copy of the corpus can be found online at

pdf bib
GikiCLEF: Crosscultural Issues in Multilingual Information Access
Diana Santos | Luís Miguel Cabral | Corina Forascu | Pamela Forner | Fredric Gey | Katrin Lamm | Thomas Mandl | Petya Osenova | Anselmo Peñas | Álvaro Rodrigo | Julia Schulz | Yvonne Skalban | Erik Tjong Kim Sang

In this paper we describe GikiCLEF, the first evaluation contest that, to our knowledge, was specifically designed to expose and investigate cultural and linguistic issues involved in structured multimedia collections and searching, and which was organized under the scope of CLEF 2009. GikiCLEF evaluated systems that answered hard questions for both human and machine, in ten different Wikipedia collections, namely Bulgarian, Dutch, English, German, Italian, Norwegian (Bokmäl and Nynorsk), Portuguese, Romanian, and Spanish. After a short historical introduction, we present the task, together with its motivation, and discuss how the topics were chosen. Then we provide another description from the point of view of the participants. Before disclosing their results, we introduce the SIGA management system explaining the several tasks which were carried out behind the scenes. We quantify in turn the GIRA resource, offered to the community for training and further evaluating systems with the help of the 50 topics gathered and the solutions identified. We end the paper with a critical discussion of what was learned, advancing possible ways to reuse the data.

pdf bib
Virtual Language Observatory: The Portal to the Language Resources and Technology Universe
Dieter Van Uytvanck | Claus Zinn | Daan Broeder | Peter Wittenburg | Mariano Gardellini

Over the years, the field of Language Resources and Technology (LRT) has developed a tremendous amount of resources and tools. However, there is no ready-to-use map that researchers could use to gain a good overview and steadfast orientation when searching for, say corpora or software tools to support their studies. It is rather the case that information is scattered across project- or organisation-specific sites, which makes it hard if not impossible for less-experienced researchers to gather all relevant material. Clearly, the provision of metadata is central to resource and software exploration. However, in the LRT field, metadata comes in many forms, tastes and qualities, and therefore substantial harmonization and curation efforts are required to provide researchers with metadata-based guidance. To address this issue a broad alliance of LRT providers (CLARIN, the Linguist List, DOBES, DELAMAN, DFKI, ELRA) have initiated the Virtual Language Observatory portal to provide a low-barrier, easy-to-follow entry point to language resources and tools; it can be accessed via

pdf bib
A Fully Annotated Corpus of Russian Speech
Pavel Skrelin | Nina Volskaya | Daniil Kocharov | Karina Evgrafova | Olga Glotova | Vera Evdokimova

The paper introduces CORPRES ― a fully annotated Russian speech corpus developed at the Department of Phonetics, St. Petersburg State University as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including labels for pitch marks, phonetic events, narrow and wide phonetic transcription, orthographic and prosodic transcription. Precise phonetic transcription of the data provides an especially valuable resource for both research and development purposes. Overall corpus size is 528 458 running words and contains 60 hours of speech made up of 7.5 hours from each speaker. 40% of the corpus was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic, prosodic and ideal phonetic transcription for this part was generated and stored as text files. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. The paper contains information about CORPRES design and annotation principles, overall data description and some speculation about possible use of the corpus.

pdf bib
FAU IISAH Corpus – A German Speech Database Consisting of Human-Machine and Human-Human Interaction Acquired by Close-Talking and Far-Distance Microphones
Werner Spiegl | Korbinian Riedhammer | Stefan Steidl | Elmar Nöth

In this paper the FAU IISAH corpus and its recording conditions are described: a new speech database consisting of human-machine and human-human interaction recordings. Beside close-talking microphones for the best possible audio quality of the recorded speech, far-distance microphones were used to acquire the interaction and communication. The recordings took place during a Wizard-of-Oz experiment in the intelligent, senior-adapted house (ISA-House). That is a living room with a speech controlled home assistance system for elderly people, based on a dialogue system, which is able to process spontaneous speech. During the studies in the ISA-House more than eight hours of interaction data were recorded including 3 hours and 27 minutes of spontaneous speech. The data were annotated in terms of human-human (off-talk) and human-machine (on-talk) interaction. The test persons used 2891 turns of off-talk and 2752 turns of on-talk including 1751 different words. Still in progress is the analysis under statistical and linguistical aspects.

pdf bib
Morphological Annotation of Quranic Arabic
Kais Dukes | Nizar Habash

The Quranic Arabic Corpus ( is an annotated linguistic resource with multiple layers of annotation including morphological segmentation, part-of-speech tagging, and syntactic analysis using dependency grammar. The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year old central religious text of Islam. This paper describes a new approach to morphological annotation of Quranic Arabic, a genre difficult to compare with other forms of Arabic. Processing Quranic Arabic is a unique challenge from a computational point of view, since the vocabulary and spelling differ from Modern Standard Arabic. The Quranic Arabic Corpus differs from other Arabic computational resources in adopting a tagset that closely follows traditional Arabic grammar. We made this decision in order to leverage a large body of existing historical grammatical analysis, and to encourage online collaborative annotation. In this paper, we discuss how the unique challenge of morphological annotation of Quranic Arabic is solved using a multi-stage approach. The different stages include automatic morphological tagging using diacritic edit-distance, two-pass manual verification, and online collaborative annotation. This process is evaluated to validate the appropriateness of the chosen methodology.

pdf bib
BAStat : New Statistical Resources at the Bavarian Archive for Speech Signals
Florian Schiel

A new type of language resource called 'BAStat' has been released by the Bavarian Archive for Speech Signals at Ludwig Maximilians Universitaet, Munich. In contrast to primary resources like speech and text corpora BAStat comprises statistical estimates based on a number of primary spoken language resources: first and second order occurrence probability of phones, syllables and words, duration statistics, probabilities of pronunciation variants of words and probabilities of context information. Unlike other statistical speech resources BAStat is based solely on recordings of conversational German and therefore models spoken language not text. The resource consists of a bundle of 7-bit ASCII tables and matrices to maximize inter-operability between different operation systems and can be downloaded for free from the BAS web-site. This contribution gives a detailed description about the empirical basis, the contained data types, the format of the resulting statistical data, some interesting interpretations of grand figures and a brief comparison to the text-based statistical resource CELEX.

pdf bib
Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank
Kais Dukes | Eric Atwell | Abdul-Baquee M. Sharaf

The Quranic Arabic Dependency Treebank (QADT) is part of the Quranic Arabic Corpus (, an online linguistic resource organized by the University of Leeds, and developed through online collaborative annotation. The website has become a popular study resource for Arabic and the Quran, and is now used by over 1,500 researchers and students daily. This paper presents the treebank, explains the choice of syntactic representation, and highlights key parts of the annotation guidelines. The text being analyzed is the Quran, the central religious book of Islam, written in classical Quranic Arabic (c. 600 CE). To date, all 77,430 words of the Quran have a manually verified morphological analysis, and syntactic analysis is in progress. 11,000 words of Quranic Arabic have been syntactically annotated as part of a gold standard treebank. Annotation guidelines are especially important to promote consistency for a corpus which is being developed through online collaboration, since often many people will participate from different backgrounds and with different levels of linguistic expertise. The treebank is available online for collaborative correction to improve accuracy, with suggestions reviewed by expert Arabic linguists, and compared against existing published books of Quranic Syntax.

pdf bib
Language Identification of Short Text Segments with N-gram Models
Tommi Vatanen | Jaakko J. Väyrynen | Sami Virpioja

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5-21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking method by Cavnar and Trenkle (1994). For the n-gram models, we test several standard smoothing techniques, including the current state-of-the-art, the modified Kneser-Ney interpolation. Experiments are conducted with 281 languages using the Universal Declaration of Human Rights. Advanced language model smoothing techniques improve the identification accuracy and the respective classifiers outperform the ranking method. The higher accuracy is obtained at the cost of larger models and slower classification speed. However, there are several methods to reduce the size of an n-gram model, and our experiments with model pruning show that it provides an easy way to balance the size and the identification accuracy. We also compare the results to the language identifier in Google AJAX Language API, using a subset of 50 languages.

pdf bib
Bootstrapping Named Entity Extraction for the Creation of Mobile Services
Joseph Polifroni | Imre Kiss | Mark Adler

As users become more accustomed to using their mobile devices to organize and schedule their lives, there is more of a demand for applications that can make that process easier. Automatic speech recognition technology has already been developed to enable essentially unlimited vocabulary in a mobile setting. Understanding the words that are spoken is the next challenge. In this paper, we describe efforts to develop a dataset and classifier to recognize named entities in speech. Using sets of both real and simulated data, in conjunction with a very large set of real named entities, we created a challenging corpus of training and test data. We use these data to develop a classifier to identify names and locations on a word-by-word basis. In this paper, we describe the process of creating the data and determining a set of features to use for named entity recognition. We report on our classification performance on these data, as well as point to future work in improving all aspects of the system.

pdf bib
Improving Proper Name Recognition by Adding Automatically Learned Pronunciation Variants to the Lexicon
Bert Réveil | Jean-Pierre Martens | Henk van den Heuvel

This paper deals with the task of large vocabulary proper name recognition. In order to accomodate a wide diversity of possible name pronunciations (due to non-native name origins or speaker tongues) a multilingual acoustic model is combined with a lexicon comprising 3 grapheme-to-phoneme (G2P) transcriptions from G2P transcribers for 3 different languages) and up to 4 so-called phoneme-to-phoneme (P2P) transcriptions. The latter are generated with (speaker tongue, name source) specific P2P converters that try to transform a set of baseline name transcriptions into a pool of transcription variants that lie closer to the `true’ name pronunciations. The experimental results show that the generated P2P variants can be employed to improve name recognition, and that the obtained accuracy is comparable to what is achieved with typical (TY) transcriptions (made by a human expert). Furthermore, it is demonstrated that the P2P conversion can best be instantiated from a baseline transcription in the name source language, and that knowledge of the speaker tongue is an important input as well for the P2P transcription process.

pdf bib
Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text
Majdi Sawalha | Eric Atwell

Morphological analyzers and part-of-speech taggers are key technologies for most text analysis applications. Our aim is to develop a part-of-speech tagger for annotating a wide range of Arabic text formats, domains and genres including both vowelized and non-vowelized text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee the advantage of enriching the text with part-of-speech tags of very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, but not specific needs of end-users, because end-user applications are not known in advance. In this paper we review existing Arabic Part-of-Speech Taggers and tag-sets, and illustrate four different Arabic PoS tag-sets for a sample of Arabic text from the Quran. We describe the detailed fine-grained morphological feature tag set of Arabic, and the fine-grained Arabic morphological analyzer algorithm. We faced practical challenges in applying the morphological analyzer to the 100-million-word Web Arabic Corpus: we had to port the software to the National Grid Service, adapt the analyser to cope with spelling variations and errors, and utilise a Broad-Coverage Lexical Resource combining 23 traditional Arabic lexicons. Finally we outline the construction of a Gold Standard for comparative evaluation.

pdf bib
The Architecture of FunGramKB
Carlos Periñán-Pascual | Francisco Arcas-Túnez

Natural language understanding systems require a knowledge base provided with conceptual representations reflecting the structure of human beings’ cognitive system. Although surface semantics can be sufficient in some other systems, the construction of a robust knowledge base guarantees its use in most natural language processing applications, consolidating thus the concept of resource reuse. In this scenario, FunGramKB is presented as a multipurpose knowledge base whose model has been particularly designed for natural language understanding tasks. The theoretical basement of this knowledge engineering project lies in the construction of two complementary types of interlingua: the conceptual logical structure, i.e. a lexically-driven interlingua which can predict linguistic phenomena according to the Role and Reference Grammar syntax-semantics interface, and the COREL scheme, i.e. a concept-oriented interlingua on which our rule-based reasoning engine is able to make inferences effectively. The objective of the paper is to describe the different conceptual, lexical and grammatical modules which make up the architecture of FunGramKB, together with an exploratory outline on how to exploit such a knowledge base within an NLP system.

pdf bib
WTIMIT: The TIMIT Speech Corpus Transmitted Over The 3G AMR Wideband Mobile Network
Patrick Bauer | David Scheler | Tim Fingscheidt

In anticipation of upcoming mobile telephony services with higher speech quality, a wideband (50 Hz to 7 kHz) mobile telephony derivative of TIMIT has been recorded called WTIMIT. It opens up various scientific investigations; e.g., on speech quality and intelligibility, as well as on wideband upgrades of network-side interactive voice response (IVR) systems with retrained or bandwidth-extended acoustic models for automatic speech recognition (ASR). Wideband telephony could enable network-side speech recognition applications such as remote dictation or spelling without the need of distributed speech recognition techniques. The WTIMIT corpus was transmitted via two prepared Nokia 6220 mobile phones over T-Mobile's 3G wideband mobile network in The Hague, The Netherlands, employing the Adaptive Multirate Wideband (AMR-WB) speech codec. The paper presents observations of transmission effects and phoneme recognition experiments. It turns out that in the case of wideband telephony, server-side ASR should not be carried out by simply decimating received signals to 8 kHz and applying existent narrowband acoustic models. Nor do we recommend just simulating the AMR-WB codec for training of wideband acoustic models. Instead, real-world wideband telephony channel data (such as WTIMIT) provides the best training material for wideband IVR systems.

pdf bib
Towards an Improved Methodology for Automated Readability Prediction
Philip van Oosten | Dries Tanghe | Véronique Hoste

Since the first half of the 20th century, readability formulas have been widely employed to automatically predict the readability of an unseen text. In this article, the formulas and the text characteristics they are composed of are evaluated in the context of large Dutch and English corpora. We describe the behaviour of the formulas and the text characteristics by means of correlation matrices and a principal component analysis, and test the methodological validity of the formulas by means of collinearity tests. Both the correlation matrices and the principal component analysis show that the formulas described in this paper strongly correspond, regardless of the language for which they were designed. Furthermore, the collinearity test reveals shortcomings in the methodology that was used to create some of the existing readability formulas. All of this leads us to conclude that a new readability prediction method is needed. We finally make suggestions to come to a cleaner methodology and present web applications that will help us collect data to compile a new gold standard for readability prediction.

pdf bib
Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic
Majdi Sawalha | Eric Atwell

Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons are different in ordering, size and aim or goal of construction. We collected 23 machine-readable lexicons, which are freely available on the web. We combined lexical resources into one large broad-coverage lexical resource by extracting information from disparate formats and merging traditional Arabic lexicons. To evaluate the broad-coverage lexical resource we computed coverage over the Qur’an, the Corpus of Contemporary Arabic, and a sample from the Arabic Web Corpus, using two methods. Counting exact word matches between test corpora and lexicon scored about 65-68%; Arabic has a rich morphology with many combinations of roots, affixes and clitics, so about a third of words in the corpora did not have an exact match in the lexicon. The second approach is to compute coverage in terms of use in a lemmatizer program, which strips clitics to look for a match for the underlying lexeme; this scored about 82-85%.

pdf bib
The AVLaughterCycle Database
Jérôme Urbain | Elisabetta Bevacqua | Thierry Dutoit | Alexis Moinet | Radoslaw Niewiadomski | Catherine Pelachaud | Benjamin Picart | Joëlle Tilmanne | Johannes Wagner

This paper presents the large audiovisual laughter database recorded as part of the AVLaughterCycle project held during the eNTERFACE’09 Workshop in Genova. 24 subjects participated. The freely available database includes audio signal and video recordings as well as facial motion tracking, thanks to markers placed on the subjects’ face. Annotations of the recordings, focusing on laughter description, are also provided and exhibited in this paper. In total, the corpus contains more than 1000 spontaneous laughs and 27 acted laughs. The laughter utterances are highly variable: the laughter duration ranges from 250ms to 82s and the sounds cover voiced vowels, breath-like expirations, hum-, hiccup- or grunt-like sounds, etc. However, as the subjects had no one to interact with, the database contains very few speech-laughs. Acted laughs tend to be longer than spontaneous ones and are more often composed of voiced vowels. The database can be useful for automatic laughter processing or cognitive science works. For the AVLaughterCycle project, it has served to animate a laughing virtual agent with an output laugh linked to the conversational partner’s input laugh.

pdf bib
Towards a Large Parallel Corpus of Cleft Constructions
Gerlof Bouma | Lilja Øvrelid | Jonas Kuhn

We present our efforts to create a large-scale, semi-automatically annotated parallel corpus of cleft constructions. The corpus is intended to reduce or make more effective the manual task of finding examples of clefts in a corpus. The corpus is being developed in the context of the Collaborative Research Centre SFB 632, which is a large, interdisciplinary research initiative to study information structure, at the University of Potsdam and the Humboldt University in Berlin. The corpus is based on the Europarl corpus (version 3). We show how state-of-the-art NLP tools, like POS taggers and statistical dependency parsers, may facilitate powerful and precise searches. We argue that identifying clefts using automatically added syntactic structure annotation is ultimately to be preferred over using lower level, though more robust, extraction methods like regular expression matching. An evaluation of the extraction method for one of the languages also offers some support for this method. We end the paper by discussing the resulting corpus itself. We present some examples of interesting clefts and translational counterparts from the corpus and suggest ways of exploiting our newly created resource in the cross-linguistic study of clefts.

pdf bib
A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia
Ziqi Zhang | Anna Lisa Gentile | Lei Xia | José Iria | Sam Chapman

Determining semantic relatedness between words or concepts is a fundamental process to many Natural Language Processing applications. Approaches for this task typically make use of knowledge resources such as WordNet and Wikipedia. However, these approaches only make use of limited number of features extracted from these resources, without investigating the usefulness of combining various different features and their importance in the task of semantic relatedness. In this paper, we propose a random walk model based approach to measuring semantic relatedness between words or concepts, which seamlessly integrates various features extracted from Wikipedia to compute semantic relatedness. We empirically study the usefulness of these features in the task, and prove that by combining multiple features that are weighed according to their importance, our system obtains competitive results, and outperforms other systems on some datasets.

pdf bib
Bilingual Lexicon Induction: Effortless Evaluation of Word Alignment Tools and Production of Resources for Improbable Language Pairs
Adrien Lardilleux | Julien Gosme | Yves Lepage

In this paper, we present a simple protocol to evaluate word aligners on bilingual lexicon induction tasks from parallel corpora. Rather than resorting to gold standards, it relies on a comparison of the outputs of word aligners against a reference bilingual lexicon. The quality of this reference bilingual lexicon does not need to be particularly high, because evaluation quality is ensured by systematically filtering this reference lexicon with the parallel corpus the word aligners are trained on. We perform a comparison of three freely available word aligners on numerous language pairs from the Bible parallel corpus (Resnik et al., 1999): MGIZA++ (Gao and Vogel, 2008), BerkeleyAligner (Liang et al., 2006), and Anymalign (Lardilleux and Lepage, 2009). We then select the most appropriate one to produce bilingual lexicons for all language pairs of this corpus. These involve Cebuano, Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swedish, and Vietnamese. The 66 resulting lexicons are made freely available.

pdf bib
Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing
Marijn Schraagen | Gerrit Bloothooft

Repetition is a common concept in human communication. This paper investigates possible benefits of repetition for automatic speech recognition under controlled conditions. Testing is performed on the newly created Autonomata TOO speech corpus, consisting of multilingual names for Points-Of-Interest as spoken by both native and non-native speakers. During corpus recording, ASR was being performed under baseline conditions using a Nuance Vocon 3200 system. On failed recognition, additional attempts for the same utterances were added to the corpus. Substantial improvements in recognition results are shown for all categories of speakers and utterances, even if speakers did not noticeably alter their previously misrecognized pronunciation. A categorization is proposed for various types of differences between utterance realisations. The number of attempts, the pronunciation of an utterance over multiple attempts compared to both previous attempts and reference pronunciation is analyzed for difference type and frequency. Variables such as the native language of the speaker and the languages in the lexicon are taken into account. Possible implications for ASR research are discussed.

pdf bib
Exploring the Relationship between Semantic Spaces and Semantic Relations
Akira Utsumi

This study examines the relationship between two kinds of semantic spaces ― i.e., spaces based on term frequency (tf) and word cooccurrence frequency (co) ― and four semantic relations ― i.e., synonymy, coordination, superordination, and collocation ― by comparing, for each semantic relation, the performance of two semantic spaces in predicting word association. The simulation experiment demonstrates that the tf-based spaces perform better in predicting word association based on the syntagmatic relation (i.e., superordination and collocation), while the co-based semantic spaces are suited for predicting word association based on the paradigmatic relation (i.e., synonymy and coordination). In addition, the co-based space with a larger context size yields better performance for the syntagmatic relation, while the co-based space with a smaller context size tends to show better performance for the paradigmatic relation. These results indicate that different semantic spaces can be used depending on what kind of semantic relatedness should be computed.

pdf bib
Example-Based Automatic Phonetic Transcription
Christina Leitner | Martin Schickbichler | Stefan Petrik

Current state-of-the-art systems for automatic phonetic transcription (APT) are mostly phone recognizers based on Hidden Markov models (HMMs). We present a different approach for APT especially designed for transcription with a large inventory of phonetic symbols. In contrast to most systems which are model-based, our approach is non-parametric using techniques derived from concatenative speech synthesis and template-based speech recognition. This example-based approach not only produces draft transcriptions that just need to be corrected instead of created from scratch but also provides a validation mechanism for ensuring consistency within the corpus. Implementations of this transcription framework are available as standalone Java software and extension to the ELAN linguistic annotation software. The transcription system was tested with audio files and reference transcriptions from the Austrian Pronunciation Database (ADABA) and compared to an HMM-based system trained on the same data set. The example-based and the HMM-based system achieve comparable phone recognition rates. A combination of rule-based and example-based APT in a constrained phone recognition scenario returned the best results.

pdf bib
MACAQ : A Multi Annotated Corpus to Study how we Adapt Answers to Various Questions
Anne Garcia-Fernandez | Sophie Rosset | Anne Vilnat

This paper presents a corpus of human answers in natural language collected in order to build a base of examples useful when generating natural language answers. We present the corpus and the way we acquired it. Answers correspond to questions with fixed linguistic form, focus, and topic. Answers to a given question exist for two modalities of interaction: oral and written. The whole corpus of answers was annotated manually and automatically on different levels including words from the questions being reused in the answer, the precise element answering the question (or information-answer), and completions. A detailed description of the annotations is presented. Two examples of corpus analyses are described. The first analysis shows some differences between oral and written modality especially in terms of length of the answers. The second analysis concerns the reuse of the question focus in the answers.

pdf bib
Evaluation of HMM-based Models for the Annotation of Unsegmented Dialogue Turns
Carlos-D. Martínez-Hinarejos | Vicent Tamarit | José-M. Benedí

Corpus-based dialogue systems rely on statistical models, whose parameters are inferred from annotated dialogues. The dialogues are usually annotated in terms of Dialogue Acts (DA), and the manual annotation is difficult (as annotation rule are hard to define), error-prone and time-consuming. Therefore, several semi-automatic annotation processes have been proposed to speed-up the process and consequently obtain a dialogue system in less total time. These processes are usually based on statistical models. The standard statistical annotation model is based on Hidden Markov Models (HMM). In this work, we explore the impact of different types of HMM, with different number of states, on annotation accuracy. We performed experiments using these models on two dialogue corpora (Dihana and SwitchBoard) of dissimilar features. The results show that some types of models improve standard HMM in a human-computer task-oriented dialogue corpus (Dihana corpus), but their impact is lower in a human-human non-task-oriented dialogue corpus (SwitchBoard corpus).

pdf bib
Meta-Knowledge Annotation of Bio-Events
Raheel Nawaz | Paul Thompson | John McNaught | Sophia Ananiadou

Biomedical corpora annotated with event-level information provide an important resource for the training of domain-specific information extraction (IE) systems. These corpora concentrate primarily on creating classified, structured representations of important facts and findings contained within the text. However, bio-event annotations often do not take into account additional information (meta-knowledge) that is expressed within the textual context of the bio-event, e.g., the pragmatic/rhetorical intent and the level of certainty ascribed to a particular bio-event by the authors. Such additional information is indispensible for correct interpretation of bio-events. Therefore, an IE system that simply presents a list of “bare” bio-events, without information concerning their interpretation, is of little practical use. We have addressed this sparseness of meta-knowledge available in existing bio-event corpora by developing a multi-dimensional annotation scheme tailored to bio-events. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed about different bio-events. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.

pdf bib
U-Compare: An Integrated Language Resource Evaluation Platform Including a Comprehensive UIMA Resource Library
Yoshinobu Kano | Ruben Dorado | Luke McCrohon | Sophia Ananiadou | Jun’ichi Tsujii

Language resources, including corpus and tools, are normally required to be combined in order to achieve a user’s specific task. However, resources tend to be developed independently in different, incompatible formats. In this paper we describe about U-Compare, which consists of the U-Compare component repository and the U-Compare platform. We have been building a highly interoperable resource library, providing the world largest ready-to-use UIMA component repository including wide variety of corpus readers and state-of-the-art language tools. These resources can be deployed as local services or web services, even possible to be hosted in clustered machines to increase the performance, while users do not need to be aware of such differences. In addition to the resource library, an integrated language processing platform is provided, allowing workflow creation, comparison, evaluation and visualization, using the resources in the library or any UIMA component, without any programming via graphical user interfaces, while a command line launcher is also available without GUIs. The evaluation itself is processed in a UIMA component, users can create and plug their own evaluation metrics in addition to the predefined metrics. U-Compare has been successfully used in many projects including BioCreative, Conll and the BioNLP shared task.

pdf bib
Enhancing Language Resources with Maps
Janne Bondi Johannessen | Kristin Hagen | Anders Nøklestad | Joel Priestley

We will look at how maps can be integrated in research resources, such as language databases and language corpora. By using maps, search results can be illustrated in a way that immediately gives the user information that words or numbers on their own would not give. We will illustrate with two different resources, into which we have now added a Google Maps application: The Nordic Dialect Corpus (Johannessen et al. 2009) and The Nordic Syntactic Judgments Database (Lindstad et al. 2009). We have integrated Google Maps into these applications. The database contains some hundred syntactic test sentences that have been evaluated by four speakers in more than hundred locations in Norway and Sweden. Searching for the evaluations of a particular sentence gives a list of several hundred judgments, which are difficult for a human researcher to assess. With the map option, isoglosses are immediately visible. We show in the paper that both with the maps depicting corpus hits and with the maps depicting database results, the map visualizations actually show clear geographical differences that would be very difficult to spot just by reading concordance lines or database tables.

pdf bib
Building a Textual Entailment Suite for the Evaluation of Automatic Content Scoring Technologies
Jana Z. Sukkarieh | Eleanor Bolge

Automatic content scoring for free-text responses has started to emerge as an application of Natural Language Processing in its own right, much like question answering or machine translation. The task, in general, is reduced to comparing a student’s answer to a model answer. Although a considerable amount of work has been done, common benchmarks and evaluation measures for this application do not currently exist. It is yet impossible to perform a comparative evaluation or progress tracking of this application across systems ― an application that we view as a textual entailment task. This paper concentrates on introducing an Educational Testing Service-built test suite that makes a step towards establishing such a benchmark. The suite can be used as regression and performance evaluations both intra-c-rater® or inter automatic content scoring technologies. It is important to note that existing textual entailment test suites like PASCAL RTE or FraCas, though beneficial, are not suitable for our purposes since we deal with atypical naturally-occurring student responses that need to be categorized in order to serve as regression test cases.

pdf bib
Evaluation of Textual Knowledge Acquisition Tools: a Challenging Task
Haïfa Zargayouna | Adeline Nazarenko

A large effort has been devoted to the development of textual knowledge acquisition (KA) tools, but it is still difficult to assess the progress that has been made. The results produced by these tools are difficult to compare, due to the heterogeneity of the proposed methods and of their goals. Various experiments have been made to evaluate terminological and ontological tools. They show that in terminology as well as in ontology acquisition, it remains difficult to compare existing tools and to analyse their advantages and drawbacks. From our own experiments in evaluating terminology and ontology acquisition tools, it appeared that the difficulties and solutions are similar for both tasks. We propose a unified approach for the evaluation of textual KA tools that can be instantiated in different ways for various tasks. The main originality of this approach lies in the way it takes into account the subjectivity of evaluation and the relativity of gold standards. In this paper, we highlight the major difficulties of KA evaluation, we then present a unified proposal for the evaluation of terminologies and ontologies acquisition tools and the associated experiments. The proposed protocols take into consideration the specificity of this type of evaluation.

pdf bib
Providing Multilingual, Multimodal Answers to Lexical Database Queries
Gerard de Melo | Gerhard Weikum

Language users are increasingly turning to electronic resources to address their lexical information needs, due to their convenience and their ability to simultaneously capture different facets of lexical knowledge in a single interface. In this paper, we discuss techniques to respond to a user's lexical queries by providing multilingual and multimodal information, and facilitating navigating along different types of links. To this end, structured information from sources like WordNet, Wikipedia, Wiktionary, as well as Web services is linked and integrated to provide a multi-faceted yet consistent response to user queries. The meanings of words in many different languages are characterized by mapping them to appropriate WordNet sense identifiers and adding multilingual gloss descriptions as well as example sentences. Relationships are derived from WordNet and Wiktionary to allow users to discover semantically related words, etymologically related words, alternative spellings, as well as misspellings. Last but not least, images, audio recordings, and geographical maps extracted from Wikipedia and Wiktionary allow for a multimodal experience.

pdf bib
Building an Italian FrameNet through Semi-automatic Corpus Analysis
Alessandro Lenci | Martina Johnson | Gabriella Lapesa

n this paper, we outline the methodology we adopted to develop a FrameNet for Italian. The main element of novelty with respect to the original FrameNet is represented by the fact that the creation and annotation of Lexical Units is strictly grounded in distributional information (statistical distribution of verbal subcategorization frames, lexical and semantic preferences of each frame) automatically acquired from a large, dependency-parsed corpus. We claim that this approach allows us to overcome some of the shortcomings of the classical lexicographic method used to create FrameNet, by complementing the accuracy of manual annotation with the robustness of data on the global distributional patterns of a verb. In the paper, we describe our method for extracting distributional data from the corpus and the way we used it for the encoding and annotation of LUs. The long-term goal of our project is to create an electronic lexicon for Italian similar to the original English FrameNet. For the moment, we have developed a database of syntactic valences that will be made freely accessible via a web interface. This represents an autonomous resource besides the FrameNet lexicon, of which we have a beginning nucleus consisting of 791 annotated sentences.

pdf bib
Spontal-N: A Corpus of Interactional Spoken Norwegian
Rein Ove Sikveland | Anton Öttl | Ingunn Amdal | Mirjam Ernestus | Torbjørn Svendsen | Jens Edlund

Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-recordings of four 30-minute free conversations between acquaintances, and a manual orthographic transcription of the entire material. On basis of the orthographic transcriptions, we automatically annotated approximately 50 percent of the material on the phoneme level, by means of a forced alignment between the acoustic signal and pronunciations listed in a dictionary. Approximately seven percent of the automatic transcription was manually corrected. Taking the manual correction as a gold standard, we evaluated several sources of pronunciation variants for the automatic transcription. Spontal-N is intended as a general purpose speech resource that is also suitable for investigating phonetic detail.

pdf bib
Bulgarian National Corpus Project
Svetla Koeva | Diana Blagoeva | Siya Kolkovska

The paper presents Bulgarian National Corpus project (BulNC) - a large-scale, representative, online available corpus of Bulgarian. The BulNC is also a monolingual general corpus, fully morpho-syntactically (and partially semantically) annotated, and manually provided with detailed meta-data descriptions. Presently the Bulgarian National corpus consists of about 320 000 000 graphical words and includes more than 10 000 samples. Briefly the corpus structure and the accepted criteria for representativeness and well-balancing are presented. The query language for advance search of collocations and concordances is demonstrated with some examples - it allows to retrieve word combinations, ordered queries, inflexionally and semantically related words, part-of-speech tags, utilising Boolean operations and grouping as well. The BulNC already plays a significant role in natural language processing of Bulgarian contributing to scientific advances in spelling and grammar checking, word sense disambiguation, speech recognition, text categorisation, topic extraction and machine translation. The BulNC can also be used in different investigations going beyond the linguistics: library studies, social sciences research, teaching methods studies, etc.

pdf bib
Composing Human and Machine Translation Services: Language Grid for Improving Localization Processes
Donghui Lin | Yoshiaki Murakami | Toru Ishida | Yohei Murakami | Masahiro Tanaka

With the development of the Internet environments, more and more language services become accessible for common people. However, the gap between human translators and machine translators remains huge especially for the domain of localization processes that requires high translation quality. Although efforts of combining human and machine translators for supporting multilingual communication have been reported in previous research, how to apply such approaches for improving localization processes are rarely discussed. In this paper, we aim at improving localization processes by composing human and machine translation services based on the Language Grid, which is a language service platform that we have developed. Further, we conduct experiments to compare the translation quality and translation cost using several translation processes, including absolute machine translation processes, absolute human translation processes and translation processes by human and machine translation services. The experiment results show that composing monolingual roles and dictionary services improves the translation quality of machine translators, and that collaboration of human and machine translators is possible to reduce the cost comparing with the absolute bilingual human translation. We also discuss the generality of the experimental results and further challenging issues of the proposed localization processes.

pdf bib
The Development of a Morphosyntactic Tagset for Afrikaans and its Use with Statistical Tagging
Boris Haselbach | Ulrich Heid

In this paper, we present a morphosyntactic tagset for Afrikaans based on the guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES). We compare our slim yet expressive tagset, MAATS (Morphosyntactic AfrikAans TagSet), with an existing one which primarily focuses on a detailed morphosyntactic and semantic description of word forms. MAATS will primarily be used for the extraction of lexical data from large pos-tagged corpora. We not only focus on morphosyntactic properties but also on the processability with statistical tagging. We discuss the tagset design and motivate our classification of Afrikaans word forms, in particular we focus on the categorization of verbs and conjunctions. The complete tagset in presented and we briefly discuss each word class. In a case study with an Afrikaans newspaper corpus, we evaluate our tagset with four different statistical taggers. Despite a relatively small amount of training data, however with a large tagger lexicon, TnT-Tagger scores 97.05 % accuracy. Additionally, we present some error sources and discuss future work.

pdf bib
Emotion Cause Events: Corpus Construction and Analysis
Sophia Yat Mei Lee | Ying Chen | Shoushan Li | Chu-Ren Huang

Emotion processing has always been a great challenge. Given the fact that an emotion is triggered by cause events and that cause events are an integral part of emotion, this paper constructs a Chinese emotion cause corpus as a first step towards automatic inference of cause-emotion correlation. The corpus focuses on five primary emotions, namely happiness, sadness, fear, anger, and surprise. It is annotated with emotion cause events based on our proposed annotation scheme. Corpus data shows that most emotions are expressed with causes, and that causes mostly occur before the corresponding emotion verbs. We also examine the correlations between emotions and cause events in terms of linguistic cues: causative verbs, perception verbs, epistemic markers, conjunctions, prepositions, and others. Results show that each group of linguistic cues serves as an indicator marking the cause events in different structures of emotional constructions. We believe that the emotion cause corpus will be the useful resource for automatic emotion cause detection as well as emotion detection and classification.

pdf bib
The DAD Parallel Corpora and their Uses
Costanza Navarretta

This paper deals with the uses of the annotations of third person singular neuter pronouns in the DAD parallel and comparable corpora of Danish and Italian texts and spoken data. The annotations contain information about the functions of these pronouns and their uses as abstract anaphora. Abstract anaphora have constructions such as verbal phrases, clauses and discourse segments as antecedents and refer to abstract objects comprising events, situations and propositions. The analysis of the annotated data shows the language specific characteristics of abstract anaphora in the two languages compared with the uses of abstract anaphora in English. Finally, the paper presents machine learning experiments run on the annotated data in order to identify the functions of third person singular neuter personal pronouns and neuter demonstrative pronouns. The results of these experiments vary from corpus to corpus. However, they are all comparable with the results obtained in similar tasks in other languages. This is very promising because the experiments have been run on both written and spoken data using a classification of the pronominal functions which is much more fine-grained than the classifications used in other studies.

pdf bib
LIPS: A Tool for Predicting the Lexical Isolation Point of a Word
Andrew Thwaites | Jeroen Geertzen | William D. Marslen-Wilson | Paula Buttery

We present LIPS (Lexical Isolation Point Software), a tool for accurate lexical isolation point (IP) prediction in recordings of speech. The IP is the point in time in which a word is correctly recognised given the acoustic evidence available to the hearer. The ability to accurately determine lexical IPs is of importance to work in the field of cognitive processing, since it enables the evaluation of competing models of word recognition. IPs are also of importance in the field of neurolinguistics, where the analyses of high-temporal-resolution neuroimaging data require a precise time alignment of the observed brain activity with the linguistic input. LIPS provides an attractive alternative to costly multi-participant perception experiments by automatically computing IPs for arbitrary words. On a test set of words, the LIPS system predicts IPs with a mean difference from the actual IP of within 1ms. The difference from the predicted and actual IP approximate to a normal distribution with a standard deviation of around 80ms (depending on the model used).

pdf bib
The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals
Caroline Williams | Andrew Thwaites | Paula Buttery | Jeroen Geertzen | Billi Randall | Meredith Shafto | Barry Devereux | Lorraine Tyler

Investigating differences in linguistic usage between individuals who have suffered brain injury (hereafter patients) and those who haven’t can yield a number of benefits. It provides a better understanding about the precise way in which impairments affect patients’ language, improves theories of how the brain processes language, and offers heuristics for diagnosing certain types of brain damage based on patients’ speech. One method for investigating usage differences involves the analysis of spontaneous speech. In the work described here we construct a text corpus consisting of transcripts of individuals’ speech produced during two tasks: the Boston-cookie-theft picture description task (Goodglass and Kaplan, 1983) and a spontaneous speech task, which elicits a semi-prompted monologue, and/or free speech. Interviews with patients from 19yrs to 89yrs were transcribed, as were interviews with a comparable number of healthy individuals (20yrs to 89yrs). Structural brain images are available for approximately 30% of participants. This unique data source provides a rich resource for future research in many areas of language impairment and has been constructed to facilitate analysis with natural language processing and corpus linguistics techniques.

pdf bib
The VeteranTapes: Research Corpus, Fragment Processing Tool, and Enhanced Publications for the e-Humanities
Henk van den Heuvel | René van Horik | Stef Scagliola | Eric Sanders | Paula Witkamp

Enhanced Publications are a new way to publish scientific and other results in an electronic article. The advantage of EPs is that the relation between the article and the underlying data facilitate the peer review process and other quality assessment activities. Due to the link between de publication and the research data the publication can be much richer than a paper edition permits. We present an example of EPs in which links are made to interview fragments that include transcripts, audio segments, annotations and metadata. EPs call for a new paradigm of research methodology in which digital persistent access to research data are a central issue. In this contribution we highlight 1. The research data as it is archived and curated, 2. the concept ""enhanced publication"" and its scientific value, 3. the ""fragment fitter tool"", a language processing tool to facilitate the creation of EPs, 4. IPR issues related to the re-use of the interview data.

pdf bib
Context Fusion: The Role of Discourse Structure and Centering Theory
Raffaella Bernardi | Manuel Kirschner | Zorana Ratkovic

Questions are not asked in isolation. Their context, viz. the preceding interactions, might be of help to understand them and retrieve the correct answer. Previous research in Interactive Question Answering showed that context fusion has a big potential to improve the performance of answer retrieval. In this paper, we study how much context, and what elements of it, should be considered to answer Follow-Up Questions (FU Qs). Following previous research, we exploit Logistic Regression Models to learn aspects of dialogue structure relevant to answering FU Qs. We enrich existing models based on shallow features with deep features, relying on the theory of discourse structure of (Chai and Jin, 2004), and on Centering Theory, respectively. Using models trained on realistic IQA data, we show which of the various theoretically motivated features hold up against empirical evidence. We also show that, while these deep features do not outperform the shallow ones on their own, an IQA system's answer correctness increases if the shallow and deep features are combined.

pdf bib
Homographic Ideogram Understanding Using Contextual Dynamic Network
Jun Okamoto | Shun Ishizaki

Conventional methods for disambiguation problems have been using statistical methods with co-occurrence of words in their contexts. It seems that human-beings assign appropriate word senses to the given ambiguous word in the sentence depending on the words which followed the ambiguous word when they could not disambiguate by using the previous contextual information. In this research, Contextual Dynamic Network Model is developed using the Associative Concept Dictionary which includes semantic relations among concepts/words and the relations can be represented with quantitative distances among them. In this model, an interactive activation method is used to identify a word’s meaning on the Contextual Semantic Network where the activation values on the network are calculated using the distances. The proposed method constructs dynamically the Contextual Semantic Network according to the input words sequentially that appear in the sentence including an ambiguous word. Therefore, in this research, after the model calculates the activation values, if there is little difference between the activation values, it reconstructs the network depending on the next words in input sentence. The evaluation of proposed method showed that the accuracy rates are high when Contextual Semantic Network has high density whose node are extended using around the ambiguous word.

pdf bib
Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
Cheikh M. Bamba Dione | Jonas Kuhn | Sina Zarrieß

In this paper, we report on the design of a part-of-speech-tagset for Wolof and on the creation of a semi-automatically annotated gold standard. In order to achieve high-quality annotation relatively fast, we first generated an accurate lexicon that draws on existing word and name lists and takes into account inflectional and derivational morphology. The main motivation for the tagged corpus is to obtain data for training automatic taggers with machine learning approaches. Hence, we took machine learning considerations into account during tagset design and we present training experiments as part of this paper. The best automatic tagger achieves an accuracy of 95.2% in cross-validation experiments. We also wanted to create a basis for experimenting with annotation projection techniques, which exploit parallel corpora. For this reason, it was useful to use a part of the Bible as the gold standard corpus, for which sentence-aligned parallel versions in many languages are easy to obtain. We also report on preliminary experiments exploiting a statistical word alignment of the parallel text.

pdf bib
Descriptive Analysis of Negation Cues in Biomedical Texts
Roser Morante

In this paper we present a description of negation cues and their scope in biomedical texts, based on the cues that occur in the BioScope corpus. We provide information about the morphological type of the cue, the characteristics of the scope in relation to the morpho-syntactic features of the cue and of the clause, and the ambiguity level of the cue by describing in which cases certain negation cues do not express negation. Additionally, we provide positive and negative examples per cue from the BioScope corpus. We show that the scope depends mostly on the part-of-speech of the cue and on the syntactic features of the clause. Although several studies have focused on processing negation in biomedical texts, we are not aware of publicly available resources that describe the scope of negation cues in detail. This paper aims at providing information for producing guidelines to annotate corpora with a negation layer, and for building resources that find the scope of negation cues automatically.

pdf bib
PDTB XML: the XMLization of the Penn Discourse TreeBank 2.0
Xuchen Yao | Irina Borisova | Mehwish Alam

The current study presents a conversion and unification of the Penn Discourse TreeBank 2.0 (PDTB) and the Penn TreeBank (PTB) under XML format. The main goal of the PDTB XML is to create a tool for efficient and broad querying of the syntax and discourse information simultaneously. The key stages of the project are developing proper cross-references between different data types and their representation in the modified TIGER-XML format, and then writing the required declarative languages (XML Schema). PTB XML is compatible with TIGER-XML format. The PDTB XML is developed as a unified format for the convenience of XQuery users; it integrates discourse relations and XML structures into one unified hierarchy and builds the cross references between the syntactic trees and the discourse relations. The syntactic and discourse elements are assigned with unique IDs in order to build cross-references between them. The converted corpus allows for a simultaneous search for syntactically specified discourse information based on the XQuery standard, which is illustrated with a simple example in the article.

pdf bib
Domain-related Annotation of Polish Spoken Dialogue Corpus LUNA.PL
Agnieszka Mykowiecka | Katarzyna Głowińska | Joanna Rabiega-Wiśniewska

The paper presents a corpus of Polish spoken dialogues annotated on several levels, from transcription of dialogues and their morphosyntactic analysis, to semantic annotation. The LUNA.PL corpus is the first semantically annotated corpus of Polish spontaneous speech. It contains 500 dialogues recorded at the Warsaw Transport Authority call centre. For each dialogue, the corpus contains recorded audio signal, its transcription and five XML files with annotations on subsequent levels. Speech transcription was done manually. Text annotation was constructed using a combination of rule based programmes and computer-aided manual work. For morphological annotation we used the already existing analyzer and manually disambiguated the results. Morphologically annotated texts of dialogues were automatically segmented into elementary syntactic chunks. Semantic annotation was done by a set of specially designed rules and then manually corrected. The paper describes details of the domain related semantic annotation which consists of two levels - concept level at which around 200 attributes and their values are annotated, and predicate level at which 47 frame types are recognized. We describe the domain model accepted, and the statistics over the entire annotated set of dialogues.

pdf bib
A New Approach to Pseudoword Generation
Lubomir Otrusina | Pavel Smrz

Sense-tagged corpora are used to evaluate word sense disambiguation (WSD) systems. Manual creation of such resources is often prohibitively expensive. That is why the concept of pseudowords - conflations of two or more unambiguous words - has been integrated into WSD evaluation experiments. This paper presents a new method of pseudoword generation which takes into account semantic-relatedness of the candidate words forming parts of the pseudowords to the particular senses of the word to be disambiguated. We compare the new approach to its alternatives and show that the results on pseudowords, that are more similar to real ambiguous words, better correspond to the actual results. Two techniques assessing the similarity are studied - the first one takes advantage of manually created dictionaries (wordnets), the second one builds on the automatically computed statistical data obtained from large corpora. Pros and cons of the two techniques are discussed and the results on a standard task are demonstrated.

pdf bib
NLP Resources for the Analysis of Patient/Therapist Interviews
Horacio Saggion | Elena Stein-Sparvieri | David Maldavsky | Sandra Szasz

We present a set of tools and resources for the analysis of interviews during psychotherapy sessions. One of the main components of the work is a dictionary-based text interpretation tool for the Spanish language. The tool is designed to identify a subset of Freudian drives in patient and therapist discourse.

pdf bib
Medefaidrin: Resources Documenting the Birth and Death Language Life-cycle
Dafydd Gibbon | Moses Ekpenyong | Eno-Abasi Urua

Language resources are typically defined and created for application in speech technology contexts, but the documentation of languages which are unlikely ever to be provided with enabling technologies nevertheless plays an important role in defining the heritage of a speech community and in the provision of basic insights into the language oriented components of human cognition. This is particularly true of endangered languages. The present case study concerns the documentation both of the birth and of the endangerment within a rather short space of time of a ‘spirit language’, Medefaidrin, created and used as a vehicular language by a religious community in South-Eastern Nigeria. The documentation shows phonological, orthographic, morphological, syntactic and textual typological features of Medefaidrin which indicate that typological properties of English were a model for the creation of the language, rather than typological properties of the enclaving language, Ibibio. The documentation is designed as part of the West African Language Archive (WALA), following OLAC metadata standards.

pdf bib
A Person-Name Filter for Automatic Compilation of Bilingual Person-Name Lexicons
Satoshi Sato | Sayoko Kaide

This paper proposes a simple and fast person-name filter, which plays an important role in automatic compilation of a large bilingual person-name lexicon. This filter is based on pn_score, which is the sum of two component scores, the score of the first name and that of the last name. Each score is calculated from two term sets: one is a dense set in which most of the members are person names; another is a baseline set that contains less person names. The pn_score takes one of five values, {+2, +1, 0, -1, -2}, which correspond to strong positive, positive, undecidable, negative, and strong negative, respectively. This pn_score can be easily extended to bilingual pn_score that takes one of nine values, by summing scores of two languages. Experimental results show that our method works well for monolingual person names in English and Japanese; the F-score of each language is 0.929 and 0.939, respectively. The performance of the bilingual person-name filter is better; the F-score is 0.955.

pdf bib
Mining the Web for the Induction of a Dialectical Arabic Lexicon
Rania Al-Sabbagh | Roxana Girju

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) ― one of the most widely understood dialects in the Arab World ― and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually evaluated for 1,000 word entries by two native speakers of the ECA-MSA varieties, the proposed approach achieves a promising F-measured performance rate of 70.9%. In discussion to the proposed algorithm, different semantic issues are highlighted for upcoming phases of the induction of a more comprehensive ECA-MSA lexicon.

pdf bib
Efficient Spoken Dialogue Domain Representation and Interpretation
Tobias Heinroth | Dan Denich | Alexander Schmitt | Wolfgang Minker

We provide a detailed look on the functioning of the OwlSpeak Spoken Dialogue Manager, which is part of the EU-funded project ATRACO. OwlSpeak interprets Spoken Dialogue Ontologies and on this basis generates VoiceXML dialogue snippets. The dialogue snippets can be interpreted by all speech servers that provide VoiceXML support and therefore make the dialogue management independent from the hosting systems providing speech recognition and synthesis. Ontologies are used within the framework of our prototype to represent specific spoken dialogue domains that can dynamically be broadened or tightened during an ongoing dialogue. We provide an exemplary dialogue encoded as OWL model and explain how this model is interpreted by the dialogue manager. The combination of a unified model for dialogue domains and the strict model-view-controller architecture that underlies the dialogue manager lead to an efficient system that allows for a new way of spoken dialogue system development and can be used for further research on adaptive spoken dialogue strategies.

pdf bib
The SignSpeak Project - Bridging the Gap Between Signers and Speakers
Philippe Dreuw | Hermann Ney | Gregorio Martinez | Onno Crasborn | Justus Piater | Jose Miguel Moya | Mark Wheatley

The SignSpeak project will be the first step to approach sign language recognition and translation at a scientific level already reached in similar research fields such as automatic speech recognition or statistical machine translation of spoken languages. Deaf communities revolve around sign languages as they are their natural means of communication. Although deaf, hard of hearing and hearing signers can communicate without problems amongst themselves, there is a serious challenge for the deaf community in trying to integrate into educational, social and work environments. The overall goal of SignSpeak is to develop a new vision-based technology for recognizing and translating continuous sign language to text. New knowledge about the nature of sign language structure from the perspective of machine recognition of continuous sign language will allow a subsequent breakthrough in the development of a new vision-based technology for continuous sign language recognition and translation. Existing and new publicly available corpora will be used to evaluate the research progress throughout the whole project.

pdf bib
Automatic Term Recognition Based on the Statistical Differences of Relative Frequencies in Different Corpora
Junko Kubo | Keita Tsuji | Shigeo Sugimoto

In this paper, we propose a method for automatic term recognition (ATR) which uses the statistical differences of relative frequencies of terms in target domain corpus and elsewhere. Generally, the target terms appear more frequently in target domain corpus than in other domain corpora. Utilizing such characteristics will lead to the improvement of extraction performance. Most of the ATR methods proposed so far only use the target domain corpus and do not take such characteristics into account. For the extraction experiment, we used the abstracts of a women's studies journal as a target domain corpus and those of academic journals of 39 domains as other domain corpora. The women's studies terms which were used for extraction evaluation were manually identified terms in the abstracts. The extraction performance was analyzed and we found that our method outperformed earlier methods. The previous methods were based on C-value, FLR and methods which were also used with other domain corpora.

pdf bib
FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser
Violeta Seretan | Eric Wehrli | Luka Nerima | Gabriela Soare

We describe work in progress on the development of a full syntactic parser for Romanian. This work is part of a larger project of multilingual extension of the Fips parser (Wehrli, 2007), already available for French, English, German, Spanish, Italian, and Greek, to four new languages (Romanian, Romansh, Russian and Japanese). The Romanian version was built by starting with the Fips generic parsing architecture for the Romance languages and customising the grammatical component, in close relation to the development of the lexical component. We describe this process and report on preliminary results obtained for journalistic texts.

pdf bib
Spontal: A Swedish Spontaneous Dialogue Corpus of Audio, Video and Motion Capture
Jens Edlund | Jonas Beskow | Kjell Elenius | Kahl Hellmer | Sofia Strönbergsson | David House

We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.

pdf bib
Building a Domain-specific Document Collection for Evaluating Metadata Effects on Information Retrieval
Walid Magdy | Jinming Min | Johannes Leveling | Gareth J. F. Jones

This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40 topics and their relevance judgements. In addition, a collection of nearly 250,000 user profiles related to the NBA collection is available. Several baseline IR experiments report the effect of using video-associated metadata on retrieval effectiveness. The results surprisingly show that searching the videos titles only performs significantly better than searching additional metadata text fields of the videos such as the tags or the description.

pdf bib
Interpreting SentiWordNet for Opinion Classification
Horacio Saggion | Adam Funk

We describe a set of tools, resources, and experiments for opinion classification in business-related datasources in two languages. In particular we concentrate on SentiWordNet text interpretation to produce word, sentence, and text-based sentiment features for opinion classification. We achieve good results in experiments using supervised learning machine over syntactic and sentiment-based features. We also show preliminary experiments where the use of summaries before opinion classification provides competitive advantage over the use of full documents.

pdf bib
DiSCo - A German Evaluation Corpus for Challenging Problems in the Broadcast Domain
Doris Baum | Daniel Schneider | Rolf Bardeli | Jochen Schwenninger | Barbara Samlowski | Thomas Winkler | Joachim Köhler

Typical broadcast material contains not only studio-recorded texts read by trained speakers, but also spontaneous and dialect speech, debates with cross-talk, voice-overs, and on-site reports with difficult acoustic environments. Standard approaches to speech and speaker recognition usually deteriorate under such conditions. This paper reports on the design, construction, and experimental analysis of DiSCo, a German corpus for the evaluation of speech and speaker recognition on challenging material from the broadcast domain. One of the key requirements for the design of this corpus was a good coverage of different types of serious programmes beyond clean speech and planned speech broadcast news. Corpus annotation encompasses manual segmentation, an orthographic transcription, and labelling with speech mode, dialect, and noise type. We indicate typical use cases for the corpus by reporting results from ASR, speech search, and speaker recognition on the new corpus, thereby obtaining insights into the difficulty of audio recognition on the various classes.

pdf bib
The Creagest Project: a Digitized and Annotated Corpus for French Sign Language (LSF) and Natural Gestural Languages
Antonio Balvet | Cyril Courtin | Dominique Boutet | Christian Cuxac | Ivani Fusellier-Souza | Brigitte Garcia | Marie-Thérèse L’Huillier | Marie-Anne Sallandre

In this paper, we discuss the theoretical, sociolinguistic, methodological and technical objectives and issues of the French Creagest Project (2007-2012) in setting up, documenting and annotating a large corpus of adult and child French Sign Language (LSF) and of natural gestural language. The main objective of this ANR-funded research project is to set up a collaborative web-based platform for the study of semiogenesis in LSF (French Sign Language), i.e. the study of emerging structures and signs, be they used by Deaf adult signers, Deaf children, or even by Deaf and hearing subjects in interaction. In section 2, we address theoretical and practical issues, emphasizing the outstanding features of the Creagest Project. In section 3, we deal with methodological issues for data collection. Finally, in section 4, we examine technical aspects of LSF video data editing and corpus annotation, in the perspective of setting up a corpus-based formalized description of LSF.

pdf bib
Speech Translation in Pedagogical Environment Using Additional Sources of Knowledge
Jesús Tomás | Alejandro Canovas | Jaime Lloret | Miguel García Pineda | Jose L. Abad

A key aspect in the development of statistical translators is the synergic combination of different sources of knowledge. This work describes the effect and implications that would have adding additional other-than-voice information in a voice translation system. In the model discussed the additional information serves as the bases for the log-linear combination of several statistical models. A prototype that implements a real-time speech translation system from Spanish to English that is adapted to specific teaching-related environments is presented. In the scenario of analysis a teacher as speaker giving an educational class could use a real time translation system with foreign students. The teacher could add slides or class notes as additional reference to the voice translation system. Should notes be already translated into the destination language the system could have even more accuracy. We present the theoretical framework of the problem, summarize the overall architecture of the system, show how the system is enhanced with capabilities related to capturing the additional information; and finally present the initial performance results.

pdf bib
Design and Data Collection for the Accentological Corpus of the Russian Language
Elena Grishina | Svetlana Savchuk | Alexej Poljakov

Accentological corpus provides a researcher an opportunity to study word stress and stress variation, which are very important for the Russian language. Moreover, Accentological corpus allows studying the history of the Russian language stress development. The research presents the main characteristics of Accentological corpus available at Corpora size, type and sources of text material, the way it is represented in the corpora, types of linguistic annotation, corpora composition and ways of their effective use according to their purposes are described. There are two zones in the Accentological corpus. 1) The zone of prose includes oral texts and films transcripts, in which stressed syllables are marked according to the real pronunciation. 2) The zone of poetry contains texts with marked accented syllables, so it is possible to define the exact word stress using special rules. The Accentological corpus has four types of annotations (metatextual, morphological, semantic and sociological) and also has its own accentological mark-up. Due to accentological annotation each word is supplied with stress marks, so a user can make queries and retrieve the stressed or unstressed word forms in combination with grammatical and semantic features.

pdf bib
CCASH: A Web Application Framework for Efficient, Distributed Language Resource Development
Paul Felt | Owen Merkling | Marc Carmen | Eric Ringger | Warren Lemmon | Kevin Seppi | Robbie Haertel

We introduce CCASH (Cost-Conscious Annotation Supervised by Humans), an extensible web application framework for cost-efficient annotation. CCASH provides a framework in which cost-efficient annotation methods such as Active Learning can be explored via user studies and afterwards applied to large annotation projects. CCASH’s architecture is described as well as the technologies that it is built on. CCASH allows custom annotation tasks to be built from a growing set of useful annotation widgets. It also allows annotation methods (such as AL) to be implemented in any language. Being a web application framework, CCASH offers secure centralized data and annotation storage and facilitates collaboration among multiple annotations. By default it records timing information about each annotation and provides facilities for recording custom statistics. The CCASH framework has been used to evaluate a novel annotation strategy presented in a concurrently published paper, and will be used in the future to annotate a large Syriac corpus.

pdf bib
Resources for Speech Synthesis of Viennese Varieties
Michael Pucher | Friedrich Neubarth | Volker Strom | Sylvia Moosmüller | Gregor Hofer | Christian Kranzler | Gudrun Schuchmann | Dietmar Schabus

This paper describes our work on developing corpora of three varieties of Viennese for unit selection speech synthesis. The synthetic voices for Viennese varieties, implemented with the open domain unit selection speech synthesis engine Multisyn of Festival will also be released within Festival. The paper especially focuses on two questions: how we selected the appropriate speakers and how we obtained the text sources needed for the recording of these non-standard varieties. Regarding the first one, it turned out that working with a ‘prototypical’ professional speaker was much more preferable than striving for authenticity. In addition, we give a brief outline about the differences between the Austrian standard and its dialectal varieties and how we solved certain technical problems that are related to these differences. In particular, the specific set of phones applicable to each variety had to be determined by applying various constraints. Since such a set does not serve any descriptive purposes but rather is influencing the quality of speech synthesis, a careful design of such a set was an important task.

pdf bib
Predictive Features for Detecting Indefinite Polar Sentences
Michael Wiegand | Dietrich Klakow

In recent years, text classification in sentiment analysis has mostly focused on two types of classification, the distinction between objective and subjective text, i.e. subjectivity detection, and the distinction between positive and negative subjective text, i.e. polarity classification. So far, there has been little work examining the distinction between definite polar subjectivity and indefinite polar subjectivity. While the former are utterances which can be categorized as either positive or negative, the latter cannot be categorized as either of these two categories. This paper presents a small set of domain independent features to detect indefinite polar sentences. The features reflect the linguistic structure underlying these types of utterances. We give evidence for the effectiveness of these features by incorporating them into an unsupervised rule-based classifier for sentence-level analysis and compare its performance with supervised machine learning classifiers, i.e. Support Vector Machines (SVMs) and Nearest Neighbor Classifier (kNN). The data used for the experiments are web-reviews collected from three different domains.

pdf bib
Term and Collocation Extraction by Means of Complex Linguistic Web Services
Ulrich Heid | Fabienne Fritzinger | Erhard Hinrichs | Marie Hinrichs | Thomas Zastrow

We present a web service-based environment for the use of linguistic resources and tools to address issues of terminology and language varieties. We discuss the architecture, corpus representation formats, components and a chainer supporting the combination of tools into task-specific services. Integrated into this environment, single web services also become part of complex scenarios for web service use. Our web services take for example corpora of several million words as an input on which they perform preprocessing, such as tokenisation, tagging, lemmatisation and parsing, and corpus exploration, such as collocation extraction and corpus comparison. Here we present an example on extraction of single and multiword items typical of a specific domain or typical of a regional variety of German. We also give a critical review on needs and available functions from a user's point of view. The work presented here is part of ongoing experimentation in the D-SPIN project, the German national counterpart of CLARIN.

pdf bib
Preparing the field for an Open Resource Infrastructure: the role of the FLaReNet Network of Excellence
Nicoletta Calzolari | Claudia Soria

In order to overcome the fragmentation that affects the field of Language Resources and Technologies, an Open and Distributed Resource Infrastructure is the necessary step for building on each other achievements, integrating resources and technologies and avoiding dispersed or conflicting efforts. Since this endeavour represents a true cultural turnpoint in the LRs field, it needs a careful preparation, both in terms of acceptance by the community and thoughtful investigation of the various technical, organisational and practical aspects implied. To achieve this, we need to act as a community able to join forces on a set of shared priorities and we need to act at a worldwide level. FLaReNet ― Fostering Language Resources Network ― is a Thematic Network funded under the EU eContent program that aims at developing the needed common vision and fostering a European and International strategy for consolidating the sector, thus enhancing competitiveness at EU level and worldwide. In this paper we present the activities undertaken by FLaReNet in order to prepare and support the establishment of such an Infrastructure, which is becoming now a reality within the new MetaNet initiative.

pdf bib
The LREC Map of Language Resources and Technologies
Nicoletta Calzolari | Claudia Soria | Riccardo Del Gratta | Sara Goggi | Valeria Quochi | Irene Russo | Khalid Choukri | Joseph Mariani | Stelios Piperidis

In this paper we present the LREC Map of Language Resources and Tools, an innovative feature introduced with this LREC. The purpose of the Map is to shed light on the vast amount of resources and tools that represent the background of the research presented at LREC, in the attempt to fill in a gap in the community knowledge about the resources and tools that are used or created worldwide. It also aims at a change of culture in the field, actively engaging each researcher in the documentation task about resources. The Map has been developed on the basis of the information provided by LREC authors during the submission of papers to the LREC 2010 conference and the LREC workshops, and contains information about almost 2000 resources. The paper illustrates the motivation behind this initiative, its main characteristics, its relevance and future impact in the field, the metadata used to describe the resources, and finally presents some of the most relevant findings.

pdf bib
Evaluation Protocol and Tools for Question-Answering on Speech Transcripts
Nicolas Moreau | Olivier Hamon | Djamel Mostefa | Sophie Rosset | Olivier Galibert | Lori Lamel | Jordi Turmo | Pere R. Comas | Paolo Rosso | Davide Buscaldi | Khalid Choukri

Question Answering (QA) technology aims at providing relevant answers to natural language questions. Most Question Answering research has focused on mining document collections containing written texts to answer written questions. In addition to written sources, a large (and growing) amount of potentially interesting information appears in spoken documents, such as broadcast news, speeches, seminars, meetings or telephone conversations. The QAST track (Question-Answering on Speech Transcripts) was introduced in CLEF to investigate the problem of question answering in such audio documents. This paper describes in detail the evaluation protocol and tools designed and developed for the CLEF-QAST evaluation campaigns that have taken place between 2007 and 2009. We first remind the data, question sets, and submission procedures that were produced or set up during these three campaigns. As for the evaluation procedure, the interface that was developed to ease the assessors’ work is described. In addition, this paper introduces a methodology for a semi-automatic evaluation of QAST systems based on time slot comparisons. Finally, the QAST Evaluation Package 2007-2009 resulting from these evaluation campaigns is also introduced.

pdf bib
The Database of Catalan Adjectives
Roser Sanromà | Gemma Boleda

We present the Database of Catalan Adjectives (DCA), a database with 2,296 adjective lemmata enriched with morphological, syntactic and semantic information. This set of adjectives has been collected from a fragment of the Corpus Textual Informatitzat de la Llengua Catalana of the Institut d’Estudis Catalans and constitutes a representative sample of the adjective class in Catalan as a whole. The database includes both manually coded and automatically extracted information regarding the most prominent properties used in the literature regarding the semantics of adjectives, such as morphological origin, suffix (if any), predicativity, gradability, adjective position with respect to the head noun, adjective modifiers, or semantic class. The DCA can be useful for NLP applications using adjectives (from POS-taggers to Opinion Mining applications) and for linguistic analysis regarding the morphological, syntactic, and semantic properties of adjectives. We now make it available to the research community under a Creative Commons Attribution Share Alike 3.0 Spain license.

pdf bib
Can Syntactic and Logical Graphs help Word Sense Disambiguation?
Amal Zouaq | Michel Gagnon | Benoit Ozell

This paper presents a word sense disambiguation (WSD) approach based on syntactic and logical representations. The objective here is to run a number of experiments to compare standard contexts (word windows, sentence windows) with contexts provided by a dependency parser (syntactic context) and a logical analyzer (logico-semantic context). The approach presented here relies on a dependency grammar for the syntactic representations. We also use a pattern knowledge base over the syntactic dependencies to extract flat predicative logical representations. These representations (syntactic and logical) are then used to build context vectors that are exploited in the WSD process. Various state-of-the-art algorithms including Simplified Lesk, Banerjee and Pedersen and frequency of co-occurrences are tested with these syntactic and logical contexts. Preliminary results show that defining context vectors based on these features may improve WSD by comparison with classical word and sentence context windows. However, future experiments are needed to provide more evidence over these issues.

pdf bib
Automatic Acquisition of Chinese Novel Noun Compounds
Meng Wang | Chu-Ren Huang | Shiwen Yu | Weiwei Sun

Automatic acquisition of novel compounds is notoriously difficult because most novel compounds have relatively low frequency in a corpus. The current study proposes a new method to deal with the novel compound acquisition challenge. We model this task as a two-class classification problem in which a candidate compound is either classified as a compound or a non-compound. A machine learning method using SVM, incorporating two types of linguistically motivated features: semantic features and character features, is applied to identify rare but valid noun compounds. We explore two kinds of training data: one is virtual training data which is obtained by three statistical scores, i.e. co-occurrence frequency, mutual information and dependent ratio, from the frequent compounds; the other is real training data which is randomly selected from the infrequent compounds. We conduct comparative experiments, and the experimental results show that even with limited direct evidence in the corpus for the novel compounds, we can make full use of the typical frequent compounds to help in the discovery of the novel compounds.

pdf bib
Constructing a Broad-coverage Lexicon for Text Mining in the Patent Domain
Nelleke Oostdijk | Suzan Verberne | Cornelis Koster

For mining intellectual property texts (patents), a broad-coverage lexicon that covers general English words together with terminology from the patent domain is indispensable. The patent domain is very diffuse as it comprises a variety of technical domains (e.g. Human Necessities, Chemistry & Metallurgy and Physics in the International Patent Classification). As a result, collecting a lexicon that covers the language used in patent texts is not a straightforward task. In this paper we describe the approach that we have developed for the semi-automatic construction of a broad-coverage lexicon for classification and information retrieval in the patent domain and which combines information from multiple sources. Our contribution is twofold. First, we provide insight into the difficulties of developing lexical resources for information retrieval and text mining in the patent domain, a research and development field that is expanding quickly. Second, we create a broad coverage lexicon annotated with rich lexical information and containing both general English word forms and domain terminology for various technical domains.

pdf bib
Syntactic Testsuites and Textual Entailment Recognition
Paul Bedaride | Claire Gardent

We focus on textual entailments mediated by syntax and propose a new methodology to evaluate textual entailment recognition systems on such data. The main idea is to generate a syntactically annotated corpus of pairs of (non-)entailments and to use error mining methodology from the parsing field to identify the most likely sources of errors. To generate the evaluation corpus we use a template based generation approach where sentences, semantic representations and syntactic annotations are all created at the same time. Furthermore, we adapt the error mining methodology initially proposed for parsing to the field of textual entailment. To illustrate the approach, we apply the proposed methodology to the Afazio RTE system (an hybrid system focusing on syntactic entailment) and show how it permits identifying the most likely sources of errors made by this system on a testsuite of 10 000 (non-)entailment pairs which is balanced in term of (non-)entailment and in term of syntactic annotations.

pdf bib
Querying Diverse Treebanks in a Uniform Way
Jan Štěpánek | Petr Pajas

This paper presents a system for querying treebanks in a uniform way. The system is able to work with both dependency and constituency based treebanks in any language. We demonstrate its abilities on 11 different treebanks. The query language used by the system provides many features not available in other existing systems while still keeping the performance efficient. The paper also describes the conversion of ten treebanks into a common XML-based format used by the system, touching the question of standards and formats. The paper then shows several examples of linguistically interesting questions that the system is able to answer, for example browsing verbal clauses without subjects or extraposed relative clauses, generating the underlying grammar in a constituency treebank, searching for non-projective edges in a dependency treebank, or word-order typology of a language based on the treebank. The performance of several implementations of the system is also discussed by measuring the time requirements of some of the queries.

pdf bib
Deep Linguistic Processing with GETARUNS for Spoken Dialogue Understanding
Rodolfo Delmonte | Antonella Bristot | Vincenzo Pallotta

In this paper we will present work carried out to scale up the system for text understanding called GETARUNS, and port it to be used in dialogue understanding. The current goal is that of extracting automatically argumentative information in order to build argumentative structure. The long term goal is using argumentative structure to produce automatic summarization of spoken dialogues. Very much like other deep linguistic processing systems, our system is a generic text/dialogue understanding system that can be used in connection with an ontology ― WordNet - and other similar repositories of commonsense knowledge. We will present the adjustments we made in order to cope with transcribed spoken dialogues like those produced in the ICSI Berkeley project. In a final section we present preliminary evaluation of the system on two tasks: the task of automatic argumentative labeling and another frequently addressed task: referential vs. non-referential pronominal detection. Results obtained fair much higher than those reported in similar experiments with machine learning approaches.

pdf bib
Arabic Part of Speech Tagging
Emad Mohamed | Sandra Kübler

Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, we compare two novel methods for POS tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags that describe full words and does not require any word segmentation. The second approach is segmentation-based, using a machine learning segmenter. In this approach, the words are first segmented, then the segments are annotated with POS tags. Because of the word-based approach, we evaluate full word accuracy rather than segment accuracy. Word-based POS tagging yields better results than segment-based tagging (93.93% vs. 93.41%). Word based tagging also gives the best results on known words, the segmentation-based approach gives better results on unknown words. Combining both methods results in a word accuracy of 94.37%, which is very close to the result obtained by using gold standard segmentation (94.91%).

pdf bib
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Alexander Pak | Patrick Paroubek

Microblogging today has become a very popular communication tool among Internet users. Millions of users share opinions on different aspects of life everyday. Therefore microblogging web-sites are rich sources of data for opinion mining and sentiment analysis. Because microblogging has appeared relatively recently, there are a few research works that were devoted to this topic. In our paper, we focus on using Twitter, the most popular microblogging platform, for the task of sentiment analysis. We show how to automatically collect a corpus for sentiment analysis and opinion mining purposes. We perform linguistic analysis of the collected corpus and explain discovered phenomena. Using the corpus, we build a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document. Experimental evaluations show that our proposed techniques are efficient and performs better than previously proposed methods. In our research, we worked with English, however, the proposed technique can be used with any other language.

pdf bib
Word Boundaries in French: Evidence from Large Speech Corpora
Rena Nemoto | Martine Adda-Decker | Jacques Durand

The goal of this paper is to investigate French word segmentation strategies using phonemic and lexical transcriptions as well as prosodic and part-of-speech annotations. Average fundamental frequency (f0) profiles and phoneme duration profiles are measured using 13 hours of broadcast news speech to study prosodic regularities of French words. Some influential factors are taken into consideration for f0 and duration measurements: word syllable length, word-final schwa, part-of-speech. Results from average f0 profiles confirm word final syllable accentuation and from average duration profiles, we can observe long word final syllable length. Both are common tendencies in French. From noun phrase studies, results of average f0 profiles illustrate higher noun first syllable after determiner. Inter-vocalic duration profile results show long inter-vocalic duration between determiner vowel and preceding word vowel. These results reveal measurable cues contributing to word boundary location. Further studies will include more detailed within syllable f0 patterns, other speaking styles and languages.

pdf bib
A Lexicon of French Quotation Verbs for Automatic Quotation Extraction
Benoît Sagot | Laurence Danlos | Rosa Stern

Quotation extraction is an important information extraction task, especially when dealing with news wires. Quotations can be found in various configurations. In this paper, we focus on direct quotations introduced by a parenthetical clause, headed by a ""quotation verb"". Our study is based on a large French news wire corpus from the Agence France-Presse. We introduce and motivate an analysis at the discursive level of such quotations, which differs from the syntactic analyses generally proposed. We show how we enriched the Lefff syntactic lexicon so that it provides an account for quotation verbs heading a quotation parenthetical, especially those extracted from a news wire corpus. We also sketch how these lexical entries can be extended to the discursive level in order to model quotations introduced in a parenthetical clause in a complete way.

pdf bib
Ways of Evaluation of the Annotators in Building the Prague Czech-English Dependency Treebank
Marie Mikulová | Jan Štěpánek

In this paper, we present several ways to measure and evaluate the annotation and annotators, proposed and used during the building of the Czech part of the Prague Czech-English Dependency Treebank. At first, the basic principles of the treebank annotation project are introduced (division to three layers: morphological, analytical and tectogrammatical). The main part of the paper describes in detail one of the important phases of the annotation process: three ways of evaluation of the annotators - inter-annotator agreement, error rate and performance. The measuring of the inter-annotator agreement is complicated by the fact that the data contain added and deleted nodes, making the alignment between annotations non-trivial. The error rate is measured by a set of automatic checking procedures that guard the validity of some invariants in the data. The performance of the annotators is measured by a booking web application. All three measures are later compared and related to each other.

pdf bib
Assessing the Impact of English Language Skills and Education Level on PubMed Searches by Dutch-speaking Users
Klaar Vanopstal | Robert Vander Stichele | Godelieve Laureys | Joost Buysschaert

The aim of this study was to assess the retrieval effectiveness of nursing students in the Dutch-speaking part of Belgium. We tested two groups: students from the master of Nursing and Midwifery training, and students from the bachelor of Nursing program. The test consisted of five parts: first, the students completed an enquiry about their computer skills, experiences with PubMed and how they assessed their own language skills. Secondly, an introduction into the use of MeSH in PubMed was given, followed by a PubMed search. After the literature search, a second enquiry was completed in which the students were asked to give their opinion about the test. To conclude, an official language test was completed. The results of the PubMed search, i.e. a list of articles the students deemed relevant for a particular question, were compared to a gold standard. Precision, recall and F-score were calculated in order to evaluate the efficiency of the PubMed search. We used information from the search process, such as search term formulation and MeSH term selection to evaluate the search process and examined their relationship with the results of the language test and the level of education.

pdf bib
Two-level Annotation of Utterance-units in Japanese Dialogs: An Empirically Emerged Scheme
Yasuharu Den | Hanae Koiso | Takehiko Maruyama | Kikuo Maekawa | Katsuya Takanashi | Mika Enomoto | Nao Yoshida

In this paper, we propose a scheme for annotating utterance-level units in Japanese dialogs, which emerged from an analysis of the interrelationship among four schemes, i) inter-pausal units, ii) intonation units, iii) clause units, and iv) pragmatic units. The associations among the labels of these four units were illustrated by multiple correspondence analysis and hierarchical cluster analysis. Based on these results, we prescribe utterance-unit identification rules, which identify two sorts of utterance-units with different granularities: short and long utterance-units. Short utterance-units are identified by acoustic and prosodic disjuncture, and they are considered to constitute units of speaker's planning and hearer's understanding. Long utterance-units, on the other hand, are recognized by syntactic and pragmatic disjuncture, and they are regarded as units of interaction. We explore some characteristics of these utterance-units, focusing particularly on unit duration and syntactic property, other participants' responses, and mismatch between the two-levels. We also discuss how our two-level utterance-units are useful in analyzing cognitive and communicative aspects of spoken dialogs.

pdf bib
Statistical French Dependency Parsing: Treebank Conversion and First Results
Marie Candito | Benoît Crabbé | Pascal Denis

We first describe the automatic conversion of the French Treebank (Abeillé and Barrier, 2004), a constituency treebank, into typed projective dependency trees. In order to evaluate the overall quality of the resulting dependency treebank, and to quantify the cases where the projectivity constraint leads to wrong dependencies, we compare a subset of the converted treebank to manually validated dependency trees. We then compare the performance of two treebank-trained parsers that output typed dependency parses. The first parser is the MST parser (Mcdonald et al., 2006), which we directly train on dependency trees. The second parser is a combination of the Berkeley parser (Petrov et al., 2006) and a functional role labeler: trained on the original constituency treebank, the Berkeley parser first outputs constituency trees, which are then labeled with functional roles, and then converted into dependency trees. We found that used in combination with a high-accuracy French POS tagger, the MST parser performs a little better for unlabeled dependencies (UAS=90.3% versus 89.6%), and better for labeled dependencies (LAS=87.6% versus 85.6%).

pdf bib
Formal Description of Resources for Ontology-based Semantic Annotation
Yue Ma | Adeline Nazarenko | Laurent Audibert

Ontology-based semantic annotation aims at putting fragments of a text in correspondence with proper elements of an ontology such that the formal semantics encoded by the ontology can be exploited to represent text interpretation. In this paper, we formalize a resource for this goal. The main difficulty in achieving good semantic annotations consists in identifying fragments to be annotated and labels to be associated with them. To this end, our approach takes advantage of standard web ontology languages as well as rich linguistic annotation platforms. This in turn is concerned with how to formalize the combination of the ontological and linguistical information, which is a topical issue that has got an increasing discussion recently. Different from existing formalizations, our purpose is to extend ontologies by semantic annotation rules whose complexity increases along two dimensions: the linguistic complexity and the rule syntactic complexity. This solution allows reusing best NLP tools for the production of various levels of linguistic annotations. It also has the merit to distinguish clearly the process of linguistic analysis and the ontological interpretation.

pdf bib
KALAKA: A TV Broadcast Speech Database for the Evaluation of Language Recognition Systems
Luis Javier Rodríguez-Fuentes | Mikel Penagarikano | Germán Bordel | Amparo Varona | Mireia Díez

A speech database, named KALAKA, was created to support the Albayzin 2008 Evaluation of Language Recognition Systems, organized by the Spanish Network on Speech Technologies from May to November 2008. This evaluation, designed according to the criteria and methodology applied in the NIST Language Recognition Evaluations, involved four target languages: Basque, Catalan, Galician and Spanish (official languages in Spain), and included speech signals in other (unknown) languages to allow open-set verification trials. In this paper, the process of designing, collecting data and building the train, development and evaluation datasets of KALAKA is described. Results attained in the Albayzin 2008 LRE are presented as a means of evaluating the database. The performance of a state-of-the-art language recognition system on a closed-set evaluation task is also presented for reference. Future work includes extending KALAKA by adding Portuguese and English as target languages and renewing the set of unknown languages needed to carry out open-set evaluations.

pdf bib
Annotation of Morphological Meanings of Verbs Revisited
Jarmila Panevová | Magda Ševčíková

Meanings of morphological categories are an indispensable component of representation of sentence semantics. In the Prague Dependency Treebank 2.0, sentence semantics is represented as a dependency tree consisting of labeled nodes and edges. Meanings of morphological categories are captured as attributes of tree nodes; these attributes are called grammatemes. The present paper focuses on morphological meanings of verbs, i.e. on meanings of the morphological category of tense, mood, aspect etc. After several introductory remarks, seven verbal grammatemes used in the PDT 2.0 annotation scenario are briefly introduced. After that, each of the grammatemes is examined. Three verbal grammatemes of the original set were included in the new set without changes, one of the grammatemes was extended, and three of them were substituted for three new ones. The revised grammateme set is to be included in the forthcoming version of PDT (tentatively called PDT 3.0). Rules for automatic and manual assignment of the revised grammatemes are further discussed in the paper.

pdf bib
Unsupervised Discovery of Collective Action Frames for Socio-Cultural Analysis
Andrew Hickl | Sanda Harabagiu

pdf bib
Predicting Morphological Types of Chinese Bi-Character Words by Machine Learning Approaches
Ting-Hao Huang | Lun-Wei Ku | Hsin-Hsi Chen

This paper presented an overview of Chinese bi-character words’ morphological types, and proposed a set of features for machine learning approaches to predict these types based on composite characters’ information. First, eight morphological types were defined, and 6,500 Chinese bi-character words were annotated with these types. After pre-processing, 6,178 words were selected to construct a corpus named Reduced Set. We analyzed Reduced Set and conducted the inter-annotator agreement test. The average kappa value of 0.67 indicates a substantial agreement. Second, Bi-character words’ morphological types are considered strongly related with the composite characters’ parts of speech in this paper, so we proposed a set of features which can simply be extracted from dictionaries to indicate the characters’ “tendency” of parts of speech. Finally, we used these features and adopted three machine learning algorithms, SVM, CRF, and Naïve Bayes, to predict the morphological types. On the average, the best algorithm CRF achieved 75% of the annotators’ performance.

pdf bib
Studying the Lexicon of Dialogue Acts
Nicole Novielli | Carlo Strapparava

Dialogue Acts have been well studied in linguistics and attracted computational linguistics research for a long time: they constitute the basis of everyday conversations and can be identified with the communicative goal of a given utterance (e.g. asking for information, stating facts, expressing opinions, agreeing or disagreeing). Even if not constituting any deep understanding of the dialogue, automatic dialogue act labeling is a task that can be relevant for a wide range of applications in both human-computer and human-human interaction. We present a qualitative analysis of the lexicon of Dialogue Acts: we explore the relationship between the communicative goal of an utterance and its affective content as well as the salience of specific word classes for each speech act. The experiments described in this paper fit in the scope of a research study whose long-term goal is to build an unsupervised classifier that simply exploits the lexical semantics of utterances for automatically annotate dialogues with the proper speech acts.

pdf bib
Automatic Summarization Using Terminological and Semantic Resources
Jorge Vivaldi | Iria da Cunha | Juan-Manuel Torres-Moreno | Patricia Velázquez-Morales

This paper presents a new algorithm for automatic summarization of specialized texts combining terminological and semantic resources: a term extractor and an ontology. The term extractor provides the list of the terms that are present in the text together their corresponding termhood. The ontology is used to calculate the semantic similarity among the terms found in the main body and those present in the document title. The general idea is to obtain a relevance score for each sentence taking into account both the ”termhood” of the terms found in such sentence and the similarity among such terms and those terms present in the title of the document. The phrases with the highest score are chosen to take part of the final summary. We evaluate the algorithm with Rouge, comparing the resulting summaries with the summaries of other summarizers. The sentence selection algorithm was also tested as part of a standalone summarizer. In both cases it obtains quite good results although the perception is that there is a space for improvement.

pdf bib
Is my Judge a good One?
Olivier Hamon

This paper aims at measuring the reliability of judges in MT evaluation. The scope is two evaluation campaigns from the CESTA project, during which human evaluations were carried out on fluency and adequacy criteria for English-to-French documents. Our objectives were threefold: observe both inter- and intra-judge agreements, and then study the influence of the evaluation design especially implemented for the need of the campaigns. Indeed, a web interface was especially developed to help with the human judgments and store the results, but some design changes were made between the first and the second campaign. Considering the low agreements observed, the judges' behaviour has been analysed in that specific context. We also asked several judges to repeat their own evaluations a few times after the first judgments done during the official evaluation campaigns. Even if judges did not seem to agree fully at first sight, a less strict comparison led to a strong agreement. Furthermore, the evolution of the design during the project seemed to have been a source for the difficulties that judges encountered to keep the same interpretation of quality.

pdf bib
Building a System for Emotions Detection from Speech to Control an Affective Avatar
Mátyás Brendel | Riccardo Zaccarelli | Laurence Devillers

In this paper we describe a corpus set together from two sub-corpora. The CINEMO corpus contains acted emotional expression obtained by playing dubbing exercises. This new protocol is a way to collect mood-induced data in large amount which show several complex and shaded emotions. JEMO is a corpus collected with an emotion-detection game and contains more prototypical emotions than CINEMO. We show how the two sub-corpora balance and enrich each other and result in a better performance. We built male and female emotion models and use Sequential Fast Forward Feature Selection to improve detection performances. After feature-selection we obtain good results even with our strict speaker independent testing method. The global corpus contains 88 speakers (38 females, 50 males). This study has been done within the scope of the ANR (National Research Agency) Affective Avatar project which deals with building a system of emotions detection for monitoring an Artificial Agent by voice.

pdf bib
Facilitating Non-expert Users of the KYOTO Platform: the TMEKO Editing Protocol for Synset to Ontology Mappings
Roxane Segers | Piek Vossen

This paper presents the general architecture of the TMEKO protocol (Tutoring Methodology for Enriching the Kyoto Ontology) that guides non-expert users through the process of creating mappings from domain wordnet synsets to a shared ontology by answering natural language questions. TMEKO will be part of a Wiki-like community platform currently developed in the Kyoto project ( The platform provides the architecture for ontology based fact mining to enable knowledge sharing across languages and cultures. A central part of the platform is the Wikyoto editing environment in which users can create their own domain wordnet for seven different languages and define relations to the central and shared ontology based on DOLCE. A substantial part of the mappings will involve important processes and qualities associated with the concept. Therefore, the TMEKO protocol provides specific interviews for creating complex mappings that go beyond subclass and equivalence relations. The Kyoto platform and the TMEKO protocol are developed and applied to the environment domain for seven different languages (English, Dutch, Italian, Spanish, Basque, Japanese and Chinese), but can easily be extended and adapted to other languages and domains.

pdf bib
The GeneReg Corpus for Gene Expression Regulation Events — An Overview of the Corpus and its In-Domain and Out-of-Domain Interoperability
Ekaterina Buyko | Elena Beisswanger | Udo Hahn

Despite the large variety of corpora in the biomedical domain their annotations differ in many respects, e.g., the coverage of different, highly specialized knowledge domains, varying degrees of granularity of targeted relations, the specificity of linguistic anchoring of relations and named entities in documents, etc. We here present GeneReg (Gene Regulation Corpus), the result of an annotation campaign led by the Jena University Language & Information Engineering (JULIE) Lab. The GeneReg corpus consists of 314 abstracts dealing with the regulation of gene expression in the model organism E. coli. Our emphasis in this paper is on the compatibility of the GeneReg corpus with the alternative Genia event corpus and with several in-domain and out-of-domain lexical resources, e.g., the Specialist Lexicon, FrameNet, and WordNet. The links we established from the GeneReg corpus to these external resources will help improve the performance of the automatic relation extraction engine JREx trained and evaluated on GeneReg.

pdf bib
Word-based Partial Annotation for Efficient Corpus Construction
Graham Neubig | Shinsuke Mori

In order to utilize the corpus-based techniques that have proven effective in natural language processing in recent years, costly and time-consuming manual creation of linguistic resources is often necessary. Traditionally these resources are created on the document or sentence-level. In this paper, we examine the benefit of annotating only particular words with high information content, as opposed to the entire sentence or document. Using the task of Japanese pronunciation estimation as an example, we devise a machine learning method that can be trained on data annotated word-by-word. This is done by dividing the estimation process into two steps (word segmentation and word-based pronunciation estimation), and introducing a point-wise estimator that is able to make each decision independent of the other decisions made for a particular sentence. In an evaluation, the proposed strategy is shown to provide greater increases in accuracy using a smaller number of annotated words than traditional sentence-based annotation techniques.

pdf bib
Design and Application of a Gold Standard for Morphological Analysis: SMOR as an Example of Morphological Evaluation
Gertrud Faaß | Ulrich Heid | Helmut Schmid

This paper describes general requirements for evaluating and documenting NLP tools with a focus on morphological analysers and the design of a Gold Standard. It is argued that any evaluation must be measurable and documentation thereof must be made accessible for any user of the tool. The documentation must be of a kind that it enables the user to compare different tools offering the same service, hence the descriptions must contain measurable values. A Gold Standard presents a vital part of any measurable evaluation process, therefore, the corpus-based design of a Gold Standard, its creation and problems that occur are reported upon here. Our project concentrates on SMOR, a morphological analyser for German that is to be offered as a web-service. We not only utilize this analyser for designing the Gold Standard, but also evaluate the tool itself at the same time. Note that the project is ongoing, therefore, we cannot present final results.

pdf bib
Identification of Rare & Novel Senses Using Translations in a Parallel Corpus
Richard Schwarz | Hinrich Schütze | Fabienne Martin | Achim Stein

The identification of rare and novel senses is a challenge in lexicography. In this paper, we present a new method for finding such senses using a word aligned multilingual parallel corpus. We use the Europarl corpus and therein concentrate on French verbs. We represent each occurrence of a French verb as a high dimensional term vector. The dimensions of such a vector are the possible translations of the verb according to the underlying word alignment. The dimensions are weighted by a weighting scheme to adjust to the significance of any particular translation. After collecting these vectors we apply forms of the K-means algorithm on the resulting vector space to produce clusters of distinct senses, so that standard uses produce large homogeneous clusters while rare and novel uses appear in small or heterogeneous clusters. We show in a qualitative and quantitative evaluation that the method can successfully find rare and novel senses.

pdf bib
Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese
Cláudia Freitas | Cristina Mota | Diana Santos | Hugo Gonçalo Oliveira | Paula Carvalho

In this paper, we present Second HAREM, the second edition of an evaluation campaign for Portuguese, addressing named entity recognition (NER). This second edition also included two new tracks: the recognition and normalization of temporal entities (proposed by a group of participants, and hence not covered on this paper) and ReRelEM, the detection of semantic relations between named entities. We summarize the setup of Second HAREM by showing the preserved distinctive features and discussing the changes compared to the first edition. Furthermore, we present the main results achieved and describe the available resources and tools developed under this evaluation, namely,(i) the golden collections, i.e. a set of documents whose named entities and semantic relations between those entities were manually annotated, (ii) the Second HAREM collection (which contains the unannotated version of the golden collection), as well as the participating systems results on it, (iii) the scoring tools, and (iv) SAHARA, a Web application that allows interactive evaluation. We end the paper by offering some remarks about what was learned.

pdf bib
The German Reference Corpus DeReKo: A Primordial Sample for Linguistic Research
Marc Kupietz | Cyril Belica | Holger Keibel | Andreas Witt

This paper describes DeReKo (Deutsches Referenzkorpus), the Archive of General Reference Corpora of Contemporary Written German at the Institut für Deutsche Sprache (IDS) in Mannheim, and the rationale behind its development. We discuss its design, its legal background, how to access it, available metadata, linguistic annotation layers, underlying standards, ongoing developments, and aspects of using the archive for empirical linguistic research. The focus of the paper is on the advantages of DeReKo's design as a primordial sample from which virtual corpora can be drawn for the specific purposes of individual studies. Both concepts, primordial sample and virtual corpus are explained and illustrated in detail. Furthermore, we describe in more detail how DeReKo deals with the fact that all its texts are subject to third parties' intellectual property rights, and how it deals with the issue of replicability, which is particularly challenging given DeReKo's dynamic growth and the possibility to construct from it an open number of virtual corpora.

pdf bib
Cultural Heritage: Knowledge Extraction from Web Documents
Eva Sassolini | Alessandra Cinini

This article presents the use of NLP techniques (text mining, text analysis) to develop specific tools that allow to create linguistic resources related to the cultural heritage domain. The aim of our approach is to create tools for the building of an online “knowledge network”, automatically extracted from text materials concerning this domain. A particular methodology was experimented by dividing the automatic acquisition of texts, and consequently, the creation of reference corpus in two phases. In the first phase, on-line documents have been extracted from lists of links provided by human experts. All documents extracted from the web by means of automatic spider have been stored in a repository of text materials. On the basis of these documents, automatic parsers create the reference corpus for the cultural heritage domain. Relevant information and semantic concepts are then extracted from this corpus. In a second phase, all these semantically relevant elements (such as proper names, names of institutions, names of places, and other relevant terms) have been used as basis for a new search strategy of text materials from heterogeneous sources. In this case also specialized crawlers (TP-crawler) have been used to work on a bulk of text materials available on line.

pdf bib
A Case Study on Interoperability for Language Resources and Applications
Marta Villegas | Núria Bel | Santiago Bel | Víctor Rodríguez

This paper reports our experience when integrating differ resources and services into a grid environment. The use case we address implies the deployment of several NLP applications as web services. The ultimate objective of this task was to create a scenario where researchers have access to a variety of services they can operate. These services should be easy to invoke and able to interoperate between one another. We essentially describe the interoperability problems we faced, which involve metadata interoperability, data interoperability and service interoperability. We devote special attention to service interoperability and explore the possibility to define common interfaces and semantic description of services. While the web services paradigm suits the integration of different services very well, this requires mutual understanding and the accommodation to common interfaces that not only provide technical solution but also ease the user‟s work. Defining common interfaces benefits interoperability but requires the agreement about operations and the set of inputs/outputs. Semantic annotation allows defining some sort of taxonomy that organizes and collects the set of admissible operations and types input/output parameters.

pdf bib
Semi-Automated Extension of a Specialized Medical Lexicon for French
Bruno Cartoni | Pierre Zweigenbaum

This paper describes the development of a specialized lexical resource for a specialized domain, namely medicine. First, in order to assess the linguistic phenomena that need to be adressed, we based our observation on a large collection of more than 300'000 terms, organised around conceptual identifiers. Based on these observations, we highlight the specificities that such a lexicon should take into account, namely in terms of inflectional and derivational knowledge. In a first experiment, we show that general resources lack a large part of the words needed to process specialized language. Secondly, we describe an experiment to feed semi-automatically a medical lexicon and populate it with inflectional information. This experiment is based on a semi-automatic methods that tries to acquire inflectional knowledge from frequent endings of words recorded in existing lexicon. Thanks to this, we increased the coverage of the target vocabulary from 14.1% to 25.7%.

pdf bib
Heterogeneous Data Sources for Signed Language Analysis and Synthesis: The SignCom Project
Kyle Duarte | Sylvie Gibet

This paper describes how heterogeneous data sources captured in the SignCom project may be used for the analysis and synthesis of French Sign Language (LSF) utterances. The captured data combine video data and multimodal motion capture (mocap) data, including body and hand movements as well as facial expressions. These data are pre-processed, synchronized, and enriched by text annotations of signed language elicitation sessions. The addition of mocap data to traditional data structures provides additional phonetic data to linguists who desire to better understand the various parts of signs (handshape, movement, orientation, etc.) to very exacting levels, as well as their interactions and relative timings. We show how the phonologies of hand configurations and articulator movements may be studied using signal processing and statistical analysis tools to highlight regularities or temporal schemata between the different modalities. Finally, mocap data allows us to replay signs using a computer animation engine, specifically editing and rearranging movements and configurations in order to create novel utterances.

pdf bib
Annotations for Opinion Mining Evaluation in the Industrial Context of the DOXA project
Patrick Paroubek | Alexander Pak | Djamel Mostefa

After presenting opinion and sentiment analysis state of the art and the DOXA project, we review the few evaluation campaigns that have dealt in the past with opinion mining. Then we present the two level opinion and sentiment model that we will use for evaluation in the DOXA project and the annotation interface we use for hand annotating a reference corpus. We then present the corpus which will be used on DOXA and report on the hand-annotation task on a corpus of comments on video games and the solution adopted to obtain a sufficient level of inter-annotator agreement.

pdf bib
Mining Wikipedia for Large-scale Repositories of Context-Sensitive Entailment Rules
Milen Kouylekov | Yashar Mehdad | Matteo Negri

This paper focuses on the central role played by lexical information in the task of Recognizing Textual Entailment. In particular, the usefulness of lexical knowledge extracted from several widely used static resources, represented in the form of entailment rules, is compared with a method to extract lexical information from Wikipedia as a dynamic knowledge resource. The proposed acquisition method aims at maximizing two key features of the resulting entailment rules: coverage (i.e. the proportion of rules successfully applied over a dataset of TE pairs), and context sensitivity (i.e. the proportion of rules applied in appropriate contexts). Evaluation results show that Wikipedia can be effectively used as a source of lexical entailment rules, featuring both higher coverage and context sensitivity with respect to other resources.

pdf bib
Using a Grammar Checker for Evaluation and Postprocessing of Statistical Machine Translation
Sara Stymne | Lars Ahrenberg

One problem in statistical machine translation (SMT) is that the output often is ungrammatical. To address this issue, we have investigated the use of a grammar checker for two purposes in connection with SMT: as an evaluation tool and as a postprocessing tool. To assess the feasibility of the grammar checker on SMT output, we performed an error analysis, which showed that the precision of error identification in general was higher on SMT output than in previous studies on human texts. Using the grammar checker as an evaluation tool gives a complementary picture to standard metrics such as Bleu, which do not account well for grammaticality. We use the grammar checker as a postprocessing tool by automatically applying the error correction suggestions it gives. There are only small overall improvements of the postprocessing on automatic metrics, but the sentences that are affected by the changes are improved, as shown both by automatic metrics and by a human error analysis. These results indicate that grammar checker techniques are a useful complement to SMT.

pdf bib
Extraction of German Multiword Expressions from Parsed Corpora Using Context Features
Marion Weller | Ulrich Heid

We report about tools for the extraction of German multiword expressions (MWEs) from text corpora; we extract word pairs, but also longer MWEs of different patterns, e.g. verb-noun structures with an additional prepositional phrase or adjective. Next to standard association-based extraction, we focus on morpho-syntactic, syntactic and lexical-choice features of the MWE candidates. A broad range of such properties (e.g. number and definiteness of nouns, adjacency of the MWE’s components and their position in the sentence, preferred lexical modifiers, etc.) along with relevant example sentences, are extracted from dependency-parsed text and stored in a data base. A sample precision evaluation and an analysis of extraction errors are provided along with the discussion of our extraction architecture. We furthermore measure the contribution of the features to the precision of the extraction: by using both morpho-syntactic and syntactic features, we achieve a higher precision in the identification of idiomatic MWEs, than by using only properties of one type.

pdf bib
The Second Evaluation Campaign of PASSAGE on Parsing of French
Patrick Paroubek | Olivier Hamon | Eric de La Clergerie | Cyril Grouin | Anne Vilnat

pdf bib
Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus
Kepa Joseba Rodríguez | Francesca Delogu | Yannick Versley | Egon W. Stemle | Massimo Poesio

The Live Memories corpus is an Italian corpus annotated for anaphoric relations. This annotation effort aims to contribute to two significant issues for the CL research: the lack of annotated anaphoric resources for Italian and the increasing interest for the social Web. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users' comments. It is planned to add a set of articles of local news papers. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. The anaphoric annotation includes discourse deixis, bridging relations and markes cases of ambiguity with the annotation of alternative interpretations. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. Reliability studies for the annotation of the mentioned phenomena and for annotation of anaphoric links in general offer satisfactory results. The Wikipedia and blogs dataset will be distributed under Creative Commons Attributions licence.

pdf bib
WikiWoods: Syntacto-Semantic Annotation for English Wikipedia
Dan Flickinger | Stephan Oepen | Gisle Ytrestøl

WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for English Wikipedia. We sketch an automated processing pipeline to extract relevant textual content from Wikipedia sources, segment documents into sentence-like units, parse and disambiguate using a broad-coverage precision grammar, and support the export of syntactic and semantic information in various formats. The full parsed corpus is accompanied by a subset of Wikipedia articles for which gold-standard annotations in the same format were produced manually. This subset was selected to represent a coherent domain, Wikipedia entries on the broad topic of Natural Language Processing.

pdf bib
Speaker Attribution in Cabinet Protocols
Josef Ruppenhofer | Caroline Sporleder | Fabian Shirokov

Historical cabinet protocols are a useful resource which enable historians to identify the opinions expressed by politicians on different subjects and at different points of time. While cabinet protocols are often available in digitized form, so far the only method to access their information content is by keyword-based search, which often returns sub-optimal results. We present a method for enriching German cabinet protocols with information about the originators of statements. This requires automatic speaker attribution. Unlike many other approaches, our method can also deal with cases in which the speaker is not explicitly identified in the sentence itself. Such cases are very common in our domain. To avoid costly manual annotation of training data, we design a rule-based system which exploits morpho-syntactic cues. We show that such a system obtains good results, especially with respect to recall which is particularly important for information access.

pdf bib
Wizard of Oz Experiments for a Companion Dialogue System: Eliciting Companionable Conversation
Nick Webb | David Benyon | Jay Bradley | Preben Hansen | Oil Mival

Within the EU-funded COMPANIONS project, we are working to evaluate new collaborative conversational models of dialogue. Such an evaluation requires us to benchmark approaches to companionable dialogue. In order to determine the impact of system strategies on our evaluation paradigm, we need to generate a range of companionable conversations, using dialogue strategies such as `empathy' and `positivity'. By companionable dialogue, we mean interactions that take user input of some scenario, and respond in a manner appropriate to the emotional content of the user utterance. In this paper, we describe our working Wizard of Oz (WoZ) system for systematically creating dialogues that fulfil these potential strategies, and enables us to deploy a range of potential techniques for selecting which parts of user input to address is which order, to inform the wizard response to the user based on a manual, on-the-fly assessment of the polarity of the user input.

pdf bib
A Database for the Exploration of Spanish Planning
Carlos Gómez Gallo | T. Florian Jaeger | Katrina Furth

We describe a new task-based corpus in the Spanish language. The corpus consists of videos, transcripts, and annotations of the inter- action between a naive speaker and a confederate listener. The speaker instructs the listener to MOVE, ROTATE, or PAINT objects on a computer screen. This resource can be used to study how participants produce instructions in a collaborative goal-oriented scenario, in Spanish. The data set is ideally suited for investigating incremental processes of the production and interpretation of language. We demonstrate here how to use this corpus to explore language-specific differences in utterance planning, for English and Spanish speakers.

pdf bib
Cognitive Evaluation Approach for a Controlled Language Post-Editing Experiment
Irina Temnikova

In emergency situations it is crucial that instructions are straightforward to understand. For this reason a controlled language for crisis management (CLCM), based on psycholinguistic studies of human comprehension under stress, was developed. In order to test the impact of CLCM machine translatability of this particular kind of sub-language text, a previous experiment involving machine translation and human post-editing has been conducted. Employing two automatic evaluation metrics, a previous evaluation of the experiment has proved that instructions written according to this CL can improve machine translation (MT) performance. This paper presents a new cognitive evaluation approach for MT post-editing, which is tested on the previous controlled and uncontrolled textual data. The presented evaluation approach allows a deeper look into the post-editing process and specifically how much effort post-editors put into correcting the different kinds of MT errors. The method is based on existing MT error classification, which is enriched with a new error ranking motivated by the cognitive effort involved in the detection and correction of these MT errors. The preliminary results of applying this approach to a subset of the original data confirmed once again the positive impact of CLCM on emergency instructions' machine translatability and thus the validity of the approach.

pdf bib
Work on Spoken (Multimodal) Language Corpora in South Africa
Jens Allwood | Harald Hammarström | Andries Hendrikse | Mtholeni N. Ngcobo | Nozibele Nomdebevana | Laurette Pretorius | Mac van der Merwe

This paper describes past, ongoing and planned work on the collection and transcription of spoken language samples for all the South African official languages and as part of this the training of researchers in corpus linguistic research skills. More specifically the work has involved (and still involves) establishing an international corpus linguistic network linked to a network hub at a UNISA website and the development of research tools, a corpus research guide and workbook for multimodal communication and spoken language corpus research. As an example of the work we are doing and hope to do more of in the future, we present a small pilot study of the influence of English and Afrikaans on the 100 most frequent words in spoken Xhosa as this is evidenced in the corpus of spoken interaction we have gathered so far. Other planned work, besides work on spoken language phenomena, involves comparison of spoken and written language and work on communicative body movements (gestures) and their relation to speech.

pdf bib
Using NLP Methods for the Analysis of Rituals
Nils Reiter | Oliver Hellwig | Anand Mishra | Anette Frank | Jens Burkhardt

This paper gives an overview of an interdisciplinary research project that is concerned with the application of computational linguistics methods to the analysis of the structure and variance of rituals, as investigated in ritual science. We present motivation and prospects of a computational approach to ritual research, and explain the choice of specific analysis techniques. We discuss design decisions for data collection and processing and present the general NLP architecture. For the analysis of ritual descriptions, we apply the frame semantics paradigm with newly invented frames where appropriate. Using scientific ritual research literature, we experimented with several techniques of automatic extraction of domain terms for the domain of rituals. As ritual research is a highly interdisciplinary endeavour, a vocabulary common to all sub-areas of ritual research can is hard to specify and highly controversial. The domain terms extracted from ritual research literature are used as a basis for a common vocabulary and thus help the creation of ritual specific frames. We applied the tf*idf, 2 and PageRank algorithm to our ritual research literature corpus and two non-domain corpora: The British National Corpus and the British Academic Written English corpus. All corpora have been part of speech tagged and lemmatized. The domain terms have been evaluated by two ritual experts independently. Interestingly, the results of the algorithms were different for different parts of speech. This finding is in line with the fact that the inter-annotator agreement also differs between parts of speech.

pdf bib
Error Correction for Arabic Dictionary Lookup
C. Anton Rytting | Paul Rodrigues | Tim Buckwalter | David Zajic | Bridget Hirsch | Jeff Carnes | Nathanael Lynn | Sarah Wayland | Chris Taylor | Jason White | Charles Blake III | Evelyn Browne | Corey Miller | Tristan Purvis

We describe a new Arabic spelling correction system which is intended for use with electronic dictionary search by learners of Arabic. Unlike other spelling correction systems, this system does not depend on a corpus of attested student errors but on student- and teacher-generated ratings of confusable pairs of phonemes or letters. Separate error modules for keyboard mistypings, phonetic confusions, and dialectal confusions are combined to create a weighted finite-state transducer that calculates the likelihood that an input string could correspond to each citation form in a dictionary of Iraqi Arabic. Results are ranked by the estimated likelihood that a citation form could be misheard, mistyped, or mistranscribed for the input given by the user. To evaluate the system, we developed a noisy-channel model trained on students’ speech errors and use it to perturb citation forms from a dictionary. We compare our system to a baseline based on Levenshtein distance and find that, when evaluated on single-error queries, our system performs 28% better than the baseline (overall MRR) and is twice as good at returning the correct dictionary form as the top-ranked result. We believe this to be the first spelling correction system designed for a spoken, colloquial dialect of Arabic.

pdf bib
Evaluating Complex Semantic Artifacts
Christopher R Walker | Hannah Copperman

Evaluating complex Natural Language Processing (NLP) systems can prove extremely difficult. In many cases, the best one can do is to evaluate these systems indirectly, by looking at the impact they have on the performance of the downstream use case. For complex end-to-end systems, these metrics are not always enlightening, especially from the perspective of NLP failure analysis, as component interaction can obscure issues specific to the NLP technology. We present an evaluation program for complex NLP systems designed to produce meaningful aggregate accuracy metrics with sufficient granularity to support active development by NLP specialists. Our goals were threefold: to produce reliable metrics, to produce useful metrics and to produce actionable data. Our use case is a graph-based Wikipedia search index. Since the evaluation of a complex graph structure is beyond the conceptual grasp of a single human judge, the problem needs to be broken down. Slices of complex data reflective of coherent Decision Points provide a good framework for evaluation using human judges (Medero et al., 2006). For NL semantics, there really is no substitute. Leveraging Decision Points allows complex semantic artifacts to be tracked with judge-driven evaluations that are accurate, timely and actionable.

pdf bib
Morphological Analysis and Generation of Arabic Nouns: A Morphemic Functional Approach
Mohamed Altantawy | Nizar Habash | Owen Rambow | Ibrahim Saleh

MAGEAD is a morphological analyzer and generator for Modern Standard Arabic (MSA) and its dialects. We introduced MAGEAD in previous work with an implementation of MSA and Levantine Arabic verbs. In this paper, we port that system to MSA nominals (nouns and adjectives), which are far more complex to model than verbs. Our system is a functional morphological analyzer and generator, i.e., it analyzes to and generates from a representation consisting of a lexeme and linguistic feature-value pairs, where the features are syntactically (and perhaps semantically) meaningful, rather than just morphologically. A detailed evaluation of the current implementation comparing it to a commonly used morphological analyzer shows that it has good morphological coverage with precision and recall scores in the 90s. An error analysis reveals that the majority of recall and precision errors are problems in the gold standard or a result of the discrepancy between different models of form-based/functional morphology.

pdf bib
Using Comparable Corpora to Adapt a Translation Model to Domains
Hiroyuki Kaji | Takashi Tsunakawa | Daisuke Okada

Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains. To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the words associated with a source-language word, presently restricted to a noun, and its translations; word translation pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the parameters and an extension to estimate translation pseudo-probabilities for verbs.

pdf bib
Fred’s Reusable Evaluation Device: Providing Support for Quick and Reliable Linguistic Annotation
Hannah Copperman | Christopher R. Walker

This paper describes an interface that was developed for processing large amounts of human judgments of linguistically annotated data. Fred’s Reusable Evaluation Device (“Fred”) provides administrators with a tool to submit linguistic evaluation tasks to judges. Each evaluation task is then presented to exactly two judges, who can submit their judgments at their own leisure. Fred then provides several metrics to administrators. The most important metric is precision, which is provided for each evaluation task and each annotator. Administrators can look at precision for a given data set over time, as well as by evaluation type, data set, or annotator. Inter-annotator agreement is also reported, and that can be tracked over time as well. The interface was developed to provide a tool for evaluating semantically marked up text. The types of evaluations Fred has been used for so far include things like correctness of subject-relation identification, and correctness of temporal relations. However, Fred’s full versatility has not yet been fully exploited.

pdf bib
The Creation of a Large-Scale LFG-Based Gold Parsebank
Alexis Baird | Christopher R. Walker

Systems for syntactically parsing sentences have long been recognized as a priority in Natural Language Processing. Statistics-based systems require large amounts of high quality syntactically parsed data. Using the XLE toolkit developed at PARC and the LFG Parsebanker interface developed at Bergen, the Parsebank Project at Powerset has generated a rapidly increasing volume of syntactically parsed data. By using these tools, we are able to leverage the LFG framework to provide richer analyses via both constituent (c-) and functional (f-) structures. Additionally, the Parsebanking Project uses source data from Wikipedia rather than source data limited to a specific genre, such as the Wall Street Journal. This paper outlines the process we used in creating a large-scale LFG-Based Parsebank to address many of the shortcomings of previously-created parse banks such as the Penn Treebank. While the Parsebank corpus is still in progress, preliminary results using the data in a variety of contexts already show promise.

pdf bib
A Modality Lexicon and its use in Automatic Tagging
Kathryn Baker | Michael Bloodgood | Bonnie Dorr | Nathaniel W. Filardo | Lori Levin | Christine Piatko

This paper describes our resource-building results for an eight-week JHU Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation. Specifically, we describe the construction of a modality annotation scheme, a modality lexicon, and two automated modality taggers that were built using the lexicon and annotation scheme. Our annotation scheme is based on identifying three components of modality: a trigger, a target and a holder. We describe how our modality lexicon was produced semi-automatically, expanding from an initial hand-selected list of modality trigger words and phrases. The resulting expanded modality lexicon is being made publicly available. We demonstrate that one tagger―a structure-based tagger―results in precision around 86% (depending on genre) for tagging of a standard LDC data set. In a machine translation application, using the structure-based tagger to annotate English modalities on an English-Urdu training corpus improved the translation quality score for Urdu by 0.3 Bleu points in the face of sparse training data.

pdf bib
The ConceptMapper Approach to Named Entity Recognition
Michael Tanenblatt | Anni Coden | Igor Sominsky

ConceptMapper is an open source tool we created for classifying mentions in an unstructured text document based on concept terminologies (dictionaries) and yielding named entities as output. It is implemented as a UIMA (Unstructured Information Management Architecture) annotator and is highly configurable: concepts can come from standardised or proprietary terminologies; arbitrary attributes can be associated with dictionary entries, and those attributes can then be associated with the named entities in the output; numerous search strategies and search options can be specified; any tokenizer packaged as a UIMA annotator can be used to tokenize the dictionary, so the same tokenization can be guaranteed for the input and dictionary, minimising tokenization mismatch errors; and the types and features of UIMA annotations used as input and generated as output can also be controlled. We describe ConceptMapper and its configuration parameters and their trade-offs, then describe the results of an experiment wherein some of these parameters are varied and precision and recall are subsequently measured in the task of in identifying concepts in a collection English-language clinical reports (colon cancer pathology). ConceptMapper is available from the Apache UIMA Sandbox, covered by the Apache Open Source license.

pdf bib
LAF/GrAF-grounded Representation of Dependency Structures
Yoshihiko Hayashi | Thierry Declerck | Chiharu Narawa

This paper shows that a LAF/GrAF-based annotation schema can be used for the adequate representation of syntactic dependency structures possibly in many languages. We first argue that there are at least two types of textual units that can be annotated with dependency information: words/tokens and chunks/phrases. We especially focus on importance of the latter dependency unit: it is particularly useful for representing Japanese dependency structures, known as Kakari-Uke structure. Based on this consideration, we then discuss a sub-typing of GrAF to represent the corresponding dependency structures. We derive three node types, two edge types, and the associated constraints for properly representing both the token-based and the chunk-based dependency structures. We finally propose a wrapper program that, as a proof of concept, converts output data from different dependency parsers in proprietary XML formats to the GrAF-compliant XML representation. It partially proves the value of an international standard like LAF/GrAF in the Web service context: an existing dependency parser can be, in a sense, standardized, once wrapped by a data format conversion process.

pdf bib
Tag Dictionaries Accelerate Manual Annotation
Marc Carmen | Paul Felt | Robbie Haertel | Deryle Lonsdale | Peter McClanahan | Owen Merkling | Eric Ringger | Kevin Seppi

Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to perform manual annotation is costly both in terms of money and time. This paper reports on a user study that was performed to determine the degree of effect that a part-of-speech dictionary has on a group of subjects performing the annotation task. The user study was conducted using a modular, web-based interface created specifically for text annotation tasks. The user study found that for both native and non-native English speakers a dictionary with greater than 60% coverage was effective at reducing annotation time and increasing annotator accuracy. On the basis of this study, we predict that using a part-of-speech tag dictionary with coverage greater than 60% can reduce the cost of annotation in terms of both time and money.

pdf bib
Learning Language Identification Models: A Comparative Analysis of the Distinctive Features of Names and Common Words
Stasinos Konstantopoulos

The intuition and basic hypothesis that this paper explores is that names are more characteristic of their language than common words are, and that a single name can have enough clues to confidently identify its language where random text of the same length wouldn't. To test this hypothesis, n-gramm modelling is used to learn language models which identify the language of isolated names and equally short document fragments. As the empirical results corroborate the prior intuition, an explanation is sought for the higher accuracy at which the language of names can be identified. The results of the application of these models, as well as the models themselves, are quantitatively and qualitatively analysed and a hypothesis is formed about the explanation of this difference. The conclusions derived are both technologically useful in information extraction or text-to-speech tasks, and theoretically interesting as a tool for improving our understanding of the morphology and phonology of the languages involved in the experiments.

pdf bib
Feasibility of Automatically Bootstrapping a Persian WordNet
Chris Irwin Davis | Dan Moldovan

In this paper we describe a proof-of-concept for the bootstrapping of a Persian WordNet. This effort was motivated by previous work done at Stanford University on bootstrapping an Arabic WordNet using a parallel corpus and an English WordNet. The principle of that work is based on the premise that paradigmatic relations are by nature deeply semantic, and as such, are likely to remain intact between languages. We performed our task on a Persian-English bilingual corpus of George Orwell’s Nineteen Eighty-Four. The corpus was neither aligned nor sense tagged, so it was necessary that these were undertaken first. A combination of manual and semiautomated methods were used to tag and sentence align the corpus. Actual mapping of English word senses onto Persian was done using automated techniques. Although Persian is written in Arabic script, it is an Indo-European language, while Arabic is a Central Semitic language. Despite their linguistic differences, we endeavor to test the applicability of the Stanford strategy to our task.

pdf bib
The South African Human Language Technologies Audit
Aditi Sharma Grover | Gerhard B. van Huyssteen | Marthinus W. Pretorius

Human language technologies (HLT) can play a vital role in bridging the digital divide and thus the HLT field has been recognised as a priority area by the South African government. We present our work on conducting a technology audit on the South African HLT landscape across the country’s eleven official languages. The process and the instruments employed in conducting the audit are described and an overview of the various complementary approaches used in the results’ analysis is provided. We find that a number of HLT language resources (LRs) are available in SA but they are of a very basic and exploratory nature. Lessons learnt in conducting a technology audit in a young and multilingual context are also discussed.

pdf bib
BabyExp: Constructing a Huge Multimodal Resource to Acquire Commonsense Knowledge Like Children Do
Massimo Poesio | Marco Baroni | Oswald Lanz | Alessandro Lenci | Alexandros Potamianos | Hinrich Schütze | Sabine Schulte im Walde | Luca Surian

There is by now widespread agreement that the most realistic way to construct the large-scale commonsense knowledge repositories required by natural language and artificial intelligence applications is by letting machines learn such knowledge from large quantities of data, like humans do. A lot of attention has consequently been paid to the development of increasingly sophisticated machine learning algorithms for knowledge extraction. However, the nature of the input that humans are exposed to while learning commonsense knowledge has received much less attention. The BabyExp project is collecting very dense audio and video recordings of the first 3 years of life of a baby. The corpus constructed in this way will be transcribed with automated techniques and made available to the research community. Moreover, techniques to extract commonsense conceptual knowledge incrementally from these multimodal data are also being explored within the project. The current paper describes BabyExp in general, and presents pilot studies on the feasibility of the automated audio and video transcriptions.

pdf bib
TTS Evaluation Campaign with a Common Spanish Database
Iñaki Sainz | Eva Navas | Inma Hernáez | Antonio Bonafonte | Francisco Campillo

This paper describes the first TTS evaluation campaign designed for Spanish. Seven research institutions took part in the evaluation campaign and developed a voice from a common speech database provided by the organisation. Each participating team had a period of seven weeks to generate a voice. Next, a set of sentences were released and each team had to synthesise them within a week period. Finally, some of the synthesised test audio files were subjectively evaluated via an online test according to the following criteria: similarity to the original voice, naturalness and intelligibility. Box-plots, Wilcoxon tests and WER have been generated in order to analyse the results. Two main conclusions can be drawn: On the one hand, there is considerable margin for improvement to reach the quality level of the natural voice. On the other hand, two systems get significantly better results than the rest: one is based on statistical parametric synthesis and the other one is a concatenative system that makes use of a sinusoidal model to modify both prosody and smooth spectral joints. Therefore, it seems that some kind of spectral control is needed when building voices with a medium size database for unrestricted domains.

pdf bib
Experiments in Human-computer Cooperation for the Semantic Annotation of Portuguese Corpora
Diana Santos | Cristina Mota

In this paper, we present a system to aid human annotation of semantic information in the scope of the project AC/DC, called corte-e-costura. This system leverages on the human annotation effort, by providing the annotator with a simple system that applies rules incrementally. Our goal was twofold: first, to develop an easy-to-use system that required a minimum of learning from the part of the linguist; second, one that provided a straightforward way of checking the results obtained, in order to immediately evaluate the results of the rules devised. After explaining the motivation for its development from scratch, we present the current status of the AC/DC project and provide a quantitative description of its material in what concerns semantic annotation. We then present the corte-e-costura system in detail, providing the result of our first experiments with the semantic fields of colour and clothing. We end the paper with some discussion of future work as well as of the experience gained.

pdf bib
Language Modeling Approach for Retrieving Passages in Lecture Audio Data
Koichiro Honda | Tomoyosi Akiba

Spoken Document Retrieval (SDR) is a promising technology for enhancing the utility of spoken materials. After the spoken documents have been transcribed by using a Large Vocabulary Continuous Speech Recognition (LVCSR) decoder, a text-based ad hoc retrieval method can be applied directly to the transcribed documents. However, recognition errors will significantly degrade the retrieval performance. To address this problem, we have previously proposed a method that aimed to fill the gap between automatically transcribed text and correctly transcribed text by using a statistical translation technique. In this paper, we extend the method by (1) using neighboring context to index the target passage, and (2) applying a language modeling approach for document retrieval. Our experimental evaluation shows that context information can improve retrieval performance, and that the language modeling approach is effective in incorporating context information into the proposed SDR method, which uses a translation model.

pdf bib
Evaluating Multilingual Question Answering Systems at CLEF
Pamela Forner | Danilo Giampiccolo | Bernardo Magnini | Anselmo Peñas | Álvaro Rodrigo | Richard Sutcliffe

The paper offers an overview of the key issues raised during the seven years’ activity of the Multilingual Question Answering Track at the Cross Language Evaluation Forum (CLEF). The general aim of the Multilingual Question Answering Track has been to test both monolingual and cross-language Question Answering (QA) systems that process queries and documents in several European languages, also drawing attention to a number of challenging issues for research in multilingual QA. The paper gives a brief description of how the task has evolved over the years and of the way in which the data sets have been created, presenting also a brief summary of the different types of questions developed. The document collections adopted in the competitions are sketched as well, and some data about the participation are provided. Moreover, the main evaluation measures used to evaluate system performances are explained and an overall analysis of the results achieved is presented.

pdf bib
Hungarian Dependency Treebank
Veronika Vincze | Dóra Szauter | Attila Almási | György Móra | Zoltán Alexin | János Csirik

Herein, we present the process of developing the first Hungarian Dependency TreeBank. First, short references are made to dependency grammars we considered important in the development of our Treebank. Second, mention is made of existing dependency corpora for other languages. Third, we present the steps of converting the Szeged Treebank into dependency-tree format: from the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency corpus for Hungarian. We also go into detail about the two major sets of problems, i.e. coordination and predicative nouns and adjectives. Fourth, we give statistics on the treebank: by now, we have completed the annotation of business news, newspaper articles, legal texts and texts in informatics, at the same time, we are planning to convert the entire corpus into dependency tree format. Finally, we give some hints on the applicability of the system: the present database may be utilized ― among others ― in information extraction and machine translation as well.

pdf bib
Generic Ontology Learners on Application Domains
Francesca Fallucchi | Maria Teresa Pazienza | Fabio Massimo Zanzotto

In ontology learning from texts, we have ontology-rich domains where we have large structured domain knowledge repositories or we have large general corpora with large general structured knowledge repositories such as WordNet (Miller, 1995). Ontology learning methods are more useful in ontology-poor domains. Yet, in these conditions, these methods have not a particularly high performance as training material is not sufficient. In this paper we present an LSP ontology learning method that can exploit models learned from a generic domain to extract new information in a specific domain. In our model, we firstly learn a model from training data and then we use the learned model to discover knowledge in a specific domain. We tested our model adaptation strategy using a background domain that is applied to learn the isa networks in the Earth Observation Domain as a specific domain. We will demonstrate that our method captures domain knowledge better than other generic models: our model better captures what is expected by domain experts than a baseline method based only on WordNet. This latter is better correlated with non-domain annotators asked to produce the ontology for the specific domain.

pdf bib
Senso Comune
Alessandro Oltramari | Guido Vetere | Maurizio Lenzerini | Aldo Gangemi | Nicola Guarino

This paper introduces the general features of Senso Comune, an open knowledge base for the Italian language, focusing on the interplay of lexical and ontological knowledge, and outlining our approach to conceptual knowledge elicitation. Senso Comune consists of a machine-readable lexicon constrained by an ontological infrastructure. The idea at the basis of Senso Comune is that natural languages exist in use, and they belong to their users. In the line of Saussure's linguistics, natural languages are seen as a social product and their main strength relies on the users’ consensus. At the same time, language has specific goals: i.e. referring to entities that belong to the users’ world (be it physical or not) and that are made up in social environments where expressions are produced and understood. This usage leverages the creativity of those who produce words and try to understand them. This is the reason why ontology, i.e. a shared conceptualization of the world, can be regarded to as the soil on which the speakers' consensus may be rooted. Some final remarks concerning future work and applications are also given.

pdf bib
Combining Resources: Taxonomy Extraction from Multiple Dictionaries
Rogelio Nazar | Maarten Janssen

The idea that dictionaries are a good source for (computational) information has been around for a long while, and the extraction of taxonomic information from them is something that has been attempted several times. However, such information extraction was typically based on the systematic analysis of the text of a single dictionary. In this paper, we demonstrate how it is possible to extract taxonomic information without any analysis of the specific text, by comparing the same lexical entry in a number of different dictionaries. Counting word frequencies in the dictionary entry for the same word in different dictionaries leads to a surprisingly good recovery of taxonomic information, without the need for any syntactic analysis of the entries in question nor any kind of language-specific treatment. As a case in point, we will show in this paper an experiment extracting hyperonymy relations from several Spanish dictionaries, measuring the effect that the different number of dictionaries have on the results.

pdf bib
POS Multi-tagging Based on Combined Models
Yan Zhao | Gertjan van Noord

In the POS tagging task, there are two kinds of statistical models: one is generative model, such as the HMM, the others are discriminative models, such as the Maximum Entropy Model (MEM). POS multi-tagging decoding method includes the N-best paths method and forward-backward method. In this paper, we use the forward-backward decoding method based on a combined model of HMM and MEM. If P(t) is the forward-backward probability of each possible tag t, we first calculate P(t) according HMM and MEM separately. For all tags options in a certain position in a sentence, we normalize P(t) in HMM and MEM separately. Probability of the combined model is the sum of normalized forward-backward probabilities P norm(t) in HMM and MEM. For each word w, we select the best tag in which the probability of combined model is the highest. In the experiments, we use combined model and get higher accuracy than any single model on POS tagging tasks of three languages, which are Chinese, English and Dutch. The result indicates that our combined model is effective.

pdf bib
AhoTransf: A Tool for Multiband Excitation Based Speech Analysis and Modification
Ibon Saratxaga | Inmaculada Hernáez | Eva Navas | Iñaki Sainz | Iker Luengo | Jon Sánchez | Igor Odriozola | Daniel Erro

In this paper we present AhoTransf, a tool that enables analysis, visualization, modification and synthesis of speech. AhoTransf integrates a speech signal analysis model with a graphical user interface to allow visualization and modification of the parameters of the model. The synthesis capability allows hearing the modified signal thus providing a quick way to understand the perceptual effect of the changes in the parameters of the model. The speech analysis/synthesis algorithm is based in the Multiband Excitation technique, but uses a novel phase information representation the Relative Phase Shift (RPS’s). With this representation, not only the amplitudes but also the phases of the harmonic components of the speech signal reveal their structured patterns in the visualization tool. AhoTransf is modularly conceived so that it can be used with different harmonic speech models.

pdf bib
Identifying Paraphrases between Technical and Lay Corpora
Louise Deléger | Pierre Zweigenbaum

In previous work, we presented a preliminary study to identify paraphrases between technical and lay discourse types from medical corpora dedicated to the French language. In this paper, we test the hypothesis that the same kinds of paraphrases as for French can be detected between English technical and lay discourse types and report the adaptation of our method from French to English. Starting from the constitution of monolingual comparable corpora, we extract two kinds of paraphrases: paraphrases between nominalizations and verbal constructions and paraphrases between neo-classical compounds and modern-language phrases. We do this relying on morphological resources and a set of extraction rules we adapt from the original approach for French. Results show that paraphrases could be identified with a rather good precision, and that these types of paraphrase are relevant in the context of the opposition between technical and lay discourse types. These observations are consistent with the results obtained for French, which demonstrates the portability of the approach as well as the similarity of the two languages as regards the use of those kinds of expressions in technical and lay discourse types.

pdf bib
Heterogeneous Sensor Database in Support of Human Behaviour Analysis in Unrestricted Environments: The Audio Part
Stavros Ntalampiras | Todor Ganchev | Ilyas Potamitis | Nikos Fakotakis

In the present paper we report on a recent effort that resulted in the establishment of a unique multimodal database, referred to as the PROMETHEUS database. This database was created in support of research and development activities, performed within the European Commission FP7 PROMETHEUS project, aiming at the creation of a framework for monitoring and interpretation of human behaviours in unrestricted indoors and outdoors environments. In the present paper we discuss the design and the implementation of the audio part of the database and offer statistical information about the audio content. Specifically, it contains single-person and multi-person scenarios, but also covers scenarios with interactions between groups of people. The database design was conceived with extended support of research and development activities devoted to detection of typical and atypical events, emergency and crisis situations, which assist for achieving situational awareness and more reliable interpretation of the context in which humans behave. The PROMETHEUS database allows for embracing a wide range of real-world applications, including smart-home and human-robot interaction interfaces, indoors/outdoors public areas surveillance, airport terminals or city park supervision, etc. A major portion of the PROMETHEUS database will be made publically available by the end of year 2010.

pdf bib
A Game-based Approach to Transcribing Images of Text
Khalil Dahab | Anja Belz

Creating language resources is expensive and time-consuming, and this forms a bottleneck in the development of language technology, for less-studied non-European languages in particular. The recent internet phenomenon of crowd-sourcing offers a cost-effective and potentially fast way of overcoming such language resource acquisition bottlenecks. We present a methodology that takes as its input scanned documents of typed or hand-written text, and produces transcriptions of the text as its output. Instead of using Optical Character Recognition (OCR) technology, the methodology is game-based and produces such transcriptions as a by-product. The approach is intended particularly for languages for which language technology and resources are scarce and reliable OCR technology may not exist. It can be used in place of OCR for transcribing individual documents, or to create corpora of paired images and transcriptions required to train OCR tools. We present Minefield, a prototype implementation of the approach which is currently collecting Arabic transcriptions.

pdf bib
The RODRIGO Database
Nicolas Serrano | Francisco Castro | Alfons Juan

Annotation of digitized pages from historical document collections is very important to research on automatic extraction of text blocks, lines, and handwriting recognition. We have recently introduced a new handwritten text database, GERMANA, which is based on a Spanish manuscript from 1891. To our knowledge, GERMANA is the first publicly available database mostly written in Spanish and comparable in size to standard databases. In this paper, we present another handwritten text database, RODRIGO, completely written in Spanish and comparable in size to GERMANA. However, RODRIGO comes from a much older manuscript, from 1545, where the typical difficult characteristics of historical documents are more evident. In particular, the writing style, which has clear Gothic influences, is significantly more complex than that of GERMANA. We also provide baseline results of handwriting recognition for reference in future studies, using standard techniques and tools for preprocessing, feature extraction, HMM-based image modelling, and language modelling.

pdf bib
Building Textual Entailment Specialized Data Sets: a Methodology for Isolating Linguistic Phenomena Relevant to Inference
Luisa Bentivogli | Elena Cabrio | Ido Dagan | Danilo Giampiccolo | Medea Lo Leggio | Bernardo Magnini

This paper proposes a methodology for the creation of specialized data sets for Textual Entailment, made of monothematic Text-Hypothesis pairs (i.e. pairs in which only one linguistic phenomenon relevant to the entailment relation is highlighted and isolated). The expected benefits derive from the intuition that investigating the linguistic phenomena separately, i.e. decomposing the complexity of the TE problem, would yield an improvement in the development of specific strategies to cope with them. The annotation procedure assumes that humans have knowledge about the linguistic phenomena relevant to inference, and a classification of such phenomena both into fine grained and macro categories is suggested. We experimented with the proposed methodology over a sample of pairs taken from the RTE-5 data set, and investigated critical issues arising when entailment, contradiction or unknown pairs are considered. The result is a new resource, which can be profitably used both to advance the comprehension of the linguistic phenomena relevant to entailment judgments and to make a first step towards the creation of large-scale specialized data sets.

pdf bib
The Leeds Arabic Discourse Treebank: Annotating Discourse Connectives for Arabic
Amal Al-Saif | Katja Markert

We present the first effort towards producing an Arabic Discourse Treebank,a news corpus where all discourse connectives are identified and annotated with the discourse relations they convey as well as with the two arguments they relate. We discuss our collection of Arabic discourse connectives as well as principles for identifying and annotating them in context, taking into account properties specific to Arabic. In particular, we deal with the fact that Arabic has a rich morphology: we therefore include clitics as connectives as well as a wide range of nominalizations as potential arguments. We present a dedicated discourse annotation tool for Arabic and a large-scale annotation study. We show that both the human identification of discourse connectives and the determination of the discourse relations they convey is reliable. Our current annotated corpus encompasses a final 5651 annotated discourse connectives in 537 news texts. In future, we will release the annotated corpus to other researchers and use it for training and testing automated methods for discourse connective and relation recognition.

pdf bib
On the Role of Discourse Markers in Interactive Spoken Question Answering Systems
Ioana Vasilescu | Sophie Rosset | Martine Adda-Decker

This paper presents a preliminary analysis of the role of some discourse markers and the vocalic hesitation ""euh"" in a corpus of spoken human utterances collected with the Ritel system, an open domain and spoken dialog system. The frequency and contextual combinatory of classical discourse markers and of the vocalic hesitation have been studied. This analysis pointed out some specificity in terms of combinatory of the analyzed items. The classical discourse markers seem to help initiating larger discursive blocks both at initial and medial positions of the on-going turns. The vocalic hesitation stand also for marking the user's embarrassments and wish to close the dialog.

pdf bib
CINEMO — A French Spoken Language Resource for Complex Emotions: Facts and Baselines
Björn Schuller | Riccardo Zaccarelli | Nicolas Rollet | Laurence Devillers

The CINEMO corpus of French emotional speech provides a richly annotated resource to help overcome the apparent lack of learning and testing speech material for complex, i.e. blended or mixed emotions. The protocol for its collection was dubbing selected emotional scenes from French movies. 51 speakers are contained and the total speech time amounts to 2 hours and 13 minutes and 4k speech chunks after segmentation. Extensive labelling was carried out in 16 categories for major and minor emotions and in 6 continuous dimensions. In this contribution we give insight into the corpus statistics focusing in particular on the topic of complex emotions, and provide benchmark recognition results obtained in exemplary large feature space evaluations. In the result the labelling oft he collected speech clearly demonstrates that a complex handling of emotion seems needed. Further, the automatic recognition experiments provide evidence that the automatic recognition of blended emotions appears to be feasible.

pdf bib
New Features in Spoken Language Search Hawk (SpLaSH): Query Language and Query Sequence
Sara Romano | Francesco Cutugno

In this work we present further development of the SpLaSH (Spoken Language Search Hawk) project. SpLaSH implements a data model for annotated speech corpora integrated with textual markup (i.e. POS tagging, syntax, pragmatics) including a toolkit used to perform complex queries across speech and text labels. The integration of time aligned annotations (TMA), represented making use of Annotation Graphs, with text aligned ones (TXA), stored in generic XML files, are provided by a data structure, the Connector Frame, acting as table-look-up linking temporal data to words in the text. SpLaSH imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. It also provides a GUI allowing three types of queries: simple query on TXA or TMA structures, sequence query on TMA structure and cross query on both TXA and TMA integrated structures. In this work new SpLaSH features will be presented: SpLaSH Query Language (SpLaSHQL) and Query Sequence.

pdf bib
FrameNet Translation Using Bilingual Dictionaries with Evaluation on the English-French Pair
Claire Mouton | Gaël de Chalendar | Benoit Richert

Semantic Role Labeling cannot be performed without an associated linguistic resource. A key resource for such a task is the FrameNet resource based on Fillmore’s theory of frame semantics. Like many linguistic resources, FrameNet has been built by English native speakers for the English language. To overcome the lack of such resources in other languages, we propose a new approach to FrameNet translation by using bilingual dictionaries and filtering the wrong translations. We define six scores to filter, based on translation redundancy and FrameNet structure. We also present our work on the enrichment of the obtained resource with nouns. This enrichment uses semantic spaces built on syntactical dependencies and a multi-represented k-NN classifier. We evaluate both the tasks on the French language over a subset of ten frames and show improved results compared to the existing French FrameNet. Our final resource contains 15,132 associations lexical units-frames for an estimated precision of 86%.

pdf bib
Annotation Tool for Extended Textual Coreference and Bridging Anaphora
Jiří Mírovský | Petr Pajas | Anna Nedoluzhko

We present an annotation tool for the extended textual coreference and the bridging anaphora in the Prague Dependency Treebank 2.0 (PDT 2.0). After we very briefly describe the annotation scheme, we focus on details of the annotation process from the technical point of view. We present the way of helping the annotators by several useful features implemented in the annotation tool, such as a possibility to combine surface and deep syntactic representation of sentences during the annotation, an automatic maintaining of the coreferential chain, underlining candidates for antecedents, etc. For studying differences among parallel annotations, the tool offers a simultaneous depicting of several annotations of the same data. The annotation tool can be used for other corpora too, as long as they have been transformed to the PML format. We present modifications of the tool for working with the coreference relations on other layers of language description, namely on the analytical layer and the morphological layer of PDT.

pdf bib
A Resource for Investigating the Impact of Anaphora and Coreference on Inference.
Azad Abad | Luisa Bentivogli | Ido Dagan | Danilo Giampiccolo | Shachar Mirkin | Emanuele Pianta | Asher Stern

Discourse phenomena play a major role in text processing tasks. However, so far relatively little study has been devoted to the relevance of discourse phenomena for inference. Therefore, an experimental study was carried out to assess the relevance of anaphora and coreference for Textual Entailment (TE), a prominent inference framework. First, the annotation of anaphoric and coreferential links in the RTE-5 Search data set was performed according to a specifically designed annotation scheme. As a result, a new data set was created where all anaphora and coreference instances in the entailing sentences which are relevant to the entailment judgment are solved and annotated.. A by-product of the annotation is a new “augmented” data set, where all the referring expressions which need to be resolved in the entailing sentences are replaced by explicit expressions. Starting from the final output of the annotation, the actual impact of discourse phenomena on inference engines was investigated, identifying the kind of operations that the systems need to apply to address discourse phenomena and trying to find direct mappings between these operation and annotation types.

pdf bib
SentiWS - A Publicly Available German-language Resource for Sentiment Analysis
Robert Remus | Uwe Quasthoff | Gerhard Heyer

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative sentiment bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 negative and 1,818 positive words, which sum up to 16,406 positive and 16,328 negative word forms, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one. The present work describes the resource’s structure, the three sources utilised to assemble it and the semi-supervised method incorporated to weight the strength of its entries. Furthermore the resource’s contents are extensively evaluated using a German-language evaluation set we constructed. The evaluation set is verified being reliable and its shown that SentiWS provides a beneficial lexical resource for German-language sentiment analysis related tasks to build on.

pdf bib
Personal Sense and Idiolect: Combining Authorship Attribution and Opinion Analysis
Polina Panicheva | John Cardiff | Paolo Rosso

Subjectivity analysis and authorship attribution are very popular areas of research. However, work in these two areas has been done separately. We believe that by combining information about subjectivity in texts and authorship, the performance of both tasks can be improved. In the paper a personalized approach to opinion mining is presented, in which the notions of personal sense and idiolect are introduced; the approach is applied to the polarity classification task. It is assumed that different authors express their private states in text individually, and opinion mining results could be improved by analyzing texts by different authors separately. The hypothesis is tested on a corpus of movie reviews by ten authors. The results of applying the personalized approach to opinion mining are presented, confirming that the approach increases the performance of the opinion mining task. Automatic authorship attribution is further applied to model the personalized approach, classifying documents by their assumed authorship. Although the automatic authorship classification imposes a number of limitations on the dataset for further experiments, after overcoming these issues the authorship attribution technique modeling the personalized approach confirms the increase over the baseline with no authorship information used.

pdf bib
Automatic Annotation of Co-Occurrence Relations
Dirk Goldhahn | Uwe Quasthoff

We introduce a method for automatically labelling edges of word co-occurrence graphs with semantic relations. Therefore we only make use of training data already contained within the graph. Starting point of this work is a graph based on word co-occurrence of the German language, which is created by applying iterated co-occurrence analysis. The edges of the graph have been partially annotated by hand with semantic relationships. In our approach we make use of the commonly appearing network motif of three words forming a triangular pattern. We assume that the fully annotated occurrences of these structures contain information useful for our purpose. Based on these patterns rules for reasoning are learned. The obtained rules are then combined using Dempster-Shafer theory to infer new semantic relations between words. Iteration of the annotation process is possible to increase the number of obtained relations. By applying the described process the graph can be enriched with semantic information at a high precision.

pdf bib
Mapping between Dependency Structures and Compositional Semantic Representations
Max Jakob | Markéta Lopatková | Valia Kordoni

This paper investigates the mapping between two semantic formalisms, namely the tectogrammatical layer of the Prague Dependency Treebank 2.0 (PDT) and (Robust) Minimal Recursion Semantics ((R)MRS). It is a first attempt to relate the dependency-based annotation scheme of PDT to a compositional semantics approach like (R)MRS. A mapping algorithm that converts PDT trees to (R)MRS structures is developed, associating (R)MRSs to each node on the dependency tree. Furthermore, composition rules are formulated and the relation between dependency in PDT and semantic heads in (R)MRS is analyzed. It turns out that structure and dependencies, morphological categories and some coreferences can be preserved in the target structures. Moreover, valency and free modifications are distinguished using the valency dictionary of PDT as an additional resource. The validation results show that systematically correct underspecified target representations can be obtained by a rule-based mapping approach, which is an indicator that (R)MRS is indeed robust in relation to the formal representation of Czech data. This finding is novel, for Czech, with its free word order and rich morphology, is typologically different than languages analyzed with (R)MRS to date.

pdf bib
Semantic Feature Engineering for Enhancing Disambiguation Performance in Deep Linguistic Processing
Danielle Ben-Gera | Yi Zhang | Valia Kordoni

The task of parse disambiguation has gained in importance over the last decade as the complexity of grammars used in deep linguistic processing has been increasing. In this paper we propose to employ the fine-grained HPSG formalism in order to investigate the contribution of deeper linguistic knowledge to the task of ranking the different trees the parser outputs. In particular, we focus on the incorporation of semantic features in the disambiguation component and the stability of our model cross domains. Our work is carried out within DELPH-IN (, using the LinGo Redwoods and the WeScience corpora, parsed with the English Resource Grammar and the PET parser.

pdf bib
A Tool for Linking Stems and Conceptual Fragments to Enhance word Access
Nuria Gala | Véronique Rey | Michael Zock

Electronic dictionaries offer many possibilities unavailable in paper dictionaries to view, display or access information. However, even these resources fall short when it comes to access words sharing semantic features and certain aspects of form: few applications offer the possibility to access a word via a morphologically or semantically related word. In this paper, we present such an application, Polymots, a lexical database for contemporary French containing 20.000 words grouped in 2.000 families. The purpose of this resource is to group words into families on the basis of shared morpho-phonological and semantic information. Words with a common stem form a family; words in a family also share a set of common conceptual fragments (in some families there is a continuity of meaning, in others meaning is distributed). With this approach, we capitalize on the bidirectional link between semantics and morpho-phonology : the user can thus access words not only on the basis of ideas, but also on the basis of formal characteristics of the word, i.e. its morphological features. The resulting lexical database should help people learn French vocabulary and assist them to find words they are looking for, going thus beyond other existing lexical resources.

pdf bib
Disambiguating Compound Nouns for a Dynamic HPSG Treebank of Wall Street Journal Texts
Valia Kordoni | Yi Zhang

The aim of this paper is twofold. We focus, on the one hand, on the task of dynamically annotating English compound nouns, and on the other hand we propose disambiguation methods and techniques which facilitate the annotation task. Both the aforementioned are part of a larger on-going effort which aims to create HPSG annotation for the texts from theWall Street Journal (henceforward WSJ) sections of the Penn Treebank (henceforward PTB) with the help of a hand-written large-scale and wide-coverage grammar of English, the English Resource Grammar (henceforward ERG; Flickinger (2002)). As we show in this paper, such annotations are very rich linguistically, since apart from syntax they also incorporate semantics, which does not only ensure that the treebank is guaranteed to be a truly sharable, re-usable and multi-functional linguistic resource, but also calls for the necessity of a better disambiguation of the internal (syntactic) structure of larger units of words, such as compound nouns, since this has an impact on the representation of their meaning, which is of utmost interest if the linguistic annotation of a given corpus is to be further understood as the practice of adding interpretative linguistic information of the highest quality in order to give “added value” to the corpus.

pdf bib
Building a Cross-lingual Relatedness Thesaurus using a Graph Similarity Measure
Lukas Michelbacher | Florian Laws | Beate Dorow | Ulrich Heid | Hinrich Schütze

The Internet is an ever growing source of information stored in documents of different languages. Hence, cross-lingual resources are needed for more and more NLP applications. This paper presents (i) a graph-based method for creating one such resource and (ii) a resource created using the method, a cross-lingual relatedness thesaurus. Given a word in one language, the thesaurus suggests words in a second language that are semantically related. The method requires two monolingual corpora and a basic dictionary. Our general approach is to build two monolingual word graphs, with nodes representing words and edges representing linguistic relations between words. A bilingual dictionary containing basic vocabulary provides seed translations relating nodes from both graphs. We then use an inter-graph node-similarity algorithm to discover related words. Evaluation with three human judges revealed that 49% of the English and 57% of the German words discovered by our method are semantically related to the target words. We publish two resources in conjunction with this paper. First, noun coordinations extracted from the German and English Wikipedias. Second, the cross-lingual relatedness thesaurus which can be used in experiments involving interactive cross-lingual query expansion.

pdf bib
Extending BART to Provide a Coreference Resolution System for German
Samuel Broscheit | Simone Paolo Ponzetto | Yannick Versley | Massimo Poesio

We present a flexible toolkit-based approach to automatic coreference resolution on German text. We start with our previous work aimed at reimplementing the system from Soon et al. (2001) for English, and extend it to duplicate a version of the state-of-the-art proposal from Klenner and Ailloud (2009). Evaluation performed on a benchmarking dataset, namely the TueBa-D/Z corpus (Hinrichs et al., 2005b), shows that machine learning based coreference resolution can be robustly performed in a language other than English.

pdf bib
A Corpus Representation Format for Linguistic Web Services: The D-SPIN Text Corpus Format and its Relationship with ISO Standards
Ulrich Heid | Helmut Schmid | Kerstin Eckart | Erhard Hinrichs

In the framework of the preparation of linguistic web services for corpus processing, the need for a representation format was felt, which supports interoperability between different web services in a corpus processing pipeline, but also provides a well-defined interface to both, legacy tools and their data formats and upcoming international standards. We present the D-SPIN text corpus format, TCF, which was designed for this purpose. It is a stand-off XML format, inspired by the philosophy of the emerging standards LAF (Linguistic Annotation Framework) and its ``instances'' MAF for morpho-syntactic annotation and SynAF for syntactic annotation. Tools for the exchange with existing (best practice) formats are available, and a converter from MAF to TCF is being tested in spring 2010. We describe the usage scenario where TCF is embedded and the properties and architecture of TCF. We also give examples of TCF encoded data and describe the aspects of syntactic and semantic interoperability already addressed.

pdf bib
A Dataset for Assessing Machine Translation Evaluation Metrics
Lucia Specia | Nicola Cancedda | Marc Dymetman

We describe a dataset containing 16,000 translations produced by four machine translation systems and manually annotated for quality by professional translators. This dataset can be used in a range of tasks assessing machine translation evaluation metrics, from basic correlation analysis to training and test of machine learning-based metrics. By providing a standard dataset for such tasks, we hope to encourage the development of better MT evaluation metrics.

pdf bib
Quality Indicators of LSP Texts — Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Jakob Halskov | Dorte Haltrup Hansen | Anna Braasch | Sussi Olsen

This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and “poor” LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of ‘best practice’ for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

pdf bib
Towards Investigating Effective Affective Dialogue Strategies
Gregor Bertrand | Florian Nothdurft | Steffen Walter | Andreas Scheck | Henrik Kessler | Wolfgang Minker

We describe an experimentalWizard-of-Oz-setup for the integration of emotional strategies into spoken dialogue management. With this setup we seek to evaluate different approaches to emotional dialogue strategies in human computer interaction with a spoken dialogue system. The study aims to analyse what kinds of emotional strategies work best in spoken dialogue management especially facing the problem that users may not be honest about their emotions. Therefore as well direct (user is asked about his state) as indirect (measurements of psychophysiological features) evidence is considered for the evaluation of our strategies.

pdf bib
Computational Linguistics for Mere Mortals - Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities
Rüdiger Gleim | Alexander Mehler

Delivering linguistic resources and easy-to-use methods to a broad public in the humanities is a challenging task. On the one hand users rightly demand easy to use interfaces but on the other hand want to have access to the full flexibility and power of the functions being offered. Even though a growing number of excellent systems exist which offer convenient means to use linguistic resources and methods, they usually focus on a specific domain, as for example corpus exploration or text categorization. Architectures which address a broad scope of applications are still rare. This article introduces the eHumanities Desktop, an online system for corpus management, processing and analysis which aims at bridging the gap between powerful command line tools and intuitive user interfaces.

pdf bib
Heuristic Word Alignment with Parallel Phrases
Maria Holmqvist

We present a heuristic method for word alignment, which is the task of identifying corresponding words in parallel text. The heuristic method is based on parallel phrases extracted from manually word aligned sentence pairs. Word alignment is performed by matching parallel phrases to new sentence pairs, and adding word links from the parallel phrase to words in the matching sentence segment. Experiments on an English--Swedish parallel corpus showed that the heuristic phrase-based method produced word alignments with high precision but low recall. In order to improve alignment recall, phrases were generalized by replacing words with part-of-speech categories. The generalization improved recall but at the expense of precision. Two filtering strategies were investigated to prune the large set of generalized phrases. Finally, the phrase-based method was compared to statistical word alignment with Giza++ and we found that although statistical alignments based on large datasets will outperform phrase-based word alignment, a combination of phrase-based and statistical word alignment outperformed pure statistical alignment in terms of Alignment Error Rate (AER).

pdf bib
The Demo / Kemo Corpus: A Principled Approach to the Study of Cross-cultural Differences in the Vocal Expression and Perception of Emotion
Martijn Goudbeek | Mirjam Broersma

This paper presents the Demo / Kemo corpus of Dutch and Korean emotional speech. The corpus has been specifically developed for the purpose of cross-linguistic comparison, and is more balanced than any similar corpus available so far: a) it contains expressions by both Dutch and Korean actors as well as judgments by both Dutch and Korean listeners; b) the same elicitation technique and recording procedure was used for recordings of both languages; c) the same nonsense sentence, which was constructed to be permissible in both languages, was used for recordings of both languages; and d) the emotions present in the corpus are balanced in terms of valence, arousal, and dominance. The corpus contains a comparatively large number of emotions (eight) uttered by a large number of speakers (eight Dutch and eight Korean). The counterbalanced nature of the corpus will enable a stricter investigation of language-specific versus universal aspects of emotional expression than was possible so far. Furthermore, given the carefully controlled phonetic content of the expressions, it allows for analysis of the role of specific phonetic features in emotional expression in Dutch and Korean.

pdf bib
Maskkot — An Entity-centric Annotation Platform
Armando Stellato | Heiko Stoermer | Stefano Bortoli | Noemi Scarpato | Andrea Turbati | Paolo Bouquet | Maria Teresa Pazienza

The Semantic Web is facing the important challenge to maintain its promise of a real world-wide graph of interconnected resources. Unfortunately, while URIs almost guarantee a direct reference to entities, the relation between the two is not bijective. Many different URI references to same concepts and entities can arise when -- in such a heterogeneous setting as the WWW -- people independently build new ontologies, or populate shared ones with new arbitrarily identified individuals. The proliferation of URIs is an unwanted, though natural effect strictly bound to the same principles which characterize the Semantic Web; reducing this phenomenon will improve the recall of Semantic Search engines, which could rely on explicit links between heterogeneous information sources. To address this problem, in this paper we present an integrated environment combining the semantic annotation and ontology building features available in the Semantic Turkey web browser extension, with globally unique identifiers for entities provided by the okkam Entity Name System, thus realizing a valuable resource for preventing diffusion of multiple URIs on the (Semantic) Web.

pdf bib
Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices
Petr Pollák | Josef Rajnoha

This paper describes Czech spontaneous speech database of lectures on digital signal processing topic collected at Czech Technical University in Prague, commonly with the procedure of its recording and annotation. The database contains 21.7 hours of speech material from 22 speakers recorded in 4 channels with 3 principally different microphones. The annotation of the database is composed from basic time segmentation, orthographic transcription including marks for speaker and environmental non-speech events, pronunciation lexicon in SAMPA alphabet, session and speaker information describing recording conditions, and the documentation. The orthographic transcription with time segmentation is saved in XML format supported by frequently used annotation tool Transcriber. In this article, special attention is also paid to the description of time synchronization of signals recorded by two independent devices: computer based recording platform using two external sound cards and commercial audio recorder Edirol R09. This synchronization is based on cross-correlation analysis with simple automated selection of suitable short signal subparts. The collection and annotation of this database is now complete and its availability via ELRA is currently under preparation.

pdf bib
A Question-answer Distance Measure to Investigate QA System Progress
Guillaume Bernard | Sophie Rosset | Martine Adda-Decker | Olivier Galibert

The performance of question answering system is evaluated through successive evaluations campaigns. A set of questions are given to the participating systems which are to find the correct answer in a collection of documents. The creation process of the questions may change from one evaluation to the next. This may entail an uncontroled question difficulty shift. For the QAst 2009 evaluation campaign, a new procedure was adopted to build the questions. Comparing results of QAst 2008 and QAst 2009 evaluations, a strong performance loss could be measured in 2009 for French and English, while the Spanish systems globally made progress. The measured loss might be related to this new way of elaborating questions. The general purpose of this paper is to propose a measure to calibrate the difficulty of a question set. In particular, a reasonable measure should output higher values for 2009 than for 2008. The proposed measure relies on a distance measure between the critical elements of a question and those of the associated correct answer. An increase of the proposed distance measure for French and English 2009 evaluations as compared to 2008 could be established. This increase correlates with the previously observed degraded performances. We conclude on the potential of this evaluation criterion: the importance of such a measure for the elaboration of new question corpora for questions answering systems and a tool to control the level of difficulty for successive evaluation campaigns.

pdf bib
Fine-Grained Geographical Relation Extraction from Wikipedia
Andre Blessing | Hinrich Schütze

In this paper, we present work on enhancing the basic data resource of a context-aware system. Electronic text offers a wealth of information about geospatial data and can be used to improve the completeness and accuracy of geospatial resources (e.g., gazetteers). First, we introduce a supervised approach to extracting geographical relations on a fine-grained level. Second, we present a novel way of using Wikipedia as a corpus based on self-annotation. A self-annotation is an automatically created high-quality annotation that can be used for training and evaluation. Wikipedia contains two types of different context: (i) unstructured text and (ii) structured data: templates (e.g., infoboxes about cities), lists and tables. We use the structured data to annotate the unstructured text. Finally, the extracted fine-grained relations are used to complete gazetteer data. The precision and recall scores of more than 97 percent confirm that a statistical IE pipeline can be used to improve the data quality of community-based resources.

pdf bib
Fine-grained Linguistic Evaluation of Question Answering Systems
Sarra El Ayari | Brigitte Grau | Anne-Laure Ligozat

Question answering systems are complex systems using natural language processing. Some evaluation campaigns are organized to evaluate such systems in order to propose a classification of systems based on final results (number of correct answers). Nevertheless, teams need to evaluate more precisely the results obtained by their systems if they want to do a diagnostic evaluation. There are no tools or methods to do these evaluations systematically. We present REVISE, a tool for glass box evaluation based on diagnostic of question answering system results.

pdf bib
Evaluating Semantic Relations and Distances in the Associative Concept Dictionary using NIRS-imaging
Nao Tatsumi | Jun Okamoto | Shun Ishizaki

In this study, we extracted brain activities related to semantic relations and distances to improve the precision of distance calculation among concepts in the Associated Concept Dictionary (ACD). For the experiments, we used a multi-channel Near-infrared Spectroscopy (NIRS) device to measure the response properties of the changes in hemoglobin concentration during word-concept association tasks. The experiments’ stimuli were selected from pairs of stimulus words and associated words in the ACD and presented in the form of a visual stimulation to the subjects. In our experiments, we obtained subject response data and brain activation data in Broca's area ―a human brain region that is active in linguistic/word-concept decision tasks― and these data imply relations with the length of associative distance. This study showed that it was possible to connect brain activities to the semantic relation among concepts, and that it would improve the method for concept distance calculation in order to build a more human-like ontology model.

pdf bib
Identification of the Question Focus: Combining Syntactic Analysis and Ontology-based Lookup through the User Interaction
Danica Damljanovic | Milan Agatonovic | Hamish Cunningham

Most question-answering systems contain a classifier module which determines a question category, based on which each question is assigned an answer type. However, setting up syntactic patterns for this classification is a big challenge. In addition, in the case of ontology-based systems, the answer type should be aligned to the queried knowledge structure. In this paper, we present an approach for determining the answer type semi-automatically. We first identify the question focus using syntactic parsing, and then try to identify the answer type by combining the head of the focus with the ontology-based lookup. When this combination is not enough to make conclusions automatically, the user is engaged into a dialog in order to resolve the answer type. User selections are saved and used for training the system in order to improve its performance over time. Further on, the answer type is used to show the feedback and the concise answer to the user. Our approach is evaluated using 250 questions from the Mooney Geoquery dataset.

pdf bib
A Corpus for Studying Full Answer Justification
Arnaud Grappy | Brigitte Grau | Olivier Ferret | Cyril Grouin | Véronique Moriceau | Isabelle Robba | Xavier Tannier | Anne Vilnat | Vincent Barbier

Question answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developping a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interesting in pinpointing the set of information that allows to verify the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form: anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer's disposal to study them, we decided to build an annotated corpus.

pdf bib
Entity Mention Detection using a Combination of Redundancy-Driven Classifiers
Silvana Marianela Bernaola Biggio | Manuela Speranza | Roberto Zanoli

We present an experimental framework for Entity Mention Detection in which two different classifiers are combined to exploit Data Redundancy attained through the annotation of a large text corpus, as well as a number of Patterns extracted automatically from the same corpus. In order to recognize proper name, nominal, and pronominal mentions we not only exploit the information given by mentions recognized within the corpus being annotated, but also given by mentions occurring in an external and unannotated corpus. The system was first evaluated in the Evalita 2009 evaluation campaign obtaining good results. The current version is being used in a number of applications: on the one hand, it is used in the LiveMemories project, which aims at scaling up content extraction techniques towards very large scale extraction from multimedia sources. On the other hand, it is used to annotate corpora, such as Italian Wikipedia, thus providing easy access to syntactic and semantic annotation for both the Natural Language Processing and Information Retrieval communities. Moreover a web service version of the system is available and the system is going to be integrated into the TextPro suite of NLP tools.

pdf bib
NP Alignment in Bilingual Corpora
Gábor Recski | András Rung | Attila Zséder | András Kornai

Aligning the NPs of parallel corpora is logically halfway between the sentence- and word-alignment tasks that occupy much of the MT literature, but has received far less attention. NP alignment is a challenging problem, capable of rapidly exposing flaws both in the word-alignment and in the NP chunking algorithms one may bring to bear. It is also a very rewarding problem in that NPs are semantically natural translation units, which means that (i) word alignments will cross NP boundaries only exceptionally, and (ii) within sentences already aligned, the proportion of 1-1 alignments will be higher for NPs than words. We created a simple gold standard for English-Hungarian, Orwell’s 1984, (since this already exists in manually verified POS-tagged format in many languages thanks to the Multex and MultexEast project) by manually verifying the automaticaly generated NP chunking (we used the yamcha, mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. The maximum NP chunking problem is much harder than base NP chunking, with F-measure in the .7 range (as opposed to over .94 for base NPs). Since the results are highly impacted by the quality of the NP chunking, we tested our alignment algorithms both with real world (machine obtained) chunkings, where results are in the .35 range for the baseline algorithm which propagates GIZA++ word alignments to the NP level, and on idealized (manually obtained) chunkings, where the baseline reaches .4 and our current system reaches .64.

pdf bib
The GIVE-2 Corpus of Giving Instructions in Virtual Environments
Andrew Gargett | Konstantina Garoufi | Alexander Koller | Kristina Striegnitz

We present the GIVE-2 Corpus, a new corpus of human instruction giving. The corpus was collected by asking one person in each pair of subjects to guide the other person towards completing a task in a virtual 3D environment with typed instructions. This is the same setting as that of the recent GIVE Challenge, and thus the corpus can serve as a source of data and as a point of comparison for NLG systems that participate in the GIVE Challenge. The instruction-giving data we collect is multilingual (45 German and 63 English dialogues), and can easily be extended to further languages by using our software, which we have made available. We analyze the corpus to study the effects of learning by repeated participation in the task and the effects of the participants' spatial navigation abilities. Finally, we present a novel annotation scheme for situated referring expressions and compare the referring expressions in the German and English data.

pdf bib
WAPUSK20 - A Database for Robust Audiovisual Speech Recognition
Alexander Vorwerk | Xiaohui Wang | Dorothea Kolossa | Steffen Zeiler | Reinhold Orglmeister

Audiovisual speech recognition (AVSR) systems have been proven superior over audio-only speech recognizers in noisy environments by incorporating features of the visual modality. In order to develop reliable AVSR systems, appropriate simultaneously recorded speech and video data is needed. In this paper, we will introduce a corpus (WAPUSK20) that consists of audiovisual data of 20 speakers uttering 100 sentences each with four channels of audio and a stereoscopic video. The latter is intended to support more accurate lip tracking and the development of stereo data based normalization techniques for greater robustness of the recognition results. The sentence design has been adopted from the GRID corpus that has been widely used for AVSR experiments. Recordings have been made under acoustically realistic conditions in a usual office room. Affordable hardware equipment has been used, such as a pre-calibrated stereo camera and standard PC components. The software written to create this corpus was designed in MATLAB with help of hardware specific software provided by the hardware manufacturers and freely available open source software.

pdf bib
Exploring Knowledge Bases for Similarity
Eneko Agirre | Montse Cuadros | German Rigau | Aitor Soroa

Graph-based similarity over WordNet has been previously shown to perform very well on word similarity. This paper presents a study of the performance of such a graph-based algorithm when using different relations and versions of Wordnet. The graph algorithm is based on Personalized PageRank, a random-walk based algorithm which computes the probability of a random-walk initiated in the target word to reach any synset following the relations in WordNet (Haveliwala, 2002). Similarity is computed as the cosine of the probability distributions for each word over WordNet. The best combination of relations includes all relations in WordNet 3.0, included disambiguated glosses, and automatically disambiguated topic signatures called KnowNets. All relations are part of the official release of WordNet, except KnowNets, which have been derived automatically. The results over the WordSim 353 dataset show that using the adequate relations the performance improves over previously published WordNet-based results on the WordSim353 dataset (Finkelstein et al., 2002). The similarity software and some graphs used in this paper are publicly available at

pdf bib
Annotation and Representation of a Diachronic Corpus of Spanish
Cristina Sánchez-Marco | Gemma Boleda | Josep Maria Fontana | Judith Domingo

In this article we describe two different strategies for the automatic tagging of a Spanish diachronic corpus involving the adaptation of existing NLP tools developed for modern Spanish. In the initial approach we follow a state-of-the-art strategy, which consists on standardizing the spelling and the lexicon. This approach boosts POS-tagging accuracy to 90, which represents a raw improvement of over 20% with respect to the results obtained without any pre-processing. In order to enable non-expert users in NLP to use this new resource, the corpus has been integrated into IAC (Corpora Interface Access). We discuss the shortcomings of the initial approach and propose a new one, which does not consist in adapting the source texts to the tagger, but rather in modifying the tagger for the direct treatment of the old variants. This second strategy addresses some important shortcomings in the previous approach and is likely to be useful not only in the creation of diachronic linguistic resources but also for the treatment of dialectal or non-standard variants of synchronic languages as well.

pdf bib
Inferring Subcat Frames of Verbs in Urdu
Ghulam Raza

This paper describes an approach for inferring syntactic frames of verbs in Urdu from an untagged corpus. Urdu, like many other South Asian languages, is a free word order and case-rich language. Separable lexical units mark different constituents for case in phrases and clauses and are called case clitics. There is not always a one to one correspondence between case clitic form and case, and case and grammatical function in Urdu. Case clitics, therefore, can not serve as direct clues for extracting the syntactic frames of verbs. So a two-step approach has been implemented. In a first step, all case clitic combinations for a verb are extracted and the unreliable ones are filtered out by applying the inferential statistics. In a second step, the information of occurrences of case clitic forms in different combinations as a whole and on individual level is processed to infer all possible syntactic frames of the verb.

pdf bib
LIMA : A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation
Romaric Besançon | Gaël de Chalendar | Olivier Ferret | Faiza Gara | Olivier Mesnard | Meriama Laïb | Nasredine Semmar

The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task that relies on many processes and resources. As a consequence, NLP tools must be both configurable and efficient: specific software architectures must be designed for this purpose. We present in this paper the LIMA multilingual analysis platform, developed at CEA LIST. This configurable platform has been designed to develop NLP based industrial applications while keeping enough flexibility to integrate various processes and resources. This design makes LIMA a linguistic analyzer that can handle languages as different as French, English, German, Arabic or Chinese. Beyond its architecture principles and its capabilities as a linguistic analyzer, LIMA also offers a set of tools dedicated to the test and the evaluation of linguistic modules and to the production and the management of new linguistic resources.

pdf bib
A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters
Grzegorz Chrupała | Dietrich Klakow

Named Entity Recognition is a relatively well-understood NLP task, with many publicly available training resources and software for processing English data. Other languages tend to be underserved in this area. For German, CoNLL-2003 Shared Task provided training data, but there are no publicly available, ready-to-use tools. We fill this gap and develop a German NER system with state-of-the-art performance. In addition to CoNLL 2003 labeled training data, we use two additional resources: (i) 32 million words of unlabeled news article text and (ii) infobox labels from German Wikipedia articles. From the unlabeled text we derive distributional word clusters. Then we use cluster membership features and Wikipedia infobox label features to train a supervised model on the labeled training data. This approach allows us to deal better with word-types unseen in the training data and achieve good performance on German with little engineering effort.

pdf bib
Text Cluster Trimming for Better Descriptions and Improved Quality
Magnus Rosell

Text clustering is potentially very useful for exploration of text sets that are too large to study manually. The success of such a tool depends on whether the results can be explained to the user. An automatically extracted cluster description usually consists of a few words that are deemed representative for the cluster. It is preferably short in order to be easily grasped. However, text cluster content is often diverse. We introduce a trimming method that removes texts that do not contain any, or a few of the words in the cluster description. The result is clusters that match their descriptions better. In experiments on two quite different text sets we obtain significant improvements in both internal and external clustering quality for the trimmed clustering compared to the original. The trimming thus has two positive effects: it forces the clusters to agree with their descriptions (resulting in better descriptions) and improves the quality of the trimmed clusters.

pdf bib
Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT
Jesús González-Rubio | Jorge Civera | Alfons Juan | Francisco Casacuberta

Currently, a great effort is being carried out in the digitalisation of large historical document collections for preservation purposes. The documents in these collections are usually written in ancient languages, such as Latin or Greek, which limits the access of the general public to their content due to the language barrier. Therefore, digital libraries aim not only at storing raw images of digitalised documents, but also to annotate them with their corresponding text transcriptions and translations into modern languages. Unfortunately, ancient languages have at their disposal scarce electronic resources to be exploited by natural language processing techniques. This paper describes the compilation process of a novel Latin-Catalan parallel corpus as a new task for statistical machine translation (SMT). Preliminary experimental results are also reported using a state-of-the-art phrase-based SMT system. The results presented in this work reveal the complexity of the task and its challenging, but interesting nature for future development.

pdf bib
Djangology: A Light-weight Web-based Tool for Distributed Collaborative Text Annotation
Emilia Apostolova | Sean Neilan | Gary An | Noriko Tomuro | Steven Lytinen

Manual text annotation is a resource-consuming endeavor necessary for NLP systems when they target new tasks or domains for which there are no existing annotated corpora. Distributing the annotation work across multiple contributors is a natural solution to reduce and manage the effort required. Although there are a few publicly available tools which support distributed collaborative text annotation, most of them have complex user interfaces and require a significant amount of involvement from the annotators/contributors as well as the project developers and administrators. We present a light-weight web application for highly distributed annotation projects - Djangology. The application takes advantage of the recent advances in web framework architecture that allow rapid development and deployment of web applications thus minimizing development time for customization. The application's web-based interface gives project administrators the ability to easily upload data, define project schemas, assign annotators, monitor progress, and review inter-annotator agreement statistics. The intuitive web-based user interface encourages annotator participation as contributors are not burdened by tool manuals, local installation, or configuration. The system has achieved a user response rate of 70% in two annotation projects involving more than 250 medical experts from various geographic locations.

pdf bib
Analysing Temporally Annotated Corpora with CAVaT
Leon Derczynski | Robert Gaizauskas

We present CAVaT, a tool that performs Corpus Analysis and Validation for TimeML. CAVaT is an open source, modular checking utility for statistical analysis of features specific to temporally-annotated natural language corpora. It provides reporting, highlights salient links between a variety of general and time-specific linguistic features, and also validates a temporal annotation to ensure that it is logically consistent and sufficiently annotated. Uniquely, CAVaT provides analysis specific to TimeML-annotated temporal information. TimeML is a standard for annotating temporal information in natural language text. In this paper, we present the reporting part of CAVaT, and then its error-checking ability, including the workings of several novel TimeML document verification methods. This is followed by the execution of some example tasks using the tool to show relations between times, events, signals and links. We also demonstrate inconsistencies in a TimeML corpus (TimeBank) that have been detected with CAVaT.

pdf bib
Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus
Martin Reynaert | Nelleke Oostdijk | Orphée De Clercq | Henk van den Heuvel | Franciska de Jong

In The Low Countries, a major reference corpus for written Dutch is being built. We discuss the interplay between data acquisition and data processing during the creation of the SoNaR Corpus. Based on developments in traditional corpus compiling and new web harvesting approaches, SoNaR is designed to contain 500 million words, balanced over 36 text types including both traditional and new media texts. Beside its balanced design, every text sample included in SoNaR will have its IPR issues settled to the largest extent possible. This data collection task presents many challenges because every decision taken on the level of text acquisition has ramifications for the level of processing and the general usability of the corpus. As far as the traditional text types are concerned, each text brings its own processing requirements and issues. For new media texts - SMS, chat - the problem is even more complex, issues such as anonimity, recognizability and citation right, all present problems that have to be tackled. The solutions actually lead to the creation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes, and the smaller - of commissioned size - more privacy compliant SoNaR, IPR-cleared for commercial purposes as well.

pdf bib
MLIF : A Metamodel to Represent and Exchange Multilingual Textual Information
Samuel Cruz-Lara | Gil Francopoulo | Laurent Romary | Nasredine Semmar

The fast evolution of language technology has produced pressing needs in standardization. The multiplicity of language resources representation levels and the specialization of these representations make difficult the interaction between linguistic resources and components manipulating these resources. In this paper, we describe the MultiLingual Information Framework (MLIF ― ISO CD 24616). MLIF is a metamodel which allows the representation and the exchange of multilingual textual information. This generic metamodel is designed to provide a common platform for all the tools developed around the existing multilingual data exchange formats. This platform provides, on the one hand, a set of generic data categories for various application domains, and on the other hand, strategies for the interoperability with existing standards. The objective is to reach a better convergence between heterogeneous standardisation activities that are taking place in the domain of data modeling (XML; W3C), text management (TEI; TEIC), multilingual information (TMX-LISA; XLIFF-OASIS) and multimedia (SMILText; W3C). This is a work in progress within ISO-TC37 in order to define a new ISO standard.

pdf bib
Generating FrameNets of Various Granularities: The FrameNet Transformer
Josef Ruppenhofer | Jonas Sunde | Manfred Pinkal

We present a method and a software tool, the FrameNet Transformer, for deriving customized versions of the FrameNet database based on frame and frame element relations. The FrameNet Transformer allows users to iteratively coarsen the FrameNet sense inventory in two ways. First, the tool can merge entire frames that are related by user-specified relations. Second, it can merge word senses that belong to frames related by specified relations. Both methods can be interleaved. The Transformer automatically outputs format-compliant FrameNet versions, including modified corpus annotation files that can be used for automatic processing. The customized FrameNet versions can be used to determine which granularity is suitable for particular applications. In our evaluation of the tool, we show that our method increases accuracy of statistical semantic parsers by reducing the number of word-senses (frames) per lemma, and increasing the number of annotated sentences per lexical unit and frame. We further show in an experiment on the FATE corpus that by coarsening FrameNet we do not incur a significant loss of information that is relevant to the Recognizing Textual Entailment task.

pdf bib
A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora
Francesca Bonin | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi

In this paper, we present a novel approach to multi-word terminology extraction combining a well-known automatic term recognition approach, the C--NC value method, with a contrastive ranking technique, aimed at refining obtained results either by filtering noise due to common words or by discerning between semantically different types of terms within heterogeneous terminologies. Differently from other contrastive methods proposed in the literature that focus on single terms to overcome the multi-word terms' sparsity problem, the proposed contrastive function is able to handle variation in low frequency events by directly operating on pre-selected multi-word terms. This methodology has been tested in two case studies carried out in the History of Art and Legal domains. Evaluation of achieved results showed that the proposed two--stage approach improves significantly multi--word term extraction results. In particular, for what concerns the legal domain it provides an answer to a well-known problem in the semi--automatic construction of legal ontologies, namely that of singling out law terms from terms of the specific domain being regulated.

pdf bib
Partial Parsing of Spontaneous Spoken French
Olivier Blanc | Matthieu Constant | Anne Dister | Patrick Watrin

This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. Spoken specificities were formalized thanks to a systematic linguistic study of a 40-hour-long speech transcription corpus. The chunker uses large-coverage and fine-grained language resources for general written language that have been augmented with resources specific to spoken French. It consists in iteratively applying finite-state lexical and syntactic resources and outputing a finite automaton representing all possible chunk analyses. The best path is then selected thanks to a hybrid disambiguation stage. We show that our system reaches scores that are comparable with state-of-the-art results in the field.

pdf bib
Sign Language Corpora for Analysis, Processing and Evaluation
Annelies Braffort | Laurence Bolot | Emilie Chételat-Pelé | Annick Choisier | Maxime Delorme | Michael Filhol | Jérémie Segouat | Cyril Verrecchia | Flora Badin | Nadège Devos

Sign Languages (SLs) are the visuo-gestural languages practised by the deaf communities. Research on SLs requires to build, to analyse and to use corpora. The aim of this paper is to present various kinds of new uses of SL corpora. The way data are used take advantage of the new capabilities of annotation software for visualisation, numerical annotation, and processing. The nature of the data can be video-based or motion capture-based. The aims of the studies include language analysis, animation processing, and evaluation. We describe here some LIMSI’s studies, and some studies from other laboratories as examples.

pdf bib
VenPro: A Morphological Analyzer for Venetan
Sara Tonelli | Emanuele Pianta | Rodolfo Delmonte | Michele Brunelli

This document reports the process of extending MorphoPro for Venetan, a lesser-used language spoken in the Nort-Eastern part of Italy. MorphoPro is the morphological component of TextPro, a suite of tools oriented towards a number of NLP tasks. In order to extend this component to Venetan, we developed a declarative representation of the morphological knowledge necessary to analyze and synthesize Venetan words. This task was challenging for several reasons, which are common to a number of lesser-used languages: although Venetan is widely used as an oral language in everyday life, its written usage is very limited; efforts for defining a standard orthography and grammar are very recent and not well established; despite recent attempts to propose a unified orthography, no Venetan standard is widely used. Besides, there are different geographical varieties and it is strongly influenced by Italian.

pdf bib
From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News
Mohamed Maamouri | Ann Bies | Seth Kulick | Wajdi Zaghouani | Dave Graff | Mike Ciul

The Arabic Treebank (ATB) Project at the Linguistic Data Consortium (LDC) has embarked on a large corpus of Broadcast News (BN) transcriptions, and this has led to a number of new challenges for the data processing and annotation procedures that were originally developed for Arabic newswire text (ATB1, ATB2 and ATB3). The corpus requirements currently posed by the DARPA GALE Program, including English translation of Arabic BN transcripts, word-level alignment of Arabic and English data, and creation of a corresponding English Treebank, place significant new constraints on ATB corpus creation, and require careful coordination among a wide assortment of concurrent activities and participants. Nonetheless, in spite of the new challenges posed by BN data, the ATB’s newly improved pipeline and revised annotation guidelines for newswire have proven to be robust enough that very few changes were necessary to account for the new genre of data. This paper presents the points where some adaptation has been necessary, and the overall pipeline as used in the production of BN ATB data.

pdf bib
The Role of Parallel Corpora in Bilingual Lexicography
Enikő Héja

This paper describes an approach based on word alignment on parallel corpora, which aims at facilitating the lexicographic work of dictionary building. Although this method has been widely used in the MT community for at least 16 years, as far as we know, it has not been applied to facilitate the creation of bilingual dictionaries for human use. The proposed corpus-driven technique, in particular the exploitation of parallel corpora, proved to be helpful in the creation of such dictionaries for several reasons. Most importantly, a parallel corpus of appropriate size guarantees that the most relevant translations are included in the dictionary. Moreover, based on the translational probabilities it is possible to rank translation candidates, which ensures that the most frequently used translation variants go first within an entry. A further advantage is that all the relevant example sentences from the parallel corpora are easily accessible, thus facilitating the selection of the most appropriate translations from possible translation candidates. Due to these properties the method is particularly apt to enable the production of active or encoding dictionaries.

pdf bib
Towards an ISO Standard for Dialogue Act Annotation
Harry Bunt | Jan Alexandersson | Jean Carletta | Jae-Woong Choe | Alex Chengyu Fang | Koiti Hasida | Kiyong Lee | Volha Petukhova | Andrei Popescu-Belis | Laurent Romary | Claudia Soria | David Traum

This paper describes an ISO project which aims at developing a standard for annotating spoken and multimodal dialogue with semantic information concerning the communicative functions of utterances, the kind of semantic content they address, and their relations with what was said and done earlier in the dialogue. The project, ISO 24617-2 ""Semantic annotation framework, Part 2: Dialogue acts"", is currently at DIS stage. The proposed annotation schema distinguishes 9 orthogonal dimensions, allowing each functional segment in dialogue to have a function in each of these dimensions, thus accounting for the multifunctionality that utterances in dialogue often have. A number of core communicative functions is defined in the form of ISO data categories, available at; they are divided into ""dimension-specific"" functions, which can be used only in a particular dimension, such as Turn Accept in the Turn Management dimension, and ""general-purpose"" functions, which can be used in any dimension, such as Inform and Request. An XML-based annotation language, ""DiAML"" is defined, with an abstract syntax, a semantics, and a concrete syntax.

pdf bib
Empty Categories in a Hindi Treebank
Archna Bhatia | Rajesh Bhatt | Bhuvana Narasimhan | Martha Palmer | Owen Rambow | Dipti Misra Sharma | Michael Tepper | Ashwini Vaidya | Fei Xia

We are in the process of creating a multi-representational and multi-layered treebank for Hindi/Urdu (Palmer et al., 2009), which has three main layers: dependency structure, predicate-argument structure (PropBank), and phrase structure. This paper discusses an important issue in treebank design which is often neglected: the use of empty categories (ECs). All three levels of representation make use of ECs. We make a high-level distinction between two types of ECs, trace and silent, on the basis of whether they are postulated to mark displacement or not. Each type is further refined into several subtypes based on the underlying linguistic phenomena which the ECs are introduced to handle. This paper discusses the stages at which we add ECs to the Hindi/Urdu treebank and why. We investigate methodically the different types of ECs and their role in our syntactic and semantic representations. We also examine our decisions whether or not to coindex each type of ECs with other elements in the representation.

pdf bib
Spanish FreeLing Dependency Grammar
Marina Lloberes | Irene Castellón | Lluís Padró

This paper presents the development of an open-source Spanish Dependency Grammar implemented in FreeLing environment. This grammar was designed as a resource for NLP applications that require a step further in natural language automatic analysis, as is the case of Spanish-to-Basque translation. The development of wide-coverage rule-based grammars using linguistic knowledge contributes to extend the existing Spanish deep parsers collection, which sometimes is limited. Spanish FreeLing Dependency Grammar, named EsTxala, provides deep and robust parse trees, solving attachments for any structure and assigning syntactic functions to dependencies. These steps are dealt with hand-written rules based on linguistic knowledge. As a result, FreeLing Dependency Parser gives a unique analysis as a dependency tree for each sentence analyzed. Since it is a resource open to the scientific community, exhaustive grammar evaluation is being done to determine its accuracy as well as strategies for its manteinance and improvement. In this paper, we show the results of an experimental evaluation carried out over EsTxala in order to test our evaluation methodology.

pdf bib
Assigning Wh-Questions to Verbal Arguments: Annotation Tools Evaluation and Corpus Building
Magali Sanches Duran | Marcelo Adriano Amâncio | Sandra Maria Aluísio

This work reports the evaluation and selection of annotation tools to assign wh-question labels to verbal arguments in a sentence. Wh-question assignment discussed herein is a kind of semantic annotation which involves two tasks: making delimitation of verbs and arguments, and linking verbs to its arguments by question labels. As it is a new type of semantic annotation, there is no report about requirements an annotation tool should have to face it. For this reason, we decided to select the most appropriated tool in two phases. In the first phase, we executed the task with an annotation tool we have used before in another task. Such phase helped us to test the task and enabled us to know which features were or not desirable in an annotation tool for our purpose. In the second phase, guided by such requirements, we evaluated several tools and selected a tool for the real task. After corpus annotation conclusion, we report some of the annotation results and some comments on the improvements there should be made in an annotation tool to better support such kind of annotation task.

pdf bib
The Impact of Task and Corpus on Event Extraction Systems
Ralph Grishman

The term “event extraction” covers a wide range of information extraction tasks, and methods developed and evaluated for one task may prove quite unsuitable for another. Understanding these task differences is essential to making broad progress in event extraction. We look back at the MUC and ACE tasks in terms of one characteristic, the breadth of the scenario ― how wide a range of information is subsumed in a single extraction task. We examine how this affects strategies for collecting information and methods for semi-supervised training of new extractors. We also consider the heterogeneity of corpora ― how varied the topics of documents in a corpus are. Extraction systems may be intended in principle for general news but are typically evaluated on topic-focused corpora, and this evaluation context may affect system design. As one case study, we examine the task of identifying physical attack events in news corpora, observing the effect on system performance of shifting from an attack-event-rich corpus to a more varied corpus and considering how the impact of this shift may be mitigated.

pdf bib
Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank
Seth Kulick | Ann Bies | Mohamed Maamouri

Complications arise for standoff annotation when the annotation is not on the source text itself, but on a more abstract representation. This is particularly the case in a language such as Arabic with morphological and orthographic challenges, and we discuss various aspects of these issues in the context of the Arabic Treebank. The Standard Arabic Morphological Analyzer (SAMA) is closely integrated into the annotation workflow, as the basis for the abstraction between the explicit source text and the more abstract token representation. However, this integration with SAMA gives rise to various problems for the annotation workflow and for maintaining the link between the Treebank and SAMA. In this paper we discuss how we have overcome these problems with consistent and more precise categorization of all of the tokens for their relationship with SAMA. We also discuss how we have improved the creation of several distinct alternative forms of the tokens used in the syntactic trees. As a result, the Treebank provides a resource relating the different forms of the same underlying token with varying degrees of vocalization, in terms of how they relate (1) to each other, (2) to the syntactic structure, and (3) to the morphological analyzer.

pdf bib
Creation of Lexical Resources for a Characterisation of Multiword Expressions in Italian
Andrea Zaninello | Malvina Nissim

The theoretical characterisation of multiword expressions (MWEs) is tightly connected to their actual occurrences in data and to their representation in lexical resources. We present three lexical resources for Italian MWEs, namely an electronic lexicon, a series of example corpora and a database of MWEs represented around morphosyntactic patterns. These resources are matched against, and created from, a very large web-derived corpus for Italian that spans across registers and domains. We can thus test expressions coded by lexicographers in a dictionary, thereby discarding unattested expressions, revisiting lexicographers's choices on the basis of frequency information, and at the same time creating an example sub-corpus for each entry. We organise MWEs on the basis of the morphosyntactic information obtained from the data in an electronic, flexible knowledge-base containing structured annotation exploitable for multiple purposes. We also suggest further work directions towards characterising MWEs by analysing the data organised in our database through lexico-semantic information available in WordNet or MultiWordNet-like resources, also in the perspective of expanding their set through the extraction of other similar compact expressions.

pdf bib
Building a Bilingual ValLex Using Treebank Token Alignment: First Observations
Jana Šindlerová | Ondřej Bojar

We explore the potential and limitations of a concept of building a bilingual valency lexicon based on the alignment of nodes in a parallel treebank. Our aim is to build an electronic Czech->English Valency Lexicon by collecting equivalences from bilingual treebank data and storing them in two already existing electronic valency lexicons, PDT-VALLEX and Engvallex. For this task a special annotation interface has been built upon the TrEd editor, allowing quick and easy collecting of frame equivalences in either of the source lexicons. The issues encountered so far include limitations of technical character, theory-dependent limitations and limitations concerning the achievable degree of quality of human annotation. The issues of special interest for both linguists and MT specialists involved in the project include linguistically motivated non-balance between the frame equivalents, either in number or in type of valency participants. The first phases of annotation so far attest the assumption that there is a unique correspondence between the functors of the translation-equivalent frames. Also, hardly any linguistically significant non-balance between the frames has been found, which is partly promising considering the linguistic theory used and partly caused by little stylistic variety of the annotated corpus texts.

pdf bib
Development and Use of an Evaluation Collection for Personalisation of Digital Newspapers
Alberto Díaz | Pablo Gervás | Antonio García | Laura Plaza

This paper presents the process of development and the characteristics of an evaluation collection for a personalisation system for digital newspapers. This system selects, adapts and presents contents according to a user model that define information needs. The collection presented here contains data that are cross-related over four different axes: a set of news items from an electronic newspaper, collected into subsets corresponding to a particular sequence of days, packaged together and cross-indexed with a set of user profiles that represent the particular evolution of interests of a set of real users over the given days, expressed in each case according to four different representation frameworks: newspaper sections, Yahoo categories, keywords, and relevance feedback over the set of news items for the previous day. This information provides a minimum starting material over which one can evaluate for a given system how it addresses the first two observations - adapting to different users and adapting to particular users over time - providing that the particular system implements the representation of information needs according to the four frameworks employed in the collection. This collection has been successfully used to perform some different experiments to determine the effectiveness of the personalization system presented.

pdf bib
LoonyBin: Keeping Language Technologists Sane through Automated Management of Experimental (Hyper)Workflows
Jonathan H. Clark | Alon Lavie

Many contemporary language technology systems are characterized by long pipelines of tools with complex dependencies. Too often, these workflows are implemented by ad hoc scripts; or, worse, tools are run manually, making experiments difficult to reproduce. These practices are difficult to maintain in the face of rapidly evolving workflows while they also fail to expose and record important details about intermediate data. Further complicating these systems are hyperparameters, which often cannot be directly optimized by conventional methods, requiring users to determine which combination of values is best via trial and error. We describe LoonyBin, an open-source tool that addresses these issues by providing: 1) a visual interface for the user to create and modify workflows; 2) a well-defined mechanism for tracking metadata and provenance; 3) a script generator that compiles visual workflows into shell scripts; and 4) a new workflow representation we call a HyperWorkflow, which intuitively and succinctly encodes small experimental variations within a larger workflow.

pdf bib
Improving Personal Name Search in the TIGR System
Keith J. Miller | Sarah McLeod | Elizabeth Schroeder | Mark Arehart | Kenneth Samuel | James Finley | Vanesa Jurica | John Polk

This paper describes the development and evaluation of enhancements to the specialized information retrieval capabilities of a multimodal reporting system. The system enables collection and dissemination of information through a distributed data architecture by allowing users to input free text documents, which are indexed for subsequent search and retrieval by other users. This unstructured data entry method is essential for users of this system, but it requires an intelligent support system for processing queries against the data. The system, known as TIGR (Tactical Ground Reporting), allows keyword searching and geospatial filtering of results, but lacked the ability to efficiently index and search person names and perform approximate name matching. To improve TIGR’s ability to provide accurate, comprehensive results for queries on person names we iteratively updated existing entity extraction and name matching technologies to better align with the TIGR use case. We evaluated each version of the entity extraction and name matching components to find the optimal configuration for the TIGR context, and combined those pieces into a named entity extraction, indexing, and search module that integrates with the current TIGR system. By comparing system-level evaluations of the original and updated TIGR search processes, we show that our enhancements to personal name search significantly improved the performance of the overall information retrieval capabilities of the TIGR system.

pdf bib
A Snack Implementation and Tcl/Tk Interface to the Fundamental Frequency Variation Spectrum Algorithm
Kornel Laskowski | Jens Edlund

Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inconvenient in such scenarios. This is because it is often, and most optimally, achieved only after speech segmentation and recognition. A consequence is that earlier speech processing components, in today’s state-of-the-art systems, lack intonation awareness by fiat; it is not known to what extent this circumscribes their performance. In the current work, we present a freely available implementation of an alternative to pitch estimation, namely the computation of the fundamental frequency variation (FFV) spectrum, which can be easily employed at any level within a speech processing system. It is our hope that the implementation we describe aid in the understanding of this novel acoustic feature space, and that it facilitate its inclusion, as desired, in the front-end routines of speech recognition, dialog act recognition, and speaker recognition systems.

pdf bib
Dialogue Reference in a Visual Domain
Jette Viethen | Simon Zwarts | Robert Dale | Markus Guhe

A central purpose of referring expressions is to distinguish intended referents from other entities that are in the context; but how is this context determined? This paper draws a distinction between discourse context ―other entities that have been mentioned in the dialogue― and visual context ―visually available objects near the intended referent. It explores how these two different aspects of context have an impact on subsequent reference in a dialogic situation where the speakers share both discourse and visual context. In addition we take into account the impact of the reference history ―forms of reference used previously in the discourse― on forming what have been called conceptual pacts. By comparing the output of different parameter settings in our model to a data set of human-produced referring expressions, we determine that an approach to subsequent reference based on conceptual pacts provides a better explanation of our data than previously proposed algorithmic approaches which compute a new distinguishing description for the intended referent every time it is mentioned.

pdf bib
Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System
Sunao Hara | Norihide Kitaoka | Kazuya Takeda

In this paper, we propose an estimation method of user satisfaction for a spoken dialog system using an N-gram-based dialog history model. We have collected a large amount of spoken dialog data accompanied by usability evaluation scores by users in real environments. The database is made by a field-test in which naive users used a client-server music retrieval system with a spoken dialog interface on their own PCs. An N-gram model is trained from the sequences that consist of users' dialog acts and/or the system's dialog acts for each one of six user satisfaction levels: from 1 to 5 and φ (task not completed). Then, the satisfaction level is estimated based on the N-gram likelihood. Experiments were conducted on the large real data and the results show that our proposed method achieved good classification performance; the classification accuracy was 94.7% in the experiment on a classification into dialogs with task completion and those without task completion. Even if the classifier detected all of the task incomplete dialog correctly, our proposed method achieved the false detection rate of only 6%.

pdf bib
A Language Approach to Modeling Human Behaviors
Peng-Wen Chen | Snehal Kumar Chennuru | Ying Zhang

The modeling of human behavior becomes more and more important due to the increasing popularity of context-aware computing and people-centric mobile applications. Inspired by the principle of action-as-language, we propose that human ambulatory behavior shares similar properties as natural languages. In addition, by exploiting this similarity, we will be able to index, recognize, cluster, retrieve, and infer high-level semantic meanings of human behaviors via the use of natural language processing techniques. In this paper, we developed a Life Logger system to help build the behavior language corpus which supports our ""Behavior as Language"" research. The constructed behavior corpus shows Zipf's distribution over the frequency of vocabularies which is aligned with our ""Behavior as Language"" assumption. Our preliminary results of using smoothed n-gram language model for activity recognition achieved an average accuracy rate of 94% in distinguishing among human ambulatory behaviors including walking, running, and cycling. This behavior-as-language corpus will enable researchers to study higher level human behavior based on the syntactic and semantic analysis of the corpus data.

pdf bib
Construction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation
Masaki Murata | Tomohiro Ohno | Shigeki Matsubara | Yasuyoshi Inagaki

With the development of speech and language processing, speech translation systems have been developed. These studies target spoken dialogues, and employ consecutive interpretation, which uses a sentence as the translation unit. On the other hand, there exist a few researches about simultaneous interpreting, and recently, the language resources for promoting simultaneous interpreting research, such as the publication of an analytical large-scale corpus, has been prepared. For the future, it is necessary to make the corpora more practical toward realization of a simultaneous interpreting system. In this paper, we describe the construction of a bilingual corpus which can be used for simultaneous lecture interpreting research. Simultaneous lecture interpreting systems are required to recognize translation units in the middle of a sentence, and generate its translation at the proper timing. We constructed the bilingual lecture corpus by the following steps. First, we segmented sentences in the lecture data into semantically meaningful units for the simultaneous interpreting. And then, we assigned the translations to these units from the viewpoint of the simultaneous interpreting. In addition, we investigated the possibility of automatically detecting the simultaneous interpreting timing from our corpus.

pdf bib
Learning Recursive Segments for Discourse Parsing
Stergos Afantenos | Pascal Denis | Philippe Muller | Laurence Danlos

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse, like the ""Segmented Discourse Representation Theory"" or SDRT, allow for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques making use of a regularized maximum entropy model, combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.

pdf bib
Extracting Product Features and Sentiments from Chinese Customer Reviews
Shu Zhang | Wenjie Jia | Yingju Xia | Yao Meng | Hao Yu

With the growing interest in opinion mining from web data, more works are focused on mining in English and Chinese reviews. Probing into the problem of product opinion mining, this paper describes the details of our language resources, and imports them into the task of extracting product feature and sentiment task. Different from the traditional unsupervised methods, a supervised method is utilized to identify product features, combining the domain knowledge and lexical information. Nearest vicinity match and syntactic tree based methods are proposed to identify the opinions regarding the product features. Multi-level analysis module is proposed to determine the sentiment orientation of the opinions. With the experiments on the electronic reviews of COAE 2008, the validities of the product features identified by CRFs and the two opinion words identified methods are testified and compared. The results show the resource is well utilized in this task and our proposed method is valid.

pdf bib
Open Soucre Graph Transducer Interpreter and Grammar Development Environment
Bernd Bohnet | Leo Wanner

Graph and tree transducers have been applied in many NLP areas―among them, machine translation, summarization, parsing, and text generation. In particular, the successful use of tree rewriting transducers for the introduction of syntactic structures in statistical machine translation contributed to their popularity. However, the potential of such transducers is limited because they do not handle graphs and because they ”consume” the source structure in that they rewrite it instead of leaving it intact for intermediate consultations. In this paper, we describe an open source tree and graph transducer interpreter, which combines the advantages of graph transducers and two-tape Finite State Transducers and surpasses the limitations of state-of-the-art tree rewriting transducers. Along with the transducer, we present a graph grammar development environment that supports the compilation and maintenance of graph transducer grammatical and lexical resources. Such an environment is indispensable for any effort to create consistent large coverage NLP-resources by human experts.

pdf bib
Linking Korean Words with an Ontology
Min-Jae Kwon | Hae-Yun Lee | Hee-Rahk Chae

The need for ontologies has increased in computer science or information science recently. Especially, NLP systems such as information retrieval, machine translation, etc. require ontologies whose concepts are connected to natural language words. There are a few Korean wordnets such as U-WIN, KorLex, CoreNet, etc. Most of them, however, stand alone without any link to an ontology. Hence, we need a Korean wordnet which is linked to a language-neutral ontology such as SUMO, OpenCyc, DOLCE, etc. In this paper, we will present a method of linking Korean word senses with the concepts of an ontology, which is part of an ongoing project. We use a Korean-English bilingual dictionary, Princeton WordNet (Fellbaum 1998), and the ontology SmartSUMO (Oberle et al. 2007). The current version of WordNet is mapped into SUMO, which constitutes a major part of SmartSUMO. We focus on mapping Korean word senses with corresponding English word senses by way of Princeton WordNet which is mapped into SUMO. This paper will show that we need to apply different algorithms of linking, depending on the information types that a bilingual dictionary contains.

pdf bib
The Semantic Atlas: an Interactive Model of Lexical Representation
Sabine Ploux | Armelle Boussidan | Hyungsuk Ji

In this paper we describe two geometrical models of meaning representation, the Semantic Atlas (SA) and the Automatic Contexonym Organizing Model (ACOM). The SA provides maps of meaning generated through correspondence factor analysis. The models can handle different types of word relations: synonymy in the SA and co-occurrence in ACOM. Their originality relies on an artifact called 'cliques' - a fine grained infra linguistic sub-unit of meaning. The SA is composed of several dictionaries and thesauri enhanced with a process of symmetrisation. It is currently available for French and English in monolingual versions as well as in a bilingual translation version. Other languages are under development and testing. ACOM deals with unannotated corpora. The models are used by research teams worldwide that investigate synonymy, translation processes, genre comparison, psycholinguistics and polysemy modeling. Both models can be consulted online via a flexible interface allowing for interactive navigation on This site is the most consulted address of the French National Center for Scientific Research’s domain (CNRS), one of the major research bodies in France. The international interest it has triggered led us to initiate the process of going open source. In the meantime, all our databases are freely available on request.

pdf bib
SINotas: the Evaluation of a NLG Application
Roberto P. A. Araujo | Rafael L. de Oliveira | Eder M. de Novais | Thiago D. Tadeu | Daniel B. Pereira | Ivandré Paraboni

SINotas is a data-to-text NLG application intended to produce short textual reports on students’ academic performance from a database conveying their grades, weekly attendance rates and related academic information. Although developed primarily as a testbed for Portuguese Natural Language Generation, SINotas generates reports of interest to both students keen to learn how their professors would describe their efforts, and to the professors themselves, who may benefit from an at-a-glance view of the student’s performance. In a traditional machine learning approach, SINotas uses a data-text aligned corpus as training data for decision-tree induction. The current system comprises a series of classifiers that implement major Document Planning subtasks (namely, data interpretation, content selection, within- and between-sentence structuring), and a small surface realisation grammar of Brazilian Portuguese. In this paper we focus on the evaluation work of the system, applying a number of intrinsic and user-based evaluation metrics to a collection of text reports generated from real application data.

pdf bib
Types of Nods. The Polysemy of a Social Signal
Isabella Poggi | Francesca D’Errico | Laura Vincze

The work analyses the head nod, a down-up movement of the head, as a polysemic social signal, that is, a signal with a number of different meanings which all share some common semantic element. Based on the analysis of 100 nods drawn from the SSPNet corpus of TV political debates, a typology of nods is presented that distinguishes Speaker’s, Interlocutor’s and Third Listener’s nods, with their subtypes (confirmation, agreement, approval, submission and permission, greeting and thanks, backchannel giving and backchannel request, emphasis, ironic agreement, literal and rhetoric question, and others). For each nod the analysis specifies: 1. characteristic features of how it is produced, among which main direction, amplitude, velocity and number of repetitions; 2. cues in other modalities, like direction and duration of gaze; 3. conversational context in which the nod typically occurs. For the Interlocutor’s or Third Listener’s nod, the preceding speech act is relevant: yes/no answer or information for a nod of confirmation, expression of opinion for one of agreement, prosocial action for greetings and thanks; for the Speaker’s nods, instead, their meanings are mainly distinguished by accompanying signals.

pdf bib
A Speech Corpus for Modeling Language Acquisition: CAREGIVER
Toomas Altosaar | Louis ten Bosch | Guillaume Aimetti | Christos Koniaris | Kris Demuynck | Henk van den Heuvel

A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The paper describes the motivation behind the corpus and its design by relying on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, is covered. The corpus contains nearly 66000 utterance based audio files spoken over a two-year period by 17 male and 17 female native speakers of Dutch, English, Finnish, and Swedish. An orthographical transcription is available for every utterance. Also, time-aligned word and phone annotations for many of the sub-corpora also exist. The CAREGIVER corpus will be published via ELRA.

pdf bib
Corpus Aligner (CorAl) Evaluation on English-Croatian Parallel Corpora
Sanja Seljan | Marko Tadić | Željko Agić | Jan Šnajder | Bojana Dalbelo Bašić | Vjekoslav Osmann

An increasing demand for new language resources of recent EU members and accessing countries has in turn initiated the development of different language tools and resources, such as alignment tools and corresponding translation memories for new languages pairs. The primary goal of this paper is to provide a description of a free sentence alignment tool CorAl (Corpus Aligner), developed at the Faculty of Electrical Engineering and Computing, University of Zagreb. The tool performs paragraph alignment at the first step of the alignment process, which is followed by sentence alignment. Description of the tool is followed by its evaluation. The paper describes an experiment with applying the CorAl aligner to a English-Croatian parallel corpus of legislative domain using metrics of precision, recall and F1-measure. Results are discussed and the concluding sections discuss future directions of CorAl development.

pdf bib
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
Siim Orasmaa | Reina Käärik | Jaak Vilo | Tiit Hennoste

An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes from the Corpus of Spoken Estonian of the University of Tartu which consists of about 2000 dialogues and texts, about 1.4 million running text units in total.

pdf bib
The Spanish Resource Grammar
Montserrat Marimon

This paper describes the Spanish Resource Grammar, an open-source multi-purpose broad-coverage precise grammar for Spanish. The grammar is implemented on the Linguistic Knowledge Builder (LKB) system, it is grounded in the theoretical framework of Head-driven Phrase Structure Grammar (HPSG), and it uses Minimal Recursion Semantics (MRS) for the semantic representation. We have developed a hybrid architecture which integrates shallow processing functionalities -- morphological analysis, and Named Entity recognition and classification -- into the parsing process. The SRG has a full coverage lexicon of closed word classes and it contains 50,852 lexical entries for open word classes. The grammar also has 64 lexical rules to perform valence changing operations on lexical items, and 191 phrase structure rules that combine words and phrases into larger constituents and compositionally build up their semantic representation. The annotation of each parsed sentence in an LKB grammar simultaneously represents a traditional phrase structure tree, and a MRS semantic representation. We provide evaluation results on sentences from newspaper texts and discuss future work.

pdf bib
PASSAGE Syntactic Representation: a Minimal Common Ground for Evaluation
Anne Vilnat | Patrick Paroubek | Eric Villemonte de la Clergerie | Gil Francopoulo | Marie-Laure Guénot

The current PASSAGE syntactic representation is the result of 9 years of constant evolution with the aim of providing a common ground for evaluating parsers of French whatever their type and supporting theory. In this paper we present the latest developments concerning the formalism and show first through a review of basic linguistic phenomena that it is a plausible minimal common ground for representing French syntax in the context of generic black box quantitative objective evaluation. For the phenomena reviewed, which include: the notion of syntactic head, apposition, control and coordination, we explain how PASSAGE representation relates to other syntactic representation schemes for French and English, slightly extending the annotation to address English when needed. Second, we describe the XML format chosen for PASSAGE and show that it is compliant with the latest propositions in terms of linguistic annotation standard. We conclude discussing the influence that corpus-based evaluation has on the characteristics of syntactic representation when willing to assess the performance of any kind of parser.

pdf bib
Linguistically Motivated Unsupervised Segmentation for Machine Translation
Mark Fishel | Harri Kirik

In this paper we use statistical machine translation and morphology information from two different morphological analyzers to try to improve translation quality by linguistically motivated segmentation. The morphological analyzers we use are the unsupervised Morfessor morpheme segmentation and analyzer toolkit and the rule-based morphological analyzer T3. Our translations are done using the Moses statistical machine translation toolkit with training on the JRC-Acquis corpora and translating on Estonian to English and English to Estonian language directions. In our work we model such linguistic phenomena as word lemmas and endings and splitting compound words into simpler parts. Also lemma information was used to introduce new factors to the corpora and to use this information for better word alignment or for alternative path back-off translation. From the results we find that even though these methods have shown previously and keep showing promise of improved translation, their success still largely depends on the corpora and language pairs used.

pdf bib
Predicting Persuasiveness in Political Discourses
Carlo Strapparava | Marco Guerini | Oliviero Stock

In political speeches, the audience tends to react or resonate to signals of persuasive communication, including an expected theme, a name or an expression. Automatically predicting the impact of such discourses is a challenging task. In fact nowadays, with the huge amount of textual material that flows on the Web (news, discourses, blogs, etc.), it can be useful to have a measure for testing the persuasiveness of what we retrieve or possibly of what we want to publish on Web. In this paper we exploit a corpus of political discourses collected from various Web sources, tagged with audience reactions, such as applause, as indicators of persuasive expressions. In particular, we use this data set in a machine learning framework to explore the possibility of classifying the transcript of political discourses, according to their persuasive power, predicting the sentences that possibly trigger applause. We also explore differences between Democratic and Republican speeches, experiment the resulting classifiers in grading some of the discourses in the Obama-McCain presidential campaign available on the Web.

pdf bib
Towards Optimal TTS Corpora
Didier Cadic | Cédric Boidin | Christophe d’Alessandro

Unit selection text-to-speech systems currently produce very natural synthesized phrases by concatenating speech segments from a large database. Recently, increasing demand for designing high quality voices with less data has created need for further optimization of the textual corpus recorded by the speaker. This corpus is traditionally the result of a condensation process: sentences are selected from a reference corpus, using an optimization algorithm (generally greedy) guided by the coverage rate of classic units (diphones, triphones, words…). Such an approach is, however, strongly constrained by the finite content of the reference corpus, providing limited language possibilities. To gain flexibility in the optimization process, in this paper, we introduce a new corpus building procedure based on sentence construction rather than sentence selection. Sentences are generated using Finite State Transducers, assisted by a human operator and guided by a new frequency-weighted coverage criterion based on Vocalic Sandwiches. This semi-automatic process requires time-consuming human intervention but seems to give access to much denser corpora, with a density increase of 30 to 40% for a given coverage rate.

pdf bib
Automatic Grammar Rule Extraction and Ranking for Definitions
Claudia Borg | Mike Rosner | Gordon J. Pace

Plain text corpora contain much information which can only be accessed through human annotation and semantic analysis, which is typically very time consuming to perform. Analysis of such texts at a syntactic or grammatical structure level can however extract some of this information in an automated manner, even if identifying effective rules can be extremely difficult. One such type of implicit information present in texts is that of definitional phrases and sentences. In this paper, we investigate the use of evolutionary algorithms to learn classifiers to discriminate between definitional and non-definitional sentences in non-technical texts, and show how effective grammar-based definition discriminators can be automatically learnt with minor human intervention.

pdf bib
A Multilingual CALL Game Based on Speech Translation
Manny Rayner | Pierrette Bouillon | Nikos Tsourakis | Johanna Gerlach | Maria Georgescul | Yukie Nakao | Claudia Baur

We describe a multilingual Open Source CALL game, CALL-SLT, which reuses speech translation technology developed using the Regulus platform to create an automatic conversation partner that allows intermediate-level language students to improve their fluency. We contrast CALL-SLT with Wang's and Seneff's ``translation game'' system, in particular focussing on three issues. First, we argue that the grammar-based recognition architecture offered by Regulus is more suitable for this type of application; second, that it is preferable to prompt the student in a language-neutral form, rather than in the L1; and third, that we can profitably record successful interactions by native speakers and store them to be reused as online help for students. The current system, which will be demoed at the conference, supports four L2s (English, French, Japanese and Swedish) and two L1s (English and French). We conclude by describing an evaluation exercise, where a version of CALL-SLT configured for English L2 and French L1 was used by several hundred high school students. About half of the subjects reported positive impressions of the system.

pdf bib
Question Answering Biographic Information and Social Network Powered by the Semantic Web
Peter Adolphs | Xiwen Cheng | Tina Klüwer | Hans Uszkoreit | Feiyu Xu

After several years of development, the vision of the Semantic Web is gradually becoming reality. Large data repositories have been created and offer semantic information in a machine-processable form for various domains. Semantic Web data can be published on the Web, gathered automatically, and reasoned about. All these developments open interesting perspectives for building a new class of domain-specific, broad-coverage information systems that overcome a long-standing bottleneck of AI systems, the notoriously incomplete knowledge base. We present a system that shows how the wealth of information in the Semantic Web can be interfaced with humans once again, using natural language for querying and answering rather than technical formalisms. Whereas current Question Answering systems typically select snippets from Web documents retrieved by a search engine, we utilize Semantic Web data, which allows us to provide natural-language answers that are tailored to the current dialog context. Furthermore, we show how to use natural language processing technologies to acquire new data and enrich existing data in a Semantic Web framework. Our system has acquired a rich biographic data resource by combining existing Semantic Web resources, which are discovered from semi-structured textual data in Web pages, with information extracted from free natural language texts.

pdf bib
Metaphor Corpus Annotated for Source - Target Domain Mappings
Ekaterina Shutova | Simone Teufel

Besides making our thoughts more vivid and filling our communication with richer imagery, metaphor also plays an important structural role in our cognition. Although there is a consensus in the linguistics and NLP research communities that the phenomenon of metaphor is not restricted to similarity-based extensions of meanings of isolated words, but rather involves reconceptualization of a whole area of experience (target domain) in terms of another (source domain), there still has been no proposal for a comprehensive procedure for annotation of cross-domain mappings. However, a corpus annotated for conceptual mappings could provide a new starting point for both linguistic and cognitive experiments. The annotation scheme we present in this paper is a step towards filling this gap. We test our procedure in an experimental setting involving multiple annotators and estimate their agreement on the task. The associated corpus annotated for source ― target domain mappings will be publicly available.

pdf bib
Efficiently Extract Rrecurring Tree Fragments from Large Treebanks
Federico Sangati | Willem Zuidema | Rens Bod

In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.

pdf bib
Vergina: A Modern Greek Speech Database for Speech Synthesis
Alexandros Lazaridis | Theodoros Kostoulas | Todor Ganchev | Iosif Mporas | Nikos Fakotakis

The present paper outlines the Vergina speech database, which was developed in support of research and development of corpus-based unit selection and statistical parametric speech synthesis systems for Modern Greek language. In the following, we describe the design, development and implementation of the recording campaign, as well as the annotation of the database. Specifically, a text corpus of approximately 5 million words, collected from newspaper articles, periodicals, and paragraphs of literature, was processed in order to select the utterances-sentences needed for producing the speech database and to achieve a reasonable phonetic coverage. The broad coverage and contents of the selected utterances-sentences of the database ― text corpus collected from different domains and writing styles ― makes this database appropriate for various application domains. The database, recorded in audio studio, consists of approximately 3,000 phonetically balanced Modern Greek utterances corresponding to approximately four hours of speech. Annotation of the Vergina speech database was performed using task-specific tools, which are based on a hidden Markov model (HMM) segmentation method, and then manual inspection and corrections were performed.

pdf bib
WikiNet: A Very Large Scale Multi-Lingual Concept Network
Vivi Nastase | Michael Strube | Benjamin Boerschinger | Caecilia Zirn | Anas Elghafari

This paper describes a multi-lingual large-scale concept network obtained automatically by mining for concepts and relations and exploiting a variety of sources of knowledge from Wikipedia. Concepts and their lexicalizations are extracted from Wikipedia pages, in particular from article titles, hyperlinks, disambiguation pages and cross-language links. Relations are extracted from the category and page network, from the category names, from infoboxes and the body of the articles. The resulting network has two main components: (i) a central, language independent index of concepts, which serves to keep track of the concepts' lexicalizations both within a language and across languages, and to separate linguistic expressions of concepts from the relations in which they are involved (concepts themselves are represented as numeric IDs); (ii) a large network built on the basis of the relations extracted, represented as relations between concepts (more specifically, the numeric IDs). The various stages of obtaining the network were separately evaluated, and the results show a qualitative resource.

pdf bib
Developing Morphological Analysers for South Asian Languages: Experimenting with the Hindi and Gujarati Languages
Niraj Aswani | Robert Gaizauskas

A considerable amount of work has been put into development of stemmers and morphological analysers. The majority of these approaches use hand-crafted suffix-replacement rules but a few try to discover such rules from corpora. While most of the approaches remove or replace suffixes, there are examples of derivational stemmers which are based on prefixes as well. In this paper we present a rule-based morphological analyser. We propose an approach that takes both prefixes as well as suffixes into account. Given a corpus and a dictionary, our method can be used to obtain a set of suffix-replacement rules for deriving an inflected word’s root form. We developed an approach for the Hindi language but show that the approach is portable, at least to related languages, by adapting it to the Gujarati language. Given that the entire process of developing such a ruleset is simple and fast, our approach can be used for rapid development of morphological analysers and yet it can obtain competitive results with analysers built relying on human authored rules.

pdf bib
A Japanese Particle Corpus Built by Example-Based Annotation
Hiroki Hanaoka | Hideki Mima | Jun’ichi Tsujii

This paper is a report on an on-going project of creating a new corpus focusing on Japanese particles. The corpus will provide deeper syntactic/semantic information than the existing resources. The initial target particle is ``to'' which occurs 22,006 times in 38,400 sentences of the existing corpus: the Kyoto Text Corpus. In this annotation task, an ``example-based'' methodology is adopted for the corpus annotation, which is different from the traditional annotation style. This approach provides the annotators with an example sentence rather than a linguistic category label. By avoiding linguistic technical terms, it is expected that any native speakers, with no special knowledge on linguistic analysis, can be an annotator without long training, and hence it can reduce the annotation cost. So far, 10,475 occurrences have been already annotated, with an inter-annotator agreement of 0.66 calculated by Cohen's kappa. The initial disagreement analyses and future directions are discussed in the paper.

pdf bib
Idioms in Context: The IDIX Corpus
Caroline Sporleder | Linlin Li | Philip Gorinski | Xaver Koch

Idioms and other figuratively used expressions pose considerable problems to natural language processing applications because they are very frequent and often behave idiosyncratically. Consequently, there has been much research on the automatic detection and extraction of idiomatic expressions. Most studies focus on type-based idiom detection, i.e., distinguishing whether a given expression can (potentially) be used idiomatically. However, many expressions such as ""break the ice"" can have both literal and non-literal readings and need to be disambiguated in a given context (token-based detection). So far relatively few approaches have attempted context-based idiom detection. One reason for this may be that few annotated resources are available that disambiguate expressions in context. With the IDIX corpus, we aim to address this. IDIX is available as an add-on to the BNC and disambiguates different usages of a subset of idioms. We believe that this resource will be useful both for linguistic and computational linguistic studies.

pdf bib
The PlayMancer Database: A Multimodal Affect Database in Support of Research and Development Activities in Serious Game Environment
Theodoros Kostoulas | Otilia Kocsis | Todor Ganchev | Fernando Fernández-Aranda | Juan J. Santamaría | Susana Jiménez-Murcia | Maher Ben Moussa | Nadia Magnenat-Thalmann | Nikos Fakotakis

The present paper reports on a recent effort that resulted in the establishment of a unique multimodal affect database, referred to as the PlayMancer database. This database was created in support of the research and development activities, taking place within the PlayMancer project, which aim at the development of a serious game environment in support of treatment of patients with behavioural and addictive disorders, such as eating disorders and gambling addictions. Specifically, for the purpose of data collection, we designed and implemented a pilot trial with healthy test subjects. Speech, video and bio-signals (pulse-rate, SpO2) were captured synchronously, during the interaction of healthy people with a number of video games. The collected data were annotated by the test subjects (self-annotation), targeting proper interpretation of the underlying affective states. The broad-shouldered design of the PlayMancer database allows its use for the needs of research on multimodal affect-emotion recognition and multimodal human-computer interaction in serious games environment.

pdf bib
Development of Linguistic Resources and Tools for Providing Multilingual Solutions in Indian Languages — A Report on National Initiative
Swaran Lata | Somnath Chandra Vijay Kumar

The multilingual diversity of India is one of the most unique in world. Currently there are 22 constitutionally recognized languages with 12 scripts. Apart from these, there are at least 35 different languages and 2000 dialects in 4 major language families. It is thus evident that, development and proliferation of software solutions in the Indic multilingual environment requires continuous and sustained effort to edge out challenges in all core areas namely storage and encoding, input mechanism, browser support and data exchange. Linguistic Resources and Tools are the key building blocks to develop multilingual solutions. In this paper, we shall present an overview of the major national initiative in India for the development and standardization of Linguistic Resources and Tools for developing and deployment of multilingual ICT solutions in India.

pdf bib
C-3: Coherence and Coreference Corpus
Cristina Nicolae | Gabriel Nicolae | Kirk Roberts

The phenomenon of coreference, covering entities, their mentions and their properties, is intricately linked to the phenomenon of coherence, covering the structure of rhetorical relations in a discourse. A text corpus that has both phenomena annotated can be used to test hypotheses about their interrelation or to detect other phenomena. We present the process by which C-3, a new corpus, was obtained by annotating the Discourse GraphBank coherence corpus with entity and mention information. The annotation followed a set of ACE guidelines adapted to favor coreference and to include entities of unknown types in the annotation. Together with the corpus we offer a new annotation tool specifically designed to annotate entity and mention information within a simple and functional graphical interface that combines the “best of all worlds” from available annotation tools. The potential usefulness of C-3 is discussed, as well as an application in which the corpus proved to be a valuable resource.

pdf bib
A Pilot Arabic CCGbank
Stephen A. Boxwell | Chris Brew

We describe a process for converting the Penn Arabic Treebank into the CCG formalism. Previous efforts have yielded CCGbanks in English, German, and Turkish, thus opening these languages to the sophisticated computational tools developed for CCG and enabling further cross-linguistic development. Conversion from a context free grammar treebank to a CCGbank is a four stage process: head finding, argument classification, binarization, and category conversion. In the process of implementing a basic CCGbank conversion algorithm, we reveal properties of Arabic grammar that interfere with conversion, such as subject topicalization, genitive constructions, relative clauses, and optional pronominal subjects. All of these problematic phenomena can be resolved in a variety of ways - we discuss advantages and disadvantages of each in their respective sections. We detail these and describe our categorial analysis of each of these Arabic grammatical phenomena in depth, as well as technical details on their integration into the conversion algorithm.

pdf bib
The DesPho-APaDy Project: Developing an Acoustic-phonetic Characterization of Dysarthric Speech in French
Cécile Fougeron | Lise Crevier-Buchman | Corinne Fredouille | Alain Ghio | Christine Meunier | Claude Chevrie-Muller | Jean-Francois Bonastre | Antonia Colazo Simon | Céline Delooze | Danielle Duez | Cédric Gendrot | Thierry Legou | Nathalie Levèque | Claire Pillot-Loiseau | Serge Pinto | Gilles Pouchoulin | Danièle Robert | Jacqueline Vaissiere | François Viallet | Coralie Vincent

This paper presents the rationale, objectives and advances of an on-going project (the DesPho-APaDy project funded by the French National Agency of Research) which aims to provide a systematic and quantified description of French dysarthric speech, over a large population of patients and three dysarthria types (related to the parkinson's disease, the Amyotrophic Lateral Sclerosis disease, and a pure cerebellar alteration). The two French corpora of dysarthric patients, from which the speech data have been selected for analysis purposes, are firstly described. Secondly, this paper discusses and outlines the requirement of a structured and organized computerized platform in order to store, organize and make accessible (for selected and protected usage) dysarthric speech corpora and associated patients’ clinical information (mostly disseminated in different locations: labs, hospitals, …). The design of both a computer database and a multi-field query interface is proposed for the clinical context. Finally, advances of the project related to the selection of the population used for the dysarthria analysis, the preprocessing of the speech files, their orthographic transcription and their automatic alignment are also presented.

pdf bib
Semi-Automatic Domain Ontology Creation from Text Resources
Mithun Balakrishna | Dan Moldovan | Marta Tatu | Marian Olteanu

Analysts in various domains, especially intelligence and financial, have to constantly extract useful knowledge from large amounts of unstructured or semi-structured data. Keyword-based search, faceted search, question-answering, etc. are some of the automated methodologies that have been used to help analysts in their tasks. General-purpose and domain-specific ontologies have been proposed to help these automated methods in organizing data and providing access to useful information. However, problems in ontology creation and maintenance have resulted in expensive procedures for expanding/maintaining the ontology library available to support the growing and evolving needs of analysts. In this paper, we present a generalized and improved procedure to automatically extract deep semantic information from text resources and rapidly create semantically-rich domain ontologies while keeping the manual intervention to a minimum. We also present evaluation results for the intelligence and financial ontology libraries, semi-automatically created by our proposed methodologies using freely-available textual resources from the Web.

pdf bib
Language Technology Challenges of a ‘Small’ Language (Catalan)
Maite Melero | Gemma Boleda | Montse Cuadros | Cristina España-Bonet | Lluís Padró | Martí Quixal | Carlos Rodríguez | Roser Saurí

In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a “harvesting” procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan.

pdf bib
Porting an Ancient Greek and Latin Treebank
John Lee | Dag Haug

We have recently converted a dependency treebank, consisting of ancient Greek and Latin texts, from one annotation scheme to another that was independently designed. This paper makes two observations about this conversion process. First, we show that, despite significant surface differences between the two treebanks, a number of straightforward transformation rules yield a substantial level of compatibility between them, giving evidence for their sound design and high quality of annotation. Second, we analyze some linguistic annotations that require further disambiguation, proposing some simple yet effective machine learning methods.

pdf bib
Comparing Computational Models of Selectional Preferences - Second-order Co-Occurrence vs. Latent Semantic Clusters
Sabine Schulte im Walde

This paper presents a comparison of three computational approaches to selectional preferences: (i) an intuitive distributional approach that uses second-order co-occurrence of predicates and complement properties; (ii) an EM-based clustering approach that models the strengths of predicate--noun relationships by latent semantic clusters (Rooth et al., 1999); and (iii) an extension of the latent semantic clusters by incorporating the MDL principle into the EM training, thus explicitly modelling the predicate--noun selectional preferences by WordNet classes (Schulte im Walde et al., 2008). Concerning the distributional approach, we were interested not only in how well the model describes selectional preferences, but moreover which second-order properties are most salient. For example, a typical direct object of the verb 'drink' is usually fluid, might be hot or cold, can be bought, might be bottled, etc. The general question we ask is: what characterises the predicate's restrictions to the semantic realisation of its complements? Our second interest lies in the actual comparison of the models: How does a very simple distributional model compare to much more complex approaches, and which representation of selectional preferences is more appropriate, using (i) second-order properties, (ii) an implicit generalisation of nouns (by clusters), or (iii) an explicit generalisation of nouns by WordNet classes within clusters? We describe various experiments on German data and two evaluations, and demonstrate that the simple distributional model outperforms the more complex cluster-based models in most cases, but does itself not always beat the powerful frequency baseline.

pdf bib
An Evaluation of Technologies for Knowledge Base Population
Paul McNamee | Hoa Trang Dang | Heather Simpson | Patrick Schone | Stephanie M. Strassel

Previous content extraction evaluations have neglected to address problems which complicate the incorporation of extracted information into an existing knowledge base. Previous question answering evaluations have likewise avoided tasks such as explicit disambiguation of target entities and handling a fixed set of questions about entities without previous determination of possible answers. In 2009 NIST conducted a Knowledge Base Population track at its Text Analysis Conference to unite the content extraction and question answering communities and jointly explore some of these issues. This exciting new evaluation attracted 13 teams from 6 countries that submitted results in two tasks, Entity Linking and Slot Filling. This paper explains the motivation and design of the tasks, describes the language resources that were developed for this evaluation, offers comparisons to previous community evaluations, and briefly summarizes the performance obtained by systems. We also identify relevant issues pertaining to target selection, challenging queries, and performance measures.

pdf bib
Aligning FrameNet and WordNet based on Semantic Neighborhoods
Óscar Ferrández | Michael Ellsworth | Rafael Muñoz | Collin F. Baker

This paper presents an algorithm for aligning FrameNet lexical units to WordNet synsets. Both, FrameNet and WordNet, are well-known as well as widely-used resources by the entire research community. They help systems in the comprehension of the semantics of texts, and therefore, finding strategies to link FrameNet and WordNet involves challenges related to a better understanding of the human language. Such deep analysis is exploited by researchers to improve the performance of their applications. The alignment is achieved by exploiting the particular characteristics of each lexical-semantic resource, with special emphasis on the explicit, formal semantic relations in each. Semantic neighborhoods are computed for each alignment of lemmas, and the algorithm calculates correlation scores by comparing such neighborhoods. The results suggest that the proposed algorithm is appropriate for aligning the FrameNet and WordNet hierarchies. Furthermore, the algorithm can aid research on increasing the coverage of FrameNet, building FrameNets in other languages, and creating a system for querying a joint FrameNet-WordNet hierarchy.

pdf bib
Al —Khalil : The Arabic Linguistic Ontology Project
Hassina Aliane | Zaia Alimazighi | Ahmed Cherif Mazari

Despite Arabic is the language of hundred millions of people over the world, little has been done in terms of computerized linguistic resources, tools or applications. In this paper we describe a project which aim is to contribute filling this gap. The project consists in building an ontology centered infrastructure for Arabic Language resources and applications. The core of this infrastructure is a linguistic ontology that is founded on Arabic Traditional Grammar. The methodology we have chosen consists in reusing an existing ontology, namely the Gold linguistic ontology. GOLD is the first ontology being designed for linguistic description on the semantic web. We first construct our ontology manually by relating our concepts from Arabic Linguistics to the upper concepts of GOLD, furthermore an information extraction algorithm is implemented to automatically enrich the ontology. We discuss the development of the ontology and present our vision for the whole project which aims at using this ontology for creating tools and resources for both linguists and NLP Researchers. Indeed, the ontology is seen , not only as a domain ontology but also as a resource for different linguistic and NLP applications.

pdf bib
LAT Bridge: Bridging Tools for Annotation and Exploration of Rich Linguistic Data
Marc Kemps-Snijders | Thomas Koller | Han Sloetjes | Huib Verwey

We present a software module, the LAT Bridge, which enables bidirectional communication between the annotation and exploration tools developed at the Max Planck Institute for Psycholinguistics as part of our Language Archiving Technology (LAT) tool suite. These existing annotation and exploration tools enable the annotation, enrichment, exploration and archive management of linguistic resources. The user community has expressed the desire to use different combinations of LAT tools in conjunction with each other. The LAT Bridge is designed to cater for a number of basic data interaction scenarios between the LAT annotation and exploration tools. These interaction scenarios (e.g. bootstrapping a wordlist, searching for annotation examples or lexical entries) have been identified in collaboration with researchers at our institute. We had to take into account that the LAT tools for annotation and exploration represent a heterogeneous application scenario with desktop-installed and web-based tools. Additionally, the LAT Bridge has to work in situations where the Internet is not available or only in an unreliable manner (i.e. with a slow connection or with frequent interruptions). As a result, the LAT Bridge’s architecture supports both online and offline communication between the LAT annotation and exploration tools.

pdf bib
Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9
Ondřej Bojar | Adam Liška | Zdeněk Žabokrtský

CzEng 0.9 is the third release of a large parallel corpus of Czech and English. For the current release, CzEng was extended by significant amount of texts from various types of sources, including parallel web pages, electronically available books and subtitles. This paper describes and evaluates filtering techniques employed in the process in order to avoid misaligned or otherwise damaged parallel sentences in the collection. We estimate the precision and recall of two sets of filters. The first set was used to process the data before their inclusion into CzEng. The filters from the second set were newly created to improve the filtering process for future releases of CzEng. Given the overall amount and variance of sources of the data, our experiments illustrate the utility of parallel data sources with respect to extractable parallel segments. As a similar behaviour can be expected for other language pairs, our results can be interpreted as guidelines indicating which sources should other researchers exploit first.

pdf bib
Corpora for the Conceptualisation and Zoning of Scientific Papers
Maria Liakata | Simone Teufel | Advaith Siddharthan | Colin Batchelor

We present two complementary annotation schemes for sentence based annotation of full scientific papers, CoreSC and AZ-II, applied to primary research articles in chemistry. AZ-II is the extension of AZ for chemistry papers. AZ has been shown to have been reliably annotated by independent human coders and useful for various information access tasks. Like AZ, AZ-II follows the rhetorical structure of a scientific paper and the knowledge claims made by the authors. The CoreSC scheme takes a different view of scientific papers, treating them as the humanly readable representations of scientific investigations. It seeks to retrieve the structure of the investigation from the paper as generic high-level Core Scientific Concepts (CoreSC). CoreSCs have been annotated by 16 chemistry experts over a total of 265 full papers in physical chemistry and biochemistry. We describe the differences and similarities between the two schemes in detail and present the two corpora produced using each scheme. There are 36 shared papers in the corpora, which allows us to quantitatively compare aspects of the annotation schemes. We show the correlation between the two schemes, their strengths and weeknesses and discuss the benefits of combining a rhetorical based analysis of the papers with a content-based one.

pdf bib
Analysis and Presentation of Results for Mobile Local Search
Alberto Tretti | Barbara Di Eugenio

Aggregation of long lists of concepts is important to avoid overwhelming a small display. Focusing on the domain of mobile local search, this paper presents the development of an application to perform filtering and aggregation of results obtained through the Yahoo! Local web service. First, we performed an analysis of the data available through Yahoo! Local by crawling its database with over 170 thousand local listings located in Chicago. Then, we compiled resources and developed algorithms to filter and aggregate local search results. The methods developed exploit Yahoo!’s listings categorization to reduce the result space and pinpoint the category containing the most relevant results. Finally, we evaluated a prototype through a user study, which pitted our system against Yahoo! Local and against a plain list of search results. The results obtained from the study show that our aggregation methods are quite effective, cutting down the number of entries returned to the user by 43% on average, but leaving search efficiency and user satisfaction unaffected.

pdf bib
The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News
Yannick Estève | Thierry Bazillon | Jean-Yves Antoine | Frédéric Béchet | Jérôme Farinas

This paper presents the EPAC corpus which is composed by a set of 100 hours of conversational speech manually transcribed and by the outputs of automatic tools (automatic segmentation, transcription, POS tagging, etc.) applied on the entire French ESTER 1 audio corpus: this concerns about 1700 hours of audio recordings from radiophonic shows. This corpus was built during the EPAC project funded by the French Research Agency (ANR) from 2007 to 2010. This corpus increases significantly the amount of French manually transcribed audio recordings easily available and it is now included as a part of the ESTER 1 corpus in the ELRA catalog without additional cost. By providing a large set of automatic outputs of speech processing tools, the EPAC corpus should be useful to researchers who want to work on such data without having to develop and deal with such tools. These automatic annotations are various: segmentation and speaker diarization, one-best hypotheses from the LIUM automatic speech recognition system with confidence measures, but also word-lattices and confusion networks, named entities, part-of-speech tags, chunks, etc. The 100 hours of speech manually transcribed were split into three data sets in order to get an official training corpus, an official development corpus and an official test corpus. These data sets were used to develop and to evaluate some automatic tools which have been used to process the 1700 hours of audio recording. For example, on the EPAC test data set our ASR system yields a word error rate equals to 17.25%.

pdf bib
Annotation Time Stamps — Temporal Metadata from the Linguistic Annotation Process
Katrin Tomanek | Udo Hahn

We describe the re-annotation of selected types of named entities (persons, organizations, locations) from the Muc7 corpus. The focus of this annotation initiative is on recording the time needed for the linguistic process of named entity annotation. Annotation times are measured on two basic annotation units -- sentences vs. complex noun phrases. We gathered evidence that decision times are non-uniformly distributed over the annotation units, while they do not substantially deviate among annotators. This data seems to support the hypothesis that annotation times very much depend on the inherent ""hardness"" of each single annotation decision. We further show how such time-stamped information can be used for empirically grounded studies of selective sampling techniques, such as Active Learning. We directly compare Active Learning costs on the basis of token-based vs. time-based measurements. The data reveals that Active Learning keeps its competitive advantage over random sampling in both scenarios though the difference is less marked for the time metric than for the token metric.

pdf bib
Annotating the Enron Email Corpus with Number Senses
Stuart Moore | Sabine Buchholz | Anna Korhonen

The Enron Email Corpus provides ``Real World'' text in the business email domain, which is a target domain for many speech and language applications. We present a section of this corpus annotated with number senses - labelling each number as a date, time, year, telephone number etc. We show that sense categories and their frequencies are very different in this domain than in newswire text. The annotated corpus can provide valuable material for the development of number sense disambiguation techniques. We have released the annotations into the public domain, to allow other researchers to perform comparisons.

pdf bib
Integration of Linguistic Markup into Semantic Models of Folk Narratives: The Fairy Tale Use Case
Piroska Lendvai | Thierry Declerck | Sándor Darányi | Pablo Gervás | Raquel Hervás | Scott Malec | Federico Peinado

Propp's influential structural analysis of fairy tales created a powerful schema for representing storylines in terms of character functions, which is directly exploitable for computational semantic analysis, and procedural generation of stories of this genre. We tackle two resources that draw on the Proppian model - one formalizes it as a semantic markup scheme and the other as an ontology -, both lacking linguistic phenomena explicitly represented in them. The need for integrating linguistic information into structured semantic resources is motivated by the emergence of suitable standards that facilitate this, as well as the benefits such joint representation would create for transdisciplinary research across Digital Humanities, Computational Linguistics, and Artificial Intelligence.

pdf bib
An LMF-based Web Service for Accessing WordNet-type Semantic Lexicons
Bora Savas | Yoshihiko Hayashi | Monica Monachini | Claudia Soria | Nicoletta Calzolari

This paper describes a Web service for accessing WordNet-type semantic lexicons. The central idea behind the service design is: given a query, the primary functionality of lexicon access is to present a partial lexicon by extracting the relevant part of the target lexicon. Based on this idea, we implemented the system as a RESTful Web service whose input query is specified by the access URI and whose output is presented in a standardized XML data format. LMF, an ISO standard for modeling lexicons, plays the most prominent role: the access URI pattern basically reflects the lexicon structure as defined by LMF; the access results are rendered based on Wordnet-LMF, which is a version of LMF XML-serialization. The Web service currently provides accesses to Princeton WordNet, Japanese WordNet, as well as the EDR Electronic Dictionary as a trial. To accommodate the EDR dictionary within the same framework, we modeled it also as a WordNet-type semantic lexicon. This paper thus argues possible alternatives to model innately bilingual/multilingual lexicons like EDR with LMF, and proposes possible revisions to Wordnet-LMF.

pdf bib
Active Learning for Building a Corpus of Questions for Parsing
Jordi Atserias | Giuseppe Attardi | Maria Simi | Hugo Zaragoza

This paper describes how we built a dependency Treebank for questions. The questions for the Treebank were drawn from questions from the TREC 10 QA task and from Yahoo! Answers. Among the uses for the corpus is to train a dependency parser achieving good accuracy on parsing questions without hurting its overall accuracy. We also explore active learning techniques to determine the suitable size for a corpus of questions in order to achieve adequate accuracy while minimizing the annotation efforts.

pdf bib
Automatically Identifying Changes in the Semantic Orientation of Words
Paul Cook | Suzanne Stevenson

The meanings of words are not fixed but in fact undergo change, with new word senses arising and established senses taking on new aspects of meaning or falling out of usage. Two types of semantic change are amelioration and pejoration; in these processes a word sense changes to become more positive or negative, respectively. In this first computational study of amelioration and pejoration we adapt a web-based method for determining semantic orientation to the task of identifying ameliorations and pejorations in corpora from differing time periods. We evaluate our proposed method on a small dataset of known historical ameliorations and pejorations, and find it to perform better than a random baseline. Since this test dataset is small, we conduct a further evaluation on artificial examples of amelioration and pejoration, and again find evidence that our proposed method is able to identify changes in semantic orientation. Finally, we conduct a preliminary evaluation in which we apply our methods to the task of finding words which have recently undergone amelioration or pejoration.

pdf bib
NPCEditor: A Tool for Building Question-Answering Characters
Anton Leuski | David Traum

NPCEditor is a system for building and deploying virtual characters capable of engaging a user in spoken dialog on a limited domain. The dialogue may take any form as long as the character responses can be specified a priori. For example, NPCEditor has been used for constructing question answering characters where a user asks questions and the character responds, but other scenarios are possible. At the core of the system is a state of the art statistical language classification technology for mapping from user's text input to system responses. NPCEditor combines the classifier with a database that stores the character information and relevant language data, a server that allows the character designer to deploy the completed characters, and a user-friendly editor that helps the designer to accomplish both character design and deployment tasks. In the paper we define the overall system architecture, describe individual NPCEditor components, and guide the reader through the steps of building a virtual character.

pdf bib
Automatic Annotation of Word Emotion in Sentences Based on Ren-CECps
Changqin Quan | Fuji Ren

Textual information is an important communication medium contained rich expression of emotion, and emotion recognition on text has wide applications. Word emotion analysis is fundamental in the problem of textual emotion recognition. Through an analysis of the characteristics of word emotion expression, we use word emotion vector to describe the combined basic emotions in a word, which can be used to distinguish direct and indirect emotion words, express emotion ambiguity in words, and express multiple emotions in words. Based on Ren-CECps (a Chinese emotion corpus), we do an experiment to explore the role of emotion word for sentence emotion recognition and we find that the emotions of a simple sentence (sentence without negative words, conjunctions, or question mark) can be approximated by an addition of the word emotions. Then MaxEnt modeling is used to find which context features are effective for recognizing word emotion in sentences. The features of word, N-words, POS, Pre-N-words emotion, Pre-is-degree-word, Pre-is-negativeword, Pre-is-conjunction and their combination have been experimented. After that, we use the two metrics: Kappa coefficient of agreement and Voting agreement to measure the word annotation agreement of Ren-CECps. The experiments on above context features showed promising results compared with word emotion agreement on people's judgments.

pdf bib
Speech Data Corpus for Verbal Intelligence Estimation
Kseniya Zablotskaya | Steffen Walter | Wolfgang Minker

The goal of our research is the development of algorithms for automatic estimation of a person's verbal intelligence based on the analysis of transcribed spoken utterances. In this paper we present the corpus of German native speakers' monologues and dialogues about the same topics collected at the University of Ulm, Germany. The monologues were descriptions of two short films; the dialogues were discussions about problems of German education. The data corpus contains the verbal intelligence quotients of each speaker, which were measured with the Hamburg Wechsler Intelligence Test for Adults. In this paper we describe our corpus, why we decided to create it, and how it was collected. We also describe some approaches which can be applied to the transcribed spoken utterances for extraction of different features which could have a correlation with a person's verbal intelligence. The data corpus consists of 71 monologues and 30 dialogues (about 10 hours of audio data).

pdf bib
A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project
Yi Liu | Pascale Fung | Yongsheng Yang | Denise DiPersio | Meghan Glenn | Stephanie Strassel | Christopher Cieri

In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUST’s acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.

pdf bib
Building a Generative Lexicon for Romanian
Anca Dinu

We present in this paper an on-going research: the construction and annotation of a Romanian Generative Lexicon (RoGL). Our system follows the specifications of CLIPS project for Italian language. It contains a corpus, a type ontology, a graphical interface and a database from which we generate data in XML format.

pdf bib
How Specialized are Specialized Corpora? Behavioral Evaluation of Corpus Representativeness for Maltese.
Jerid Francom | Amy LaCross | Adam Ussishkin

In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then apply statistical methods to evaluate the extent to which familiarity ratings predict corpus frequency for verbs in the Maltese corpus from three angles: 1) token frequency, 2) frequency distributions and 3) morpho-syntactic type (binyan). This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

pdf bib
Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
Kevin Walker | Christopher Caruso | Denise DiPersio

The development of technologies to address machine translation and distillation of multilingual broadcast data depends heavily on the collection of large volumes of material from modern data providers. To address the needs of GALE researchers, the Linguistic Data Consortium (LDC) developed a system for collecting broadcast news and conversation from a variety of Arabic, Chinese and English broadcasters. The system is highly automated, easily extensible and robust and is capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. In addition to this extensive system, LDC manages three remote collection sites to maximize the variety of available broadcast data and has designed a portable broadcast collection platform to facilitate remote collection. This paper will present a detailed a description of the design and implementation of LDC’s collection system, the technical challenges and solutions to large scale broadcast data collection efforts and an overview of the system’s operation. This paper will also discuss the challenges of managing remote collections, in particular, the strategies used to normalize data formats, naming conventions and delivery methods to achieve optimal integration of remotely-collected data into LDC’s collection database and downstream tasking workflow.

pdf bib
Like Finding a Needle in a Haystack: Annotating the American National Corpus for Idiomatic Expressions
Laura Street | Nathan Michalov | Rachel Silverstein | Michael Reynolds | Lurdes Ruela | Felicia Flowers | Angela Talucci | Priscilla Pereira | Gabriella Morgon | Samantha Siegel | Marci Barousse | Antequa Anderson | Tashom Carroll | Anna Feldman

Our paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1,500-sentence samples from the spoken, the nonfiction, and the fiction portions of the ANC. Our paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (> .80) on the tags but a low level of agreement (< .00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus

pdf bib
Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic
Wajdi Zaghouani | Bruno Pouliquen | Mohamed Ebrahim | Ralf Steinberger

We present a fully functional Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabic, but - instead - a highly multilingual, almost language-independent NER system was adapted to also cover Arabic. The Semitic language Arabic substantially differs from the Indo-European and Finno-Ugric languages currently covered. This paper thus describes what Arabic language-specific resources had to be developed and what changes needed to be made to the otherwise language-independent rule set in order to be applicable to the Arabic language. The achieved evaluation results are generally satisfactory, but could be improved for certain entity types. The results of the IE tools can be seen on the Arabic pages of the freely accessible Europe Media Monitor (EMM) application NewsExplorer, which can be found at

pdf bib
Enriching Word Alignment with Linguistic Tags
Xuansong Li | Niyu Ge | Stephen Grimes | Stephanie M. Strassel | Kazuaki Maeda

Incorporating linguistic knowledge into word alignment is becoming increasingly important for current approaches in statistical machine translation research. To improve automatic word alignment and ultimately machine translation quality, an annotation framework is jointly proposed by LDC (Linguistic Data Consortium) and IBM. The framework enriches word alignment corpora to capture contextual, syntactic and language-specific features by introducing linguistic tags to the alignment annotation. Two annotation schemes constitute the framework: alignment and tagging. The alignment scheme aims to identify minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. The framework produces a solid ground-level alignment base upon which larger translation unit alignment can be automatically induced. To test the soundness of this work, evaluation is performed on a pilot annotation, resulting in inter- and intra- annotator agreement of above 90%. To date LDC has produced manual word alignment and tagging on 32,823 Chinese-English sentences following this framework.

pdf bib
The Indiana “Cooperative Remote Search Task” (CReST) Corpus
Kathleen Eberhard | Hannele Nicholson | Sandra Kübler | Susan Gundersen | Matthias Scheutz

This paper introduces a novel corpus of natural language dialogues obtained from humans performing a cooperative, remote, search task (CReST) as it occurs naturally in a variety of scenarios (e.g., search and rescue missions in disaster areas). This corpus is unique in that it involves remote collaborations between two interlocutors who each have to perform tasks that require the other's assistance. In addition, one interlocutor's tasks require physical movement through an indoor environment as well as interactions with physical objects within the environment. The multi-modal corpus contains the speech signals as well as transcriptions of the dialogues, which are additionally annotated for dialog structure, disfluencies, and for constituent and dependency syntax. On the dialogue level, the corpus was annotated for separate dialogue moves, based on the classification developed by Carletta et al. (1997) for coding task-oriented dialogues. Disfluencies were annotated using the scheme developed by Lickley (1998). The syntactic annotation comprises POS annotation, Penn Treebank style constituent annotations as well as dependency annotations based on the dependencies of pennconverter.

pdf bib
Principled Construction of Elicited Imitation Tests
Carl Christensen | Ross Hendrickson | Deryle Lonsdale

In this paper we discuss the methodology behind the construction of elicited imitation (EI) test items. First we examine varying uses for EI tests in research and in testing overall oral proficiency. We also mention criticisms of previous test items. Then we identify the factors that contribute to the difficulty of an EI item as shown in previous studies. Based on this discussion, we describe a way of automating the creation of test items in order to better evaluate language learners' oral proficiency while improving item naturalness. We present a new item construction tool and the process that it implements in order to create test items from a corpus, identifying relevant features needed to compile a database of EI test items. We examine results from administration of a new EI test engineered in this manner, illustrating the effect that standard language resources can have on creating an effective EI test item repository. We also sketch ongoing work on test item generation for other languages and an adaptive test that will use this collection of test items.

pdf bib
A High Recall Error Identification Tool for Hindi Treebank Validation
Bharat Ram Ambati | Mridul Gupta | Samar Husain | Dipti Misra Sharma

This paper describes the development of a hybrid tool for a semi-automated process for validation of treebank annotation at various levels. The tool is developed for error detection at the part-of-speech, chunk and dependency levels of a Hindi treebank, currently under development. The tool aims to identify as many errors as possible at these levels to achieve consistency in the task of annotation. Consistency in treebank annotation is a must for making data as error-free as possible and for providing quality assurance. The tool is aimed at ensuring consistency and to make manual validation cost effective. We discuss a rule based and a hybrid approach (statistical methods combined with rule-based methods) by which a high-recall system can be developed and used to identify errors in the treebank. We report some results of using the tool on a sample of data extracted from the Hindi treebank. We also argue how the tool can prove useful in improving the annotation guidelines which would in turn, better the quality of annotation in subsequent iterations.

pdf bib
Dialogues in Context: An Objective User-Oriented Ev