Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, Raquel Silva (Editors)

Anthology ID:: L04-1
Month:: May
Year:: 2004
Address:: Lisbon, Portugal
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
URL:: https://aclanthology.org/L04-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Maria Teresa Lino | Maria Francisca Xavier | Fátima Ferreira | Rute Costa | Raquel Silva

pdf bib

Can We Talk? Prospects for Automatically Training Spoken Dialogue Systems
Marilyn Walker

pdf bib

Strategic Directions of National and International Research Funding
Hans Uszkoreit

pdf bib

Multilingual Content Processing
Gregor Thurmair

pdf bib

Collaborative Commentary: Opening Up Spoken Language Databases
Brian MacWhinney

pdf bib

Getting to the Heart of the Matter; Speech is More than Just the Expression of Text or Language
Nick Campbell

pdf bib

Industrial Needs for Language Resources
Bente Maegaard

pdf bib

Thesaurus or Logical Ontology, Which do we Need for Mining Text?
Junichi Tsujii

pdf bib

Information Extraction from Hindi Texts
Kamlesh Dutta | Saroj Kaushik | Nupur Prakash

pdf bib

The Language Belongs to the People!
Cornelis H.A. Koster | Stefan Gradmann

pdf bib

pdf bib

pdf bib

Multilingual Corpus-based Approach to the Resolution of English –ing
Lee Schwartz | Takako Aikawa

pdf bib

On the Problems of Creating a Golden Standard of Inflected Forms in Portuguese
Diana Santos | Anabela Barreiro

pdf bib

pdf bib

Evaluating Multimodal NLG Using Production Experiments
Ielka van der Sluis | Emiel Krahmer

pdf bib

Concept Creation in Lexical Ontologies
Nuno Seco | Tony Veale | Jer Hayes

pdf bib

Polysemy and Category Structure in WordNet: An Evidential Approach
Tony Veale

pdf bib

Towards a Reference Annotation Framework
Susanne Salmon-Alt | Laurent Romary

pdf bib

A New ITU-T Recommendation on the Evaluation of Telephone-Based Spoken Dialogue Systems
Sebastian Möller

pdf bib

Raising the Bar: Stacked Conservative Error Correction Beyond Boosting
Dekai Wu | Grace Ngai | Marine Carpuat

pdf bib

An Analysis of the Relative Difficulty of Reuters-21578 Subsets
Franca Debole | Fabrizio Sebastiani

pdf bib

Experiences in Collection of Handwriting Data for Online Handwriting Recognition in Indic Scripts
Ajay S. Bhaskarabhatla | Sriganesh Madhvanath

pdf bib

Collocation Extraction Using Web Statistics
Hsin-Hsi Chen | Yi-Cheng Yu | Chih-Long Lin

pdf bib

An XML Representation for Annotated Handwriting Datasets for Online Handwriting Recognition
Ajay S. Bhaskarabhatla | Sriganesh Madhvanath

pdf bib

Reusing Language Resources for Speech Applications involving Emotion
Christina Alexandris | Stavroula-Evita Fotinea

pdf bib

Designing and Recording an Audiovisual Database of Emotional Speech in Basque
Eva Navas | Amaia Castelruiz | Iker Luengo | Jon Sánchez | Inmaculada Hernáez

pdf bib

Evaluation of Different Similarity Measures for the Extraction of Multiword Units in a Reinforcement Learning Environment
Gaël Dias | Sérgio Nunes

pdf bib

Terminal Device Oriented Comparable Corpora and its Alignment- Towards Extracting Paraphrasing Patterns
Hiroshi Nakagawa | Hidetaka Masuda | Dai Sato

pdf bib

Towards Basic Categories for Describing Properties of Texts in a Corpus
Serge Sharoff

pdf bib

Using Weighted Abduction to Align Term Variant Translations in Bilingual Texts
Michael Carl | Ecaterina Rascu | Johann Haller

pdf bib

Investigation on Semantics to Improve the COVAX System
Luciana Bordoni

pdf bib

Incremental Knowledge Acquisition from WordNet and EuroWordNet
Wim Peters

pdf bib

Finding Semantic Associations on Express Lane
Vivi Năstase | Rada Mihalcea

pdf bib

Infrastructure for Collaborative Annotation of Speech
Mickel Grönroos | Manne Miettinen

pdf bib

Automatic Language-Independent Induction of Gazetteer Lists
Diana Maynard | Kalina Bontcheva | Hamish Cunningham

pdf bib

Corpus Design, Recording and Phonetic Analysis of Greek Emotional Database
Nikos Fakotakis

pdf bib

Human Dialogue Modelling Using Annotated Corpora
Yorick Wilks | Nick Webb | Andrea Setzer | Mark Hepple | Roberta Catizone

pdf bib

CrossTowns: Automatically Generated Phonetic Lexicons of Cross-lingual Pronunciation Variants of European City Names
Stefan Schaden

pdf bib

Pattern Discovery in Named Organization Corpus
Hsin-Hsi Chen | Yi-Lin Chu

pdf bib

Connector Usage in the English Essay Writing of Japanese EFL Learners
Masumi Narita | Chieko Sato | Masatoshi Sugiura

pdf bib

Detection of Domain Specific Terminology Using Corpora Comparison
Patrick Drouin

pdf bib

Comparative Evaluation of a Stochastic Parser on Semantic and Syntactic-semantic Labels
Wolfgang Minker

pdf bib

Sinica BOW (Bilingual Ontological Wordnet): Integration of Bilingual WordNet and SUMO
Chu-Ren Huang | Ru-Yng Chang | Hsiang-Pin Lee

pdf bib

How to Disassemble Alphabetical Processions - Morphological Treatment of Unknown Words
Stephan Bopp | Sandro Pedrazzini | Elisabeth Maier

pdf bib

Creating Slovenian Language Resources for Development of Speech-to-speech Translation Components
Darinka Verdonik | Matej Rojc | Zdravko Kačič

pdf bib

Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data
Magnus Sahlgren

pdf bib

The Development and Integration of the LDA-Toolkit Into COST249 SpeechDat(II) SIG Reference Recognizer
Bojan Kotnik | Zdravko Kačič | Bogomir Horvat

pdf bib

Duration Modeling For Turkish Text-to-Speech Synthesis System
Özlem Öztürk | Özgul Salor | Tolga Çiloğlu | Mubeccel Demirekler

pdf bib

Clustering Concept Hierarchies from Text
Philipp Cimiano | Andreas Hotho | Steffen Staab

pdf bib

pdf bib

Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy
Satoshi Sekine | Chikashi Nobata

pdf bib

Sejong Korean Corpora in the Making
Beom-mo Kang | Hunggyu Kim

pdf bib

pdf bib

Automatic Generation of Glosses in the OntoLearn System
Alessandro Cucchiarelli | Roberto Navigli | Francesca Neri | Paola Velardi

pdf bib

pdf bib

A Spoken Afrikaans Language Resource Designed for Research on Pronunciation Variations
Daan Wissing | Jean-Pierre Martens | Ulrike Janke | Wim Goedertier

pdf bib

The BITS Speech Synthesis Corpus for German
Tania Ellbogen | Florian Schiel | Alexander Steffen

pdf bib

MAUS Goes Iterative
Florian Schiel

pdf bib

EuroWordNet as a Resource for Cross-language Information Retrieval
Mark Stevenson | Paul Clough

pdf bib

Finding the Correct Interpretation of Swedish Compounds, a Statistical Approach
Jonas Sjöbergh | Viggo Kann

pdf bib

Automatic Extraction of Hyponyms from Japanese Newspapers. Using Lexico-syntactic Patterns
Maya Ando | Satoshi Sekine | Shun Ishizaki

pdf bib

Extending a Verb-lexicon Using a Semantically Annotated Corpus
Karin Kipper | Benjamin Snyder | Martha Palmer

pdf bib

The Centre for Dutch Language and Speech Technology (TST Centre)
J.C.T. Beeken | P.H.J. van der Kamp

pdf bib

A Global Data Category Registry for Interoperable Language Resources
Sue Ellen Wright

pdf bib

The Integrated Language Database of 8th - 21st-Century Dutch
J. G. Kruyt

pdf bib

From Acts and Topics to Transactions and Dialogue Smoothness
Hans Dybkjær | Laila Dybkjær

pdf bib

Grouping Synonymous Sentences from a Parallel Corpus
Hideki Kashioka

pdf bib

Discovery of (New) Knowledge and the Analysis of Text Corpora
Khurshid Ahmad | Maria Teresa Musacchio

pdf bib

Evaluation of Microphone Array Front-Ends for ASR - an Extension of the AURORA Framework
Harald Höge | Josef G. Bauer | Christian Geißler | Panji Setiawan | Kai Steinert

pdf bib

Development of Slovenian Broadcast News Speech Database
Janez Žibert | France Mihelič

pdf bib

A Named Entity Recognizer for Danish
Eckhard Bick

pdf bib

pdf bib

Portuguese Large-scale Language Resources for NLP Applications
Elisabete Ranchhod | Paula Carvalho | Cristina Mota | Anabela Barreiro

pdf bib

Development of a Corpus Workbench for the METU Turkish Corpus
Umut Özge | Bilge Say

pdf bib

Mercedes, a Term-in-Context Highlighter
Raúl Araya | Jordi Vivaldi

pdf bib

The Bilingual Web Dictionary on Demand
Henrik Selsøe Sørensen

pdf bib

Making an XML-based Japanese-Slovene Learners’ Dictionary
Tomaž Erjavec | Kristina Hmeljak Sangawa | Irena Srdanović | Anton ml. Vahčič

pdf bib

MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
Tomaž Erjavec

pdf bib

A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis
Lorena Seijo Pereiro | Ana Martínez Ínsua | Francisco Méndez Pazó | Francisco Campillo Díaz | Eduardo Rodríguez Banga

pdf bib

The SPARTACUS-Database: a Spanish Sentence Database for Offline Handwriting Recognition
Salvador España | María José Castro | José Luis Hidalgo

pdf bib

Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing
Sofia Stamou | Goran Nenadic | Dimitris Christodoulakis

pdf bib

Building a Paraphrase Corpus for Speech Translation
Mitsuo Shimohata | Eiichiro Sumita | Yuji Matsumoto

pdf bib

Incremental Methods to Select Test Sentences for Evaluating Translation Ability
Yasuhiro Akiba | Eiichiro Sumita | Hiromi Nakaiwa | Seiichi Yamamoto | Hiroshi G. Okuno

pdf bib

Reusable Lexical Representations for Idioms
Jan Odijk

pdf bib

The Design of Czech Language Formal Listening Tests for the Evaluation of TTS Systems
Daniel Tihelka | Jindřich Matoušek

pdf bib

A Data-driven Adaptation of Prosody in a Multilingual TTS
Janez Stergar | Caglayan Erdem | Bogomir Horvat | Zdravko Kačič

pdf bib

pdf bib

Probabilistic Detection of Context-Sensitive Spelling Errors
Johnny Bigert

pdf bib

pdf bib

pdf bib

Utilizing the One-Sense-per-Discourse Constraint for Fully Unsupervised Word Sense Induction and Disambiguation
Reinhard Rapp

pdf bib

A Freely Available Automatically Generated Thesaurus of Related Words
Reinhard Rapp

pdf bib

Using a Parallel Transcript/Subtitle Corpus for Sentence Compression
Vincent Vandeghinste | Erik Tjong Kim Sang

pdf bib

Handling Subtle Sense Distinctions Through Wordnet Semantic Types
Sofia Stamou | Dimitris Christodoulakis

pdf bib

Multi-lingual Evaluation of a Natural Language Generation System
Athanasios Karasimos | Amy Isard

pdf bib

The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone
Heike Telljohann | Erhard Hinrichs | Sandra Kübler

pdf bib

The NIST Meeting Room Pilot Corpus
John S. Garofolo | Christophe D. Laprun | Martial Michel | Vincent M. Stanford | Elham Tabassi

pdf bib

Securing Interpretability: The Case of Ega Language Documentation
Dafydd Gibbon | Catherine Bow | Steven Bird | Baden Hughes

pdf bib

A Comparative Study on Human Communication Behaviors and Linguistic Characteristics for Speech-to-Speech Translation
Toshiyuki Takezawa | Genichiro Kikui

pdf bib

Cost-effective Cross-lingual Document Classification
Núria Bel | Cornelis H.A. Koster | Marta Villegas

pdf bib

A Powerful and Versatile XML Format for Representing Role-semantic Annotation
Katrin Erk | Sebastian Padó

pdf bib

pdf bib

Putting the Dutch PAROLE Corpus to Work
P. H. J. van der Kamp | J. G. Kruyt

pdf bib

Acquiring Reusable Multilingual Phonotactic Resources
Julie Carson-Berndsen | Robert Kelly

pdf bib

Phonological Treebanks. Issues in Generation and Application
Moritz Neugebauer | Stephen Wilson

pdf bib

Methodology for Rapid Prototyping and Testing of ASR Based User Interfaces
Pedro Concejero Cerezo | Juan José Rodríguez Soler | Daniel Tapias Merino | Alberto J. Sánchez García

pdf bib

Open Resources for Language Technology
Lars Degerstedt | Arne Jönsson

pdf bib

Unsupervised Text Mining for Ontology Extraction: An Evaluation of Statistical Measures
Marie-Laure Reinberger | Walter Daelemans

pdf bib

pdf bib

Benchmarking Ontology Tools. A Case Study for the WebODE Platform.
Oscar Corcho | Raúl García-Castro | Asunción Gómez-Pérez

pdf bib

A Chatbot as a Novel Corpus Visualization Tool
Bayan Abu Shawar | Eric Atwell

pdf bib

Evaluating Variants of the Lesk Approach for Disambiguating Words
Florentina Vasilescu | Philippe Langlais | Guy Lapalme

pdf bib

The Rationale for Building an Ontology Expressly for NLP
Sergei Nirenburg | Marjorie McShane | Stephen Beale

pdf bib

Some Meaning Procedures of Ontological Semantics
Marjorie McShane | Stephen Beale | Sergei Nirenburg

pdf bib

Using the Penn Treebank to Evaluate Non-Treebank Parsers
Eric K. Ringger | Robert C. Moore | Eugene Charniak | Lucy Vanderwende | Hisami Suzuki

pdf bib

Comparison of Some Automatic and Manual Methods for Summary Evaluation Based on the Text Summarization Challenge 2
Hidetsugu Nanba | Manabu Okumura

pdf bib

The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study
Anthony McEnery | Zhonghua Xiao

pdf bib

pdf bib

Word Sense Disambiguation as a Wordnets’ Validation Method in Balkanet
Dan Tufis | Radu Ion | Nancy Ide

pdf bib

Term Translations in Parallel Corpora: Discovery and Consistency Check
Dan Tufis

pdf bib

The Corpógrafo – a Web-based Environment for Corpora Research
Luís Sarmento | Belinda Maia | Diana Santos

pdf bib

Automatic Classification of Geographic Named Entities
Daniel Ferrés | Marc Massot | Muntsa Padró | Horacio Rodríguez | Jordi Turmo

pdf bib

Acquiring Bayesian Networks from Text
Olivia Sanchez-Graillet | Massimo Poesio

pdf bib

Developping Tools and Building Linguistic Resources for Vietnamese Morpho-syntactic Processing
Thanh Bon Nguyen | Thi Minh Huyen Nguyen | Laurent Romary | Xuan Luong Vu

pdf bib

SpeechRecorder - a Universal Platform Independent Multi-Channel Audio Recording Software
Christoph Draxler | Klaus Jänsch

pdf bib

An Evaluation Protocol for Text Mining Tools : ALCESTE, SAS Text Miner, SPAD-CRM and Temis Text Mining Solutions Testing
Yasmina Quatrain | Sylvaine Nugier | Anne Peradotto

pdf bib

Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian
Alessandro Panunzi | Eugenio Picchi | Massimo Moneglia

pdf bib

pdf bib

Exploiting Semantic Web Technologies for Intelligent Access to Historical Documents
Nancy Ide | David Woolner

pdf bib

Using Cooccurrence Statistics and the Web to Discover Synonyms in a Technical Language
Marco Baroni | Sabrina Bisi

pdf bib

Semi-supervised Learning by Fuzzy Clustering and Ensemble Learning
Hiroyuki Shinnou | Minoru Sasaki

pdf bib

Speech & Expression; the Value of a Longitudinal Corpus
Nick Campbell

pdf bib

A Complete Understanding Speech System Based on Semantic Concepts
Salma Jamoussi | Kamel Smaïli | Dominique Fohr | Jean-Paul Haton

pdf bib

The CLaRK System: XML-based Corpora Development System for Rapid Prototyping
Kiril Simov | Alexander Simov | Hristo Ganev | Krasimira Ivanova | Ilko Grigorov

pdf bib

NLP-enhanced Error Checking for Catalan Unrestricted Text
Toni Badia | Àngel Gil | Martí Quixal | Oriol Valentín

pdf bib

Open-source Tools for Creation, Maintenance, and Storage of Lexical Resources for Language Generation from Ontologies
Kalina Bontcheva

pdf bib

User Query Analysis for the Specification and Evaluation of a Dialogue Processing and Retrieval System
Agnes Lisowska | Andrei Popescu-Belis | Susan Armstrong

pdf bib

Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian
Borislav Popov | Angel Kirilov | Diana Maynard | Dimitar Manov

pdf bib

Abstracting a Dialog Act Tagset for Meeting Processing
Andrei Popescu-Belis

pdf bib

Online Evaluation of Coreference Resolution
Andrei Popescu-Belis | Loïs Rigouste | Susanne Salmon-Alt | Laurent Romary

pdf bib

FreeLing: An Open-Source Suite of Language Analyzers
Xavier Carreras | Isaac Chao | Lluís Padró | Muntsa Padró

pdf bib

Phrase-Based Dependency Evaluation of a Japanese Parser
Hisami Suzuki

pdf bib

Functional Requirements for an Interlinear Text Editor
Baden Hughes | Catherine Bow | Steven Bird

pdf bib

pdf bib

pdf bib

The American English SALA-II Data Collection
Peter A. Heeman

pdf bib

How Does Automatic Machine Translation Evaluation Correlate with Human Scoring as the Number of Reference Translations Increases?
Andrew Finch | Yasuhiro Akiba | Eiichiro Sumita

pdf bib

Evaluating the FOKS Error Model
Slaven Bilac | Timothy Baldwin | Hozumi Tanaka

pdf bib

Evaluation of a Speech Cuer: From Motion Capture to a Concatenative Text-to-cued Speech System
Guillaume Gibert | Gérard Bailly | Frédéric Eliséi | Denis Beautemps | Rémi Brun

pdf bib

Beyond TREC’s Filtering Track
Nikolaos Nanas | Victoria Uren | Anne de Roeck | John Domingue

pdf bib abs

A Corpus-based Syntactic Lexicon for Adverbs
Sanni Nimb

A word class often neglected in the field of NLP resources, namely adverbs, has lately been described in a computational lexicon produced at CST as one of the results of a Ph.D.-project. The adverb lexicon, which is integrated in the Danish STO lexicon, gives detailed syntactic information on the type of modification and position, as well as on other syntactic properties of approx 800 Danish adverbs. One of the aims of the lexicon has been to establish a clear distinction between syntactic and semantic information - where other lexicons often generalize over the syntactic behavior of semantic classes of adverbs, every adverb is described with respect to its proper syntactic behavior in a text corpus, revealing very individual syntactic properties. Syntactic information on adverbs is needed in NLP systems generating text to ensure correct placing in the phrase they modify. Also in systems analyzing text, this information is needed in order to attach the adverbs to the right node in the syntactic parse trees. Within the field of linguistic research, several results can be deduced from the lexicon, e.g. knowledge of syntactic classes of Danish adverbs.

pdf bib abs

The Future of Evaluation for Cross-Language Information Retrieval Systems
Carol Peters | Martin Braschler | Khalid Choukri | Julio Gonzalo | Michael Kluck

The objective of the Cross-Language Evaluation Forum (CLEF) is to promote research in the multilingual information access domain. In this short paper, we list the achievements of CLEF during its first four years of activity and describe how the range of tasks has been considerably expanded during this period. The aim of the paper is to demonstrate the importance of evaluation initiatives with respect to system research and development and to show how essential it is for such initiatives to keep abreast of and even anticipate the emerging needs of both system developers and application communities if they are to have a future.

pdf bib abs

The SALA II project comprises mobile telephone recordings according to the SpeechDat (II) paradigm for several languages in North and Latin America. Each database contains the recordings of 1000 speakers, with the exception of US Spanish (2000 speakers) and US English (4000 speakers). A quarter of the recordings of each database are made respectively in a quiet environment (home/office), in the street, in a public place, and in a moving vehicle. This paper presents an evaluation of the project. The paper details on experiences with respect to the implementation of design specifications, speaker recruitment, data recordings (on site), data processing, orthographic transcription and lexicon generation. Furthermore, the validation procedure and its results are documented. Finally, the availability and distribution of the databases are addressed.

pdf bib abs

Parallel Corpora for the Galician Language: Building and Processing of the CLUVI (Linguistic Corpus of the University of Vigo)
Xavier Gómez-Guinovart | Elena Sacau Fontenla

In this paper, we present the methodology developed by the SLI (Computational Linguistics Group of the University of Vigo) for the building and processing of the CLUVI Corpus, showing the TMX-based XML specification designed to encode both morphosyntactic features and translation alignments in parallel corpora, and the solutions adopted for making the CLUVI parallel corpora freely available over the WWW (http://sli.uvigo.es/CLUVI/).

pdf bib abs

PBIE: A Data Preparation Toolkit Toward Developing a Parsing-Based Information Extraction System
Junko Hosaka | Igor V. Kurochkin | Akihiko Konagaya

We have developed a toolkit in which an annotation tool, a syntactic tree editor, and an extraction rule editor interact dynamically. Its output can be stored in a database for further use. In the field of biomedicine, there is a critical need for automatic text processing. However, current language processing approaches suffer from insufficient basic data incorporating both human domain expertise and domain-specific language processing capabilities. With the annotation tool presented here, a set of ggold standardsh can be collected, representing what should be extracted. At the same time, any change in annotation can be viewed on an associated syntactic tree. These facilities provide a clear picture of the relationship between the extraction target and the syntactic tree. Underlying sentences can be analyzed with a parser which can be plugged in, or a set of parsed sentences can be used to generate the tree. Extraction rules written with the integrated editor can be applied at once, and their validity can immediately be verified both on the syntactic tree and on the sentence string by coloring the corresponding segments. Thus our toolkit enables the user to efficiently construct parse-based extraction rules. PBIE2 works under Windows 2000/XP and requires Microsoft Internet Explorer 6.0 or higher. The data can be stored in Microsoft Access.

pdf bib abs

A Syntactically Annotated Corpus of Tibetan
Andreas Wagner | Bettina Zeisler

This paper describes the creation of a syntactically annotated Tibetan corpus. This corpus forms a part of the TUSNELDA collection of corpora and databases for linguistic research. It will ultimately comprise spoken and written Tibetan texts originating from different regions and historical epochs. These texts are annotated with several kinds of linguistic information, in particular POS tags, phrases, argument structures of verbs, clauses and sentences, as well as several kinds of discourse units and textual segments. The annotation is done in XML. The primary research interest which guides the development of the corpus is the investigation of cross-clausal references, especially the relation between empty arguments (i.e. arguments not overtly realised in a clause) and their antecedents in previous clauses. For this purpose, such references are explicitly encoded so that they can be qualitatively and quantitatively evaluated with the help of standard XML techniques such as XPath search and XSLT transformations. Apart from this primary research interest, we expect that our corpus will be useful for other projects concerning Tibetan and related languages. Like other data in TUSNELDA, it will be made accessible via a WWW query interface.

pdf bib abs

Lexical Entry Templates for Robust Deep Parsing
Montserrat Marimon | Núria Bel

We report on the development and employment of lexical entry templates in a large--coverage unification--based grammar of Spanish. The aim of the work reported in this paper is to provide robust deep linguistic processing in order to make the grammar more adequate for industrial NLP applications.

pdf bib abs

Tiered Tagging Revisited
Dan Tufis | Liviu Dragomirescu

In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXT-EAST compliant lexical tags (MSD) into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding. The algorithm is described in details and the generated baseline tagsets for Czech, English, Estonian, Hungarian, Romanian and Slovenean are evaluated. They are much smaller and systematically ensures better tagging accuracy than the corresponding MSDs.

pdf bib abs

A Methodology and Associated Tools for Building Interlingual Wordnets
Dan Tufis | Eduard Barbu

The paper describes the methodology and the tools we developed for the purpose of building a Romanian wordnet. The work is carried out within the BalkaNet European project and is concerned with wordnets for Bulgarian, Czech, Greek, Romanian, Serbian and Turkish all of them aligned via an interlingual index (ILI) to Princeton Wordnet. The wordnets structuring follows the principles adopted in EuroWordNet. In order to ensure maximal cross-lingual lexical coverage, the consortium decided to implement the same concepts, represented by a common set of ILI concepts. We describe the selection of concepts to be implemented in all the monolingual wordnets The methodologies adopted by each partner were different and they depended on the language resources and personnel available. For the Romanian wordnet,we decided that it should be based on the reference lexicographic descriptions of Romanian which we had in electronic forms: EXPD, a heavily XML annotated explanatory dictionary (developed in the previous CONCEDE project and based on the standard Explanatory Dictionary of Romanian), SYND, a published dictionary of synonyms which we keyboarded, encoded and completed with more than 4000 new synonymy sets extracted from EXPD, EnRoD, a Romanian-English dictionary, most part of it being extracted automatically from parallel corpora and further hand validated and extended. Besides these monolingual resources, as all the other members of the consortium, we had at our disposal the interlingual mapping of the Princeton Wordnet. All the above mentioned resources have been incorporated into a user-friendly system, WnBuilder, which allows for cooperative work of a large number of lexicographers. When the distributed work is put together, the synsets are validated. Several errors show up, the most frequent and difficult to solve being the case of a literal with the same sense number appearing in different synsets. We discuss reasons for such conflicts as well as their correction, supported by another utility program called WnCorrector. The full paper presents WnBuilder and WnCorrector, as well as the status of the Romanian wordnet development.

pdf bib abs

Construction of a Bilingual Arabic-Spanish Lexicon of Verbs Based on a Parallel Corpus
Doaa Samy | Antonio Moreno-Sandoval | José M. Guirao

Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for Arabic language). At the end, aligned equivalent verbs are detected automatically from a parallel corpus Spanish-Arabic. To achieve this goal, we had to pass through different preparatory stages concerning the assesment of the parallel corpus, the monolingual tokenization of each corpus, a preliminary sentence alignment and finally applying the model of automatic extraction of equivalent verbs. Our method is hybrid, since it combines both statistical and linguistic approaches.

pdf bib abs

This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semi-automatic terminology extraction tool based on XML, for its use in technical and scientific information managing.

pdf bib abs

A Bayesian Model for Shallow Syntactic Parsing of Natural Language Texts
Manolis Maragoudakis | Nikos Fakotakis | George Kokkinakis

For the present work, we introduce and evaluate a novel Bayesian syntactic shallow parser that is able to perform robust detection of pairs of subject-object and subject-direct object-indirect object for a given verb, in a natural language sentence. The shallow parser infers on the correct subject-object pairs based on knowledge provided by Bayesian network learning from annotated text corpora. The DELOS corpus, a collection of economic domain texts that has been automatically annotated using various morphological and syntactic tools was used as training material. Our shallow parser makes use of limited linguistic input. More specifically, we consider only part of speech tagging, the voice and the mood of the verb as well as the head word of a noun phrase. For the task of detecting the head word of a phrase we used a sentence boundary detector. Identifying the head word of a noun phrase, i.e. the word that holds the morphological information (case, number) of the whole phrase, also proves to be very helpful for our task as its morphological tag is all the information that is needed regarding the phrase. The evaluation of the proposed method was performed against three other machine learning techniques, namely naive Bayes, k-Nearest Neighbor and Support Vector Machines, methods that have been previously applied to natural language processing tasks with satisfactory results. The experimental outcomes portray a satisfactory performance of our proposed shallow parser, which reaches almost 92 per cent in terms of precision.

pdf bib abs

Multifunctional Computational Lexicon of Contemporary Portuguese: An Available Resource for Multitype Applications
Florbela Barreto | Raquel Amaro

This paper presents some aspects of the first Portuguese frequency lexicon extracted from a corpus of large dimensions. The Multifunctional Computational Lexicon of Contemporary Portuguese (henceforth MCL) rised from the necessity of filling a gap existent in the studies of the contemporary Portuguese. Until recently, the frequency lexicons of Portuguese were of very small dimensions, such as Português Fundamental, which is constituted by 2.217 words extracted from a 700.000 word corpus and the Frequency Dictionary of Portuguese Words based on a literary corpus of 500.000 words. We describe here the main steps taken for collecting the lexical and frequency data and some of the major problems that arouse in the process. The resulting lexicon is a freely available reliable resource for several types of applications.

pdf bib abs

Use and Evaluation of Prosodic Annotations in Dutch
Jacques Duchateau | Tim Ceyssens | Hugo Van hamme

In the development of annotations for a spoken database, an important issue is whether the annotations can be generated automatically with sufficient precision, or whether expensive manual annotations are needed. In this paper, the case of prosodic annotations is discussed, which was investigated on the CGN database (Spoken Dutch Corpus). The main conclusions of this work are as follows. First, it was found that the available amount of manual prosodic annotations is sufficient for the development of our (baseline, decision tree based) prosodic models. In other words, more manual annotations do not improve the models. Second, the developed prosodic models for prominence are insufficiently accurate to produce automatic prominence annotations that are as good as the manual ones. But on the other hand the consistency between manual and automatic break annotations is as high as the inter-transcriber consistency for breaks. So given the current amount of manual break annotations, annotations for the remainder of the CGN database can be generated automatically with the same quality as the manual annotations.

pdf bib abs

Resources and Techniques for Multilingual Information Extraction
Stephan Busemann | Hans-Ulrich Krieger

Official travel warnings published regularly in the internet by the ministries for foreign affairs of France, Germany, and the UK provide a useful resource for assessing the risks associated with travelling to some countries. The shallow IE system SProUT has been extended to meet the specific needs of delivering a language-neutral output for English, French, or German input texts. A shared type hierarchy, a feature-enhanced gazetteer resource, and generic techniques of merging chunk analyses into larger results are major reusable results of this work.

pdf bib abs

Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus
Lei Chen | Yang Liu | Mary Harper | Eduardo Maia | Susan McRoy

People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor's gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments.

pdf bib abs

The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and spoken material and they yield low cost resources for acoustic model training. When automatically transcribing the audio data, regions of agreement between automatic transcripts and written sources allow to transfer time-codes to the written documents: this may be helpful in an audio archive or audio information retrieval environment. Regions of disagreement can be automatically selected for further correction by human transcribers. This study makes use of 10 hours of French radio interview archives with corresponding press-oriented transcripts. The audio corpus has then been transcribed using the LIMSI speech recognizer resulting in automatic transcripts, exhibiting an average word error rate of 12%. 80% of the text corpus (with word chunks of at least five words) can be exactly aligned with the automatic transcripts of the audio data. The residual word error rate on these 80% is less than 1%.

pdf bib abs

Evaluation of a Spoken Phonetic Database in Basque Language
V. Guijarrubia | I. Torres | L.J. Rodríguez

In this paper we present the evaluation of a spoken phonetic corpus designed to train acoustic models for Speech Recognition applications in Basque Language. A complete set of acoustic-phonetic decoding experiments was carried out over the proposed database. Context dependent and independent phoneme units were used in these experiments with two different approaches to acoustic modeling, namely discrete and continuous Hidden Markov Models (HMMs). A complete set of HMMs were trained and tested with the database. Experimental results reveal that the database is large and phonetically rich enough to get great acoustic models to be integrated in Continuous Speech Recognition Systems.

pdf bib abs

Using Paradigm Tables to Generate New Utterances Similar to those Existing in Linguistic Resources
Yves Lepage | Guilhem Peralta

We inspect the possibility of creating new linguistic utterances (small sentences) similar to those already present in an existing linguistic resource. Using paradigm tables ensures that the new generated sentences resemble previous data, while being of course different. We report an experiment in which 1,201 new correct sentences were generated starting from only 22 seed sentences.

pdf bib

Collection and Evaluation of Broadcast News Data for Arabic
Mohamed Afify | Ossama Emam

pdf bib abs

A Language Resources Infrastructure for Bulgarian
Kiril Simov | Petya Osenova | Sia Kolkovska | Elisaveta Balabanova | Dimitar Doikoff

This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.

pdf bib abs

This paper deals with databases that combine different aspects: children's speech, emotional speech, human-robot communication, cross-linguistics, and read vs. spontaneous speech: in a Wizard-of-Oz scenario, German and English children had to instruct Sony's AIBO robot to fulfil specific tasks. In one experimental condition, strictly parallel for German and English, the AIBO behaved `disobedient' by following it's own script irrespective of the child's commands. By that, reactions of different children to the same sequence of AIBO's actions could be obtained. In addition, both the German and the English children were recorded reading texts. The data are transliterated orthographically; emotional user states and some other phenomena will be annotated. We report preliminary word recognition rates and classification results.

pdf bib abs

The Role of MultiWord Terminology in Knowledge Management
James Dowdall | Will Lowe | Jeremy Ellman | Fabio Rinaldi | Michael Hess

One of the major obstacles for knowledge management remains MultiWord Terminology (MWT). This paper explores the difficulties that arise and describes real world solutions implemented as part of the Parmenides project. Parmenides is being built as an integrated knowledge management package that combines information, MWT and ontology extraction methods in a semi-automated framework. The focus of this paper is on eliciting ontological fragments based on dedicated MWT processing.

pdf bib abs

The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus
Jörg Tiedemann | Lars Nygaard

The OPUS corpus is a growing collection of translated documents collected from the internet. The current version contains about 30 million words in 60 languages. The entire corpus is sentence aligned and it also contains linguistic markup for certain languages.

pdf bib abs

Selecting the Correct English Synset for a Spanish Sense
Javier Farreres | Horacio Rodríguez

This work tries to enrich the Spanish Wordnet using a Spanish taxonomy as a knowledge source. The Spanish taxonomy is composed by Spanish senses, while Spanish Wordnet is composed by synsets, mostly linked to English WordNet. A set of weighted associations between Spanish words and Wordnet synsets is used for inferring associations between both taxonomies.

pdf bib abs

The goal of this project (LILA) is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems for telephone applications in the Asian Pacific area. Specifications follow those of SpeechDat-like databases. Utterances will be recorded directly from calls made either from fixed or cellular telephones and are composed by read text and answers to specific questions. The project is driven by a consortium composed by a large number of industrial companies. Each company is in charge of the production of two databases. The consortium shares the databases produced in the project. The goal of the project should be reached within the year 2005.

pdf bib abs

Derivational Relations in Flectional Languages - Czech Case
Jaroslava Hlaváčová | Jana Klímová

When a text in any language is submitted to a morphological analysis, there always rest some unrecognized words. We can lower their number by adding new words into the dictionary used by the morphological analyzer but we can never gather the whole of the language. The system described in this paper (we call it "derivation module") deals with the unknown derived words. It aims not only at analyzing but also at synthesizing Czech derived words. Such a system is of particular value for automatic processing of languages where derivational morphology plays an important role in regular word formation.

pdf bib

Standards for Language Codes: developing ISO 639
David Dalby | Lee Gillam | Christopher Cox | Debbie Garside

pdf bib abs

SLR Validation: Current Trends and Developments
Henk van den Heuvel | Dorota Iskra | Eric Sanders | Folkert de Vriend

This paper deals with the quality evaluation (validation) of Spoken Language Resources (SLR). The current situation in terms of relevant validation criteria and procedures is briefly presented. Next, a number of validation issues related to new data formats (XML-based annotations, UTF-16 encoding) are discussed. Further, new validation cycles that were introduced in a series of new projects like SpeeCon and OrienTel are addressed: prompt sheet validation, lexicon validation and pre-release validation. Finally, SPEX's current and future

pdf bib

Identifying Definitions in Text Collections for Question Answering
Horacio Saggion

pdf bib abs

Multiple Sequence Alignment for Characterizing the Lineal Structure of Revision
Laura Alonso | Irene Castellón | Jordi Escribano | Xavier Messeguer | Lluís Padró

We present a first approach to the application of a data mining technique, Multiple Sequence Alignment, to the systematization of a polemic aspect of discourse, namely, the expression of contrast, concession, counterargument and semantically similar discursive relations. The representation of the phenomena under study is carried out by very simple techniques, mostly pattern-matching, but the results allow to drive insightful conclusions on the organization of this aspect of discourse: equivalence classes of discourse markers are established, and systematic patterns are discovered, which will be applied in enhancing a discursive parser.

pdf bib abs

Mining the Web for Discourse Markers
Ben Hutchinson

This paper proposes a methodology for obtaining sentences containing discourse markers from the World Wide Web. The proposed methodology is particularly suitable for collecting large numbers of discourse marker tokens. It relies on the automatic identification of discourse markers, and we show that this can be done with an accuracy within 9% of that of human performance. We also show that the distribution of discourse markers on the web correlates highly with those in a conventional balanced corpus.

pdf bib abs

A Pattern Extraction Workbench Combining Multiple Linguistic Levels
Magnus Merkel | Andreas Lange

In this paper an interactive pattern extraction workbench, I*Pex, is presented. The workbench comes in a graphical environment and is designed to be used in an incremental and interactive fashion with the user. Patterns can be constructed to work in combination involving specifications on several linguistic levels simultaneously, from the character level using regular expressions, parts of speech and dependency relations to semantic roles. The input text format is based on XCES XML format.

pdf bib abs

Exploiting Coreference Annotations for Text-to-Hypertext Conversion
Anke Holler | Jan Frederik Maas | Angelika Storrer

The paper describes an annotation scheme for coreference developed within the application context of text-to-hypertext conversion. In this context coference is used (1) for generating document-internal and cross-document hyperlinks, and (2) for resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes. We will argue that for the purpose of cross-document linking it is necessary to separate the annotation of coreference relations from the annotation of anaphoric relations. To account for this requirement, we developed a knowledge-based annotation scheme that relates referential expressions in the text to entities in a knowledge representation, which is modeled using XML Topic Maps.

pdf bib abs

“Why do you Ignore me?” - Proof that not all Direct Speech is Bad
Laura Hasler

In the automatic summarisation of written texts, direct speech is usually deemed unsuitable for inclusion in important sentences. This is due to the fact that humans do not usually include such quotations when they create summaries. In this paper, we argue that despite generally negative attitudes, direct speech can be useful for summarisation and ignoring it can result in the omission of important and relevant information. We present an analysis of a corpus of annotated newswire texts in which a substantial amount of speech is marked by different annotators, and describe when and why direct speech can be included in summaries. In an attempt to make direct speech more appropriate for summaries, we also describe rules currently being developed to transform it into a more summary-acceptable format.

pdf bib abs

“Human Language Technology Elements in a Knowledge Organisation System - The VID Project”
Costanza Navarretta | Bolette Sandford Pedersen | Dorte Haltrup Hansen

This paper describes how Human Language Technologies and linguistic resources are used to support the construction of components of a knowledge organisation system. In particular we focus on methodologies and resources for building a corpus-based domain ontology and extracting relevant metadata information for text chunks from domain-specific corpora.

pdf bib

pdf bib abs

Development of Bilingual Domain-Specific Ontology for Automatic Conceptual Indexing
Natalia V. Loukachevitch | Boris V. Dobrov

In the paper we describe development, means of evaluation and applications of Russian-English Sociopolitical Thesaurus specially developed as a linguistic resource for automatic text processing applications. The Sociopolitical domain is not a domain of social research but a broad domain of social relations including economic, political, military, cultural, sports and other subdomains. The knowledge of this domain is necessary for automatic text processing of such important documents as official documents, legislative acts, newspaper articles.

pdf bib abs

Development of Ontologies with Minimal Set of Conceptual Relations
Natalia V. Loukachevitch | Boris V. Dobrov

In the paper we describe our approach to development of ontologies with small number of relation types. Non-taxonomic relations in our ontologies are based on ontological dependence conception described in the formal ontology. This minimal relations set does not depend on a domain or a task and makes possible to begin the ontology construction at once, as soon as a task is set and a domain is determined, to receive the first version of an ontology in short time. Such an initial ontology can be used for information-retrieval applications and can serve as a structural basis for further development of the ontology

pdf bib abs

Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons
Maria Fernanda Bacelar do Nascimento | Amália Mendes | Luísa Pereira

Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL's webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL's projects and now available for on-line queries; b) a published sample of "Português Fundamental", a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.

pdf bib abs

Automatisation of the Activity of Term Collection in Different Languages
Bruno Cartoni | Pierrette Bouillon | Yalina Alphonse | Sabine Lehmann

This article describes the use and development of a tool for grammar and terminology control (FLAG), for the purposes of automating the verification of terminology for a large-scale user of multilingual terminology. It describes the various advantages of the tool and shows a process for transforming a traditional terminology list into a list of inflected forms as well as patterns which can be used to find possible morpho-syntactic derivations of terms.

pdf bib abs

Automatically Selecting Domain Markers for Terminology Extraction
Jorge Vivaldi | Horacio Rodríguez

Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction task is clearly domain depending, in the case a general-purpose resource without specific domain information is used, we need a way of attaching domain information to the units of the resource. For big resources it is desirable that this semantic enrichment could be carried out automatically. Given a specific domain, our proposal aims to detect in EWN those units that can be considered as domain markers (DM). We can define a DM as an EWN entry whose attached strings belong to the domain, as well as the variants of all its descendents through the hyponymy relation. The procedure we propose in this paper is fully automatic and, a priori, domain-independent. The only external knowledge it uses is a set of terms, which is an external vocabulary, which is considered to have at least one sense belonging to the domain.

pdf bib abs

This paper presents EASY (Evaluation of Analyzers of SYntax), an ongoing evaluation campaign of syntactic parsing of French, a subproject of EVALDA in the French TECHNOLANGUE program. After presenting the elaboration of the annotation formalism, we describe the corpus building steps, the annotation tools, the evaluation measures and finally, plans to produce a validated large linguistic resource, syntactically annotated

pdf bib abs

Annotators’ Agreement: The Case of Topic-Focus Articulation
Kateřina Veselá | Jiří Havelka | Eva Hajičová

The annotation of the Prague Dependency Treebank (PDT) is conceived of as a multilayered scenario that comprises also dependency representations (tectogrammatical tree structures, TGTS's) of the underlying structure of the sentences. TGTS's capture three basic aspects of the underlying structure of sentences: (a) the dependency tree structure, (b) the kinds of dependency syntactic relations, and (c) the basic characteristics of the topic-focus articulation (TFA). Since the PDT is a large collection and the annotations on the deepest layer are to a large extent performed by several human annotators (based on an automatic preprocessing module), it is more than necessary to observe the consistence of annotators and the agreement among them. In the present paper, we summarize the results of the evaluation of parallel annotations of several samples taken from PDT and the measures accepted to improve the consistency of annotations.

pdf bib abs

Evaluating Lexical Resources for a Semantic Tagger
Scott S. L. Piao | Paul Rayson | Dawn Archer | Tony McEnery

Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17th and 19th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% -- 97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them ‘future proof’, we need to evaluate their potential both synchronically and diachronically across genres.

pdf bib abs

Multimodal Meaning Representation for Generic Dialogue Systems Architectures
Frédéric Landragin | Alexandre Denis | Annalisa Ricci | Laurent Romary

An unified language for the communicative acts between agents is essential for the design of multi-agents architectures. Whatever the type of interaction (linguistic, multimodal, including particular aspects such as force feedback), whatever the type of application (command dialogue, request dialogue, database querying), the concepts are common and we need a generic meta-model. In order to tend towards task-independent systems, we need to clarify the modules parameterization procedures. In this paper, we focus on the characteristics of a meta-model designed to represent meaning in linguistic and multimodal applications. This meta-model is called MMIL for MultiModal Interface Language, and has first been specified in the framework of the IST MIAMM European project. What we want to test here is how relevant is MMIL for a completely different context (a different task, a different interaction type, a different linguistic domain). We detail the exploitation of MMIL in the framework of the IST OZONE European project, and we draw the conclusions on the role of MMIL in the parameterization of task-independent dialogue managers.

pdf bib abs

STO: A Danish Lexicon Resource - Ready for Applications
Anna Braasch | Sussi Olsen

This paper deals with the STO lexicon, the most comprehensive computational lexicon of Danish developed for NLP/HLT applications, which is now ready for use. Danish was one of the 12 EU-languages participating in the LE-PAROLE and SIMPLE projects; therefore it was obvious to continue this work building on our experience obtained from these projects. The material for Danish produced within these projects – further enriched with language-specific information - is incorporated into the STO lexicon. First, we describe the main characteristics of the lexical coverage and linguistic content of the STO lexicon; second, we present some recent uses and point to some prospective exploitations of the material. Finally, we outline an internet-based user interface, which allows for browsing through the complex information content of the STO lexical database and some other selected WRL’s for Danish.

pdf bib abs

A Domain-Independent Approach to IE Rule Development
Kalliopi Zervanou | John McNaught

A key element for the extraction of information in a natural language document is a set of shallow text analysis rules, which are typically based on pre-defined linguistic patterns. Current Information Extraction research aims at the automatic or semi-automatic acquisition of these rules. Within this research framework, we consider in this paper the potential for acquiring generic extraction patterns. Our research is based on the hypothesis that, terms (the linguistic representation of concepts in a specialised domain) and Named Entities (the names of persons, organisations and dates of importance in the text) can together be considered as the basic semantic entities of textual information and can therefore be used as a basis for the conceptual representation of domain specific texts and the definition of what constitutes an information extraction template in linguistic terms. The extraction patterns discovered by this approach involve significant associations of these semantic entities with verbs and they can subsequently be translated into the grammar formalism of choice.

The aim of the MEDIA project is to design and test a methodology for the evaluat ion of context-dependent and independent spoken dialogue systems. We propose an evaluation paradigm based on the use of test suites from real-world corpora and a common semantic representation and common metrics. This paradigm should allow us to diagnose the context-sensitive understanding capability of dialogue system s. This paradigm will be used within an evaluation campaign involving several si tes all of which will carry out the task of querying information from a database .

pdf bib abs

The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances.

pdf bib abs

Principles of a System for Terminological Concept Modelling
Bodil Nistrup Madsen | Hanne Erdman Thomsen | Carl Vikner

We are working on a project called CAOS - Computer-Aided Ontology Structuring - whose aim is to develop a computer system designed to enable semi-automatic construction of concept systems, or ontologies. The system is intended to be interactive and presupposes an end-user with a terminological background (terminologist or professional translator). CAOS supports terminological concept modelling. The backbone of this concept modelling is constituted by characteristics modelled by formal feature specifications, i.e. attribute-value pairs. Our use of feature specifications is subject to a number of principles and constraints. In this paper we want to demonstrate some of these principles and to show why they are necessary in order to permit the construction of an interactive tool for building terminological ontologies. We will also show how they contribute to determine the structuring of the ontologies in CAOS and to facilitate the work of the terminologist user.

pdf bib abs

On the Usefulness of Large Spoken Language Corpora for Linguistic Research
Christophe Van Bael | Helmer Strik | Henk van den Heuvel

In the past, fundamental linguistic research was typically conducted on small data sets that were handcrafted for the specific research at hand. However, from the eighties onwards, many large spoken language corpora have become available. This study investigates the usefulness of large multi-purpose spoken language corpora for fundamental linguistic research. A research task was designed in which we tried to capture the major pronunciation differences between three speech styles in context-sensitive re-write rules at the phone level. These re-write rules were extracted from the alignments of both a manual phonetic transcription and an automatic phonetic transcription with a canonical reference transcription of the same material.

pdf bib abs

WALA: A Multilingual Resource Repository for West African Languages
Dafydd Gibbon | Firmin Ahoua | Eddi Gbéry | Eno-Abasi Urua | Moses Ekpenyong

The West African Language Archive (WALA) initiative has emerged from a number of concurrent projects, and aims to encourage local scholars to create high quality decentralised repositories documenting West African languages, and to make these repositories available to language communities, language planners, educationalists and scientists via an internet metadata portal such as OLAC (Open Language Archive Community). A wide range of criteria has to be met in designing and implementing this kind of archive. We discuss these criteria with reference to experiences in documentation work in three very different ongoing language documentation projects, on designing an encyclopaedia, on documenting an endangered language, and on creating a speech synthesiser. We pay special attention to the provision of metadata, a formal variety of catalogue or housekeeping information, without which resources are doomed to remain inaccessible.

pdf bib

Annotating a Corpus for Building a Domain-specific Knowledge Base
Sabine Bartsch

pdf bib

A Comparison of Summarisation Methods Based on Term Specificity Estimation
Constantin Orăsan | Viktor Pekar | Laura Hasler

pdf bib

Measurements of Spoken Language Variability in a Multilingual Corpus. Predictable Aspects
Massimo Moneglia

pdf bib

Reliability of Lexical and Prosodic Cues in Two Real-life Spoken Dialog Corpora
L. Devillers | I. Vasilescu

pdf bib

WordNet Affect: an Affective Extension of WordNet
Carlo Strapparava | Alessandro Valitutti

pdf bib

The GENOMA-KB Platform: Queries over Integrated Linguistic Resources
Margarita Hospedales | Manel Rodríguez

pdf bib

pdf bib

Towards the Use of Word Stems and Suffixes for Statistical Machine Translation
Maja Popović | Hermann Ney

pdf bib

Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval
Matthias Eck | Stephan Vogel | Alex Waibel

pdf bib

Evaluating Name-Matching for Coreference Resolution
Olga Uryupina

pdf bib

Design and Implementation of a Semantic Search Engine for Portuguese
Carlos Amaral | Dominique Laurent | André Martins | Afonso Mendes | Cláudia Pinto

pdf bib

Converting Treebank Annotations to Language Neutral Syntax
Richard Campbell | Eric Ringger

pdf bib

Methodology For Building Thematic Indexes In Medicine For French
Yalina Alphonse | Pierrette Bouillon

pdf bib

Transcrigal: A Bilingual System for Automatic Indexing of Broadcast News
Carmen Garcia-Mateo | Javier Dieguez-Tirado | Laura Docio-Fernandez | Antonio Cardenal-Lopez

pdf bib

Abar-Hitz: An Annotation Tool for the Basque Dependency Treebank
Arantza Díaz de Ilarraza | Aitzpea Garmendia | Maite Oronoz

pdf bib

Creating Multi-purpose Linguistic Resources for Modern Greek: a Deep Modern Greek Grammar
Valia Kordoni | Julia Neu

pdf bib

Enriching the Spanish EuroWordNet by Collocations
Leo Wanner | Margarita Alonso Ramos | Antonia Martí

pdf bib

FrameNet as a “Net”
Charles J. Fillmore | Collin F. Baker | Hiroaki Sato

pdf bib

pdf bib

Creation of a Doctor-Patient Dialogue Corpus Using Standardized Patients
Robert S. Melvin | Win May | Shrikanth Narayanan | Panayiotis Georgiou | Shadi Ganjavi

pdf bib

Talkbank: Building an Open Unified Multimodal Database of Communicative Interaction
Brian MacWhinney | Steven Bird | Christopher Cieri | Craig Martell

pdf bib

A Fine-Grained Evaluation Method for Speech-to-Speech Machine Translation Using Concept Annotations
Robert S. Belvin | Susanne Riehemann | Kristin Precoda

pdf bib

Rethinking Reusable Resources
David M. de Matos | Ricardo Ribeiro | Nuno J. Mamede

pdf bib

pdf bib

pdf bib

Concept-based Queries: Combining and Reusing Linguistic Corpus Formats and Query Languages
Felix Sasaki | Andreas Witt | Dafydd Gibbon | Thorsten Trippel

pdf bib

Co-reference in Japanese Task-oriented Dialogues: A Contribution to the Development of Language-specific and Language-general Annotation Schemes and Resources
Felix Sasaki | Andreas Witt

pdf bib abs

Constructing Word-Sense Association Networks from Bilingual Dictionary and Comparable Corpora
Hiroyuki Kaji | Osamu Imaichi

A novel thesaurus named a gword-sense association networkh is proposed for the first time. It consists of nodes representing word senses, each of which is defined as a set consisting of a word and its translation equivalents, and edges connecting topically associated word senses. This word-sense association network is produced from a bilingual dictionary and comparable corpora by means of a newly developed fully automatic method. The feasibility and effectiveness of the method were demonstrated experimentally by using the EDR English-Japanese dictionary together with Wall Street Journal and Nihon Keizai Shimbun corpora. The word-sense association networks were applied to word-sense disambiguation as well as to a query interface for information retrieval.

pdf bib abs

Utilization of Multiple Language Resources for Robust Grammar-Based Tense and Aspect Classification
Alexis Palmer | Jonas Kuhn | Carlota Smith

This paper reports on an ongoing project that uses varied language resources and advanced NLP tools for a linguistic classification task in discourse semantics. The system we present is designed to assign a "situation entity" class label to each predicator in English text. The project goal is to achieve the best-possible identification of situation entities in naturally-occurring written texts by implementing a robust system that will deal with real corpus material, rather than just with constructed textbook examples of discourse. In this paper we focus on the combination of multiple information sources, which we see as being vital for a robust classification system. We use a deep syntactic grammar of English to identify morphological, syntactic, and discourse clues, and we use various lexical databases for fine-grained semantic properties of the predicators. Experiments performed to date show that enhancing the output of the grammar with information from lexical resources improves recall but lowers precision in the situation entity classification task.

pdf bib abs

Retrieving Annotated Corpora for Corpus Annotation
Kyôsuke Yoshida | Taiichi Hashimoto | Takenobu Tokunaga | Hozumi Tanaka

This paper introduces a tool \Bonsai which supports human in annotating corpora with morphosyntactic information, and in retrieving syntactic structures stored in the database. Integrating annotation and retrieval enables users to annotate a new instance while looking back at the already annotated sentences which share the similar morphosyntactic structure. We focus on the retrieval part of the system, and describe a method to decompose a large input query into smaller ones in order to gain retrieval efficiency. The proposed method is evaluated with the Penn Treebank corpus, showing significant improvements.

pdf bib abs

Classification of Japanese Spatial Nouns
Takenobu Tokunaga | Tomofumi Koyama | Suguru Saito | Masayuki Nakajima

We have already proposed a framework to represent a location in terms of both symbolic and numeric aspects. In order to deal with vague linguistic expressions of a location, the representation adopts a potential function mapping a location to its plausibility. This paper proposes classification of Japanese spatial nouns and potential functions corresponding to each class. We focused on a common Japanese spatial expression ``X no Y (Y of X)'' where X is a reference object and Y is a spatial noun. For example, ``tukue no migi (the right of the desk)'' denotes a location with reference to the desk. This expression were collected from corpora, and spatial nouns appearing in the Y position were classified into two major classes; designating a part of the reference object and designating a location apart from the reference object . And the latter class were further classified into two subclasses; direction-oriented and distance-oriented. For each class, a potential function were designed for providing meaning of spatial nouns.

pdf bib abs

Meaningful Clusters
Antonio Sanfilippo | Gus Calapristi | Vernon Crow | Beth Hetzler | Alan Turner

We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.

pdf bib abs

Multi-Document Summarization Using Multiple-Sequence Alignment
V. Finley Lacatusu | Steven J. Maiorano | Sanda M. Harabagiu

This paper describes a novel clustering-based text summarization system that uses Multiple Sequence Alignment to improve the alignment of sentences within topic clusters. While most current clustering-based summarization systems base their summaries only on the common information contained in a collection of highly-related sentences, our system constructs more informative summaries that incorporate both the redundant and unique contributions of the sentences in the cluster. When evaluated using ROUGE, the summaries produced by our system represent a substantial improvement over the baseline, which is at 63% of the human performance.

pdf bib abs

RevisionBank: A Resource for Revision-based Multi-document Summarization and Evaluation
Jahna Otterbacher | Dragomir Radev

Multi-document summaries produced via sentence extraction often suffer from a number of cohesion problems, including dangling anaphora, sudden shifts in topic and incorrect or awkward chronological ordering. Therefore, the development of an automated revision process to correct such problems is a research area of current interest. We present the RevisionBank, a corpus of 240 extractive, multi-document summaries that have been manually revised to promote cohesion. The summaries were revised by six linguistic students using a constrained set of revision operations that we previously developed. In the current paper, we describe the process of developing a taxonomy of cohesion problems and corrective revision operators that address such problems, as well as an annotation schema for our corpus. Finally, we discuss how our taxonomy and corpus can be used for the study of revision-based multi-document summarization as well as for summary evaluation.

pdf bib abs

The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools
Sandra Aluisio | Gisele Montilha Pinheiro | Aline M. P. Manfrin | Leandro H. M. de Oliveira | Luiz C. Genoves, Jr. | Stella E. O. Tagnin

In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus.

pdf bib abs

CST Bank: A Corpus for the Study of Cross-document Structural Relationships
Dragomir Radev | Jahna Otterbacher | Zhu Zhang

Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many instances of paraphrasing, information overlap and even contradiction. The current paper presents the Cross-document Structure Theory (CST) Bank, a collection of multi-document clusters in which pairs of sentences from different documents have been annotated for cross-document structure theory relationships. We will describe how we built the corpus, including our method for reducing the number of sentence pairs to be annotated by our hired judges, using lexical similarity measures. Finally, we will describe how CST and the CST Bank can be applied to different research areas such as multi-document summarization.

pdf bib abs

Applying Computational Linguistic Techniques in a Documentary Project for Q’anjob’al (Mayan, Guatemala)
Jonas Kuhn | B’alam Mateo-Toledo

This paper reports on a number of experiments in which we applied standard techniques from NLP in the context of documentation of endangered languages. We concentrated on the use of existing, freely available toolkits. Specifically, we explore the use of Finite-State Morphological Analysis, Maximum Entropy Part-of-Speech Tagging, and N-Gram Language Modeling.

pdf bib abs

Information Retrieval System Using Latent Contextual Relevance
Minoru Sasaki | Hiroyuki Shinnou

When the relevance feedback, which is one of the most popular information retrieval model, is used in an information retrieval system, a related word is extracted based on the first retrival result. Then these words are added into the original query, and retrieval is performed again using updated query. Generally, Using such query expansion technique, retrieval performance using the query expansion falls in comparison with the performance using the original query. As the cause, there is a few synonyms in the thesaurus and although some synonyms are added to the query, the same documents are retireved as a result. In this paper, to solve the problem over such related words, we propose latent context relevance in consideration of the relevance between query and each index words in the document set.

pdf bib abs

Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames
Daisuke Kawahara | Ryohei Sasano | Sadao Kurohashi

This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.

pdf bib abs

Lexical Analysis of Agglutinative Languages Using a Dictionary of Lemmas and Lexical Transducers
Sun-Mee Bae | Key-Sun Choi

This paper presents a simple method for performing a lexical analysis of agglutinative languages like Korean, which have a heavy morphology. Especially, for nouns and adverbs with regular morphological modifications and/or high productivity, we do not need to artificially construct huge dictionaries of all inflected forms of lemmas. To construct a dictionary of lemmas and lexical transducers, first, we construct automatically a dictionary of all inflected forms from KAIST POS-Tagged Corpus. Secondly, we separate the party of lemmas and one of sequences of inflectional suffixes. Thirdly, we describe their lexical transducers (i.e., morphological rules) to recognize all inflected forms of lemmas for nouns and adverbs according to the combinatorial restrictions between lemmas and their inflectional suffixes. Finally, we evaluate the advantages of this method.

pdf bib abs

Evaluation and Adaptation of a Specialised Language Checking Tool for Non-specialised Machine Translation and Non-expert MT Users for Multi-lingual Telecooperation
Rita Nüebel

Style guides or writing recommendations play an important role in the field of technical documentation production, e.g. in industrial contexts. Also, writing recommendations are used in technical contexts together with machine translation (MT) in order to circumvent the MT system's weaknesses. This paper describes the evaluation and adaptation of a language checker deployed in the project int.unity In this project, both MT and a specialised language checker were adapted to the requirements of non-expert users and a non-technical domain. The language technology was integrated with the groupware platform BSCW to support the multi-lingual communication of geographically distributed teams concerned with trade union work. The users' languages were either German or English, i.e. the users were monolingual. We chose linguatec's server version of Personal Translator 2004 MT system for the German<->English translations. The language checker CLAT for German and English has been developed at IAI. It is used by technical authors to support the production of high-quality technical documentation. The CLAT core system was adapted and extended in order to match the new requirements imposed by both the user profile and the subsequent MT application. In this paper, the focus will be on the assessment and adaptation of style rules for German.

pdf bib abs

We survey the evaluation methodology adopted in Information Extraction (IE), as defined in the MUC conferences and in later independent efforts applying machine learning to IE. We point out a number of problematic issues that may hamper the comparison between results obtained by different researchers. Some of them are common to other NLP tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Issues specific to IE evaluation include: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an information extraction task, a number of characteristics should be clearly defined. However, in the papers only a few of them are usually explicitly specified. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. The goal is to reach a widespread agreement on such proposal so that future IE evaluations will adopt the proposed methodology, making comparisons between algorithms fair and reliable. In order to achieve this goal, we will develop and make available to the community a set of tools and resources that incorporate a standardized IE methodology.

pdf bib abs

Enriching WordNet Via Generative Metonymy and Creative Polysemy
Jer Hayes | Tony Veale | Nuno Seco

Metonymy is a creative process that establishes relationships based on contiguity or semantic relatedness between concepts. We outline a mechanism for deriving new concepts from WordNet using metonymy. We argue that by exploiting polysemy in WordNet we can take advantage of the metonymic relations between concepts. The focus of our metonymy generation work has been the creation of noun noun compounds that do not already exist in WordNet and which can be profitably added to WordNet. The mechanism of metonymy generation we outline takes a source compound and creates new compounds by exploiting the polysemy associated with hyponyms of the head of the source compound. We argue that metonymy generation is a sound basis for concept creation as the newly created compounds are semantically related to the source concept. We demonstrate that metonymy generation based on polysemy is superior to a method of metonymy generation that ignores polysemy. These new concepts can be used to augment WordNet.

pdf bib abs

Evaluation and Adaptation of the Celex Dutch Morphological Database
Tom Laureys | Guy De Pauw | Hugo Van hamme | Walter Daelemans | Dirk Van Compernolle

This paper describes some important modifications to the Celex morphological database in the context of the FLaVoR project. FLaVoR aims to develop a novel modular framework for speech recognition, enabling the integration of complex linguistic knowledge sources, such as a morphological model. Morphology is a fairly unexploited linguistic information source speech recognizers could benefit from. This is especially true for languages which allow for a rich set of morphological operations, such as our target language Dutch. In this paper we focus on the exploitation of the Celex Dutch morphological database as the information source underlying two different morphological analyzers being developed within the project. Although the Celex database provides a valuable source of morphological information for Dutch, many modifications were necessary before it could be practically applied. We identify major problems, discuss the implemented solutions and finally experimentally evaluate the effect of our modifications to the database.

pdf bib

A Model of Semantic Representations Analysis for Chinese Sentences
Li Tang | Donghong Ji | Lingpeng Yang | Yu Nie

pdf bib abs

A Comparison of Two Variant Corpora: The Same Content with Different Source
Kyonghee Paik | Kiyonori Ohtake | Kazuhide Yamamoto

In order to investigate the effect of source language on translations, we investigate two variants of a Korean translation corpus. The first variant consists of Korean translations of 162,308 Japanese sentences from the ATR BTEC (Basic Expression Text Corpus). The second variant was made by translating the English translations of the Japanese sentences into Korean. We show that the source language text has a large influence on the target text. Even after normalizing orthographic differences, fewer than 8.3\% of the sentences in the two variants were identical. We describe in general which phenomena differ and then discuss how our analysis can be used in natural language processing.

pdf bib abs

Training a Sentence-Level Machine Translation Confidence Measure
Christopher B. Quirk

We present a supervised method for training a sentence level confidence measure on translation output using a human-annotated corpus. We evaluate a variety of machine learning methods. The resultant measure, while trained on a very small dataset, correlates well with human judgments, and proves to be effective on one task based evaluation. Although the experiments have only been run on one MT system, we believe the nature of the features gathered are general enough that the approach will also work well on other systems.

pdf bib abs

Software Tools for Morphological Tagging of Zulu Corpora and Lexicon Development
Sonja E. Bosch | Laurette Pretorius

The aim of this paper is to discuss aspects of an on-going project on the development of grammatical and lexical resources for Zulu with sufficient coverage for unrestricted text. We explain how the basic software tools of computational morphology are used in linguistic processing, more specifically for automatic word form recognition and morphological tagging of the growing stock of electronic text corpora of a Bantu language such as Zulu. It is also shown how a machine-readable lexicon is in turn enhanced with the information acquired and extracted by means of such corpus analysis.

pdf bib abs

Improving Collocation Extraction for High Frequency Words
David Wible | Chin-Hwa Kuo | Nai-Lung Tsao

The purpose of this paper is to introduce an alternative word association measure aimed at addressing the under-extraction collocations that contain high frequency words. While measures such as MI provide the important contribution of filtering out sheer high frequency of words in the detection of collocations in large corpora, one side effect of this filtering is that it becomes correspondingly difficult for such measures to detect true collocations involving high frequency words. As an alternative, we propose normalizing the MI measure by dividing the frequency of a candidate lexeme by the number of senses of that lexeme. We premise this alternative approach on the one sense per collocation assumption of Yarowsky (1992; 1995). Ten verb-noun collocations involving three high frequency verbs (make, take, run) are used to compare the extraction results of traditional MI and the proposed normalized MI. Results show the ranking of these high-frequency verbs as candidate collocates with the target focal nouns is raised by normalizing MI as proposed. Side effects of these improved rankings are discussed, such as increase in false positives resulting from higher recall. It is found that overall rank precision remains quite stable even with the increased recall of normalized MI.

pdf bib abs

Annotation of Coreference Relations Among Linguistic Expressions and Images in Biological Articles
Ai Kawazoe | Asanobu Kitamoto | Nigel Collier

In this paper, we propose an annotation scheme which can be used not only for annotating coreference relations between linguistic expressions, but also those among linguistic expressions and images, in scientific texts such as biomedical articles. Images in biomedical domain often contain important information for analyses and diagnoses, and we consider that linking images to textual descriptions of their semantic contents in terms of coreference relations is useful for multimodal access to the information. We present our annotation scheme and the concept of a "coreference pool," which plays a central role in the scheme. We also introduce a support tool for text annotation named Open Ontology Forge which we have already developed, and additional functions for the software to cover image annotations (ImageOF) which is now being developed.

pdf bib abs

Evaluation of Cross-Language Information Retrieval Using the Domain-Specific GIRT Data as Parallel German-English Corpus
Michael Kluck

The development of the evaluation of domain-specific cross-language information retrieval (CLIR) is shown in the context of the Cross-Language Evaluation Forum (CLEF) campaigns from 2000 to 2003. The pre-conditions and the usable data and additionally available instruments are described. The main goals of this task of CLEF are to allow the evaluation of Cross-Language Information Retrieval (CLIR) systems in the context of structured data and in a domain-specific area (not in the more general context of floating, journalistic texts), and with the additional possibility to make use of thesauri which had been used for intellectual indexing of the documents and are provided with the data. The parallel German-English GIRT4 corpus is described and some of the results of the CLEF 2004 campaign are discussed.

pdf bib abs

Generating Coreferential Descriptions from a Structured Model of the Context
Hélène Manuélian

This paper shows on the basis of a corpus study how a model of the context should be structured for the generation of coreferring descriptions in French. We show that this way of structuring the context can help to generate more paraphrases and a particular kind of referring expressions used to add information about the referent.

pdf bib

Open Collaborative Development of the Thai Language Resources for Natural Language Processing
Thatsanee Charoenporn | Virach Sornlertlamvanich | Sawit Kasuriya | Chatchawarn Hansakunbuntheung | Hitoshi Isahara

pdf bib abs

Automatic Translation Memory Fuzzy Match Post-Editing: A Step Beyond Traditional TM/MT Integration
Lambros Kranias | Anna Samiotou

An innovative way of integrating Translation Memory (TM) and Machine Translation (MT) processing is presented which goes beyond the traditional "cascade" integration of Translation Memory and Machine Translation. The new method aims to automatically post-edit TM similar matches by the use of an MT module thus enhancing the TM fuzzy (similar) scores as well as enabling the utilisation of low-score TM fuzzy matches. This leads to substantial translation cost reduction. The suggested method, which can be classified as an Example-Based Machine Translation application, is analysed and examples are provided for clarification. It is evaluated through test results that involve human interaction. The method has been implemented within the ESTeam Translator (ET) Language Toolbox and is already in use in the various commercial installations of ET.

pdf bib abs

After the successful completion of the Spoken Dutch Corpus (1998 -- 2003) the time is ripe to take some time to sit back and reflect on our achievements and the procedures underlying them in order to learn from our experiences. In this paper we will in particular pay attention to issues affecting the levels of linguistic annotation, but some more general issues deserve to be treated as well (bug reporting, consistency). We will try to come up with solutions, but sometimes we want to invite further discussion from other researchers.

pdf bib abs

Combining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing
Attila Novák | Viktor Nagy | Csaba Oravecz

Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.

pdf bib

pdf bib abs

Discarding Noise in an Automatically Acquired Lexicon of Support verb Constructions
M. Begoña Villada Moirón

We applied data-driven methods to carry out automatic acquisition of Dutch prepositional support verb constructions (SVCs) in corpora (e.g., iets in de gaten houden (``keep an eye on something'')). This paper addresses the question whether linguistic diagnostics help to discard noise from the nbest lists and how to (semi-)automatically apply such linguistic diagnostics to parsed corpora. We show that some of the linguistic diagnostics proposed in Hollebrandse (1993) effectively identify SVCs and contribute a modest error rate decrease.

pdf bib abs

Translation Memories Enrichment by Statistical Bilingual Segmentation
Francisco Nevado | Francisco Casacuberta | Josu Landa

A majority of Machine Aided Translation systems are based on comparisons between a source sentence and reference sentences stored in Translation Memories (TMs). The translation search is done by looking for sentences in a database which are similar to the source sentence. TMs have two basic limitations: the dependency on the repetition of complete sentences and the high cost of building a TM. As human translators do not only remember sentences from their preceding translations, but they also decompose the sentence to be translated and work with smaller units, it would be desirable to enrich the TM database with smaller translation units. This enrichment should also be automatic in order not to increase the cost of building a TM. We propose the application of two automatic bilingual segmentation techniques based on statistical translation methods in order to create new, shorter bilingual segments to be included in a TM database. An evaluation of the two techniques is carried out for a bilingual Basque-Spanish task.

pdf bib abs

The African Speech Technology Project: An Assessment
J. C. Roux | P. H. Louw | T. R. Niesler

This paper reflects on the recently completed African Speech Technology (AST) Project. The AST Project successfully developed eleven annotated telephone speech databases for five languages spoken in South Africa i.e. Xhosa, Southern Sotho, Zulu, English and Afrikaans. These databases were used to train and test speech recognition systems applied in a multilingual telephone-based prototype hotel booking system. An overview is given of the database design and contents. The acquisition of the data is discussed with regards to the telephony interface, as well as speaker recruitment and briefing. Particular reference is given to some of the practical implications of acquiring appropriate data in under-developed communities. Database management processes such as transcription, quality control and validation are explained. This is followed by information on the development of the prototype. Results of usability tests are discussed followed by an assessment of the Project as a whole.

pdf bib abs

Automatic Phonemic Labeling and Segmentation of Spoken Dutch
Kris Demuynck | Tom Laureys | Patrick Wambacq | Dirk Van Compernolle

The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.

pdf bib abs

Using Large Multi-purpose Corpora for Specific Research Questions: Discourse Phenomena Related to Wh-questions in the Spoken Dutch Corpus
Nelleke Oostdijk | Lou Boves

In this paper, we investigate whether a dataset derived from a multi-purpose corpus such as the Spoken Dutch Corpus may be considered appropriate for developing a taxonomy of wh-questions, and a model of the way in which these questions are integrated in spoken discourse. We compare the results obtained from the Spoken Dutch Corpus with a similar analysis of a large random collection of FAQs from the internet. We find substantial differences between the questions in spoken discourse and FAQs. Therefore, it may not be trivial to use a general purpose corpus as a starting point for developing models for human-computer interaction.

pdf bib abs

Methods of Digital Access for Legal Language Documentation
Paola Mariani | Costanza Badii

For many years the Istituto di Teoria e Tecniche dell'Informazione Giuridica (ITTIG) of the Consiglio Nazionale delle Ricerche has studied the evolution of legal language, creating databases for documentation and digital retrieval of law texts. The ITTIG is attending to document legal language through information technology in order to provide as wide an access as possible to its findings. The Institute has recently created an on-line digital database that includes the full text of the most important Italian laws (Codes and Constitutions) from the 16th to the 20th century. The ITTIG is also in the process of preparing another database made up of contexts from the original 10th to the 20th century legal sources.

pdf bib abs

Architecture for Distributed Language Resource Management and Archiving
Peter Wittenburg | Heidi Johnson | Markus Buchhorn | Hennie Brugman | Daan Broeder

An architecture is presented that provides an integrated framework for managing, archiving and accessing language resources. This architecture was discussed in the DELAMAN network – a world-wide network of archives holding material about endangered languages. Such a framework will be built upon a metadata infrastructure, a mechanism to resolve unique resource identifiers, user and access rights management components. These components are closely related and have to be based on redundant and distributed services. For all these components existing middleware seems to be available, however, it has to be checked how they can interact with each other.

pdf bib abs

This paper presents specifications and requirements for creation and validation of large lexica that are needed in automatic Speech Recognition (ASR), Text-to-Speech (TTS) and statistical Speech-to-Speech Translation (SST) systems. The prepared language resources are created and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during years 2002-2005. Large lexica consisting of phonetic, suprasegmental and morpho-syntactic content will be provided with well-documented specifications for 13 languages. A short summary of the LC-STAR project itself is presented. Overview about the specification for the corpora collection and word extraction as well as the specification and format of the lexica are presented. Particular attention is paid to the validation of the produced lexica and the lessons learnt during pre-validation. The created and validated language resources will be available via ELRA/ELDA.

pdf bib abs

Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora
Antoni Oliver | Marko Tadić

This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian National Corpus. The methodology has proved to be efficient for those languages that, like Croatian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We also present an extension of the system that uses automatic querying to Internet to acquire those entries for which we have not enough information in our corpus.

pdf bib

Learning to Predict Pitch Accents Using Bayesian Belief Networks for Greek Language
Panagiotis Zervas | Manolis Maragoudakis | Nikos Fakotakis | George Kokkinakis

pdf bib

A Grammar and Style Checker Based on Internet Searches
Joaquim Moré | Salvador Climent | Antoni Oliver

pdf bib

Cross-Disciplinary Integration of Metadata Descriptions
Peter Wittenburg | Greg Gulrajani | Daan Broeder | Marcus Uneson

pdf bib

Representing Italian Complex Nominals: A Pilot Study
Valeria Quochi

pdf bib

Text Corpora, Local Grammars and Prediction
Hayssam Traboulsi | David Cheng | Khurshid Ahmad

pdf bib

SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection
Helmut Schmid | Arne Fitschen | Ulrich Heid

pdf bib

The Overview of the SST Speech Corpus of Japanese Learner English and Evaluation Through the Experiment on Automatic Detection of Learners’ Errors
Emi Izumi | Kiyotaka Uchimoto | Hitoshi Isahara

pdf bib

Dynamic Lexicographic Data Modelling. A Diachronic Dictionary Development Report
Paul Gévaudan | Dirk Wiebel

pdf bib

Re-using High-quality Resources for Continued Evaluation of Automated Summarization Systems
Laura Alonso | Maria Fuentes | Marc Massot | Horacio Rodríguez

pdf bib

Corpus-based Learning of Lexical Resources for German Named Entity Recognition
Marc Rössler

pdf bib

Collaborative Annotation of Sign Language Data with Peer-to-Peer Technology
Hennie Brugman | Onno Crasborn | Albert Russel

pdf bib

Semantic Categorization of Spanish Se-constructions
Glòria Vázquez | Ana Fernández Montraveta | Irene Castellón | Laura Alonso

pdf bib

pdf bib

pdf bib

MetaMorpho TM: A Rule-Based Translation Corpus
Tamás Gröbler | Gábor Hodász | Balázs Kis

pdf bib

Annotating Multi-media/Multi-modal Resources with ELAN
Hennie Brugman | Albert Russel

pdf bib

Annotation of Anaphoric Expressions in an Aligned Bilingual Corpus
Agnès Tutin | Meriam Haddara | Ruslan Mitkov | Constantin Orasan

pdf bib

Unexpected Productions May Well be Errors
Tylman Ule | Kiril Simov

pdf bib

A Framework for Evaluating the Suitability of Non-English Corpora for Language Engineering
Avik Sarkar | Anne De Roeck

pdf bib

Intelligent Building of Language Resources for HLT Applications
Anna Samiotou | Lambros Kranias | Dimitrios Kokkinakis

pdf bib

Collecting Spontaneously Spoken Queries for Information Retrieval
Tomoyosi Akiba | Atsushi Fujii | Katunobu Itou

pdf bib

Multilingual Pattern Libraries for Question Answering: a Case Study for Definition Questions
Hristo Tanev | Milen Kouylekov | Matteo Negri | Bonaventura Coppola | Bernardo Magnini

pdf bib

Automatic Transformation of Phrase Treebanks to Dependency Trees
Michael Daum | Kilian A. Foth | Wolfgang Menzel

pdf bib

Computational Lexicography and Carlo Emilio Gadda, Principe dell’Analisi e Duca della Buona Cognizione
Maria Luigia Ceccotti | Manuela Sassi

pdf bib

An Annotation Scheme for a Rhetorical Analysis of Biology Articles
Yoko Mizuta | Nigel Collier

pdf bib

Textual Distraction as a Basis for Evaluating Automatic Summarisers
Antoinette Renouf | Andrew Kehoe

pdf bib

Verb Valency Descriptors for a Syntactic Treebank
Milena Slavcheva

pdf bib

Integrated Language Technologies for Multilingual Information Services in the MEMPHIS Project
Walter Kasper | Jörg Steffen | Jakub Piskorski | Paul Buitelaar

pdf bib

Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis
S.R. Deepa | Kalika Bali | A.G. Ramakrishnan | Partha Pratim Talukdar

pdf bib

Summarization of Multimodal Information
Saif Ahmad | Paulo C. F. de Oliveira | Khurshid Ahmad

pdf bib

Design of an Interactive Web-based User Interface for Speech Database Query Formation
Toomas Altosaar | Matti Karjalainen

pdf bib

pdf bib

Evaluating Conversation with Hans Christian Andersen
Niels Ole Bernsen | Laila Dybkjær | Svend Kiilerich

pdf bib

The New Dutch-Flemish HLT Programme: a Concerted Effort to Stimulate the HLT Sector
Catia Cucchiarini | Elisabeth D’Halleweyn

pdf bib

Related Word-pairs Extraction Without Dictionaries
Eiko Yamamoto | Kyoji Umemura

pdf bib

What is my Style? Using Stylistic Features of Portuguese Web Texts to Classify Web Pages According to Users’ Needs
Rachel Aires | Aline Manfrin | Sandra Aluísio | Diana Santos

pdf bib

BootCaT: Bootstrapping Corpora and Terms from the Web
Marco Baroni | Silvia Bernardini

pdf bib

N-Gram Language Modeling for Robust Multi-Lingual Document Classification
Jörg Steffen

pdf bib

A Word Alignment System Based on a Translation Equivalence Extractor
Ana-Maria Barbu

pdf bib

Using Profiles for IMDI Metadata Creation
Daan Broeder | Peter Wittenburg | Onno Crasborn

pdf bib

Rethinking Readability of Digital Editions — The Case of the AAC’s “Digital Brenner”
Karlheinz Mörth

pdf bib

Automatic Building Gazetteers of Co-referring Named Entities
Daniel Ferrés | Marc Massot | Muntsa Padró | Horacio Rodríguez | Jordi Turmo

pdf bib

Semi-Automatic Derivation of a French Lexicon from CLIPS
Nilda Ruimy | Pierrette Bouillon | Bruno Cartoni

pdf bib

The American National Corpus First Release
Nancy Ide | Keith Suderman

pdf bib

Identifying Morphosyntactic Preferences in Collocations
Stefan Evert | Ulrich Heid | Kristina Spranger

pdf bib

Towards General-Purpose Annotation Tools – How Far Are We Today?
Laila Dybkjær | Niels Ole Bernse

pdf bib

Automated Morphological Segmentation and Evaluation
Uwe D. Reichel | Karl Weilhammer

pdf bib

A Registry of Standard Data Categories for Linguistic Annotation
Nancy Ide | Laurent Romary

pdf bib

A Natural Language Approach to Information Management: Tracking Scientific Advances Through the Structure of Words
Andrew Hippisley | Chara Karavasili

pdf bib

Building a Maritime Domain Lexicon: a Few Considerations on the Database Structure and the Semantic Coding
Rita Marinelli | Adriana Roventini | Alessandro Enea

pdf bib

pdf bib

Test Collections for Patent-to-Patent Retrieval and Patent Map Generation in NTCIR-4 Workshop
Atsushi Fujii | Makoto Iwayama | Noriko Kando

pdf bib

Part-of-Speech Annotation of Biology Research Abstracts
Yuka Tateisi | Jun-ichi Tsujii

pdf bib

Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
Božo Bekavac | Petya Osenova | Kiril Simov | Marko Tadić

pdf bib

Corporate Voice, Tone of Voice and Controlled Language Techniques
Lina Henriksen | Bart Jongejan | Bente Maegaard

pdf bib

Cypriot Speech Database: Data Collection and Greek to Cypriot Dialect Adaptation
Nikos Fakotakis

pdf bib

Automatic Extraction of Syntactic Semantic Patterns for Multilingual Resources
Borja Navarro | Manuel Palomar | Patricio Martínez-Barco

pdf bib

The Integral Dictionary: An Ontological Resource for the Semantic Web: Integration of EuroWordNet, Balkanet, TID, and SUMO
Dominique Dutoit | Pierre Nugues | Patrick de Torcy

pdf bib

Categorizing Web Pages as a Preprocessing Step for Information Extraction
Viktor Pekar | Richard Evans | Ruslan Mitkov

pdf bib

A Framework for Data-driven Video-realistic Audio-visual Speech-synthesis
Christian Weiss

pdf bib

Corpus Based Enrichment of GermaNet Verb Frames
Manuela Kunze | Dietmar Rösner

pdf bib

Semi-automatic Acquisition of Command Grammar
Thierry Poibeau | Bénédicte Goujon

pdf bib

Towards a Language Infrastructure for the Semantic Web
Thierry Declerck | Paul Buitelaar | Nicoletta Calzolari | Alessandro Lenci

pdf bib

Conversational Telephone Speech Corpus Collection for the NIST Speaker Recognition Evaluation 2004
Alvin Martin | David Miller | Mark Przybocki | Joseph Campbell | Hirotaka Nakasone

pdf bib

Augmenting Manual Dictionaries for Statistical Machine Translation Systems
Stephan Vogel | Christian Monson

pdf bib

Linguistic Corpus Search
Christian Biemann | Uwe Quasthoff | Christian Wolff

pdf bib

pdf bib

The Influence of the Labeller’s Regional Background on Phonetic Transcriptions: Implications for the Evaluation of Spoken Language Resources
Evie Coussé | Steven Gillis | Hanne Kloots | Marc Swerts

pdf bib

pdf bib

Automatic Acquisition of Paradigmatic Relations Using Iterated Co-occurrences
Chris Biemann | Stefan Bordag | Uwe Quasthoff

pdf bib

pdf bib

pdf bib

ELRA Validation Methodology and Standard Promotion for Linguistic Resources
Hanne Fersøe | Monica Monachini

pdf bib

The AAC [Austrian Academy Corpus] – An Enterprise to Develop Large Electronic Text Corpora
Hanno Biber | Evelyn Breiteneder

pdf bib

Improving Automatic Phonetic Transcription of Spontaneous Speech Through Variant-Based Pronunciation Variation Modelling
Diana Binnenpoorte | Catia Cucchiarini | Helmer Strik | Lou Boves

pdf bib

A General-Purpose, Off-the-shelf Anaphora Resolution Module: Implementation and Preliminary Evaluation
Massimo Poesio | Mijail A. Kabadjov

pdf bib

Building a Conceptual Graph Bank for Chinese Language
Donghong Ji | Li Tang | Lingpeng Yang

pdf bib

Enriching a French Treebank
Anne Abeillé | Nicolas Barrier

pdf bib

French-English Multi-word Term Alignment Based on Lexical Context Analysis
Béatrice Daille | Samuel Dufour-Kowalski | Emmanuel Morin

pdf bib

An Argumentative Annotation Schema for Meeting Discussions
Vincenzo Pallotta | Hatem Ghorbel | Patrick Ruch | Giovanni Coray

pdf bib

A morphological Analyzer for Standard Albanian
Jochen Trommer | Dalina Kallulli

pdf bib

Generating an Arabic Full-form Lexicon for Bidirectional Morphology Lookup
Abdelhadi Soudi | Andreas Eisele

pdf bib

Orthographic and Phonetic Annotation of Very Large Czech Corpora with Quality Assessment
Petr Pollák | Jan Černocký

pdf bib

INQUER: A WordNet-based Question-Answering Application
Catarina Ribeiro | Ricardo Santos | João Correia | Rui Pedro Chaves | Palmira Marrafa

pdf bib

Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese
António Branco | João Silva

pdf bib

A High Quality Partial Parser for Annotating German Text Corpora
Stefan Klatt

pdf bib

Bayesian Semantics Incorporation to Web Content for Natural Language Information Retrieval
Manolis Maragoudakis | Nikos Fakotakis

pdf bib

Usability Evaluation of Spoken Dialogue Systems
Lars Bo Larsen

pdf bib

Enriching EWN with Syntagmatic Information by Means of WSD
Iulia Nica | Mª Antònia Martí | Andrés Montoyo | Sonia Vázquez

pdf bib

Proper Names and Polysemy: From a Lexicographic Experience
Rita Marinelli

pdf bib

Tools for Upgrading Printed Dictionaries by Means of Corpus-based Lexical Acquisition
Ulrich Heid | Bettina Säuberlich | Esther Debus-Gregor | Werner Scholze-Stubenrecht

pdf bib

Extraction of Polish Named-Entities
Jakub Piskorski

pdf bib

Automatic Acquisition of Sense Examples Using ExRetriever
Juan Fernández | Mauro Castillo | German Rigau | Jordi Atserias | Jordi Turmo

pdf bib

Combining Heterogeneous Lexical Resources
Cvetana Krstev | Duško Vitas | Ranka Stankoviæ | Ivan Obradoviæ | Gordana Pavloviæ-Lažetiæ

pdf bib

Spoken and Written Language Resources for Vietnamese
Viet-Bac Le | Do-Dat Tran | Eric Castelli | Laurent Besacier | Jean-François Serignat

pdf bib

Building and Using a Corpus of Shallow Dialogue Annotated Meetings
Andrei Popescu-Belis | Maria Georgescul | Alexander Clark | Susan Armstrong

pdf bib

XTERM: A Flexible Standard-Compliant XML-Based Termbase Management System
Lorenzo Piccioni | Eros Zanchetta

pdf bib

Word Sense Disambiguation Using Random Indexing
Márton Miháltz

pdf bib

pdf bib

Bypassing Greeklish!
A. Chalamandaris | P. Tsiakoulis | S. Raptis | G. Giannopoulos | G. Carayannis

pdf bib

Semi-Automatic UNL Dictionary Generation Using WordNet.PT
Catarina Ribeiro | Ricardo Santos | Rui Pedro Chaves | Palmira Marrafa

pdf bib

Bootstrapping a Database of German Multi-word Expressions
Alexander Geyken

pdf bib

A Practical Comparison of Different Filters Used in Automatic Term Extraction
Le An Ha

pdf bib

SVMTool: A general POS Tagger Generator Based on Support Vector Machines
Jesús Giménez | Lluís Màrquez

pdf bib

A Multi-Modal Documentation System for Warao
Stefanie Herrmann | Hartmut Keck | Stephan Kepser

pdf bib

The DeepThought Core Architecture Framework
Ulrich Callmeier | Andreas Eisele | Ulrich Schäfer | Melanie Siegel

pdf bib

Towards the Meaning Top Ontology: Sources of Ontological Meaning
Jordi Atserias | Salvador Climent | German Rigau

pdf bib

An Environment for Dialogue Corpora Collection (ENDIACC)
Zygmunt Vetulani

pdf bib

pdf bib

An Acoustic Corpus Contemplating Regional Variation for Studies of European Portuguese Nasals
António Teixeira | Liliana Ferreira | Lurdes Moutinho | Rosa Lídia Coimbra | Raquel Lisboa

pdf bib

Experiments on Building Language Resources for Multi-Modal Dialogue Systems
Laurent Romary | Amalia Todirascu | David Langlois

pdf bib

Callisto: A Configurable Annotation Workbench
David Day | Chad McHenry | Robyn Kozierok | Laurel Riek

pdf bib

The Effect of Text Difficulty on Machine Translation Performance – A Pilot Study with ILR-Rated Texts in Spanish, Farsi, Arabic, Russian and Korean
Ray Clifford | Neil Granoien | Douglas Jones | Wade Shen | Clifford Weinstein

pdf bib

An Annotated German-Language Medical Text Corpus as Language Resource
Joachim Wermter | Udo Hahn

pdf bib

Application of the BLEU Method for Evaluating Free-text Answers in an E-learning Environment
Diana Pérez | Enrique Alfonseca | Pilar Rodríguez

pdf bib

Extraction of Hyperonymy of Adjectives from Large Corpora by Using the Neural Network Model
Kyoko Kanzaki | Qing Ma | Eiko Yamamoto | Masaki Murata | Hitoshi Isahara

pdf bib

The Penn Discourse Treebank
Eleni Miltsakaki | Rashmi Prasad | Aravind Joshi | Bonnie Webber

pdf bib

Using the Web as a Corpus for the Syntactic-Based Collocation Identification
Violeta Seretan | Luka Nerima | Eric Wehrli

pdf bib

Automatic Methods to Supplement Broad-Coverage Subcategorization Lexicons
Michael Schiehlen | Kristina Spranger

pdf bib

pdf bib

Evaluation of a Multimodal Dialogue System for Small-screen Devices
Holmer Hemsen

pdf bib

Web Services for Language Resources and Language Technology Applications
Christian Biemann | Stefan Bordag | Uwe Quasthoff | Christian Wolff

pdf bib

pdf bib

Top Ontology as a Tool for Semantic Role Tagging
Karel Pala | Pavel Smrz

pdf bib

A Suite of Tools for Marking Up Textual Data for Temporal Text Mining Scenarios
Argyrios Vasilakopoulos | Michele Bersani | William J. Black

pdf bib

Frequent Term Distribution Measures for Dataset Profiling
Anne De Roeck | Avik Sarkar | Paul Garthwaite

pdf bib

pdf bib

Ontology Evaluation Functionalities of RDF(S),DAML+OIL, and OWL Parsers and Ontology Platforms
Asunción Gómez-Pérez | M. Carmen Suárez-Figueroa

pdf bib

Word Association Norms as a Unique Supplement of Traditional Language Resources
Anna Sinopalnikova | Pavel Smrz

pdf bib

Towards a Dynamic Lexicon: Predicting the Syntactic Argument Structure of Complex Verbs
Nadine Aldinger

pdf bib

Semantic Annotating of Czech Corpus via WSD
Robert Král

pdf bib

Using the NITE XML Toolkit on the Switchboard Corpus to Study Syntactic Choice: a Case Study
Jean Carletta | Shipra Dingare | Malvina Nissim | Tatiana Nikitina

pdf bib

An Annotation Scheme for Information Status in Dialogue
Malvina Nissim | Shipra Dingare | Jean Carletta | Mark Steedman

pdf bib

Speech Recognition Simulation and its Application for Wizard-of-Oz Experiments
Alex Trutnev | Antoine Rozenknop | Martin Rajman

pdf bib

Language Modeling Using Dynamic Bayesian Networks
Murat Deviren | Khalid Daoudi | Kamel Smaïli

pdf bib

Pumping Documents Through a Domain and Genre Classification Pipeline
Udo Hahn | Joachim Wermter

pdf bib

A Hybrid Strategy For Regular Grammar Parsing
Kiril Simov | Petya Osenova

pdf bib

pdf bib

MED-TYP: A Typological Database for Mediterranean Languages
Andrea Sansò

pdf bib

A graphical Tool for Handling Rule Grammars in Java Speech Grammar Format
Kallirroi Georgila | Nikos Fakotakis | George Kokkinakis

pdf bib

A Flexible Language Acquisition Tool Kit for Natural Language Processing
Svetlana Sheremetyeva

pdf bib

The Effect of Bias on an Automatically-built Word Sense Corpus
David Martínez | Eneko Agirre

pdf bib

pdf bib

pdf bib

Memory-based Classification of Proper Names in Norwegian
Anders Nøklestad

pdf bib

Comparative Evaluations in the Domain of Automatic Speech Recognition
Alex Trutnev | Martin Rajman

pdf bib

Consistent Storage of Metadata in Inference Lexica: the MetaLex Approach
Thorsten Trippel | Felix Sasaki | Dafydd Gibbon

pdf bib

Applying a Part-of-Speech Tagger to Postal Address Detection on the Web
Nuno Cavalheiro Marques | Sérgio Gonçalves

pdf bib

Unifying Lexicons in view of a Phonological and Morphological Lexical DB
Monica Monachini | Federico Calzolari | Michele Mammini | Sergio Rossi | Marisa Ulivieri

pdf bib

pdf bib

Building Distributed Language Resources By Grid Computing
Fabio Tamburini

pdf bib

Mapping Dependency Structures to Phrase Structures and the Automatic Acquisition of Mapping Rules
Bernd Bohnet | Halyna Seniv

pdf bib

A Framework for Temporal Resolution
Georgiana Puşcaşu

pdf bib

EGRAM – A Grammar Development Environment and its Usage for Language Generation
Stephan Busemann

pdf bib

pdf bib

Exploring Portability of Syntactic Information from English to Basque
Eneko Agirre | Aitziber Atutxa | Koldo Gojenola | Kepa Sarasola

pdf bib

Spanish WordNet 1.6: Porting the Spanish Wordnet Across Princeton Versions
Jordi Atserias | Luís Villarejo | German Rigau

pdf bib

pdf bib

Automatic Keyword Extraction from Spoken Text. A Comparison of Two Lexical Resources: EDR and WordNet
Lonneke van der Plas | Vincenzo Pallotta | Martin Rajman | Hatem Ghorbel

pdf bib

Pronominal Anaphora Resolution for Unrestricted Text
Anna Kupść | Teruko Mitamura | Benjamin Van Durme | Eric Nyberg

pdf bib

pdf bib

Steps Towards Semantically Annotated Language Resources
Manfred Klenner | Fabio Rinaldi | Michael Hess

pdf bib

pdf bib

Semi-Automatic Construction of a Question Treebank
Karin Müller

pdf bib

Calibrating Resource-light Automatic MT Evaluation: a Cheap Approach to Ranking MT Systems by the Usability of Their Output
Bogdan Babych | Debbie Elliott | Anthony Hartley

pdf bib

pdf bib

Perceptual Evaluation of Quality Deterioration Owing to Prosody Modification
Kazuki Adachi | Tomoki Toda | Hiromichi Kawanami | Hiroshi Saruwatari | Kiyohiro Shikano

pdf bib

Integration of Russian Language Resources
Serge A. Yablonsky

pdf bib

A2Q: An Agent-based Architecure for Multilingual Q&A
Roberto Basili | Nicola Lorusso | Maria Teresa Pazienza | Fabio Massimo Zanzotto

pdf bib

OntoTag’s Linguistic Ontologies: Enhancing Higher Level and Semantic Web Annotations
Guadalupe Aguado de Cea | Inmaculada Álvarez-de-Mon | Antonio Pareja-Lora

pdf bib

Exploiting Language Resources for Semantic Web Annotations
Kaarel Kaljurand | Fabio Rinaldi | James Dowdall | Michael Hess

pdf bib

pdf bib

The Translation Correction Tool: English-Spanish User Studies
Ariadna Font Llitjós | Jaime Carbonell

pdf bib

A Labelled Corpus for Prepositional Phrase Attachment
Brian Mitchell | Robert Gaizauskas

pdf bib

Comparing the Ambiguity Reduction Abilities of Probabilistic Context-Free Grammars
Gabriel Infante-Lopez | Maarten de Rijke

pdf bib

NameNet: a Self-Improving Resource for Name Classification
Paul Morarescu | Sanda Harabagiu

pdf bib

Image-Language Multimodal Corpora: Needs, Lacunae and an AI Synergy for Annotation
Katerina Pastra | Yorick Wilks

pdf bib

Detecting Errors in English Article Usage with a Maximum Entropy Classifier Trained on a Large, Diverse Corpus
Na-Rae Han | Martin Chodorow | Claudia Leacock

pdf bib

The Core of the Czech Derivational Dictionary
Radek Sedláček

pdf bib

Automatic Sentence Simplification for Subtitling in Dutch and English
Walter Daelemans | Anja Höthker | Erik Tjong Kim Sang

pdf bib

Enriching a Thai Lexical Database with Selectional Preferences
Canasai Kruengkrai | Thatsanee Charoenporn | Virach Sornlertlamvanich | Hitoshi Isahara

pdf bib

Results of the 2003 Topic Detection and Tracking Evaluation
Jonathan G. Fiscus

pdf bib

Parsing Ungrammatical Input: an Evaluation Procedure
Jennifer Foster

pdf bib

An Automatic Method for Constructing Domain-Specific Ontology Resources
Melania Degeratu | Vasileios Hatzivassiloglou

pdf bib

pdf bib

Modelling Legitimate Translation Variation for Automatic Evaluation of MT Quality
Bogdan Babych | Anthony Hartley

pdf bib

Semantic Mark-up of Italian Legal Texts Through NLP-based Techniques
Roberto Bartolini | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli | Claudia Soria

pdf bib

Morphology Based Automatic Acquisition of Large-coverage Lexica
Lionel Clément | Benoît Sagot | Bernard Lang

pdf bib

Towards Intelligent Written Cultural Heritage Processing - Lexical processing
Kiril Ribarov

pdf bib

Developing Language Resources for a Transnational Digital Government System
Violetta Cavalli-Sforza | Jaime G. Carbonell | Peter J. Jansen

pdf bib

Semi-automatic Syntactic and Semantic Corpus Annotation with a Deep Parser
Mary D. Swift | Myroslava O. Dzikovska | Joel R. Tetreault | James F. Allen

pdf bib

Collecting and Sharing Bilingual Spontaneous Speech Corpora: the ChinFaDial Experiment
Georges Fafiotte | Christian Boitet | Mark Seligman | Chengqing Zong

pdf bib

Can Anaphoric Definite Descriptions be Replaced by Pronouns?
Judita Preiss | Caroline Gasperin | Ted Briscoe

pdf bib

Hybrid Constraints for Robust Parsing: First Experiments and Evaluation
Roberto Bartolini | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli

pdf bib

E-Wiz: a Trapper Protocol for Hunting the Expressive Speech Corpora in Lab
Véronique Aubergé | Nicolas Audibert | Albert Rilliard

pdf bib

Agreement in Human Factoid Annotation for Summarization Evaluation
Simone Teufel | Hans van Halteren

pdf bib

Evaluating an Authentic Audio-Visual Expressive Speech Corpus
Albert Rilliard | Véronique Aubergé | Nicolas Audibert

pdf bib

pdf bib

Linguistic Miner: An Italian Linguistic Knowledge System
Eugenio Picchi | Maria Luigia Ceccotti | Sebastiana Cucurullo | Manuela Sassi | Eva Sassolini

pdf bib

Metaphors in Wordnets: From Theory to Practice
Antonietta Alonge | Birte Lönneker

pdf bib

Standardization in Multimodal Content Representation: Some Methodological Issues
Harry Bunt | Laurent Romary

pdf bib

A Similarity Measure for Unsupervised Semantic Disambiguation
Roberto Basili | Marco Cammisa | Fabio Massimo Zanzotto

pdf bib

Usability Evaluation of Multimodal and Domain-Oriented Spoken Language Dialogue Systems
Laila Dybkjær | Niels Ole Bernsen | Wolfgang Minker

pdf bib

Using WordNet to Measure Semantic Orientations of Adjectives
Jaap Kamps | Maarten Marx | Robert J. Mokken | Maarten de Rijke

pdf bib

MT Goes Farming: Comparing Two Machine Translation Approaches on a New Domain
Per Weijnitz | Eva Forsbom | Ebba Gustavii | Eva Pettersson | Jörg Tiedemann

pdf bib

VOXMEX Speech Database: Design of a Phonetically Balanced Corpus
Esmeralda Uraga | César Gamboa

pdf bib

Data Driven Ontology Evaluation
Christopher Brewster | Harith Alani | Srinandan Dasmahapatra | Yorick Wilks

pdf bib

Embedding IMDI Metadata into a Large Phonetic Corpus
Oliver Schonefeld | Jan-Torsten Milde

pdf bib

Using Semantic Language Resources to Support Textual Inference for Question Answering
Francesca Bertagna

pdf bib

An Information Repository Model for Advanced Question Answering Systems
Vasco Calais Pedro | Jeongwoo Ko | Eric Nyberg | Teruko Mitamura

pdf bib

Content Interoperability of Lexical Resources: Open Issues and “MILE” Perspectives
Francesca Bertagna | Alessandro Lenci | Monica Monachini | Nicoletta Calzolari

pdf bib

Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation
Martin Čmejrek | Jan Cuřín | Jiří Havelka | Jan Hajič | Vladislav Kuboň

pdf bib

pdf bib

An Efficient Word Confidence Measure Using Likelihood Ratio Scores
Arlindo O. Veiga | Fernando S. Perdigão

pdf bib

Adding Syntactic Annotations to Transcripts of Parent-Child Dialogs
Kenji Sagae | Brian MacWhinney | Alon Lavie

pdf bib

Distributional Consistency: As a General Method for Defining a Core Lexicon
Huarui Zhang | Churen Huang | Shiwen Yu

pdf bib

Computing Reliability for Coreference Annotation
Rebecca J. Passonneau

pdf bib

Publicly Available Topic Signatures for all WordNet Nominal Senses
Eneko Agirre | Oier Lopez de Lacalle

pdf bib

Road-testing the English Resource Grammar Over the British National Corpus
Timothy Baldwin | Emily M. Bender | Dan Flickinger | Ara Kim | Stephan Oepen

pdf bib

Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System?
Ying Zhang | Stephan Vogel | Alex Waibel

pdf bib

Exploiting Anchor Text as a Lexical Resource
Peter Anick

pdf bib

pdf bib

Current Projects in Languages of Military Interest at the Defense Language Institute
Michael Emonts

pdf bib

A Multilingual Database of Idioms
Aline Villavicencio | Timothy Baldwin | Benjamin Waldron

pdf bib

Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium
Kazuaki Maeda | Stephanie Strassel

pdf bib

Linguistic Resources for Effective, Affordable, Reusable Speech-to-Text
Stephanie Strassel

pdf bib

Building part-of-speech Corpora Through Histogram Hopping
Marc Vilain

pdf bib

An Emerging Transcontinental Collaborative Research and Education Agenda in Human Language Technologies
Gregory Ernest Monaco | Abdelhadi Soudi

pdf bib

Issues in Corpus Development for Multi-party Multi-modal Task-oriented Dialogue
Susan Robinson | Bilyana Martinovski | Saurabh Garg | Jens Stephan | David Traum

pdf bib

The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text
Christopher Cieri | David Miller | Kevin Walker

pdf bib

Evaluation of Multi-party Virtual Reality Dialogue Interaction
David R. Traum | Susan Robinson | Jens Stephan

pdf bib

The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data
Christopher Cieri | Joseph P. Campbell | Hirotaka Nakasone | David Miller | Kevin Walker

pdf bib

Building a Large Grammar for Italian
Alessandro Mazzei | Vincenzo Lombardo

pdf bib

Japanese MULTEXT: a Prosodic Corpus
Shigeyoshi Kitazawa | Shinya Kiriyama | Toshihiko Itoh | Nick Campbell

pdf bib

The OLISSIPO and LECTIO Projects
Giuseppe Cappelli | Paulo Alberto

pdf bib

A Public Reference Implementation of the RAP Anaphora Resolution Algorithm
Long Qiu | Min-Yen Kan | Tat-Seng Chua

pdf bib

pdf bib

WinPitch Corpus, a Text to Speech Alignment Tool for Multimodal Corpora
Philippe Martin

pdf bib

The Statistical Analysis of Morphosyntactic Distributions
Stefan Evert

pdf bib

CHeM: A System for the Automatic Analysis of e-mails in the Restoration and Conservation Domain
Luciana Bordoni | Leonardo Pasqualini | Filippo Sciarrone

pdf bib

Resources for Place Name Analysis
Robert Irie | Beth Sundheim

pdf bib

NEMLAR - An Arabic Language Resources Project
Bente Maegaard

pdf bib

pdf bib

Intranet Try To Find Project (ITTF): An Approach for the Search of Relevant Information Inside an Organization
Christophe Jouis | Jean-Marie Ferru

pdf bib

A Progress Report from the Linguistic Data Consortium: Recent Activities in Resource Creation and Distribution and the Development of Tools and Standards
Christopher Cieri | Mark Liberman

pdf bib

Recent Activities within the European Language Resources Association: Issues on Sharing Language Resources and Evaluation
Khalid Choukri

pdf bib

EVALDA-CESART Project: Terminological Resources Acquisition Tools Evaluation Campaign
Widad Mustafa El Hadi | Ismail Timimi | Marianne Dabbadie

pdf bib

From Weaver to the ALPAC Report
Gabriella Pardelli | Manuela Sassi | Sara Goggi

pdf bib

The Verb in the Terminological Collocations. Contribution to the Development of a Morphological Analyser: MorphoCom
Rute Costa | Raquel Silva

pdf bib

Cluster Analysis and Classification of Named Entities
Joaquim F. Ferreira da Silva | Zornitsa Kozareva | José Gabriel Pereira Lopes

pdf bib

Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus
Khalid Choukri | Mahtab Nikkhou | Niklas Paulsson

pdf bib

Technolangue: A Permanent Evaluation and Information Infrastructure
Valérie Mapelli | Maria Nava | Sylvain Surcin | Djamel Mostefa | Khalid Choukri

pdf bib

Extending Wordnets To Implicit Information
Palmira Marrafa

pdf bib

Russian Information Retrieval Evaluation Seminar
Boris Dobrov | Igor Kuralenok | Natalia Loukachevitch | Igor Nekrestyanov | Ilya Segalovich