Constantin Orašan

Also published as: C. Orasan, Constantin Orasan, Constantin Orăsan, Constantin Orǎsan


2021

pdf bib
Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by Machine Translation Systems
Hadeel Saadany | Constantin Orăsan | Emad Mohamed | Ashraf Tantavy
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words. The reason is that an incorrect translation of such words would miss the fundamental aspect of the source text, i.e. the author’s sentiment. In the online world, MT systems are extensively used to translate User-Generated Content (UGC) such as reviews, tweets, and social media posts, where the main message is often the author’s positive or negative attitude towards the topic of the text. It is important in such scenarios to accurately measure how far an MT system can be a reliable real-life utility in transferring the correct affect message. This paper tackles an under-recognized problem in the field of machine translation evaluation which is judging to what extent automatic metrics concur with the gold standard of human evaluation for a correct translation of sentiment. We evaluate the efficacy of conventional quality metrics in spotting a mistranslation of sentiment, especially when it is the sole error in the MT output. We propose a numerical “sentiment-closeness” measure appropriate for assessing the accuracy of a translated affect message in UGC text by an MT system. We will show that incorporating this sentiment-aware measure can significantly enhance the correlation of some available quality metrics with the human judgement of an accurate translation of sentiment.

pdf bib
An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that multilingual QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.

2020

pdf bib
Fake or Real? A Study of Arabic Satirical Fake News
Hadeel Saadany | Constantin Orasan | Emad Mohamed
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

One very common type of fake news is satire which comes in a form of a news website or an online platform that parodies reputable real news agencies to create a sarcastic version of reality. This type of fake news is often disseminated by individuals on their online platforms as it has a much stronger effect in delivering criticism than through a straightforward message. However, when the satirical text is disseminated via social media without mention of its source, it can be mistaken for real news. This study conducts several exploratory analyses to identify the linguistic properties of Arabic fake news with satirical content. It shows that although it parodies real news, Arabic satirical news has distinguishing features on the lexico-grammatical level. We exploit these features to build a number of machine learning models capable of identifying satirical fake news with an accuracy of up to 98.6%. The study introduces a new dataset (3185 articles) scraped from two Arabic satirical news websites (‘Al-Hudood’ and ‘Al-Ahram Al-Mexici’) which consists of fake news. The real news dataset consists of 3710 articles collected from three official news sites: the ‘BBC-Arabic’, the ‘CNN-Arabic’ and ‘Al-Jazeera news’. Both datasets are concerned with political issues related to the Middle East.

pdf bib
TransQuest at WMT2020: Sentence-Level Direct Assessment
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the Fifth Conference on Machine Translation

This paper presents the team TransQuest’s participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.

pdf bib
RGCL at SemEval-2020 Task 6: Neural Approaches to DefinitionExtraction
Tharindu Ranasinghe | Alistair Plum | Constantin Orasan | Ruslan Mitkov
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.

pdf bib
Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation
John E. Ortega | Marcello Federico | Constantin Orasan | Maja Popovic
Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation

pdf bib
TransQuest: Translation Quality Estimation with Cross-lingual Transformers
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 28th International Conference on Computational Linguistics

Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.

pdf bib
Is it Great or Terrible? Preserving Sentiment in Neural Machine Translation of Arabic Reviews
Hadeel Saadany | Constantin Orasan
Proceedings of the Fifth Arabic Natural Language Processing Workshop

Since the advent of Neural Machine Translation (NMT) approaches there has been a tremendous improvement in the quality of automatic translation. However, NMT output still lacks accuracy in some low-resource languages and sometimes makes major errors that need extensive postediting. This is particularly noticeable with texts that do not follow common lexico-grammatical standards, such as user generated content (UGC). In this paper we investigate the challenges involved in translating book reviews from Arabic into English, with particular focus on the errors that lead to incorrect translation of sentiment polarity. Our study points to the special characteristics of Arabic UGC, examines the sentiment transfer errors made by Google Translate of Arabic UGC to English, analyzes why the problem occurs, and proposes an error typology specific of the translation of Arabic UGC. Our analysis shows that the output of online translation tools of Arabic UGC can either fail to transfer the sentiment at all by producing a neutral target text, or completely flips the sentiment polarity of the target word or phrase and hence delivers a wrong affect message. We address this problem by fine-tuning an NMT model with respect to sentiment polarity showing that this approach can significantly help with correcting sentiment errors detected in the online translation of Arabic UGC.

pdf bib
Intelligent Translation Memory Matching and Retrieval with Sentence Encoders
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Matching and retrieving previously translated segments from the Translation Memory is a key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper, we introduce sentence encoders to improve matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance-based algorithms.

2019

pdf bib
RGCL-WLV at SemEval-2019 Task 12: Toponym Detection
Alistair Plum | Tharindu Ranasinghe | Pablo Calleja | Constantin Orăsan | Ruslan Mitkov
Proceedings of the 13th International Workshop on Semantic Evaluation

This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.

pdf bib
Sentence Simplification for Semantic Role Labelling and Information Extraction
Richard Evans | Constantin Orasan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.

pdf bib
Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning
Alistair Plum | Tharindu Ranasinghe | Constantin Orasan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.

pdf bib
Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains

pdf bib
Semantic Textual Similarity with Siamese Neural Networks
Tharindu Ranasinghe | Constantin Orasan | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods

pdf bib
A Survey of the Perceived Text Adaptation Needs of Adults with Autism
Victoria Yaneva | Constantin Orasan | Le An Ha | Natalia Ponomareva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

NLP approaches to automatic text adaptation often rely on user-need guidelines which are generic and do not account for the differences between various types of target groups. One such group are adults with high-functioning autism, who are usually able to read long sentences and comprehend difficult words but whose comprehension may be impeded by other linguistic constructions. This is especially challenging for real-world user-generated texts such as product reviews, which cannot be controlled editorially and are thus a particularly good applcation for automatic text adaptation systems. In this paper we present a mixed-methods survey conducted with 24 adult web-users diagnosed with autism and an age-matched control group of 33 neurotypical participants. The aim of the survey was to identify whether the group with autism experienced any barriers when reading online reviews, what these potential barriers were, and what NLP methods would be best suited to improve the accessibility of online reviews for people with autism. The group with autism consistently reported significantly greater difficulties with understanding online product reviews compared to the control group and identified issues related to text length, poor topic organisation, and the use of irony and sarcasm.

2018

pdf bib
Aggressive Language Identification Using Word Embeddings and Sentiment Features
Constantin Orăsan
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method are information extracted from word embeddings and the output of a sentiment analyser. Several machine learning methods and different combinations of features were tried. The official submissions used Support Vector Machines and Random Forests. The official evaluation showed that for texts similar to the ones in the training dataset Random Forests work best, whilst for texts which are different SVMs are a better choice. The evaluation also showed that despite its simplicity the method performs well when compared with more elaborated methods.

pdf bib
What Makes You Stressed? Finding Reasons From Tweets
Reshmi Gopalakrishna Pillai | Mike Thelwall | Constantin Orasan
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Detecting stress from social media gives a non-intrusive and inexpensive alternative to traditional tools such as questionnaires or physiological sensors for monitoring mental state of individuals. This paper introduces a novel framework for finding reasons for stress from tweets, analyzing multiple categories for the first time. Three word-vector based methods are evaluated on collections of tweets about politics or airlines and are found to be more accurate than standard machine learning algorithms.

pdf bib
Trouble on the Road: Finding Reasons for Commuter Stress from Tweets
Reshmi Gopalakrishna Pillai | Mike Thelwall | Constantin Orasan
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)

2017

pdf bib
Combining Multiple Corpora for Readability Assessment for People with Cognitive Disabilities
Victoria Yaneva | Constantin Orăsan | Richard Evans | Omid Rohanian
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Given the lack of large user-evaluated corpora in disability-related NLP research (e.g. text simplification or readability assessment for people with cognitive disabilities), the question of choosing suitable training data for NLP models is not straightforward. The use of large generic corpora may be problematic because such data may not reflect the needs of the target population. The use of the available user-evaluated corpora may be problematic because these datasets are not large enough to be used as training data. In this paper we explore a third approach, in which a large generic corpus is combined with a smaller population-specific corpus to train a classifier which is evaluated using two sets of unseen user-evaluated data. One of these sets, the ASD Comprehension corpus, is developed for the purposes of this study and made freely available. We explore the effects of the size and type of the training data used on the performance of the classifiers, and the effects of the type of the unseen test datasets on the classification performance.

bib
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology
Irina Temnikova | Constantin Orasan | Gloria Corpas Pastor | Stephan Vogel
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology

2016

pdf bib
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing
Wolfgang Maier | Sandra Kübler | Constantin Orasan
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing

pdf bib
Semantic Textual Similarity in Quality Estimation
Hanna Bechara | Carla Parra Escartin | Constantin Orasan | Lucia Specia
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib
The EXPERT project: training the future experts in translation technology
Constantin Orašan
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf bib
WOLVESAAR at SemEval-2016 Task 1: Replicating the Success of Monolingual Word Alignment and Neural Embeddings for Semantic Textual Similarity
Hannah Bechara | Rohit Gupta | Liling Tan | Constantin Orăsan | Ruslan Mitkov | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
The EXPERT project: Advancing the state of the art in hybrid translation technologies
Constantin Orasan | Alessandro Cattelan | Gloria Corpas Pastor | Josef van Genabith | Manuel Herranz | Juan José Arevalillo | Qun Liu | Khalil Sima’an | Lucia Specia
Proceedings of Translating and the Computer 37

pdf bib
ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks
Rohit Gupta | Constantin Orăsan | Josef van Genabith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
MiniExperts: An SVM Approach for Measuring Semantic Textual Similarity
Hanna Béchara | Hernani Costa | Shiva Taslimipoor | Rohit Gupta | Constantin Orasan | Gloria Corpas Pastor | Ruslan Mitkov
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Barbecued Opakapaka: Using Semantic Preferences for Ontology Population
Ismail El Maarouf | Georgiana Marsic | Constantin Orăsan
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Can Translation Memories afford not to use paraphrasing ?
Rohit Gupta | Constantin Orasan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Machine Translation Evaluation using Recurrent Neural Networks
Rohit Gupta | Constantin Orăsan | Josef van Genabith
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Can Translation Memories afford not to use paraphrasing?
Rohit Gupta | Constantin Orăsan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Proceedings of the Workshop Natural Language Processing for Translation Memories
Constantin Orasan | Rohit Gupta
Proceedings of the Workshop Natural Language Processing for Translation Memories

2014

pdf bib
UoW: NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual Entailment
Rohit Gupta | Hanna Béchara | Ismail El Maarouf | Constantin Orăsan
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Intelligent translation memory matching and retrieval metric exploiting linguistic technology
Rohit Gupta | Hanna Bechara | Constantin Orasan
Proceedings of Translating and the Computer 36

pdf bib
Incorporating paraphrasing in translation memory matching and retrieval
Rohit Gupta | Constantin Orǎsan
Proceedings of the 17th Annual conference of the European Association for Machine Translation

pdf bib
An evaluation of syntactic simplification rules for people with autism
Richard Evans | Constantin Orăsan | Iustin Dornescu
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)
Constantin Orasan | Petya Osenova | Cristina Vertan
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

pdf bib
Relative clause extraction for syntactic simplification
Iustin Dornescu | Richard Evans | Constantin Orăsan
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

2013

pdf bib
A Tagging Approach to Identify Complex Constituents for Text Simplification
Iustin Dornescu | Richard Evans | Constantin Orăsan
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
Annotating Near-Identity from Coreference Disagreements
Marta Recasens | M. Antònia Martí | Constantin Orasan
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an extension of the coreference annotation in the English NP4E and the Catalan AnCora-CA corpora with near-identity relations, which are borderline cases of coreference. The annotated subcorpora have 50K tokens each. Near-identity relations, as presented by Recasens et al. (2010; 2011), build upon the idea that identity is a continuum rather than an either/or relation, thus introducing a middle ground category to explain currently problematic cases. The first annotation effort that we describe shows that it is not possible to annotate near-identity explicitly because subjects are not fully aware of it. Therefore, our second annotation effort used an indirect method, and arrived at near-identity annotations by inference from the disagreements between five annotators who had only a two-alternative choice between coreference and non-coreference. The results show that whereas as little as 2-6% of the relations were explicitly annotated as near-identity in the former effort, up to 12-16% of the relations turned out to be near-identical following the indirect method of the latter effort.

pdf bib
CLCM - A Linguistic Resource for Effective Simplification of Instructions in the Crisis Management Domain and its Evaluations
Irina Temnikova | Constantin Orasan | Ruslan Mitkov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Due to the increasing number of emergency situations which can have substantial consequences, both financially and fatally, the Crisis Management (CM) domain is developing at an exponential speed. The efficient management of emergency situations relies on clear communication between all of the participants in a crisis situation. For these reasons the Text Complexity (TC) of the CM domain needed to be investigated and showed that CM domain texts exhibit high TC levels. This article presents a new linguistic resource in the form of Controlled Language (CL) guidelines for manual text simplification in the CM domain which aims to address high TC in the CM domain and produce clear messages to be used in crisis situations. The effectiveness of the resource has been tested via evaluation from several different perspectives important for the domain. The overall results show that the CLCM simplification has a positive impact on TC, reading comprehension, manual translation and machine translation. Additionally, an investigation of the cognitive difficulty in applying manual simplification operations led to interesting discoveries. This article provides details of the evaluation methods, the conducted experiments, their results and indications about future work.

pdf bib
Book Review: Interactive Multi-Modal Question-Answering by Antal van den Bosch and Gosse Bouma
Constantin Orăsan
Computational Linguistics, Volume 38, Issue 2 - June 2012

2009

pdf bib
WLV: A Confidence-based Machine Learning Method for the GREC-NEG’09 Task
Constantin Orăsan | Iustin Dornescu
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

pdf bib
QALL-ME needs AIR: a portability study
Constantin Orăsan | Iustin Dornescu | Natalia Ponomareva
Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains

pdf bib
Proceedings of the Workshop on Events in Emerging Text Types
Constantin Orasan | Laura Hasler | Corina Forăscu
Proceedings of the Workshop on Events in Emerging Text Types

2008

pdf bib
Evaluation of a Cross-lingual Romanian-English Multi-document Summariser
Constantin Orăsan | Oana Andreea Chiorean
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The rapid growth of the Internet means that more information is available than ever before. Multilingual multi-document summarisation offers a way to access this information even when it is not in a language spoken by the reader by extracting the gist from related documents and translating it automatically. This paper presents an experiment in which Maximal Marginal Relevance (MMR), a well known multi-document summarisation method, is used to produce summaries from Romanian news articles. A task-based evaluation performed on both the original summaries and on their automatically translated versions reveals that they still contain a significant portion of the important information from the original texts. However, direct evaluation of the automatically translated summaries shows that they are not very legible and this can put off some readers who want to find out more about a topic.

pdf bib
Development and Alignment of a Domain-Specific Ontology for Question Answering
Shiyan Ou | Viktor Pekar | Constantin Orasan | Christian Spurk | Matteo Negri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

With the appearance of Semantic Web technologies, it becomes possible to develop novel, sophisticated question answering systems, where ontologies are usually used as the core knowledge component. In the EU-funded project, QALL-ME, a domain-specific ontology was developed and applied for question answering in the domain of tourism, along with the assistance of two upper ontologies for concept expansion and reasoning. This paper focuses on the development of the QALL-ME ontology in the tourism domain and its alignment with the upper ontologies - WordNet and SUMO. The design of the ontology is presented in the paper, and a semi-automatic alignment procedure is described with some alignment results given as well. Furthermore, the aligned ontology was used to semantically annotate original data obtained from the tourism web sites and natural language questions. The storage schema of the annotated data and the data access method for retrieving answers from the annotated data are also reported in the paper.

pdf bib
The QALL-ME Benchmark: a Multilingual Resource of Annotated Spoken Requests for Question Answering
Elena Cabrio | Milen Kouylekov | Bernardo Magnini | Matteo Negri | Laura Hasler | Constantin Orasan | David Tomás | Jose Luis Vicedo | Guenter Neumann | Corinna Weber
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the QALL-ME benchmark, a multilingual resource of annotated spoken requests in the tourism domain, freely available for research purposes. The languages currently involved in the project are Italian, English, Spanish and German. It introduces a semantic annotation scheme for spoken information access requests, specifically derived from Question Answering (QA) research. In addition to pragmatic and semantic annotations, we propose three QA-based annotation levels: the Expected Answer Type, the Expected Answer Quantifier and the Question Topical Target of a request, to fully capture the content of a request and extract the sought-after information. The QALL-ME benchmark is developed under the EU-FP6 QALL-ME project which aims at the realization of a shared and distributed infrastructure for Question Answering (QA) systems on mobile devices (e.g. mobile phones). Questions are formulated by the users in free natural language input, and the system returns the actual sequence of words which constitutes the answer from a collection of information sources (e.g. documents, databases). Within this framework, the benchmark has the twofold purpose of training machine learning based applications for QA, and testing their actual performance with a rapid turnaround in controlled laboratory setting.

pdf bib
Anaphora Resolution Exercise: an Overview
Constantin Orăsan | Dan Cristea | Ruslan Mitkov | António Branco
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lessons learnt from it. This first edition focused only on English pronominal anaphora and NP coreference, and was organised as an exploratory exercise where various issues were investigated. ARE proposed four different tasks: pronominal anaphora resolution and NP coreference resolution on a predefined set of entities, pronominal anaphora resolution and NP coreference resolution on raw texts. For each of these tasks different inputs and evaluation metrics were prepared. This paper presents the four tasks, their input data and evaluation metrics used. Even though a large number of researchers in the field expressed their interest to participate, only three institutions took part in the formal evaluation. The paper briefly presents their results, but does not try to interpret them because in this edition of ARE our aim was not about finding why certain methods are better, but to prepare the ground for a fully-fledged edition.

pdf bib
Entailment-based Question Answering for Structured Data
Bogdan Sacaleanu | Constantin Orasan | Christian Spurk | Shiyan Ou | Oscar Ferrandez | Milen Kouylekov | Matteo Negri
Coling 2008: Companion volume: Demonstrations

2006

pdf bib
Computer-aided summarisation – what the user really wants
Constantin Orăsan | Laura Hasler
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Computer-aided summarisation is a technology developed at the University of Wolverhampton as a complement to automatic summarisation, to produce high quality summaries with less effort. To achieve this, a user-friendly environment which incorporates several well-known summarisation methods has been developed. This paper presents the main features of the computer-aided summarisation environment and explains the changes introduced to it as a result of user feedback.

pdf bib
Transferring Coreference Chains through Word Alignment
Oana Postolache | Dan Cristea | Constantin Orasan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper investigates the problem of automatically annotating resources with NP coreference information using a parallel corpus, English-Romanian, in order to transfer, through word alignment, coreference chains from the English part to the Romanian part of the corpus. The results show that we can detect Romanian referential expressions and coreference chains with over 80% F-measure, thus using our method as a preprocessing step followed by manual correction as part of an annotation effort for creating a large Romanian corpus with coreference information is worthwhile.

pdf bib
NPs for Events: Experiments in Coreference Annotation
Laura Hasler | Constantin Orasan | Karin Naumann
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes a pilot project which developed a methodology for NP and event coreference annotation consisting of detailed annotation schemes and guidelines. In order to develop this, a small sample annotated corpus in the domain of terrorism/security was built. The methodology developed can be used as a basis for large-scale annotation to produce much-needed resources. In contrast to related projects, ours focused almost exclusively on the development of annotation guidelines and schemes, to ensure that future annotations based on this methodology capture the phenomena both reliably and in detail. The project also involved extensive discussions in order to redraft the guidelines, as well as major extensions to PALinkA, our existing annotation tool, to accommodate event as well as NP coreference annotation.

2005

pdf bib
Building a WSD module within an MT system to enable interactive resolution in the user’s source language
Constantin Orasan | Ted Marshall | Robert Clark | Le An Ha | Ruslan Mitkov
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

2004

pdf bib
A Comparison of Summarisation Methods Based on Term Specificity Estimation
Constantin Orăsan | Viktor Pekar | Laura Hasler
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Annotation of Anaphoric Expressions in an Aligned Bilingual Corpus
Agnès Tutin | Meriam Haddara | Ruslan Mitkov | Constantin Orasan
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
An Evolutionary Approach for Improving the Quality of Automatic Summaries
Constantin Orasan
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

pdf bib
PALinkA: A highly customisable tool for discourse annotation
Constantin Orăsan
Proceedings of the Fourth SIGdial Workshop of Discourse and Dialogue

pdf bib
How to build a QA system in your back-garden: application for Romanian
Constantin Orăsan | Doina Tatar | Gabriela Şerban | Dana Lupsa | Adrian Oneţ
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
CAST: A computer-aided summarisation tool
Constantin Orasan | Ruslan Mitkov | Laura Hasler
10th Conference of the European Chapter of the Association for Computational Linguistics

2002

pdf bib
A corpus-based investigation of junk emails
Constantin Orasan | Ramesh Krishnamurthy
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Building annotated resources for automatic text summarisation
Constantin Orasan
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Bilingual alignment of anaphoric expressions
R. Muñoz | R. Mitkov | M. Palomar | J. Peral | R. Evans | L. Moreno | C. Orasan | M. Saiz-Noeda | A. Ferrández | C. Barbu | P. Martínez-Barco | A. Suárez
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Assessing the difficulty of finding people in texts
Constantin Orăsan | Richard Evans
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf bib
Learning to identify animate references
Constantin Orasan | Richard Evans
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

2000

pdf bib
An Open Architecture for the Construction and Administration of Corpora
Constantin Orăsan | Ramesh Krishnamurthy
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
CLinkA A Coreferential Links Annotator
Constantin Orăsan
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)