Luísa Coheur

Also published as: Luisa Coheur


2021

pdf bib
Online Learning Meets Machine Translation Evaluation: Finding the Best Systems with the Least Human Effort
Vânia Mendonça | Ricardo Rei | Luisa Coheur | Alberto Sardinha | Ana Lúcia Santos
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In Machine Translation, assessing the quality of a large amount of automatic translations can be challenging. Automatic metrics are not reliable when it comes to high performing systems. In addition, resorting to human evaluators can be expensive, especially when evaluating multiple systems. To overcome the latter challenge, we propose a novel application of online learning that, given an ensemble of Machine Translation systems, dynamically converges to the best systems, by taking advantage of the human feedback available. Our experiments on WMT’19 datasets show that our online approach quickly converges to the top-3 ranked systems for the language pairs considered, despite the lack of human feedback for many translations.

pdf bib
MT-Telescope: An interactive platform for contrastive evaluation of MT systems
Ricardo Rei | Ana C Farinha | Craig Stewart | Luisa Coheur | Alon Lavie
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

We present MT-Telescope, a visualization platform designed to facilitate comparative analysis of the output quality of two Machine Translation (MT) systems. While automated MT evaluation metrics are commonly used to evaluate MT systems at a corpus-level, our platform supports fine-grained segment-level analysis and interactive visualisations that expose the fundamental differences in the performance of the compared systems. MT-Telescope also supports dynamic corpus filtering to enable focused analysis on specific phenomena such as; translation of named entities, handling of terminology, and the impact of input segment length on translation quality. Furthermore, the platform provides a bootstrapped t-test for statistical significance as a means of evaluating the rigor of the resulting system ranking. MT-Telescope is open source, written in Python, and is built around a user friendly and dynamic web interface. Complementing other existing tools, our platform is designed to facilitate and promote the broader adoption of more rigorous analysis practices in the evaluation of MT quality.

2020

pdf bib
AIA-BDE: A Corpus of FAQs in Portuguese and their Variations
Hugo Gonçalo Oliveira | João Ferreira | José Santos | Pedro Fialho | Ricardo Rodrigues | Luisa Coheur | Ana Alves
Proceedings of the 12th Language Resources and Evaluation Conference

We present AIA-BDE, a corpus of 380 domain-oriented FAQs in Portuguese and their variations, i.e., paraphrases or entailed questions, created manually, by humans, or automatically, with Google Translate. Its aims to be used as a benchmark for FAQ retrieval and automatic question-answering, but may be useful in other contexts, such as the development of task-oriented dialogue systems, or models for natural language inference in an interrogative context. We also report on two experiments. Matching variations with their original questions was not trivial with a set of unsupervised baselines, especially for manually created variations. Besides high performances obtained with ELMo and BERT embeddings, an Information Retrieval system was surprisingly competitive when considering only the first hit. In the second experiment, text classifiers were trained with the original questions, and tested when assigning each variation to one of three possible sources, or assigning them as out-of-domain. Here, the difference between manual and automatic variations was not so significant.

pdf bib
HamNoSyS2SiGML: Translating HamNoSys Into SiGML
Carolina Neves | Luísa Coheur | Hugo Nicolau
Proceedings of the 12th Language Resources and Evaluation Conference

Sign Languages are visual languages and the main means of communication used by Deaf people. However, the majority of the information available online is presented through written form. Hence, it is not of easy access to the Deaf community. Avatars that can animate sign languages have gained an increase of interest in this area due to their flexibility in the process of generation and edition. Synthetic animation of conversational agents can be achieved through the use of notation systems. HamNoSys is one of these systems, which describes movements of the body through symbols. Its XML-compliant, SiGML, is a machine-readable input of HamNoSys able to animate avatars. Nevertheless, current tools have no freely available open source libraries that allow the conversion from HamNoSys to SiGML. Our goal is to develop a tool of open access, which can perform this conversion independently from other platforms. This system represents a crucial intermediate step in the bigger pipeline of animating signing avatars. Two cases studies are described in order to illustrate different applications of our tool.

pdf bib
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

pdf bib
PE2LGP Animator: A Tool To Animate A Portuguese Sign Language Avatar
Pedro Cabral | Matilde Gonçalves | Hugo Nicolau | Luísa Coheur | Ruben Santos
Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives

Software for the production of sign languages is much less common than for spoken languages. Such software usually relies on 3D humanoid avatars to produce signs which, inevitably, necessitates the use of animation. One barrier to the use of popular animation tools is their complexity and steep learning curve, which can be hard to master for inexperienced users. Here, we present PE2LGP, an authoring system that features a 3D avatar that signs Portuguese Sign Language. Our Animator is designed specifically to craft sign language animations using a key frame method, and is meant to be easy to use and learn to users without animation skills. We conducted a preliminary evaluation of the Animator, where we animated seven Portuguese Sign Language sentences and asked four sign language users to evaluate their quality. This evaluation revealed that the system, in spite of its simplicity, is indeed capable of producing comprehensible messages.

2019

pdf bib
BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification
Pedro Mota | Maxine Eskenazi | Luísa Coheur
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We propose BeamSeg, a joint model for segmentation and topic identification of documents from the same domain. The model assumes that lexical cohesion can be observed across documents, meaning that segments describing the same topic use a similar lexical distribution over the vocabulary. The model implements lexical cohesion in an unsupervised Bayesian setting by drawing from the same language model segments with the same topic. Contrary to previous approaches, we assume that language models are not independent, since the vocabulary changes in consecutive segments are expected to be smooth and not abrupt. We achieve this by using a dynamic Dirichlet prior that takes into account data contributions from other topics. BeamSeg also models segment length properties of documents based on modality (textbooks, slides, etc.). The evaluation is carried out in three datasets. In two of them, improvements of up to 4.8% and 7.3% are obtained in the segmentation and topic identifications tasks, indicating that both tasks should be jointly modeled.

pdf bib
L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations
Eugénio Ribeiro | Vânia Mendonça | Ricardo Ribeiro | David Martins de Matos | Alberto Sardinha | Ana Lúcia Santos | Luísa Coheur
Proceedings of the 13th International Workshop on Semantic Evaluation

Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F1, as well as in terms of BCubed F1 in most cases.

2017

pdf bib
L2F/INESC-ID at SemEval-2017 Tasks 1 and 2: Lexical and semantic features in word and textual similarity
Pedro Fialho | Hugo Patinho Rodrigues | Luísa Coheur | Paulo Quaresma
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our approach to the SemEval-2017 “Semantic Textual Similarity” and “Multilingual Word Similarity” tasks. In the former, we test our approach in both English and Spanish, and use a linguistically-rich set of features. These move from lexical to semantic features. In particular, we try to take advantage of the recent Abstract Meaning Representation and SMATCH measure. Although without state of the art results, we introduce semantic structures in textual similarity and analyze their impact. Regarding word similarity, we target the English language and combine WordNet information with Word Embeddings. Without matching the best systems, our approach proved to be simple and effective.

2016

pdf bib
Building a Corpus of Errors and Quality in Machine Translation: Experiments on Error Impact
Ângela Costa | Rui Correia | Luísa Coheur
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we describe a corpus of automatic translations annotated with both error type and quality. The 300 sentences that we have selected were generated by Google Translate, Systran and two in-house Machine Translation systems that use Moses technology. The errors present on the translations were annotated with an error taxonomy that divides errors in five main linguistic categories (Orthography, Lexis, Grammar, Semantics and Discourse), reflecting the language level where the error is located. After the error annotation process, we accessed the translation quality of each sentence using a four point comprehension scale from 1 to 5. Both tasks of error and quality annotation were performed by two different annotators, achieving good levels of inter-annotator agreement. The creation of this corpus allowed us to use it as training data for a translation quality classifier. We concluded on error severity by observing the outputs of two machine learning classifiers: a decision tree and a regression model.

pdf bib
A study on the production of collocations by European Portuguese learners
Ângela Costa | Luísa Coheur | Teresa Lino
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
QGASP: a Framework for Question Generation Based on Different Levels of Linguistic Information
Hugo Patinho Rodrigues | Luísa Coheur | Eric Nyberg
Proceedings of the 9th International Natural Language Generation conference

2015

pdf bib
Proceedings of the Fourth Workshop on Vision and Language
Anja Belz | Luisa Coheur | Vittorio Ferrari | Marie-Francine Moens | Katerina Pastra | Ivan Vulić
Proceedings of the Fourth Workshop on Vision and Language

pdf bib
Coupling Natural Language Processing and Animation Synthesis in Portuguese Sign Language Translation
Inês Almeida | Luísa Coheur | Sara Candeias
Proceedings of the Fourth Workshop on Vision and Language

pdf bib
From European Portuguese to Portuguese Sign Language
Inês Almeida | Luísa Coheur | Sara Candeias
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

2014

pdf bib
JUST.ASK, a QA system that learns to answer new questions from previous interactions
Sérgio Curto | Ana C. Mendes | Pedro Curto | Luísa Coheur | Ângela Costa
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present JUST.ASK, a publicly available Question Answering system, which is freely available. Its architecture is composed of the usual Question Processing, Passage Retrieval and Answer Extraction components. Several details on the information generated and manipulated by each of these components are also provided to the user when interacting with the demonstration. Since JUST.ASK also learns to answer new questions based on users’ feedback, (s)he is invited to identify the correct answers. These will then be used to retrieve answers to future questions.

pdf bib
Translation errors from English to Portuguese: an annotated corpus
Angela Costa | Tiago Luís | Luísa Coheur
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Analysing the translation errors is a task that can help us finding and describing translation problems in greater detail, but can also suggest where the automatic engines should be improved. Having these aims in mind we have created a corpus composed of 150 sentences, 50 from the TAP magazine, 50 from a TED talk and the other 50 from the from the TREC collection of factoid questions. We have automatically translated these sentences from English into Portuguese using Google Translate and Moses. After we have analysed the errors and created the error annotation taxonomy, the corpus was annotated by a linguist native speaker of Portuguese. Although Google’s overall performance was better in the translation task (we have also calculated the BLUE and NIST scores), there are some error types that Moses was better at coping with, specially discourse level errors.

2013

pdf bib
Meet EDGAR, a tutoring agent at MONSERRATE
Pedro Fialho | Luísa Coheur | Sérgio Curto | Pedro Cláudio | Ângela Costa | Alberto Abad | Hugo Meinedo | Isabel Trancoso
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
An English-Portuguese parallel corpus of questions: translation guidelines and application in SMT
Ângela Costa | Tiago Luís | Joana Ribeiro | Ana Cristina Mendes | Luísa Coheur
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The task of Statistical Machine Translation depends on large amounts of training corpora. Despite the availability of several parallel corpora, these are typically composed of declarative sentences, which may not be appropriate when the goal is to translate other types of sentences, e.g., interrogatives. There have been efforts to create corpora of questions, specially in the context of the evaluation of Question-Answering systems. One of those corpora is the UIUC dataset, composed of nearly 6,000 questions, widely used in the task of Question Classification. In this work, we make available the Portuguese version of the UIUC dataset, which we manually translated, as well as the translation guidelines. We show the impact of this corpus in the performance of a state-of-the-art SMT system when translating questions. Finally, we present a taxonomy of translation errors, according to which we analyze the output of the automatic translation before and after using the corpus as training data.

pdf bib
Extending a wordnet framework for simplicity and scalability
Pedro Fialho | Sérgio Curto | Ana Cristina Mendes | Luísa Coheur
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The WordNet knowledge model is currently implemented in multiple software frameworks providing procedural access to language instances of it. Frameworks tend to be focused on structural/design aspects of the model thus describing low level interfaces for linguistic knowledge retrieval. Typically the only high level feature directly accessible is word lookup while traversal of semantic relations leads to verbose/complex combinations of data structures, pointers and indexes which are irrelevant in an NLP context. Here is described an extension to the JWNL framework that hides technical requirements of access to WordNet features with an essentially word/sense based API applying terminology from the official online interface. This high level API is applied to the original English version of WordNet and to an SQL based Portuguese lexicon, translated into a WordNet based representation usable by JWNL.

pdf bib
Dealing with unknown words in statistical machine translation
João Silva | Luísa Coheur | Ângela Costa | Isabel Trancoso
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In Statistical Machine Translation, words that were not seen during training are unknown words, that is, words that the system will not know how to translate. In this paper we contribute to this research problem by profiting from orthographic cues given by words. Thus, we report a study of the impact of word distance metrics in cognates' detection and, in addition, on the possibility of obtaining possible translations of unknown words through Logical Analogy. Our approach is tested in the translation of corpora from Portuguese to English (and vice-versa).

2011

pdf bib
Named entity translation using anchor texts
Wang Ling | Pável Calado | Bruno Martins | Isabel Trancoso | Alan Black | Luísa Coheur
Proceedings of the 8th International Workshop on Spoken Language Translation: Papers

This work describes a process to extract Named Entity (NE) translations from the text available in web links (anchor texts). It translates a NE by retrieving a list of web documents in the target language, extracting the anchor texts from the links to those documents and finding the best translation from the anchor texts, using a combination of features, some of which, are specific to anchor texts. Experiments performed on a manually built corpora, suggest that over 70% of the NEs, ranging from unpopular to popular entities, can be translated correctly using sorely anchor texts. Tests on a Machine Translation task indicate that the system can be used to improve the quality of the translations of state-of-the-art statistical machine translation systems.

pdf bib
Exploring linguistically-rich patterns for question generation
Sérgio Curto | Ana Cristina Mendes | Luísa Coheur
Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop

pdf bib
BP2EP - Adaptation of Brazilian Portuguese texts to European Portuguese
Luis Marujo | Nuno Grazina | Tiago Luis | Wang Ling | Luisa Coheur | Isabel Trancoso
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Reordering Modeling using Weighted Alignment Matrices
Wang Ling | Tiago Luís | João Graça | Isabel Trancoso | Luísa Coheur
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Named Entity Recognition in Questions: Towards a Golden Collection
Ana Cristina Mendes | Luísa Coheur | Paula Vaz Lobo
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Named Entity Recognition (NER) plays a relevant role in several Natural Language Processing tasks. Question-Answering (QA) is an example of such, since answers are frequently named entities in agreement with the semantic category expected by a given question. In this context, the recognition of named entities is usually applied in free text data. NER in natural language questions can also aid QA and, thus, should not be disregarded. Nevertheless, it has not yet been given the necessary importance. In this paper, we approach the identification and classification of named entities in natural language questions. We hypothesize that NER results can benefit with the inclusion of previously labeled questions in the training corpus. We present a broad study addressing that hypothesis, focusing on the balance to be achieved between the amount of free text and questions in order to build a suitable training corpus. This work also contributes by providing a set of nearly 5,500 annotated questions with their named entities, freely available for research purposes.

pdf bib
The INESC-ID machine translation system for the IWSLT 2010
Wang Ling | Tiago Luís | João Graça | Luísa Coheur | Isabel Trancoso
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper we describe the Instituto de Engenharia de Sistemas e Computadores Investigac ̧a ̃o e Desenvolvimento (INESC-ID) system that participated in the IWSLT 2010 evaluation campaign. Our main goal for this evaluation was to employ several state-of-the-art methods applied to phrase-based machine translation in order to improve the translation quality. Aside from the IBM M4 alignment model, two constrained alignment models were tested, which produced better overall results. These results were further improved by using weighted alignment matrixes during phrase extraction, rather than the single best alignment. Finally, we tested several filters that ruled out phrase pairs based on puntuation. Our system was evaluated on the BTEC and DIALOG tasks, having achieved a better overall ranking in the DIALOG task.

pdf bib
Towards a general and extensible phrase-extraction algorithm
Wang Ling | Tiago Luís | João Graça | Luísa Coheur | Isabel Trancoso
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

Phrase-based systems deeply depend on the quality of their phrase tables and therefore, the process of phrase extraction is always a fundamental step. In this paper we present a general and extensible phrase extraction algorithm, where we have highlighted several control points. The instantiation of these control points allows the simulation of previous approaches, as in each one of these points different strategies/heuristics can be tested. We show how previous approaches fit in this algorithm, compare several of them and, in addition, we propose alternative heuristics, showing their impact on the final translation results. Considering two different test scenarios from the IWSLT 2010 competition (BTEC, Fr-En and DIALOG, Cn-En), we have obtained an improvement in the results of 2.4 and 2.8 BLEU points, respectively.

2008

pdf bib
Building a Golden Collection of Parallel Multi-Language Word Alignment
João Graça | Joana Paulo Pardal | Luísa Coheur | Diamantino Caseiro
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper reports an experience on producing manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (Graça et al., 2008). Word alignment of each language pair is made over the first 100 sentences of the common test set from the Europarl corpora (Koehn, 2005), corresponding to 600 new annotated sentences. This collection is publicly available at http://www.l2f.inesc- id.pt/resources/translation/. It contains, to our knowledge, the first word alignment gold set for the Portuguese language, with three other languages. Besides, it is to our knowledge, the first multi-language manual word aligned parallel corpus, where the same sentences are annotated for each language pair. We started by using the guidelines presented at (Mariño, 2005) and performed several refinements: some due to under-specifications on the original guidelines, others because of disagreement on some choices. This lead to the development of an extensive new set of guidelines for multi-lingual word alignment annotation that, we believe, makes the alignment process less ambiguous. We evaluate the inter-annotator agreement obtaining an average of 91.6% agreement between the different language pairs.

2007

pdf bib
The INESC-ID IWSLT07 SMT system
João V. Graça | Diamantino Caseiro | Luísa Coheur
Proceedings of the Fourth International Workshop on Spoken Language Translation

We present the machine translation system used by L2F from INESC-ID in the evaluation campaign of the International Workshop on Spoken Language Translation (2007), in the task of translating spontaneous conversations in the travel domain from Italian to English.

2004

pdf bib
From a Surface Analysis to a Dependency Structure
Luisa Coheur | Nuno Mamede | Gabriel G. Bes
Proceedings of the Workshop on Recent Advances in Dependency Grammar

pdf bib
A step towards incremental generation of logical forms
Luísa Coheur | Nuno Mamede | Gabriel Bès
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)