Helena Moniz - ACL Anthology

Helena Moniz

Also published as: Helena Moniz

2025

The BridgeAI Project
Helena Moniz | António Novais | Joana Lamego | Nuno André
Proceedings of Machine Translation Summit XX: Volume 2

This paper presents an updated overview of the ‘BridgeAI’ project, a science-for-policy initiative funded by the Portuguese Foundation for Science and Technology (FCT) and the Recovery and Resilience Programme. In its second stage of implementation, BridgeAI continues to build upon its original goals, working towards a strategy to align AI research, policy, regulatory frameworks, and practical application. The project provides Portugal with an evidence-based framework to implement the EU Artificial Intelligence (AI) Act (AIA), ensuring responsible AI innovation through multidisciplinary collaboration. BridgeAI connects academia, industry, public administration, and civil society to create actionable insights and regulatory recommendations. This paper details the project’s latest advancements, key recommendations, and future directions.

Proceedings of Machine Translation Summit XX: Volume 1
Pierrette Bouillon | Johanna Gerlach | Sabrina Girletti | Lise Volkart | Raphael Rubino | Rico Sennrich | Ana C. Farinha | Marco Gaido | Joke Daems | Dorothy Kenny | Helena Moniz | Sara Szoc
Proceedings of Machine Translation Summit XX: Volume 1

Proceedings of Machine Translation Summit XX: Volume 2
Pierrette Bouillon | Johanna Gerlach | Sabrina Girletti | Lise Volkart | Raphael Rubino | Rico Sennrich | Samuel Läubli | Martin Volk | Miquel Esplà-Gomis | Vincent Vandeghinste | Helena Moniz | Sara Szoc
Proceedings of Machine Translation Summit XX: Volume 2

Cultural Transcreation in Asian Languages with Prompt-Based LLMs
Helena Wu | Beatriz Silva | Vera Cabarrão | Helena Moniz
Proceedings of Machine Translation Summit XX: Volume 2

This research explores Cultural Transcreation (CT) for East Asian languages, focusing primarily on Mandarin Chinese (ZH) and the customer service (CS) market. We combined Large Language Models (LLMs) with prompt engineering to develop a CT product that, aligned with the Augmented Translation concept, enhances multilingual CS communication, enables professionals to engage with their target audience effortlessly, and improves overall service quality. Through a series of preparatory steps, including guideline establishment, benchmark validation, iterative prompt refinement, and LLM testing, we integrated the CT product into the CS platform, assessed its performance, and refined prompts based on a pilot feedback. The results highlight its success in empowering agents, regardless of linguistic or cultural expertise, to bridge effective communication gaps through AI-assisted cultural rephrasing, thus achieving its market launch. Beyond CS, the study extends the concept of transcreation and prompt-based LLM applications to other fields, discussing its performance in the language conversion of website content and advertising.

2024

Cultural Transcreation with LLMs as a new product
Beatriz Silva | Helena Wu | Yan Jingxuan | Vera Cabarrão | Helena Moniz | Sara Guerreiro de Sousa | João Almeida | Malene Sjørslev Søholm | Ana Farinha | Paulo Dimas
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

We present how at Unbabel we have been using Large Language Models to apply a Cultural Transcreation (CT) product on customer support (CS) emails and how we have been testing the quality and potential of this product. We discuss our preliminary evaluation of the performance of different MT models in the task of translating rephrased content and the quality of the translation outputs. Furthermore, we introduce the live pilot programme and the corresponding relevant findings, showing that transcreated content is not only culturally adequate but it is also of high rephrasing and translation quality.

The BridgeAI Project
Helena Moniz | Joana Lamego | Nuno André | António Novais | Bruno Silva | Maria Henriques | Mariana Dalblon | Paulo Dimas | Pedro Gonçalves
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

This paper describes the project “BridgeAI: Boosting Regulatory Implementation with Data-driven insights, Global expertise, and Ethics for AI”, a one-year science-for-policy research project funded by the Portuguese Foundation for Science and Technology (FCT). The project aims to provide decision-makers in Portugal with the best context to implement the EU Artificial Intelligence (AI) Act and bridge the gap between AI research and policy. Although not exclusively on machine translation, the project pertains to natural language processing in general and ultimately to each of us as citizens.

The Center for Responsible AI Project
Maria Ana Henriques | Ana Farinha | Nuno André | António Novais | Sara Guerreiro de Sousa | Bruno Prezado Silva | Ana Oliveira | Helena Moniz | Andre Martins | Paulo Dimas
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

This paper describes the project “NextGenAI: Center for Responsible AI”, a 39-month Mobilizing and Green Agenda for Business Innovation funded by the Portuguese Recovery and Resilience Plan, under the Recovery and Resilience Facility (RRF). The project aims to create a new Center for Responsible AI in Portugal, capable of delivering more than 20 AI products in crucial areas like “Life Sciences”, many of which use generative AI, particularly NLP models such as those for Machine Translation, contributing to translating into legislation the European Law included in the EU AI Act, and creating a critical mass in the development of responsible AI technologies. To accomplish this mission, the Center for Responsible AI is formed by an ecosystem of startups and research institutions driving research in a virtuous way by addressing real market needs and opportunities in Responsible AI.

ConText at WASSA 2024 Empathy and Personality Shared Task: History-Dependent Embedding Utterance Representations for Empathy and Emotion Prediction in Conversations
Patrícia Pereira | Helena Moniz | Joao Paulo Carvalho
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Empathy and emotion prediction are key components in the development of effective and empathetic agents, amongst several other applications. The WASSA shared task on empathy empathy and emotion prediction in interactions presents an opportunity to benchmark approaches to these tasks.Appropriately selecting and representing the historical context is crucial in the modelling of empathy and emotion in conversations. In our submissions, we model empathy, emotion polarity and emotion intensity of each utterance in a conversation by feeding the utterance to be classified together with its conversational context, i.e., a certain number of previous conversational turns, as input to an encoder Pre-trained Language Model (PLM), to which we append a regression head for prediction. We also model perceived counterparty empathy of each interlocutor by feeding all utterances from the conversation and a token identifying the interlocutor for which we are predicting the empathy. Our system officially ranked 1st at the CONV-turn track and 2nd at the CONV-dialog track.

Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Carolina Scarton | Charlotte Prescott | Chris Bayliss | Chris Oakley | Joanna Wright | Stuart Wrigley | Xingyi Song | Edward Gow-Smith | Mikel Forcada | Helena Moniz
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

Generating subject-matter expertise assessment questions with GPT-4: a medical translation use-case
Diana Silveira | Marina Sánchez-Torrón | Helena Moniz
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

This paper examines the suitability of a large language model (LLM), GPT-4, for generating multiple choice questions (MCQs) aimed at assessing subject matter expertise (SME) in the domain of medical translation. The main objective of these questions is to model the skills of potential subject matter experts in a human-in-the-loop machine translation (MT) flow, to ensure that tasks are matched to the individuals with the right skill profile. The investigation was conducted at Unbabel, an artificial intelligence-powered human translation platform. Two medical translation experts evaluated the GPT-4-generated questions and answers, one focusing on English–European Portuguese, and the other on English–German. We present a methodology for creating prompts to elicit high-quality GPT-4 outputs for this use case, as well as for designing evaluation scorecards for human review of such output. Our findings suggest that GPT-4 has the potential to generate suitable items for subject matter expertise tests, providing a more efficient approach compared to relying solely on humans. Furthermore, we propose recommendations for future research to build on our approach and refine the quality of the outputs generated by LLMs.

2023

A Context-Aware Annotation Framework for Customer Support Live Chat Machine Translation
Miguel Menezes | M. Amin Farajian | Helena Moniz | João Varelas Graça
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

To measure context-aware machine translation (MT) systems quality, existing solutions have recommended human annotators to consider the full context of a document. In our work, we revised a well known Machine Translation quality assessment framework, Multidimensional Quality Metrics (MQM), (Lommel et al., 2014) by introducing a set of nine annotation categories that allows to map MT errors to source document contextual phenomenon, for simplicity sake we named such phenomena as contextual triggers. Our analysis shows that the adapted categories set enhanced MQM’s potential for MT error identification, being able to cover up to 61% more errors, when compared to traditional non-context core MQM’s application. Subsequently, we analyzed the severity of these MT “contextual errors”, showing that the majority fall under the critical and major levels, further indicating the impact of such errors. Finally, we measured the ability of existing evaluation metrics in detecting the proposed MT “contextual errors”. The results have shown that current state-of-the-art metrics fall short in detecting MT errors that are caused by contextual triggers on the source document side. With the work developed, we hope to understand how impactful context is for enhancing quality within a MT workflow and draw attention to future integration of the proposed contextual annotation framework into current MQM’s core typology.

Context-aware and gender-neutral Translation Memories
Marjolene Paulo | Vera Cabarrão | Helena Moniz | Miguel Menezes | Rachel Grewcock | Eduardo Farah
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This work proposes an approach to use Part-Of-Speech (POS) information to automatically detect context-dependent Translation Units (TUs) from a Translation Memory database pertaining to the customer support domain. In line with our goal to minimize context-dependency in TUs, we show how this mechanism can be deployed to create new gender-neutral and context-independent TUs. Our experiments, conducted across Portuguese (PT), Brazilian Portuguese (PT-BR), Spanish (ES), and Spanish-Latam (ES-LATAM), show that the occurrence of certain POS with specific words is accurate in identifying context dependency. In a cross-client analysis, we found that ~10% of the most frequent 13,200 TUs were context-dependent, with gender determining context-dependency in 98% of all confirmed cases. We used these findings to suggest gender-neutral equivalents for the most frequent TUs with gender constraints. Our approach is in use in the Unbabel translation pipeline, and can be integrated into any other Neural Machine Translation (NMT) pipeline.

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation
John Mendonça | Patrícia Pereira | Helena Moniz | Joao Paulo Carvalho | Alon Lavie | Isabel Trancoso
Proceedings of the Eleventh Dialog System Technology Challenge

Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 “Automatic Evaluation Metrics for Open-Domain Dialogue Systems”, proving the evaluation capabilities of prompted LLMs.

Quality Fit for Purpose: Building Business Critical Errors Test Suites
Mariana Cabeça | Marianna Buchicchio | Madalena Gonçalves | Christine Maroti | João Godinho | Pedro Coelho | Helena Moniz | Alon Lavie
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper illustrates a new methodology based on Test Suites (Avramidis et al., 2018) with focus on Business Critical Errors (BCEs) (Stewart et al., 2022) to evaluate the output of Machine Translation (MT) and Quality Estimation (QE) systems. We demonstrate the value of relying on semi-automatic evaluation done through scalable BCE-focused Test Suites to monitor both MT and QE systems’ performance for 8 language pairs (LPs) and a total of 4 error categories. This approach allows us to not only track the impact of new features and implementations in a real business environment, but also to identify strengths and weaknesses in models regarding different error types, and subsequently know what to improve henceforth.

Context-Dependent Embedding Utterance Representations for Emotion Recognition in Conversations
Patrícia Pereira | Helena Moniz | Isabel Dias | Joao Paulo Carvalho
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Emotion Recognition in Conversations (ERC) has been gaining increasing importance as conversational agents become more and more common. Recognizing emotions is key for effective communication, being a crucial component in the development of effective and empathetic conversational agents. Knowledge and understanding of the conversational context are extremely valuable for identifying the emotions of the interlocutor. We thus approach Emotion Recognition in Conversations leveraging the conversational context, i.e., taking into attention previous conversational turns. The usual approach to model the conversational context has been to produce context-independent representations of each utterance and subsequently perform contextual modeling of these. Here we propose context-dependent embedding representations of each utterance by leveraging the contextual representational power of pre-trained transformer language models. In our approach, we feed the conversational context appended to the utterance to be classified as input to the RoBERTa encoder, to which we append a simple classification module, thus discarding the need to deal with context after obtaining the embeddings since these constitute already an efficient representation of such context. We also investigate how the number of introduced conversational turns influences our model performance. The effectiveness of our approach is validated on the open-domain DailyDialog dataset and on the task-oriented EmoWOZ dataset.

Dialogue Quality and Emotion Annotations for Customer Support Conversations
John Mendonca | Patrícia Pereira | Miguel Menezes | Vera Cabarrão | Ana C Farinha | Helena Moniz | Alon Lavie | Isabel Trancoso
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Task-oriented conversational datasets often lack topic variability and linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.

2022

Agent and User-Generated Content and its Impact on Customer Support MT
Madalena Gonçalves | Marianna Buchicchio | Craig Stewart | Helena Moniz | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper illustrates a new evaluation framework developed at Unbabel for measuring the quality of source language text and its effect on both Machine Translation (MT) and Human Post-Edition (PE) performed by non-professional post-editors. We examine both agent and user-generated content from the Customer Support domain and propose that differentiating the two is crucial to obtaining high quality translation output. Furthermore, we present results of initial experimentation with a new evaluation typology based on the Multidimensional Quality Metrics (MQM) Framework Lommel et al., 2014), specifically tailored toward the evaluation of source language text. We show how the MQM Framework Lommel et al., 2014) can be adapted to assess errors of monolingual source texts and demonstrate how very specific source errors propagate to the MT and PE targets. Finally, we illustrate how MT systems are not robust enough to handle very specific source noise in the context of Customer Support data.

A Case Study on the Importance of Named Entities in a Machine Translation Pipeline for Customer Support Content
Miguel Menezes | Vera Cabarrão | Pedro Mota | Helena Moniz | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper describes the research developed at Unbabel, a Portuguese Machine-translation start-up, that combines MT with human post-edition and focuses strictly on customer service content. We aim to contribute to furthering MT quality and good-practices by exposing the importance of having a continuously-in-development robust Named Entity Recognition system compliant with General Data Protection Regulation (GDPR). Moreover, we have tested semiautomatic strategies that support and enhance the creation of Named Entities gold standards to allow a more seamless implementation of Multilingual Named Entities Recognition Systems. The project described in this paper is the result of a shared work between Unbabel ́s linguists and Unbabel ́s AI engineering team, matured over a year. The project should, also, be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between the different scientific fields that compose and characterize the area of Natural Language Processing (NLP).

Findings of the WMT 2022 Shared Task on Chat Translation
Ana C Farinha | M. Amin Farajian | Marianna Buchicchio | Patrick Fernandes | José G. C. de Souza | Helena Moniz | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper reports the findings of the second edition of the Chat Translation Shared Task. Similarly to the previous WMT 2020 edition, the task consisted of translating bilingual customer support conversational text. However, unlike the previous edition, in which the bilingual data was created from a synthetic monolingual English corpus, this year we used a portion of the newly released Unbabel’s MAIA corpus, which contains genuine bilingual conversations between agents and customers. We also expanded the language pairs to English↔German (en↔de), English↔French (en↔fr), and English↔Brazilian Portuguese (en↔pt-br).Given that the main goal of the shared task is to translate bilingual conversations, participants were encouraged to train and test their models specifically for this environment. In total, we received 18 submissions from 4 different teams. All teams participated in both directions of en↔de. One of the teams also participated in en↔fr and en↔pt-br. We evaluated the submissions with automatic metrics as well as human judgments via Multidimensional Quality Metrics (MQM) on both directions. The official ranking of the systems is based on the overall MQM scores of the participating systems on both directions, i.e. agent and customer.

This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action – Multi3Generation (CA18231), an interdisciplinary network of research groups working on different aspects of language generation. This “meta-paper” will serve as reference for citations of the Action in future publications. It presents the objectives, challenges and a the links for the achieved outcomes.

QUARTZ: Quality-Aware Machine Translation
José G.C. de Souza | Ricardo Rei | Ana C. Farinha | Helena Moniz | André F. T. Martins
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper presents QUARTZ, QUality-AwaRe machine Translation, a project led by Unbabel which aims at developing machine translation systems that are more robust and produce fewer critical errors. With QUARTZ we want to enable machine translation for user-generated conversational content types that do not tolerate critical errors in automatic translations.

2020

Project MAIA: Multilingual AI Agent Assistant
André F. T. Martins | Joao Graca | Paulo Dimas | Helena Moniz | Graham Neubig
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper presents the Multilingual Artificial Intelligence Agent Assistant (MAIA), a project led by Unbabel with the collaboration of CMU, INESC-ID and IT Lisbon. MAIA will employ cutting-edge machine learning and natural language processing technologies to build multilingual AI agent assistants, eliminating language barriers. MAIA’s translation layer will empower human agents to provide customer support in real-time, in any language, with human quality.

Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

2018

Cross-domain analysis of discourse markers in European Portuguese
Vera Cabarrão | Helena Moniz | Fernando Batista | Jaime Ferreira | Isabel Trancoso | Ana Isabel Mata
Dialogue Discourse Volume 9

This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.

2017

The INTERACT Project and Crisis MT
Sharon O’Brien | Chao-Hong Liu | Andy Way | João Graça | André Martins | Helena Moniz | Ellie Kemp | Rebecca Petras
Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track

2016

SPA: Web-based Platform for easy Access to Speech Processing Modules
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyper-articulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.

The SpeDial datasets: datasets for Spoken Dialogue Systems analytics
José Lopes | Arodami Chorianopoulou | Elisavet Palogiannidi | Helena Moniz | Alberto Abad | Katerina Louka | Elias Iosif | Alexandros Potamianos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The SpeDial consortium is sharing two datasets that were used during the SpeDial project. By sharing them with the community we are providing a resource to reduce the duration of cycle of development of new Spoken Dialogue Systems (SDSs). The datasets include audios and several manual annotations, i.e., miscommunication, anger, satisfaction, repetition, gender and task success. The datasets were created with data from real users and cover two different languages: English and Greek. Detectors for miscommunication, anger and gender were trained for both systems. The detectors were particularly accurate in tasks where humans have high annotator agreement such as miscommunication and gender. As expected due to the subjectivity of the task, the anger detector had a less satisfactory performance. Nevertheless, we proved that the automatic detection of situations that can lead to problems in SDSs is possible and can be a promising direction to reduce the duration of SDS’s development cycle.

2014

OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
Anabela Barreiro | Fernando Batista | Ricardo Ribeiro | Helena Moniz | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents 3 sets of OpenLogos resources, namely the English-German, the English-French, and the English-Italian bilingual dictionaries. In addition to the usual information on part-of-speech, gender, and number for nouns, offered by most dictionaries currently available, OpenLogos bilingual dictionaries have some distinctive features that make them unique: they contain cross-language morphological information (inflectional and derivational), semantico-syntactic knowledge, indication of the head word in multiword units, information about whether a source word corresponds to an homograph, information about verb auxiliaries, alternate words (i.e., predicate or process nouns), causatives, reflexivity, verb aspect, among others. The focal point of the paper will be the semantico-syntactic knowledge that is important for disambiguation and translation precision. The resources are publicly available at the METANET platform for free use by the research community.

Revising the annotation of a Broadcast News corpus: a linguistic approach
Vera Cabarrão | Helena Moniz | Fernando Batista | Ricardo Ribeiro | Nuno Mamede | Hugo Meinedo | Isabel Trancoso | Ana Isabel Mata | David Martins de Matos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant revised data is now being extensively used, and was of extreme importance for improving the performance of several modules, especially the punctuation and capitalization modules, but also the speech recognition system, and all the subsequent modules. The resultant data has also been recently used in disfluency studies across domains.

Prosodic, syntactic, semantic guidelines for topic structures across domains and corpora
Ana Isabel Mata | Helena Moniz | Telmo Móia | Anabela Gonçalves | Fátima Silva | Fernando Batista | Inês Duarte | Fátima Oliveira | Isabel Falé
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the annotation guidelines applied to naturally occurring speech, aiming at an integrated account of contrast and parallel structures in European Portuguese. These guidelines were defined to allow for the empirical study of interactions among intonation and syntax-discourse patterns in selected sets of different corpora (monologues and dialogues, by adults and teenagers). In this paper we focus on the multilayer annotation process of left periphery structures by using a small sample of highly spontaneous speech in which the distinct types of topic structures are displayed. The analysis of this sample provides fundamental training and testing material for further application in a wider range of domains and corpora. The annotation process comprises the following time-linked levels (manual and automatic): phone, syllable and word level transcriptions (including co-articulation effects); tonal events and break levels; part-of-speech tagging; syntactic-discourse patterns (construction type; construction position; syntactic function; discourse function), and disfluency events as well. Speech corpora with such a multi-level annotation are a valuable resource to look into grammar module relations in language use from an integrated viewpoint. Such viewpoint is innovative in our language, and has not been often assumed by studies for other languages.

Teenage and adult speech in school context: building and processing a corpus of European Portuguese
Ana Isabel Mata | Helena Moniz | Fernando Batista | Julia Hirschberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a corpus of European Portuguese spoken by teenagers and adults in school context, CPE-FACES, with an overview of the differential characteristics of high school oral presentations and the challenges this data poses to automatic speech processing. The CPE-FACES corpus has been created with two main goals: to provide a resource for the study of prosodic patterns in both spontaneous and prepared unscripted speech, and to capture inter-speaker and speaking style variations common at school, for research on oral presentations. Research on speaking styles is still largely based on adult speech. References to teenagers are sparse and cross-analyses of speech types comparing teenagers and adults are rare. We expect CPE-FACES, currently a unique resource in this domain, will contribute to filling this gap in European Portuguese. Focusing on disfluencies and phrase-final phonetic-phonological processes we show the impact of teenage speech on the automatic segmentation of oral presentations. Analyzing fluent final intonation contours in declarative utterances, we also show that communicative situation specificities, speaker status and cross-gender differences are key factors in speaking style variation at school.

2008

The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese
Isabel Trancoso | Rui Martins | Helena Moniz | Ana Isabel Mata | M. Céu Viana
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the corpus of university lectures that has been recorded in European Portuguese, and some of the recognition experiments we have done with it. The highly specific topic domain and the spontaneous speech nature of the lectures are two of the most challenging problems. Lexical and language model adaptation proved difficult given the scarcity of domain material in Portuguese, but improvements can be achieved with unsupervised acoustic model adaptation. From the point of view of the study of spontaneous speech characteristics, namely disflluencies, the LECTRA corpus has also proved a very valuable resource.

Co-authors

Ana Isabel Mata 5

Ana C. Farinha 4

Mikel L. Forcada 4

Miguel Menezes 4

Patrícia Pereira 4

Carolina Scarton 4

Marianna Buchicchio 3

Joao Paulo Carvalho 3

António Novais 3

Mary Nurminen 3

Ricardo Ribeiro 3

Anabela Barreiro 2

Chris Bayliss 2

Pierrette Bouillon 2

José G. C. de Souza 2

M. Amin Farajian 2

Jaime Ferreira 2

Johanna Gerlach 2

Sabrina Girletti 2

Madalena Gonçalves 2

Edward Gow-Smith 2

Sara Guerreiro de Sousa 2

Maarit Koponen 2

David Martins de Matos 2

John Mendonça 2

Carla Parra Escartín 2

Charlotte Prescott 2

Raphael Rubino 2

Rico Sennrich 2

Beatriz Silva 2

Joanna Wright 2

Stuart Wrigley 2

Mirela Alhasani 1

João Almeida 1

Maria Ana Henriques 1

Nora Aranberri 1

Isabelle Augenstein 1

Loic Barrault 1

Rachel Bawden 1

Arianna Bisazza 1

Judith Brenner 1

Mariana Cabeça 1

Patrick Cadwell 1

Iacer Calixto 1

Konstantinos Chatzitheodorou 1

Arodami Chorianopoulou 1

Luísa Coheur 1

Marta R. Costa-jussà 1

Mariana Dalblon 1

Christophe Declercq 1

Miquel Esplà-Gomis 1

Eduardo Farah 1

Patrick Fernandes 1

Margot Fonteyne 1

Dimitra Gkatzia 1

João Godinho 1

Anabela Gonçalves 1

Pedro Gonçalves 1

João Varelas Graça 1

Rachel Grewcock 1

Ana Guerberof 1

Maria Henriques 1

Julia Hirschberg 1

Diptesh Kanojia 1

Dorothy Kenny 1

Ekaterina Lapshinova-Koltunski 1

Sirkku Latomaa 1

Chao-Hong Liu 1

Katerina Louka 1

Samuel Läubli 1

Christine Maroti 1

Bruno Martins 1

Mikhail Mikhailov 1

Joss Moorkens 1

Graham Neubig 1

Mara Nunziatini 1

Fátima Oliveira 1

Sharon O’Brien 1

Elisavet Palogiannidi 1

Marcin Paprzycki 1

Marjolene Paulo 1

Rebecca Petras 1

Spyridon Pilos 1

Maja Popović 1

François Portet 1

Alexandros Potamianos 1

Bruno Prezado Silva 1

Tharindu Ranasinghe 1

Eugénio Ribeiro 1

Andrew Rufener 1

Frederike Schierl 1

Fátima Silva 1

Diana Silveira 1

Malene Sjørslev Søholm 1

Craig Stewart 1

Víctor M. Sánchez-Cartagena 1

Marina Sánchez-Torrón 1

Joachim Van Den Bogaert 1

Vincent Vandeghinste 1

Eva Vanmassenhove 1

M. Céu Viana 1

Sergi Àlvarez Vidal 1

José GC de Souza 1

Venues