Fernando Batista
2026
Retrieval-Augmented Generation with Small Language Models for Fake News Detection
Lucca Baptista Silva Ferraz | Jhúlia de Souza Leal | Anderson Raymundo Avila | Thiago Alexandre Salgueiro Pardo | Fernando Batista | Renato Moraes Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Lucca Baptista Silva Ferraz | Jhúlia de Souza Leal | Anderson Raymundo Avila | Thiago Alexandre Salgueiro Pardo | Fernando Batista | Renato Moraes Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
The spread of online misinformation has made fake news detection an essential tool for mitigating its negative impact, but many studies often disregard the temporal information, and existing datasets become outdated as news evolve. Some modern solutions using Retrieval-Augmented Generation (RAG) can solve the problem of unseen news events by providing context to the models. However, there are no studies evaluating the feasibility of web searches to attain context to decide whether a news article is true or not. This work aims to address this gap by conducting a comparative study between RAG-based solutions, traditional fake news classification methods, and deep learning-based methods. The results show that although RAG is a modern and promising technique, it cannot outperform techniques already adopted in the literature.
2022
Hate Speech Dynamics Against African descent, Roma and LGBTQI Communities in Portugal
Paula Carvalho | Bernardo Cunha | Raquel Santos | Fernando Batista | Ricardo Ribeiro
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Paula Carvalho | Bernardo Cunha | Raquel Santos | Fernando Batista | Ricardo Ribeiro
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper introduces FIGHT, a dataset containing 63,450 tweets, posted before and after the official declaration of Covid-19 as a pandemic by online users in Portugal. This resource aims at contributing to the analysis of online hate speech targeting the most representative minorities in Portugal, namely the African descent and the Roma communities, and the LGBTQI community, the most commonly reported target of hate speech in social media at the European context. We present the methods for collecting the data, and provide insightful statistics on the distribution of tweets included in FIGHT, considering both the temporal and spatial dimensions. We also analyze the availability over time of tweets targeting the above-mentioned communities, distinguishing public, private and deleted tweets. We believe this study will contribute to better understand the dynamics of online hate speech in Portugal, particularly in adverse contexts, such as a pandemic outbreak, allowing the development of more informed and accurate hate speech resources for Portuguese.
2020
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
2018
Contractions: To Align or Not to Align, That Is the Question
Anabela Barreiro | Fernando Batista
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
Anabela Barreiro | Fernando Batista
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and, a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (e.g., [no seio de] [a União Europeia] | [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur in the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (e.g., [no que diz respeito a] | [with regard to] or [além disso] [in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
Cross-domain analysis of discourse markers in European Portuguese
Vera Cabarrão | Helena Moniz | Fernando Batista | Jaime Ferreira | Isabel Trancoso | Ana Isabel Mata
Dialogue Discourse Volume 9
Vera Cabarrão | Helena Moniz | Fernando Batista | Jaime Ferreira | Isabel Trancoso | Ana Isabel Mata
Dialogue Discourse Volume 9
This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.
2016
Machine Translation of Non-Contiguous Multiword Units
Anabela Barreiro | Fernando Batista
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing
Anabela Barreiro | Fernando Batista
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing
SPA: Web-based Platform for easy Access to Speech Processing Modules
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyper-articulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.
2014
Prosodic, syntactic, semantic guidelines for topic structures across domains and corpora
Ana Isabel Mata | Helena Moniz | Telmo Móia | Anabela Gonçalves | Fátima Silva | Fernando Batista | Inês Duarte | Fátima Oliveira | Isabel Falé
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Ana Isabel Mata | Helena Moniz | Telmo Móia | Anabela Gonçalves | Fátima Silva | Fernando Batista | Inês Duarte | Fátima Oliveira | Isabel Falé
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents the annotation guidelines applied to naturally occurring speech, aiming at an integrated account of contrast and parallel structures in European Portuguese. These guidelines were defined to allow for the empirical study of interactions among intonation and syntax-discourse patterns in selected sets of different corpora (monologues and dialogues, by adults and teenagers). In this paper we focus on the multilayer annotation process of left periphery structures by using a small sample of highly spontaneous speech in which the distinct types of topic structures are displayed. The analysis of this sample provides fundamental training and testing material for further application in a wider range of domains and corpora. The annotation process comprises the following time-linked levels (manual and automatic): phone, syllable and word level transcriptions (including co-articulation effects); tonal events and break levels; part-of-speech tagging; syntactic-discourse patterns (construction type; construction position; syntactic function; discourse function), and disfluency events as well. Speech corpora with such a multi-level annotation are a valuable resource to look into grammar module relations in language use from an integrated viewpoint. Such viewpoint is innovative in our language, and has not been often assumed by studies for other languages.
Revising the annotation of a Broadcast News corpus: a linguistic approach
Vera Cabarrão | Helena Moniz | Fernando Batista | Ricardo Ribeiro | Nuno Mamede | Hugo Meinedo | Isabel Trancoso | Ana Isabel Mata | David Martins de Matos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Vera Cabarrão | Helena Moniz | Fernando Batista | Ricardo Ribeiro | Nuno Mamede | Hugo Meinedo | Isabel Trancoso | Ana Isabel Mata | David Martins de Matos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant revised data is now being extensively used, and was of extreme importance for improving the performance of several modules, especially the punctuation and capitalization modules, but also the speech recognition system, and all the subsequent modules. The resultant data has also been recently used in disfluency studies across domains.
Teenage and adult speech in school context: building and processing a corpus of European Portuguese
Ana Isabel Mata | Helena Moniz | Fernando Batista | Julia Hirschberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Ana Isabel Mata | Helena Moniz | Fernando Batista | Julia Hirschberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present a corpus of European Portuguese spoken by teenagers and adults in school context, CPE-FACES, with an overview of the differential characteristics of high school oral presentations and the challenges this data poses to automatic speech processing. The CPE-FACES corpus has been created with two main goals: to provide a resource for the study of prosodic patterns in both spontaneous and prepared unscripted speech, and to capture inter-speaker and speaking style variations common at school, for research on oral presentations. Research on speaking styles is still largely based on adult speech. References to teenagers are sparse and cross-analyses of speech types comparing teenagers and adults are rare. We expect CPE-FACES, currently a unique resource in this domain, will contribute to filling this gap in European Portuguese. Focusing on disfluencies and phrase-final phonetic-phonological processes we show the impact of teenage speech on the automatic segmentation of oral presentations. Analyzing fluent final intonation contours in declarative utterances, we also show that communicative situation specificities, speaker status and cross-gender differences are key factors in speaking style variation at school.
Linguistic Evaluation of Support Verb Constructions by OpenLogos and Google Translate
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Susanne Preuß | Kutz Arrieta | Wang Ling | Fernando Batista | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Susanne Preuß | Kutz Arrieta | Wang Ling | Fernando Batista | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents a systematic human evaluation of translations of English support verb constructions produced by a rule-based machine translation (RBMT) system (OpenLogos) and a statistical machine translation (SMT) system (Google Translate) for five languages: French, German, Italian, Portuguese and Spanish. We classify support verb constructions by means of their syntactic structure and semantic behavior and present a qualitative analysis of their translation errors. The study aims to verify how machine translation (MT) systems translate fine-grained linguistic phenomena, and how well-equipped they are to produce high-quality translation. Another goal of the linguistically motivated quality analysis of SVC raw output is to reinforce the need for better system hybridization, which leverages the strengths of RBMT to the benefit of SMT, especially in improving the translation of multiword units. Taking multiword units into account, we propose an effective method to achieve MT hybridization based on the integration of semantico-syntactic knowledge into SMT.
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
Anabela Barreiro | Fernando Batista | Ricardo Ribeiro | Helena Moniz | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Anabela Barreiro | Fernando Batista | Ricardo Ribeiro | Helena Moniz | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents 3 sets of OpenLogos resources, namely the English-German, the English-French, and the English-Italian bilingual dictionaries. In addition to the usual information on part-of-speech, gender, and number for nouns, offered by most dictionaries currently available, OpenLogos bilingual dictionaries have some distinctive features that make them unique: they contain cross-language morphological information (inflectional and derivational), semantico-syntactic knowledge, indication of the head word in multiword units, information about whether a source word corresponds to an homograph, information about verb auxiliaries, alternate words (i.e., predicate or process nouns), causatives, reflexivity, verb aspect, among others. The focal point of the paper will be the semantico-syntactic knowledge that is important for disambiguation and translation precision. The resources are publicly available at the METANET platform for free use by the research community.
2013
When multiwords go bad in machine translation
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Fernando Batista
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Fernando Batista
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies
2008
Language Dynamics and Capitalization using Maximum Entropy
Fernando Batista | Nuno Mamede | Isabel Trancoso
Proceedings of ACL-08: HLT, Short Papers
Fernando Batista | Nuno Mamede | Isabel Trancoso
Proceedings of ACL-08: HLT, Short Papers
2000
Search
Fix author
Co-authors
- Helena Moniz 7
- Isabel Trancoso 7
- Anabela Barreiro 5
- Ricardo Ribeiro 5
- Ana Isabel Mata 4
- Vera Cabarrão 2
- Jaime Ferreira 2
- Nuno Mamede 2
- David Martins de Matos 2
- Johanna Monti 2
- Brigitte Orliac 2
- Alberto Abad 1
- Kutz Arrieta 1
- Anderson Raymundo Avila 1
- Arianna Bisazza 1
- Paula Carvalho 1
- Luísa Coheur 1
- Bernardo Cunha 1
- Pedro Curto 1
- Inês Duarte 1
- Isabel Falé 1
- Lucca Baptista Silva Ferraz 1
- Mikel L. Forcada 1
- Sara Fumega 1
- Anabela Gonçalves 1
- Ana Guerberof 1
- Julia Hirschberg 1
- Jhúlia de Souza Leal 1
- Wang Ling 1
- Lena Marg 1
- André F. T. Martins 1
- Bruno Martins 1
- Hugo Meinedo 1
- Joss Moorkens 1
- Telmo Móia 1
- Mary Nurminen 1
- Fátima Oliveira 1
- Thiago Alexandre Salgueiro Pardo 1
- Carla Parra Escartín 1
- Susanne Preuß 1
- Tânia Pêgo 1
- Eugénio Ribeiro 1
- Raquel Santos 1
- Fátima Silva 1
- Renato Moraes Silva 1
- Marco Turchi 1
- Luzia Wittmann 1