Thiago Castro Ferreira - ACL Anthology

Thiago Castro Ferreira

Also published as: Thiago Ferreira, Thiago Castro-Ferreira, Thiago Castro Ferreira

2025

Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines
Yunsu Kim | Ahmedelmogtaba Abdelaziz | Thiago Castro Ferreira | Mohamed Al-Badrashiny | Hassan Sawaf
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

As the demand for artificial intelligence (AI) grows to address complex real-world tasks, single models are often insufficient, requiring the integration of multiple models into pipelines. This paper introduces Bel Esprit, a conversational agent designed to construct AI model pipelines based on user requirements. Bel Esprit uses a multi-agent framework where subagents collaborate to clarify requirements, build, validate, and populate pipelines with appropriate models. We demonstrate its effectiveness in generating pipelines from ambiguous user queries, using both human-curated and synthetic data. A detailed error analysis highlights ongoing challenges in pipeline building.Bel Esprit is available for a free trial at https://belesprit.aixplain.com.

A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops
Kamer Ali Yuksel | Thiago Castro Ferreira | Mohamed Al-Badrashiny | Hassan Sawaf
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)

Agentic AI systems use specialized agents to handle tasks within complex workflows, enabling automation and efficiency. However, optimizing these systems often requires labor-intensive, manual adjustments to refine roles, tasks, and interactions. This paper introduces a framework for autonomously optimizing Agentic AI solutions across industries, such as NLG-driven enterprise applications. The system employs agents for Refinement, Execution, Evaluation, Modification, and Documentation, leveraging iterative feedback loops powered by an LLM (Llama 3.2-3B). The framework achieves optimal performance without human input by autonomously generating and testing hypotheses to improve system configurations. This approach enhances scalability and adaptability, offering a robust solution for real-world applications in dynamic environments. Case studies across diverse domains illustrate the transformative impact of this framework, showcasing significant improvements in output quality, relevance, and actionability. All data for these case studies, including original and evolved agent codes, along with their outputs, are here: https://anonymous.4open.science/r/evolver-1D11

Long-context Reference-based MT Quality Estimation
Sami Haq | Chinonso Osuji | Sheila Castilho | Brian Davis | Thiago Castro Ferreira
Proceedings of the Tenth Conference on Machine Translation

In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level ESA scores using augmented long-context data. To construct long-context training examples, we concatenate multiple in-domain sentences and compute a weighted average of their scores. We further integrate human judgment datasets MQM, SQM, and DA) through score normalisation and train multilingual models on the source, hypothesis, and reference translations. Experimental results demonstrate that incorporating long-context information yields higher correlations with human judgments compared to models trained exclusively on short segments.

DCU-ADAPT-modPB at the GEM’24 Data-to-Text Task: Analysis of Human Evaluation Results
Rudali Huidrom | Chinonso Cynthia Osuji | Kolawole John Adebayo | Thiago Castro Ferreira | Brian Davis
Proceedings of the 18th International Natural Language Generation Conference: Generation Challenges

Are Multi-Agents the new Pipeline Architecture for Data-to-Text Systems?
Chinonso Cynthia Osuji | Brian Timoney | Mark Andrade | Thiago Castro Ferreira | Brian Davis
Proceedings of the 18th International Natural Language Generation Conference

Large Language Models (LLMs) have achieved remarkable results in natural language generation, yet challenges remain in data-to-text (D2T) tasks, particularly in controlling output, ensuring transparency, and maintaining factual consistency with the input. We introduce the first LLM-based multi-agent framework for D2T generation, coordinating specialized agents to produce high-quality, interpretable outputs. Our system combines the reasoning and acting abilities of ReAct agents, the self-correction of Reflexion agents, and the quality assurance of Guardrail agents, all directed by an Orchestrator agent that assigns tasks to three specialists—content ordering, text structuring, and surface realization—and iteratively refines outputs based on Guardrail feedback. This closed-loop design enables precise control and dynamic optimization, yielding text that is coherent, accurate, and grounded in the input data. On a relatively simple dataset like WebNLG, our framework performs competitively with end-to-end systems, highlighting its promise for more complex D2T scenarios.

Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets
Chinonso Cynthia Osuji | Simon Mille | Ornait O’Connell | Thiago Castro Ferreira | Anya Belz | Brian Davis
Proceedings of the 18th International Natural Language Generation Conference

The ability of LLMs to write coherent, faithful long texts from structured data inputs remains relatively uncharted, in part because nearly all public data-to-text datasets contain only short input-output pairs. To address these gaps, we benchmark six LLMs, a rule‐based system and human-written texts on a new long-input dataset in English and Irish via LLM-based evaluation. We find substantial differences between models and languages.

2024

DCU-ADAPT-modPB at the GEM’24 Data-to-Text Generation Task: Model Hybridisation for Pipeline Data-to-Text Natural Language Generation
Chinonso Cynthia Osuji | Rudali Huidrom | Kolawole John Adebayo | Thiago Castro Ferreira | Brian Davis
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges

In this paper, we present our approach to the GEM Shared Task at the INLG’24 Generation Challenges, which focuses on generating data-to-text in multiple languages, including low-resource languages, from WebNLG triples. We employ a combination of end-to-end and pipeline neural architectures for English text generation. To extend our methodology to Hindi, Korean, Arabic, and Swahili, we leverage a neural machine translation model. Our results demonstrate that our approach achieves competitive performance in the given task.

Pipeline Neural Data-to-text with Large Language Models
Chinonso Cynthia Osuji | Brian Timoney | Thiago Castro Ferreira | Brian Davis
Proceedings of the 17th International Natural Language Generation Conference

Previous studies have highlighted the advantages of pipeline neural architectures over end-to-end models, particularly in reducing text hallucination. In this study, we extend prior research by integrating pretrained language models (PLMs) into a pipeline framework, using both fine-tuning and prompting methods. Our findings show that fine-tuned PLMs consistently generate high quality text, especially within end-to-end architectures and at intermediate stages of the pipeline across various domains. These models also outperform prompt-based ones on automatic evaluation metrics but lag in human evaluations. Compared to the standard five-stage pipeline architecture, a streamlined three-stage pipeline, which only include ordering, structuring, and surface realization, achieves superior performance in fluency and semantic adequacy according to the human evaluation.

A Persona-Based Corpus in the Diabetes Self-Care Domain - Applying a Human-Centered Approach to a Low-Resource Context
Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Alves
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

While Natural Language Processing (NLP) models have gained substantial attention, only in recent years has research opened new paths for tackling Human-Computer Design (HCD) from the perspective of natural language. We focus on developing a human-centered corpus, more specifically, a persona-based corpus in a particular healthcare domain (diabetes mellitus self-care). In order to follow an HCD approach, we created personas to model interpersonal interaction (expert and non-expert users) in that specific domain. We show that an HCD approach benefits language generation from different perspectives, from machines to humans - contributing with new directions for low-resource contexts (languages other than English and sensitive domains) where the need to promote effective communication is essential.

aiXplain SDK: A High-Level and Standardized Toolkit for AI Assets
Shreyas Sharma | Lucas Pavanelli | Thiago Castro Ferreira | Mohamed Al-Badrashiny | Hassan Sawaf
Proceedings of the 17th International Natural Language Generation Conference

The aiXplain SDK is an open-source Python toolkit which aims to simplify the wide and complex ecosystem of AI resources. The toolkit enables access to a wide selection of AI assets, including datasets, models, and metrics, from both academic and commercial sources, which can be selected, executed and evaluated in one place through different services in a standardized format with consistent documentation provided. The study showcases the potential of the proposed toolkit with different code examples and by using it on a user journey where state-of-the-art Large Language Models are fine-tuned on instruction prompt datasets, outperforming their base versions.

Imaginary Numbers! Evaluating Numerical Referring Expressions by Neural End-to-End Surface Realization Systems
Rossana Cunha | Osuji Chinonso | João Campos | Brian Timoney | Brian Davis | Fabio Cozman | Adriana Pagano | Thiago Castro Ferreira
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP

Neural end-to-end surface realizers output more fluent texts than classical architectures. However, they tend to suffer from adequacy problems, in particular hallucinations in numerical referring expression generation. This poses a problem to language generation in sensitive domains, as is the case of robot journalism covering COVID-19 and Amazon deforestation. We propose an approach whereby numerical referring expressions are converted from digits to plain word form descriptions prior to being fed to state-of-the-art Large Language Models. We conduct automatic and human evaluations to report the best strategy to numerical superficial realization. Code and data are publicly available.

2023

Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model
Chris van der Lee | Thiago Castro Ferreira | Chris Emmery | Travis J. Wiltshire | Emiel Krahmer
Computational Linguistics, Volume 49, Issue 3 - September 2023

This study discusses the effect of semi-supervised learning in combination with pretrained language models for data-to-text generation. It is not known whether semi-supervised learning is still helpful when a large-scale language model is also supplemented. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text systems that are additionally enriched by a data augmentation or a pseudo-labeling semi-supervised learning approach. Results show that semi-supervised learning results in higher scores on diversity metrics. In terms of output quality, extending the training set of a data-to-text system with a language model using the pseudo-labeling approach did increase text quality scores, but the data augmentation approach yielded similar scores to the system without training set extension. These results indicate that semi-supervised learning approaches can bolster output quality and diversity, even when a language model is also present.

2022

Generating Questions from Wikidata Triples
Kelvin Han | Thiago Castro Ferreira | Claire Gardent
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Question generation from knowledge bases (or knowledge base question generation, KBQG) is the task of generating questions from structured database information, typically in the form of triples representing facts. To handle rare entities and generalize to unseen properties, previous work on KBQG resorted to extensive, often ad-hoc pre- and post-processing of the input triple. We revisit KBQG – using pre training, a new (triple, question) dataset and taking question type into account – and show that our approach outperforms previous work both in a standard and in a zero-shot setting. We also show that the extended KBQG dataset (also helpful for knowledge base question answering) we provide allows not only for better coverage in terms of knowledge base (KB) properties but also for increased output variability in that it permits the generation of multiple questions from the same KB triple.

Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System
Rudali Huidrom | Ondřej Dušek | Zdeněk Kasner | Thiago Castro Ferreira | Anya Belz
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

In this paper, we present the results of two reproduction studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against reference outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of reproducibility depending on result type, data and labelling task. Our resources are available and open-sourced.

Proceedings of the 15th International Conference on Natural Language Generation: System Demonstrations
Samira Shaikh | Thiago Ferreira | Amanda Stent
Proceedings of the 15th International Conference on Natural Language Generation: System Demonstrations

Proceedings of the 15th International Conference on Natural Language Generation
Samira Shaikh | Thiago Ferreira | Amanda Stent
Proceedings of the 15th International Conference on Natural Language Generation

Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Samira Shaikh | Thiago Ferreira | Amanda Stent
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

Anotação de textos não canônicos: um estudo exploratorio de Grande sertão: veredas pelas dependências universais
Andre V. L. Coneglian | Ana Luisa A. R. Guimarães | Thiago Castro Ferreira | Adriana S. Pagano
Proceedings of the Universal Dependencies Brazilian Festival

aiXplain at Arabic Hate Speech 2022: An Ensemble Based Approach to Detecting Offensive Tweets
Salaheddin Alzubi | Thiago Castro Ferreira | Lucas Pavanelli | Mohamed Al-Badrashiny
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

Abusive speech on online platforms has a detrimental effect on users’ mental health. This warrants the need for innovative solutions that automatically moderate content, especially on online platforms such as Twitter where a user’s anonymity is loosely controlled. This paper outlines aiXplain Inc.’s ensemble based approach to detecting offensive speech in the Arabic language based on OSACT5’s shared sub-task A. Additionally, this paper highlights multiple challenges that may hinder progress on detecting abusive speech and provides potential avenues and techniques that may lead to significant progress.

MTLens: Machine Translation Output Debugging
Shreyas Sharma | Kareem Darwish | Lucas Pavanelli | Thiago Castro Ferreira | Mohamed Al-Badrashiny | Kamer Ali Yuksel | Hassan Sawaf
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The performance of Machine Translation (MT) systems varies significantly with inputs of diverging features such as topics, genres, and surface properties. Though there are many MT evaluation metrics that generally correlate with human judgments, they are not directly useful in identifying specific shortcomings of MT systems. In this demo, we present a benchmarking interface that enables improved evaluation of specific MT systems in isolation or multiple MT systems collectively by quantitatively evaluating their performance on many tasks across multiple domains and evaluation metrics. Further, it facilitates effective debugging and error analysis of MT output via the use of dynamic filters that help users hone in on problem sentences with specific properties, such as genre, topic, sentence length, etc. The interface can be extended to include additional filters such as lexical, morphological, and syntactic features. Aside from helping debug MT output, it can also help in identifying problems in reference translations and evaluation metrics.

2021

Enriching the E2E dataset
Thiago Castro Ferreira | Helena Vaz | Brian Davis | Adriana Pagano
Proceedings of the 14th International Conference on Natural Language Generation

This study introduces an enriched version of the E2E dataset, one of the most popular language resources for data-to-text NLG. We extract intermediate representations for popular pipeline tasks such as discourse ordering, text structuring, lexicalization and referring expression generation, enabling researchers to rapidly develop and evaluate their data-to-text pipeline systems. The intermediate representations are extracted by aligning non-linguistic and text representations through a process called delexicalization, which consists in replacing input referring expressions to entities/attributes with placeholders. The enriched dataset is publicly available.

Sentiment Analysis in Portuguese Texts from Online Health Community Forums: Data, Model and Evaluation
Yohan Gumiel | Isabela Lee | Tayane Soares | Thiago Ferreira | Adriana Pagano
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology

This study describes the development of a Portuguese Community-Question Answering benchmark in the domain of Diabetes Mellitus using a Recognizing Question Entailment (RQE) approach. Given a premise question, RQE aims to retrieve semantically similar, already answered, archived questions. We build a new Portuguese benchmark corpus with 785 pairs between premise questions and archived answered questions marked with relevance judgments by medical experts. Based on the benchmark corpus, we leveraged and evaluated several RQE approaches ranging from traditional information retrieval methods to novel large pre-trained language models and ensemble techniques using learn-to-rank approaches. Our experimental results show that a supervised transformer-based method trained with multiple languages and for multiple tasks (MUSE) outperforms the alternatives. Our results also show that ensembles of methods (stacking) as well as a traditional (light) information retrieval method (BM25) can produce competitive results. Finally, among the tested strategies, those that exploit only the question (not the answer), provide the best effectiveness-efficiency trade-off. Code is publicly available.

Another PASS: A Reproduction Study of the Human Evaluation of a Football Report Generation System
Simon Mille | Thiago Castro Ferreira | Anya Belz | Brian Davis
Proceedings of the 14th International Conference on Natural Language Generation

This paper reports results from a reproduction study in which we repeated the human evaluation of the PASS Dutch-language football report generation system (van der Lee et al., 2017). The work was carried out as part of the ReproGen Shared Task on Reproducibility of Human Evaluations in NLG, in Track A (Paper 1). We aimed to repeat the original study exactly, with the main difference that a different set of evaluators was used. We describe the study design, present the results from the original and the reproduction study, and then compare and analyse the differences between the two sets of results. For the two ‘headline’ results of average Fluency and Clarity, we find that in both studies, the system was rated more highly for Clarity than for Fluency, and Clarity had higher standard deviation. Clarity and Fluency ratings were higher, and their standard deviations lower, in the reproduction study than in the original study by substantial margins. Clarity had a higher degree of reproducibility than Fluency, as measured by the coefficient of variation. Data and code are publicly available.

Comparing Supervised Machine Learning Techniques for Genre Analysis in Software Engineering Research Articles
Felipe Araújo de Britto | Thiago Castro Ferreira | Leonardo Pereira Nunes | Fernando Silva Parreiras
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Written communication is of utmost importance to the progress of scientific research. The speed of such development, however, may be affected by the scarcity of reviewers to referee the quality of research articles. In this context, automatic approaches that are able to query linguistic segments in written contributions by detecting the presence or absence of common rhetorical patterns have become a necessity. This paper aims to compare supervised machine learning techniques tested to accomplish genre analysis in Introduction sections of software engineering articles. A semi-supervised approach was carried out to augment the number of annotated sentences in SciSents (Avaliable on: ANONYMOUS). Two supervised approaches using SVM and logistic regression were undertaken to assess the F-score for genre analysis in the corpus. A technique based on logistic regression and BERT has been found to perform genre analysis highly satisfactorily with an average of 88.25 on F-score when retrieving patterns at an overall level.

2020

DaMata: A Robot-Journalist Covering the Brazilian Amazon Deforestation
André Luiz Rosa Teixeira | João Campos | Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Cozman
Proceedings of the 13th International Conference on Natural Language Generation

This demo paper introduces DaMata, a robot-journalist covering deforestation in the Brazilian Amazon. The robot-journalist is based on a pipeline architecture of Natural Language Generation, which yields multilingual daily and monthly reports based on the public data provided by DETER, a real-time deforestation satellite monitor developed and maintained by the Brazilian National Institute for Space Research (INPE). DaMata automatically generates reports in Brazilian Portuguese and English and publishes them on the Twitter platform. Corpus and code are publicly available.

Building The First English-Brazilian Portuguese Corpus for Automatic Post-Editing
Felipe Almeida Costa | Thiago Castro Ferreira | Adriana Pagano | Wagner Meira
Proceedings of the 28th International Conference on Computational Linguistics

This paper introduces the first corpus for Automatic Post-Editing of English and a low-resource language, Brazilian Portuguese. The source English texts were extracted from the WebNLG corpus and automatically translated into Portuguese using a state-of-the-art industrial neural machine translator. Post-edits were then obtained in an experiment with native speakers of Brazilian Portuguese. To assess the quality of the corpus, we performed error analysis and computed complexity indicators measuring how difficult the APE task would be. We report preliminary results of Phrase-Based and Neural Machine Translation Models on this new corpus. Data and code publicly available in our repository.

Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)
Thiago Castro Ferreira | Claire Gardent | Nikolai Ilinykh | Chris van der Lee | Simon Mille | Diego Moussallem | Anastasia Shimorina
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

Proceedings of the Third Workshop on Multilingual Surface Realisation
Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Simon Mille | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results
Simon Mille | Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR’20) which was organised as part of the COLING’20 Workshop on Multilingual Surface Realisation. As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks: (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

A General Benchmarking Framework for Text Generation
Diego Moussallem | Paramjot Kaur | Thiago Ferreira | Chris van der Lee | Anastasia Shimorina | Felix Conrads | Michael Röder | René Speck | Claire Gardent | Simon Mille | Nikolai Ilinykh | Axel-Cyrille Ngonga Ngomo
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

The RDF-to-text task has recently gained substantial attention due to the continuous growth of RDF knowledge graphs in number and size. Recent studies have focused on systematically comparing RDF-to-text approaches on benchmarking datasets such as WebNLG. Although some evaluation tools have already been proposed for text generation, none of the existing solutions abides by the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles and involves RDF data for the knowledge extraction task. In this paper, we present BENG, a FAIR benchmarking platform for Natural Language Generation (NLG) and Knowledge Extraction systems with focus on RDF data. BENG builds upon the successful benchmarking platform GERBIL, is opensource and is publicly available along with the data it contains.

Referring to what you know and do not know: Making Referring Expression Generation Models Generalize To Unseen Entities
Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Alves
Proceedings of the 28th International Conference on Computational Linguistics

Data-to-text Natural Language Generation (NLG) is the computational process of generating natural language in the form of text or voice from non-linguistic data. A core micro-planning task within NLG is referring expression generation (REG), which aims to automatically generate noun phrases to refer to entities mentioned as discourse unfolds. A limitation of novel REG models is not being able to generate referring expressions to entities not encountered during the training process. To solve this problem, we propose two extensions to NeuralREG, a state-of-the-art encoder-decoder REG model. The first is a copy mechanism, whereas the second consists of representing the gender and type of the referent as inputs to the model. Drawing on the results of automatic and human evaluation as well as an ablation study using the WebNLG corpus, we contend that our proposal contributes to the generation of more meaningful referring expressions to unseen entities than the original system and related work. Code and all produced data are publicly available.

The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)
Thiago Castro Ferreira | Claire Gardent | Nikolai Ilinykh | Chris van der Lee | Simon Mille | Diego Moussallem | Anastasia Shimorina
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

WebNLG+ offers two challenges: (i) mapping sets of RDF triples to English or Russian text (generation) and (ii) converting English or Russian text to sets of RDF triples (semantic parsing). Compared to the eponymous WebNLG challenge, WebNLG+ provides an extended dataset that enable the training, evaluation, and comparison of microplanners and semantic parsers. In this paper, we present the results of the generation and semantic parsing task for both English and Russian and provide a brief description of the participating systems.

Evaluation rules! On the use of grammars and rule-based systems for NLG evaluation
Emiel van Miltenburg | Chris van der Lee | Thiago Castro-Ferreira | Emiel Krahmer
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

NLG researchers often use uncontrolled corpora to train and evaluate their systems, using textual similarity metrics, such as BLEU. This position paper argues in favour of two alternative evaluation strategies, using grammars or rule-based systems. These strategies are particularly useful to identify the strengths and weaknesses of different systems. We contrast our proposals with the (extended) WebNLG dataset, which is revealed to have a skewed distribution of predicates. We predict that this distribution affects the quality of the predictions for systems trained on this data. However, this hypothesis can only be thoroughly tested (without any confounds) once we are able to systematically manipulate the skewness of the data, using a rule-based approach.

2019

Surface Realization Shared Task 2019 (MSR19): The Team 6 Approach
Thiago Castro Ferreira | Emiel Krahmer
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

This study describes the approach developed by the Tilburg University team to the shallow track of the Multilingual Surface Realization Shared Task 2019 (SR’19) (Mille et al., 2019). Based on Ferreira et al. (2017) and on our 2018 submission Ferreira et al. (2018), the approach generates texts by first preprocessing an input dependency tree into an ordered linearized string, which is then realized using a rule-based and a statistical machine translation (SMT) model. This year our submission is able to realize texts in the 11 languages proposed for the task, different from our last year submission, which covered only 6 Indo-European languages. The model is publicly available.

Neural data-to-text generation: A comparison between pipeline and end-to-end architectures
Thiago Castro Ferreira | Chris van der Lee | Emiel van Miltenburg | Emiel Krahmer
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. By contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in between. This study introduces a systematic comparison between neural pipeline and end-to-end data-to-text approaches for the generation of text from RDF triples. Both architectures were implemented making use of the encoder-decoder Gated-Recurrent Units (GRU) and Transformer, two state-of-the art deep learning methods. Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches. Moreover, the pipeline models generalize better to unseen inputs. Data and code are publicly available.

Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models
Florian Kunneman | Thiago Castro Ferreira | Emiel Krahmer | Antal van den Bosch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Community Question Answering forums are popular among Internet users, and a basic problem they encounter is trying to find out if their question has already been posed before. To address this issue, NLP researchers have developed methods to automatically detect question-similarity, which was one of the shared tasks in SemEval. The best performing systems for this task made use of Syntactic Tree Kernels or the SoftCosine metric. However, it remains unclear why these methods seem to work, whether their performance can be improved by better preprocessing methods and what kinds of errors they (and other methods) make. In this paper, we therefore systematically combine and compare these two approaches with the more traditional BM25 and translation-based models. Moreover, we analyze the impact of preprocessing steps (lowercasing, suppression of punctuation and stop words removal) and word meaning similarity based on different distributions (word translation probability, Word2Vec, fastText and ELMo) on the performance of the task. We conduct an error analysis to gain insight into the differences in performance between the system set-ups. The implementation is made publicly available from https://github.com/fkunneman/DiscoSumo/tree/master/ranlp.

2018

Surface Realization Shared Task 2018 (SR18): The Tilburg University Approach
Thiago Castro Ferreira | Sander Wubben | Emiel Krahmer
Proceedings of the First Workshop on Multilingual Surface Realisation

This study describes the approach developed by the Tilburg University team to the shallow task of the Multilingual Surface Realization Shared Task 2018 (SR18). Based on (Castro Ferreira et al., 2017), the approach works by first preprocessing an input dependency tree into an ordered linearized string, which is then realized using a statistical machine translation model. Our approach shows promising results, with BLEU scores above 50 for 5 different languages (English, French, Italian, Portuguese and Spanish) and above 35 for the Dutch language.

Enriching the WebNLG corpus
Thiago Castro Ferreira | Diego Moussallem | Emiel Krahmer | Sander Wubben
Proceedings of the 11th International Conference on Natural Language Generation

This paper describes the enrichment of WebNLG corpus (Gardent et al., 2017a,b), with the aim to further extend its usefulness as a resource for evaluating common NLG tasks, including Discourse Ordering, Lexicalization and Referring Expression Generation. We also produce a silver-standard German translation of the corpus to enable the exploitation of NLG approaches to other languages than English. The enriched corpus is publicly available.

RDF2PT: Generating Brazilian Portuguese Texts from RDF Data
Diego Moussallem | Thiago Ferreira | Marcos Zampieri | Maria Claudia Cavalcanti | Geraldo Xexéo | Mariana Neves | Axel-Cyrille Ngonga Ngomo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

NeuralREG: An end-to-end approach to referring expression generation
Thiago Castro Ferreira | Diego Moussallem | Ákos Kádár | Sander Wubben | Emiel Krahmer
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Traditionally, Referring Expression Generation (REG) models first decide on the form and then on the content of references to discourse entities in text, typically relying on features such as salience and grammatical function. In this paper, we present a new approach (NeuralREG), relying on deep neural networks, which makes decisions about form and content in one go without explicit feature extraction. Using a delexicalized version of the WebNLG corpus, we show that the neural model substantially improves over two strong baselines.

2017

Linguistic realisation as machine translation: Comparing different MT models for AMR-to-text generation
Thiago Castro Ferreira | Iacer Calixto | Sander Wubben | Emiel Krahmer
Proceedings of the 10th International Conference on Natural Language Generation

In this paper, we study AMR-to-text generation, framing it as a translation task and comparing two different MT approaches (Phrase-based and Neural MT). We systematically study the effects of 3 AMR preprocessing steps (Delexicalisation, Compression, and Linearisation) applied before the MT phase. Our results show that preprocessing indeed helps, although the benefits differ for the two MT models.

Improving the generation of personalised descriptions
Thiago Castro Ferreira | Ivandré Paraboni
Proceedings of the 10th International Conference on Natural Language Generation

Referring expression generation (REG) models that use speaker-dependent information require a considerable amount of training data produced by every individual speaker, or may otherwise perform poorly. In this work we propose a simple personalised method for this task, in which speakers are grouped into profiles according to their referential behaviour. Intrinsic evaluation shows that the use of speaker’s profiles generally outperforms the personalised method found in previous work.

Generating flexible proper name references in text: Data, models and evaluation
Thiago Castro Ferreira | Emiel Krahmer | Sander Wubben
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This study introduces a statistical model able to generate variations of a proper name by taking into account the person to be mentioned, the discourse context and variation. The model relies on the REGnames corpus, a dataset with 53,102 proper name references to 1,000 people in different discourse contexts. We evaluate the versions of our model from the perspective of how human writers produce proper names, and also how human readers process them. The corpus and the model are publicly available.

2016

Task demands and individual variation in referring expressions
Adriana Baltaretu | Thiago Castro Ferreira
Proceedings of the 9th International Natural Language Generation conference

Towards proper name generation: a corpus analysis
Thiago Castro Ferreira | Sander Wubben | Emiel Krahmer
Proceedings of the 9th International Natural Language Generation conference

Individual Variation in the Choice of Referential Form
Thiago Castro Ferreira | Emiel Krahmer | Sander Wubben
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Towards more variation in text generation: Developing and evaluating variation models for choice of referential form
Thiago Castro Ferreira | Emiel Krahmer | Sander Wubben
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Zoom: a corpus of natural language descriptions of map locations
Romina Altamirano | Thiago Ferreira | Ivandré Paraboni | Luciana Benotti
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Co-authors

Diego Moussallem 6

Chris van der Lee 6

Mohamed Al-Badrashiny 5

Rossana Cunha 5

Chinonso Cynthia Osuji 5

Claire Gardent 4

Rudali Huidrom 3

Nikolai Ilinykh 3

Lucas Pavanelli 3

Samira Shaikh 3

Anastasia Shimorina 3

Brian Timoney 3

Kolawole John Adebayo 2

Yvette Graham 2

Ana Luisa A. R. Guimarães 2

Axel-Cyrille Ngonga Ngomo 2

Ivandré Paraboni 2

Shreyas Sharma 2

Emiel Van Miltenburg 2

Kamer Ali Yuksel 2

Ahmedelmogtaba Abdelaziz 1

Felipe Almeida Costa 1

Romina Altamirano 1

Salaheddin Alzubi 1

Felipe Araújo de Britto 1

Adriana Baltaretu 1

Rodrigo Bastos Fóscolo 1

Luciana Benotti 1

Iacer Calixto 1

Sheila Castilho 1

Maria Claudia Cavalcanti 1

Osuji Chinonso 1

Andre V. L. Coneglian 1

Felix Conrads 1

Kareem Darwish 1

Ondřej Dušek 1

Rivaney F. Oliveira 1

Celso França 1

Gabriel Frota 1

Marcos André Gonçalves 1

Daniel Hasan Dalip 1

Zdeněk Kasner 1

Paramjot Kaur 1

Florian Kunneman 1

Ákos Kádár 1

Abisague Langbehn 1

Mariana Neves 1

Leonardo Pereira Nunes 1

Chinonso Osuji 1

Ornait O’Connell 1

Adriana S. Pagano 1

Adalberto Penna 1

Vitoria Portella 1

Isabela Rigotto 1

André Luiz Rosa Teixeira 1

Michael Röder 1

Fernando Silva Parreiras 1

Tayane Soares 1

Tayane A. Soares 1

João Victor de Pinho Costa 1

Travis J. Wiltshire 1

Geraldo Bonorino Xexéo 1

Marcos Zampieri 1

Antal van den Bosch 1

Venues