Horacio Saggion

2025

pdf bib abs
Democracy Made Easy: Simplifying Complex Topics to Enable Democratic Participation
Nouran Khallaf | Stefan Bott | Carlo Eugeni | John O’Flaherty | Serge Sharoff | Horacio Saggion
Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL)

Several people are excluded from democratic deliberation because the language which is used in this context may be too difficult to understand for them. Our iDEM project aims at lowering existing linguistic barriers in deliberative processes by developing technology to facilitate the translation of complicated text into easy to read formats which are more suitable for may people. In this paper we describe classification experiments for detecting different types of difficulties which should be amended in order to make texts easier to understand. We focus on a lexical simplification system which can achieve state-of-the-art results with the use of a free and open-weight Large Language Model for the Romance Languages in the iDEM project. Moreover, a sentence segmentation system is introduced that can create text segmentation for long sentences based on training data. We describe the iDEM mobile app, which will make our technology available as a service for end-users of our target populations.

pdf bib abs
Spontaneous Catalan Sign Language Recognition: Data Acquisition and Classification
Naiara Garmendia | Horacio Saggion | Euan McGill
Proceedings of the Third International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)

This work presents the first investigation into Spontaneous Isolated Sign Language Recognition for Catalan Sign Language (LSC). Our work is grounded on the derivation of a dataset of signs and their glosses from a corpus of spontaneous dialogues and monologues. The recognition model is based on a Multi-Scale Graph Convolutional network fitted to our data. Results are promising since several signs are recognized with a high level of accuracy, and an average accuracy of 71% on the top 5 predicted classes from a total of 105 available. An interactive interface with experimental results is also presented. The data and software are made available to the research community.

pdf bib
Segmenting Italian Sentences for Easy Reading
Marta Cozzini | Horacio Saggion
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.

pdf bib abs
Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models
Piotr Przybyła | Euan McGill | Horacio Saggion
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform (1) quantitative evaluation using various prompts, models and query limits, (2) targeted manual assessment of the generated text and (3) qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.

pdf bib abs
Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs
Akio Hayakawa | Stefan Bott | Horacio Saggion
Proceedings of the 18th International Natural Language Generation Conference

Despite their strong performance, large language models (LLMs) face challenges in real-world application of lexical simplification (LS), particularly in privacy-sensitive and resource-constrained environments. Moreover, since vulnerable user groups (e.g., people with disabilities) are one of the key target groups of this technology, it is crucial to ensure the safety and correctness of the output of LS systems. To address these issues, we propose an efficient framework for LS systems that utilizes small LLMs deployable in local environments. Within this framework, we explore knowledge distillation with synthesized data and in-context learning as baselines. Our experiments in five languages evaluate model outputs both automatically and manually. Our manual analysis reveals that while knowledge distillation boosts automatic metric scores, it also introduces a safety trade-off by increasing harmful simplifications. Importantly, we find that the model’s output probability is a useful signal for detecting harmful simplifications. Leveraging this, we propose a filtering strategy that suppresses harmful simplifications while largely preserving beneficial ones. This work establishes a benchmark for efficient and safe LS with small LLMs. It highlights the key trade-offs between performance, efficiency, and safety, and demonstrates a promising approach for safe real-world deployment.

pdf bib abs
The Impact of Named Entity Recognition on Transformer-Based Multi-Label Dietary Recipe Classification
Kemalcan Bora | Horacio Saggion
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

This research explores the impact of Named Entity Recognition (NER) on transformer-based models for multi-label recipe classification by dietary preference. To support this task, we introduce the NutriCuisine Index: a collection of 23,932 recipes annotated across six dietary categories (Healthy, Vegan, Gluten-Free, Low-Carb, High-Protein, Low-Sugar). Using BERT-base-uncased, RoBERTa-base, and DistilBERT-base-uncased, we evaluate how NER-based preprocessing affects the performance (F1-score, Precision, Recall, and Hamming Loss) of Transformer-based multi-label classification models. RoBERTa-base shows significant improvements with NER in F1-score (∆F1 = +0.0147, p < 0.001), Precision, and Recall, while BERT and DistilBERT show no such gains. NER also leads to a slight but statistically significant increase in Hamming Loss across all models. These findings highlight the model dependent impact of NER on classification performance.

pdf bib abs
Investigating Large Language Models’ (LLMs) Capabilities for Sexism Detection on a Low-Resource Language
Lutfiye Seda Mut Altin | Horacio Saggion
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Automatic detection of sexist language on social media is gaining attention due to its harmful societal impact and technical challenges it presents. The limited availability of data resources in some languages restricts the development of effective tools to fight the spread of such content. In this work, we investigated various methods to improve the efficiency of automatic detection of sexism and its subtypes in a low-resource language, Turkish. We first experimented with various LLM prompting strategies for classification and then investigated the impact of different data augmentation strategies, including both synthetic data generation with LLMs (GPT, DeepSeek) and translationbased augmentation using English and Spanish data. Finally, we examined whether these augmentation methods would improve model performance of a trained neural network (BERT). Our benchmarking results show that fine-tuned LLM (GPT-4o-mini) achieved the best performance compared to zero-shot, few-shot, Chain-of-Thought prompt classification and training a neural network (BERT) including the data augmented in different ways (synthetic generation, translation). Our results also indicated that, for the classification of more granular classes, in other words, more specific tasks, training a neural network generally performed better than prompt-based classification using an LLM.

pdf bib
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Matthew Shardlow | Fernando Alva-Manchego | Kai North | Regina Stodden | Horacio Saggion | Nouran Khallaf | Akio Hayakawa
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

pdf bib abs
UoL-UPF at TSAR 2025 Shared Task A Generate-and-Select Approach for Readability-Controlled Text Simplification
Akio Hayakawa | Nouran Khallaf | Horacio Saggion | Serge Sharoff
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

The TSAR 2025 Shared Task on Readability-Controlled Text Simplification focuses on simplifying English paragraphs written at an advanced level (B2 or higher) and rewriting them to target CEFR levels (A2 or B1). The challenge is to reduce linguistic complexity without sacrificing coherence or meaning. We developed three complementary approaches based on large language models (LLMs). The first approach (Run 1) generates a diverse set of paragraph-level simplifications. It then applies filters to enforce CEFR alignment, preserve meaning, and encourage diversity, and finally selects the candidates with the lowest perceived risk. The second (Run 2) performs simplification at the sentence level, combining structured prompting, coreference resolution, and explainable AI techniques to highlight influential phrases, with candidate selection guided by automatic and LLM-based judges. The third hybrid approach (Run 3) integrates both strategies by pooling paragraph- and sentence-level simplifications, and subsequently applying the identical filtering and selection architecture used in Run 1. In the official TSAR evaluation, the hybrid system ranked 2nd overall, while its component systems also achieved competitive results.

2024

We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.

pdf bib abs
Bootstrapping Pre-trained Word Embedding Models for Sign Language Gloss Translation
Euan McGill | Luis Chiruzzo | Horacio Saggion
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

This paper explores a novel method to modify existing pre-trained word embedding models of spoken languages for Sign Language glosses. These newly-generated embeddings are described, visualised, and then used in the encoder and/or decoder of models for the Text2Gloss and Gloss2Text task of machine translation. In two translation settings (one including data augmentation-based pre-training and a baseline), we find that bootstrapped word embeddings for glosses improve translation across four Signed/spoken language pairs. Many improvements are statistically significant, including those where the bootstrapped gloss embedding models are used.Languages included: American Sign Language, Finnish Sign Language, Spanish Sign Language, Sign Language of The Netherlands.

SignON, a 3-year Horizon 20202 project addressing the lack of technology and services for MT between sign languages (SLs) and spoken languages (SpLs) ended in December 2023. SignON was unprecedented. Not only it addressed the wider complexity of the aforementioned problem – from research and development of recognition, translation and synthesis, through development of easy-to-use mobile applications and a cloud-based framework to do the “heavy lifting” as well as to establishing ethical, privacy and inclusivenesspolicies and operation guidelines – but also engaged with the deaf and hard of hearing communities in an effective co-creation approach where these main stakeholders drove the development in the right direction and had the final say.Currently we are witnessing advances in natural language processing for SLs, including MT. SignON was one of the largest projects that contributed to this surge with 17 partners and more than 60 consortium members, working in parallel with other international and European initiatives, such as project EASIER and others.

pdf bib
Can True Zero-shot Methods with Large Language Models be Adopted for Sign Language Machine Translation?
Euan McGill | Horacio Saggion
Proceedings of the First International Workshop on Knowledge-Enhanced Machine Translation

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

pdf bib abs
AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis
Nicolau Duran-Silva | Pablo Accuosto | Piotr Przybyła | Horacio Saggion
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

The accurate attribution of scientific works to research organizations is hindered by the lack of openly available manually annotated data–in particular when multilingual and complex affiliation strings are considered. The AffilGood framework introduced in this paper addresses this gap. We identify three sub-tasks relevant for institution name disambiguation and make available annotated datasets and tools aimed at each of them, including i) a dataset annotated with affiliation spans in noisy automatically-extracted strings; ii) a dataset annotated with named entities for the identification of organizations and their locations; iii) seven datasets annotated with the Research Organization Registry (ROR) identifiers for the evaluation of entity-linking systems. In addition, we describe, evaluate and make available newly developed tools that use these datasets to provide solutions for each of the identified sub-tasks. Our results confirm the value of the developed resources and methods in addressing key challenges in institution name disambiguation.

pdf bib abs
A Novel Corpus for Automated Sexism Identification on Social Media
Lutfiye Seda Mut Altin | Horacio Saggion
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

In this paper, we present a novel dataset for the study of automated sexism identification and categorization on social media in Turkish. For this purpose, we have collected, following a well established methodology, a set of Tweets and YouTube comments. Relying on expert organizations in the area of gender equality, each text has been annotated based on a two-level labelling schema derived from previous research. Our resulting dataset consists of around 7,000 annotated instances useful for the study of expressions of sexism and misogyny on the Web. To the best of our knowledge, this is the first two-level manually annotated comprehensive Turkish dataset for sexism identification. In order to fuel research in this relevant area, we also present the result of our benchmarking experiments in the area of sexism identification in Turkish.

pdf bib
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
Matthew Shardlow | Horacio Saggion | Fernando Alva-Manchego | Marcos Zampieri | Kai North | Sanja Štajner | Regina Stodden
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

pdf bib abs
Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations
Horacio Saggion | Stefan Bott | Sandra Szasz | Nelson Pérez | Saúl Calderón | Martín Solís
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.

pdf bib abs
Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning
Piotr Przybyła | Euan McGill | Horacio Saggion
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

We present XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. Our solution is adaptive, it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. This reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We evaluate our approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. We also perform a qualitative analysis to understand the language patterns in the misinformation text that play a role in the attacks.

pdf bib abs
TRIBBLE - TRanslating IBerian languages Based on Limited E-resources
Igor Kuzmin | Piotr Przybyła | Euan Mcgill | Horacio Saggion
Proceedings of the Ninth Conference on Machine Translation

In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn_Latn, arg_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.

2023

pdf bib abs
BSL-Hansard: A parallel, multimodal corpus of English and interpreted British Sign Language data from parliamentary proceedings
Euan McGill | Horacio Saggion
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages

BSL-Hansard is a novel open source and multimodal resource composed by combining Sign Language video data in BSL and English text from the official transcription of British parliamentary sessions. This paper describes the method followed to compile BSL-Hansard including time alignment of text using the MAUS (Schiel, 2015) segmentation system, gives some statistics about this dataset, and suggests experiments. These primarily include end-to-end Sign Language-to-text translation, but is also relevant for broader machine translation, and speech and language processing tasks.

SignON (https://signon-project.eu/) is a Horizon 2020 project, running from 2021 until the end of 2023, which addresses the lack of technology and services for the automatic translation between sign languages (SLs) and spoken languages, through an inclusive, human-centric solution, hence contributing to the repertoire of communication media for deaf, hard of hearing (DHH) and hearing individuals. In this paper, we present an update of the status of the project, describing the approaches developed to address the challenges and peculiarities of SL machine translation (SLMT).

pdf bib abs
LeSS: A Computationally-Light Lexical Simplifier for Spanish
Sanja Stajner | Daniel Ibanez | Horacio Saggion
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Due to having knowledge of only basic vocabulary, many people cannot understand up-to-date written information and thus make informed decisions and fully participate in the society. We propose LeSS, a modular lexical simplification architecture that outperforms state-of-the-art lexical simplification systems for Spanish. In addition to its state-of-the-art performance, LeSS is computationally light, using much less disk space, CPU and GPU, and having faster loading and execution time than the transformer-based lexical simplification models which are predominant in the field.

pdf bib abs
Part-of-Speech tagging Spanish Sign Language data and its applications in Sign Language machine translation
Euan McGill | Luis Chiruzzo | Santiago Egea Gómez | Horacio Saggion
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

This paper examines the use of manually part-of-speech tagged sign language gloss data in the Text2Gloss and Gloss2Text translation tasks, as well as running an LSTM-based sequence labelling model on the same glosses for automatic part-of-speech tagging. We find that a combination of tag-enhanced glosses and pretraining the neural model positively impacts performance in the translation tasks. The results of the tagging task are limited, but provide a methodological framework for further research into tagging sign language gloss data.

pdf bib abs
LSLlama: Fine-Tuned LLaMA for Lexical Simplification
Anthony Baez | Horacio Saggion
Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability

Generative Large Language Models (LLMs), such as GPT-3, have become increasingly effective and versatile in natural language processing (NLP) tasks. One such task is Lexical Simplification, where state-of-the-art methods involve complex, multi-step processes which can use both deep learning and non-deep learning processes. LLaMA, an LLM with full research access, holds unique potential for the adaption of the entire LS pipeline. This paper details the process of fine-tuning LLaMA to create LSLlama, which performs comparably to previous LS baseline models LSBert and UniHD.

2022

pdf bib
Proceedings of the Fifth International Workshop on Emoji Understanding and Applications in Social Media
Sanjaya Wijeratne | Jennifer Lee | Horacio Saggion | Amit Sheth
Proceedings of the Fifth International Workshop on Emoji Understanding and Applications in Social Media

pdf bib abs
What’s in a (dataset’s) name? The case of BigPatent
Silvia Casola | Alberto Lavelli | Horacio Saggion
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Sharing datasets and benchmarks has been crucial for rapidly improving Natural Language Processing models and systems. Documenting datasets’ characteristics (and any modification introduced over time) is equally important to avoid confusion and make comparisons reliable. Here, we describe the case of BigPatent, a dataset for patent summarization that exists in at least two rather different versions under the same name. While previous literature has not clearly distinguished among versions, their differences do not only lay on a surface level but also modify the dataset’s core nature and, thus, the complexity of the summarization task. While this paper describes a specific case, we aim to shed light on new challenges that might emerge in resource sharing and advocate for comprehensive documentation of datasets and models.

pdf bib abs
Identification of complex words and passages in medical documents in French
Kim Cheng Sheang | Anaïs Koptient | Natalia Grabar | Horacio Saggion
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Identification de mots et passages difficiles dans les documents médicaux en français. L’objectif de la simplification automatique des textes consiste à fournir une nouvelle version de documents qui devient plus facile à comprendre pour une population donnée ou plus facile à traiter par d’autres applications du TAL. Cependant, avant d’effectuer la simplification, il est important de savoir ce qu’il faut simplifier exactement dans les documents. En effet, même dans les documents techniques et spécialisés, il n’est pas nécessaire de tout simplifier mais juste les segments qui présentent des difficultés de compréhension. Il s’agit typiquement de la tâche d’identification de mots complexes : effectuer le diagnostic de difficulté d’un document donné pour y détecter les mots et passages complexes. Nous proposons de travail sur l’identification de mots et passages complexes dans les documents biomédicaux en français.

pdf bib abs
Evaluation of Automatic Text Simplification: Where are we now, where should we go from here
Natalia Grabar | Horacio Saggion
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Évaluation de la simplification automatique de textes : où nous en sommes et vers où devonsnous aller. L’objectif de la simplification automatique de textes consiste à adapter le contenu de documents afin de les rendre plus faciles à comprendre par une population donnée ou bien pour améliorer les performances d’autres tâches TAL, comme le résumé automatique ou extraction d’information. Les étapes principales de la simplification automatique de textes sont plutôt bien définies et étudiées dans les travaux existants, alors que l’évaluation de la simplification reste sous-étudiée. En effet, contrairement à d’autres tâches de TAL, comme la recherche et extraction d’information, la structuration de terminologie ou les questions-réponses, qui s’attendent à avoir des résultats factuels et consensuels, il est difficile de définir un résultat standard de la simplification. Le processus de simplification est très subjectif et souvent non consensuel parce qu’il est lourdement basé sur les connaissances propres des personnes. Ainsi, plusieurs facteurs sont impliqués dans le processus de simplification et son évaluation. Dans ce papier, nous présentons et discutons quelques uns de ces facteurs : le rôle de l’utilisateur final, les données de référence, le domaine des documents source et les mesures d’évaluation.

pdf bib abs
Translating Spanish into Spanish Sign Language: Combining Rules and Data-driven Approaches
Luis Chiruzzo | Euan McGill | Santiago Egea-Gómez | Horacio Saggion
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

This paper presents a series of experiments on translating between spoken Spanish and Spanish Sign Language glosses (LSE), including enriching Neural Machine Translation (NMT) systems with linguistic features, and creating synthetic data to pretrain and later on finetune a neural translation model. We found evidence that pretraining over a large corpus of LSE synthetic data aligned to Spanish sentences could markedly improve the performance of the translation models.

pdf bib abs
Challenges with Sign Language Datasets for Sign Language Recognition and Translation
Mirella De Sisto | Vincent Vandeghinste | Santiago Egea Gómez | Mathieu De Coster | Dimitar Shterionov | Horacio Saggion
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Sign Languages (SLs) are the primary means of communication for at least half a million people in Europe alone. However, the development of SL recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and standardization issues in the available data. The former challenge relates to the volume of data available for machine learning as well as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing the provision of automatic tools based on neural models. In the present paper, we give an overview of these challenges by comparing various SL corpora and SL machine learning datasets. Furthermore, we propose a framework to address the lack of standardization at format level, unify the available resources and facilitate SL research for different languages. Our framework takes ELAN files as inputs and returns textual and visual data ready to train SL recognition and translation models. We present a proof of concept, training neural translation models on the data produced by the proposed framework.

pdf bib abs
ALEXSIS: A Dataset for Lexical Simplification in Spanish
Daniel Ferrés | Horacio Saggion
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Lexical Simplification is the process of reducing the lexical complexity of a text by replacing difficult words with easier to read (or understand) expressions while preserving the original information and meaning. In this paper we introduce ALEXSIS, a new dataset for this task, and we use ALEXSIS to benchmark Lexical Simplification systems in Spanish. The paper describes the evaluation of three kind of approaches to Lexical Simplification, a thesaurus-based approach, a single transformers-based approach, and a combination of transformers. We also report state of the art results on a previous Lexical Simplification dataset for Spanish.

pdf bib abs
Exploring the limits of a base BART for multi-document summarization in the medical domain
Ishmael Obonyo | Silvia Casola | Horacio Saggion
Proceedings of the Third Workshop on Scholarly Document Processing

This paper is a description of our participation in the Multi-document Summarization for Literature Review (MSLR) Shared Task, in which we explore summarization models to create an automatic review of scientific results. Rather than maximizing the metrics using expensive computational models, we placed ourselves in a situation of scarce computational resources and explore the limits of a base sequence to sequence models (thus with a limited input length) to the task. Although we explore methods to feed the abstractive model with salient sentences only (using a first extractive step), we find the results still need some improvements.

pdf bib abs
Lexical Simplification in Foreign Language Learning: Creating Pedagogically Suitable Simplified Example Sentences
Jasper Degraeuwe | Horacio Saggion
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

This study presents a lexical simplification (LS) methodology for foreign language (FL) learning purposes, a barely explored area of automatic text simplification (TS). The method, targeted at Spanish as a foreign language (SFL), includes a customised complex word identification (CWI) classifier and generates substitutions based on masked language modelling. Performance is calculated on a custom dataset by means of a new, pedagogically-oriented evaluation. With 43% of the top simplifications being found suitable, the method shows potential for simplifying sentences to be used in FL learning activities. The evaluation also suggests that, though still crucial, meaning preservation is not always a prerequisite for successful LS. To arrive at grammatically correct and more idiomatic simplifications, future research could study the integration of association measures based on co-occurrence data.

pdf bib abs
Controllable Lexical Simplification for English
Kim Cheng Sheang | Daniel Ferrés | Horacio Saggion
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

Fine-tuning Transformer-based approaches have recently shown exciting results on sentence simplification task. However, so far, no research has applied similar approaches to the Lexical Simplification (LS) task. In this paper, we present ConLS, a Controllable Lexical Simplification system fine-tuned with T5 (a Transformer-based model pre-trained with a BERT-style approach and several other tasks). The evaluation results on three datasets (LexMTurk, BenchLS, and NNSeval) have shown that our model performs comparable to LSBert (the current state-of-the-art) and even outperforms it in some cases. We also conducted a detailed comparison on the effectiveness of control tokens to give a clear view of how each token contributes to the model.

We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part of the Workshop on Text Simplification, Accessibility, and Readability TSAR-2022 held in conjunction with EMNLP 2022. The task called the Natural Language Processing research community to contribute with methods to advance the state of the art in multilingual lexical simplification for English, Portuguese, and Spanish. A total of 14 teams submitted the results of their lexical simplification systems for the provided test data. Results of the shared task indicate new benchmarks in Lexical Simplification with English lexical simplification quantitative results noticeably higher than those obtained for Spanish and (Brazilian) Portuguese.

2021

pdf bib abs
Syntax-aware Transformers for Neural Machine Translation: The Case of Text to Sign Gloss Translation
Santiago Egea Gómez | Euan McGill | Horacio Saggion
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

It is well-established that the preferred mode of communication of the deaf and hard of hearing (DHH) community are Sign Languages (SLs), but they are considered low resource languages where natural language processing technologies are of concern. In this paper we study the problem of text to SL gloss Machine Translation (MT) using Transformer-based architectures. Despite the significant advances of MT for spoken languages in the recent couple of decades, MT is in its infancy when it comes to SLs. We enrich a Transformer-based architecture aggregating syntactic information extracted from a dependency parser to word-embeddings. We test our model on a well-known dataset showing that the syntax-aware model obtains performance gains in terms of MT evaluation metrics.

pdf bib abs
Controllable Sentence Simplification with a Unified Text-to-Text Transfer Transformer
Kim Cheng Sheang | Horacio Saggion
Proceedings of the 14th International Conference on Natural Language Generation

Recently, a large pre-trained language model called T5 (A Unified Text-to-Text Transfer Transformer) has achieved state-of-the-art performance in many NLP tasks. However, no study has been found using this pre-trained model on Text Simplification. Therefore in this paper, we explore the use of T5 fine-tuning on Text Simplification combining with a controllable mechanism to regulate the system outputs that can help generate adapted text for different target audiences. Our experiments show that our model achieves remarkable results with gains of between +0.69 and +1.41 over the current state-of-the-art (BART+ACCESS). We argue that using a pre-trained model such as T5, trained on several tasks with large amounts of data, can help improve Text Simplification.

2020

pdf bib abs
A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization and Cross-document Relation Discovery
Ahmed AbuRa’ed | Horacio Saggion | Luis Chiruzzo
Proceedings of the Twelfth Language Resources and Evaluation Conference

Related work sections or literature reviews are an essential part of every scientific article being crucial for paper reviewing and assessment. The automatic generation of related work sections can be considered an instance of the multi-document summarization problem. In order to allow the study of this specific problem, we have developed a manually annotated, machine readable data-set of related work sections, cited papers (e.g. references) and sentences, together with an additional layer of papers citing the references. We additionally present experiments on the identification of cited sentences, using as input citation contexts. The corpus alongside the gold standard are made available for use by the scientific community.

pdf bib abs
LaSTUS/TALN at TRAC - 2020 Trolling, Aggression and Cyberbullying
Lütfiye Seda Mut Altın | Alex Bravo | Horacio Saggion
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

This paper presents the participation of the LaSTUS/TALN team at TRAC-2020 Trolling, Aggression and Cyberbullying shared task. The aim of the task is to determine whether a given text is aggressive and contains misogynistic content. Our approach is based on a bidirectional Long Short Term Memory network (bi-LSTM). Our system performed well at sub-task A, aggression detection; however underachieved at sub-task B, misogyny detection.

2019

pdf bib abs
LaSTUS/TALN at SemEval-2019 Task 6: Identification and Categorization of Offensive Language in Social Media with Attention-based Bi-LSTM model
Lutfiye Seda Mut Altin | Àlex Bravo Serrano | Horacio Saggion
Proceedings of the 13th International Workshop on Semantic Evaluation

We present a bidirectional Long-Short Term Memory network for identifying offensive language in Twitter. Our system has been developed in the context of the SemEval 2019 Task 6 which comprises three different sub-tasks, namely A: Offensive Language Detection, B: Categorization of Offensive Language, C: Offensive Language Target Identification. We used a pre-trained Word Embeddings in tweet data, including information about emojis and hashtags. Our approach achieves good performance in the three sub-tasks.

pdf bib abs
Transferring Knowledge from Discourse to Arguments: A Case Study with Scientific Abstracts
Pablo Accuosto | Horacio Saggion
Proceedings of the 6th Workshop on Argument Mining

In this work we propose to leverage resources available with discourse-level annotations to facilitate the identification of argumentative components and relations in scientific texts, which has been recognized as a particularly challenging task. In particular, we implement and evaluate a transfer learning approach in which contextualized representations learned from discourse parsing tasks are used as input of argument mining models. As a pilot application, we explore the feasibility of using automatically identified argumentative components and relations to predict the acceptance of papers in computer science venues. In order to conduct our experiments, we propose an annotation scheme for argumentative units and relations and use it to enrich an existing corpus with an argumentation layer.

2018

pdf bib
Data-Driven Text Simplification
Sanja Štajner | Horacio Saggion
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib abs
Interpretable Emoji Prediction via Label-Wise Attention LSTMs
Francesco Barbieri | Luis Espinosa-Anke | Jose Camacho-Collados | Steven Schockaert | Horacio Saggion
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Human language has evolved towards newer forms of communication such as social media, where emojis (i.e., ideograms bearing a visual meaning) play a key role. While there is an increasing body of work aimed at the computational modeling of emoji semantics, there is currently little understanding about what makes a computational model represent or predict a given emoji in a certain way. In this paper we propose a label-wise attention mechanism with which we attempt to better understand the nuances underlying emoji prediction. In addition to advantages in terms of interpretability, we show that our proposed architecture improves over standard baselines in emoji prediction, and does particularly well when predicting infrequent emojis.

pdf bib
PDFdigest: an Adaptable Layout-Aware PDF-to-XML Textual Content Extractor for Scientific Articles
Daniel Ferrés | Horacio Saggion | Francesco Ronzano | Àlex Bravo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Multimodal Emoji Prediction
Francesco Barbieri | Miguel Ballesteros | Francesco Ronzano | Horacio Saggion
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Emojis are small images that are commonly included in social media text messages. The combination of visual and textual content in the same message builds up a modern way of communication, that automatic systems are not used to deal with. In this paper we extend recent advances in emoji prediction by putting forward a multimodal approach that is able to predict emojis in Instagram posts. Instagram posts are composed of pictures together with texts which sometimes include emojis. We show that these emojis can be predicted by using the text, but also using the picture. Our main finding is that incorporating the two synergistic modalities, in a combined model, improves accuracy in an emoji prediction task. This result demonstrates that these two modalities (text and images) encode different information on the use of emojis and therefore can complement each other.

This paper describes the results of the first Shared Task on Multilingual Emoji Prediction, organized as part of SemEval 2018. Given the text of a tweet, the task consists of predicting the most likely emoji to be used along such tweet. Two subtasks were proposed, one for English and one for Spanish, and participants were allowed to submit a system run to one or both subtasks. In total, 49 teams participated to the English subtask and 22 teams submitted a system run to the Spanish subtask. Evaluation was carried out emoji-wise, and the final ranking was based on macro F-Score. Data and further information about this task can be found at https://competitions.codalab.org/competitions/17344.

This paper describes the SemEval 2018 Shared Task on Hypernym Discovery. We put forward this task as a complementary benchmark for modeling hypernymy, a problem which has traditionally been cast as a binary classification task, taking a pair of candidate words as input. Instead, our reformulated task is defined as follows: given an input term, retrieve (or discover) its suitable hypernyms from a target corpus. We proposed five different subtasks covering three languages (English, Spanish, and Italian), and two specific domains of knowledge in English (Medical and Music). Participants were allowed to compete in any or all of the subtasks. Overall, a total of 11 teams participated, with a total of 39 different systems submitted through all subtasks. Data, results and further information about the task can be found at https://competitions.codalab.org/competitions/17119.

pdf bib abs
LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task
Ahmed AbuRa’ed | Horacio Saggion
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents the participation of the LaSTUS/TALN team in the Complex Word Identification (CWI) Shared Task 2018 in the English monolingual track . The purpose of the task was to determine if a word in a given sentence can be judged as complex or not by a certain target audience. For the English track, task organizers provided a training and a development datasets of 27,299 and 3,328 words respectively together with the sentence in which each word occurs. The words were judged as complex or not by 20 human evaluators; ten of whom are natives. We submitted two systems: one system modeled each word to evaluate as a numeric vector populated with a set of lexical, semantic and contextual features while the other system relies on a word embedding representation and a distance metric. We trained two separate classifiers to automatically decide if each word is complex or not. We submitted six runs, two for each of the three subsets of the English monolingual CWI track.

bib
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)
Arne Jönsson | Evelina Rennes | Horacio Saggion | Sanja Stajner | Victoria Yaneva
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

2017

pdf bib abs
Are Emojis Predictable?
Francesco Barbieri | Miguel Ballesteros | Horacio Saggion
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Emojis are ideograms which are naturally combined with plain text to visually complement or condense the meaning of a message. Despite being widely used in social media, their underlying semantics have received little attention from a Natural Language Processing standpoint. In this paper, we investigate the relation between words and emojis, studying the novel task of predicting which emojis are evoked by text-based tweet messages. We train several models based on Long Short-Term Memory networks (LSTMs) in this task. Our experimental results show that our neural model outperforms a baseline as well as humans solving the same task, suggesting that computational models are able to better capture the underlying semantics of emojis.

pdf bib abs
What Sentence are you Referring to and Why? Identifying Cited Sentences in Scientific Literature
Ahmed AbuRa’ed | Luis Chiruzzo | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the current context of scientific information overload, text mining tools are of paramount importance for researchers who have to read scientific papers and assess their value. Current citation networks, which link papers by citation relationships (reference and citing paper), are useful to quantitatively understand the value of a piece of scientific work, however they are limited in that they do not provide information about what specific part of the reference paper the citing paper is referring to. This qualitative information is very important, for example, in the context of current community-based scientific summarization activities. In this paper, and relying on an annotated dataset of co-citation sentences, we carry out a number of experiments aimed at, given a citation sentence, automatically identify a part of a reference paper being cited. Additionally our algorithm predicts the specific reason why such reference sentence has been cited out of five possible reasons.

pdf bib abs
Towards the Understanding of Gaming Audiences by Modeling Twitch Emotes
Francesco Barbieri | Luis Espinosa-Anke | Miguel Ballesteros | Juan Soler-Company | Horacio Saggion
Proceedings of the 3rd Workshop on Noisy User-generated Text

Videogame streaming platforms have become a paramount example of noisy user-generated text. These are websites where gaming is broadcasted, and allows interaction with viewers via integrated chatrooms. Probably the best known platform of this kind is Twitch, which has more than 100 million monthly viewers. Despite these numbers, and unlike other platforms featuring short messages (e.g. Twitter), Twitch has not received much attention from the Natural Language Processing community. In this paper we aim at bridging this gap by proposing two important tasks specific to the Twitch platform, namely (1) Emote prediction; and (2) Trolling detection. In our experiments, we evaluate three models: a BOW baseline, a logistic supervised classifiers based on word embeddings, and a bidirectional long short-term memory recurrent neural network (LSTM). Our results show that the LSTM model outperforms the other two models, where explicit features with proven effectiveness for similar tasks were encoded.

pdf bib abs
An Adaptable Lexical Simplification Architecture for Major Ibero-Romance Languages
Daniel Ferrés | Horacio Saggion | Xavier Gómez Guinovart
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Lexical Simplification is the task of reducing the lexical complexity of textual documents by replacing difficult words with easier to read (or understand) expressions while preserving the original meaning. The development of robust pipelined multilingual architectures able to adapt to new languages is of paramount importance in lexical simplification. This paper describes and evaluates a modular hybrid linguistic-statistical Lexical Simplifier that deals with the four major Ibero-Romance Languages: Spanish, Portuguese, Catalan, and Galician. The architecture of the system is the same for the four languages addressed, only the language resources used during simplification are language specific.

2016

pdf bib abs
Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning
Luis Espinosa-Anke | Jose Camacho-Collados | Sara Rodríguez-Fernández | Horacio Saggion | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

WordNet is probably the best known lexical resource in Natural Language Processing. While it is widely regarded as a high quality repository of concepts and semantic relations, updating and extending it manually is costly. One important type of relation which could potentially add enormous value to WordNet is the inclusion of collocational information, which is paramount in tasks such as Machine Translation, Natural Language Generation and Second Language Learning. In this paper, we present ColWordNet (CWN), an extended WordNet version with fine-grained collocational information, automatically introduced thanks to a method exploiting linear relations between analogous sense-level embeddings spaces. We perform both intrinsic and extrinsic evaluations, and release CWN for the use and scrutiny of the community.

pdf bib abs
Natural Language Processing for Intelligent Access to Scientific Information
Horacio Saggion | Francesco Ronzano
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Tutorial Abstracts

During the last decade the amount of scientific information available on-line increased at an unprecedented rate. As a consequence, nowadays researchers are overwhelmed by an enormous and continuously growing number of articles to consider when they perform research activities like the exploration of advances in specific topics, peer reviewing, writing and evaluation of proposals. Natural Language Processing Technology represents a key enabling factor in providing scientists with intelligent patterns to access to scientific information. Extracting information from scientific papers, for example, can contribute to the development of rich scientific knowledge bases which can be leveraged to support intelligent knowledge access and question answering. Summarization techniques can reduce the size of long papers to their essential content or automatically generate state-of-the-art-reviews. Paraphrase or textual entailment techniques can contribute to the identification of relations across different scientific textual sources. This tutorial provides an overview of the most relevant tasks related to the processing of scientific documents, including but not limited to the in-depth analysis of the structure of the scientific articles, their semantic interpretation, content extraction and summarization.

pdf bib
Supervised Distributional Hypernym Discovery via Domain Adaptation
Luis Espinosa-Anke | Jose Camacho-Collados | Claudio Delli Bovi | Horacio Saggion
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib abs
A Multi-Layered Annotated Corpus of Scientific Papers
Beatriz Fisas | Francesco Ronzano | Horacio Saggion
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Scientific literature records the research process with a standardized structure and provides the clues to track the progress in a scientific field. Understanding its internal structure and content is of paramount importance for natural language processing (NLP) technologies. To meet this requirement, we have developed a multi-layered annotated corpus of scientific papers in the domain of Computer Graphics. Sentences are annotated with respect to their role in the argumentative structure of the discourse. The purpose of each citation is specified. Special features of the scientific discourse such as advantages and disadvantages are identified. In addition, a grade is allocated to each sentence according to its relevance for being included in a summary. To the best of our knowledge, this complex, multi-layered collection of annotations and metadata characterizing a set of research papers had never been grouped together before in one corpus and therefore constitutes a newer, richer resource with respect to those currently available in the field.

pdf bib abs
ELMD: An Automatically Generated Entity Linking Gold Standard Dataset in the Music Domain
Sergio Oramas | Luis Espinosa Anke | Mohamed Sordo | Horacio Saggion | Xavier Serra
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a gold standard dataset for Entity Linking (EL) in the Music Domain. It contains thousands of musical named entities such as Artist, Song or Record Label, which have been automatically annotated on a set of artist biographies coming from the Music website and social network Last.fm. The annotation process relies on the analysis of the hyperlinks present in the source texts and in a voting-based algorithm for EL, which considers, for each entity mention in text, the degree of agreement across three state-of-the-art EL systems. Manual evaluation shows that EL Precision is at least 94%, and due to its tunable nature, it is possible to derive annotations favouring higher Precision or Recall, at will. We make available the annotated dataset along with evaluation data and the code.

pdf bib abs
What does this Emoji Mean? A Vector Space Skip-Gram Model for Twitter Emojis
Francesco Barbieri | Francesco Ronzano | Horacio Saggion
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Emojis allow us to describe objects, situations and even feelings with small images, providing a visual and quick way to communicate. In this paper, we analyse emojis used in Twitter with distributional semantic models. We retrieve 10 millions tweets posted by USA users, and we build several skip gram word embedding models by mapping in the same vectorial space both words and emojis. We test our models with semantic similarity experiments, comparing the output of our models with human assessment. We also carry out an exhaustive qualitative evaluation, showing interesting results.

pdf bib
TALN at SemEval-2016 Task 11: Modelling Complex Words by Contextual, Lexical and Semantic Features
Francesco Ronzano | Ahmed Abura’ed | Luis Espinosa-Anke | Horacio Saggion
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
TALN at SemEval-2016 Task 14: Semantic Taxonomy Enrichment Via Sense-Based Embeddings
Luis Espinosa-Anke | Francesco Ronzano | Horacio Saggion
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Making Sense of Massive Amounts of Scientific Publications: the Scientific Knowledge Miner Project
Francesco Ronzano | Ana Freire | Diego Saez-Trumper | Horacio Saggion
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
Trainable Citation-enhanced Summarization of Scientific Articles
Horacio Saggion | Ahmed AbuRa’ed | Francesco Ronzano
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

2015

pdf bib
A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation
Sanja Štajner | Hannah Béchara | Horacio Saggion
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
How Topic Biases Your Results? A Case Study of Sentiment Analysis and Irony Detection in Italian
Francesco Barbieri | Francesco Ronzano | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Weakly Supervised Definition Extraction
Luis Espinosa-Anke | Horacio Saggion | Francesco Ronzano
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Translating from Original to Simplified Sentences using Moses: When does it Actually Work?
Sanja Štajner | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Automatic Text Simplification for Spanish: Comparative Evaluation of Various Simplification Strategies
Sanja Štajner | Iacer Calixto | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
UPF-taln: SemEval 2015 Tasks 10 and 11. Sentiment Analysis of Literal and Figurative Language in Twitter
Francesco Barbieri | Francesco Ronzano | Horacio Saggion
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
TALN-UPF: Taxonomy Learning Exploiting CRF-Based Hypernym Extraction on Encyclopedic Definitions
Luis Espinosa-Anke | Horacio Saggion | Francesco Ronzano
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
On the Discoursive Structure of Computer Graphics Research Papers
Beatriz Fisas | Horacio Saggion | Francesco Ronzano
Proceedings of the 9th Linguistic Annotation Workshop

2014

pdf bib
Modelling Irony in Twitter
Francesco Barbieri | Horacio Saggion
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
Creating Summarization Systems with SUMMA
Horacio Saggion
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Automatic text summarization, the reduction of a text to its essential content is fundamental for an on-line information society. Although many summarization algorithms exist, there are few tools or infrastructures providing capabilities for developing summarization applications. This paper presents a new version of SUMMA, a text summarization toolkit for the development of adaptive summarization applications. SUMMA includes algorithms for computation of various sentence relevance features and functionality for single and multidocument summarization in various languages. It also offers methods for content-based evaluation of summaries.

pdf bib abs
Modelling Irony in Twitter: Feature Analysis and Evaluation
Francesco Barbieri | Horacio Saggion
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Irony, a creative use of language, has received scarce attention from the computational linguistics research point of view. We propose an automatic system capable of detecting irony with good accuracy in the social network Twitter. Twitter allows users to post short messages (140 characters) which usually do not follow the expected rules of the grammar, users tend to truncate words and use particular punctuation. For these reason automatic detection of Irony in Twitter is not trivial and requires specific linguistic tools. We propose in this paper a new set of experiments to assess the relevance of the features included in our model. Our model does not include words or sequences of words as features, aiming to detect inner characteristic of Irony.

pdf bib abs
Can Numerical Expressions Be Simpler? Implementation and Demostration of a Numerical Simplification System for Spanish
Susana Bautista | Horacio Saggion
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Information in newspapers is often showed in the form of numerical expressions which present comprehension problems for many people, including people with disabilities, illiteracy or lack of access to advanced technology. The purpose of this paper is to motivate, describe, and demonstrate a rule-based lexical component that simplifies numerical expressions in Spanish texts. We propose an approach that makes news articles more accessible to certain readers by rewriting difficult numerical expressions in a simpler way. We will showcase the numerical simplification system with a live demo based on the execution of our components over different texts, and which will consider both successful and unsuccessful simplification cases.

pdf bib
One Step Closer to Automatic Evaluation of Text Simplification Systems
Sanja Štajner | Ruslan Mitkov | Horacio Saggion
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
Keyword Highlighting Improves Comprehension for People with Dyslexia
Luz Rello | Horacio Saggion | Ricardo Baeza-Yates
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
Modelling Sarcasm in Twitter, a Novel Approach
Francesco Barbieri | Horacio Saggion | Francesco Ronzano
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2013

pdf bib
Readability Indices for Automatic Evaluation of Text Simplification Systems: A Feasibility Study for Spanish
Sanja Štajner | Horacio Saggion
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Unsupervised Learning Summarization Templates from Concise Summaries
Horacio Saggion
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility
Luz Rello | Horacio Saggion | Ricardo Baeza-Yates
Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility

pdf bib
Proceedings of the 14th European Workshop on Natural Language Generation
Albert Gatt | Horacio Saggion
Proceedings of the 14th European Workshop on Natural Language Generation

2012

pdf bib
Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish
Stefan Bott | Luz Rello | Biljana Drndarevic | Horacio Saggion
Proceedings of COLING 2012

pdf bib abs
The CONCISUS Corpus of Event Summaries
Horacio Saggion | Sandra Szasz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Text summarization and information extraction systems require adaptation to new domains and languages. This adaptation usually depends on the availability of language resources such as corpora. In this paper we present a comparable corpus in Spanish and English for the study of cross-lingual information extraction and summarization: the CONCISUS Corpus. It is a rich human-annotated dataset composed of comparable event summaries in Spanish and English covering four different domains: aviation accidents, rail accidents, earthquakes, and terrorist attacks. In addition to the monolingual summaries in English and Spanish, we provide automatic translations and ``comparable'' full event reports of the events. The human annotations are concepts marked in the textual sources representing the key event information associated to the event type. The dataset has also been annotated using text processing pipelines. It is being made freely available to the research community for research purposes.

pdf bib abs
Text Simplification Tools for Spanish
Stefan Bott | Horacio Saggion | Simon Mille
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we describe the development of a text simplification system for Spanish. Text simplification is the adaptation of a text to the special needs of certain groups of readers, such as language learners, people with cognitive difficulties and elderly people, among others. There is a clear need for simplified texts, but manual production and adaptation of existing texts is labour intensive and costly. Automatic simplification is a field which attracts growing attention in Natural Language Processing, but, to the best of our knowledge, there are no simplification tools for Spanish. We present a prototype for automatic simplification, which shows that the most important structural simplification operations can be successfully treated with an approach based on rules which can potentially be improved by statistical methods. For the development of this prototype we carried out a corpus study which aims at identifying the operations a text simplification system needs to carry out in order to produce an output similar to what human editors produce when they simplify texts.

pdf bib
Towards Automatic Lexical Simplification in Spanish: An Empirical Study
Biljana Drndarević | Horacio Saggion
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations

pdf bib
Graphical Schemes May Improve Readability but Not Understandability for People with Dyslexia
Luz Rello | Horacio Saggion | Ricardo Baeza-Yates | Eduardo Graells
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations

pdf bib
A Hybrid System for Spanish Text Simplification
Stefan Bott | Horacio Saggion | David Figueroa
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Unsupervised Content Discovery from Concise Summaries
Horacio Saggion
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

2011

pdf bib
An Unsupervised Alignment Algorithm for Text Simplification Corpus Construction
Stefan Bott | Horacio Saggion
Proceedings of the Workshop on Monolingual Text-To-Text Generation

pdf bib
Multi-domain Cross-lingual Information Extraction from Clean and Noisy Texts
Horacio Saggion | Sandra Szasz
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

2010

pdf bib abs
Évaluation automatique de résumés avec et sans référence
Juan-Manuel Torres-Moreno | Horacio Saggion | Iria da Cunha | Patricia Velázquez-Morales | Eric Sanjuan
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous étudions différentes méthodes d’évaluation de résumé de documents basées sur le contenu. Nous nous intéressons en particulier à la corrélation entre les mesures d’évaluation avec et sans référence humaine. Nous avons développé FRESA, un nouveau système d’évaluation fondé sur le contenu qui calcule les divergences entre les distributions de probabilité. Nous appliquons notre système de comparaison aux diverses mesures d’évaluation bien connues en résumé de texte telles que la Couverture, Responsiveness, Pyramids et Rouge en étudiant leurs associations dans les tâches du résumé multi-document générique (francais/anglais), focalisé (anglais) et résumé mono-document générique (français/espagnol).

pdf bib
Multilingual Summarization Evaluation without Human Models
Horacio Saggion | Juan-Manuel Torres-Moreno | Iria da Cunha | Eric SanJuan | Patricia Velázquez-Morales
Coling 2010: Posters

pdf bib abs
NLP Resources for the Analysis of Patient/Therapist Interviews
Horacio Saggion | Elena Stein-Sparvieri | David Maldavsky | Sandra Szasz
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a set of tools and resources for the analysis of interviews during psychotherapy sessions. One of the main components of the work is a dictionary-based text interpretation tool for the Spanish language. The tool is designed to identify a subset of Freudian drives in patient and therapist discourse.

pdf bib abs
Interpreting SentiWordNet for Opinion Classification
Horacio Saggion | Adam Funk
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a set of tools, resources, and experiments for opinion classification in business-related datasources in two languages. In particular we concentrate on SentiWordNet text interpretation to produce word, sentence, and text-based sentiment features for opinion classification. We achieve good results in experiments using supervised learning machine over syntactic and sentiment-based features. We also show preliminary experiments where the use of summaries before opinion classification provides competitive advantage over the use of full documents.

pdf bib
Experiments on Summary-based Opinion Classification
Elena Lloret | Horacio Saggion | Manuel Palomar
Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text

pdf bib
Human Language Technology for Text-based Analysis of Psychotherapy Sessions in the Spanish Language
Horacio Saggion | Elena Stein-Sparvieri | David Maldavsky | Sandra Szasz
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2009

pdf bib
A Classification Algorithm for Predicting the Structure of Summaries
Horacio Saggion
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

2008

pdf bib
SUMMA. A Robust and Adaptable Summarization Tool
Horacio Saggion
Traitement Automatique des Langues, Volume 49, Numéro 2 : Plate-formes pour le traitement automatique des langues [Platforms for Natural Language Processing]

pdf bib
Experiments on Semantic-based Clustering for Cross-document Coreference
Horacio Saggion
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Introduction to Text Summarization and Other Information Access Technologies
Horacio Saggion
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib abs
A Framework for Identity Resolution and Merging for Multi-source Information Extraction
Milena Yankova | Horacio Saggion | Hamish Cunningham
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In the context of ontology-based information extraction, identity resolution is the process of deciding whether an instance extracted from text refers to a known entity in the target domain (e.g. the ontology). We present an ontology-based framework for identity resolution which can be customized to different application domains and extraction tasks. Rules for identify resolution, which compute similarities between target and source entities based on class information and instance properties and values, can be defined for each class in the ontology. We present a case study of the application of the framework to the problem of multi-source job vacancy extraction

pdf bib
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Sivaji Bandyopadhyay | Thierry Poibeau | Horacio Saggion | Roman Yangarber
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

2007

pdf bib
SHEF: Semantic Tagging and Summarization Techniques Applied to Cross-document Coreference
Horacio Saggion
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib abs
Multilingual Multidocument Summarization Tools and Evaluation
Horacio Saggion
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a number of experiments carried out to address the problem of creating summaries from multiple sources in multiple languages. A centroid-based sentence extraction system has been developed which decides the content of the summary using texts in different languages and uses sentences from English sources alone to create the final output. We describe the evaluation of the system in the recent Multilingual Summarization Evaluation MSE 2005 using the pyramids and ROUGE methods.

pdf bib abs
Language Resources for Background Gathering
Horacio Saggion | Robert Gaizauskas
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe the Cubreporter information access system which allows access to news archives through the use of natural language technology. The system includes advanced text search, question answering, summarization, and entity profiling capabilities. It has been designed taking into account the characteristics of the background gathering task.