Stefan Bott


2024

pdf bib
Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations
Horacio Saggion | Stefan Bott | Sandra Szasz | Nelson Pérez | Saúl Calderón | Martín Solís
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.

pdf bib
An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework
Matthew Shardlow | Fernando Alva-Manchego | Riza Batista-Navarro | Stefan Bott | Saul Calderon Ramirez | Rémi Cardon | Thomas François | Akio Hayakawa | Andrea Horbach | Anna Hülsing | Yusuke Ide | Joseph Marvin Imperial | Adam Nohejl | Kai North | Laura Occhipinti | Nelson Peréz Rojas | Nishat Raihan | Tharindu Ranasinghe | Martin Solis Salazar | Marcos Zampieri | Horacio Saggion
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

pdf bib
The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline
Matthew Shardlow | Fernando Alva-Manchego | Riza Batista-Navarro | Stefan Bott | Saul Calderon Ramirez | Rémi Cardon | Thomas François | Akio Hayakawa | Andrea Horbach | Anna Hülsing | Yusuke Ide | Joseph Marvin Imperial | Adam Nohejl | Kai North | Laura Occhipinti | Nelson Peréz Rojas | Nishat Raihan | Tharindu Ranasinghe | Martin Solis Salazar | Sanja Štajner | Marcos Zampieri | Horacio Saggion
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.

2017

pdf bib
Factoring Ambiguity out of the Prediction of Compositionality for German Multi-Word Expressions
Stefan Bott | Sabine Schulte im Walde
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Ambiguity represents an obstacle for distributional semantic models(DSMs), which typically subsume the contexts of all word senses within one vector. While individual vector space approaches have been concerned with sense discrimination (e.g., Schütze 1998, Erk 2009, Erk and Pado 2010), such discrimination has rarely been integrated into DSMs across semantic tasks. This paper presents a soft-clustering approach to sense discrimination that filters sense-irrelevant features when predicting the degrees of compositionality for German noun-noun compounds and German particle verbs.

2016

pdf bib
The Role of Modifier and Head Properties in Predicting the Compositionality of English and German Noun-Noun Compounds: A Vector-Space Perspective
Sabine Schulte im Walde | Anna Hätty | Stefan Bott
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
GhoSt-PV: A Representative Gold Standard of German Particle Verbs
Stefan Bott | Nana Khvtisavrishvili | Max Kisselew | Sabine Schulte im Walde
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

German particle verbs represent a frequent type of multi-word-expression that forms a highly productive paradigm in the lexicon. Similarly to other multi-word expressions, particle verbs exhibit various levels of compositionality. One of the major obstacles for the study of compositionality is the lack of representative gold standards of human ratings. In order to address this bottleneck, this paper presents such a gold standard data set containing 400 randomly selected German particle verbs. It is balanced across several particle types and three frequency bands, and accomplished by human ratings on the degree of semantic compositionality.

pdf bib
GhoSt-NN: A Representative Gold Standard of German Noun-Noun Compounds
Sabine Schulte im Walde | Anna Hätty | Stefan Bott | Nana Khvtisavrishvili
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a novel gold standard of German noun-noun compounds (Ghost-NN) including 868 compounds annotated with corpus frequencies of the compounds and their constituents, productivity and ambiguity of the constituents, semantic relations between the constituents, and compositionality ratings of compound-constituent pairs. Moreover, a subset of the compounds containing 180 compounds is balanced for the productivity of the modifiers (distinguishing low/mid/high productivity) and the ambiguity of the heads (distinguishing between heads with 1, 2 and >2 senses

2015

pdf bib
Exploiting Fine-grained Syntactic Transfer Features to Predict the Compositionality of German Particle Verbs
Stefan Bott | Sabine Schulte im Walde
Proceedings of the 11th International Conference on Computational Semantics

2014

pdf bib
Optimizing a Distributional Semantic Model for the Prediction of German Particle Verb Compositionality
Stefan Bott | Sabine Schulte im Walde
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In the work presented here we assess the degree of compositionality of German Particle Verbs with a Distributional Semantics Model which only relies on word window information and has no access to syntactic information as such. Our method only takes the lexical distributional distance between the Particle Verb to its Base Verb as a predictor for compositionality. We show that the ranking of distributional similarity correlates significantly with the ranking of human judgements on semantic compositionality for a series of Particle Verbs and the Base Verbs they are derived from. We also investigate the influence of further linguistic factors, such as the ambiguity and the overall frequency of the verbs and a syntactically separate occurrences of verbs and particles that causes difficulties for the correct lemmatization of Particle Verbs. We analyse in how far these factors may influence the success with which the compositionality of the Particle Verbs may be predicted.

pdf bib
Syntactic Transfer Patterns of German Particle Verbs and their Impact on Lexical Semantics
Stefan Bott | Sabine Schulte im Walde
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
Modelling Regular Subcategorization Changes in German Particle Verbs
Stefan Bott | Sabine Schulte im Walde
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

2012

pdf bib
A Hybrid System for Spanish Text Simplification
Stefan Bott | Horacio Saggion | David Figueroa
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish
Stefan Bott | Luz Rello | Biljana Drndarevic | Horacio Saggion
Proceedings of COLING 2012

pdf bib
Text Simplification Tools for Spanish
Stefan Bott | Horacio Saggion | Simon Mille
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we describe the development of a text simplification system for Spanish. Text simplification is the adaptation of a text to the special needs of certain groups of readers, such as language learners, people with cognitive difficulties and elderly people, among others. There is a clear need for simplified texts, but manual production and adaptation of existing texts is labour intensive and costly. Automatic simplification is a field which attracts growing attention in Natural Language Processing, but, to the best of our knowledge, there are no simplification tools for Spanish. We present a prototype for automatic simplification, which shows that the most important structural simplification operations can be successfully treated with an approach based on rules which can potentially be improved by statistical methods. For the development of this prototype we carried out a corpus study which aims at identifying the operations a text simplification system needs to carry out in order to produce an output similar to what human editors produce when they simplify texts.

2011

pdf bib
An Unsupervised Alignment Algorithm for Text Simplification Corpus Construction
Stefan Bott | Horacio Saggion
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2009

pdf bib
A Second-Order Joint Eisner Model for Syntactic and Semantic Dependency Parsing
Xavier Lluís | Stefan Bott | Lluís Màrquez
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

2006

pdf bib
CUCWeb: A Catalan corpus built from the Web
Gemma Boleda | Stefan Bott | Rodrigo Meza | Carlos Castillo | Toni Badia | Vicente López
Proceedings of the 2nd International Workshop on Web as Corpus

2002

pdf bib
CATCG: a general purpose parsing tool applied
Alex Alsina | Toni Badia | Gemma Boleda | Stefan Bott | Àngel Gil | Martí Quixal | Oriol Valentín
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)