Leonardo Zilio

2026

Exploring automatic terminology extraction from historical medical data
Leonardo Zilio | Maria José Bocorny Finatto
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2

This paper analyzes the performance of several terminology extraction methods when confronted with historical specialized texts that do not conform with modern orthographical norms. We tested two extraction methods based on linguistic patterns, four prompt-based generative artificial intelligence (GenAI) models, and one BERT-like model. Some of these models went through fine-tuning for terminology extraction, and one of these is specialized in the extraction of medical terms from documents written in Portuguese. For the GenAI models, we tested four different prompting strategies. As test set, we used chapter fifteen of the second part of the book Aviso ’a Gente do Mar sobre a sua Saude [Advice to Sea People about their Health], originally written in French by G. Mauran at the end of the 18th century, and translated and adapted to Portuguese in 1794. The chapter was annotated with terminology, and the evaluation was conducted independently both in terms of f-measure, as well as in terms of pure precision, to observe if the automatic extraction methods could complement the manual token-based annotation. Results show that using automatic extraction methods to complement the manual annotation can improve coverage, and that individual models do not achieve high extraction quality, but, by combining two or more models, a recall of more than 90% could be achieved in the test data.

2024

pdf bib abs

Automatic Identification of COVID-19-Related Conspiracy Narratives in German Telegram Channels and Chats
Philipp Heinrich | Andreas Blombach | Bao Minh Doan Dang | Leonardo Zilio | Linda Havenstein | Nathan Dykes | Stephanie Evert | Fabian Schäfer
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We are concerned with mapping the discursive landscape of conspiracy narratives surrounding the COVID-19 pandemic. In the present study, we analyse a corpus of more than 1,000 German Telegram posts tagged with 14 fine-grained conspiracy narrative labels by three independent annotators. Since emerging narratives on social media are short-lived and notoriously hard to track, we experiment with different state-of-the-art approaches to few-shot and zero-shot text classification. We report performance in terms of ROC-AUC and in terms of optimal F1, and compare fine-tuned methods with off-the-shelf approaches and human performance.

pdf bib

NLP for historical Portuguese: Analysing 18th-century medical texts
Leonardo Zilio | Rafaela Radünz Lazzari | Maria Jose Bocorny Finatto
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib abs

Using character-level models for efficient abbreviation and long-form detection
Leonardo Zilio | Shenbin Qian | Diptesh Kanojia | Constantin Orasan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Abbreviations and their associated long forms are important textual elements that are present in almost every scientific communication, and having information about these forms can help improve several NLP tasks. In this paper, our aim is to fine-tune language models for automatically identifying abbreviations and long forms. We used existing datasets which are annotated with abbreviations and long forms to train and test several language models, including transformer models, character-level language models, stacking of different embeddings, and ensemble methods. Our experiments showed that it was possible to achieve state-of-the-art results by stacking RoBERTa embeddings with domain-specific embeddings. However, the analysis of our first run showed that one of the datasets had issues in the BIO annotation, which led us to propose a revised dataset. After re-training selected models on the revised dataset, results show that character-level models achieve comparable results, especially when detecting abbreviations, but both RoBERTa large and the stacking of embeddings presented better results on biomedical data. When tested on a different subdomain (segments extracted from computer science texts), an ensemble method proved to yield the best results for the detection of long forms, and a character-level model had the best performance in detecting abbreviations.

pdf bib

Can rules still beat neural networks? The case of automatic normalisation for 18th-century Portuguese texts
Leonardo Zilio | Rafaela R. Lazzari | Maria José B. Finatto
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2

2023

pdf bib abs

Analysing Mistranslation of Emotions in Multilingual Tweets by Online MT Tools
Hadeel Saadany | Constantin Orasan | Rocio Caro Quintana | Felix Do Carmo | Leonardo Zilio
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

It is common for websites that contain User-Generated Text (UGT) to provide an automatic translation option to reach out to their linguistically diverse users. In such scenarios, the process of translating the users’ emotions is entirely automatic with no human intervention, neither for post-editing, nor for accuracy checking. In this paper, we assess whether automatic translation tools can be a successful real-life utility in transferring emotion in multilingual tweets. Our analysis shows that the mistranslation of the source tweet can lead to critical errors where the emotion is either completely lost or flipped to an opposite sentiment. We identify linguistic phenomena specific to Twitter data which pose a challenge in translation of emotions and show how frequent these features are in different language pairs. We also show that commonly-used quality metrics can lend false confidence in the performance of online MT tools specifically when the source emotion is distorted in telegraphic messages such as tweets.

2022

pdf bib abs

PLOD: An Abbreviation Detection Dataset for Scientific Documents
Leonardo Zilio | Hadeel Saadany | Prashant Sharma | Diptesh Kanojia | Constantin Orăsan
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language Processing tasks, such as machine translation and information retrieval. However, in terms of publicly available datasets, there is not enough data for training deep-neural-networks-based models to the point of generalising well over data. This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction that contains 160k+ segments automatically annotated with abbreviations and their long forms. We performed manual validation over a set of instances and a complete automatic validation for this dataset. We then used it to generate several baseline models for detecting abbreviations and long forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89 for detecting their corresponding long forms. We release this dataset along with our code and all the models publicly at https://github.com/surrey-nlp/PLOD-AbbreviationDetection

2020

pdf bib abs

A Lexical Simplification Tool for Promoting Health Literacy
Leonardo Zilio | Liana Braga Paraguassu | Luis Antonio Leiva Hercules | Gabriel Ponomarenko | Laura Berwanger | Maria José Bocorny Finatto
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

This paper presents MedSimples, an authoring tool that combines Natural Language Processing, Corpus Linguistics and Terminology to help writers to convert health-related information into a more accessible version for people with low literacy skills. MedSimples applies parsing methods associated with lexical resources to automatically evaluate a text and present simplification suggestions that are more suitable for the target audience. Using the suggestions provided by the tool, the author can adapt the original text and make it more accessible. The focus of MedSimples lies on texts for special purposes, so that it not only deals with general vocabulary, but also with specialized terms. The tool is currently under development, but an online working prototype exists and can be tested freely. An assessment of MedSimples was carried out aiming at evaluating its current performance with some promising results, especially for informing the future developments that are planned for the tool.

2018

pdf bib

SW4ALL: a CEFR Classified and Aligned Corpus for Language Learning
Rodrigo Wilkens | Leonardo Zilio | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

An SLA Corpus Annotated with Pedagogically Relevant Grammatical Structures
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Investigating Productive and Receptive Knowledge: A Profile for Second Language Learning
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the 27th International Conference on Computational Linguistics

The literature frequently addresses the differences in receptive and productive vocabulary, but grammar is often left unacknowledged in second language acquisition studies. In this paper, we used two corpora to investigate the divergences in the behavior of pedagogically relevant grammatical structures in reception and production texts. We further improved the divergence scores observed in this investigation by setting a polarity to them that indicates whether there is overuse or underuse of a grammatical structure by language learners. This led to the compilation of a language profile that was later combined with vocabulary and readability features for classifying reception and production texts in three classes: beginner, intermediate, and advanced. The results of the automatic classification task in both production (0.872 of F-measure) and reception (0.942 of F-measure) were comparable to the current state of the art. We also attempted to automatically attribute a score to texts produced by learners, and the correlation results were encouraging, but there is still a good amount of room for improvement in this task. The developed language profile will serve as input for a system that helps language learners to activate more of their passive knowledge in writing texts.

2017

pdf bib

pdf bib abs

Using NLP for Enhancing Second Language Acquisition
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This study presents SMILLE, a system that draws on the Noticing Hypothesis and on input enhancements, addressing the lack of salience of grammatical infor mation in online documents chosen by a given user. By means of input enhancements, the system can draw the user’s attention to grammar, which could possibly lead to a higher intake per input ratio for metalinguistic information. The system receives as input an online document and submits it to a combined processing of parser and hand-written rules for detecting its grammatical structures. The input text can be freely chosen by the user, providing a more engaging experience and reflecting the user’s interests. The system can enhance a total of 107 fine-grained types of grammatical structures that are based on the CEFR. An evaluation of some of those structures resulted in an overall precision of 87%.

2016

pdf bib abs

B2SG: a TOEFL-like Task for Portuguese
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources such as WordNet are useful for NLP applications, but their manual construction consumes time and personnel, and frequently results in low coverage. One alternative is the automatic construction of large resources from corpora like distributional thesauri, containing semantically associated words. However, as they may contain noise, there is a strong need for automatic ways of evaluating the quality of the resulting resource. This paper introduces a gold standard that can aid in this task. The BabelNet-Based Semantic Gold Standard (B2SG) was automatically constructed based on BabelNet and partly evaluated by human judges. It consists of sets of tests that present one target word, one related word and three unrelated words. B2SG contains 2,875 validated relations: 800 for verbs and 2,075 for nouns; these relations are divided among synonymy, antonymy and hypernymy. They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them. As a case study two distributional thesauri were also developed: one using surface forms from a large (1.5 billion word) corpus and the other using lemmatized forms from a smaller (409 million word) corpus. Both distributional thesauri were then evaluated against B2SG, and the one using lemmatized forms performed slightly better.

pdf bib

How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality
Carlos Ramisch | Silvio Cordeiro | Leonardo Zilio | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib abs

VerbLexPor: a lexical resource with semantic roles for Portuguese
Leonardo Zilio | Maria José Bocorny Finatto | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a lexical resource developed for Portuguese. The resource contains sentences annotated with semantic roles. The sentences were extracted from two domains: Cardiology research papers and newspaper articles. Both corpora were analyzed with the PALAVRAS parser and subsequently processed with a subcategorization frames extractor, so that each sentence that contained at least one main verb was stored in a database together with its syntactic organization. The annotation was manually carried out by a linguist using an annotation interface. Both the annotated and non-annotated data were exported to an XML format, which is readily available for download. The reason behind exporting non-annotated data is that there is syntactic information collected from the parser annotation in the non-annotated data, and this could be useful for other researchers. The sentences from both corpora were annotated separately, so that it is possible to access sentences either from the Cardiology or from the newspaper corpus. The full resource presents more than seven thousand semantically annotated sentences, containing 192 different verbs and more than 15 thousand individual arguments and adjuncts.