Giulia Venturi


2024

pdf bib
Evaluating Large Language Models via Linguistic Profiling
Alessio Miaschi | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) undergo extensive evaluation against various benchmarks collected in established leaderboards to assess their performance across multiple tasks. However, to the best of our knowledge, there is a lack of comprehensive studies evaluating these models’ linguistic abilities independent of specific tasks. In this paper, we introduce a novel evaluation methodology designed to test LLMs’ sentence generation abilities under specific linguistic constraints. Drawing on the ‘linguistic profiling’ approach, we rigorously investigate the extent to which five LLMs of varying sizes, tested in both zero- and few-shot scenarios, effectively adhere to (morpho)syntactic constraints. Our findings shed light on the linguistic proficiency of LLMs, revealing both their capabilities and limitations in generating linguistically-constrained sentences.

pdf bib
Linguistic Knowledge Can Enhance Encoder-Decoder Models (If You Let It)
Alessio Miaschi | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we explore the impact of augmenting pre-trained Encoder-Decoder models, specifically T5, with linguistic knowledge for the prediction of a target task. In particular, we investigate whether fine-tuning a T5 model on an intermediate task that predicts structural linguistic properties of sentences modifies its performance in the target task of predicting sentence-level complexity. Our study encompasses diverse experiments conducted on Italian and English datasets, employing both monolingual and multilingual T5 models at various sizes. Results obtained for both languages and in cross-lingual configurations show that linguistically motivated intermediate fine-tuning has generally a positive impact on target task performance, especially when applied to smaller models and in scenarios with limited data availability.

2022

pdf bib
Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus
Tommaso Agnoloni | Roberto Bartolini | Francesca Frontini | Simonetta Montemagni | Carlo Marchetti | Valeria Quochi | Manuela Ruisi | Giulia Venturi
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 period and a former period for reference and comparison according to the CLARIN ParlaMint guidelines and prescriptions. The corpus contains 1199 sessions and 79,373 speeches, for a total of about 31 million words and was encoded according to the ParlaCLARIN TEI XML format, as well as in CoNLL-UD format. It includes extensive metadata about the speakers, the sessions, the political parties and Parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity classification was also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

pdf bib
SemEval-2022 Task 3: PreTENS-Evaluating Neural Networks on Presuppositional Semantic Knowledge
Roberto Zamparelli | Shammur Chowdhury | Dominique Brunato | Cristiano Chesi | Felice Dell’Orletta | Md. Arid Hasan | Giulia Venturi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We report the results of the SemEval 2022 Task 3, PreTENS, on evaluation the acceptability of simple sentences containing constructions whose two arguments are presupposed to be or not to be in an ordered taxonomic relation. The task featured two sub-tasks articulated as: (i) binary prediction task and (ii) regression task, predicting the acceptability in a continuous scale. The sentences were artificially generated in three languages (English, Italian and French). 21 systems, with 8 system papers were submitted for the task, all based on various types of fine-tuned transformer systems, often with ensemble methods and various data augmentation techniques. The best systems reached an F1-macro score of 94.49 (sub-task1) and a Spearman correlation coefficient of 0.80 (sub-task2), with interesting variations in specific constructions and/or languages.

pdf bib
On the Nature of BERT: Correlating Fine-Tuning and Linguistic Competence
Federica Merendi | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 29th International Conference on Computational Linguistics

Several studies in the literature on the interpretation of Neural Language Models (NLM) focus on the linguistic generalization abilities of pre-trained models. However, little attention is paid to how the linguistic knowledge of the models changes during the fine-tuning steps. In this paper, we contribute to this line of research by showing to what extent a wide range of linguistic phenomena are forgotten across 50 epochs of fine-tuning, and how the preserved linguistic knowledge is correlated with the resolution of the fine-tuning task. To this end, we considered a quite understudied task where linguistic information plays the main role, i.e. the prediction of the evolution of written language competence of native language learners. In addition, we investigate whether it is possible to predict the fine-tuned NLM accuracy across the 50 epochs solely relying on the assessed linguistic competence. Our results are encouraging and show a high relationship between the model’s linguistic competence and its ability to solve a linguistically-based downstream task.

2021

pdf bib
What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity
Alessio Miaschi | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

This paper presents an investigation aimed at studying how the linguistic structure of a sentence affects the perplexity of two of the most popular Neural Language Models (NLMs), BERT and GPT-2. We first compare the sentence-level likelihood computed with BERT and the GPT-2’s perplexity showing that the two metrics are correlated. In addition, we exploit linguistic features capturing a wide set of morpho-syntactic and syntactic phenomena showing how they contribute to predict the perplexity of the two NLMs.

2020

pdf bib
Linguistic Profiling of a Neural Language Model
Alessio Miaschi | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 28th International Conference on Computational Linguistics

In this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT’s capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence.

pdf bib
Tracking the Evolution of Written Language Competence in L2 Spanish Learners
Alessio Miaschi | Sam Davidson | Dominique Brunato | Felice Dell’Orletta | Kenji Sagae | Claudia Helena Sanchez-Gutierrez | Giulia Venturi
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper we present an NLP-based approach for tracking the evolution of written language competence in L2 Spanish learners using a wide range of linguistic features automatically extracted from students’ written productions. Beyond reporting classification results for different scenarios, we explore the connection between the most predictive features and the teaching curriculum, finding that our set of linguistic features often reflect the explicit instructions that students receive during each course.

pdf bib
“Voices of the Great War”: A Richly Annotated Corpus of Italian Texts on the First World War
Federico Boschetti | Irene De Felice | Stefano Dei Rossi | Felice Dell’Orletta | Michele Di Giorgio | Martina Miliani | Lucia C. Passaro | Angelica Puddu | Giulia Venturi | Nicola Labanca | Alessandro Lenci | Simonetta Montemagni
Proceedings of the Twelfth Language Resources and Evaluation Conference

“Voices of the Great War” is the first large corpus of Italian historical texts dating back to the period of First World War. This corpus differs from other existing resources in several respects. First, from the linguistic point of view it gives account of the wide range of varieties in which Italian was articulated in that period, namely from a diastratic (educated vs. uneducated writers), diaphasic (low/informal vs. high/formal registers) and diatopic (regional varieties, dialects) points of view. From the historical perspective, through a collection of texts belonging to different genres it represents different views on the war and the various styles of narrating war events and experiences. The final corpus is balanced along various dimensions, corresponding to the textual genre, the language variety used, the author type and the typology of conveyed contents. The corpus is fully annotated with lemmas, part-of-speech, terminology, and named entities. Significant corpus samples representative of the different “voices” have also been enriched with meta-linguistic and syntactic information. The layer of syntactic annotation forms the first nucleus of an Italian historical treebank complying with the Universal Dependencies standard. The paper illustrates the final resource, the methodology and tools used to build it, and the Web Interface for navigating it.

pdf bib
Profiling-UD: a Tool for Linguistic Profiling of Texts
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.

2018

pdf bib
Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Maria Simi | Giulia Venturi
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Detection and correction of errors and inconsistencies in “gold treebanks” are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, and at the level of individual dependencies.

pdf bib
Is this Sentence Difficult? Do you Agree?
Dominique Brunato | Lorenzo De Mattei | Felice Dell’Orletta | Benedetta Iavarone | Giulia Venturi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we present a crowdsourcing-based approach to model the human perception of sentence complexity. We collect a large corpus of sentences rated with judgments of complexity for two typologically-different languages, Italian and English. We test our approach in two experimental scenarios aimed to investigate the contribution of a wide set of lexical, morpho-syntactic and syntactic phenomena in predicting i) the degree of agreement among annotators independently from the assigned judgment and ii) the perception of sentence complexity.

2017

pdf bib
Dangerous Relations in Dependency Treebanks
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf bib
PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Thomas François | Philippe Blache
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

pdf bib
CItA: an L1 Italian Learners Corpus to Study the Development of Writing Competence
Alessia Barbagli | Pietro Lucisano | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present the CItA corpus (Corpus Italiano di Apprendenti L1), a collection of essays written by Italian L1 learners collected during the first and second year of lower secondary school. The corpus was built in the framework of an interdisciplinary study jointly carried out by computational linguistics and experimental pedagogists and aimed at tracking the development of written language competence over the years and students’ background information.

2015

pdf bib
Design and Annotation of the First Italian Corpus for Text Simplification
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 9th Linguistic Annotation Workshop

pdf bib
NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms
Giulia Venturi | Tommaso Bellandi | Felice Dell’Orletta | Simonetta Montemagni
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
T2K^2: a System for Automatically Extracting and Organizing Knowledge from Texts
Felice Dell’Orletta | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present T2K^2, a suite of tools for automatically extracting domain―specific knowledge from collections of Italian and English texts. T2K^2 (Text―To―Knowledge v2) relies on a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine learning which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured documents. Extracted knowledge ranges from domain―specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K^2 also includes “linguistic profiling” functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the “added value” of newly inserted documents. T2K^2 is a web application which can be accessed from any browser through a personal account which has been tested in a wide range of domains.

pdf bib
Assessing the Readability of Sentences: Which Corpora and Features?
Felice Dell’Orletta | Martijn Wieling | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

2013

pdf bib
Linguistic Profiling based on General–purpose Features and Native Language Identification
Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Unsupervised Linguistically-Driven Reliable Dependency Parses Detection and Self-Training for Adaptation to the Biomedical Domain
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

pdf bib
Linguistic Profiling of Texts Across Textual Genres and Readability Levels. An Exploratory Study on Italian Fictional Prose
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
Genre-oriented Readability Assessment: a Case Study
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Workshop on Speech and Language Processing Tools in Education

pdf bib
Enriching the ISST-TANL Corpus with Semantic Frames
Alessandro Lenci | Simonetta Montemagni | Giulia Venturi | Maria Grazia Cutrullà
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper describes the design and the results of a manual annotation methodology devoted to enrich the ISST--TANL Corpus, derived from the Italian Syntactic--Semantic Treebank (ISST), with Semantic Frames information. The main issues encountered in applying the English FrameNet annotation criteria to a corpus of Italian language are discussed together with the choice of anchoring the semantic annotation layer to the underlying dependency syntactic structure. The results of a case study aimed at extending and specialising this methodology for the annotation of a corpus of legislative texts are also discussed.

2011

pdf bib
ULISSE: an Unsupervised Algorithm for Detecting Reliable Dependency Parses
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

pdf bib
READIT: Assessing Readability of Italian Texts with a View to Text Simplification
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

2010

pdf bib
Contrastive Filtering of Domain-Specific Multi-Word Terms from Different Types of Corpora
Francesca Bonin | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf bib
A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora
Francesca Bonin | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a novel approach to multi-word terminology extraction combining a well-known automatic term recognition approach, the C--NC value method, with a contrastive ranking technique, aimed at refining obtained results either by filtering noise due to common words or by discerning between semantically different types of terms within heterogeneous terminologies. Differently from other contrastive methods proposed in the literature that focus on single terms to overcome the multi-word terms' sparsity problem, the proposed contrastive function is able to handle variation in low frequency events by directly operating on pre-selected multi-word terms. This methodology has been tested in two case studies carried out in the History of Art and Legal domains. Evaluation of achieved results showed that the proposed two--stage approach improves significantly multi--word term extraction results. In particular, for what concerns the legal domain it provides an answer to a well-known problem in the semi--automatic construction of legal ontologies, namely that of singling out law terms from terms of the specific domain being regulated.

2008

pdf bib
Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora
Paul Thompson | Philip Cotter | John McNaught | Sophia Ananiadou | Simonetta Montemagni | Andrea Trabucco | Giulia Venturi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper reports on the design and construction of a bio-event annotated corpus which was developed with a specific view to the acquisition of semantic frames from biomedical corpora. We describe the adopted annotation scheme and the annotation process, which is supported by a dedicated annotation tool. The annotated corpus contains 677 abstracts of biomedical research articles.