Felice Dell’Orletta - ACL Anthology

Felice Dell’Orletta

Also published as: Felice Dell'Orletta, Felice Dell’orletta

2025

Evaluating Lexical Proficiency in Neural Language Models
Cristiano Ciaccio | Alessio Miaschi | Felice Dell’Orletta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a novel evaluation framework designed to assess the lexical proficiency and linguistic creativity of Transformer-based Language Models (LMs). We validate the framework by analyzing the performance of a set of LMs of different sizes, in both mono- and multilingual configuration, across tasks involving the generation, definition, and contextual usage of lexicalized words, neologisms, and nonce words. To support these evaluations, we developed a novel dataset of lexical entries for the Italian language, including curated definitions and usage examples sourced from various online platforms. The results highlight the robustness and effectiveness of our framework in evaluating multiple dimensions of LMs’ linguistic understanding and offer an insight, through the assessment of their linguistic creativity, on the lexical generalization abilities of LMs.

Crossword Space: Latent Manifold Learning for Italian Crosswords and beyond
Cristiano Ciaccio | Gabriele Sarti | Alessio Miaschi | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Doctor, Is That You? Evaluating Large Language Models on Italy’s Medical School Entrance Exams
Ruben Piperno | Agnese Bonfigli | Felice Dell’Orletta | Leandro Pecchia | Mario Merone | Luca Bacco
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

The Role of Eye-Tracking Data in Encoder-Based Models: An In-depth Linguistic Analysis
Lucia Domenichelli | Luca Dini | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

TEXT-CAKE: Challenging Language Models on Local Text Coherence
Luca Dini | Dominique Brunato | Felice Dell’Orletta | Tommaso Caselli
Proceedings of the 31st International Conference on Computational Linguistics

We present a deep investigation of encoder-based Language Models (LMs) on their abilities to detect text coherence across four languages and four text genres using a new evaluation benchmark, TEXT-CAKE. We analyze both multilingual and monolingual LMs with varying architectures and parameters in different finetuning settings. Our findings demonstrate that identifying subtle perturbations that disrupt local coherence is still a challenging task. Furthermore, our results underline the importance of using diverse text genres during pre-training and of an optimal pre-traning objective and large vocabulary size. When controlling for other parameters, deep LMs (i.e., higher number of layers) have an advantage over shallow ones, even when the total number of parameters is smaller.

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Andrea Pedrotti | Michele Papucci | Cristiano Ciaccio | Alessio Miaschi | Giovanni Puccetti | Felice Dell’Orletta | Andrea Esuli
Findings of the Association for Computational Linguistics: ACL 2025

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.

Generating and Evaluating Multi-Level Text Simplification: A Case Study on Italian
Michele Papucci | Giulia Venturi | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

A Novel Real-World Dataset of Italian Clinical Notes for NLP-based Decision Support in Low Back Pain Treatment
Agnese Bonfigli | Ruben Piperno | Luca Bacco | Felice Dell’Orletta | Dominique Brunato | Filippo Crispino | Giuseppe Francesco Papalia | Fabrizio Russo | Gianluca Vadalà | Rocco Papalia | Mario Merone | Leandro Pecchia
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models
Cristiano Ciaccio | Marta Sartor | Alessio Miaschi | Felice Dell’Orletta
Findings of the Association for Computational Linguistics: ACL 2025

Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are “character-blind” and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how a PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.

The OuLiBench Benchmark: Formal Constraints as a Lens into LLM Linguistic Competence
Silvio Calderaro | Alessio Miaschi | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Luca Moroni | Giovanni Puccetti | Pere-Lluís Huguet Cabot | Andrei Stefan Bejgu | Alessio Miaschi | Edoardo Barba | Felice Dell’Orletta | Andrea Esuli | Roberto Navigli
Findings of the Association for Computational Linguistics: NAACL 2025

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token “fertility”) and slower inference speed.In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7B-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

From Human Reading to NLM Understanding: Evaluating the Role of Eye-Tracking Data in Encoder-Based Models
Luca Dini | Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cognitive signals, particularly eye-tracking data, offer valuable insights into human language processing. Leveraging eye-gaze data from the Ghent Eye-Tracking Corpus, we conducted a series of experiments to examine how integrating knowledge of human reading behavior impacts Neural Language Models (NLMs) across multiple dimensions: task performance, attention mechanisms, and the geometry of their embedding space. We explored several fine-tuning methodologies to inject eye-tracking features into the models. Our results reveal that incorporating these features does not degrade downstream task performance, enhances alignment between model attention and human attention patterns, and compresses the geometry of the embedding space.

2024

Linguistic Knowledge Can Enhance Encoder-Decoder Models (If You Let It)
Alessio Miaschi | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we explore the impact of augmenting pre-trained Encoder-Decoder models, specifically T5, with linguistic knowledge for the prediction of a target task. In particular, we investigate whether fine-tuning a T5 model on an intermediate task that predicts structural linguistic properties of sentences modifies its performance in the target task of predicting sentence-level complexity. Our study encompasses diverse experiments conducted on Italian and English datasets, employing both monolingual and multilingual T5 models at various sizes. Results obtained for both languages and in cross-lingual configurations show that linguistically motivated intermediate fine-tuning has generally a positive impact on target task performance, especially when applied to smaller models and in scenarios with limited data availability.

Preface to the CLiC-it 2024 Proceedings
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Rachele Sprugnoli
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Evaluating Large Language Models via Linguistic Profiling
Alessio Miaschi | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) undergo extensive evaluation against various benchmarks collected in established leaderboards to assess their performance across multiple tasks. However, to the best of our knowledge, there is a lack of comprehensive studies evaluating these models’ linguistic abilities independent of specific tasks. In this paper, we introduce a novel evaluation methodology designed to test LLMs’ sentence generation abilities under specific linguistic constraints. Drawing on the ‘linguistic profiling’ approach, we rigorously investigate the extent to which five LLMs of varying sizes, tested in both zero- and few-shot scenarios, effectively adhere to (morpho)syntactic constraints. Our findings shed light on the linguistic proficiency of LLMs, revealing both their capabilities and limitations in generating linguistically-constrained sentences.

AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Giovanni Puccetti | Anna Rogers | Chiara Alzetta | Felice Dell’Orletta | Andrea Esuli
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) are increasingly used as ‘content farm’ models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real ‘content farm’. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.

Hits or Misses? A Linguistically Explainable Formula for Fanfiction Success
Giulio Leonardi | Dominique Brunato | Felice Dell’orletta
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This study presents a computational analysis of Italian fanfiction, aiming to construct an interpretable model of successful writing within this emerging literary domain. Leveraging explicit features that capture both linguistic style and semantic content, we demonstrate the feasibility of automatically predicting successful writing in fanfiction and we identify a set of robust linguistic predictors that maintain their predictive power across diverse topics and time periods, offering insights into the universal aspects of engaging storytelling. This approach not only enhances our understanding of fanfiction as a genre but also offers potential applications in broader literary analysis and content creation.

SimilEx: The First Italian Dataset for Sentence Similarity with Natural Language Explanations
Chiara Alzetta | Felice Dell’orletta | Chiara Fazzone | Giulia Venturi
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Large language models (LLMs) demonstrate great performance in natural language processing and understanding tasks. However, much work remains to enhance their interpretability. Annotated datasets with explanations could be key to addressing this issue, as they enable the development of models that provide human-like explanations for their decisions. In this paper, we introduce the SimilEx dataset, the first Italian dataset reporting human evaluations of similarity between pairs of sentences. For a subset of these pairs, the annotators also provided explanations in natural language for the scores assigned. The SimilEx dataset is valuable for exploring the variability in similarity perception between sentences and for training LLMs to offer human-like explanations for their predictions.

Controllable Text Generation to Evaluate Linguistic Abilities of Italian LLMs
Cristiano Ciaccio | Felice Dell’orletta | Alessio Miaschi | Giulia Venturi
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

State-of-the-art Large Language Models (LLMs) demonstrate exceptional proficiency across diverse tasks, yet systematic evaluations of their linguistic abilities remain limited. This paper addresses this gap by proposing a new evaluation framework leveraging the potentialities of Controllable Text Generation. Our approach evaluates the models’ capacity to generate sentences that adhere to specific linguistic constraints and their ability to recognize the linguistic properties of their own generated sentences, also in terms of consistency with the specified constraints. We tested our approach on six Italian LLMs using various linguistic constraints.

Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)
Felice Dell'Orletta | Alessandro Lenci | Simonetta Montemagni | Rachele Sprugnoli
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Fine-tuning with HED-IT: The impact of human post-editing for dialogical language models
Daniela Occhipinti | Michele Marchi | Irene Mondella | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim | Marco Guerini
Findings of the Association for Computational Linguistics: ACL 2024

Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs.

2023

Unmasking the Wordsmith: Revealing Author Identity through Reader Reviews
Chiara Alzetta | Felice Dell’Orletta | Chiara Fazzone | Alessio Miaschi | Giulia Venturi
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Lost in Labels: An Ongoing Quest to Optimize Text-to-Text Label Selection for Classification
Michele Papucci | Alessio Miaschi | Felice Dell’Orletta
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Coherent or Not? Stressing a Neural Language Model for Discourse Coherence in Multiple Languages
Dominique Brunato | Felice Dell’Orletta | Irene Dini | Andrea Amelio Ravelli
Findings of the Association for Computational Linguistics: ACL 2023

In this study, we investigate the capability of a Neural Language Model (NLM) to distinguish between coherent and incoherent text, where the latter has been artificially created to gradually undermine local coherence within text. While previous research on coherence assessment using NLMs has primarily focused on English, we extend our investigation to multiple languages. We employ a consistent evaluation framework to compare the performance of monolingual and multilingual models in both in-domain and out-domain settings. Additionally, we explore the model’s performance in a cross-language scenario.

Unraveling Text Coherence from the Human Perspective: a Novel Dataset for Italian
Federica Papa | Luca Dini | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

2022

On the Nature of BERT: Correlating Fine-Tuning and Linguistic Competence
Federica Merendi | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 29th International Conference on Computational Linguistics

Several studies in the literature on the interpretation of Neural Language Models (NLM) focus on the linguistic generalization abilities of pre-trained models. However, little attention is paid to how the linguistic knowledge of the models changes during the fine-tuning steps. In this paper, we contribute to this line of research by showing to what extent a wide range of linguistic phenomena are forgotten across 50 epochs of fine-tuning, and how the preserved linguistic knowledge is correlated with the resolution of the fine-tuning task. To this end, we considered a quite understudied task where linguistic information plays the main role, i.e. the prediction of the evolution of written language competence of native language learners. In addition, we investigate whether it is possible to predict the fine-tuned NLM accuracy across the 50 epochs solely relying on the assessed linguistic competence. Our results are encouraging and show a high relationship between the model’s linguistic competence and its ability to solve a linguistically-based downstream task.

SemEval-2022 Task 3: PreTENS-Evaluating Neural Networks on Presuppositional Semantic Knowledge
Roberto Zamparelli | Shammur Chowdhury | Dominique Brunato | Cristiano Chesi | Felice Dell’Orletta | Md. Arid Hasan | Giulia Venturi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We report the results of the SemEval 2022 Task 3, PreTENS, on evaluation the acceptability of simple sentences containing constructions whose two arguments are presupposed to be or not to be in an ordered taxonomic relation. The task featured two sub-tasks articulated as: (i) binary prediction task and (ii) regression task, predicting the acceptability in a continuous scale. The sentences were artificially generated in three languages (English, Italian and French). 21 systems, with 8 system papers were submitted for the task, all based on various types of fine-tuned transformer systems, often with ensemble methods and various data augmentation techniques. The best systems reached an F1-macro score of 94.49 (sub-task1) and a Spearman correlation coefficient of 0.80 (sub-task2), with interesting variations in specific constructions and/or languages.

How about Time? Probing a Multilingual Language Model for Temporal Relations
Tommaso Caselli | Irene Dini | Felice Dell’Orletta
Proceedings of the 29th International Conference on Computational Linguistics

This paper presents a comprehensive set of probing experiments using a multilingual language model, XLM-R, for temporal relation classification between events in four languages. Results show an advantage of contextualized embeddings over static ones and a detrimen- tal role of sentence level embeddings. While obtaining competitive results against state-of-the-art systems, our probes indicate a lack of suitable encoded information to properly address this task.

Outlier Dimensions that Disrupt Transformers are Driven by Frequency
Giovanni Puccetti | Anna Rogers | Aleksandr Drozd | Felice Dell’Orletta
Findings of the Association for Computational Linguistics: EMNLP 2022

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlate with the frequencies of encoded tokens in pre-training data, and they also contribute to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotopicity in future models we need pre-training schemas that would better take into account the skewed token distributions.

2021

Quale testo è scritto meglio? A Study on Italian Native Speakers’ Perception of Writing Quality
Aldo Cerulli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

Sentence Complexity in Context
Benedetta Iavarone | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

We study the influence of context on how humans evaluate the complexity of a sentence in English. We collect a new dataset of sentences, where each sentence is rated for perceived complexity within different contextual windows. We carry out an in-depth analysis to detect which linguistic features correlate more with complexity judgments and with the degree of agreement among annotators. We train several regression models, using either explicit linguistic features or contextualized word embeddings, to predict the mean complexity values assigned to sentences in the different contextual windows, as well as their standard deviation. Results show that models leveraging explicit features capturing morphosyntactic and syntactic phenomena perform always better, especially when they have access to features extracted from all contextual sentences.

What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity
Alessio Miaschi | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

This paper presents an investigation aimed at studying how the linguistic structure of a sentence affects the perplexity of two of the most popular Neural Language Models (NLMs), BERT and GPT-2. We first compare the sentence-level likelihood computed with BERT and the GPT-2’s perplexity showing that the two metrics are correlated. In addition, we exploit linguistic features capturing a wide set of morpho-syntactic and syntactic phenomena showing how they contribute to predict the perplexity of the two NLMs.

Human Perception in Natural Language Generation
Lorenzo De Mattei | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We ask subjects whether they perceive as human-produced a bunch of texts, some of which are actually human-written, while others are automatically generated. We use this data to fine-tune a GPT-2 model to push it to generate more human-like texts, and observe that this fine-tuned model produces texts that are indeed perceived more human-like than the original model. Contextually, we show that our automatic evaluation strategy well correlates with human judgements. We also run a linguistic analysis to unveil the characteristics of human- vs machine-perceived language.

Trattamento automatico della lingua a supporto dell’editoria: primi esperimenti con il Devoto-Oli Junior(Automatic Language Treatment to Support Publishing: First Experiments with the Devoto-Oli Junior)
Irene Dini | Felice Dell’Orletta | Fabio Ferri | Biancamaria Gismondi | Simonetta Montemagni
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

Probing Tasks Under Pressure
Alessio Miaschi | Chiara Alzetta | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

How Do BERT Embeddings Organize Linguistic Knowledge?
Giovanni Puccetti | Alessio Miaschi | Felice Dell’Orletta
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Several studies investigated the linguistic information implicitly encoded in Neural Language Models. Most of these works focused on quantifying the amount and type of information available within their internal representations and across their layers. In line with this scenario, we proposed a different study, based on Lasso regression, aimed at understanding how the information encoded by BERT sentence-level representations is arrange within its hidden units. Using a suite of several probing tasks, we showed the existence of a relationship between the implicit knowledge learned by the model and the number of individual units involved in the encodings of this competence. Moreover, we found that it is possible to identify groups of hidden units more relevant for specific linguistic properties.

Audience Engagement Prediction in Guided Tours through Multimodal Features
Andrea Amelio Ravelli | Andrea Cimino | Felice Dell’Orletta
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models
Gabriele Sarti | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

This paper investigates the relationship between two complementary perspectives in the human assessment of sentence complexity and how they are modeled in a neural language model (NLM). The first perspective takes into account multiple online behavioral metrics obtained from eye-tracking recordings. The second one concerns the offline perception of complexity measured by explicit human judgments. Using a broad spectrum of linguistic features modeling lexical, morpho-syntactic, and syntactic properties of sentences, we perform a comprehensive analysis of linguistic phenomena associated with the two complexity viewpoints and report similarities and differences. We then show the effectiveness of linguistic features when explicitly leveraged by a regression model for predicting sentence complexity and compare its results with the ones obtained by a fine-tuned neural language model. We finally probe the NLM’s linguistic competence before and after fine-tuning, highlighting how linguistic information encoded in representations changes when the model learns to predict complexity.

2020

A Machine Learning approach for Sentiment Analysis for Italian Reviews in Healthcare
Luca Bacco | Andrea Cimino | Luca Paulon | Mario Merone | Felice Dell’Orletta
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)
Johanna Monti | Felice Dell'Orletta | Fabio Tamburini
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Profiling-UD: a Tool for Linguistic Profiling of Texts
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.

Italian Transformers Under the Linguistic Lens
Alessio Miaschi | Gabriele Sarti | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Quantitative Linguistic Investigations across Universal Dependencies Treebanks
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Petya Osenova | Kiril Simov | Giulia Venturi
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

On the interaction of automatic evaluation and task framing in headline style transfer
Lorenzo De Mattei | Michele Cafagna | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim | Albert Gatt
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

An ongoing debate in the NLG community concerns the best way to evaluate systems, with human evaluation often being considered the most reliable method, compared to corpus-based metrics. However, tasks involving subtle textual differences, such as style transfer, tend to be hard for humans to perform. In this paper, we propose an evaluation method for this task based on purposely-trained classifiers, showing that it better reflects system differences than traditional metrics such as BLEU.

Linguistic Profiling of a Neural Language Model
Alessio Miaschi | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 28th International Conference on Computational Linguistics

In this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT’s capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence.

“Voices of the Great War”: A Richly Annotated Corpus of Italian Texts on the First World War
Federico Boschetti | Irene De Felice | Stefano Dei Rossi | Felice Dell’Orletta | Michele Di Giorgio | Martina Miliani | Lucia C. Passaro | Angelica Puddu | Giulia Venturi | Nicola Labanca | Alessandro Lenci | Simonetta Montemagni
Proceedings of the Twelfth Language Resources and Evaluation Conference

“Voices of the Great War” is the first large corpus of Italian historical texts dating back to the period of First World War. This corpus differs from other existing resources in several respects. First, from the linguistic point of view it gives account of the wide range of varieties in which Italian was articulated in that period, namely from a diastratic (educated vs. uneducated writers), diaphasic (low/informal vs. high/formal registers) and diatopic (regional varieties, dialects) points of view. From the historical perspective, through a collection of texts belonging to different genres it represents different views on the war and the various styles of narrating war events and experiences. The final corpus is balanced along various dimensions, corresponding to the textual genre, the language variety used, the author type and the typology of conveyed contents. The corpus is fully annotated with lemmas, part-of-speech, terminology, and named entities. Significant corpus samples representative of the different “voices” have also been enriched with meta-linguistic and syntactic information. The layer of syntactic annotation forms the first nucleus of an Italian historical treebank complying with the Universal Dependencies standard. The paper illustrates the final resource, the methodology and tools used to build it, and the Web Interface for navigating it.

Is Neural Language Model Perplexity Related to Readability?
Alessio Miaschi | Chiara Alzetta | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Predicting Movie-elicited Emotions from Dialogue in Screenplay Text: A Study on ”Forrest Gump”
Benedetta Iavarone | Felice Dell’Orletta
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Contextual and Non-Contextual Word Embeddings: an in-depth Linguistic Investigation
Alessio Miaschi | Felice Dell’Orletta
Proceedings of the 5th Workshop on Representation Learning for NLP

In this paper we present a comparison between the linguistic knowledge encoded in the internal representations of a contextual Language Model (BERT) and a contextual-independent one (Word2vec). We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that, although BERT is capable of understanding the full context of each word in an input sequence, the implicit knowledge encoded in its aggregated sentence representations is still comparable to that of a contextual-independent model. We also find that BERT is able to encode sentence-level properties even within single-word embeddings, obtaining comparable or even superior results than those obtained with sentence representations.

The Style of a Successful Story: a Computational Study on the Fanfiction Genre
Andrea Mattei | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

GePpeTto Carves Italian into a Language Model
Lorenzo De Mattei | Michele Cafagna | Felice Dell’Orletta | Malvina Nissim | Marco Guerini
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Preface
Johanna Monti | Felice Dell’Orletta | Fabio Tamburini
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Tracking the Evolution of Written Language Competence in L2 Spanish Learners
Alessio Miaschi | Sam Davidson | Dominique Brunato | Felice Dell’Orletta | Kenji Sagae | Claudia Helena Sanchez-Gutierrez | Giulia Venturi
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper we present an NLP-based approach for tracking the evolution of written language competence in L2 Spanish learners using a wide range of linguistic features automatically extracted from students’ written productions. Beyond reporting classification results for different scenarios, we explore the connection between the most predictive features and the teaching curriculum, finding that our set of linguistic features often reflect the explicit instructions that students receive during each course.

Exploring Attention in a Multimodal Corpus of Guided Tours
Andrea Amelio Ravelli | Antonio Origlia | Felice Dell’Orletta
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Invisible to People but not to Machines: Evaluation of Style-aware HeadlineGeneration in Absence of Reliable Human Judgment
Lorenzo De Mattei | Michele Cafagna | Felice Dell’Orletta | Malvina Nissim
Proceedings of the Twelfth Language Resources and Evaluation Conference

We automatically generate headlines that are expected to comply with the specific styles of two different Italian newspapers. Through a data alignment strategy and different training/testing settings, we aim at decoupling content from style and preserve the latter in generation. In order to evaluate the generated headlines’ quality in terms of their specific newspaper-compliance, we devise a fine-grained evaluation strategy based on automatic classification. We observe that our models do indeed learn newspaper-specific style. Importantly, we also observe that humans aren’t reliable judges for this task, since although familiar with the newspapers, they are not able to discern their specific styles even in the original human-written headlines. The utility of automatic evaluation goes therefore beyond saving the costs and hurdles of manual annotation, and deserves particular care in its design.

2019

Linguistically-Driven Strategy for Concept Prerequisites Learning on Italian
Alessio Miaschi | Chiara Alzetta | Franco Alberto Cardillo | Felice Dell’Orletta
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present a new concept prerequisite learning method for Learning Object (LO) ordering that exploits only linguistic features extracted from textual educational resources. The method was tested in a cross- and in- domain scenario both for Italian and English. Additionally, we performed experiments based on a incremental training strategy to study the impact of the training set size on the classifier performances. The paper also introduces ITA-PREREQ, to the best of our knowledge the first Italian dataset annotated with prerequisite relations between pairs of educational concepts, and describe the automatic strategy devised to build it.

Building an Italian Written-Spoken Parallel Corpus: a Pilot Study
Elisa Dominutti | Lucia Pifferi | Felice Dell’Orletta | Simonetta Montemagni | Valeria Quochi
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Lost in Text. A Cross-Genre Analysis of Linguistic Phenomena within Text
Chiara Buongiovanni | Francesco Gracci | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Italian and English Sentence Simplification: How Many Differences?
Martina Fieromonte | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

What Makes a Review helpful? Predicting the Helpfulness of Italian TripAdvisor Reviews
Giulia Chiriatti | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Prerequisite or Not Prerequisite? That’s the Problem! An NLP-based Approach for Concept Prerequisite Learning
Chiara Alzetta | Alessio Miaschi | Giovanni Adorni | Felice Dell’Orletta | Frosina Koceva | Samuele Passalacqua | Ilaria Torre
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Quanti anni hai? Age Identification for Italian
Aleksandra Maslennikova | Paolo Labruna | Andrea Cimino | Felice Dell’Orletta
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

2018

A NLP-based Analysis of Reflective Writings by Italian Teachers
Giulia Chiriatti | Valentina Della Gala | Felice Dell’Orletta | Simonetta Montemagni | Maria Chiara Pettenati | Maria Teresa Sagri | Giulia Venturi
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

The CHROME Manifesto: Integrating Multimodal Data into Cultural Heritage Resources
Francesco Cutugno | Felice Dell’Orletta | Isabella Poggi | Renata Savy | Antonio Sorgente
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Sentences and Documents in Native Language Identification
Andrea Cimino | Felice Dell’Orletta | Dominique Brunato | Giulia Venturi
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Is this Sentence Difficult? Do you Agree?
Dominique Brunato | Lorenzo De Mattei | Felice Dell’Orletta | Benedetta Iavarone | Giulia Venturi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we present a crowdsourcing-based approach to model the human perception of sentence complexity. We collect a large corpus of sentences rated with judgments of complexity for two typologically-different languages, Italian and English. We test our approach in two experimental scenarios aimed to investigate the contribution of a wide set of lexical, morpho-syntactic and syntactic phenomena in predicting i) the degree of agreement among annotators independently from the assigned judgment and ii) the perception of sentence complexity.

Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Maria Simi | Giulia Venturi
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Detection and correction of errors and inconsistencies in “gold treebanks” are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, and at the level of individual dependencies.

Gender and Genre Linguistic Profiling: A Case Study on Female and Male Journalistic and Diary Prose
Eleonora Cocciu | Dominique Brunato | Giulia Venturi | Felice Dell’Orletta
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Italian in the Trenches: Linguistic Annotation and Analysis of Texts of the Great War
Irene De Felice | Felice Dell’Orletta | Giulia Venturi | Alessandro Lenci | Simonetta Montemagni
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

DARC-IT: a DAtaset for Reading Comprehension in ITalian
Dominique Brunato | Martina Valeriani | Felice Dell’Orletta
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Word Embeddings in Sentiment Analysis
Ruggero Petrolito | Felice Dell’Orletta
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Lexicon and Syntax: Complexity across Genres and Language Varieties
Pietro Dell’Oglio | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Dangerous Relations in Dependency Treebanks
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

On the order of Words in Italian: a Study on Genre vs Complexity
Dominique Brunato | Felice Dell’Orletta
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

Stacked Sentence-Document Classifier Approach for Improving Native Language Identification
Andrea Cimino | Felice Dell’Orletta
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we describe the approach of the ItaliaNLP Lab team to native language identification and discuss the results we submitted as participants to the essay track of NLI Shared Task 2017. We introduce for the first time a 2-stacked sentence-document architecture for native language identification that is able to exploit both local sentence information and a wide set of general-purpose features qualifying the lexical and grammatical structure of the whole document. When evaluated on the official test set, our sentence-document stacked architecture obtained the best result among all the participants of the essay track with an F1 score of 0.8818.

Identifying Predictive Features for Textual Genre Classification: the Key Role of Syntax
Andrea Cimino | Martijn Wieling | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

Stylometry in Computer-Assisted Translation: Experiments on the Babylonian Talmud
Emiliano Giovannetti | Davide Albanesi | Andrea Bellandi | David Dattilo | Felice Dell’Orletta
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

2016

PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Thomas François | Philippe Blache
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

CItA: an L1 Italian Learners Corpus to Study the Development of Writing Competence
Alessia Barbagli | Pietro Lucisano | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present the CItA corpus (Corpus Italiano di Apprendenti L1), a collection of essays written by Italian L1 learners collected during the first and second year of lower secondary school. The corpus was built in the framework of an interdisciplinary study jointly carried out by computational linguistics and experimental pedagogists and aimed at tracking the development of written language competence over the years and students’ background information.

2015

NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms
Giulia Venturi | Tommaso Bellandi | Felice Dell’Orletta | Simonetta Montemagni
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

Design and Annotation of the First Italian Corpus for Text Simplification
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 9th Linguistic Annotation Workshop

2014

The PAISÀ Corpus of Italian Web Texts
Verena Lyding | Egon Stemle | Claudia Borghetti | Marco Brunello | Sara Castagnoli | Felice Dell’Orletta | Henrik Dittmann | Alessandro Lenci | Vito Pirrelli
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

T2K^2: a System for Automatically Extracting and Organizing Knowledge from Texts
Felice Dell’Orletta | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present T2K^2, a suite of tools for automatically extracting domain―specific knowledge from collections of Italian and English texts. T2K^2 (Text―To―Knowledge v2) relies on a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine learning which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured documents. Extracted knowledge ranges from domain―specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K^2 also includes linguistic profiling functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the added value of newly inserted documents. T2K^2 is a web application which can be accessed from any browser through a personal account which has been tested in a wide range of domains.

Assessing the Readability of Sentences: Which Corpora and Features?
Felice Dell’Orletta | Martijn Wieling | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

2013

Linguistic Profiling based on General–purpose Features and Native Language Identification
Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Linguistic Profiling of Texts Across Textual Genres and Readability Levels. An Exploratory Study on Italian Fictional Prose
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

Unsupervised Linguistically-Driven Reliable Dependency Parses Detection and Self-Training for Adaptation to the Biomedical Domain
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

2012

Genre-oriented Readability Assessment: a Case Study
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Workshop on Speech and Language Processing Tools in Education

2011

ULISSE: an Unsupervised Algorithm for Detecting Reliable Dependency Parses
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

2010

A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora
Francesca Bonin | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a novel approach to multi-word terminology extraction combining a well-known automatic term recognition approach, the C--NC value method, with a contrastive ranking technique, aimed at refining obtained results either by filtering noise due to common words or by discerning between semantically different types of terms within heterogeneous terminologies. Differently from other contrastive methods proposed in the literature that focus on single terms to overcome the multi-word terms' sparsity problem, the proposed contrastive function is able to handle variation in low frequency events by directly operating on pre-selected multi-word terms. This methodology has been tested in two case studies carried out in the History of Art and Legal domains. Evaluation of achieved results showed that the proposed two--stage approach improves significantly multi--word term extraction results. In particular, for what concerns the legal domain it provides an answer to a well-known problem in the semi--automatic construction of legal ontologies, namely that of singling out law terms from terms of the specific domain being regulated.

Contrastive Filtering of Domain-Specific Multi-Word Terms from Different Types of Corpora
Francesca Bonin | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

Comparing the Influence of Different Treebank Annotations on Dependency Parsing
Cristina Bosco | Simonetta Montemagni | Alessandro Mazzei | Vincenzo Lombardo | Felice Dell’Orletta | Alessandro Lenci | Leonardo Lesmo | Giuseppe Attardi | Maria Simi | Alberto Lavelli | Johan Hall | Jens Nilsson | Joakim Nivre
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

As the interest of the NLP community grows to develop several treebanks also for languages other than English, we observe efforts towards evaluating the impact of different annotation strategies used to represent particular languages or with reference to particular tasks. This paper contributes to the debate on the influence of resources used for the training and development on the performance of parsing systems. It presents a comparative analysis of the results achieved by three different dependency parsers developed and tested with respect to two treebanks for the Italian language, namely TUT and ISST--TANL, which differ significantly at the level of both corpus composition and adopted dependency representations.

Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin
Marco Passarotti | Felice Dell’Orletta
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The creation of language resources for less-resourced languages like the historical ones benefits from the exploitation of language-independent tools and methods developed over the years by many projects for modern languages. Along these lines, a number of treebanks for historical languages started recently to arise, including treebanks for Latin. Among the Latin treebanks, the Index Thomisticus Treebank is a 68,000 token dependency treebank based on the Index Thomisticus by Roberto Busa SJ, which contains the opera omnia of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million tokens. In this paper, we describe a number of modifications that we applied to the dependency parser DeSR, in order to improve the parsing accuracy rates on the Index Thomisticus Treebank. First, we adapted the parser to the specific processing of Medieval Latin, defining an ad-hoc configuration of its features. Then, in order to improve the accuracy rates provided by DeSR, we applied a revision parsing method and we combined the outputs produced by different algorithms. This allowed us to improve accuracy rates substantially, reaching results that are well beyond the state of the art of parsing for Latin.

2009

Reverse Revision and Linear Tree Combination for Dependency Parsing
Giuseppe Attardi | Felice Dell’Orletta
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

DeSRL: A Linear-Time Semantic Role Labeling System
Massimiliano Ciaramita | Giuseppe Attardi | Felice Dell’Orletta | Mihai Surdeanu
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

Multilingual Dependency Parsing and Domain Adaptation using DeSR
Giuseppe Attardi | Felice Dell’Orletta | Maria Simi | Atanas Chanev | Massimiliano Ciaramita
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

Probing the Space of Grammatical Variation: Induction of Cross-Lingual Grammatical Constraints from Treebanks
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

Searching treebanks for functional constraints: cross-lingual experiments in grammatical relation assignment
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper reports on a detailed quantitative analysis of distributional language data of both Italian and Czech, highlighting the relative contribution of a number of distributed grammatical factors to sentence-based identification of subjects and direct objects. The work is based on a Maximum Entropy model of stochastic resolution of grammatical conflicting constraints, and is demonstrably capable of putting explanatory theoretical accounts to the challenging test of an extensive, usage-based empirical verification.

2005

Climbing the Path to Grammar: A Maximum Entropy Model of Subject/Object Learning
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Workshop on Psychocomputational Models of Human Language Acquisition

Co-authors

Andrea Cimino 11

Alessandro Lenci 9

Cristiano Ciaccio 5

Lorenzo De Mattei 5

Malvina Nissim 5

Giovanni Puccetti 5

Giuseppe Attardi 4

Vito Pirrelli 4

Michele Cafagna 3

Benedetta Iavarone 3

Michele Papucci 3

Andrea Amelio Ravelli 3

Gabriele Sarti 3

Agnese Bonfigli 2

Francesca Bonin 2

Tommaso Caselli 2

Giulia Chiriatti 2

Massimiliano Ciaramita 2

Irene De Felice 2

Lucia Domenichelli 2

Chiara Fazzone 2

Marco Guerini 2

Johanna Monti 2

Leandro Pecchia 2

Ruben Piperno 2

Rachele Sprugnoli 2

Fabio Tamburini 2

Martijn Wieling 2

Giovanni Adorni 1

Davide Albanesi 1

Edoardo Barba 1

Alessia Barbagli 1

Andrei Stefan Bejgu 1

Tommaso Bellandi 1

Andrea Bellandi 1

Philippe Blache 1

Claudia Borghetti 1

Federico Boschetti 1

Cristina Bosco 1

Marco Brunello 1

Chiara Buongiovanni 1

Silvio Calderaro 1

Franco Alberto Cardillo 1

Sara Castagnoli 1

Atanas Chanev 1

Cristiano Chesi 1

Shammur Absar Chowdhury 1

Eleonora Cocciu 1

Filippo Crispino 1

Francesco Cutugno 1

David Dattilo 1

Valentina Della Gala 1

Pietro Dell’Oglio 1

Michele Di Giorgio 1

Henrik Dittmann 1

Elisa Dominutti 1

Aleksandr Drozd 1

Martina Fieromonte 1

Thomas François 1

Emiliano Giovannetti 1

Biancamaria Gismondi 1

Francesco Gracci 1

Md. Arid Hasan 1

‪Pere-Lluís Huguet Cabot 1

Frosina Koceva 1

Nicola Labanca 1

Paolo Labruna 1

Alberto Lavelli 1

Giulio Leonardi 1

Leonardo Lesmo 1

Vincenzo Lombardo 1

Pietro Lucisano 1

Verena Lyding 1

Michele Marchi 1

Aleksandra Maslennikova 1

Andrea Mattei 1

Alessandro Mazzei 1

Federica Merendi 1

Martina Miliani 1

Irene Mondella 1

Roberto Navigli 1

Daniela Occhipinti 1

Antonio Origlia 1

Petya Osenova 1

Federica Papa 1

Giuseppe Francesco Papalia 1

Rocco Papalia 1

Samuele Passalacqua 1

Lucia C. Passaro 1

Marco Passarotti 1

Andrea Pedrotti 1

Ruggero Petrolito 1

Maria Chiara Pettenati 1

Lucia Pifferi 1

Isabella Poggi 1

Angelica Puddu 1

Valeria Quochi 1

Stefano Dei Rossi 1

Fabrizio Russo 1

Maria Teresa Sagri 1

Claudia Helena Sanchez-Gutierrez 1

Antonio Sorgente 1

Mihai Surdeanu 1

Gianluca Vadalà 1

Martina Valeriani 1

Roberto Zamparelli 1

Venues