Giuseppe Samo

2026

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo | Paola Merlo
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)

In this paper, we investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish—with its transparent morphological markers—both monolingual and multilingual models succeed either when tokenization is highly atomic or breaking words into small subword units. For Hebrew, however, a multilingual model using character-level tokenization fails to capture its non-concatenative morphology, while a monolingual model with unified morpheme-aware segmentation excels. Performance improves on more synthetic datasets, in all models.

2025

pdf bib

Multilingual vs. Monolingual Transformer Models in Encoding Linguistic Structure and Lexical Abstraction
Vivi Nastase | Giuseppe Samo | Chunyang Jiang | Paola Merlo
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

2024

pdf bib abs

Exploring Italian Sentence Embeddings Properties through Multi-tasking
Vivi Nastase | Giuseppe Samo | Chunyang Jiang | Paola Merlo
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale – several Blackbird Language Matrices (BLMs) problems in Italian – and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure – in terms of sequence of phrases/chunks – and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.

pdf bib abs

Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement
Vivi Nastase | Giuseppe Samo | Chunyang Jiang | Paola Merlo
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.

pdf bib abs

BLM-It - Blackbird Language Matrices for Italian: A CALAMITA Challenge
Chunyang Jiang | Giuseppe Samo | Vivi Nastase | Paola Merlo
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and delve into deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples following corrupted generating rules. We propose three subtasks —agreement concord, causative and object-drop alternation detection— each in two variants of increasing lexical complexity.The datasets comprise a few prompts for few-shot learning and a large test set.

2023

pdf bib abs

Blackbird Language Matrices Tasks for Generalization
Paola Merlo | Chunyang Jiang | Giuseppe Samo | Vivi Nastase
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

To develop a system with near-human language capabilities, we need to understand current systems’ generalisation and compositional abilities. We approach this by generating compositional, structured data, inspired from visual intelligence tests, that depend on the problem-solvers being able to disentangle objects and their absolute and relative properties in a sequence of images. We design an analogous task and develop the corresponding datasets that capture specific linguistic phenomena and their properties. Solving each problem instance depends on detecting the relevant linguistic objects and generative rules of the problem. We propose two datasets modelling two linguistic phenomena – subject-verb agreement in French, and verb alternations in English. The datasets can be used to investigate how LLMs encode linguistic objects, such as phrases, their grammatical and semantic properties, such as number or semantic role, and how such information is combined to correctly solve each problem. Specifically generated error types help investigate the behaviour of the system, which important information it is able to detect, and which structures mislead it.

pdf bib

Building Structured Synthetic Datasets: The Case of Blackbird Language Matrices (BLMs)
Paola Merlo | Giuseppe Samo | Vivi Nastase | Chunyang Jiang
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

pdf bib abs

BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs
Giuseppe Samo | Vivi Nastase | Chunyang Jiang | Paola Merlo
Findings of the Association for Computational Linguistics: EMNLP 2023

Current NLP models appear to be achieving performance comparable to human capabilities on well-established benchmarks. New benchmarks are now necessary to test deeper layers of understanding of natural languages by these models. Blackbird’s Language Matrices are a recently developed framework that draws inspiration from tests of human analytic intelligence. The BLM task has revealed that successful performances in previously studied linguistic problems do not yet stem from a deep understanding of the generative factors that define these problems. In this study, we define a new BLM task for predicate-argument structure, and develop a structured dataset for its investigation, concentrating on the spray-load verb alternations in English, as a case study. The context sentences include one alternant from the spray-load alternation and the target sentence is the other alternant, to be chosen among a minimally contrastive and adversarial set of answers. We describe the generation process of the dataset and the reasoning behind the generating rules. The dataset aims to facilitate investigations into how verb information is encoded in sentence embeddings and how models generalize to the complex properties of argument structures. Benchmarking experiments conducted on the dataset and qualitative error analysis on the answer set reveal the inherent challenges associated with the problem even for current high-performing representations.