Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement

Vivi Nastase, Giuseppe Samo, Chunyang Jiang, Paola Merlo


Abstract
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.
Anthology ID:
2024.clicit-1.71
Volume:
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:
December
Year:
2024
Address:
Pisa, Italy
Editors:
Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:
CLiC-it
SIG:
Publisher:
CEUR Workshop Proceedings
Note:
Pages:
631–643
Language:
URL:
https://aclanthology.org/2024.clicit-1.71/
DOI:
Bibkey:
Cite (ACL):
Vivi Nastase, Giuseppe Samo, Chunyang Jiang, and Paola Merlo. 2024. Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 631–643, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):
Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement (Nastase et al., CLiC-it 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clicit-1.71.pdf