Multilingual Nonce Dependency Treebanks: Understanding how Language Models Represent and Process Syntactic Structure

David Arps; Laura Kallmeyer; Younes Samih; Hassan Sajjad

doi:10.18653/v1/2024.naacl-long.433

Multilingual Nonce Dependency Treebanks: Understanding how Language Models Represent and Process Syntactic Structure

David Arps, Laura Kallmeyer, Younes Samih, Hassan Sajjad

Abstract

We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of Müller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.

Anthology ID:: 2024.naacl-long.433
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7822–7844
Language:
URL:: https://aclanthology.org/2024.naacl-long.433/
DOI:: 10.18653/v1/2024.naacl-long.433
Bibkey:
Cite (ACL):: David Arps, Laura Kallmeyer, Younes Samih, and Hassan Sajjad. 2024. Multilingual Nonce Dependency Treebanks: Understanding how Language Models Represent and Process Syntactic Structure. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7822–7844, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Multilingual Nonce Dependency Treebanks: Understanding how Language Models Represent and Process Syntactic Structure (Arps et al., NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.433.pdf
Video:: https://aclanthology.org/2024.naacl-long.433.mp4

PDF Cite Search Video Fix data