The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English

Tom S Juzek

The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English

Abstract

We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status (“well-formedness” according to syntactic formalisms) extracted from the literature, as well as its acceptability status (“intuitive goodness” as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that “in-betweenness” occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.

Anthology ID:: 2024.lrec-main.1401
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 16113–16120
Language:
URL:: https://aclanthology.org/2024.lrec-main.1401/
DOI:
Bibkey:
Cite (ACL):: Tom S Juzek. 2024. The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16113–16120, Torino, Italia. ELRA and ICCL.
Cite (Informal):: The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English (Juzek, LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.1401.pdf

PDF Cite Search Fix data