Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

Santiago Herrera, Caio Corro, Sylvain Kahane


Abstract
Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model’s results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
Anthology ID:
2024.lrec-main.1314
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
15114–15125
Language:
URL:
https://aclanthology.org/2024.lrec-main.1314
DOI:
Bibkey:
Cite (ACL):
Santiago Herrera, Caio Corro, and Sylvain Kahane. 2024. Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15114–15125, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks (Herrera et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1314.pdf