Advancing Linguistic Features and Insights by Label-informed Feature Grouping: An Exploration in the Context of Native Language Identification

Serhiy Bykh, Detmar Meurers


Abstract
We propose a hierarchical clustering approach designed to group linguistic features for supervised machine learning that is inspired by variationist linguistics. The method makes it possible to abstract away from the individual feature occurrences by grouping features together that behave alike with respect to the target class, thus providing a new, more general perspective on the data. On the one hand, it reduces data sparsity, leading to quantitative performance gains. On the other, it supports the formation and evaluation of hypotheses about individual choices of linguistic structures. We explore the method using features based on verb subcategorization information and evaluate the approach in the context of the Native Language Identification (NLI) task.
Anthology ID:
C16-1071
Volume:
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Yuji Matsumoto, Rashmi Prasad
Venue:
COLING
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
739–749
Language:
URL:
https://aclanthology.org/C16-1071
DOI:
Bibkey:
Cite (ACL):
Serhiy Bykh and Detmar Meurers. 2016. Advancing Linguistic Features and Insights by Label-informed Feature Grouping: An Exploration in the Context of Native Language Identification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 739–749, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Advancing Linguistic Features and Insights by Label-informed Feature Grouping: An Exploration in the Context of Native Language Identification (Bykh & Meurers, COLING 2016)
Copy Citation:
PDF:
https://aclanthology.org/C16-1071.pdf