The Tembusu Treebank: An English Learner Treebank

Luís Morgado da Costa, Francis Bond, Roger V. P. Winder


Abstract
This paper reports on the creation and development of the Tembusu Learner Treebank — an open treebank created from the NTU Corpus of Learner English, unique for incorporating mal-rules in the annotation of ungrammatical sentences. It describes the motivation and development of the treebank, as well as its exploitation to build a new parse-ranking model for the English Resource Grammar, designed to help improve the parse selection of ungrammatical sentences and diagnose these sentences through mal-rules. The corpus contains 25,000 sentences, of which 4,900 are treebanked. The paper concludes with an evaluation experiment that shows the usefulness of this new treebank in the tasks of grammatical error detection and diagnosis.
Anthology ID:
2022.lrec-1.515
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4817–4826
Language:
URL:
https://aclanthology.org/2022.lrec-1.515
DOI:
Bibkey:
Cite (ACL):
Luís Morgado da Costa, Francis Bond, and Roger V. P. Winder. 2022. The Tembusu Treebank: An English Learner Treebank. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4817–4826, Marseille, France. European Language Resources Association.
Cite (Informal):
The Tembusu Treebank: An English Learner Treebank (Morgado da Costa et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.515.pdf
Code
 lmorgadodacosta/the-tembusu-treebank