Dagmar Divjak
2022
Abstraction not Memory: BERT and the English Article System
Harish Tayyar Madabushi
|
Dagmar Divjak
|
Petar Milin
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Article prediction is a task that has long defied accurate linguistic description. As such, this task is ideally suited to evaluate models on their ability to emulate native-speaker intuition. To this end, we compare the performance of native English speakers and pre-trained models on the task of article prediction set up as a three way choice (a/an, the, zero). Our experiments with BERT show that BERT outperforms humans on this task across all articles. In particular, BERT is far superior to humans at detecting the zero article, possibly because we insert them using rules that the deep neural model can easily pick up. More interestingly, we find that BERT tends to agree more with annotators than with the corpus when inter-annotator agreement is high but switches to agreeing more with the corpus as inter-annotator agreement drops. We contend that this alignment with annotators, despite being trained on the corpus, suggests that BERT is not memorising article use, but captures a high level generalisation of article use akin to human intuition.
2020
CxGBERT: BERT meets Construction Grammar
Harish Tayyar Madabushi
|
Laurence Romain
|
Dagmar Divjak
|
Petar Milin
Proceedings of the 28th International Conference on Computational Linguistics
While lexico-semantic elements no doubt capture a large amount of linguistic information, it has been argued that they do not capture all information contained in text. This assumption is central to constructionist approaches to language which argue that language consists of constructions, learned pairings of a form and a function or meaning that are either frequent or have a meaning that cannot be predicted from its component parts. BERT’s training objectives give it access to a tremendous amount of lexico-semantic information, and while BERTology has shown that BERT captures certain important linguistic dimensions, there have been no studies exploring the extent to which BERT might have access to constructional information. In this work we design several probes and conduct extensive experiments to answer this question. Our results allow us to conclude that BERT does indeed have access to a significant amount of information, much of which linguists typically call constructional information. The impact of this observation is potentially far-reaching as it provides insights into what deep learning methods learn from text, while also showing that information contained in constructions is redundantly encoded in lexico-semantics.
2008
Designing and Evaluating a Russian Tagset
Serge Sharoff
|
Mikhail Kopotev
|
Tomaž Erjavec
|
Anna Feldman
|
Dagmar Divjak
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset is based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 500 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set that can be shared with other researchers.
Search
Fix data
Co-authors
- Harish Tayyar Madabushi 2
- Petar Milin 2
- Tomaž Erjavec 1
- Anna Feldman 1
- Mikhail Kopotev 1
- show all...