Sameer Pradhan

Also published as: Sameer S. Pradhan, S. Pradhan

2025

Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Michal Novak | Massimo Poesio | Sameer Pradhan | Vincent Ng
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

2024

pdf bib abs

SPLICE: A Singleton-Enhanced PipeLIne for Coreference REsolution
Yilun Zhu | Siyao Peng | Sameer Pradhan | Amir Zeldes
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Singleton mentions, i.e. entities mentioned only once in a text, are important to how humans understand discourse from a theoretical perspective. However previous attempts to incorporate their detection in end-to-end neural coreference resolution for English have been hampered by the lack of singleton mention spans in the OntoNotes benchmark. This paper addresses this limitation by combining predicted mentions from existing nested NER systems and features derived from OntoNotes syntax trees. With this approach, we create a near approximation of the OntoNotes dataset with all singleton mentions, achieving ~94% recall on a sample of gold singletons. We then propose a two-step neural mention and coreference resolution system, named SPLICE, and compare its performance to the end-to-end approach in two scenarios: the OntoNotes test set and the out-of-domain (OOD) OntoGUM corpus. Results indicate that reconstructed singleton training yields results comparable to end-to-end systems for OntoNotes, while improving OOD stability (+1.1 avg. F1). We conduct error analysis for mention detection and delve into its impact on coreference clustering, revealing that precision improvements deliver more substantial benefits than increases in recall for resolving coreference chains.

pdf bib abs

pdf bib abs

My Science Tutor (MyST)–a Large Corpus of Children’s Conversational Speech
Sameer Pradhan | Ronald A. Cole | Wayne H. Ward
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This article describes the [corpus-name] corpus developed as part of the [project-name] project. To the best of our knowledge, this is one of the largest collections of children’s conversational speech that is freely available for non-commercial use under the creative commons license (CC BY-NC-SA 4.0). It comprises approximately 400 hours of speech, spanning some 230K utterances spread across about 10,500 virtual tutor sessions. Roughly 1,300 third, fourth and fifth grade students contributed to this corpus. The current release contains roughly 100K transcribed utterances. It is our hope that the corpus can be used to improve automatic speech recognition models and algorithms. We report the word error rate achieved on the test set using a model trained on the training and development portion of the corpus. The git repository of the corpus contains the complete training and evaluation setup in order to facilitate a fair and consistent evaluation. It is our hope that this corpus will contribute to the creation and evaluation of conversational AI agents having a better understanding of children’s speech, potentially opening doors to novel, effective, learning and therapeutic interventions.

Sameer Pradhan

2025

2024

2023

2022

2021

2020

2019

2018

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2005

2004

2001

Co-authors

Venues