Kapil Thadani


2020

pdf bib
Effective Few-Shot Classification with Transfer Learning
Aakriti Gupta | Kapil Thadani | Neil O’Hare
Proceedings of the 28th International Conference on Computational Linguistics

Few-shot learning addresses the the problem of learning based on a small amount of training data. Although more well-studied in the domain of computer vision, recent work has adapted the Amazon Review Sentiment Classification (ARSC) text dataset for use in the few-shot setting. In this work, we use the ARSC dataset to study a simple application of transfer learning approaches to few-shot classification. We train a single binary classifier to learn all few-shot classes jointly by prefixing class identifiers to the input text. Given the text and class, the model then makes a binary prediction for that text/class pair. Our results show that this simple approach can outperform most published results on this dataset. Surprisingly, we also show that including domain information as part of the task definition only leads to a modest improvement in model accuracy, and zero-shot classification, without further fine-tuning on few-shot domains, performs equivalently to few-shot classification. These results suggest that the classes in the ARSC few-shot task, which are defined by the intersection of domain and rating, are actually very similar to each other, and that a more suitable dataset is needed for the study of few-shot text classification.

2019

pdf bib
Unsupervised Neologism Normalization Using Embedding Space Mapping
Nasser Zalmout | Kapil Thadani | Aasish Pappu
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

This paper presents an approach for detecting and normalizing neologisms in social media content. Neologisms refer to recent expressions that are specific to certain entities or events and are being increasingly used by the public, but have not yet been accepted in mainstream language. Automated methods for handling neologisms are important for natural language understanding and normalization, especially for informal genres with user generated content. We present an unsupervised approach for detecting neologisms and then normalizing them to canonical words without relying on parallel training data. Our approach builds on the text normalization literature and introduces adaptations to fit the specificities of this task, including phonetic and etymological considerations. We evaluate the proposed techniques on a dataset of Reddit comments, with detected neologisms and corresponding normalizations.

2016

pdf bib
Extractive Summarization under Strict Length Constraints
Yashar Mehdad | Amanda Stent | Kapil Thadani | Dragomir Radev | Youssef Billawala | Karolina Buchner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we report a comparison of various techniques for single-document extractive summarization under strict length budgets, which is a common commercial use case (e.g. summarization of news articles by news aggregators). We show that, evaluated using ROUGE, numerous algorithms from the literature fail to beat a simple lead-based baseline for this task. However, a supervised approach with lightweight and efficient features improves over the lead-based baseline. Additional human evaluation demonstrates that the supervised approach also performs competitively with a commercial system that uses more sophisticated features.

pdf bib
The Role of Discourse Units in Near-Extractive Summarization
Junyi Jessy Li | Kapil Thadani | Amanda Stent
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

pdf bib
Approximation Strategies for Multi-Structure Sentence Compression
Kapil Thadani
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Cluster-based Web Summarization
Yves Petinot | Kathleen McKeown | Kapil Thadani
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Supervised Sentence Fusion with Single-Stage Inference
Kapil Thadani | Kathleen McKeown
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Sentence Compression with Joint Structural Inference
Kapil Thadani | Kathleen McKeown
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

2012

pdf bib
A Joint Phrasal and Dependency Model for Paraphrase Alignment
Kapil Thadani | Scott Martin | Michael White
Proceedings of COLING 2012: Posters

2011

pdf bib
Identifying Event Descriptions using Co-training with Online News Summaries
William Yang Wang | Kapil Thadani | Kathleen McKeown
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
Kapil Thadani | Kathleen McKeown
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Hierarchical Model of Web Summaries
Yves Petinot | Kathleen McKeown | Kapil Thadani
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Towards Strict Sentence Intersection: Decoding and Evaluation Strategies
Kapil Thadani | Kathleen McKeown
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2010

pdf bib
Towards Semi-Automated Annotation for Prepositional Phrase Attachment
Sara Rosenthal | William Lipovsky | Kathleen McKeown | Kapil Thadani | Jacob Andreas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper investigates whether high-quality annotations for tasks involving semantic disambiguation can be obtained without a major investment in time or expense. We examine the use of untrained human volunteers from Amazons Mechanical Turk in disambiguating prepositional phrase (PP) attachment over sentences drawn from the Wall Street Journal corpus. Our goal is to compare the performance of these crowdsourced judgments to the annotations supplied by trained linguists for the Penn Treebank project in order to indicate the viability of this approach for annotation projects that involve contextual disambiguation. The results of our experiments on a sample of the Wall Street Journal corpus show that invoking majority agreement between multiple human workers can yield PP attachments with fairly high precision. This confirms that a crowdsourcing approach to syntactic annotation holds promise for the generation of training corpora in new domains and genres where high-quality annotations are not available and difficult to obtain.

pdf bib
Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment
Mukund Jha | Jacob Andreas | Kapil Thadani | Sara Rosenthal | Kathleen McKeown
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Time-Efficient Creation of an Accurate Sentence Fusion Corpus
Kathleen McKeown | Sara Rosenthal | Kapil Thadani | Coleman Moore
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
A Framework for Identifying Textual Redundancy
Kapil Thadani | Kathleen McKeown
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)