On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Stephen Mussmann, Robin Jia, Percy Liang


Abstract
Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32.5% on QQP and 20.1% on WikiQA.
Anthology ID:
2020.findings-emnlp.305
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3400–3413
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.305
DOI:
10.18653/v1/2020.findings-emnlp.305
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.305.pdf
Code
 worksheets/0x39ba5559
Data
GLUEWikiQA