Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles

Christopher Clark, Mark Yatskar, Luke Zettlemoyer


Abstract
Many datasets have been shown to contain incidental correlations created by idiosyncrasies in the data collection process. For example, sentence entailment datasets can have spurious word-class correlations if nearly all contradiction sentences contain the word “not”, and image recognition datasets can have tell-tale object-background correlations if dogs are always indoors. In this paper, we propose a method that can automatically detect and ignore these kinds of dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. During training, the lower capacity model learns to capture relatively shallow correlations, which we hypothesize are likely to reflect dataset bias. This frees the higher capacity model to focus on patterns that should generalize better. We ensure the models learn non-overlapping approaches by introducing a novel method to make them conditionally independent. Importantly, our approach does not require the bias to be known in advance. We evaluate performance on synthetic datasets, and four datasets built to penalize models that exploit known biases on textual entailment, visual question answering, and image recognition tasks. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
Anthology ID:
2020.findings-emnlp.272
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3031–3045
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.272
DOI:
10.18653/v1/2020.findings-emnlp.272
Bibkey:
Cite (ACL):
Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2020. Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3031–3045, Online. Association for Computational Linguistics.
Cite (Informal):
Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles (Clark et al., Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.272.pdf
Code
 chrisc36/autobias
Data
ImageNetMNISTVisual Question Answering v2.0