Real-world applications of neural language models often involve running many different models over the same corpus. The high computational cost of these runs has led to interest in techniques that can reuse the contextualized embeddings produced in previous runs to speed training and inference of future ones. We refer to this approach as embedding recycling (ER). While multiple ER techniques have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few models and do not adequately account for overhead costs. We perform an extensive evaluation of ER across eight different models (17 to 900 million parameters) and fourteen tasks in English. We show how a simple ER technique that caches activations from an intermediate layer of a pretrained model, and learns task-specific adapters on the later layers, is broadly effective. For the best-performing baseline in our experiments (DeBERTa-v2 XL), adding a precomputed cache results in a 90% speedup during training and 87-91% speedup for inference, with negligible impact on accuracy. Our analysis reveals important areas of future work.
CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense
Michael Chen | Mike D’Arcy | Alisa Liu | Jared Fernandez | Doug Downey
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP
Commonsense reasoning is a critical AI capability, but it is difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance on commonsense knowledge benchmarks. These systems do not possess human-level common sense, but are able to exploit limitations of the datasets to achieve human-level scores. We introduce the CODAH dataset, an adversarially-constructed evaluation dataset for testing common sense. CODAH forms a challenging extension to the recently-proposed SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video. To produce a more difficult dataset, we introduce a novel procedure for question acquisition in which workers author questions designed to target weaknesses of state-of-the-art neural question answering systems. Workers are rewarded for submissions that models fail to answer correctly both before and after fine-tuning (in cross-validation). We create 2.8k questions via this procedure and evaluate the performance of multiple state-of-the-art question answering systems on our dataset. We observe a significant gap between human performance, which is 95.3%, and the performance of the best baseline accuracy of 65.3% by the OpenAI GPT model.