2023
pdf
bib
abs
Evaluation for Change
Rishi Bommasani
Findings of the Association for Computational Linguistics: ACL 2023
Evaluation is the central means for assessing, understanding, and communicating about NLP models. In this position paper, we argue evaluation should be more than that: it is a force for driving change, carrying a sociological and political character beyond its technical dimensions. As a force, evaluation’s power arises from its adoption: under our view, evaluation succeeds when it achieves the desired change in the field. Further, by framing evaluation as a force, we consider how it competes with other forces. Under our analysis, we conjecture that the current trajectory of NLP suggests evaluation’s power is waning, in spite of its potential for realizing more pluralistic ambitions in the field. We conclude by discussing the legitimacy of this power, who acquires this power and how it distributes. Ultimately, we hope the research community will more aggressively harness evaluation to drive change.
2020
pdf
bib
abs
Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings
Rishi Bommasani
|
Kelly Davis
|
Claire Cardie
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Contextualized representations (e.g. ELMo, BERT) have become the default pretrained representations for downstream NLP applications. In some settings, this transition has rendered their static embedding predecessors (e.g. Word2Vec, GloVe) obsolete. As a side-effect, we observe that older interpretability methods for static embeddings — while more diverse and mature than those available for their dynamic counterparts — are underutilized in studying newer contextualized representations. Consequently, we introduce simple and fully general methods for converting from contextualized representations to static lookup-table embeddings which we apply to 5 popular pretrained models and 9 sets of pretrained weights. Our analysis of the resulting static embeddings notably reveals that pooling over many contexts significantly improves representational quality under intrinsic evaluation. Complementary to analyzing representational quality, we consider social biases encoded in pretrained representations with respect to gender, race/ethnicity, and religion and find that bias is encoded disparately across pretrained models and internal layers even for models with the same training data. Concerningly, we find dramatic inconsistencies between social bias estimators for word embeddings.
pdf
bib
abs
Intrinsic Evaluation of Summarization Datasets
Rishi Bommasani
|
Claire Cardie
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
High quality data forms the bedrock for building meaningful statistical models in NLP. Consequently, data quality must be evaluated either during dataset construction or *post hoc*. Almost all popular summarization datasets are drawn from natural sources and do not come with inherent quality assurance guarantees. In spite of this, data quality has gone largely unquestioned for many of these recent datasets. We perform the first large-scale evaluation of summarization datasets by introducing 5 intrinsic metrics and applying them to 10 popular datasets. We find that data usage in recent summarization research is sometimes inconsistent with the underlying properties of the data. Further, we discover that our metrics can serve the additional purpose of being inexpensive heuristics for detecting generically low quality examples.
2019
pdf
bib
abs
Long-Distance Dependencies Don’t Have to Be Long: Simplifying through Provably (Approximately) Optimal Permutations
Rishi Bommasani
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Neural models at the sentence level often operate on the constituent words/tokens in a way that encodes the inductive bias of processing the input in a similar fashion to how humans do. However, there is no guarantee that the standard ordering of words is computationally efficient or optimal. To help mitigate this, we consider a dependency parse as a proxy for the inter-word dependencies in a sentence and simplify the sentence with respect to combinatorial objectives imposed on the sentence-parse pair. The associated optimization results in permuted sentences that are provably (approximately) optimal with respect to minimizing dependency parse lengths and that are demonstrably simpler. We evaluate our general-purpose permutations within a fine-tuning schema for the downstream task of subjectivity analysis. Our fine-tuned baselines reflect a new state of the art for the SUBJ dataset and the permutations we introduce lead to further improvements with a 2.0% increase in classification accuracy (absolute) and a 45% reduction in classification error (relative) over the previous state of the art.
pdf
bib
abs
SPARSE: Structured Prediction using Argument-Relative Structured Encoding
Rishi Bommasani
|
Arzoo Katiyar
|
Claire Cardie
Proceedings of the Third Workshop on Structured Prediction for NLP
We propose structured encoding as a novel approach to learning representations for relations and events in neural structured prediction. Our approach explicitly leverages the structure of available relation and event metadata to generate these representations, which are parameterized by both the attribute structure of the metadata as well as the learned representation of the arguments of the relations and events. We consider affine, biaffine, and recurrent operators for building hierarchical representations and modelling underlying features. We apply our approach to the second-order structured prediction task studied in the 2016/2017 Belief and Sentiment analysis evaluations (BeSt): given a document and its entities, relations, and events (including metadata and mentions), determine the sentiment of each entity towards every relation and event in the document. Without task-specific knowledge sources or domain engineering, we significantly improve over systems and baselines that neglect the available metadata or its hierarchical structure. We observe across-the-board improvements on the BeSt 2016/2017 sentiment analysis task of at least 2.3 (absolute) and 10.6% (relative) F-measure over the previous state-of-the-art.