Brihi Joshi


2023

pdf bib
Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-text Rationales
Brihi Joshi | Ziyi Liu | Sahana Ramnath | Aaron Chan | Zhewei Tong | Shaoliang Nie | Qifan Wang | Yejin Choi | Xiang Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Among the remarkable emergent capabilities of large language models (LMs) is free-text rationalization; beyond certain scale, large LMs are capable of generating seemingly useful rationalizations, which in turn, can dramatically enhance their performances on leaderboards. This phenomenon raises a question: can machine generated rationales also be useful for humans, especially when lay humans try to answer questions based on those machine rationales? We observe that human utility of existing rationales is far from satisfactory and expensive to estimate with human studies. Existing metrics like task performance of the LM generating the rationales or similarity between generated and gold rationales are not good indicators of their human utility. While we observe that certain properties of rationales like conciseness and novelty are correlated with their human utility, estimating them without human involvement is challenging. We show that, by estimating a rationale’s helpfulness in answering similar unseen instances, we can measure its human utility to a better extent. We also translate this finding into an automated score, Gen-U, that we propose, which can help improve LMs’ ability to generate rationales with better human utility, while maintaining most of its task performance. Lastly, we release all code and collected data with this project.

pdf bib
XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models
Dong-Ho Lee | Akshen Kadakia | Brihi Joshi | Aaron Chan | Ziyi Liu | Kiran Narahari | Takashi Shibuya | Ryosuke Mitani | Toshiyuki Sekiya | Jay Pujara | Xiang Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

NLP models are susceptible to learning spurious biases (i.e., bugs) that work on some datasets but do not properly reflect the underlying task. Explanation-based model debugging aims to resolve spurious biases by showing human users explanations of model behavior, asking users to give feedback on the behavior, thenusing the feedback to update the model. While existing model debugging methods have shown promise, their prototype-level implementations provide limited practical utility. Thus, we propose XMD: the first open-source, end-to-end framework for explanation-based model debugging. Given task- or instance-level explanations,users can flexibly provide various forms of feedback via an intuitive, web-based UI. After receiving user feedback, XMD automatically updates the model in real time, by regularizing the model so that its explanationsalign with the user feedback. The new model can then be easily deployed into real-world applications via Hugging Face. Using XMD, we can improve the model’s OOD performance on text classification tasks by up to 18%.

2022

pdf bib
ER-Test: Evaluating Explanation Regularization Methods for Language Models
Brihi Joshi | Aaron Chan | Ziyi Liu | Shaoliang Nie | Maziar Sanjabi | Hamed Firooz | Xiang Ren
Findings of the Association for Computational Linguistics: EMNLP 2022

By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (NLMs). Explanation regularization (ER) aims to improve NLM generalization by pushing the NLM’s machine rationales (Which input tokens did the NLM focus on?) to align with human rationales (Which input tokens would humans focus on). Though prior works primarily study ER via in-distribution (ID) evaluation, out-of-distribution (OOD) generalization is often more critical in real-world scenarios, yet ER’s effect on OOD generalization has been underexplored.In this paper, we introduce ER-Test, a framework for evaluating ER models’ OOD generalization along three dimensions: unseen datasets, contrast set tests, and functional tests. Using ER-Test, we comprehensively analyze how ER models’ OOD generalization varies with the rationale alignment criterion (loss function), human rationale type (instance-level v/s task-level), number and choice of rationale-annotated instances, and time budget for rationale annotation. Across two tasks and six datasets, we show that ER has little impact on ID performance but yields large OOD performance gains, with the best ER criterion being task-dependent. Also, ER can improve OOD performance even with task-level or few human rationales. Finally, we find that rationale annotation is more time-efficient than label annotation for improving OOD performance. Our results with ER-Test help demonstrate ER’s utility and establish best practices for using ER effectively.

2020

pdf bib
Did You “Read” the Next Episode? Using Textual Cues for Predicting Podcast Popularity
Brihi Joshi | Shravika Mittal | Aditya Chetan
Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA)

pdf bib
The Devil is in the Details: Evaluating Limitations of Transformer-based Methods for Granular Tasks
Brihi Joshi | Neil Shah | Francesco Barbieri | Leonardo Neves
Proceedings of the 28th International Conference on Computational Linguistics

Contextual embeddings derived from transformer-based neural language models have shown state-of-the-art performance for various tasks such as question answering, sentiment analysis, and textual similarity in recent years. Extensive work shows how accurately such models can represent abstract, semantic information present in text. In this expository work, we explore a tangent direction and analyze such models’ performance on tasks that require a more granular level of representation. We focus on the problem of textual similarity from two perspectives: matching documents on a granular level (requiring embeddings to capture fine-grained attributes in the text), and an abstract level (requiring embeddings to capture overall textual semantics). We empirically demonstrate, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks. We then propose a simple but effective method to incorporate TF-IDF into models that use contextual embeddings, achieving relative improvements of up to 36% on granular tasks.