Beyza Ermis


2023

pdf bib
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research
Luiza Pozzobon | Beyza Ermis | Patrick Lewis | Sara Hooker
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and call for a more structured approach to evaluating toxicity over time.

pdf bib
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation
Meriem Boubdir | Edward Kim | Beyza Ermis | Sara Hooker | Marzieh Fadaee
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

In Natural Language Processing (NLP), the Elo rating system, well-established for ranking dynamic competitors in games like chess, has seen increasing adoption for evaluating Large Language Models (LLMs) through “A vs B” paired comparisons. However, while popular, the system’s suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. Our study investigates the sensitivity and reproducibility of Elo scores for LLMs, integrating both synthetic and human feedback. We show that Elo ratings for LLMs stabilize with 100 or more comparison permutations. A lower K-factor is preferable for closely matched models, whereas a higher K-factor better distinguishes models with clear performance differences. We also report that transitivity (A B and B C implies A C) does not consistently hold, particularly when models demonstrate similar performance. Our empirical findings provide guidelines for more reliable LLM evaluation.

pdf bib
Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models
Luiza Pozzobon | Beyza Ermis | Patrick Lewis | Sara Hooker
Findings of the Association for Computational Linguistics: EMNLP 2023

Considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. Furthermore, previous approaches have often neglected the crucial factor of language’s evolving nature over time. In this work, we present a comprehensive perspective on toxicity mitigation that takes into account its changing nature. We introduce Goodtriever, a flexible methodology that matches the current state-of-the-art toxicity mitigation while achieving 43% relative latency reduction during inference and being more computationally efficient. By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation. Our research advocates for an increased focus on adaptable mitigation techniques, which better reflect the data drift models face when deployed in the wild.