Evelina Bakhturina

2025

pdf bib abs
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
Grigor Nalbandyan | Rima Shahbazyan | Evelina Bakhturina
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Typical evaluations of Large Language Models (LLMs) report a single metric per dataset, often representing the model’s best-case performance under carefully selected settings. Unfortunately, this approach overlooks model robustness and reliability in real-world applications. For instance, simple paraphrasing of prompts on the MMLU-Pro dataset causes accuracy fluctuations of up to 10%, while reordering answer choices in the AGIEval dataset results in accuracy differences of up to 6.1%. While some studies discuss issues with LLM robustness, there is no unified or centralized framework for evaluating the robustness of language models. To address this gap and consolidate existing research on model robustness, we present SCORE (Systematic COnsistency and Robustness Evaluation), a comprehensive framework for non-adversarial evaluation of LLMs. The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency. We will make the code publicly available to facilitate further development and research.

2023

pdf bib abs
NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023
Oleksii Hrinchuk | Vladimir Bataev | Evelina Bakhturina | Boris Ginsburg
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper provides an overview of NVIDIA NeMo’s speech translation systems for the IWSLT 2023 Offline Speech Translation Task. This year, we focused on end-to-end system which capitalizes on pre-trained models and synthetic data to mitigate the problem of direct speech translation data scarcity. When trained on IWSLT 2022 constrained data, our best En->De end-to-end model achieves the average score of 31 BLEU on 7 test sets from IWSLT 2010-2020 which improves over our last year cascade (28.4) and end-to-end (25.7) submissions. When trained on IWSLT 2023 constrained data, the average score drops to 29.5 BLEU.

2020

There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application deeply. Additionally, the study of model size on domain-specific models has been mostly missing. We empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer. We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus, contributing to our understanding of domain language model applications. We demonstrate noticeable improvements over the previous state-of-the-art (SOTA) on standard biomedical NLP benchmarks of question answering, named entity recognition, and relation extraction. Code and checkpoints to reproduce our experiments are available at [github.com/NVIDIA/NeMo].

Co-authors

Venues

Fix data