Oleg Somov

2025

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs
Mikhail Seleznyov | Mikhail Chaichuk | Gleb Ershov | Alexander Panchenko | Elena Tutubalina | Oleg Somov
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 4 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models’ current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: tthttps://github.com/AIRI-Institute/when-punctuation-matters.

pdf bib abs

Bridging the Gap with RedSQL: A Russian Text-to-SQL Benchmark for Domain-Specific Applications
Irina Brodskaya | Elena Tutubalina | Oleg Somov
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

We present the first domain-specific text-to-SQL benchmark in Russian, targeting fields with high operational load where rapid decision-making is critical. The benchmark spans across 9 domains, including healthcare, aviation, and others, and comprises 409 curated query pairs. It is designed to test model generalization under domain shift, introducing challenges such as specialized terminology and complex schema structures. Evaluation of state-of-the-art large language models (LLM) reveals significant performance drop in comparison to open-domain academic benchmarks, highlighting the need for domain-aware approaches in text-to-SQL. The benchmark is available at: https://github.com/BrodskaiaIrina/functional-text2sql-subsets

2024

pdf bib abs

AIRI NLP Team at EHRSQL 2024 Shared Task: T5 and Logistic Regression to the Rescue
Oleg Somov | Alexey Dontsov | Elena Tutubalina
Proceedings of the 6th Clinical Natural Language Processing Workshop

This paper presents a system developed for the Clinical NLP 2024 Shared Task, focusing on reliable text-to-SQL modeling on Electronic Health Records (EHRs). The goal is to create a model that accurately generates SQL queries for answerable questions while avoiding incorrect responses and handling unanswerable queries. Our approach comprises three main components: a query correspondence model, a text-to-SQL model, and an SQL verifier.For the query correspondence model, we trained a logistic regression model using hand-crafted features to distinguish between answerable and unanswerable queries. As for the text-to-SQL model, we utilized T5-3B as a pretrained language model, further fine-tuned on pairs of natural language questions and corresponding SQL queries. Finally, we applied the SQL verifier to inspect the resulting SQL queries.During the evaluation stage of the shared task, our system achieved an accuracy of 68.9 % (metric version without penalty), positioning it at the fifth-place ranking. While our approach did not surpass solutions based on large language models (LMMs) like ChatGPT, it demonstrates the promising potential of domain-specific specialized models that are more resource-efficient. The code is publicly available at https://github.com/runnerup96/EHRSQL-text2sql-solution.

2023

pdf bib abs

Shifted PAUQ: Distribution shift in text-to-SQL
Oleg Somov | Elena Tutubalina
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

Semantic parsing plays a pivotal role in advancing the accessibility of human-computer interaction on a large scale. Spider, a widely recognized dataset for text2SQL, contains a wide range of natural language (NL) questions in English and corresponding SQL queries. Original splits of Spider and its adapted to Russian language and improved version, PAUQ, assume independence and identical distribution of training and testing data (i.i.d split). In this work, we propose a target length split and multilingual i.i.d split to measure compositionality and cross-language generalization. We present experimental results of popular text2SQL models on original, multilingual, and target length splits. We also construct a context-free grammar for the evaluation of compositionality in text2SQL in an out-of-distribution setting. We make the splits publicly available on HuggingFace hub via https://huggingface.co/datasets/composite/pauq

2022

pdf bib abs

PAUQ: Text-to-SQL in Russian
Daria Bakshandaeva | Oleg Somov | Ekaterina Dmitrieva | Vera Davydova | Elena Tutubalina
Findings of the Association for Computational Linguistics: EMNLP 2022

Semantic parsing is an important task that allows to democratize human-computer interaction. One of the most popular text-to-SQL datasets with complex and diverse natural language (NL) questions and SQL queries is Spider. We construct and complement a Spider dataset for Russian, thus creating the first publicly available text-to-SQL dataset for this language. While examining its components - NL questions, SQL queries and databases content - we identify limitations of the existing database structure, fill out missing values for tables and add new requests for underrepresented categories. We select thirty functional test sets with different features that can be used for the evaluation of neural models’ abilities. To conduct the experiments, we adapt baseline architectures RAT-SQL and BRIDGE and provide in-depth query component analysis. On the target language, both models demonstrate strong results with monolingual training and improved accuracy in multilingual scenario. In this paper, we also study trade-offs between machine-translated and manually-created NL queries. At present, Russian text-to-SQL is lacking in datasets as well as trained models, and we view this work as an important step towards filling this gap.

Co-authors

Ekaterina Dmitrieva 1

Alexey Dontsov 1

Gleb Ershov 1

Alexander Panchenko 1

Mikhail Seleznyov 1

Venues

Fix author