Sowmya Vajjala

2025

We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.

pdf bib abs
Test Set Quality in Multilingual LLM Evaluation
Chalamalasetti Kranti | Gabriel Bernier-Colborne | Yvan Gauthier | Sowmya Vajjala
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems

Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models (LLM). However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages – French and Telugu, identifying several errors in the datasets during the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages. Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.

pdf bib abs
Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?
Gaurav Kamath | Sowmya Vajjala
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We explore whether synthetic datasets generated by large language models using a few high quality seed samples are useful for low-resource named entity recognition, considering 11 languages from three language families. Our results suggest that synthetic data created with such seed data is a reasonable choice when there is no available labeled data, and is better than using entirely automatically labeled data. However, a small amount of high-quality data, coupled with cross-lingual transfer from a related language, always offers better performance. Data and code available at: https://github.com/grvkamath/low-resource-syn-ner.

pdf bib abs
OneNRC@TSAR2025 Shared Task Small Models for Readability Controlled Text Simplification
Sowmya Vajjala
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

In this system description paper, we describe the team OneNRC’s experiments on readability controlled text simplification, focused on using smaller, quantized language models (<20B). We compare these with one large proprietary model and show that the smaller models offer comparable or even better results in some experimental settings. The approach primarily comprises of prompt optimization, agentic workflow, and tool calling. The best results were achieved while using a CEFR proficiency classifier as a verification tool for the language model agent. In terms of comparison with other systems, our submission that used a quantized Gemma3:12B model that ran on a laptop achieved a rank of 9.88 among the submitted systems as per the AUTORANK framework used by the organizers. We hope these results will lead into further exploration on the usefulness of smaller models for text simplification.

2024

pdf bib abs
Improving Absent Keyphrase Generation with Diversity Heads
Edwin Thomas | Sowmya Vajjala
Findings of the Association for Computational Linguistics: NAACL 2024

Keyphrase Generation (KPG) is the task of automatically generating appropriate keyphrases for a given text, with a wide range of real-world applications such as document indexing and tagging, information retrieval, and text summarization. NLP research makes a distinction between present and absent keyphrases based on whether a keyphrase is directly present as a sequence of words in the document during evaluation. However, present and absent keyphrases are treated together in a text-to-text generation framework during training. We treat present keyphrase extraction as a sequence labeling problem and propose a new absent keyphrase generation model that uses a modified cross-attention layer with additional heads to capture diverse views for the same context encoding in this paper. Our experiments show improvements over the state-of-the-art for four datasets for present keyphrase extraction and five datasets for absent keyphrase generation among the six English datasets we explored, covering long and short documents.

pdf bib abs
Methods, Applications, and Directions of Learning-to-Rank in NLP Research
Justin Lee | Gabriel Bernier-Colborne | Tegan Maharaj | Sowmya Vajjala
Findings of the Association for Computational Linguistics: NAACL 2024

Learning-to-rank (LTR) algorithms aim to order a set of items according to some criteria. They are at the core of applications such as web search and social media recommendations, and are an area of rapidly increasing interest, with the rise of large language models (LLMs) and the widespread impact of these technologies on society. In this paper, we survey the diverse use cases of LTR methods in natural language processing (NLP) research, looking at previously under-studied aspects such as multilingualism in LTR applications and statistical significance testing for LTR problems. We also consider how large language models are changing the LTR landscape. This survey is aimed at NLP researchers and practitioners interested in understanding the formalisms and best practices regarding the application of LTR approaches in their research.

pdf bib abs
Keyphrase Generation: Lessons from a Reproducibility Study
Edwin Thomas | Sowmya Vajjala
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Reproducibility studies are treated as means to verify the validity of a scientific method, but what else can we learn from such experiments? We addressed this question taking Keyphrase Generation (KPG) as the use case in this paper, by studying three state-of-the-art KPG models in terms of reproducibility under either the same (same data/model/code) or varied (different training data/model, but same code) conditions, and exploring different ways of comparing KPG models beyond the most commonly used evaluation measures. We drew some conclusions on the state of the art in KPG based on these experiments, and provided guidelines for researchers working on the topic about reporting experimental results in a more comprehensive manner.

pdf bib abs
Scope Ambiguities in Large Language Models
Gaurav Kamath | Sebastian Schuster | Sowmya Vajjala | Siva Reddy
Transactions of the Association for Computational Linguistics, Volume 12

Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models—GPT-2, GPT-3/3.5, Llama 2, and GPT-4—treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).1

2023

pdf bib abs
A Multilingual Evaluation of NER Robustness to Adversarial Inputs
Akshay Srinivasan | Sowmya Vajjala
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

Adversarial evaluations of language models typically focus on English alone. In this paper, we performed a multilingual evaluation of Named Entity Recognition (NER) in terms of its robustness to small perturbations in the input. Our results showed the NER models we explored across three languages (English, German and Hindi) are not very robust to such changes, as indicated by the fluctuations in the overall F1 score as well as in a more fine-grained evaluation. With that knowledge, we further explored whether it is possible to improve the existing NER models using a part of the generated adversarial data sets as augmented training data to train a new NER model or as fine-tuning data to adapt an existing NER model. Our results showed that both these approaches improve performance on the original as well as adversarial test sets. While there is no significant difference between the two approaches for English, re-training is significantly better than fine-tuning for German and Hindi.

2022

pdf bib abs
A Neural Pairwise Ranking Model for Readability Assessment
Justin Lee | Sowmya Vajjala
Findings of the Association for Computational Linguistics: ACL 2022

Automatic Readability Assessment (ARA), the task of assigning a reading level to a text, is traditionally treated as a classification problem in NLP research. In this paper, we propose the first neural, pairwise ranking approach to ARA and compare it with existing classification, regression, and (non-neural) ranking methods. We establish the performance of our approach by conducting experiments with three English, one French and one Spanish datasets. We demonstrate that our approach performs well in monolingual single/cross corpus testing scenarios and achieves a zero-shot cross-lingual ranking accuracy of over 80% for both French and Spanish when trained on English data. Additionally, we also release a new parallel bilingual readability dataset, that could be useful for future research. To our knowledge, this paper proposes the first neural pairwise ranking model for ARA, and shows the first results of cross-lingual, zero-shot evaluation of ARA with neural models.

pdf bib abs
Trends, Limitations and Open Challenges in Automatic Readability Assessment Research
Sowmya Vajjala
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Readability assessment is the task of evaluating the reading difficulty of a given piece of text. This article takes a closer look at contemporary NLP research on developing computational models for readability assessment, identifying the common approaches used for this task, their shortcomings, and some challenges for the future. Where possible, the survey also connects computational research with insights from related work in other disciplines such as education and psychology.

pdf bib abs
What do we really know about State of the Art NER?
Sowmya Vajjala | Ramya Balasubramaniam
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Named Entity Recognition (NER) is a well researched NLP task and is widely used in real world NLP scenarios. NER research typically focuses on the creation of new ways of training NER, with relatively less emphasis on resources and evaluation. Further, state of the art (SOTA) NER models, trained on standard datasets, typically report only a single performance measure (F-score) and we don’t really know how well they do for different entity types and genres of text, or how robust are they to new, unseen entities. In this paper, we perform a broad evaluation of NER using a popular dataset, that takes into consideration various text genres and sources constituting the dataset at hand. Additionally, we generate six new adversarial test sets through small perturbations in the original test set, replacing select entities while retaining the context. We also train and test our models on randomly generated train/dev/test splits followed by an experiment where the models are trained on a select set of genres but tested genres not seen in training. These comprehensive evaluation strategies were performed using three SOTA NER models. Based on our results, we recommend some useful reporting practices for NER researchers, that could help in providing a better understanding of a SOTA model’s performance in future.

2021

pdf bib abs
Teaching NLP outside Linguistics and Computer Science classrooms: Some challenges and some opportunities
Sowmya Vajjala
Proceedings of the Fifth Workshop on Teaching NLP

NLP’s sphere of influence went much beyond computer science research and the development of software applications in the past decade. We see people using NLP methods in a range of academic disciplines from Asian Studies to Clinical Oncology. We also notice the presence of NLP as a module in most of the data science curricula within and outside of regular university setups. These courses are taken by students from very diverse backgrounds. This paper takes a closer look at some issues related to teaching NLP to these diverse audiences based on my classroom experiences, and identifies some challenges the instructors face, particularly when there is no ecosystem of related courses for the students. In this process, it also identifies a few challenge areas for both NLP researchers and tool developers.

2019

pdf bib abs
On Understanding the Relation between Expert Annotations of Text Readability and Target Reader Comprehension
Sowmya Vajjala | Ivana Lucic
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Automatic readability assessment aims to ensure that readers read texts that they can comprehend. However, computational models are typically trained on texts created from the perspective of the text writer, not the target reader. There is little experimental research on the relationship between expert annotations of readability, reader’s language proficiency, and different levels of reading comprehension. To address this gap, we conducted a user study in which over a 100 participants read texts of different reading levels and answered questions created to test three forms of comprehension. Our results indicate that more than readability annotation or reader proficiency, it is the type of comprehension question asked that shows differences between reader responses - inferential questions were difficult for users of all levels of proficiency across reading levels. The data collected from this study will be released with this paper, which will, for the first time, provide a collection of 45 reader bench marked texts to evaluate readability assessment systems developed for adult learners of English. It can also potentially be useful for the development of question generation approaches in intelligent tutoring systems research.

pdf bib
Experiments on Non-native Speech Assessment and its Consistency
Ziwei Zhou | Sowmya Vajjala | Seyed Vahid Mirnezami
Proceedings of the 8th Workshop on NLP for Computer Assisted Language Learning

2018

pdf bib abs
Experiments with Universal CEFR Classification
Sowmya Vajjala | Taraka Rama
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

The Common European Framework of Reference (CEFR) guidelines describe language proficiency of learners on a scale of 6 levels. While the description of CEFR guidelines is generic across languages, the development of automated proficiency classification systems for different languages follow different approaches. In this paper, we explore universal CEFR classification using domain-specific and domain-agnostic, theory-guided as well as data-driven features. We report the results of our preliminary experiments in monolingual, cross-lingual, and multilingual classification with three languages: German, Czech, and Italian. Our results show that both monolingual and multilingual models achieve similar performance, and cross-lingual classification yields lower, but comparable results to monolingual classification.

pdf bib abs
OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification
Sowmya Vajjala | Ivana Lučić
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper describes the collection and compilation of the OneStopEnglish corpus of texts written at three reading levels, and demonstrates its usefulness for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total). The corpus is now freely available under a CC by-SA 4.0 license and we hope that it would foster further research on the topics of readability assessment and text simplification.

2017

pdf bib abs
A study of N-gram and Embedding Representations for Native Language Identification
Sowmya Vajjala | Sagnik Banerjee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We report on our experiments with N-gram and embedding based feature representations for Native Language Identification (NLI) as a part of the NLI Shared Task 2017 (team name: NLI-ISU). Our best performing system on the test set for written essays had a macro F1 of 0.8264 and was based on word uni, bi and trigram features. We explored n-grams covering word, character, POS and word-POS mixed representations for this task. For embedding based feature representations, we employed both word and document embeddings. We had a relatively poor performance with all embedding representations compared to n-grams, which could be because of the fact that embeddings capture semantic similarities whereas L1 differences are more stylistic in nature.

pdf bib
A Telugu treebank based on a grammar book
Taraka Rama | Sowmya Vajjala
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf bib abs
Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts
Sowmya Vajjala | Detmar Meurers | Alexander Eitel | Katharina Scheiter
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

Computational approaches to readability assessment are generally built and evaluated using gold standard corpora labeled by publishers or teachers rather than being grounded in observations about human performance. Considering that both the reading process and the outcome can be observed, there is an empirical wealth that could be used to ground computational analysis of text readability. This will also support explicit readability models connecting text complexity and the reader’s language proficiency to the reading process and outcomes. This paper takes a step in this direction by reporting on an experiment to study how the relation between text complexity and reader’s language proficiency affects the reading process and performance outcomes of readers after reading We modeled the reading process using three eye tracking variables: fixation count, average fixation count, and second pass reading duration. Our models for these variables explained 78.9%, 74% and 67.4% variance, respectively. Performance outcome was modeled through recall and comprehension questions, and these models explained 58.9% and 27.6% of the variance, respectively. While the online models give us a better understanding of the cognitive correlates of reading with text complexity and language proficiency, modeling of the offline measures can be particularly relevant for incorporating user aspects into readability models.

Sowmya Vajjala

2025

2024

2023

2022

2021

2019

2018

2017

2016

2014

2013

2012

Co-authors

Venues