Nitin Madnani - ACL Anthology

Nitin Madnani

2025

Span Labeling with Large Language Models: Shell vs. Meat
Phoebe Mulcaire | Nitin Madnani
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

We present a method for labeling spans of text with large language models (LLMs) and apply it to the task of identifying shell language, language which plays a structural or connective role without constituting the main content of a text. We compare several recent LLMs by evaluating their “annotations” against a small human-curated test set, and train a smaller supervised model on thousands of LLM-annotated examples. The described method enables workflows that can learn complex or nuanced linguistic phenomena without tedious, large-scale hand-annotations of training data or specialized feature engineering.

2023

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Ekaterina Kochmar | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Nitin Madnani | Anaïs Tack | Victoria Yaneva | Zheng Yuan | Torsten Zesch
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Beyond the Repo: A Case Study on Open Source Integration with GECToR
Sanjna Kashyap | Zhaoyang Xie | Kenneth Steimel | Nitin Madnani
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

We present a case study describing our efforts to integrate the open source GECToR code and models into our production NLP pipeline that powers many of Educational Testing Service’s products and prototypes. The paper’s contributions includes a discussion of the issues we encountered during integration and our solutions, the overarching lessons we learned about integrating open source projects, and, last but not least, the open source contributions we made as part of the journey.

2022

Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
Ekaterina Kochmar | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Nitin Madnani | Anaïs Tack | Victoria Yaneva | Zheng Yuan | Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

2021

Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Andrea Horbach | Ekaterina Kochmar | Ronja Laarmann-Quante | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

2020

Automated Evaluation of Writing – 50 Years and Counting
Beata Beigman Klebanov | Nitin Madnani
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this theme paper, we focus on Automated Writing Evaluation (AWE), using Ellis Page’s seminal 1966 paper to frame the presentation. We discuss some of the current frontiers in the field and offer some thoughts on the emergent uses of this technology.

Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

Using PRMSE to evaluate automated scoring systems in the presence of label noise
Anastassia Loukina | Nitin Madnani | Aoife Cahill | Lili Yao | Matthew S. Johnson | Brian Riordan | Daniel F. McCaffrey
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

The effect of noisy labels on the performance of NLP systems has been studied extensively for system training. In this paper, we focus on the effect that noisy labels have on system evaluation. Using automated scoring as an example, we demonstrate that the quality of human ratings used for system evaluation have a substantial impact on traditional performance metrics, making it impossible to compare system evaluations on labels with different quality. We propose that a new metric, PRMSE, developed within the educational measurement community, can help address this issue, and provide practical guidelines on using PRMSE.

User-centered & Robust NLP OSS: Lessons Learned from Developing & Maintaining RSMTool
Nitin Madnani | Anastassia Loukina
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

For the last 5 years, we have developed and maintained RSMTool – an open-source tool for evaluating NLP systems that automatically score written and spoken responses. RSMTool is designed to be cross-disciplinary, borrowing heavily from NLP, machine learning, and educational measurement. Its cross-disciplinary nature has required us to learn a user-centered development approach in terms of both design and implementation. We share some of these lessons in this paper.

2019

My Turn To Read: An Interleaved E-book Reading Tool for Developing and Struggling Readers
Nitin Madnani | Beata Beigman Klebanov | Anastassia Loukina | Binod Gyawali | Patrick Lange | John Sabatini | Michael Flor
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Literacy is crucial for functioning in modern society. It underpins everything from educational attainment and employment opportunities to health outcomes. We describe My Turn To Read, an app that uses interleaved reading to help developing and struggling readers improve reading skills while reading for meaning and pleasure. We hypothesize that the longer-term impact of the app will be to help users become better, more confident readers with an increased stamina for extended reading. We describe the technology and present preliminary evidence in support of this hypothesis.

Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Helen Yannakoudakis | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Torsten Zesch
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

The many dimensions of algorithmic fairness in educational applications
Anastassia Loukina | Nitin Madnani | Klaus Zechner
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

The issues of algorithmic fairness and bias have recently featured prominently in many publications highlighting the fact that training the algorithms for maximum performance may often result in predictions that are biased against various groups. Educational applications based on NLP and speech processing technologies often combine multiple complex machine learning algorithms and are thus vulnerable to the same sources of bias as other machine learning systems. Yet such systems can have high impact on people’s lives especially when deployed as part of high-stakes tests. In this paper we discuss different definitions of fairness and possible ways to apply them to educational applications. We then use simulated and real data to consider how test-takers’ native language backgrounds can affect their automated scores on an English language proficiency assessment. We illustrate that total fairness may not be achievable and that different definitions of fairness may require different solutions.

Toward Automated Content Feedback Generation for Non-native Spontaneous Speech
Su-Youn Yoon | Ching-Ni Hsieh | Klaus Zechner | Matthew Mulholland | Yuan Wang | Nitin Madnani
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this study, we developed an automated algorithm to provide feedback about the specific content of non-native English speakers’ spoken responses. The responses were spontaneous speech, elicited using integrated tasks where the language learners listened to and/or read passages and integrated the core content in their spoken responses. Our models detected the absence of key points considered to be important in a spoken response to a particular test question, based on two different models: (a) a model using word-embedding based content features and (b) a state-of-the art short response scoring engine using traditional n-gram based features. Both models achieved a substantially improved performance over the majority baseline, and the combination of the two models achieved a significant further improvement. In particular, the models were robust to automated speech recognition (ASR) errors, and performance based on the ASR word hypotheses was comparable to that based on manual transcriptions. The accuracy and F-score of the best model for the questions included in the train set were 0.80 and 0.68, respectively. Finally, we discussed possible approaches to generating targeted feedback about the content of a language learner’s response, based on automatically detected missing key points.

2018

Automated Scoring: Beyond Natural Language Processing
Nitin Madnani | Aoife Cahill
Proceedings of the 27th International Conference on Computational Linguistics

In this position paper, we argue that building operational automated scoring systems is a task that has disciplinary complexity above and beyond standard competitive shared tasks which usually involve applying the latest machine learning techniques to publicly available data in order to obtain the best accuracy. Automated scoring systems warrant significant cross-discipline collaboration of which natural language processing and machine learning are just two of many important components. Such systems have multiple stakeholders with different but valid perspectives that can often times be at odds with each other. Our position is that it is essential for us as NLP researchers to understand and incorporate these perspectives in our research and work towards a mutually satisfactory solution in order to build automated scoring systems that are accurate, fair, unbiased, and useful.

Writing Mentor: Self-Regulated Writing Feedback for Struggling Writers
Nitin Madnani | Jill Burstein | Norbert Elliot | Beata Beigman Klebanov | Diane Napolitano | Slava Andreyev | Maxwell Schwartz
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

Writing Mentor is a free Google Docs add-on designed to provide feedback to struggling writers and help them improve their writing in a self-paced and self-regulated fashion. Writing Mentor uses natural language processing (NLP) methods and resources to generate feedback in terms of features that research into post-secondary struggling writers has classified as developmental (Burstein et al., 2016b). These features span many writing sub-constructs (use of sources, claims, and evidence; topic development; coherence; and knowledge of English conventions). Prelimi- nary analysis indicates that users have a largely positive impression of Writing Mentor in terms of usability and potential impact on their writing.

Atypical Inputs in Educational Applications
Su-Youn Yoon | Aoife Cahill | Anastassia Loukina | Klaus Zechner | Brian Riordan | Nitin Madnani
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

In large-scale educational assessments, the use of automated scoring has recently become quite common. While the majority of student responses can be processed and scored without difficulty, there are a small number of responses that have atypical characteristics that make it difficult for an automated scoring system to assign a correct score. We describe a pipeline that detects and processes these kinds of responses at run-time. We present the most frequent kinds of what are called non-scorable responses along with effective filtering models based on various NLP and speech processing technologies. We give an overview of two operational automated scoring systems —one for essay scoring and one for speech scoring— and describe the filtering models they use. Finally, we present an evaluation and analysis of filtering models used for spoken responses in an assessment of language proficiency.

Second Language Acquisition Modeling
Burr Settles | Chris Brust | Erin Gustafson | Masato Hagiwara | Nitin Madnani
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present the task of second language acquisition (SLA) modeling. Given a history of errors made by learners of a second language, the task is to predict errors that they are likely to make at arbitrary points in the future. We describe a large corpus of more than 7M words produced by more than 6k learners of English, Spanish, and French using Duolingo, a popular online language-learning app. Then we report on the results of a shared task challenge aimed studying the SLA task via this corpus, which attracted 15 teams and synthesized work from various fields including cognitive science, linguistics, and machine learning.

The ACL Anthology: Current State and Future Directions
Daniel Gildea | Min-Yen Kan | Nitin Madnani | Christoph Teichmann | Martín Villalba
Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

The Association of Computational Linguistic’s Anthology is the open source archive, and the main source for computational linguistics and natural language processing’s scientific literature. The ACL Anthology is currently maintained exclusively by community volunteers and has to be available and up-to-date at all times. We first discuss the current, open source approach used to achieve this, and then discuss how the planned use of Docker images will improve the Anthology’s long-term stability. This change will make it easier for researchers to utilize Anthology data for experimentation. We believe the ACL community can directly benefit from the extension-friendly architecture of the Anthology. We end by issuing an open challenge of reviewer matching we encourage the community to rally towards.

2017

Building Better Open-Source Tools to Support Fairness in Automated Scoring
Nitin Madnani | Anastassia Loukina | Alina von Davier | Jill Burstein | Aoife Cahill
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Automated scoring of written and spoken responses is an NLP application that can significantly impact lives especially when deployed as part of high-stakes tests such as the GRE® and the TOEFL®. Ethical considerations require that automated scoring algorithms treat all test-takers fairly. The educational measurement community has done significant research on fairness in assessments and automated scoring systems must incorporate their recommendations. The best way to do that is by making available automated, non-proprietary tools to NLP researchers that directly incorporate these recommendations and generate the analyses needed to help identify and resolve biases in their scoring systems. In this paper, we attempt to provide such a solution.

Speech- and Text-driven Features for Automated Scoring of English Speaking Tasks
Anastassia Loukina | Nitin Madnani | Aoife Cahill
Proceedings of the Workshop on Speech-Centric Natural Language Processing

We consider the automatic scoring of a task for which both the content of the response as well its spoken fluency are important. We combine features from a text-only content scoring system originally designed for written responses with several categories of acoustic features. Although adding any single category of acoustic features to the text-only system on its own does not significantly improve performance, adding all acoustic features together does yield a small but significant improvement. These results are consistent for responses to open-ended questions and to questions focused on some given source material.

A Large Scale Quantitative Exploration of Modeling Strategies for Content Scoring
Nitin Madnani | Anastassia Loukina | Aoife Cahill
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We explore various supervised learning strategies for automated scoring of content knowledge for a large corpus of 130 different content-based questions spanning four subject areas (Science, Math, English Language Arts, and Social Studies) and containing over 230,000 responses scored by human raters. Based on our analyses, we provide specific recommendations for content scoring. These are based on patterns observed across multiple questions and assessments and are, therefore, likely to generalize to other scenarios and prove useful to the community as automated content scoring becomes more popular in schools and classrooms.

2016

Language Muse: Automated Linguistic Activity Generation for English Language Learners
Nitin Madnani | Jill Burstein | John Sabatini | Kietha Biggers | Slava Andreyev
Proceedings of ACL-2016 System Demonstrations

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing
Courtney Napoles | Aoife Cahill | Nitin Madnani
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

Model Combination for Correcting Preposition Selection Errors
Nitin Madnani | Michael Heilman | Aoife Cahill
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

Automatically Scoring Tests of Proficiency in Music Instruction
Nitin Madnani | Aoife Cahill | Brian Riordan
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

2015

Effective Feature Integration for Automated Short Answer Scoring
Keisuke Sakaguchi | Michael Heilman | Nitin Madnani
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The Impact of Training Data on Automated Short Answer Scoring Performance
Michael Heilman | Nitin Madnani
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

Preliminary Experiments on Crowdsourced Evaluation of Feedback Granularity
Nitin Madnani | Martin Chodorow | Aoife Cahill | Melissa Lopez | Yoko Futagi | Yigal Attali
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

2014

Predicting Grammaticality on an Ordinal Scale
Michael Heilman | Aoife Cahill | Nitin Madnani | Melissa Lopez | Matthew Mulholland | Joel Tetreault
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Content Importance Models for Scoring Writing From Sources
Beata Beigman Klebanov | Nitin Madnani | Jill Burstein | Swapna Somasundaran
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

An Explicit Feedback System for Preposition Errors based on Wikipedia Revisions
Nitin Madnani | Aoife Cahill
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

2013

Robust Systems for Preposition Error Correction Using Wikipedia Revisions
Aoife Cahill | Nitin Madnani | Joel Tetreault | Diane Napolitano
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

ParaQuery: Making Sense of Paraphrase Collections
Lili Kotlerman | Nitin Madnani | Aoife Cahill
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Using Pivot-Based Paraphrasing and Sentiment Profiles to Improve a Subjectivity Lexicon for Essay Data
Beata Beigman Klebanov | Nitin Madnani | Jill Burstein
Transactions of the Association for Computational Linguistics, Volume 1

We demonstrate a method of improving a seed sentiment lexicon developed on essay data by using a pivot-based paraphrasing system for lexical expansion coupled with sentiment profile enrichment using crowdsourcing. Profile enrichment alone yields up to 15% improvement in the accuracy of the seed lexicon on 3-way sentence-level sentiment polarity classification of essay data. Using lexical expansion in addition to sentiment profiles provides a further 7% improvement in performance. Additional experiments show that the proposed method is also effective with other subjectivity lexicons and in a different domain of application (product reviews).

HENRY-CORE: Domain Adaptation and Stacking for Text Similarity
Michael Heilman | Nitin Madnani
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

ETS: Domain Adaptation and Stacking for Short Answer Scoring
Michael Heilman | Nitin Madnani
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

Automated Scoring of a Summary-Writing Task Designed to Measure Reading Comprehension
Nitin Madnani | Jill Burstein | John Sabatini | Tenaha O’Reilly
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Detecting Missing Hyphens in Learner Text
Aoife Cahill | Martin Chodorow | Susanne Wolff | Nitin Madnani
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

2012

Identifying High-Level Organizational Elements in Argumentative Discourse
Nitin Madnani | Michael Heilman | Joel Tetreault | Martin Chodorow
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Re-examining Machine Translation Metrics for Paraphrase Identification
Nitin Madnani | Joel Tetreault | Martin Chodorow
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

ETS: Discriminative Edit Models for Paraphrase Scoring
Michael Heilman | Nitin Madnani
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Exploring Grammatical Error Correction with Not-So-Crummy Machine Translation
Nitin Madnani | Joel Tetreault | Martin Chodorow
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

2011

They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems
Nitin Madnani | Martin Chodorow | Joel Tetreault | Alla Rozovskaya
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

The Web is not a PERSON, Berners-Lee is not an ORGANIZATION, and African-Americans are not LOCATIONS: An Analysis of the Performance of Named-Entity Recognition
Robert Krovetz | Paul Deane | Nitin Madnani
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

E-rating Machine Translation
Kristen Parton | Joel Tetreault | Nitin Madnani | Martin Chodorow
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods
Nitin Madnani | Bonnie J. Dorr
Computational Linguistics, Volume 36, Issue 3 - September 2010

Putting the User in the Loop: Interactive Maximal Marginal Relevance for Query-Focused Summarization
Jimmy Lin | Nitin Madnani | Bonnie Dorr
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Measuring Transitivity Using Untrained Annotators
Nitin Madnani | Jordan Boyd-Graber | Philip Resnik
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

2009

Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric
Matthew Snover | Nitin Madnani | Bonnie Dorr | Richard Schwartz
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

Are Multiple Reference Translations Necessary? Investigating the Value of Paraphrased Reference Translations in Parameter Optimization
Nitin Madnani | Philip Resnik | Bonnie J. Dorr | Richard Schwartz
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their corresponding reference translations. However, obtaining reference translations is expensive. In our earlier work (Madnani et al., 2007), we introduced a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and demonstrated that the resulting paraphrases can be used to cut the number of human reference translations needed in half. In this paper, we take the idea a step further, asking how far it is possible to get with just a single good reference translation for each item in the development set. Our analysis suggests that it is necessary to invest in four or more human translations in order to significantly improve on a single translation augmented by monolingual paraphrases.

Combining Open-Source with Research to Re-engineer a Hands-on Introductory NLP Course
Nitin Madnani | Bonnie J. Dorr
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

2007

Using Paraphrases for Parameter Tuning in Statistical Machine Translation
Nitin Madnani | Necip Fazil Ayan | Philip Resnik | Bonnie Dorr
Proceedings of the Second Workshop on Statistical Machine Translation

Measuring Variability in Sentence Ordering for News Summarization
Nitin Madnani | Rebecca Passonneau | Necip Fazil Ayan | John Conroy | Bonnie Dorr | Judith Klavans | Dianne O’Leary | Judith Schlesinger
Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07)

2005

The Hiero Machine Translation System: Extensions, Evaluation, and Analysis
David Chiang | Adam Lopez | Nitin Madnani | Christof Monz | Philip Resnik | Michael Subotin
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Co-authors

Joel Tetreault 7

Beata Beigman Klebanov 5

Ekaterina Kochmar 5

Torsten Zesch 5

Philip Resnik 4

Andrea Horbach 3

Ronja Laarmann-Quante 3

Claudia Leacock 3

Ildikó Pilán 3

Brian Riordan 3

John Sabatini 3

Helen Yannakoudakis 3

Klaus Zechner 3

Slava Andreyev 2

Necip Fazil Ayan 2

Melissa Lopez 2

Matthew Mulholland 2

Diane Napolitano 2

Richard Schwartz 2

Victoria Yaneva 2

Kietha Biggers 1

Jordan Lee Boyd-Graber 1

Norbert Elliot 1

Daniel Gildea 1

Erin Gustafson 1

Binod Gyawali 1

Masato Hagiwara 1

Ching-Ni Hsieh 1

Matthew S. Johnson 1

Sanjna Kashyap 1

Judith L. Klavans 1

Lili Kotlerman 1

Robert Krovetz 1

Patrick L. Lange 1

Daniel F. McCaffrey 1

Christof Monz 1

Phoebe Mulcaire 1

Courtney Napoles 1

Dianne P. O’Leary 1

Tenaha O’Reilly 1

Kristen Parton 1

Rebecca J. Passonneau 1

Alla Rozovskaya 1

Keisuke Sakaguchi 1

Judith D. Schlesinger 1

Maxwell Schwartz 1

Matthew Snover 1

Swapna Somasundaran 1

Kenneth Steimel 1

Michael Subotin 1

Christoph Teichmann 1

Martín Villalba 1

Susanne Wolff 1

Alina von Davier 1

Venues