Paul Kantor

Also published as: Paul B. Kantor


2024

pdf bib
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Wentian Wang | Sarthak Jain | Paul Kantor | Jacob Feldman | Lazaros Gallos | Hao Wang
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that “truly” understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.

2011

pdf bib
Automatic Assessment of Coverage Quality in Intelligence Reports
Samuel Brody | Paul Kantor
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2009

pdf bib
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts
Ciprian Chelba | Paul Kantor | Brian Roark
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

2006

pdf bib
User-Centered Evaluation of Interactive Question Answering Systems
Diane Kelly | Paul Kantor | Emile Morse | Jean Scholtz | Ying Sun
Proceedings of the Interactive Question Answering Workshop at HLT-NAACL 2006

2004

pdf bib
HITIQA: Scenario Based Question Answering
Sharon Small | Tomek Strzalkowski | Ting Liu | Sean Ryan | Robert Salkin | Nobuyuki Shimizu | Paul Kantor | Diane Kelly | Robert Rittman | Nina Wacholder | Boris Yamrom
Proceedings of the Workshop on Pragmatics of Question Answering at HLT-NAACL 2004

pdf bib
HITIQA: Towards Analytical Question Answering
Sharon Small | Tomek Strzalkowski | Ting Liu | Sean Ryan | Robert Salkin | Nobuyuki Shimizu | Paul Kantor | Diane Kelly | Robert Rittman | Nina Wacholder
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Designing a Realistic Evaluation of an End-to-end Interactive Question Answering System
Nina Wacholder | Sharon Small | Bing Bai | Diane Kelly | Robert Rittman | Sean Ryan | Robert Salkin | Peng Song | Ying Sun | Ting Liu | Paul Kantor | Tomek Strzalkowski
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
Automatically Predicting Information Quality in News Documents
Rong Tang | Kwong Bor Ng | Tomek Strzalkowski | Paul B. Kantor
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers