Rose E Wang
2024
MathFish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
Li Lucy
|
Tal August
|
Rose E Wang
|
Luca Soldaini
|
Courtney Allison
|
Kyle Lo
Findings of the Association for Computational Linguistics: EMNLP 2024
To ensure that math curriculum is grade-appropriate and aligns with critical skills or concepts in accordance with educational standards, pedagogical experts can spend months carefully reviewing published math problems. Drawing inspiration from this process, our work presents a novel angle for evaluating language models’ (LMs) mathematical abilities, by investigating whether they can discern skills and concepts enabled by math content. We contribute two datasets: one consisting of 385 fine-grained descriptions of K-12 math skills and concepts, or *standards*, from Achieve the Core (*ATC*), and another of 9.9K math problems labeled with these standards (*MathFish*). We develop two tasks for evaluating LMs’ abilities to assess math problems: (1) verifying whether a problem aligns with a given standard, and (2) tagging a problem with all aligned standards. Working with experienced teachers, we find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways. We also show that LMs often generate problems that do not fully align with standards described in prompts, suggesting the need for careful scrutiny on use cases involving LMs for generating curricular materials. Finally, we categorize problems in GSM8k using math standards, allowing us to better understand why some problems are more difficult to solve for models than others.
Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations
Rose E Wang
|
Pawan Wirawarn
|
Kenny Lam
|
Omar Khattab
|
Dorottya Demszky
Findings of the Association for Computational Linguistics: EMNLP 2024
Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce *Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present *LessonLink*, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT Math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR’s practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.
Search
Co-authors
- Li Lucy 1
- Tal August 1
- Luca Soldaini 1
- Courtney Allison 1
- Kyle Lo 1
- show all...