Nico Daheim


2024

pdf bib
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors
Nico Daheim | Jakub Macina | Manu Kapur | Iryna Gurevych | Mrinmaya Sachan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) offer many opportunities to scale high-quality personalized tutoring. A promising approach is to build dialog tutoring models to scaffold students’ problem-solving. However, even though existing models perform well in solving reasoning questions, they can struggle to precisely detect student’s errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1,002 stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student error which are more often correct with less hallucinations compared to existing baselines. The benchmark dataset and code will be released openly.

pdf bib
Book2Dial: Generating Teacher Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots
Junling Wang | Jakub Macina | Nico Daheim | Sankalan Pal Chowdhury | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: ACL 2024

Educational chatbots are a promising tool for assisting student learning. However, the development of effective chatbots in education has been challenging, as high-quality data is seldom available in this domain. In this paper, we propose a framework for generating synthetic teacher-student interactions grounded in a set of textbooks. Our approaches capture a key aspect of learning interactions where curious students with partial knowledge interactively ask teachers questions about the material in the textbook. We highlight various quality criteria that such dialogues must fulfill and compare several approaches relying on either prompting or finetuning large language models according to these criteria. We use the synthetic dialogues to train educational chatbots and show the benefits of further fine-tuning in educational domains. However, careful human evaluation shows that our best data synthesis method still suffers from hallucinations and tends to reiterate information from previous conversations. Our findings offer insights for future efforts in synthesizing conversational data that strikes a balance between size and quality. We will open-source our data and code.

pdf bib
Elastic Weight Removal for Faithful and Abstractive Dialogue Generation
Nico Daheim | Nouha Dziri | Mrinmaya Sachan | Iryna Gurevych | Edoardo Ponti
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Generating factual responses is a crucial requirement for dialogue systems. To promotemore factual responses, a common strategyis to ground their responses in relevant documents that inform response generation. However, common dialogue models still often hallucinate information that was not containedin these documents and is therefore unfaithful. In this work, we propose to alleviate suchhallucinations by ‘subtracting’ the parametersof a model trained to hallucinate from a dialogue response generation model in order to‘negate’ the contribution of such hallucinatedexamples from it. Extensive automatic and human evaluation shows favourable results whencompared to state-of-the-art methods that combine the distributions of multiple models, suchas DExperts (Liu et al., 2021), and others thatchange the training procedure, such as Quark(Lu et al., 2022a). Finally, we show how wecan not only reduce hallucinations but also discourage extractive responses, which are oftena consequence of reducing hallucinations byencouraging copy-pasting of document spans.We publicly release our code for reproducibilityand facilitating further research.

2023

pdf bib
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems
Jakub Macina | Nico Daheim | Sankalan Chowdhury | Tanmay Sinha | Manu Kapur | Iryna Gurevych | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: EMNLP 2023

While automatic dialogue tutors hold great potential in making education personalized and more accessible, research on such systems has been hampered by a lack of sufficiently large and high-quality datasets. Collecting such datasets remains challenging, as recording tutoring sessions raises privacy concerns and crowdsourcing leads to insufficient data quality. To address this, we propose a framework to generate such dialogues by pairing human teachers with a Large Language Model (LLM) prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues grounded in multi-step math reasoning problems. While models like GPT-3 are good problem solvers, they fail at tutoring because they generate factually incorrect feedback or are prone to revealing solutions to students too early. To overcome this, we let teachers provide learning opportunities to students by guiding them using various scaffolding questions according to a taxonomy of teacher moves. We demonstrate MathDial and its extensive annotations can be used to finetune models to be more effective tutors (and not just solvers). We confirm this by automatic and human evaluation, notably in an interactive setting that measures the trade-off between student solving success and telling solutions. The dataset is released publicly.

pdf bib
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Vilém Zouhar | Shehzaad Dhuliawala | Wangchunshu Zhou | Nico Daheim | Tom Kocmi | Yuchen Eleanor Jiang | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).

pdf bib
Opportunities and Challenges in Neural Dialog Tutoring
Jakub Macina | Nico Daheim | Lingzhi Wang | Tanmay Sinha | Manu Kapur | Iryna Gurevych | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Designing dialog tutors has been challenging as it involves modeling the diverse and complex pedagogical strategies employed by human tutors. Although there have been significant recent advances in neural conversational systems using large language models and growth in available dialog corpora, dialog tutoring has largely remained unaffected by these advances. In this paper, we rigorously analyze various generative language models on two dialog tutoring datasets for language learning using automatic and human evaluations to understand the new opportunities brought by these advances as well as the challenges we must overcome to build models that would be usable in real educational settings. We find that although current approaches can model tutoring in constrained learning scenarios when the number of concepts to be taught and possible teacher strategies are small, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring, which measures learning opportunities for students and how engaging the dialog is. To understand the behavior of our models in a real tutoring setting, we conduct a user study using expert annotators and find a significantly large number of model reasoning errors in 45% of conversations. Finally, we connect our findings to outline future work.

2022

pdf bib
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

pdf bib
Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model
Nico Daheim | David Thulke | Christian Dugast | Hermann Ney
Findings of the Association for Computational Linguistics: EMNLP 2022

In this work, we present a model for document-grounded response generation in dialog that is decomposed into two components according to Bayes’ theorem.One component is a traditional ungrounded response generation model and the other component models the reconstruction of the grounding document based on the dialog context and generated response.We propose different approximate decoding schemes and evaluate our approach on multiple open-domain and task-oriented document-grounded dialog datasets.Our experiments show that the model is more factual in terms of automatic factuality metrics than the baseline model.Furthermore, we outline how introducing scaling factors between the components allows for controlling the tradeoff between factuality and fluency in the model output.Finally, we compare our approach to a recently proposed method to control factuality in grounded dialog, CTRL (Rashkin et al., 2021), and show that both approaches can be combined to achieve additional improvements.

2021

pdf bib
Cascaded Span Extraction and Response Generation for Document-Grounded Dialog
Nico Daheim | David Thulke | Christian Dugast | Hermann Ney
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

This paper summarizes our entries to both subtasks of the first DialDoc shared task which focuses on the agent response prediction task in goal-oriented document-grounded dialogs. The task is split into two subtasks: predicting a span in a document that grounds an agent turn and generating an agent response based on a dialog and grounding document. In the first subtask, we restrict the set of valid spans to the ones defined in the dataset, use a biaffine classifier to model spans, and finally use an ensemble of different models. For the second sub-task, we use a cascaded model which grounds the response prediction on the predicted span instead of the full document. With these approaches, we obtain significant improvements in both subtasks compared to the baseline.
Search
Co-authors