Mohsen Mesgar - ACL Anthology

Mohsen Mesgar

2026

Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
Lei Tang | Wei Zhou | Mohsen Mesgar
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.

2025

Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering
Wei Zhou | Mohsen Mesgar | Annemarie Friedrich | Heike Adel
Findings of the Association for Computational Linguistics: NAACL 2025

Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrate notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain. The use of closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi Agent Collaboration with Tool use (MACT), a framework that requires neither fine-tuning nor closed-source models. In MACT, a planning agent and a coding agent that also make use of tools collaborate for TQA. MACT outperforms previous SoTA systems on three out of four benchmarks and performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. Our extensive analyses prove the effectiveness of MACT’s multi-agent collaboration in TQA. We release our code publicly.

Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Findings of the Association for Computational Linguistics: ACL 2025

In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and model types from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.

p²-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p²-TQA, a process-based preference learning framework for TQA post-training. p²-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p²-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p²-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

G-MACT at SemEval-2025 Task 8: Exploring Planning and Tool Use in Question Answering over Tabular Data
Wei Zhou | Mohsen Mesgar | Annemarie Friedrich | Heike Adel
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper describes our system submitted to SemEval-2024 Task 8 “Question Answering over Tabular Data.”The shared task focuses on tackling real-life table question answering (TQA) involving extremely large tables with the additional challenges of interpreting complex questions. To address these issues, we leverage a framework of Multi-Agent Collaboration with Tool use (MACT), a method that combines planning and tool use. The planning module breaks down a complex question by designing a step-by-step plan. This plan is translated into Python code by a coding model, and a Python interpreter executes the code to generate an answer. Our system demonstrates competitive performance in the shared task and is ranked 5th out of 38 in the open-source model category. We provide a detailed analysis of our model, evaluating the effectiveness and the efficiency of each component, and identify common error patterns. Our paper offers essential insights and recommendations for future advancements in developing TQA systems.

RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Proceedings of the 4th Table Representation Learning Workshop

Tables can be represented either as text or as images. Previous works on table question answering (TQA) typically rely on only one representation, neglecting the potential benefits of combining both. In this work, we explore integrating textual and visual table representations using multi-modal large language models (MLLMs) for TQA. Specifically, we propose RITT, a retrieval-assisted framework that first identifies the most relevant part of a table for a given question, then dynamically selects the optimal table representations based on the question type. Experiments demonstrate that our framework significantly outperforms the baseline MLLMs by an average of 13 Exact Match and surpasses two text-only state-of-the-art TQA methods on four TQA benchmarks, highlighting the benefits of leveraging both textual and visual table representations.

2024

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Mohsen Mesgar | Sharid Loáiciga
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites
Anna Hätty | Dragan Milchevski | Kersten Döring | Marko Putnikovic | Mohsen Mesgar | Filip Novović | Maximilian Braun | Karina Leoni Borimann | Igor Stranjanac
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Extracting product information is crucial for informed business decisions and strategic planning across multiple industries. However, recent methods relying only on large language models (LLMs) are resource-intensive and computationally prohibitive due to website structure differences and numerous non-product pages. To address these challenges, we propose a novel modular method that leverages low-cost classification models to filter out company web pages, significantly reducing computational costs. Our approach consists of three modules: web page crawling, product page classification using efficient machine learning models, and product information extraction using LLMs on classified product pages. We evaluate our method on a new dataset of about 7000 product and non-product web pages, achieving a 6-point improvement in F1-score, 95% reduction in computational time, and 87.5% reduction in cost compared to end-to-end LLMs. Our research demonstrates the effectiveness of our proposed low-cost classification module to identify web pages containing product information, making product information extraction more effective and cost-efficient.

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback
Wen Lai | Mohsen Mesgar | Alexander Fraser
Findings of the Association for Computational Linguistics: ACL 2024

To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.

2023

A Dataset of Argumentative Dialogues on Scientific Papers
Federico Ruggeri | Mohsen Mesgar | Iryna Gurevych
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With recent advances in question-answering models, various datasets have been collected to improve and study the effectiveness of these models on scientific texts. Questions and answers in these datasets explore a scientific paper by seeking factual information from the paper’s content. However, these datasets do not tackle the argumentative content of scientific papers, which is of huge importance in persuasiveness of a scientific discussion. We introduce ArgSciChat, a dataset of 41 argumentative dialogues between scientists on 20 NLP papers. The unique property of our dataset is that it includes both exploratory and argumentative questions and answers in a dialogue discourse on a scientific paper. Moreover, the size of ArgSciChat demonstrates the difficulties in collecting dialogues for specialized domains. Thus, our dataset is a challenging resource to evaluate dialogue agents in low-resource domains, in which collecting training data is costly. We annotate all sentences of dialogues in ArgSciChat and analyze them extensively. The results confirm that dialogues in ArgSciChat include exploratory and argumentative interactions. Furthermore, we use our dataset to fine-tune and evaluate a pre-trained document-grounded dialogue agent. The agent achieves a low performance on our dataset, motivating a need for dialogue agents with a capability to reason and argue about their answers. We publicly release ArgSciChat.

Python Code Generation by Asking Clarification Questions
Haau-Sing (Xiaocheng) Li | Mohsen Mesgar | André Martins | Iryna Gurevych
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Code generation from text requires understanding the user’s intent from a natural languagedescription and generating an executable code snippet that satisfies this intent. While recent pretrained language models demonstrate remarkable performance for this task, these models fail when the given natural language description is under-specified. In this work, we introduce a novel and more realistic setup for this task. We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions. Therefore, we collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers. The empirical results of our evaluation of pretrained language model performance on code generation show that clarifications result in more precisely generated code, as shown by the substantial improvement of model performance in all evaluation metrics. Alongside this, our task and dataset introduce new challenges to the community, including when and what clarification questions should be asked. Our code and dataset are available on GitHub.

The Devil is in the Details: On Models and Training Regimes for Few-Shot Intent Classification
Mohsen Mesgar | Thy Thy Tran | Goran Glavaš | Iryna Gurevych
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

In task-oriented dialog (ToD) new intents emerge on regular basis, with a handful of available utterances at best. This renders effective Few-Shot Intent Classification (FSIC) a central challenge for modular ToD systems. Recent FSIC methods appear to be similar: they use pretrained language models (PLMs) to encode utterances and predominantly resort to nearest-neighbor-based inference. However, they also differ in major components: they start from different PLMs, use different encoding architectures and utterance similarity functions, and adopt different training regimes. Coupling of these vital components together with the lack of informative ablations prevents the identification of factors that drive the (reported) FSIC performance. We propose a unified framework to evaluate these components along the following key dimensions:(1) Encoding architectures: Cross-Encoder vs Bi-Encoders;(2) Similarity function: Parameterized (i.e., trainable) vs non-parameterized; (3) Training regimes: Episodic meta-learning vs conventional (i.e., non-episodic) training. Our experimental results on seven FSIC benchmarks reveal three new important findings. First, the unexplored combination of cross-encoder architecture and episodic meta-learning consistently yields the best FSIC performance. Second, episodic training substantially outperforms its non-episodic counterpart. Finally, we show that splitting episodes into support and query sets has a limited and inconsistent effect on performance. Our findings show the importance of ablations and fair comparisons in FSIC. We publicly release our code and data.

Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text
Sophie Henning | Talita Anthonio | Wei Zhou | Heike Adel | Mohsen Mesgar | Annemarie Friedrich
Findings of the Association for Computational Linguistics: EMNLP 2023

Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article’s topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence or not. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.

2021

Improving Factual Consistency Between a Response and Persona Facts
Mohsen Mesgar | Edwin Simpson | Iryna Gurevych
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Neural models for response generation produce responses that are semantically plausible but not necessarily factually consistent with facts describing the speaker’s persona. These models are trained with fully supervised learning where the objective function barely captures factual consistency. We propose to fine-tune these models by reinforcement learning and an efficient reward function that explicitly captures the consistency between a response and persona facts as well as semantic plausibility. Our automatic and human evaluations on the PersonaChat corpus confirm that our approach increases the rate of responses that are factually consistent with persona facts over its supervised counterpart while retains the language quality of responses.

A Neural Graph-based Local Coherence Model
Mohsen Mesgar | Leonardo F. R. Ribeiro | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2021

Entity grids and entity graphs are two frameworks for modeling local coherence. These frameworks represent entity relations between sentences and then extract features from such representations to encode coherence. The benefits of convolutional neural models for extracting informative features from entity grids have been recently studied. In this work, we study the benefits of Relational Graph Convolutional Networks (RGCN) to encode entity graphs for measuring local coherence. We evaluate our neural graph-based model for two benchmark coherence evaluation tasks: sentence ordering (SO) and summary coherence rating (SCR). The results show that our neural graph-based model consistently outperforms the neural grid-based model for both tasks. Our model performs competitively with a strong baseline coherence model, while our model uses 50% fewer parameters. Our work defines a new, efficient, and effective baseline for local coherence modeling.

2020

Dialogue Coherence Assessment Without Explicit Dialogue Act Labels
Mohsen Mesgar | Sebastian Bücker | Iryna Gurevych
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent dialogue coherence models use the coherence features designed for monologue texts, e.g. nominal entities, to represent utterances and then explicitly augment them with dialogue-relevant features, e.g., dialogue act labels. It indicates two drawbacks, (a) semantics of utterances are limited to entity mentions, and (b) the performance of coherence models strongly relies on the quality of the input dialogue act labels. We address these issues by introducing a novel approach to dialogue coherence assessment. We use dialogue act prediction as an auxiliary task in a multi-task learning scenario to obtain informative utterance representations for coherence assessment. Our approach alleviates the need for explicit dialogue act labels during evaluation. The results of our experiments show that our model substantially (more than 20 accuracy points) outperforms its strong competitors on the DailyDialogue corpus, and performs on par with them on the SwitchBoard corpus for ranking dialogues concerning their coherence. We release our source code.

2019

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
Steffen Eger | Gözde Gül Şahin | Andreas Rücklé | Ji-Ung Lee | Claudia Schulz | Mohsen Mesgar | Krishnkant Swarnkar | Edwin Simpson | Iryna Gurevych
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Visual modifications to text are often used to obfuscate offensive comments in social media (e.g., “!d10t”) or as a writing style (“1337” in “leet speak”), among other scenarios. We consider this as a new type of adversarial attack in NLP, a setting to which humans are very robust, as our experiments with both simple and more difficult visual perturbations demonstrate. We investigate the impact of visual adversarial attacks on current NLP systems on character-, word-, and sentence-level tasks, showing that both neural and non-neural models are, in contrast to humans, extremely sensitive to such attacks, suffering performance decreases of up to 82%. We then explore three shielding methods—visual character embeddings, adversarial training, and rule-based recovery—which substantially improve the robustness of the models. However, the shielding methods still fall behind performances achieved in non-attack scenarios, which demonstrates the difficulty of dealing with visual attacks.

2018

A Neural Local Coherence Model for Text Quality Assessment
Mohsen Mesgar | Michael Strube
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a local coherence model that captures the flow of what semantically connects adjacent sentences in a text. We represent the semantics of a sentence by a vector and capture its state at each word of the sentence. We model what relates two adjacent sentences based on the two most similar semantic states, each of which is in one of the sentences. We encode the perceived coherence of a text by a vector, which represents patterns of changes in salient information that relates adjacent sentences. Our experiments demonstrate that our approach is beneficial for two downstream tasks: Readability assessment, in which our model achieves new state-of-the-art results; and essay scoring, in which the combination of our coherence vectors and other task-dependent features significantly improves the performance of a strong essay scorer.

2017

Using a Graph-based Coherence Model in Document-Level Machine Translation
Leo Born | Mohsen Mesgar | Michael Strube
Proceedings of the Third Workshop on Discourse in Machine Translation

Although coherence is an important aspect of any text generation system, it has received little attention in the context of machine translation (MT) so far. We hypothesize that the quality of document-level translation can be improved if MT models take into account the semantic relations among sentences during translation. We integrate the graph-based coherence model proposed by Mesgar and Strube, (2016) with Docent (Hardmeier et al., 2012, Hardmeier, 2014) a document-level machine translation system. The application of this graph-based coherence modeling approach is novel in the context of machine translation. We evaluate the coherence model and its effects on the quality of the machine translation. The result of our experiments shows that our coherence model slightly improves the quality of translation in terms of the average Meteor score.

2016

Generating Coherent Summaries of Scientific Articles Using Coherence Patterns
Daraksha Parveen | Mohsen Mesgar | Michael Strube
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Lexical Coherence Graph Modeling Using Word Embeddings
Mohsen Mesgar | Michael Strube
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Feature-Rich Error Detection in Scientific Writing Using Logistic Regression
Madeline Remse | Mohsen Mesgar | Michael Strube
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

2015

Graph-based Coherence Modeling For Assessing Readability
Mohsen Mesgar | Michael Strube
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2014

Normalized Entity Graph for Computing Local Coherence
Mohsen Mesgar | Michael Strube
Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing

2013

History Based Unsupervised Data Oriented Parsing
Mohsen Mesgar | Gholamreza Ghasem-Sani
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013