Ben Bogin - ACL Anthology

Ben Bogin

2025

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints
Alec Bunn | Sarah Wiegreffe | Ben Bogin
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Evaluation of intermediate language model checkpoints during training is critical for effective model development and selection. How-ever, reliable evaluation using the popular multiple-choice question (MCQ) format is challenging, as small and non instruction-tunedmodels often lack the symbolic reasoning required for the task. This is despite the fact that MCQ evaluation is often used and needed todistinguish between the performance of different training runs. In particular, when prompted with a question and a set of labeled answerchoices (e.g., “A. . . . , B. . . . , C. . . . ”), many models struggle to emit the correct label (e.g., “C”), even when they can select the correct string answer choice. We propose an alternative evaluation method: fine-tuning the model on an auxiliary MCQ dataset prior to outputting labels. We validate this approach empirically by showing that training on auxiliary data improves MCQ ability on all our test datasets except 1. This approach provides a more accurate signal of model capability at intermediate checkpoints, as it disentangles the evaluation of core knowledge from the model’s emerging ability to follow formatting instructions.

2024

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Ori Yoran | Samuel Joseph Amouyal | Chaitanya Malaviya | Ben Bogin | Ofir Press | Jonathan Berant
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well in terms of accuracy, they exhibit low precision and tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that open web navigation remains a major challenge.

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
Ben Bogin | Kejuan Yang | Shashank Gupta | Kyle Richardson | Erin Bransom | Peter Clark | Ashish Sabharwal | Tushar Khot
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPER aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub-problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.

Leveraging Code to Improve In-Context Learning for Semantic Parsing
Ben Bogin | Shivanshu Gupta | Peter Clark | Ashish Sabharwal
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs.In this work, we show how pre-existing coding abilities of LLMs can be leveraged for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets; combined, they lead to dramatic improvements (e.g., 7.9% to 66.5% on SMCalFlow compositional split) and can substantially improve compositional generalization, nearly closing the performance gap between easier i.i.d. and harder compositional splits. Finally, comparisons across multiple PLs and DSL variations suggest that the similarity of a target language to general-purpose code is more important than prevalence in pretraining corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs.

2023

Diverse Demonstrations Improve In-context Compositional Generalization
Itay Levy | Ben Bogin | Jonathan Berant
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In-context learning has shown great success in i.i.d semantic parsing splits, where the training and test sets are drawn from the same distribution. In this setup, models are typically prompted with demonstrations that are similar to the input utterance. However, in the setup of compositional generalization, where models are tested on outputs with structures that are absent from the training set, selecting similar demonstrations is insufficient, as often no example will be similar enough to the input. In this work, we propose a method to select diverse demonstrations that aims to collectively cover all of the structures required in the output program, in order to encourage the model to generalize to new structures from these demonstrations. We empirically show that combining diverse demonstrations with in-context learning substantially improves performance across three compositional generalization semantic parsing datasets in the pure in-context learning setup and when combined with finetuning.

Answering Questions by Meta-Reasoning over Multiple Chains of Thought
Ori Yoran | Tomer Wolfson | Ben Bogin | Uri Katz | Daniel Deutch | Jonathan Berant
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider the relations between intermediate steps across chains and do not provide a unified explanation for the predicted answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts large language models to meta-reason over multiple chains of thought, rather than aggregate their answers. MCR examines different reasoning chains, mixes information between them and selects the most relevant facts in generating an explanation and predicting the answer. MCR outperforms strong baselines on 7 multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations exhibit high quality, enabling humans to verify its answers.

2022

Unobserved Local Structures Make Compositional Generalization Hard
Ben Bogin | Shivanshu Gupta | Jonathan Berant
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While recent work has shown that sequence-to-sequence models struggle to generalize to new compositions (termed compositional generalization), little is known on what makes compositional generalization hard on a particular test instance. In this work, we investigate the factors that make generalization to certain test instances challenging. We first substantiate that some examples are more difficult than others by showing that different models consistently fail or succeed on the same test instances. Then, we propose a criterion for the difficulty of an example: a test instance is hard if it contains a local structure that was not observed at training time. We formulate a simple decision rule based on this criterion and empirically show it predicts instance-level generalization well across 5 different semantic parsing datasets, substantially better than alternative decision rules. Last, we show local structures can be leveraged for creating difficult adversarial compositional splits and also to improve compositional generalization under limited training budgets by strategically selecting examples for the training set.

2021

COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images
Ben Bogin | Shivanshu Gupta | Matt Gardner | Jonathan Berant
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost fully automatic procedure for generating question-answer pairs along with a set of context images. COVR focuses on questions that require complex reasoning, including higher-order operations such as quantification and aggregation. Due to the automatic generation process, COVR facilitates the creation of compositional splits, where models at test time need to generalize to new concepts and compositions in a zero- or few-shot setting. We construct compositional splits using COVR and demonstrate a myriad of cases where state-of-the-art pre-trained language-and-vision models struggle to compositionally generalize.

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data
Moshe Hazoom | Vibhor Malik | Ben Bogin
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering
Ben Bogin | Sanjay Subramanian | Matt Gardner | Jonathan Berant
Transactions of the Association for Computational Linguistics, Volume 9

Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser. Our model induces latent trees, driven by end-to-end (the answer) supervision only. We show that this inductive bias towards tree structures dramatically improves systematic generalization to out-of- distribution examples, compared to strong baselines on an arithmetic expressions benchmark as well as on C losure, a dataset that focuses on systematic generalization for grounded question answering. On this challenging dataset, our model reaches an accuracy of 96.1%, significantly higher than prior models that almost perfectly solve the task on a random, in-distribution split.

2020

Obtaining Faithful Interpretations from Compositional Neural Networks
Sanjay Subramanian | Ben Bogin | Nitish Gupta | Tomer Wolfson | Sameer Singh | Jonathan Berant | Matt Gardner
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explanation of the model’s reasoning; that is, that all modules perform their intended behaviour. In this work, we propose and conduct a systematic evaluation of the intermediate outputs of NMNs on NLVR2 and DROP, two datasets which require composing multiple reasoning steps. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour. To remedy that, we train the model with auxiliary supervision and propose particular choices for module architecture that yield much better faithfulness, at a minimal cost to accuracy.

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

MedICaT: A Dataset of Medical Images, Captions, and Textual References
Sanjay Subramanian | Lucy Lu Wang | Ben Bogin | Sachin Mehta | Madeleine van Zuylen | Sravanthi Parasa | Sameer Singh | Matt Gardner | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: EMNLP 2020

Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.

Proceedings of the First Workshop on Interactive and Executable Semantic Parsing
Ben Bogin | Srinivasan Iyer | Xi Victoria Lin | Dragomir Radev | Alane Suhr | Panupong | Caiming Xiong | Pengcheng Yin | Tao Yu | Rui Zhang | Victor Zhong
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing

2019

Global Reasoning over Database Structures for Text-to-SQL Parsing
Ben Bogin | Matt Gardner | Jonathan Berant
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

State-of-the-art semantic parsers rely on auto-regressive decoding, emitting one symbol at a time. When tested against complex databases that are unobserved at training time (zero-shot), the parser often struggles to select the correct set of database constants in the new database, due to the local nature of decoding. %since their decisions are based on weak, local information only. In this work, we propose a semantic parser that globally reasons about the structure of the output query to make a more contextually-informed selection of database constants. We use message-passing through a graph neural network to softly select a subset of database constants for the output query, conditioned on the question. Moreover, we train a model to rank queries based on the global alignment of database constants to question words. We apply our techniques to the current state-of-the-art model for Spider, a zero-shot semantic parsing dataset with complex databases, increasing accuracy from 39.4% to 47.4%.

Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing
Ben Bogin | Jonathan Berant | Matt Gardner
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In spider, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network, and this representation is later used at both encoding and decoding time. Evaluation shows that encoding the schema structure improves our parser accuracy from 33.8% to 39.4%, dramatically above the current state of the art, which is at 19.7%.

2018

Towards an argumentative content search engine using weak supervision
Ran Levy | Ben Bogin | Shai Gretz | Ranit Aharonov | Noam Slonim
Proceedings of the 27th International Conference on Computational Linguistics

Searching for sentences containing claims in a large text corpus is a key component in developing an argumentative content search engine. Previous works focused on detecting claims in a small set of documents or within documents enriched with argumentative content. However, pinpointing relevant claims in massive unstructured corpora, received little attention. A step in this direction was taken in (Levy et al. 2017), where the authors suggested using a weak signal to develop a relatively strict query for claim–sentence detection. Here, we leverage this work to define weak signals for training DNNs to obtain significantly greater performance. This approach allows to relax the query and increase the potential coverage. Our results clearly indicate that the system is able to successfully generalize from the weak signal, outperforming previously reported results in terms of both precision and coverage. Finally, we adapt our system to solve a recent argument mining task of identifying argumentative sentences in Web texts retrieved from heterogeneous sources, and obtain F1 scores comparable to the supervised baseline.

Co-authors

Niklas Muennighoff 2

Kyle Richardson 2

Ashish Sabharwal 2

Noah A. Smith 2

Tomer Wolfson 2

Tosin Adewumi 1

Ranit Aharonov 1

Samuel Joseph Amouyal 1

David Atkinson 1

Russell Authur 1

Victoria Basmov 1

Akshita Bhagia 1

Matthew Blumberg 1

Tien-Tung Bui 1

Khyathi Chandu 1

Liang-Yu Chen 1

Vu Minh Chien 1

Arnav Varma Dantuluri 1

Pradeep Dasigi 1

Daniel Deutch 1

Aleksandr Drozd 1

Jennifer Dumas 1

Felix Friedrich 1

Tommaso Furlanello 1

Ananth Gottumukkala 1

Dirk Groeneveld 1

Shashank Gupta 1

Kshitij Gupta 1

Valentin Hofmann 1

Gabriel Ilharco 1

Srinivasan Iyer 1

Adalberto Barbosa Junior 1

Marzena Karpinska 1

Daniel Khashabi 1

Rodney Kinney 1

Wojciech Kusa 1

Veronika Laippala 1

Nathan Lambert 1

Xi Victoria Lin 1

Jiangming Liu 1

Nelson F. Liu 1

Ian Magnusson 1

Chaitanya Malaviya 1

Virendra Mehta 1

Mayank Mishra 1

Diganta Misra 1

Jacob Morrison 1

Nour Moustafa-Fahmy 1

Phoebe Mulcaire 1

Aakanksha Naik 1

Taishi Nakamura 1

Roberto Navigli 1

Sravanthi Parasa 1

Matthew E. Peters 1

Sampo Pyysalo 1

Dragomir Radev 1

Abhilasha Ravichander 1

Dustin Schwenk 1

Luca Soldaini 1

Jason T. Stillerman 1

Emma Strubell 1

Nishant Subramani 1

Oyvind Tafjord 1

Simone Tedeschi 1

Reut Tsarfaty 1

Sarah Wiegreffe 1

Caiming Xiong 1

Prateek Yadav 1

Pengcheng Yin 1

Luke Zettlemoyer 1

Terry Yue Zhuo 1

Madeleine van Zuylen 1

Venues