2025
pdf
bib
abs
MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools
Nishant Subramani
|
Jason Eisner
|
Justin Svegliato
|
Benjamin Van Durme
|
Yu Su
|
Sam Thomson
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logit lens and then computes similarity scores between each layer’s generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
2024
pdf
bib
abs
Language-to-Code Translation with a Single Labeled Example
Kaj Bostrom
|
Harsh Jhamtani
|
Hao Fang
|
Sam Thomson
|
Richard Shin
|
Patrick Xia
|
Benjamin Van Durme
|
Jason Eisner
|
Jacob Andreas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tools for translating natural language into code promise natural, open-ended interaction with databases, web APIs, and other software systems. However, this promise is complicated by the diversity and continual development of these systems, each with its own interface and distinct set of features. Building a new language-to-code translator, even starting with a large language model (LM), typically requires annotating a large set of natural language commands with their associated programs. In this paper, we describe ICIP (In-Context Inverse Programming), a method for bootstrapping a language-to-code system using mostly (or entirely) unlabeled programs written using a potentially unfamiliar (but human-readable) library or API. ICIP uses a pre-trained LM to assign candidate natural language descriptions to these programs, then iteratively refines the descriptions to ensure global consistency. Across nine different application domains from the Overnight and Spider benchmarks and text-davinci-003 and CodeLlama-7b-Instruct models, ICIP outperforms a number of prompting baselines. Indeed, in a “nearly unsupervised” setting with only a single annotated program and 100 unlabeled examples, it achieves up to 85% of the performance of a fully supervised system.
2023
pdf
bib
abs
Toward Interactive Dictation
Belinda Z. Li
|
Jason Eisner
|
Adam Pauls
|
Sam Thomson
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Voice dictation is an increasingly important text input modality. Existing systems that allow both dictation and editing-by-voice restrict their command language to flat templates invoked by trigger words. In this work, we study the feasibility of allowing users to interrupt their dictation with spoken editing commands in open-ended natural language. We introduce a new task and dataset, TERTiUS, to experiment with such systems. To support this flexibility in real-time, a system must incrementally segment and classify spans of speech as either dictation or command, and interpret the spans that are commands. We experiment with using large pre-trained language models to predict the edited text, or alternatively, to predict a small text-editing program. Experiments show a natural trade-off between model accuracy and latency: a smaller model achieves 30% end-state accuracy with 1.3 seconds of latency, while a larger model achieves 55% end-state accuracy with 7 seconds of latency.
2022
pdf
bib
abs
Online Semantic Parsing for Latency Reduction in Task-Oriented Dialogue
Jiawei Zhou
|
Jason Eisner
|
Michael Newman
|
Emmanouil Antonios Platanios
|
Sam Thomson
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Standard conversational semantic parsing maps a complete user utterance into an executable program, after which the program is executed to respond to the user. This could be slow when the program contains expensive function calls. We investigate the opportunity to reduce latency by predicting and executing function calls while the user is still speaking. We introduce the task of online semantic parsing for this purpose, with a formal latency reduction metric inspired by simultaneous machine translation. We propose a general framework with first a learned prefix-to-program prediction module, and then a simple yet effective thresholding heuristic for subprogram selection for early execution. Experiments on the SMCalFlow and TreeDST datasets show our approach achieves large latency reduction with good parsing quality, with a 30%–65% latency reduction depending on function execution time and allowed cost.
pdf
bib
abs
Guided K-best Selection for Semantic Parsing Annotation
Anton Belyy
|
Chieh-yang Huang
|
Jacob Andreas
|
Emmanouil Antonios Platanios
|
Sam Thomson
|
Richard Shin
|
Subhro Roy
|
Aleksandr Nisnevich
|
Charles Chen
|
Benjamin Van Durme
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Collecting data for conversational semantic parsing is a time-consuming and demanding process. In this paper we consider, given an incomplete dataset with only a small amount of data, how to build an AI-powered human-in-the-loop process to enable efficient data collection. A guided K-best selection process is proposed, which (i) generates a set of possible valid candidates; (ii) allows users to quickly traverse the set and filter incorrect parses; and (iii) asks users to select the correct parse, with minimal modification when necessary. We investigate how to best support users in efficiently traversing the candidate set and locating the correct parse, in terms of speed and accuracy. In our user study, consisting of five annotators labeling 300 instances each, we find that combining keyword searching, where keywords can be used to query relevant candidates, and keyword suggestion, where representative keywords are automatically generated, enables fast and accurate annotation.
pdf
bib
abs
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems
Elias Stengel-Eskin
|
Emmanouil Antonios Platanios
|
Adam Pauls
|
Sam Thomson
|
Hao Fang
|
Benjamin Van Durme
|
Jason Eisner
|
Yu Su
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
In natural language understanding (NLU) production systems, users’ evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation into this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on a small set of new symbols often decreases. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues.
2021
pdf
bib
abs
Value-Agnostic Conversational Semantic Parsing
Emmanouil Antonios Platanios
|
Adam Pauls
|
Subhro Roy
|
Yuchen Zhang
|
Alexander Kyte
|
Alan Guo
|
Sam Thomson
|
Jayant Krishnamurthy
|
Jason Wolfe
|
Jacob Andreas
|
Dan Klein
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Conversational semantic parsers map user utterances to executable programs given dialogue histories composed of previous utterances, programs, and system responses. Existing parsers typically condition on rich representations of history that include the complete set of values and computations previously discussed. We propose a model that abstracts over values to focus prediction on type- and function-level context. This approach provides a compact encoding of dialogue histories and predicted programs, improving generalization and computational efficiency. Our model incorporates several other components, including an atomic span copy operation and structural enforcement of well-formedness constraints on predicted programs, that are particularly advantageous in the low-data regime. Trained on the SMCalFlow and TreeDST datasets, our model outperforms prior work by 7.3% and 10.6% respectively in terms of absolute accuracy. Trained on only a thousand examples from each dataset, it outperforms strong baselines by 12.4% and 6.4%. These results indicate that simple representations are key to effective generalization in conversational semantic parsing.
pdf
bib
abs
Constrained Language Models Yield Few-Shot Semantic Parsers
Richard Shin
|
Christopher Lin
|
Sam Thomson
|
Charles Chen
|
Subhro Roy
|
Emmanouil Antonios Platanios
|
Adam Pauls
|
Dan Klein
|
Jason Eisner
|
Benjamin Van Durme
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We explore the use of large pretrained language models as few-shot semantic parsers. The goal in semantic parsing is to generate a structured meaning representation given a natural language input. However, language models are trained to generate natural language. To bridge the gap, we use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation. Our results demonstrate that with only a small amount of data and very little code to convert into English-like representations, our blueprint for rapidly bootstrapping semantic parsers leads to surprisingly effective performance on multiple community tasks, greatly exceeding baseline methods also trained on the same limited data.
pdf
bib
abs
Compositional Generalization for Neural Semantic Parsing via Span-level Supervised Attention
Pengcheng Yin
|
Hao Fang
|
Graham Neubig
|
Adam Pauls
|
Emmanouil Antonios Platanios
|
Yu Su
|
Sam Thomson
|
Jacob Andreas
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
We describe a span-level supervised attention loss that improves compositional generalization in semantic parsers. Our approach builds on existing losses that encourage attention maps in neural sequence-to-sequence models to imitate the output of classical word alignment algorithms. Where past work has used word-level alignments, we focus on spans; borrowing ideas from phrase-based machine translation, we align subtrees in semantic parses to spans of input sentences, and encourage neural attention mechanisms to mimic these alignments. This method improves the performance of transformers, RNNs, and structured decoders on three benchmarks of compositional generalization.
2020
pdf
bib
abs
Task-Oriented Dialogue as Dataflow Synthesis
Jacob Andreas
|
John Bufe
|
David Burkett
|
Charles Chen
|
Josh Clausman
|
Jean Crawford
|
Kate Crim
|
Jordan DeLoach
|
Leah Dorner
|
Jason Eisner
|
Hao Fang
|
Alan Guo
|
David Hall
|
Kristin Hayes
|
Kellie Hill
|
Diana Ho
|
Wendy Iwaszuk
|
Smriti Jha
|
Dan Klein
|
Jayant Krishnamurthy
|
Theo Lanman
|
Percy Liang
|
Christopher H. Lin
|
Ilya Lintsbakh
|
Andy McGovern
|
Aleksandr Nisnevich
|
Adam Pauls
|
Dmitrij Petters
|
Brent Read
|
Dan Roth
|
Subhro Roy
|
Jesse Rusak
|
Beth Short
|
Div Slomin
|
Ben Snyder
|
Stephon Striplin
|
Yu Su
|
Zachary Tellman
|
Sam Thomson
|
Andrei Vorobev
|
Izabela Witoszko
|
Jason Wolfe
|
Abby Wray
|
Yuchen Zhang
|
Alexander Zotov
Transactions of the Association for Computational Linguistics, Volume 8
We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at 
https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.
2018
pdf
bib
abs
Rational Recurrences
Hao Peng
|
Roy Schwartz
|
Sam Thomson
|
Noah A. Smith
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Despite the tremendous empirical success of neural models in natural language processing, many of them lack the strong intuitions that accompany classical machine learning approaches. Recently, connections have been shown between convolutional neural networks (CNNs) and weighted finite state automata (WFSAs), leading to new interpretations and insights. In this work, we show that some recurrent neural networks also share this connection to WFSAs. We characterize this connection formally, defining rational recurrences to be recurrent hidden state update functions that can be written as the Forward calculation of a finite set of WFSAs. We show that several recent neural models use rational recurrences. Our analysis provides a fresh view of these models and facilitates devising new neural architectures that draw inspiration from WFSAs. We present one such model, which performs better than two recent baselines on language modeling and text classification. Our results demonstrate that transferring intuitions from classical models like WFSAs can be an effective approach to designing and understanding neural models.
pdf
bib
abs
Syntactic Scaffolds for Semantic Structures
Swabha Swayamdipta
|
Sam Thomson
|
Kenton Lee
|
Luke Zettlemoyer
|
Chris Dyer
|
Noah A. Smith
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
We introduce the syntactic scaffold, an approach to incorporating syntactic information into semantic tasks. Syntactic scaffolds avoid expensive syntactic processing at runtime, only making use of a treebank during training, through a multitask objective. We improve over strong baselines on PropBank semantics, frame semantics, and coreference resolution, achieving competitive performance on all three tasks.
pdf
bib
abs
Learning Joint Semantic Parsers from Disjoint Data
Hao Peng
|
Sam Thomson
|
Swabha Swayamdipta
|
Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
We present a new approach to learning a semantic parser from multiple datasets, even when the target semantic formalisms are drastically different and the underlying corpora do not overlap. We handle such “disjoint” data by treating annotations for unobserved formalisms as latent structured variables. Building on state-of-the-art baselines, we show improvements both in frame-semantic parsing and semantic dependency parsing by modeling them jointly.
pdf
bib
abs
Bridging CNNs, RNNs, and Weighted Finite-State Machines
Roy Schwartz
|
Sam Thomson
|
Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recurrent and convolutional neural networks comprise two distinct families of models that have proven to be useful for encoding natural language utterances. In this paper we present SoPa, a new model that aims to bridge these two approaches. SoPa combines neural representation learning with weighted finite-state automata (WFSAs) to learn a soft version of traditional surface patterns. We show that SoPa is an extension of a one-layer CNN, and that such CNNs are equivalent to a restricted version of SoPa, and accordingly, to a restricted form of WFSA. Empirically, on three text classification tasks, SoPa is comparable or better than both a BiLSTM (RNN) baseline and a CNN baseline, and is particularly useful in small data settings.
pdf
bib
abs
Backpropagating through Structured Argmax using a SPIGOT
Hao Peng
|
Sam Thomson
|
Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce structured projection of intermediate gradients (SPIGOT), a new method for backpropagating through neural networks that include hard-decision structured predictions (e.g., parsing) in intermediate layers. SPIGOT requires no marginal inference, unlike structured attention networks and reinforcement learning-inspired solutions. Like so-called straight-through estimators, SPIGOT defines gradient-like quantities associated with intermediate nondifferentiable operations, allowing backpropagation before and after them; SPIGOT’s proxy aims to ensure that, after a parameter update, the intermediate structure will remain well-formed. We experiment on two structured NLP pipelines: syntactic-then-semantic dependency parsing, and semantic parsing followed by sentiment classification. We show that training with SPIGOT leads to a larger improvement on the downstream task than a modularly-trained pipeline, the straight-through estimator, and structured attention, reaching a new state of the art on semantic dependency parsing.
2017
pdf
bib
abs
Deep Multitask Learning for Semantic Dependency Parsing
Hao Peng
|
Sam Thomson
|
Noah A. Smith
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present a deep neural architecture that parses sentences into three semantic dependency graph formalisms. By using efficient, nearly arc-factored inference and a bidirectional-LSTM composed with a multi-layer perceptron, our base system is able to significantly improve the state of the art for semantic dependency parsing, without using hand-engineered features or syntax. We then explore two multitask learning approaches—one that shares parameters across formalisms, and one that uses higher-order structures to predict the graphs jointly. We find that both approaches improve performance across formalisms on average, achieving a new state of the art. Our code is open-source and available at 
https://github.com/Noahs-ARK/NeurboParser.
2015
pdf
bib
Toward Abstractive Summarization Using Semantic Representations
Fei Liu
|
Jeffrey Flanigan
|
Sam Thomson
|
Norman Sadeh
|
Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
pdf
bib
Frame-Semantic Role Labeling with Heterogeneous Annotations
Meghana Kshirsagar
|
Sam Thomson
|
Nathan Schneider
|
Jaime Carbonell
|
Noah A. Smith
|
Chris Dyer
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
2014
pdf
bib
A Discriminative Graph-Based Parser for the Abstract Meaning Representation
Jeffrey Flanigan
|
Sam Thomson
|
Jaime Carbonell
|
Chris Dyer
|
Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
CMU: Arc-Factored, Discriminative Semantic Dependency Parsing
Sam Thomson
|
Brendan O’Connor
|
Jeffrey Flanigan
|
David Bamman
|
Jesse Dodge
|
Swabha Swayamdipta
|
Nathan Schneider
|
Chris Dyer
|
Noah A. Smith
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)