Karl Stratos

2025

pdf bib abs
ImpRAG: Retrieval-Augmented Generation with Implicit Queries
Wenzheng Zhang | Xi Victoria Lin | Karl Stratos | Wen-tau Yih | Mingda Chen
Findings of the Association for Computational Linguistics: EMNLP 2025

Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.

pdf bib abs
The Impact of Visual Information in Chinese Characters: Evaluating Large Models’ Ability to Recognize and Utilize Radicals
Xiaofeng Wu | Karl Stratos | Wei Xu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The glyphic writing system of Chinese incorporates information-rich visual features in each character, such as radicals that provide hints about meaning or pronunciation. However, there has been no investigation into whether contemporary Large Language Models (LLMs) and Vision-Language Models (VLMs) can harness these sub-character features in Chinese through prompting. In this study, we establish a benchmark to evaluate LLMs’ and VLMs’ understanding of visual elements in Chinese characters, including radicals, composition structures, strokes, and stroke counts. Our results reveal that models surprisingly exhibit some, but still limited, knowledge of the visual information, regardless of whether images of characters are provided. To incite models’ ability to use radicals, we further experiment with incorporating radicals into the prompts for Chinese language processing (CLP) tasks. We observe consistent improvement in Part-Of-Speech tagging when providing additional information about radicals, suggesting the potential to enhance CLP by integrating sub-character information.

2024

pdf bib abs
Model Editing by Standard Fine-Tuning
Govind Krishnan Gangadhar | Karl Stratos
Findings of the Association for Computational Linguistics: ACL 2024

Standard fine-tuning is considered not as effective as specialized methods for model editing due to its comparatively poor performance. However, it is simple, agnostic to the architectural details of the model being edited, and able to leverage advances in standard training techniques with no additional work (e.g., black-box PEFT for computational efficiency), making it an appealing choice for a model editor. In this work, we show that standard fine-tuning alone can yield competitive model editing performance with two minor modifications. First, we optimize the conditional likelihood rather than the full likelihood. Second, in addition to the typical practice of training on randomly paraphrased edit prompts to encourage generalization, we also train on random or similar unedited facts to encourage locality. Our experiments on the ZsRE and CounterFact datasets demonstrate that these simple modifications allow standard fine-tuning to match or outperform highly specialized editors in terms of edit score.

2023

pdf bib abs
Seq2seq is All You Need for Coreference Resolution
Wenzheng Zhang | Sam Wiseman | Karl Stratos
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Existing works on coreference resolution suggest that task-specific models are necessary to achieve state-of-the-art performance. In this work, we present compelling evidence that such models are not necessary. We finetune a pretrained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation. Despite the extreme simplicity, our model outperforms or closely matches the best coreference systems in the literature on an array of datasets. We consider an even simpler version of seq2seq that generates only the tagged spans and find it highly performant. Our analysis shows that the model size, the amount of supervision, and the choice of sequence representations are key factors in performance.

pdf bib abs
Improving Multitask Retrieval by Promoting Task Specialization
Wenzheng Zhang | Chenyan Xiong | Karl Stratos | Arnold Overwijk
Transactions of the Association for Computational Linguistics, Volume 11

In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval, in which a separate retriever is trained for each task. We show that it is possible to train a multitask retriever that outperforms task-specific retrievers by promoting task specialization. The main ingredients are: (1) a better choice of pretrained model—one that is explicitly optimized for multitasking—along with compatible prompting, and (2) a novel adaptive learning method that encourages each parameter to specialize in a particular task. The resulting multitask retriever is highly performant on the KILT benchmark. Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.1

2021

pdf bib abs
Data-to-text Generation by Splicing Together Nearest Neighbors
Sam Wiseman | Arturs Backurs | Karl Stratos
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We propose to tackle data-to-text generation tasks by directly splicing together retrieved segments of text from “neighbor” source-target pairs. Unlike recent work that conditions on retrieved neighbors but generates text token-by-token, left-to-right, we learn a policy that directly manipulates segments of neighbor text, by inserting or replacing them in partially constructed generations. Standard techniques for training such a policy require an oracle derivation for each generation, and we prove that finding the shortest such derivation can be reduced to parsing under a particular weighted context-free grammar. We find that policies learned in this way perform on par with strong baselines in terms of automatic and human evaluation, but allow for more interpretable and controllable generation.

pdf bib
Unsupervised Label Refinement Improves Dataless Text Classification
Zewei Chu | Karl Stratos | Kevin Gimpel
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs
Corrected CBOW Performs as well as Skip-gram
Ozan İrsoy | Adrian Benton | Karl Stratos
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Mikolov et al. (2013a) observed that continuous bag-of-words (CBOW) word embeddings tend to underperform Skip-gram (SG) embeddings, and this finding has been reported in subsequent works. We find that these observations are driven not by fundamental differences in their training objectives, but more likely on faulty negative sampling CBOW implementations in popular libraries such as the official implementation, word2vec.c, and Gensim. We show that after correcting a bug in the CBOW gradient update, one can learn CBOW word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks, while being many times faster to train.

pdf bib abs
Fast and Effective Biomedical Entity Linking Using a Dual Encoder
Rajarshi Bhowmik | Karl Stratos | Gerard de Melo
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Biomedical entity linking is the task of identifying mentions of biomedical concepts in text documents and mapping them to canonical entities in a target thesaurus. Recent advancements in entity linking using BERT-based models follow a retrieve and rerank paradigm, where the candidate entities are first selected using a retriever model, and then the retrieved candidates are ranked by a reranker model. While this paradigm produces state-of-the-art results, they are slow both at training and test time as they can process only one mention at a time. To mitigate these issues, we propose a BERT-based dual encoder model that resolves multiple mentions in a document in one shot. We show that our proposed model is multiple times faster than existing BERT-based models while being competitive in accuracy for biomedical entity linking. Additionally, we modify our dual encoder model for end-to-end biomedical entity linking that performs both mention span detection and entity disambiguation and out-performs two recently proposed models.

pdf bib abs
Understanding Hard Negatives in Noise Contrastive Estimation
Wenzheng Zhang | Karl Stratos
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The choice of negative examples is important in noise contrastive estimation. Recent works find that hard negatives—highest-scoring incorrect examples under the model—are effective in practice, but they are used without a formal justification. We develop analytical tools to understand the role of hard negatives. Specifically, we view the contrastive loss as a biased estimator of the gradient of the cross-entropy loss, and show both theoretically and empirically that setting the negative distribution to be the model distribution results in bias reduction. We also derive a general form of the score function that unifies various architectures used in text retrieval. By combining hard negatives with appropriate score functions, we obtain strong results on the challenging task of zero-shot entity linking.

2020

pdf bib abs
Discrete Latent Variable Representations for Low-Resource Text Classification
Shuning Jin | Sam Wiseman | Karl Stratos | Karen Livescu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

While much work on deep latent variable models of text uses continuous latent variables, discrete latent variables are interesting because they are more interpretable and typically more space efficient. We consider several approaches to learning discrete latent variable models for text in the case where exact marginalization over these variables is intractable. We compare the performance of the learned representations as features for low-resource document and sentence classification. Our best models outperform the previous best reported results with continuous representations in these low-resource settings, while learning significantly more compressed representations. Interestingly, we find that an amortized variant of Hard EM performs particularly well in the lowest-resource regimes.

pdf bib abs
Mining Knowledge for Natural Language Inference from Wikipedia Categories
Mingda Chen | Zewei Chu | Karl Stratos | Kevin Gimpel
Findings of the Association for Computational Linguistics: EMNLP 2020

Accurate lexical entailment (LE) and natural language inference (NLI) often require large quantities of costly annotations. To alleviate the need for labeled data, we introduce WikiNLI: a resource for improving model performance on NLI and LE tasks. It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia. We show that we can improve strong baselines such as BERT and RoBERTa by pretraining them on WikiNLI and transferring the models on downstream tasks. We conduct systematic comparisons with phrases extracted from other knowledge bases such as WordNet and Wikidata to find that pretraining on WikiNLI gives the best performance. In addition, we construct WikiNLI in other languages, and show that pretraining on them improves performance on NLI tasks of corresponding languages.

2019

pdf bib abs
EntEval: A Holistic Evaluation Benchmark for Entity Representations
Mingda Chen | Zewei Chu | Yang Chen | Karl Stratos | Kevin Gimpel
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Rich entity representations are useful for a wide class of problems involving entities. Despite their importance, there is no standardized benchmark that evaluates the overall quality of entity representations. In this work, we propose EntEval: a test suite of diverse tasks that require nontrivial understanding of entities including entity typing, entity similarity, entity relation prediction, and entity disambiguation. In addition, we develop training techniques for learning better entity representations by using natural hyperlink annotations in Wikipedia. We identify effective objectives for incorporating the contextual information in hyperlinks into state-of-the-art pretrained language models (Peters et al., 2018) and show that they improve strong baselines on multiple EntEval tasks.

pdf bib abs
Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction
Karl Stratos
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We address part-of-speech (POS) induction by maximizing the mutual information between the induced label and its context. We focus on two training objectives that are amenable to stochastic gradient descent (SGD): a novel generalization of the classical Brown clustering objective and a recently proposed variational lower bound. While both objectives are subject to noise in gradient updates, we show through analysis and experiments that the variational lower bound is robust whereas the generalized Brown objective is vulnerable. We obtain strong performance on a multitude of datasets and languages with a simple architecture that encodes morphology and context.

pdf bib abs
Label-Agnostic Sequence Labeling by Copying Nearest Neighbors
Sam Wiseman | Karl Stratos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Retrieve-and-edit based approaches to structured prediction, where structures associated with retrieved neighbors are edited to form new structures, have recently attracted increased interest. However, much recent work merely conditions on retrieved structures (e.g., in a sequence-to-sequence framework), rather than explicitly manipulating them. We show we can perform accurate sequence labeling by explicitly (and only) copying labels from retrieved neighbors. Moreover, because this copying is label-agnostic, we can achieve impressive performance in zero-shot sequence-labeling tasks. We additionally consider a dynamic programming approach to sequence labeling in the presence of retrieved neighbors, which allows for controlling the number of distinct (copied) segments used to form a prediction, and leads to both more interpretable and accurate predictions.

2018

pdf bib abs
Compositional Morpheme Embeddings with Affixes as Functions and Stems as Arguments
Daniel Edmiston | Karl Stratos
Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP

This work introduces a novel, linguistically motivated architecture for composing morphemes to derive word embeddings. The principal novelty in the work is to treat stems as vectors and affixes as functions over vectors. In this way, our model’s architecture more closely resembles the compositionality of morphemes in natural language. Such a model stands in opposition to models which treat morphemes uniformly, making no distinction between stem and affix. We run this new architecture on a dependency parsing task in Korean—a language rich in derivational morphology—and compare it against a lexical baseline,along with other sub-word architectures. StAffNet, the name of our architecture, shows competitive performance with the state-of-the-art on this task.

2017

pdf bib abs
A Sub-Character Architecture for Korean Language Processing
Karl Stratos
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We introduce a novel sub-character architecture that exploits a unique compositional structure of the Korean language. Our method decomposes each character into a small set of primitive phonetic units called jamo letters from which character- and word-level representations are induced. The jamo letters divulge syntactic and semantic information that is difficult to access with conventional character-level units. They greatly alleviate the data sparsity problem, reducing the observation space to 1.6% of the original while increasing accuracy in our experiments. We apply our architecture to dependency parsing and achieve dramatic improvement over strong lexical baselines.

pdf bib abs
Domain Attention with an Ensemble of Experts
Young-Bum Kim | Karl Stratos | Dongchan Kim
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

An important problem in domain adaptation is to quickly generalize to a new domain with limited supervision given K existing domains. One approach is to retrain a global model across all K + 1 domains using standard techniques, for instance Daumé III (2009). However, it is desirable to adapt without having to re-estimate a global model from scratch each time a new domain with potentially new intents and slots is added. We describe a solution based on attending an ensemble of domain experts. We assume K domain specific intent and slot models trained on respective domains. When given domain K + 1, our model uses a weighted combination of the K domain experts’ feedback along with its own opinion to make predictions on the new domain. In experiments, the model significantly outperforms baselines that do not use domain adaptation and also performs better than the full retraining approach.

pdf bib abs
Adversarial Adaptation of Synthetic or Stale Data
Young-Bum Kim | Karl Stratos | Dongchan Kim
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Two types of data shift common in practice are 1. transferring from synthetic data to live user data (a deployment shift), and 2. transferring from stale data to current data (a temporal shift). Both cause a distribution mismatch between training and evaluation, leading to a model that overfits the flawed training data and performs poorly on the test data. We propose a solution to this mismatch problem by framing it as domain adaptation, treating the flawed training dataset as a source domain and the evaluation dataset as a target domain. To this end, we use and build on several recent advances in neural domain adaptation such as adversarial training (Ganinet al., 2016) and domain separation network (Bousmalis et al., 2016), proposing a new effective adversarial training scheme. In both supervised and unsupervised adaptation scenarios, our approach yields clear improvement over strong baselines.

pdf bib abs
Reconstruction of Word Embeddings from Sub-Word Parameters
Karl Stratos
Proceedings of the First Workshop on Subword and Character Level Models in NLP

Pre-trained word embeddings improve the performance of a neural model at the cost of increasing the model size. We propose to benefit from this resource without paying the cost by operating strictly at the sub-lexical level. Our approach is quite simple: before task-specific training, we first optimize sub-word parameters to reconstruct pre-trained word embeddings using various distance measures. We report interesting results on a variety of tasks: word similarity, word analogy, and part-of-speech tagging.

pdf bib abs
Entity Identification as Multitasking
Karl Stratos
Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing

Standard approaches in entity identification hard-code boundary detection and type prediction into labels and perform Viterbi. This has two disadvantages: 1. the runtime complexity grows quadratically in the number of types, and 2. there is no natural segment-level representation. In this paper, we propose a neural architecture that addresses these disadvantages. We frame the problem as multitasking, separating boundary detection and type prediction but optimizing them jointly. Despite its simplicity, this architecture performs competitively with fully structured models such as BiLSTM-CRFs while scaling linearly in the number of types. Furthermore, by construction, the model induces type-disambiguating embeddings of predicted mentions.

2016

pdf bib abs
Frustratingly Easy Neural Domain Adaptation
Young-Bum Kim | Karl Stratos | Ruhi Sarikaya
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Popular techniques for domain adaptation such as the feature augmentation method of Daumé III (2009) have mostly been considered for sparse binary-valued features, but not for dense real-valued features such as those used in neural networks. In this paper, we describe simple neural extensions of these techniques. First, we propose a natural generalization of the feature augmentation method that uses K + 1 LSTMs where one model captures global patterns across all K domains and the remaining K models capture domain-specific information. Second, we propose a novel application of the framework for learning shared structures by Ando and Zhang (2005) to domain adaptation, and also provide a neural extension of their approach. In experiments on slot tagging over 17 domains, our methods give clear performance improvement over Daumé III (2009) applied on feature-rich CRFs.

pdf bib abs
Domainless Adaptation by Constrained Decoding on a Schema Lattice
Young-Bum Kim | Karl Stratos | Ruhi Sarikaya
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In many applications such as personal digital assistants, there is a constant need for new domains to increase the system’s coverage of user queries. A conventional approach is to learn a separate model every time a new domain is introduced. This approach is slow, inefficient, and a bottleneck for scaling to a large number of domains. In this paper, we introduce a framework that allows us to have a single model that can handle all domains: including unknown domains that may be created in the future as long as they are covered in the master schema. The key idea is to remove the need for distinguishing domains by explicitly predicting the schema of queries. Given permitted schema of a query, we perform constrained decoding on a lattice of slot sequences allowed under the schema. The proposed model achieves competitive and often superior performance over the conventional model trained separately per domain.

pdf bib
Scalable Semi-Supervised Query Classification Using Matrix Sketching
Young-Bum Kim | Karl Stratos | Ruhi Sarikaya
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib abs
Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models
Karl Stratos | Michael Collins | Daniel Hsu
Transactions of the Association for Computational Linguistics, Volume 4

We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., “the” is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong baselines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words.

Co-authors

Venues

ws1