Vlad Niculae

2025

pdf bib abs
kNN For Whisper And Its Effect On Bias And Speaker Adaptation
Maya K. Nachesa | Vlad Niculae
Findings of the Association for Computational Linguistics: NAACL 2025

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level k nearest neighbor search (kNN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from kNN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

pdf bib abs
Angular Dispersion Accelerates k-Nearest Neighbors Machine Translation
Evgeniia Tokarchuk | Sergey Troshin | Vlad Niculae
Findings of the Association for Computational Linguistics: EMNLP 2025

Augmenting neural machine translation with external memory at decoding time, in the form of k-nearest neighbors machine translation (k-NN MT), is a well-established strategy for increasing translation performance. k-NN MT retrieves a set of tokens that occurred in the most similar contexts recorded in a prepared data store, using hidden state representations of translation contexts as vector lookup keys. One of the main disadvantages of this method is the high computational cost and memory requirements. Since an exhaustive search is not feasible in large data stores practitioners commonly use approximate k-NN lookup, yet even such algorithms are a bottleneck. In contrast to research directions seeking to accelerate k-NN MT by reducing data store size or the number of lookup calls, we pursue an orthogonal direction based on the performance properties of approximate k-NN lookup data structures. In particular, we propose encouraging angular dispersion of the neural hidden representations of contexts. We show that improving dispersion leads to better balance in the retrieval data structures, accelerating retrieval and slightly improving translations.

pdf bib abs
Context-Aware or Context-Insensitive? Assessing LLMs’ Performance in Document-Level Translation
Wafaa Mohammed | Vlad Niculae
Proceedings of Machine Translation Summit XX: Volume 1

Large language models (LLMs) are increasingly strong contenders in machine translation. In this work, we focus on document-level translation, where some words cannot be translated without context from outside the sentence. Specifically, we investigate the ability of prominent LLMs to utilize the document context during translation through a perturbation analysis (analyzing models’ robustness to perturbed and randomized document context) and an attribution analysis (examining the contribution of relevant context to the translation). We conduct an extensive evaluation across nine LLMs from diverse model families and training paradigms, including translation-specialized LLMs, alongside two encoder-decoder transformer baselines. We find that LLMs’ improved document-translation performance compared to encoder-decoder models is not reflected in pronoun translation performance. Our analysis highlight the need for context-aware finetuning of LLMs with a focus on relevant parts of the context to improve their reliability for document-level translation.

pdf bib abs
Asking a Language Model for Diverse Responses
Sergey Troshin | Irina Saparina | Antske Fokkens | Vlad Niculae
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

Large language models increasingly rely on explicit reasoning chains and can produce multiple plausible responses for a given context. We study the candidate sampler that produces the set of plausible responses contrasting the ancestral (parallel) sampling against two alternatives: enumeration, which asks the model to produce n candidates in one pass, and iterative sampling, which proposes candidates sequentially while conditioning on the currently generated response set. Under matched budgets, we compare these samplers on quality, lexical and computation flow diversity, and efficiency. Our empirical results demonstrate that enumeration and iterative strategies result in higher diversity at comparable quality. Our findings highlight the potential of simple non-independent sampling strategies to improve response diversity without sacrificing generation quality.

2024

pdf bib
Traitement Automatique des Langues, Volume 64, Numéro 2 : Robustesse et limites des modèles de traitement automatique des langues [Robustness and limits of NLP models]
Caio Corro | Gaël Lejeune | Vlad Niculae
Traitement Automatique des Langues, Volume 64, Numéro 2 : Robustesse et limites des modèles de traitement automatique des langues [Robustness and limits of NLP models]

pdf bib
Introduction au numéro spécial - Robustesse et limites des modèles de traitement automatique des langues [Introduction to the special issue on robustness and limits of NLP models]
Caio Coro | Gaël Lejeune | Vlad Niculae
Traitement Automatique des Langues, Volume 64, Numéro 2 : Robustesse et limites des modèles de traitement automatique des langues [Robustness and limits of NLP models]

pdf bib abs
Entropy– and Distance-Regularized Attention Improves Low-Resource Neural Machine Translation
Ali Araabi | Vlad Niculae | Christof Monz
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Transformer-based models in Neural Machine Translation (NMT) rely heavily on multi-head attention for capturing dependencies within and across source and target sequences. In Transformers, attention mechanisms dynamically determine which parts of the sentence to focus on in the encoder and decoder through self-attention and cross-attention. Our experiments show that high-resource NMT systems often exhibit a specific peaked attention distribution, indicating a focus on key elements. However, in low-resource NMT, attention tends to be dispersed throughout the sentence, lacking the focus demonstrated by high-resource models. To tackle this issue, we present EaDRA (Entropy– and Distance-Regularized Attention), which introduces an inductive bias to prioritize essential elements and guide the attention mechanism accordingly. Extensive experiments using EaDRA on diverse low-resource language pairs demonstrate significant improvements in translation quality, while incurring negligible computational cost.

pdf bib abs
On Measuring Context Utilization in Document-Level MT Systems
Wafaa Mohammed | Vlad Niculae
Findings of the Association for Computational Linguistics: EACL 2024

Document-level translation models are usually evaluated using general metrics such as BLEU, which are not informative about the benefits of context. Current work on context-aware evaluation, such as contrastive methods, only measure translation accuracy on words that need context for disambiguation. Such measures cannot reveal whether the translation model uses the correct supporting context. We propose to complement accuracy-based evaluation with measures of context utilization. We find that perturbation-based analysis (comparing models’ performance when provided with correct versus random context) is an effective measure of overall context utilization. For a finer-grained phenomenon-specific evaluation, we propose to measure how much the supporting context contributes to handling context-dependent discourse phenomena. We show that automatically-annotated supporting context gives similar conclusions to human-annotated context and can be used as alternative for cases where human annotations are not available. Finally, we highlight the importance of using discourse-rich datasets when assessing context utilization.

pdf bib abs
The Unreasonable Effectiveness of Random Target Embeddings for Continuous-Output Neural Machine Translation
Evgeniia Tokarchuk | Vlad Niculae
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Continuous-output neural machine translation (CoNMT) replaces the discrete next-word prediction problem with an embedding prediction.The semantic structure of the target embedding space (*i.e.*, closeness of related words) is intuitively believed to be crucial. We challenge this assumption and show that completely random output embeddings can outperform laboriously pre-trained ones, especially on larger datasets. Further investigation shows this surprising effect is strongest for rare words, due to the geometry of their embeddings. We shed further light on this finding by designing a mixed strategy that combines random and pre-trained embeddings, and that performs best overall.

2023

pdf bib abs
Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens
David Stap | Vlad Niculae | Christof Monz
Findings of the Association for Computational Linguistics: EMNLP 2023

We argue that translation quality alone is not a sufficient metric for measuring knowledge transfer in multilingual neural machine translation. To support this claim, we introduce Representational Transfer Potential (RTP), which measures representational similarities between languages. We show that RTP can measure both positive and negative transfer (interference), and find that RTP is strongly correlated with changes in translation quality, indicating that transfer does occur. Furthermore, we investigate data and language characteristics that are relevant for transfer, and find that multi-parallel overlap is an important yet under-explored feature. Based on this, we develop a novel training scheme, which uses an auxiliary similarity loss that encourages representations to be more invariant across languages by taking advantage of multi-parallel data. We show that our method yields increased translation quality for low- and mid-resource languages across multiple data and model setups.

pdf bib abs
Joint Dropout: Improving Generalizability in Low-Resource Neural Machine Translation through Phrase Pair Variables
Ali Araabi | Vlad Niculae | Christof Monz
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Despite the tremendous success of Neural Machine Translation (NMT), its performance on low- resource language pairs still remains subpar, partly due to the limited ability to handle previously unseen inputs, i.e., generalization. In this paper, we propose a method called Joint Dropout, that addresses the challenge of low-resource neural machine translation by substituting phrases with variables, resulting in significant enhancement of compositionality, which is a key aspect of generalization. We observe a substantial improvement in translation quality for language pairs with minimal resources, as seen in BLEU and Direct Assessment scores. Furthermore, we conduct an error analysis, and find Joint Dropout to also enhance generalizability of low-resource NMT in terms of robustness and adaptability across different domains.

2022

pdf bib abs
How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?
Ali Araabi | Christof Monz | Vlad Niculae
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.

pdf bib abs
On Target Representation in Continuous-output Neural Machine Translation
Evgeniia Tokarchuk | Vlad Niculae
Proceedings of the 7th Workshop on Representation Learning for NLP

Continuous generative models proved their usefulness in high-dimensional data, such as image and audio generation. However, continuous models for text generation have received limited attention from the community. In this work, we study continuous text generation using Transformers for neural machine translation (NMT). We argue that the choice of embeddings is crucial for such models, so we aim to focus on one particular aspect”:” target representation via embeddings. We explore pretrained embeddings and also introduce knowledge transfer from the discrete Transformer model using embeddings in Euclidean and non-Euclidean spaces. Our results on the WMT Romanian-English and English-Turkish benchmarks show such transfer leads to the best-performing continuous model.

2020

pdf bib abs
Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning
Tsvetomila Mihaylova | Vlad Niculae | André F. T. Martins
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Latent structure models are a powerful tool for modeling language data: they can mitigate the error propagation and annotation bottleneck in pipeline systems, while simultaneously uncovering linguistic insights about the data. One challenge with end-to-end training of these models is the argmax operation, which has null gradient. In this paper, we focus on surrogate gradients, a popular strategy to deal with this problem. We explore latent structure learning through the angle of pulling back the downstream learning objective. In this paradigm, we discover a principled motivation for both the straight-through estimator (STE) as well as the recently-proposed SPIGOT – a variant of STE for structured models. Our perspective leads to new algorithms in the same family. We empirically compare the known and the novel pulled-back estimators against the popular alternatives, yielding new insight for practitioners and revealing intriguing failure cases.

2019

pdf bib abs
Adaptively Sparse Transformers
Gonçalo M. Correia | Vlad Niculae | André F. T. Martins
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with alpha-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the alpha parameter – which controls the shape and sparsity of alpha-entmax – allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

pdf bib abs
Sparse Sequence-to-Sequence Models
Ben Peters | Vlad Niculae | André F. T. Martins
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This density is wasteful, making models less interpretable and assigning probability mass to many implausible outputs. In this paper, we propose sparse sequence-to-sequence models, rooted in a new family of 𝛼-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any 𝛼 > 1. We provide fast algorithms to evaluate these transformations and their gradients, which scale well for large vocabulary sizes. Our models are able to produce sparse alignments and to assign nonzero probability to a short list of plausible outputs, sometimes rendering beam search exact. Experiments on morphological inflection and machine translation reveal consistent gains over dense models.

pdf bib abs
Latent Structure Models for Natural Language Processing
André F. T. Martins | Tsvetomila Mihaylova | Nikita Nangia | Vlad Niculae
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Latent structure models are a powerful tool for modeling compositional data, discovering linguistic structure, and building NLP pipelines. They are appealing for two main reasons: they allow incorporating structural bias during training, leading to more accurate models; and they allow discovering hidden linguistic structure, which provides better interpretability. This tutorial will cover recent advances in discrete latent structure models. We discuss their motivation, potential, and limitations, then explore in detail three strategies for designing such models: gradient approximation, reinforcement learning, and end-to-end differentiable methods. We highlight connections among all these methods, enumerating their strengths and weaknesses. The models we present and analyze have been applied to a wide variety of NLP tasks, including sentiment analysis, natural language inference, language modeling, machine translation, and semantic parsing. Examples and evaluation will be covered throughout. After attending the tutorial, a practitioner will be better informed about which method is best suited for their problem.

2018

pdf bib abs
Towards Dynamic Computation Graphs via Sparse Latent Structure
Vlad Niculae | André F. T. Martins | Claire Cardie
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Deep NLP models benefit from underlying structures in the data—e.g., parse trees—typically extracted using off-the-shelf parsers. Recent attempts to jointly learn the latent structure encounter a tradeoff: either make factorization assumptions that limit expressiveness, or sacrifice end-to-end differentiability. Using the recently proposed SparseMAP inference, which retrieves a sparse distribution over latent structures, we propose a novel approach for end-to-end learning of latent structure predictors jointly with a downstream predictor. To the best of our knowledge, our method is the first to enable unrestricted dynamic computation graph construction from the global latent structure, while maintaining differentiability.

pdf bib abs
Interpretable Structure Induction via Sparse Attention
Ben Peters | Vlad Niculae | André F. T. Martins
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Neural network methods are experiencing wide adoption in NLP, thanks to their empirical performance on many tasks. Modern neural architectures go way beyond simple feedforward and recurrent models: they are complex pipelines that perform soft, differentiable computation instead of discrete logic. The price of such soft computing is the introduction of dense dependencies, which make it hard to disentangle the patterns that trigger a prediction. Our recent work on sparse and structured latent computation presents a promising avenue for enhancing interpretability of such neural pipelines. Through this extended abstract, we aim to discuss and explore the potential and impact of our methods.

2017

pdf bib abs
Argument Mining with Structured SVMs and RNNs
Vlad Niculae | Joonsuk Park | Claire Cardie
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a novel factor graph model for argument mining, designed for settings in which the argumentative relations in a document do not necessarily form a tree structure. (This is the case in over 20% of the web comments dataset we release.) Our model jointly learns elementary unit type classification and argumentative relation prediction. Moreover, our model supports SVM and RNN parametrizations, can enforce structure constraints (e.g., transitivity), and can express dependencies between adjacent relations and propositions. Our approaches outperform unstructured baselines in both web comments and argumentative essay datasets.

We address the task of stress prediction as a sequence tagging problem. We present sequential models with averaged perceptron training for learning primary stress in Romanian words. We use character n-grams and syllable n-grams as features and we account for the consonant-vowel structure of the words. We show in this paper that Romanian stress is predictable, though not deterministic, by using data-driven machine learning techniques.

2013

pdf bib
Computational considerations of comparisons and similes
Vlad Niculae | Victoria Yaneva
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

pdf bib
Temporal Text Classification for Romanian Novels set in the Past
Alina Maria Ciobanu | Liviu P. Dinu | Octavia-Maria Şulea | Anca Dinu | Vlad Niculae
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Sequence Tagging for Verb Conjugation in Romanian
Liviu Dinu | Octavia-Maria Şulea | Vlad Niculae
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Temporal classification for historical Romanian texts
Alina Maria Ciobanu | Anca Dinu | Liviu Dinu | Vlad Niculae | Octavia-Maria Şulea
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Comparison pattern matching and creative simile recognition
Vlad Niculae
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

pdf bib
Determining is-a relationships for Textual Entailment
Vlad Niculae | Octavian Popescu
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

2012

pdf bib
Dealing with the Grey Sheep of the Romanian Gender System, the Neuter
Liviu P. Dinu | Vlad Niculae | Maria Sulea
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs
Liviu P. Dinu | Vlad Niculae | Octavia-Maria Sulea
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
The Romanian Neuter Examined Through A Two-Gender N-Gram Classification System
Liviu P. Dinu | Vlad Niculae | Octavia-Maria Şulea
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Romanian has been traditionally seen as bearing three lexical genders: masculine, feminine and neuter, although it has always been known to have only two agreement patterns (for masculine and feminine). A recent analysis of the Romanian gender system described in (Bateman and Polinsky, 2010), based on older observations, argues that there are two lexically unspecified noun classes in the singular and two different ones in the plural and that what is generally called neuter in Romanian shares the class in the singular with masculines, and the class in the plural with feminines based not only on agreement features but also on form. Previous machine learning classifiers that have attempted to discriminate Romanian nouns according to gender have so far taken as input only the singular form, presupposing the traditional tripartite analysis. We propose a classifier based on two parallel support vector machines using n-gram features from the singular and from the plural which outperforms previous classifiers in its high ability to distinguish the neuter. The performance of our system suggests that the two-gender analysis of Romanian, on which it is based, is on the right track.

pdf bib
Pastiche Detection Based on Stopword Rankings. Exposing Impersonators of a Romanian Writer
Liviu P. Dinu | Vlad Niculae | Maria-Octavia Sulea
Proceedings of the Workshop on Computational Approaches to Deception Detection