Shuoyang Ding

2025

This paper describes Nvidia-Nemo’s WMT 2025 Metrics Shared Task submission. We investigated two strategies for extending Machine Translation (MT) evaluation to unsegmented documents: 1) first segmenting into sentences and then applying regression-based metrics and 2) directly utilizing the long-context capabilities of LLMs. The base comparison of the segmentation-based and LLM-based metrics on the WMT 2023-24 evaluation sets indicated that the former performs more robustly across language pairs.Thus we sought to improve the LLM-based approach by incorporating relative evaluation - this setting jointly evaluates all candidate translations at once and relative to each other, rather than evaluating each separately. Our experiments using the open-source Qwen3 LLM show that relative evaluation improves score correlations with human judgment, but only if the task is structured as a 2-stage evaluate-then-refine problem.

pdf bib abs

Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models
Miguel Romero Calvo | Shuoyang Ding | Corey D Barrett | Georgiana Dinu | George Karypis
Findings of the Association for Computational Linguistics: ACL 2025

Dense embeddings are fundamental to modern machine learning systems, powering Retrieval-Augmented Generation (RAG), information retrieval, and representation learning. While instruction-conditioning has become the dominant approach for embedding specialization, its direct application to low-capacity models imposes fundamental representational constraints that limit the performance gains derived from specialization. In this paper, we analyze these limitations and introduce the Mixture of Task Experts (MoTE) transformer block, which leverages task-specialized parameters trained with Task-Aware Contrastive Learning () to enhance the model’s ability to generate specialized embeddings. Empirical results show that MoTE achieves 64% higher performance gains in retrieval datasets (+3.27→ +5.21) and 43% higher performance gains across all datasets (+1.81→ 2.60). Critically, these gains are achieved without altering instructions, training data, inference time, or number of active parameters.

pdf bib abs

Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.

2024

pdf bib abs

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
Vilém Zouhar | Shuoyang Ding | Anna Currey | Tatyana Badeka | Jenyuan Wang | Brian Thompson
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to both metrics that rely on the surface form and pre-trained metrics that are not fine-tuned on MT quality judgments.

2022

pdf bib abs

Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation
Weiting Tan | Shuoyang Ding | Huda Khayrallah | Philipp Koehn
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Neural Machine Translation (NMT) models are known to suffer from noisy inputs. To make models robust, we generate adversarial augmentation samples that attack the model and preserve the source-side meaning at the same time. To generate such samples, we propose a doubly-trained architecture that pairs two NMT models of opposite translation directions with a joint loss function, which combines the target-side attack and the source-side semantic similarity constraint. The results from our experiments across three different language pairs and two evaluation metrics show that these adversarial samples improve model robustness.

2021

pdf bib abs

Evaluating Saliency Methods for Neural Language Models
Shuoyang Ding | Philipp Koehn
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Saliency methods are widely used to interpret neural network predictions, but different variants of saliency methods often disagree even on the interpretations of the same prediction made by the same model. In these cases, how do we identify when are these interpretations trustworthy enough to be used in analyses? To address this question, we conduct a comprehensive and quantitative evaluation of saliency methods on a fundamental category of NLP models: neural language models. We evaluate the quality of prediction interpretations from two perspectives that each represents a desirable property of these interpretations: plausibility and faithfulness. Our evaluation is conducted on four different datasets constructed from the existing human annotation of syntactic and semantic agreements, on both sentence-level and document-level. Through our evaluation, we identified various ways saliency methods could yield interpretations of low quality. We recommend that future work deploying such methods to neural language models should carefully validate their interpretations before drawing insights.

pdf bib abs

The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task
Shuoyang Ding | Marcin Junczys-Dowmunt | Matt Post | Christian Federmann | Philipp Koehn
Proceedings of the Sixth Conference on Machine Translation

This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip translation, and pseudo post-editing of the MT output. We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline. Our system is also the top-ranking system on the MT MCC metric for the English-German language pair.

pdf bib abs

Levenshtein Training for Word-level Quality Estimation
Shuoyang Ding | Marcin Junczys-Dowmunt | Matt Post | Philipp Koehn
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we propose a two-stage transfer learning procedure on both augmented data and human post-editing data. We also propose heuristics to construct reference labels that are compatible with subword-level finetuning and inference. Results on WMT 2020 QE shared task dataset show that our proposed method has superior data efficiency under the data-constrained setting and competitive performance under the unconstrained setting.

2019

pdf bib abs

Parallelizable Stack Long Short-Term Memory
Shuoyang Ding | Philipp Koehn
Proceedings of the Third Workshop on Structured Prediction for NLP

Stack Long Short-Term Memory (StackLSTM) is useful for various applications such as parsing and string-to-tree neural machine translation, but it is also known to be notoriously difficult to parallelize for GPU training due to the fact that the computations are dependent on discrete operations. In this paper, we tackle this problem by utilizing state access patterns of StackLSTM to homogenize computations with regard to different discrete operations. Our parsing experiments show that the method scales up almost linearly with increasing batch size, and our parallelized PyTorch implementation trains significantly faster compared to the Dynet C++ implementation.

pdf bib

An Exploration of Placeholding in Neural Machine Translation
Matt Post | Shuoyang Ding | Marianna Martindale | Winston Wu
Proceedings of Machine Translation Summit XVII: Research Track

pdf bib

A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation
Shuoyang Ding | Adithya Renduchintala | Kevin Duh
Proceedings of Machine Translation Summit XVII: Research Track

pdf bib abs

Saliency-driven Word Alignment Interpretation for Neural Machine Translation
Shuoyang Ding | Hainan Xu | Philipp Koehn
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

Despite their original goal to jointly learn to align and translate, Neural Machine Translation (NMT) models, especially Transformer, are often perceived as not learning interpretable word alignments. In this paper, we show that NMT models do learn interpretable word alignments, which could only be revealed with proper interpretation methods. We propose a series of such methods that are model-agnostic, are able to be applied either offline or online, and do not require parameter update or architectural change. We show that under the force decoding setup, the alignments induced by our interpretation method are of better quality than fast-align for some systems, and when performing free decoding, they agree well with the alignments induced by automatic alignment tools.

Shuoyang Ding

2025

2024

2022

2021

2019

2017

2016

2014

Co-authors

Venues