Jia Xu

2025

This paper studies the relevance modeling problem by integrating world knowledge stored in the parameters of LLMs with specialized domain knowledge represented by user behavior data for achieving promising performance. The novel framework ProRBP is proposed, which innovatively develops user-driven behavior neighbor retrieval module to learn domain-specific knowledge in time and introduces progressive prompting and aggregation module for considering diverse aspects of the relevance and prediction stability. We explore an industrial implementation to deploy LLMs to handle full-scale search traffics of Alipay with acceptable cost and latency. The comprehensive experiments on real-world industry data and online A/B testing validate the superiority of our proposal and the effectiveness of its main modules.

pdf bib abs

ResFormer: All-Time Reservoir Memory for Long Sequence Classification
Hongbo Liu | Jia Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity, restricting their input length. Although extensive efforts have aimed at reducing computational demands, processing extensive contexts remains challenging. To overcome these limitations, we propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology. ResFormer integrates an reservoir computing network featuring a nonlinear readout to effectively capture long-term contextual dependencies in linear time. Concurrently, short-term dependencies within sentences are modeled using a conventional Transformer architecture with fixed-length inputs. Experiments demonstrate that ResFormer significantly outperforms baseline models of DeepSeek-Qwen and ModernBERT, delivering an accuracy improvement of up to +22.3% on the EmoryNLP dataset and consistent gains on MultiWOZ, MELD, and IEMOCAP. In addition, ResFormer exhibits reduced memory consumption, underscoring its effectiveness and efficiency in modeling extensive contextual information.

pdf bib abs

Reinforcement Learning with Supervised Alignment
João Luís Lins | Jia Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Supervised fine-tuning (SFT) is a widely used and highly effective method for adapting Large Language Models (LLMs) to specific tasks. However, it often suffers from overfitting, causing models to excel on fine-tuned data but struggle with unseen or rare real-world inputs. While recent methods like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) aim to align LLMs with human values and tasks, they face challenges such as the high cost of human labeling or instabilities and biases inherent in using LLMs as judges. To address these issues, we propose a novel approach called Reinforcement Learning from supervised Alignment (RLA), which constructs a supervised alignment to train the reward model for reinforcement learning. Using only 100,000 MS MARCO samples, our method outperforms RLAIF by a relative margin ranging from +5.38% to +131.8%. It also significantly enhances the baseline Llama3 LLM, achieving up to +55% improvement on in-domain tasks and up to +16% on out-of-domain tasks. While RLA slightly underperforms supervised fine-tuning (SFT) on in-domain benchmarks, it surpasses SFT by up to 50 times on out-of-domain and cross-task evaluations, demonstrating strong generalization capabilities.

pdf bib abs

EmByte: Decomposition and Compression Learning for Small yet Private NLP
Shenglan Li | Jia Xu | Mengjiao Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent breakthroughs in natural language processing (NLP) have come with escalating model sizes and computational costs, posing significant challenges for deployment in real-time and resource-constrained environments. We introduce EMBYTE, a novel byte-level tokenization model that achieves substantial embedding compression while preserving NLP accuracy and enhancing privacy. At the core of EMBYTE is a new Decompose-and-Compress (DeComp) learning strategy that decomposes subwords into fine-grained byte embeddings and then compresses them via neural projection. DeComp enables EMBYTE to be shrunk down to any vocabulary size (e.g., 128 or 256), drastically reducing embedding parameter count by up to 94% compared to subword-based models without increasing sequence length or degrading performance. Moreover, EMBYTE is resilient to privacy threats such as gradient inversion attacks, due to its byte-level many-to-one mapping structure. Empirical results on GLUE, machine translation, sentiment analysis, and language modeling tasks show that EMBYTE matches or surpasses the performance of significantly larger models, while offering improved efficiency. This makes EMBYTE a lightweight and generalizable NLP solution, well-suited for deployment in privacy-sensitive or low-resource environments.

2023

pdf bib abs

Probabilistic Robustness for Data Filtering
Yu Yu | Abdul Rafae Khan | Shahram Khadivi | Jia Xu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We introduce our probabilistic robustness rewarded data optimization (PRoDO) approach as a framework to enhance the model’s generalization power by selecting training data that optimizes our probabilistic robustness metrics. We use proximal policy optimization (PPO) reinforcement learning to approximately solve the computationally intractable training subset selection problem. The PPO’s reward is defined as our (𝛼,𝜖, 𝛾)-Robustness that measures performance consistency over multiple domains by simulating unknown test sets in real-world scenarios using a leaving-one-out strategy. We demonstrate that our PRoDO effectively filters data that lead to significantly higher prediction accuracy and robustness on unknown-domain test sets. Our experiments achieve up to +17.2% increase of accuracy (+25.5% relatively) in sentiment analysis, and -28.05 decrease of perplexity (-32.1% relatively) in language modeling.In addition, our probabilistic (𝛼,𝜖, 𝛾)-Robustness definition serves as an evaluation metric with higher levels of agreement with human annotations than typical performance-based metrics.

2022

pdf bib abs

SIT at MixMT 2022: Fluent Translation Built on Giant Pre-trained Models
Abdul Khan | Hrishikesh Kanade | Girish Budhrani | Preet Jhanglani | Jia Xu
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the Stevens Institute of Technology’s submission for the WMT 2022 Shared Task: Code-mixed Machine Translation (MixMT). The task consisted of two subtasks, subtask 1 Hindi/English to Hinglish and subtask 2 Hinglish to English translation. Our findings lie in the improvements made through the use of large pre-trained multilingual NMT models and in-domain datasets, as well as back-translation and ensemble techniques. The translation output is automatically evaluated against the reference translations using ROUGE-L and WER. Our system achieves the 1st position on subtask 2 according to ROUGE-L, WER, and human evaluation, 1st position on subtask 1 according to WER and human evaluation, and 3rd position on subtask 1 with respect to ROUGE-L metric.

pdf bib abs

Byte-based Multilingual NMT for Endangered Languages
Mengjiao Zhang | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840% relative improvement.

pdf bib abs

Can Data Diversity Enhance Learning Generalization?
Yu Yu | Shahram Khadivi | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

This paper introduces our Diversity Advanced Actor-Critic reinforcement learning (A2C) framework (DAAC) to improve the generalization and accuracy of Natural Language Processing (NLP). We show that the diversification of training samples alleviates overfitting and improves model generalization and accuracy. We quantify diversity on a set of samples using the max dispersion, convex hull volume, and graph entropy based on sentence embeddings in high-dimensional metric space. We also introduce A2C to select such a diversified training subset efficiently. Our experiments achieve up to +23.8 accuracy increase (38.0% relatively) in sentiment analysis, -44.7 perplexity decrease (37.9% relatively) in language modeling, and consistent improvements in named entity recognition over various domains. In particular, our method outperforms both domain adaptation and generalization baselines without using any target domain knowledge.

pdf bib abs

Analyzing Encoded Concepts in Transformer Language Models
Hassan Sajjad | Nadir Durrani | Fahim Dalvi | Firoj Alam | Abdul Khan | Jia Xu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a novel framework ConceptX, to analyze how latent concepts are encoded in representations learned within pre-trained lan-guage models. It uses clustering to discover the encoded concepts and explains them by aligning with a large set of human-defined concepts. Our analysis on seven transformer language models reveal interesting insights: i) the latent space within the learned representations overlap with different linguistic concepts to a varying degree, ii) the lower layers in the model are dominated by lexical concepts (e.g., affixation) and linguistic ontologies (e.g. Word-Net), whereas the core-linguistic concepts (e.g., morphology, syntactic relations) are better represented in the middle and higher layers, iii) some encoded concepts are multi-faceted and cannot be adequately explained using the existing human-defined concepts.

pdf bib abs

Measuring Robustness for NLP
Yu Yu | Abdul Rafae Khan | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

The quality of Natural Language Processing (NLP) models is typically measured by the accuracy or error rate of a predefined test set. Because the evaluation and optimization of these measures are narrowed down to a specific domain like news and cannot be generalized to other domains like Twitter, we often observe that a system reported with human parity results generates surprising errors in real-life use scenarios. We address this weakness with a new approach that uses an NLP quality measure based on robustness. Unlike previous work that has defined robustness using Minimax to bound worst cases, we measure robustness based on the consistency of cross-domain accuracy and introduce the coefficient of variation and (epsilon, gamma)-Robustness. Our measures demonstrate higher agreements with human evaluation than accuracy scores like BLEU on ranking Machine Translation (MT) systems. Our experiments of sentiment analysis and MT tasks show that incorporating our robustness measures into learning objectives significantly enhances the final NLP prediction accuracy over various domains, such as biomedical and social media.

2021

pdf bib abs

Grouping Words with Semantic Diversity
Karine Chubarian | Abdul Rafae Khan | Anastasios Sidiropoulos | Jia Xu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to simplify input language representation with low ambiguity. Since the semantically diverse words reside in different contexts, we are able to substitute words with their groups and still distinguish word meanings relying on their contexts. We design several algorithms that compute diverse groupings based on random sampling, geometric distances, and entropy maximization, and we prove formal guarantees for the entropy-based algorithms. Experimental results show that our methods generalize NLP models and demonstrate enhanced accuracy on POS tagging and LM tasks and significant improvements on medium-scale machine translation tasks, up to +6.5 BLEU points. Our source code is available at https://github.com/abdulrafae/dg.

2020

pdf bib abs

Coding Textual Inputs Boosts the Accuracy of Neural Networks
Abdul Rafae Khan | Jia Xu | Weiwei Sun
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Natural Language Processing (NLP) tasks are usually performed word by word on textual inputs. We can use arbitrary symbols to represent the linguistic meaning of a word and use these symbols as inputs. As “alternatives” to a text representation, we introduce Soundex, MetaPhone, NYSIIS, logogram to NLP, and develop fixed-output-length coding and its extension using Huffman coding. Each of those codings combines different character/digital sequences and constructs a new vocabulary based on codewords. We find that the integration of those codewords with text provides more reliable inputs to Neural-Network-based NLP systems through redundancy than text-alone inputs. Experiments demonstrate that our approach outperforms the state-of-the-art models on the application of machine translation, language modeling, and part-of-speech tagging. The source code is available at https://github.com/abdulrafae/coding_nmt.

2019

pdf bib abs

This paper describes the systems of the CUNY-PKU team in SemEval 2019 Task 1: Cross-lingual Semantic Parsing with UCCA. We introduce a novel model by applying a cascaded MLP and BiLSTM model. Then, we ensemble multiple system-outputs by reparsing. In particular, we introduce a new decoding algorithm for building the UCCA representation. Our system won the first place in one track (French-20K-Open), second places in four tracks (English-Wiki-Open, English-20K-Open, German-20K-Open, and German-20K-Closed), and third place in one track (English-20K-Closed), among all seven tracks.

2018

pdf bib abs

Assessing Quality Estimation Models for Sentence-Level Prediction
Hoang Cuong | Jia Xu
Proceedings of the 27th International Conference on Computational Linguistics

This paper provides an evaluation of a wide range of advanced sentence-level Quality Estimation models, including Support Vector Regression, Ride Regression, Neural Networks, Gaussian Processes, Bayesian Neural Networks, Deep Kernel Learning and Deep Gaussian Processes. Beside the accurateness, our main concerns are also the robustness of Quality Estimation models. Our work raises the difficulty in building strong models. Specifically, we show that Quality Estimation models often behave differently in Quality Estimation feature space, depending on whether the scale of feature space is small, medium or large. We also show that Quality Estimation models often behave differently in evaluation settings, depending on whether test data come from the same domain as the training data or not. Our work suggests several strong candidates to use in different circumstances.

pdf bib abs

Hunter NMT System for WMT18 Biomedical Translation Task: Transfer Learning in Neural Machine Translation
Abdul Khan | Subhadarshi Panda | Jia Xu | Lampros Flokas
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of Hunter Neural Machine Translation (NMT) to the WMT’18 Biomedical translation task from English to French. The discrepancy between training and test data distribution brings a challenge to translate text in new domains. Beyond the previous work of combining in-domain with out-of-domain models, we found accuracy and efficiency gain in combining different in-domain models. We conduct extensive experiments on NMT with transfer learning. We train on different in-domain Biomedical datasets one after another. That means parameters of the previous training serve as the initialization of the next one. Together with a pre-trained out-of-domain News model, we enhanced translation quality with 3.73 BLEU points over the baseline. Furthermore, we applied ensemble learning on training models of intermediate epochs and achieved an improvement of 4.02 BLEU points over the baseline. Overall, our system is 11.29 BLEU points above the best system of last year on the EDP 2017 test set.

Generating Virtual Parallel Corpus: A Compatibility Centric Method
Jia Xu | Weiwei Sun
Proceedings of Machine Translation Summit XIII: Papers

pdf bib

Enhancing Chinese Word Segmentation Using Unlabeled Data
Weiwei Sun | Jia Xu
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib

Parallel Corpus Refinement as an Outlier Detection Algorithm
Kaveh Taghipour | Shahram Khadivi | Jia Xu
Proceedings of Machine Translation Summit XIII: Papers

pdf bib

DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT and RBMT
Jia Xu | Hans Uszkoreit | Casey Kennington | David Vilar | Xiaojun Zhang
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib

2008

pdf bib

Phrase Table Training for Precision and Recall: What Makes a Good Phrase and a Good Phrase Pair?
Yonggang Deng | Jia Xu | Yuqing Gao
Proceedings of ACL-08: HLT

pdf bib

Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation
Jia Xu | Jianfeng Gao | Kristina Toutanova | Hermann Ney
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib

Domain dependent statistical machine translation
Jia Xu | Yonggang Deng | Yuqing Gao | Hermann Ney
Proceedings of Machine Translation Summit XI: Papers

2006

pdf bib

Partitioning Parallel Documents Using Binary Segmentation
Jia Xu | Richard Zens | Hermann Ney
Proceedings on the Workshop on Statistical Machine Translation

pdf bib abs

Error Analysis of Statistical Machine Translation Output
David Vilar | Jia Xu | Luis Fernando D’Haro | Hermann Ney
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Evaluation of automatic translation output is a difficult task. Several performance measures like Word Error Rate, Position Independent Word Error Rate and the BLEU and NIST scores are widely use and provide a useful tool for comparing different systems and to evaluate improvements within a system. However the interpretation of all of these measures is not at all clear, and the identification of the most prominent source of errors in a given system using these measures alone is not possible. Therefore some analysis of the generated translations is needed in order to identify the main problems and to focus the research efforts. This area is however mostly unexplored and few works have dealt with it until now. In this paper we will present a framework for classification of the errors of a machine translation system and we will carry out an error analysis of the system used by the RWTH in the first TC-STAR evaluation.