Jiahao Xu

2024

pdf bib abs
TransAgents: Build Your Translation Company with Language Agents
Minghao Wu | Jiahao Xu | Longyue Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications. In this work, we introduce TransAgents, a novel multi-agent translation system inspired by human translation companies. TransAgents employs specialized agents—Senior Editor, Junior Editor, Translator, Localization Specialist, and Proofreader—to collaboratively produce translations that are accurate, culturally sensitive, and of high quality. Our system is flexible, allowing users to configure their translation company based on specific needs, and universal, with empirical evidence showing superior performance across various domains compared to state-of-the-art methods. Additionally, TransAgents features a user-friendly interface and offers translations at a cost approximately 80× cheaper than professional human translation services. Evaluations on literary, legal, and financial test sets demonstrate that TransAgents produces translations preferred by human evaluators, even surpassing human-written references in literary contexts. Our live demo website is available at https://www.transagents.ai/. Our demonstration video is available at https://www.youtube.com/watch?v=p7jIAtF-WKc.

Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the second edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel corpus. Furthermore, we put forth an industry-endorsed criteria to guide human evaluation process. This year, we totally received 10 submissions from 5 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. In addition, our extensive analysis reveals a series of interesting findings on literary and discourse-aware MT. We release data, system outputs, and leaderboard at https://www2.statmt.org/wmt24/literary-translation-task.html.

2023

pdf bib abs
SimCSE++: Improving Contrastive Learning for Sentence Embeddings from Two Perspectives
Jiahao Xu | Wei Shao | Lihui Chen | Lemao Liu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper improves contrastive learning for sentence embeddings from two perspectives: handling dropout noise and addressing feature corruption. Specifically, for the first perspective, we identify that the dropout noise from negative pairs affects the model’s performance. Therefore, we propose a simple yet effective method to deal with such type of noise. Secondly, we pinpoint the rank bottleneck of current solutions to feature corruption and propose a dimension-wise contrastive learning objective to address this issue. Both proposed methods are generic and can be applied to any contrastive learning based models for sentence embeddings. Experimental results on standard benchmarks demonstrate that combining both proposed methods leads to a gain of 1.8 points compared to the strong baseline SimCSE configured with BERT base. Furthermore, applying the proposed method to DiffCSE, another strong contrastive learning based baseline, results in a gain of 1.4 points.

pdf bib abs
DistillCSE: Distilled Contrastive Learning for Sentence Embeddings
Jiahao Xu | Wei Shao | Lihui Chen | Lemao Liu
Findings of the Association for Computational Linguistics: EMNLP 2023

This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation. The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation. However, the vanilla DistillCSE through the standard implementation of knowledge distillation only achieves marginal improvements. The quantitative analyses demonstrate its reason that the standard knowledge distillation exhibits a relatively large variance of the teacher model’s logits due to the essence of contrastive learning. To mitigate the issue induced by high variance, this paper accordingly proposed two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components. Experiments on standard benchmarks demonstrate that the proposed DistillCSE outperforms many strong baseline methods and yields a new state-of-the-art performance.

2022

Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: what kind of synthetic data contributes to BT performance?Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield the better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.