Fan Jiang

2026

Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping
Fan Jiang | Honglin Yu | Grace Y Chung | Trevor Cohn
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Extending Large Language Models (LLMs) to new languages is challenging, with most methods proposed suffering from high computational cost and catastrophic forgetting of original model capabilities. Embedding relearning (CITATION), a technique that creates new tokenizers and tunes embeddings on fixed model weights for target language adaptation, is both light-weight and performant. However, it has only been shown to work for older generation encoder-only models and for high resource languages. In this paper, we extend this framework to decoder-only LLMs focusing on joint adaptation to many languages, including low-resource ones. We experiment in three language groups over 100 languages each. We adapt a pre-trained LLM via switching to a customized tokenizer, and relearning the embedding layer. Across 96 diverse languages spanning both classification and generation tasks, we show embedding relearning improves models by up to 20%, being highly competitive with full-weight updating baselines while vastly more computationally efficient and mitigating catastrophic forgetting. This translates into better results in transferring the improved multilingual performance to tasks that build on core English abilities (e.g., multilingual math reasoning), compared to various baselines. Further analysis reveals the critical role of customizing tokenizers in achieving effective language transfer, particularly for non-Latin script languages.

2025

pdf bib abs

Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce **H**igh-rank **D**istributed **PiSSA (HD-PiSSA)**, a distributed PEFT approach that initializes **orthogonal adapters** across different devices and aggregates their delta updates collectively on (W) for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16× higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, HD-PiSSA benefits from this extra optimization flexibility and outperforms both LoRA and PiSSA across a variety of challenging downstream tasks, including mathematics, code, and multi-task learning.

pdf bib abs

Few-Shot Multilingual Open-Domain QA from Five Examples
Fan Jiang | Tom Drummond | Trevor Cohn
Transactions of the Association for Computational Linguistics, Volume 13

Recent approaches to multilingual open- domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a few-shot learning approach to synthesize large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, FsModQA, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a cross-lingual prompting strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.

2024

pdf bib

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision
Fan Jiang | Tom Drummond | Trevor Cohn
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

pdf bib abs

Language Bias in Multilingual Information Retrieval: The Nature of the Beast and Mitigation Methods
Jinrui Yang | Fan Jiang | Timothy Baldwin
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

Language fairness in multilingual information retrieval (MLIR) systems is crucial for ensuring equitable access to information across diverse languages. This paper sheds light on the issue, based on the assumption that queries in different languages, but with identical semantics, should yield equivalent ranking lists when retrieving on the same multilingual documents. We evaluate the degree of fairness using both traditional retrieval methods, and a DPR neural ranker based on mBERT and XLM-R. Additionally, we introduce ‘LaKDA’, a novel loss designed to mitigate language biases in neural MLIR approaches. Our analysis exposes intrinsic language biases in current MLIR technologies, with notable disparities across the retrieval methods, and the effectiveness of LaKDA in enhancing language fairness.

2023

pdf bib abs

Don’t Mess with Mister-in-Between: Improved Negative Search for Knowledge Graph Completion
Fan Jiang | Tom Drummond | Trevor Cohn
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

The best methods for knowledge graph completion use a ‘dual-encoding’ framework, a form of neural model with a bottleneck that facilitates fast approximate search over a vast collection of candidates. These approaches are trained using contrastive learning to differentiate between known positive examples and sampled negative instances. The mechanism for sampling negatives to date has been very simple, driven by pragmatic engineering considerations (e.g., using mismatched instances from the same batch). We propose several novel means of finding more informative negatives, based on searching for candidates with high lexical overlaps, from the dual-encoder model and according to knowledge graph structures. Experimental results on four benchmarks show that our best single model improves consistently over previous methods and obtains new state-of-the-art performance, including the challenging large-scale Wikidata5M dataset. Combing different kinds of strategies through model ensembling results in a further performance boost.

pdf bib abs

Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval
Fan Jiang | Qiongkai Xu | Tom Drummond | Trevor Cohn
Findings of the Association for Computational Linguistics: EMNLP 2023

Neural ‘dense’ retrieval models are state of the art for many datasets, however these models often exhibit limited domain transfer ability. Existing approaches to adaptation are unwieldy, such as requiring explicit supervision, complex model architectures, or massive external models. We present ABEL, a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings. Our technique follows a straightforward loop: a dense retriever learns from supervision signals provided by a reranker, and subsequently, the reranker is updated based on feedback from the improved retriever. By iterating this loop, the two components mutually enhance one another’s performance. Experimental results demonstrate that our unsupervised ABEL model outperforms both leading supervised and unsupervised retrievers on the BEIR benchmark. Meanwhile, it exhibits strong adaptation abilities to tasks and domains that were unseen during training. By either fine-tuning ABEL on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.

pdf bib abs

Noisy Self-Training with Synthetic Queries for Dense Retrieval
Fan Jiang | Tom Drummond | Trevor Cohn
Findings of the Association for Computational Linguistics: EMNLP 2023

Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.

2021

pdf bib abs

Incorporating Syntax and Semantics in Coreference Resolution with Heterogeneous Graph Attention Network
Fan Jiang | Trevor Cohn
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

External syntactic and semantic information has been largely ignored by existing neural coreference resolution models. In this paper, we present a heterogeneous graph-based model to incorporate syntactic and semantic structures of sentences. The proposed graph contains a syntactic sub-graph where tokens are connected based on a dependency tree, and a semantic sub-graph that contains arguments and predicates as nodes and semantic role labels as edges. By applying a graph attention network, we can obtain syntactically and semantically augmented word representation, which can be integrated using an attentive integration layer and gating mechanism. Experiments on the OntoNotes 5.0 benchmark show the effectiveness of our proposed model.

Co-authors

Venues

TACL1

WS1

Fix author