Understanding LLMs’ Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Changjiang Gao (长江 高); Hankun Lin; Xin Huang; Xue Han; Junlan Feng; Chao Deng; Jiajun Chen; Shujian Huang (书剑 黄)

doi:10.18653/v1/2025.emnlp-main.1161

Understanding LLMs’ Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Changjiang Gao, Hankun Lin, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen, Shujian Huang

Abstract

Cross-lingual context retrieval (extracting contextual information in one language based on requests in another) is a fundamental aspect of cross-lingual alignment, but the performance and mechanism of it for large language models (LLMs) remains unclear. In this paper, we evaluate the cross-lingual context retrieval of over 40 LLMs across 12 languages, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our mechanism analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.

Anthology ID:: 2025.emnlp-main.1161
Original:: 2025.emnlp-main.1161v1
Version 2:: 2025.emnlp-main.1161v2
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22797–22826
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1161/
DOI:: 10.18653/v1/2025.emnlp-main.1161
Bibkey:
Cite (ACL):: Changjiang Gao, Hankun Lin, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen, and Shujian Huang. 2025. Understanding LLMs’ Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22797–22826, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Understanding LLMs’ Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From (Gao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1161.pdf
Checklist:: 2025.emnlp-main.1161.checklist.pdf

PDF (v2) PDF (v1) Cite Search Checklist Fix data