MuSeCLIR: A Multiple Senses and Cross-lingual Information Retrieval Dataset

Wing Yan Li, Julie Weeds, David Weir


Abstract
This paper addresses a deficiency in existing cross-lingual information retrieval (CLIR) datasets and provides a robust evaluation of CLIR systems’ disambiguation ability. CLIR is commonly tackled by combining translation and traditional IR. Due to translation ambiguity, the problem of ambiguity is worse in CLIR than in monolingual IR. But existing auto-generated CLIR datasets are dominated by searches for named entity mentions, which does not provide a good measure for disambiguation performance, as named entity mentions can often be transliterated across languages and tend not to have multiple translations. Therefore, we introduce a new evaluation dataset (MuSeCLIR) to address this inadequacy. The dataset focusses on polysemous common nouns with multiple possible translations. MuSeCLIR is constructed from multilingual Wikipedia and supports searches on documents written in European (French, German, Italian) and Asian (Chinese, Japanese) languages. We provide baseline statistical and neural model results on MuSeCLIR which show that MuSeCLIR has a higher requirement on the ability of systems to disambiguate query terms.
Anthology ID:
2022.coling-1.96
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1128–1135
Language:
URL:
https://aclanthology.org/2022.coling-1.96
DOI:
Bibkey:
Cite (ACL):
Wing Yan Li, Julie Weeds, and David Weir. 2022. MuSeCLIR: A Multiple Senses and Cross-lingual Information Retrieval Dataset. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1128–1135, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
MuSeCLIR: A Multiple Senses and Cross-lingual Information Retrieval Dataset (Li et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.96.pdf
Code
 justinal/museclir
Data
CLIRMatrix