Ivan Korotkov
2024
OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement
Yang Gao
|
Ji Ma
|
Ivan Korotkov
|
Keith Hall
|
Dana Alon
|
Donald Metzler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related papers in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we develop multilingual SDSM models by adjusting and extending the state-of-the-art methods designed for English SDSM tasks. We find that: (i)Some highly successful methods in English SDSM yield significantly worse performance in multilingual SDSM. (ii)Our best model, which enriches the non-English papers with English summaries, outperforms strong baselines by 7% (in mean average precision) on multilingual SDSM tasks, without compromising the performance on English SDSM tasks.
2021
Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation
Ji Ma
|
Ivan Korotkov
|
Yinfei Yang
|
Keith Hall
|
Ryan McDonald
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
A major obstacle to the wide-spread adoption of neural retrieval models is that they require large supervised training sets to surpass traditional term-based techniques, which are constructed from raw corpora. In this paper, we propose an approach to zero-shot learning for passage retrieval that uses synthetic question generation to close this gap. The question generation system is trained on general domain data, but is applied to documents in the targeted domain. This allows us to create arbitrarily large, yet noisy, question-passage relevance pairs that are domain specific. Furthermore, when this is coupled with a simple hybrid term-neural model, first-stage retrieval performance can be improved further. Empirically, we show that this is an effective strategy for building neural passage retrieval models in the absence of large training corpora. Depending on the domain, this technique can even approach the accuracy of supervised models.