Isabelle Mohr
2024
Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Rohan Jha
|
Bo Wang
|
Michael Günther
|
Georgios Mastrapas
|
Saba Sturua
|
Isabelle Mohr
|
Andreas Koukounas
|
Mohammad Kalim Wang
|
Nan Wang
|
Han Xiao
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model’s retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
2022
CoVERT: A Corpus of Fact-checked Biomedical COVID-19 Tweets
Isabelle Mohr
|
Amelie Wührl
|
Roman Klinger
Proceedings of the Thirteenth Language Resources and Evaluation Conference
During the first two years of the COVID-19 pandemic, large volumes of biomedical information concerning this new disease have been published on social media. Some of this information can pose a real danger, particularly when false information is shared, for instance recommendations how to treat diseases without professional medical advice. Therefore, automatic fact-checking resources and systems developed specifically for medical domain are crucial. While existing fact-checking resources cover COVID-19 related information in news or quantify the amount of misinformation in tweets, there is no dataset providing fact-checked COVID-19 related Twitter posts with detailed annotations for biomedical entities, relations and relevant evidence. We contribute CoVERT, a fact-checked corpus of tweets with a focus on the domain of biomedicine and COVID-19 related (mis)information. The corpus consists of 300 tweets, each annotated with named entities and relations. We employ a novel crowdsourcing methodology to annotate all tweets with fact-checking labels and supporting evidence, which crowdworkers search for online. This methodology results in substantial inter-annotator agreement. Furthermore, we use the retrieved evidence extracts as part of a fact-checking pipeline, finding that the real-world evidence is more useful than the knowledge directly available in pretrained language models.
Search
Co-authors
- Amelie Wührl 1
- Roman Klinger 1
- Rohan Jha 1
- Bo Wang 1
- Michael Günther 1
- show all...