Rhitabrat Pokharel
2024
Evaluating Multilingual Long-Context Models for Retrieval and Reasoning
Ameeta Agrawal
|
Andy Dang
|
Sina Bagheri Nezhad
|
Rhitabrat Pokharel
|
Russell Scheinberg
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset – mLongRR – to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.
2023
Generating Continuations in Multilingual Idiomatic Contexts
Rhitabrat Pokharel
|
Ameeta Agrawal
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
Estimating Semantic Similarity between In-Domain and Out-of-Domain Samples
Rhitabrat Pokharel
|
Ameeta Agrawal
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
Prior work typically describes out-of-domain (OOD) or out-of-distribution (OODist) samples as those that originate from dataset(s) or source(s) different from the training set but for the same task. When compared to in-domain (ID) samples, the models have been known to usually perform poorer on OOD samples, although this observation is not consistent. Another thread of research has focused on OOD detection, albeit mostly using supervised approaches. In this work, we first consolidate and present a systematic analysis of multiple definitions of OOD and OODist as discussed in prior literature. Then, we analyze the performance of a model under ID and OOD/OODist settings in a principled way. Finally, we seek to identify an unsupervised method for reliably identifying OOD/OODist samples without using a trained model. The results of our extensive evaluation using 12 datasets from 4 different tasks suggest the promising potential of unsupervised metrics in this task.