Alamgir Munir Qazi
2025
Where Patients Slow Down: Surprisal, Uncertainty, and Simplification in French Clinical Reading
Oksana Ivchenko
|
Alamgir Munir Qazi
|
Jamal Abdul Nasir
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
This eye-tracking study links language-model surprisal and contextual entropy to how 23 non-expert adults read French health texts. Participants read seven texts (clinical case, medical, general), each available in an Original and Simplified version. Surprisal and entropy were computed with eight autoregressive models (82M–8B parameters), and four complementary eye-tracking measures were analyzed. Surprisal correlates positively with early reading measures, peaking in the smallest GPT-2 models (r ≈ 0.26) and weakening with model size. Entropy shows the opposite pattern, with negative correlations strongest in the 7B-8B models (r ≈ −0.13), consistent with a skim-when-uncertain strategy. Surprisal effects are largest in Clinical Original passages and drop by ∼20% after simplification, whereas entropy effects are stable across domain and version. These findings expose a scaling paradox – where different model sizes are optimal for different cognitive signals – and suggest that French plain-language editing should focus on rewriting high-surprisal passages to reduce processing difficulty, and on avoiding high-entropy contexts for critical information.
Cuaċ: Fast and Small Universal Representations of Corpora
John P. McCrae
|
Bernardo Stearns
|
Alamgir Munir Qazi
|
Shubhanker Banerjee
|
Atul Kr. Ojha
Proceedings of the 5th Conference on Language, Data and Knowledge
The increasing size and diversity of corpora in natural language processing requires highly efficient processing frameworks. Building on the universal corpus format, Teanga, we present Cuaċ, a format for the compact representation of corpora. We describe this methodology based on short-string compression and indexing techniques and show that the files created with this methodology are similar to compressed human-readable serializations and can be further compressed using lossless compression. We also show that this introduces no computational penalty on the time to process files. This methodology aims to speed up natural language processing pipelines and is the basis for a fast database system for corpora.
When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Alamgir Munir Qazi
|
John P. McCrae
|
Jamal Nasir
Proceedings of the 5th Conference on Language, Data and Knowledge
The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.
Search
Fix author
Co-authors
- John Philip McCrae 2
- Jamal Abdul Nasir 1
- Shubhanker Banerjee 1
- Oksana Ivchenko 1
- Jamal A. Nasir 1
- show all...