Sadegh Jafari


2025

pdf bib
data2lang2vec: Data Driven Typological Features Completion
Hamidreza Amirzadeh | Sadegh Jafari | Anika Harju | Rob van der Goot
Proceedings of the 31st International Conference on Computational Linguistics

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

2024

pdf bib
DRAGON at FIGNEWS 2024 Shared Task: a Dedicated RAG for October 7th conflict News
Sadegh Jafari | Mohsen Mahmoodzadeh | Vanooshe Nazari | Razieh Bahmanyar | Kathryn Burrows
Proceedings of The Second Arabic Natural Language Processing Conference

In this study, we present a novel approach to annotating bias and propaganda in social media data by leveraging topic modeling techniques. Utilizing the BERTopic tool, we performed topic modeling on the FIGNEWS Shared-task dataset, which initially comprised 13,500 samples. From this dataset, we identified 35 distinct topics and selected approximately 50 representative samples from each topic, resulting in a subset of 1,812 samples. These selected samples were meticulously annotated for bias and propaganda labels. Subsequently, we employed multiple methods like KNN, SVC, XGBoost, and RAG to develop a classifier capable of detecting bias and propaganda within social media content. Our approach demonstrates the efficacy of using topic modeling for efficient data subset selection and provides a robust foundation for improving the accuracy of bias and propaganda detection in large-scale social media datasets.

2023

pdf bib
A longitudinal study about gradual changes in the Iranian Online Public Sphere pre and post of ‘Mahsa Moment’: Focusing on Twitter
Sadegh Jafari | Amin Fathi | Abolfazl Hajizadegan | Amirmohammad Kazemeini | Sauleh Eetemadi
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Mahsa Amini’s death shocked Iranian society. The effects of this event and the subsequent tragedies in Iran not only in realspace but also in cyberspace, including Twitter, were tremendous and unimaginable. We explore how Twitter has changed after Mahsa Amini’s death by analyzing the sentiments of Iranian users in the 90 days after this event. Additionally, we track the change in word meaning and each word’s neighboring words. Finally, we use word clustering methods for topic modeling.