Ebrahim Ansari

2025

Last Minute at the GermEval-2025 LLMs4Subjects Task: Few-Shot Contrastive Learning for Multilingual Multi-Label Classification
Parisa Shirali | Zahra Sarlak | Ebrahim Ansari
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops

pdf bib abs

PerSpaCor: Correcting Space and ZWNJ Errors in Persian Text with Transformer Models
Matin Ebrahimkhani | Ebrahim Ansari
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Precision and clarity are essential qualities of written texts; however, Persian script, rooted in Arabic script, presents unique challenges that can compromise readability and correctness. In particular, the use of space and half-space—specifically the Zero Width Non-Joiner (ZWNJ)—is essential for proper character separation in Persian typography. This research introduces four models for correcting spacing and ZWNJ errors at the character level, thereby improving both readability and textual accuracy. By fine-tuning BERT-based transformer models on Bijankhan and Peykare corpora—comprising over 12.7 million preprocessed and annotated words—and formulating the task as sequence labeling, the best model achieves a macro-average F1-score of 97.26%. An interactive corrector that incorporates user input further improves performance to a macro-average F1-score of 98.38%. These results demonstrate the effectiveness of advanced language models in enhancing Persian text quality and highlight their applicability to real-world natural language processing tasks.

pdf bib abs

IASBS at SemEval-2025 Task 11: Ensembling Transformers for Bridging the Gap in Text-Based Emotion Detection
Mehrzad Tareh | Erfan Mohammadzadeh | Aydin Mohandesi | Ebrahim Ansari
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

In this paper, we address the challenges of text-based emotion detection, focusing on multi-label classification, emotion intensity prediction, and cross-lingual emotion detection across various languages. We explore the use of advanced machine learning models, particularly transformers, in three tracks: emotion detection, emotion intensity prediction, and cross-lingual emotion detection. Our approach utilizes pre-trained transformer models, such as Gemini, DeBERTa, M-BERT, and M-DistilBERT, combined with techniques like majority voting and average ensemble voting (AEV) to enhance performance. We also incorporate multilingual strategies and prompt engineering to effectively handle the complexities of emotion detection across diverse linguistic and cultural contexts. Our findings demonstrate the success of ensemble methods and multilingual models in improving the accuracy and generalization of emotion detection, particularly for low-resource languages.

pdf bib abs

Last Minute at SemEval-2025 Task 5: RAG System for Subject Tagging
Zahra Sarlak | Ebrahim Ansari
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Last Minute at SemEval-2025 Task 5: RAG System for Subject TaggingZahra Sarlak, Ebrahim AnsariIn this study, we explore the LLMs4Subjects shared task, which focuses on leveraging retrieval-augmented generation (RAG) to enhance subject classification in technical records from the Leibniz University’s Technical Library (TIBKAT). The challenge requires participants to recommend appropriate subject headings from the GND taxonomy while processing bibliographic data in both German and English.

2024

pdf bib abs

Enhancing Turkish Word Segmentation: A Focus on Borrowed Words and Invalid Morpheme
Soheila Behrooznia | Ebrahim Ansari | Zdenek Zabokrtsky
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

This study addresses a challenge in morphological segmentation: accurately segmenting words in languages with rich morphology. Current probabilistic methods, such as Morfessor, often produce results that lack consistency with human-segmented words. Our study adds some steps to the Morfessor segmentation process to consider invalid morphemes and borrowed words from other languages to improve morphological segmentation significantly. Comparing our idea to the results obtained from Morfessor demonstrates its efficiency, leading to more accurate morphology segmentation. This is particularly evident in the case of Turkish, highlighting the potential for further advancements in morpheme segmentation for morphologically rich languages.

pdf bib abs

IASBS at SemEval-2024 Task 10: Delving into Emotion Discovery and Reasoning in Code-Mixed Conversations
Mehrzad Tareh | Aydin Mohandesi | Ebrahim Ansari
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In this paper, we detail the IASBS team’s approach and findings from participating in SemEval-2024 Task 10, “Emotion Discovery and Reasoning in Hindi-English Code-mixed Conversations (EDiReF).” This task encompasses three critical subtasks: Emotion Recognition in Conversation (ERC), and Emotion Flip Reasoning (EFR) in both Hindi-English code-mixed and English dialogues. Our methodology integrates advanced NLP and machine learning techniques, focusing on the unique challenges of code-mixing, such as linguistic diversity and shifts in emotional context. By implementing a robust framework that includes data preprocessing, and feature engineering using models like GPT-4 and DistilBERT, we extend our analysis beyond mere emotion identification to explore the triggers behind emotion flips. This endeavor not only achieved third place on the leaderboard, demonstrating a high proficiency in emotion and flip detection with an F1-Score of 0.70 but also significantly contributed to the advancement of emotional AI. Our findings offer valuable insights into the complex interplay of emotions in communication, showcasing the potential for enhancing applications across various domains, from social media analytics to healthcare, and underscore the importance of understanding emotional dynamics in code-mixed conversations for future research and practical applications.

2021

pdf bib abs

SLTEV: Comprehensive Evaluation of Spoken Language Translation
Ebrahim Ansari | Ondřej Bojar | Barry Haddow | Mohammad Mahmoudi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Automatic evaluation of Machine Translation (MT) quality has been investigated over several decades. Spoken Language Translation (SLT), esp. when simultaneous, needs to consider additional criteria and does not have a standard evaluation procedure and a widely used toolkit. To fill the gap, we develop SLTev, an open-source tool for assessing SLT in a comprehensive way. SLTev reports the quality, latency, and stability of an SLT candidate output based on the time-stamped transcript and reference translation into a target language. For quality, we rely on sacreBLEU which provides MT evaluation measures such as chrF or BLEU. For latency, we propose two new scoring techniques. For stability, we extend the previously defined measures with a normalized Flicker in our work. We also propose a new averaging of older measures. A preliminary version of SLTev was used in the IWSLT 2020 shared task. Moreover, a growing collection of test datasets directly accessible by SLTev are provided for system evaluation comparable across papers.

This paper presents an automatic speech translation system aimed at live subtitling of conference presentations. We describe the overall architecture and key processing components. More importantly, we explain our strategy for building a complex system for end-users from numerous individual components, each of which has been tested only in laboratory conditions. The system is a working prototype that is routinely tested in recognizing English, Czech, and German speech and presenting it translated simultaneously into 42 target languages.

2020

ELITR (European Live Translator) project aims to create a speech translation system for simultaneous subtitling of conferences and online meetings targetting up to 43 languages. The technology is tested by the Supreme Audit Office of the Czech Republic and by alfaview®, a German online conferencing system. Other project goals are to advance document-level and multilingual machine translation, automatic speech recognition, and automatic minuting.

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of teams participated in at least one of the tracks. This paper introduces each track’s goal, data and evaluation metrics, and reports the results of the received submissions.

pdf bib abs

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding
Hadi Abdi Khojasteh | Ebrahim Ansari | Mahdi Bohlouli
Proceedings of the Twelfth Language Resources and Evaluation Conference

Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a “Large Scale Colloquial Persian Dataset” (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.

2019

pdf bib abs

Supervised Morphological Segmentation Using Rich Annotated Lexicon
Ebrahim Ansari | Zdeněk Žabokrtský | Mohammad Mahmoudi | Hamid Haghdoost | Jonáš Vidra
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning based approaches for the morphological segmentation task. We trained our models using annotated segmentation lexicons. To evaluate the effect of the training data size on our models, we decided to create a large hand-annotated morphologically segmented corpus of Persian words, which is, to the best of our knowledge, the first and the only segmentation lexicon for the Persian language. In the experimental phase, using the hand-annotated Persian lexicon and two smaller similar lexicons for Czech and Finnish languages, we evaluated the effect of the training data size, different hyper-parameters settings as well as different RNN-based models.

pdf bib

Building a Morphological Network for Persian on Top of a Morpheme-Segmented Lexicon
Hamid Haghdoost | Ebrahim Ansari | Zdeněk Žabokrtský | Mahshid Nikravesh
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology