Aria Nourbakhsh

2025

pdf bib abs
Trust but Verify: A Comprehensive Survey of Faithfulness Evaluation Methods in Abstractive Text Summarization
Salima Lamsiyah | Aria Nourbakhsh | Christoph Schommer
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Abstractive text summarization systems have advanced significantly with the rise of neural language models. However, they frequently suffer from issues of unfaithfulness or factual inconsistency, generating content that is not verifiably supported by the source text. This survey provides a comprehensive review of over 40 studies published between 2020 and 2025 on methods for evaluating faithfulness in abstractive summarization. We present a unified taxonomy that covers human evaluation techniques and a variety of automatic metrics, including question answering (QA)-based methods, natural language inference (NLI)-based methods, graph-based approaches, and large language model (LLM)-based evaluation. We also discuss meta-evaluation protocols that assess the quality of these metrics. In addition, we analyze a wide range of benchmark datasets, highlighting their design, scope, and relevance to emerging challenges such as long-document and domain-specific summarization. In addition, we identify critical limitations in current evaluation practices, including poor alignment with human judgment, limited robustness, and inefficiencies in handling complex summaries. We conclude by outlining future directions to support the development of more reliable, interpretable, and scalable evaluation methods. This work aims to support researchers in navigating the rapidly evolving landscape of faithfulness evaluation in summarization.

pdf bib abs
Quantifying the Overlap: Attribution Maps and Linguistic Heuristics in Encoder-Decoder Machine Translation Models
Aria Nourbakhsh | Salima Lamsiyah | Christoph Schommer
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Explainable AI (XAI) attribution methods seek to illuminate the decision-making process of generative models by quantifying the contribution of each input token to the generated output. Different attribution algorithms, often rooted in distinct methodological frameworks, can produce varied interpretations of feature importance. In this study, we utilize attribution mappings derived from three distinct methods as weighting signals during the training of encoder-decoder models. Our findings demonstrate that Attention and Value Zeroing attribution weights consistently lead to improved model performance. To better understand the linguistic information these mappings capture, we extract part-of-speech (POS), dependency, and named entity recognition (NER) tags from the input-output pairs and compare them with the XAI attribution maps. Although the Saliency method shows greater alignment with POS and dependency annotations than Value Zeroing, it exhibits more divergence in places where its attributions do not conform to these linguistic tags, compared to the other two methods, and it contributes less to the models’ performance.

2023

Pre-trained Transformers are challenging human performances in many Natural Language Processing tasks. The massive datasets used for pre-training seem to be the key to their success on existing tasks. In this paper, we explore how a range of pre-trained natural language understanding models performs on definitely unseen sentences provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks perform on par with pre-trained Transformers even after fine-tuning. Only after what we call extreme domain adaptation, that is, retraining with the masked language model task on all the novel corpus, pre-trained Transformers reach their standard high results. This suggests that huge pre-training corpora may give Transformers unexpected help since they are exposed to many of the possible sentences.

2019

pdf bib abs
sthruggle at SemEval-2019 Task 5: An Ensemble Approach to Hate Speech Detection
Aria Nourbakhsh | Frida Vermeer | Gijs Wiltvank | Rob van der Goot
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our approach to detection of hate speech against women and immigrants in tweets for our participation in the SemEval-2019 Task 5. We trained an SVM and an RF classifier using character bi- and trigram features and a BiLSTM pre-initialized with external word embeddings. We combined the predictions of the SVM, RF and BiLSTM in two different ensemble models. The first was a majority vote of the binary values, and the second used the average of the confidence scores. For development, we got the highest accuracy (75%) by the final ensemble model with majority voting. For testing, all models scored substantially lower and the scores between the classifiers varied more. We believe that these large differences between the higher accuracies in the development phase and the lower accuracies we obtained in the testing phase have partly to do with differences between the training, development and testing data.

pdf bib abs
Toward Dialogue Modeling: A Semantic Annotation Scheme for Questions and Answers
María Andrea Cruz Blandón | Gosse Minnema | Aria Nourbakhsh | Maria Boritchev | Maxime Amblard
Proceedings of the 13th Linguistic Annotation Workshop

The present study proposes an annotation scheme for classifying the content and discourse contribution of question-answer pairs. We propose detailed guidelines for using the scheme and apply them to dialogues in English, Spanish, and Dutch. Finally, we report on initial machine learning experiments for automatic annotation.