Mohamed Nadif


2025

pdf bib
Graphes, NER et LLMs pour la classification non supervisée de documents
Imed Keraghel | Mohamed Nadif
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Les récents progrès en apprentissage automatique, notamment les modèles de langage de grande taille (LLMs) tels que BERT et GPT, offrent des plongements contextuels riches qui améliorent la représentation des textes. Cependant, les approches actuelles de clustering de documents négligent souvent les relations profondes entre entités nommées ainsi que le potentiel des représentations issues des LLMs. Cet article propose une nouvelle approche qui intègre la reconnaissance d’entités nommées (NER) et les embeddings de LLMs dans un cadre fondé sur les graphes pour le clustering de documents. La méthode construit un graphe dont les nœuds représentent les documents et dont les arêtes sont pondérées par la similarité entre entités nommées, le tout optimisé au moyen d’un réseau de neurones convolutifs sur graphes (GCN). Cela permet un regroupement plus efficace des documents sémantiquement proches. Les résultats expérimentaux indiquent que notre approche surpasse les méthodes traditionnelles basées sur la cooccurrence, en particulier pour les documents riches en entités nommées.

2024

pdf bib
More Discriminative Sentence Embeddings via Semantic Graph Smoothing
Chakib Fettal | Lazhar Labiod | Mohamed Nadif
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion. Leveraging semantic graph smoothing, we enhance sentence embeddings obtained from pretrained models to improve results for the text clustering and classification tasks. Our method, validated on eight benchmarks, demonstrates consistent improvements, showcasing the potential of semantic graph smoothing in improving sentence embeddings for the supervised and unsupervised document categorization tasks.

2023

pdf bib
Is Anisotropy Truly Harmful? A Case Study on Text Clustering
Mira Ait-Saada | Mohamed Nadif
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In the last few years, several studies have been devoted to dissecting dense text representations in order to understand their effectiveness and further improve their quality. Particularly, the anisotropy of such representations has been observed, which means that the directions of the word vectors are not evenly distributed across the space but rather concentrated in a narrow cone. This has led to several attempts to counteract this phenomenon both on static and contextualized text representations. However, despite this effort, there is no established relationship between anisotropy and performance. In this paper, we aim to bridge this gap by investigating the impact of different transformations on both the isotropy and the performance in order to assess the true impact of anisotropy. To this end, we rely on the clustering task as a means of evaluating the ability of text representations to produce meaningful groups. Thereby, we empirically show a limited impact of anisotropy on the expressiveness of sentence representations both in terms of directions and L2 closeness.

pdf bib
Unsupervised Anomaly Detection in Multi-Topic Short-Text Corpora
Mira Ait-Saada | Mohamed Nadif
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Unsupervised anomaly detection seeks to identify deviant data samples in a dataset without using labels and constitutes a challenging task, particularly when the majority class is heterogeneous. This paper addresses this topic for textual data and aims to determine whether a text sample is an outlier within a potentially multi-topic corpus. To this end, it is crucial to grasp the semantic aspects of words, particularly when dealing with short texts, since it is difficult to syntactically discriminate data samples based only on a few words. Thereby we make use of word embeddings to represent each sample by a dense vector, efficiently capturing the underlying semantics. Then, we rely on the Mixture Model approach to detect which samples deviate the most from the underlying distributions of the corpus. Experiments carried out on real datasets show the effectiveness of the proposed approach in comparison to state-of-the-art techniques both in terms of performance and time efficiency, especially when more than one topic is present in the corpus.