Iustin Sirbu

2025

MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification
Iustin Sirbu | Robert-Adrian Popovici | Cornelia Caragea | Stefan Trausan-Matu | Traian Rebedea
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce **MultiMatch**, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques - heads agreement from **Multi**head Co-training, self-adaptive thresholds from Free**Match**, and Average Pseudo-Margins from Margin**Match** - resulting in a holistic approach that improves robustness and performance in SSL settings.Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.

2024

pdf bib abs

The debate surrounding gun control and gun regulation in the United States has intensified in the wake of numerous mass shooting events. As perspectives on this matter vary, it becomes increasingly important to comprehend individuals’ positions. Stance detection, the task of determining an author’s position towards a proposition or target, has gained attention for its potential use in understanding public perceptions towards controversial topics and identifying the best strategies to address public concerns. In this paper, we present GunStance, a dataset of tweets pertaining to shooting events, focusing specifically on the controversial topics of “banning guns” versus “regulating guns.” The tweets in the dataset are sourced from discussions on Twitter following various shooting incidents in the United States. Amazon Mechanical Turk was used to manually annotate a subset of the tweets relevant to the targets of interest (“banning guns” and “regulating guns”) into three classes: In-Favor, Against, and Neutral. The remaining unlabeled tweets are included in the dataset to facilitate studies on semi-supervised learning (SSL) approaches that can help address the scarcity of the labeled data in stance detection tasks. Furthermore, we propose a hybrid approach that combines curriculum-based SSL and Large Language Models (LLM), and show that the proposed approach outperforms supervised, semi-supervised, and LLM-based zero-shot models in most experiments on our assembled dataset.

2022

pdf bib abs

Multimodal Semi-supervised Learning for Disaster Tweet Classification
Iustin Sirbu | Tiberiu Sosea | Cornelia Caragea | Doina Caragea | Traian Rebedea
Proceedings of the 29th International Conference on Computational Linguistics

During natural disasters, people often use social media platforms, such as Twitter, to post information about casualties and damage produced by disasters. This information can help relief authorities gain situational awareness in nearly real time, and enable them to quickly distribute resources where most needed. However, annotating data for this purpose can be burdensome, subjective and expensive. In this paper, we investigate how to leverage the copious amounts of unlabeled data generated on social media by disaster eyewitnesses and affected individuals during disaster events. To this end, we propose a semi-supervised learning approach to improve the performance of neural models on several multimodal disaster tweet classification tasks. Our approach shows significant improvements, obtaining up to 7.7% improvements in F-1 in low-data regimes and 1.9% when using the entire training data. We make our code and data publicly available at https://github.com/iustinsirbu13/multimodal-ssl-for-disaster-tweet-classification.

Co-authors

Sarthak Khanal 1

Robert-Adrian Popovici 1

Stefan Trausan-Matu 1

Venues

Fix author