Maximilian Schmidhuber
2024
LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection
Udo Kruschwitz
|
Maximilian Schmidhuber
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024
Large Language Model (LLM)-based Synthetic Data is becoming an increasingly important field of research. One of its promising application is in training classifiers to detect online toxicity, which is of increasing concern in today’s digital landscape. In this work, we assess the feasibility of generative models to generate synthetic data for toxic speech detection. Our experiments are conducted on six different toxicity datasets, four of whom are hateful and two are toxic in the broader sense. We then employ a classifier trained on the original data for filtering. To explore the potential of this data, we conduct experiments using combinations of original and synthetic data, synthetic oversampling of the minority class, and a comparison of original vs. synthetic-only training. Results indicate that while our generative models offer benefits in certain scenarios, it does not improve hateful dataset classification. However, it does boost patronizing and condescending language detection. We find that synthetic data generated by LLMs is a promising avenue of research, but further research is needed to improve the quality of the generated data and develop better filtering methods. Code is available on GitHub; the generated dataset will be available on Zenodo in the final submission.
2022
MS@IW at SemEval-2022 Task 4: Patronising and Condescending Language Detection with Synthetically Generated Data
Selina Meyer
|
Maximilian Schmidhuber
|
Udo Kruschwitz
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
In this description paper we outline the system architecture submitted to Task 4, Subtask 1 at SemEval-2022. We leverage the generative power of state of the art generative pretrained transformer models to increase training set size and remedy class imbalance issues. Our best submitted system is trained on a synthetically enhanced dataset with 10.3 times as many positive samples as the original dataset and reaches an F1 score of 50.62%, which is 10 percentage points higher than our initial system trained on an undersampled version of the original dataset. We explore possible reasons for the comparably low score in the overall task ranking and report on experiments conducted during the post-evaluation phase.
2021
Universität Regensburg MaxS at GermEval 2021 Task 1: Synthetic Data in Toxic Comment Classification
Maximilian Schmidhuber
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments
We report on our submission to Task 1 of the GermEval 2021 challenge – toxic comment classification. We investigate different ways of bolstering scarce training data to improve off-the-shelf model performance on a toxic comment classification task. To help address the limitations of a small dataset, we use data synthetically generated by a German GPT-2 model. The use of synthetic data has only recently been taking off as a possible solution to ad- dressing training data sparseness in NLP, and initial results are promising. However, our model did not see measurable improvement through the use of synthetic data. We discuss possible reasons for this finding and explore future works in the field.