Generating Artificial Texts as Substitution or Complement of Training Data

Vincent Claveau, Antoine Chaffin, Ewa Kijak


Abstract
The quality of artificially generated texts has considerably improved with the advent of transformers. The question of using these models to generate learning data for supervised learning tasks naturally arises, especially when the original language resource cannot be distributed, or when it is small. In this article, this question is explored under 3 aspects: (i) are artificial data an efficient complement? (ii) can they replace the original data when those are not available or cannot be distributed for confidentiality reasons? (iii) can they improve the explainability of classifiers? Different experiments are carried out on classification tasks - namely sentiment analysis on product reviews and Fake News detection - using artificially generated data by fine-tuned GPT-2 models. The results show that such artificial data can be used in a certain extend but require pre-processing to significantly improve performance. We also show that bag-of-words approaches benefit the most from such data augmentation.
Anthology ID:
2022.lrec-1.453
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4260–4269
Language:
URL:
https://aclanthology.org/2022.lrec-1.453
DOI:
Bibkey:
Cite (ACL):
Vincent Claveau, Antoine Chaffin, and Ewa Kijak. 2022. Generating Artificial Texts as Substitution or Complement of Training Data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4260–4269, Marseille, France. European Language Resources Association.
Cite (Informal):
Generating Artificial Texts as Substitution or Complement of Training Data (Claveau et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.453.pdf
Data
FLUE