Improving Long-Text Authorship Verification via Model Selection and Data Tuning
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Authorship verification is used to link texts written by the same author without needing a model per author, making it useful to deanonymizing users spreading text with malicious intent. In this work, we evaluated our Cross-Encoder system with four Transformers using differently tuned variants of fanfiction data and found that our BigBird pipeline outperformed Longformer, RoBERTa, and ELECTRA and performed competitively against the official top ranked system from the PAN evaluation. We also examined the effect of authors and fandoms not seen in training on model performance. Through this, we found fandom has the greatest influence on true trials, and that a balanced training dataset in terms of class and fandom performed the most consistently.
Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
We present a new method to bootstrap filter Twitter language ID labels in our dataset for automatic language identification (LID). Our method combines geo-location, original Twitter LID labels, and Amazon Mechanical Turk to resolve missing and unreliable labels. We are the first to compare LID classification performance using the MIRA algorithm and langid.py. We show classifier performance on different versions of our dataset with high accuracy using only Twitter data, without ground truth, and very few training examples. We also show how Platt Scaling can be use to calibrate MIRA classifier output values into a probability distribution over candidate classes, making the output more intuitive. Our method allows for fine-grained distinctions between similar languages and dialects and allows us to rediscover the language composition of our Twitter dataset.