DATAMAFIA at WNUT-2020 Task 2: A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks

Ayan Sengupta


Abstract
This document describes the system description developed by team datamafia at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. This paper contains a thorough study of pre-trained language models on downstream binary classification task over noisy user generated Twitter data. The solution submitted to final test leaderboard is a fine tuned RoBERTa model which achieves F1 score of 90.8% and 89.4% on the dev and test data respectively. In the later part, we explore several techniques for injecting regularization explicitly into language models to generalize predictions over noisy data. Our experiments show that adding regularizations to RoBERTa pre-trained model can be very robust to data and annotation noises and can improve overall performance by more than 1.2%.
Anthology ID:
2020.wnut-1.51
Volume:
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:
November
Year:
2020
Address:
Online
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
371–377
Language:
URL:
https://aclanthology.org/2020.wnut-1.51
DOI:
10.18653/v1/2020.wnut-1.51
Bibkey:
Cite (ACL):
Ayan Sengupta. 2020. DATAMAFIA at WNUT-2020 Task 2: A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 371–377, Online. Association for Computational Linguistics.
Cite (Informal):
DATAMAFIA at WNUT-2020 Task 2: A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks (Sengupta, WNUT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wnut-1.51.pdf
Optional supplementary material:
 2020.wnut-1.51.OptionalSupplementaryMaterial.zip
Code
 victor7246/wnut-2020-task-2
Data
WNUT-2020 Task 2