UNT Linguistics at SemEval-2020 Task 12: Linear SVC with Pre-trained Word Embeddings as Document Vectors and Targeted Linguistic Features

Jared Fromknecht, Alexis Palmer


Abstract
This paper outlines our approach to Tasks A & B for the English Language track of SemEval-2020 Task 12: OffensEval 2: Multilingual Offensive Language Identification in Social Media. We use a Linear SVM with document vectors computed from pre-trained word embeddings, and we explore the effectiveness of lexical, part of speech, dependency, and named entity (NE) features. We manually annotate a subset of the training data, which we use for error analysis and to tune a threshold for mapping training confidence values to labels. While document vectors are consistently the most informative features for both tasks, testing on the development set suggests that dependency features are an effective addition for Task A, and NE features for Task B.
Anthology ID:
2020.semeval-1.294
Volume:
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Month:
December
Year:
2020
Address:
Barcelona (online)
Editors:
Aurelie Herbelot, Xiaodan Zhu, Alexis Palmer, Nathan Schneider, Jonathan May, Ekaterina Shutova
Venue:
SemEval
SIG:
SIGLEX
Publisher:
International Committee for Computational Linguistics
Note:
Pages:
2209–2215
Language:
URL:
https://aclanthology.org/2020.semeval-1.294
DOI:
10.18653/v1/2020.semeval-1.294
Bibkey:
Cite (ACL):
Jared Fromknecht and Alexis Palmer. 2020. UNT Linguistics at SemEval-2020 Task 12: Linear SVC with Pre-trained Word Embeddings as Document Vectors and Targeted Linguistic Features. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2209–2215, Barcelona (online). International Committee for Computational Linguistics.
Cite (Informal):
UNT Linguistics at SemEval-2020 Task 12: Linear SVC with Pre-trained Word Embeddings as Document Vectors and Targeted Linguistic Features (Fromknecht & Palmer, SemEval 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.semeval-1.294.pdf
Data
OLID