2021
pdf
bib
abs
LZ1904 at SemEval-2021 Task 5: Bi-LSTM-CRF for Toxic Span Detection using Pretrained Word Embedding
Liang Zou
|
Wen Li
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Recurrent Neural Networks (RNN) have been widely used in various Natural Language Processing (NLP) tasks such as text classification, sequence tagging, and machine translation. Long Short Term Memory (LSTM), a special unit of RNN, has the benefit of memorizing past and even future information in a sentence (especially for bidirectional LSTM). In the shared task of detecting spans which make texts toxic, we first apply pretrained word embedding (GloVe) to generate the word vectors after tokenization. And then we construct Bidirectional Long Short Term Memory-Conditional Random Field (Bi-LSTM-CRF) model by Baidu research to predict whether each word in the sentence is toxic or not. We tune hyperparameters of dropout rate, number of LSTM units, embedding size with 10 epochs and choose the best epoch with validation recall. Our model achieves an F1 score of 66.99 percent in test dataset.
2019
pdf
bib
abs
NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers
Ping Liu
|
Wen Li
|
Liang Zou
Proceedings of the 13th International Workshop on Semantic Evaluation
Transfer learning and domain adaptive learning have been applied to various fields including computer vision (e.g., image recognition) and natural language processing (e.g., text classification). One of the benefits of transfer learning is to learn effectively and efficiently from limited labeled data with a pre-trained model. In the shared task of identifying and categorizing offensive language in social media, we preprocess the dataset according to the language behaviors on social media, and then adapt and fine-tune the Bidirectional Encoder Representation from Transformer (BERT) pre-trained by Google AI Language team. Our team NULI wins the first place (1st) in Sub-task A - Offensive Language Identification and is ranked 4th and 18th in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offense Target Identification respectively.
pdf
bib
abs
Ensemble Methods to Distinguish Mainland and Taiwan Chinese
Hai Hu
|
Wen Li
|
He Zhou
|
Zuoyu Tian
|
Yiwen Zhang
|
Liang Zou
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
This paper describes the IUCL system at VarDial 2019 evaluation campaign for the task of discriminating between Mainland and Taiwan variation of mandarin Chinese. We first build several base classifiers, including a Naive Bayes classifier with word n-gram as features, SVMs with both character and syntactic features, and neural networks with pre-trained character/word embeddings. Then we adopt ensemble methods to combine output from base classifiers to make final predictions. Our ensemble models achieve the highest F1 score (0.893) in simplified Chinese track and the second highest (0.901) in traditional Chinese track. Our results demonstrate the effectiveness and robustness of the ensemble methods.
2017
pdf
bib
abs
Classifier Stacking for Native Language Identification
Wen Li
|
Liang Zou
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
This paper reports our contribution (team WLZ) to the NLI Shared Task 2017 (essay track). We first extract lexical and syntactic features from the essays, perform feature weighting and selection, and train linear support vector machine (SVM) classifiers each on an individual feature type. The output of base classifiers, as probabilities for each class, are then fed into a multilayer perceptron to predict the native language of the author. We also report the performance of each feature type, as well as the best features of a type. Our system achieves an accuracy of 86.55%, which is among the best performing systems of this shared task.