We provide a summary of the fifth edition of the CASE workshop that is held in the scope of EMNLP 2022. The workshop consists of regular papers, two keynotes, working papers of shared task participants, and task overview papers. This workshop has been bringing together all aspects of event information collection across technical and social science fields. In addition to the progress in depth, the submission and acceptance of multimodal approaches show the widening of this interdisciplinary research topic.
Social media posts containing hate speech are reproduced and redistributed at an accelerated pace, reaching greater audiences at a higher speed. We present a machine learning system for automatic detection of hate speech in Turkish, along with a hate speech dataset consisting of tweets collected in two separate domains. We first adopted a definition for hate speech that is in line with our goals and amenable to easy annotation; then designed the annotation schema for annotating the collected tweets. The Istanbul Convention dataset consists of tweets posted following the withdrawal of Turkey from the Istanbul Convention. The Refugees dataset was created by collecting tweets about immigrants by filtering based on commonly used keywords related to immigrants. Finally, we have developed a hate speech detection system using the transformer architecture (BERTurk), to be used as a baseline for the collected dataset. The binary classification accuracy is 77% when the system is evaluated using 5-fold cross-validation on the Istanbul Convention dataset and 71% for the Refugee dataset. We also tested a regression model with 0.66 and 0.83 RMSE on a scale of [0-4], for the Istanbul Convention and Refugees datasets.
This paper introduces a new Turkish Twitter Named Entity Recognition dataset. The dataset, which consists of 5000 tweets from a year-long period, was labeled by multiple annotators with a high agreement score. The dataset is also diverse in terms of the named entity types as it contains not only person, organization, and location but also time, money, product, and tv-show categories. Our initial experiments with pretrained language models (like BertTurk) over this dataset returned F1 scores of around 80%. We share this dataset publicly.
This paper describes the system proposed by Sabancı University Natural Language Processing Group in the SemEval-2022 MultiCoNER task. We developed an unsupervised entity linking pipeline that detects potential entity mentions with the help of Wikipedia and also uses the corresponding Wikipedia context to help the classifier in finding the named entity type of that mention. The proposed pipeline significantly improved the performance, especially for complex entities in low-context settings.
This paper aims to present WordNet and Wikipedia connection by linking synsets from Turkish WordNet KeNet with Wikipedia and thus, provide a better machine-readable dictionary to create an NLP model with rich data. For this purpose, manual mapping between two resources is realized and 11,478 synsets are linked to Wikipedia. In addition to this, automatic linking approaches are utilized to analyze possible connection suggestions. Baseline Approach and ElasticSearch Based Approach help identify the potential human annotation errors and analyze the effectiveness of these approaches in linking. Adopting both manual and automatic mapping provides us with an encompassing resource of WordNet and Wikipedia connections.
This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.
This paper summarizes our group’s efforts in the multilingual protest news detection shared task, which is organized as a part of the Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE) Workshop. We participated in all four subtasks in English. Especially in the identification of event containing sentences task, our proposed ensemble approach using RoBERTa and multichannel CNN-LexStem model yields higher performance. Similarly in the event extraction task, our transformer-LSTM-CRF architecture outperforms regular transformers significantly.
This paper summarizes our group’s efforts in the event sentence coreference identification shared task, which is organized as part of the Automated Extraction of Socio-Political Events from News (AESPEN) Workshop. Our main approach consists of three steps. We initially use a transformer based model to predict whether a pair of sentences refer to the same event or not. Later, we use these predictions as the initial scores and recalculate the pair scores by considering the relation of sentences in a pair with respect to other sentences. As the last step, final scores between these sentences are used to construct the clusters, starting with the pairs with the highest scores. Our proposed approach outperforms the baseline approach across all evaluation metrics.
This paper summarizes our group’s efforts in the offensive language identification shared task, which is organized as part of the International Workshop on Semantic Evaluation (Sem-Eval2020). Our final submission system is an ensemble of three different models, (1) CNN-LSTM, (2) BiLSTM-Attention and (3) BERT. Word embeddings, which were pre-trained on tweets, are used while training the first two models. BERTurk, which is the first BERT model for Turkish, is also explored. Our final submitted approach ranked as the second best model in the Turkish sub-task.
In this paper, we address the problem of identifying informative tweets related to COVID-19 in the form of a binary classification task as part of our submission for W-NUT 2020 Task 2. Specifically, we focus on ensembling methods to boost the classification performance of classification models such as BERT and CNN. We show that ensembling can reduce the variance in performance, specifically for BERT base models.