Event Extraction (EE) is a fundamental task in information extraction, aimed at identifying events and their associated arguments within textual data. It holds significant importance in various applications and serves as a catalyst for the development of related tasks. Despite the availability of numerous datasets and methods for event extraction in various languages, there has been a notable absence of a dedicated dataset for the Vietnamese language. To address this limitation, we propose BKEE, a novel event extraction dataset for Vietnamese. BKEE encompasses over 33 distinct event types and 28 different event argument roles, providing a labeled dataset for entity mentions, event mentions, and event arguments on 1066 documents. Additionally, we establish robust baselines for potential downstream tasks on this dataset, facilitating the analysis of challenges and future development prospects in the field of Vietnamese event extraction.
Our work addresses the problem of unsupervised Aspect Category Detection using a small set of seed words. Recent works have focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.
Aspect detection is a fundamental task in opinion mining. Previous works use seed words either as priors of topic models, as anchors to guide the learning of aspects, or as features of aspect classifiers. This paper presents a novel weakly-supervised method to exploit seed words for aspect detection based on an encoder architecture. The encoder maps segments and aspects into a low-dimensional embedding space. The goal is approximating similarity between segments and aspects in the embedding space and their ground-truth similarity generated from seed words. An objective function is proposed to capture the uncertainty of ground-truth similarity. Our method outperforms previous works on several benchmarks in various domains.
This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.
Cet article présente un modèle génératif pour l’induction non supervisée d’événements. Les précédentes méthodes de la littérature utilisent uniquement les têtes des syntagmes pour représenter les entités. Pourtant, le groupe complet (par exemple, ”un homme armé”) apporte une information plus discriminante (que ”homme”). Notre modèle tient compte de cette information et la représente dans la distribution des schémas d’événements. Nous montrons que ces relations jouent un rôle important dans l’estimation des paramètres, et qu’elles conduisent à des distributions plus cohérentes et plus discriminantes. Les résultats expérimentaux sur le corpus de MUC-4 confirment ces progrès.