Generating new events given context with correlated ones plays a crucial role in many event-centric reasoning tasks. Existing works either limit their scope to specific scenarios or overlook event-level correlations. In this paper, we propose to pre-train a general Correlation-aware context-to-Event Transformer (ClarET) for event-centric reasoning. To achieve this, we propose three novel event-centric objectives, i.e., whole event recovering, contrastive event-correlation encoding and prompt-based event locating, which highlight event-level correlations with effective training. The proposed ClarET is applicable to a wide range of event-centric reasoning scenarios, considering its versatility of (i) event-correlation types (e.g., causal, temporal, contrast), (ii) application formulations (i.e., generation and classification), and (iii) reasoning types (e.g., abductive, counterfactual and ending reasoning). Empirical fine-tuning results, as well as zero- and few-shot learning, on 9 benchmarks (5 generation and 4 classification tasks covering 4 reasoning types with diverse event correlations), verify its effectiveness and generalization ability.
Distant supervision uses triple facts in knowledge graphs to label a corpus for relation extraction, leading to wrong labeling and long-tail problems. Some works use the hierarchy of relations for knowledge transfer to long-tail relations. However, a coarse-grained relation often implies only an attribute (e.g., domain or topic) of the distant fact, making it hard to discriminate relations based solely on sentence semantics. One solution is resorting to entity types, but open questions remain about how to fully leverage the information of entity types and how to align multi-granular entity types with sentences. In this work, we propose a novel model to enrich distantly-supervised sentences with entity types. It consists of (1) a pairwise type-enriched sentence encoding module injecting both context-free and -related backgrounds to alleviate sentence-level wrong labeling, and (2) a hierarchical type-sentence alignment module enriching a sentence with the triple fact’s basic attributes to support long-tail relations. Our model achieves new state-of-the-art results in overall and long-tail performance on benchmarks.
Knowledge distillation has been proven effective when customizing small language models for specific tasks. Here, a corpus as ‘textbook’ plays an indispensable role, only through which the teacher can teach the student. Prevailing methods adopt a two-stage distillation paradigm: general distillation first with task-agnostic general corpus and task-specific distillation next with augmented task-specific corpus. We argue that such a paradigm may not be optimal. In general distillation, it’s extravagant to let the diverse but desultory general knowledge overwhelms the limited model capacity of the student. While in task-specific distillation, the task corpus is usually limited and narrow, preventing the student from learning enough knowledge. To mitigate the issues in the two gapped corpora, we present a better textbook for the student to learn: contextualized corpus that contextualizes task corpus with large-scale general corpus through relevance-based text retrieval. Experimental results on GLUE benchmark demonstrate that contextualized corpus is the better textbook compared with jointly using general corpus and augmented task-specific corpus. Surprisingly, it enables task-specific distillation from scratch without general distillation while maintaining comparable performance, making it more flexible to customize the student model with desired model size under various computation constraints.
Learning sentence embeddings in an unsupervised manner is fundamental in natural language processing. Recent common practice is to couple pre-trained language models with unsupervised contrastive learning, whose success relies on augmenting a sentence with a semantically-close positive instance to construct contrastive pairs. Nonetheless, existing approaches usually depend on a mono-augmenting strategy, which causes learning shortcuts towards the augmenting biases and thus corrupts the quality of sentence embeddings. A straightforward solution is resorting to more diverse positives from a multi-augmenting strategy, while an open question remains about how to unsupervisedly learn from the diverse positives but with uneven augmenting qualities in the text field. As one answer, we propose a novel Peer-Contrastive Learning (PCL) with diverse augmentations. PCL constructs diverse contrastive positives and negatives at the group level for unsupervised sentence embeddings. PCL performs peer-positive contrast as well as peer-network cooperation, which offers an inherent anti-bias ability and an effective way to learn from diverse augmentations. Experiments on STS benchmarks verify the effectiveness of PCL against its competitors in unsupervised sentence embeddings.
Grounding dialogue agents with knowledge documents has sparked increased attention in both academia and industry. Recently, a growing body of work is trying to build retrieval-based knowledge-grounded dialogue systems. While promising, these approaches require collecting pairs of dialogue context and the corresponding ground-truth knowledge sentences that contain the information regarding the dialogue context. Unfortunately, hand-labeling data to that end is time-consuming, and many datasets and applications lack such knowledge annotations. In this paper, we propose a reciprocal learning approach to jointly optimize a knowledge retriever and a response ranker for knowledge-grounded response retrieval without ground-truth knowledge labels. Specifically, the knowledge retriever uses the feedback from the response ranker as pseudo supervised signals of knowledge retrieval for updating its parameters, while the response ranker also receives the top-ranked knowledge sentences from knowledge retriever for optimization. Evaluation results on two public benchmarks show that our model can significantly outperform previous state-of-the-art methods.
Ethical judgment aims to determine if a person in a narrative situation acts under people’s social norms under a culture, so it is crucial to understand actions in narratives and achieve machine ethics. Recent works depend on data-driven methods to directly judge the ethics of complex real-world narratives but face two major challenges. First, they cannot well handle dilemma situations due to a lack of basic knowledge about social norms. Second, they focus merely on sparse situation-level judgment regardless of the social norms involved during the judgment, leading to a black box. In this work, inspired by previous knowledge-grounded and -augmented paradigms, we propose to complement a complex situation with grounded social norms. Besides a norm-grounding knowledge model, we present a novel norm-supported ethical judgment model in line with neural module networks to alleviate dilemma situations and improve norm-level explainability. Empirically, our model improves state-of-the-art performance on two narrative judgment benchmarks.
Aspect-level sentiment classification (ALSC) aims at identifying the sentiment polarity of a specified aspect in a sentence. ALSC is a practical setting in aspect-based sentiment analysis due to no opinion term labeling needed, but it fails to interpret why a sentiment polarity is derived for the aspect. To address this problem, recent works fine-tune pre-trained Transformer encoders for ALSC to extract an aspect-centric dependency tree that can locate the opinion words. However, the induced opinion words only provide an intuitive cue far below human-level interpretability. Besides, the pre-trained encoder tends to internalize an aspect’s intrinsic sentiment, causing sentiment bias and thus affecting model performance. In this paper, we propose a span-based anti-bias aspect representation learning framework. It first eliminates the sentiment bias in the aspect embedding by adversarial learning against aspects’ prior sentiment. Then, it aligns the distilled opinion candidates with the aspect by span-based dependency modeling to highlight the interpretable opinion terms. Our method achieves new state-of-the-art performance on five benchmarks, with the capability of unsupervised opinion extraction.
Multilingual question answering over knowledge graph (KGQA) aims to derive answers from a knowledge graph (KG) for questions in multiple languages. To be widely applicable, we focus on its zero-shot transfer setting. That is, we can only access training data in a high-resource language, while need to answer multilingual questions without any labeled data in target languages. A straightforward approach is resorting to pre-trained multilingual models (e.g., mBERT) for cross-lingual transfer, but there is a still significant gap of KGQA performance between source and target languages. In this paper, we exploit unsupervised bilingual lexicon induction (BLI) to map training questions in source language into those in target language as augmented training data, which circumvents language inconsistency between training and inference. Furthermore, we propose an adversarial learning strategy to alleviate syntax-disorder of the augmented data, making the model incline to both language- and syntax-independence. Consequently, our model narrows the gap in zero-shot cross-lingual transfer. Experiments on two multilingual KGQA datasets with 11 zero-resource languages verify its effectiveness.
Many graph embedding approaches have been proposed for knowledge graph completion via link prediction. Among those, translating embedding approaches enjoy the advantages of light-weight structure, high efficiency and great interpretability. Especially when extended to complex vector space, they show the capability in handling various relation patterns including symmetry, antisymmetry, inversion and composition. However, previous translating embedding approaches defined in complex vector space suffer from two main issues: 1) representing and modeling capacities of the model are limited by the translation function with rigorous multiplication of two complex numbers; and 2) embedding ambiguity caused by one-to-many relations is not explicitly alleviated. In this paper, we propose a relation-adaptive translation function built upon a novel weighted product in complex space, where the weights are learnable, relation-specific and independent to embedding size. The translation function only requires eight more scalar parameters each relation, but improves expressive power and alleviates embedding ambiguity problem. Based on the function, we then present our Relation-adaptive translating Embedding (RatE) approach to score each graph triple. Moreover, a novel negative sampling method is proposed to utilize both prior knowledge and self-adversarial learning for effective optimization. Experiments verify RatE achieves state-of-the-art performance on four link prediction benchmarks.
Wrong labeling problem and long-tail relations are two main challenges caused by distant supervision in relation extraction. Recent works alleviate the wrong labeling by selective attention via multi-instance learning, but cannot well handle long-tail relations even if hierarchies of the relations are introduced to share knowledge. In this work, we propose a novel neural network, Collaborating Relation-augmented Attention (CoRA), to handle both the wrong labeling and long-tail relations. Particularly, we first propose relation-augmented attention network as base model. It operates on sentence bag with a sentence-to-relation attention to minimize the effect of wrong labeling. Then, facilitated by the proposed base model, we introduce collaborating relation features shared among relations in the hierarchies to promote the relation-augmenting process and balance the training data for long-tail relations. Besides the main training objective to predict the relation of a sentence bag, an auxiliary objective is utilized to guide the relation-augmenting process for a more accurate bag-level representation. In the experiments on the popular benchmark dataset NYT, the proposed CoRA improves the prior state-of-the-art performance by a large margin in terms of Precision@N, AUC and Hits@K. Further analyses verify its superior capability in handling long-tail relations in contrast to the competitors.
In this work, we aim at equipping pre-trained language models with structured knowledge. We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs. Building upon entity-level masked language models, our first contribution is an entity masking scheme that exploits relational knowledge underlying the text. This is fulfilled by using a linked knowledge graph to select informative entities and then masking their mentions. In addition, we use knowledge graphs to obtain distractors for the masked entities, and propose a novel distractor-suppressed ranking objective that is optimized jointly with masked language model. In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training, to inject language models with structured knowledge via learning from raw text. It is more efficient than retrieval-based methods that perform entity linking and integration during finetuning and inference, and generalizes more effectively than the methods that directly learn from concatenated graph triples. Experiments show that our proposed model achieves improved performance on five benchmarks, including question answering and knowledge base completion.
Neural networks equipped with self-attention have parallelizable computation, light-weight structure, and the ability to capture both long-range and local dependencies. Further, their expressive power and performance can be boosted by using a vector to measure pairwise dependency, but this requires to expand the alignment matrix to a tensor, which results in memory and computation bottlenecks. In this paper, we propose a novel attention mechanism called “Multi-mask Tensorized Self-Attention” (MTSA), which is as fast and as memory-efficient as a CNN, but significantly outperforms previous CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token) and global (source2token) dependencies by a novel compatibility function composed of dot-product and additive attentions, 2) uses a tensor to represent the feature-wise alignment scores for better expressive power but only requires parallelizable matrix multiplications, and 3) combines multi-head with multi-dimensional attentions, and applies a distinct positional mask to each head (subspace), so the memory and computation can be distributed to multiple heads, each with sequential information encoded independently. The experiments show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or competitive performance on nine NLP benchmarks with compelling memory- and time-efficiency.
We consider the problem of conversational question answering over a large-scale knowledge base. To handle huge entity vocabulary of a large-scale knowledge base, recent neural semantic parsing based approaches usually decompose the task into several subtasks and then solve them sequentially, which leads to following issues: 1) errors in earlier subtasks will be propagated and negatively affect downstream ones; and 2) each subtask cannot naturally share supervision signals with others. To tackle these issues, we propose an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model. The proposed framework thus enables shared supervisions and alleviates the effect of error propagation. Experiments on a large-scale conversational question answering dataset containing 1.6M question answering pairs over 12.8M entities show that the proposed framework improves overall F1 score from 67% to 79% compared with previous state-of-the-art work.