While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability on different types of tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark (https://www.luge.ai/#/luge/task/taskDetail?taskId=15) and hope it can facilitate the research in building trustworthy systems.
In this paper, we present DuReader-retrieval, a large-scale Chinese dataset for passage retrieval. DuReader-retrieval contains more than 90K queries and over 8M unique passages from a commercial search engine. To alleviate the shortcomings of other datasets and ensure the quality of our benchmark, we (1) reduce the false negatives in development and test sets by manually annotating results pooled from multiple retrievers, and (2) remove the training queries that are semantically similar to the development and testing queries. Additionally, we provide two out-of-domain testing sets for cross-domain evaluation, as well as a set of human translated queries for for cross-lingual retrieval evaluation. The experiments demonstrate that DuReader-retrieval is challenging and a number of problems remain unsolved, such as the salient phrase mismatch and the syntactic mismatch between queries and paragraphs. These experiments also show that dense retrievers do not generalize well across domains, and cross-lingual retrieval is essentially challenging. DuReader-retrieval is publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval.
In this paper, we focus on the robustness evaluation of Chinese Question Matching (QM) models. Most of the previous work on analyzing robustness issues focus on just one or a few types of artificial adversarial examples. Instead, we argue that a comprehensive evaluation should be conducted on natural texts, which takes into account the fine-grained linguistic capabilities of QM models. For this purpose, we create a Chinese dataset namely DuQM which contains natural questions with linguistic perturbations to evaluate the robustness of QM models. DuQM contains 3 categories and 13 subcategories with 32 linguistic perturbations. The extensive experiments demonstrate that DuQM has a better ability to distinguish different models. Importantly, the detailed breakdown of evaluation by the linguistic phenomena in DuQM helps us easily diagnose the strength and weakness of different models. Additionally, our experiment results show that the effect of artificial adversarial examples does not work on natural texts. Our baseline codes and a leaderboard are now publicly available.
This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.
Emotion-cause pair extraction (ECPE), which aims at simultaneously extracting emotion-cause pairs that express emotions and their corresponding causes in a document, plays a vital role in understanding natural languages. Considering that most emotions usually have few causes mentioned in their contexts, we present a novel end-to-end Pair Graph Convolutional Network (PairGCN) to model pair-level contexts so that to capture the dependency information among local neighborhood candidate pairs. Moreover, in the graphical network, contexts are grouped into three types and each type of contexts is propagated by its own way. Experiments on a benchmark Chinese emotion-cause pair extraction corpus demonstrate the effectiveness of the proposed model.
The goal of fine-grained propaganda detection is to determine whether a given sentence uses propaganda techniques (sentence-level) or to recognize which techniques are used (fragment-level). This paper presents the sys- tem of our participation in the sentence-level subtask of the propaganda detection shared task. In order to better utilize the document information, we construct context-dependent input pairs (sentence-title pair and sentence- context pair) to fine-tune the pretrained BERT, and we also use the undersampling method to tackle the problem of imbalanced data.
We present a neural network-based joint approach for emotion classification and emotion cause detection, which attempts to capture mutual benefits across the two sub-tasks of emotion analysis. Considering that emotion classification and emotion cause detection need different kinds of features (affective and event-based separately), we propose a joint encoder which uses a unified framework to extract features for both sub-tasks and a joint model trainer which simultaneously learns two models for the two sub-tasks separately. Our experiments on Chinese microblogs show that the joint approach is very promising.
Human generates responses relying on semantic and functional dependencies, including coreference relation, among dialogue elements and their context. In this paper, we investigate matching a response with its multi-turn context using dependency information based entirely on attention. Our solution is inspired by the recently proposed Transformer in machine translation (Vaswani et al., 2017) and we extend the attention mechanism in two ways. First, we construct representations of text segments at different granularities solely with stacked self-attention. Second, we try to extract the truly matched segment pairs with attention across the context and response. We jointly introduce those two kinds of attention in one uniform neural network. Experiments on two large-scale multi-turn response selection tasks show that our proposed model significantly outperforms the state-of-the-art models.
Machine learning-based methods have obtained great progress on emotion classification. However, in most previous studies, the models are learned based on a single corpus which often suffers from insufficient labeled data. In this paper, we propose a corpus fusion approach to address emotion classification across two corpora which use different emotion taxonomies. The objective of this approach is to utilize the annotated data from one corpus to help the emotion classification on another corpus. An Integer Linear Programming (ILP) optimization is proposed to refine the classification results. Empirical studies show the effectiveness of the proposed approach to corpus fusion for emotion classification.
Emotion processing has always been a great challenge. Given the fact that an emotion is triggered by cause events and that cause events are an integral part of emotion, this paper constructs a Chinese emotion cause corpus as a first step towards automatic inference of cause-emotion correlation. The corpus focuses on five primary emotions, namely happiness, sadness, fear, anger, and surprise. It is annotated with emotion cause events based on our proposed annotation scheme. Corpus data shows that most emotions are expressed with causes, and that causes mostly occur before the corresponding emotion verbs. We also examine the correlations between emotions and cause events in terms of linguistic cues: causative verbs, perception verbs, epistemic markers, conjunctions, prepositions, and others. Results show that each group of linguistic cues serves as an indicator marking the cause events in different structures of emotional constructions. We believe that the emotion cause corpus will be the useful resource for automatic emotion cause detection as well as emotion detection and classification.