Story video-text alignment, a core task in computational story understanding, aims to align video clips with corresponding sentences in their descriptions. However, progress on the task has been held back by the scarcity of manually annotated video-text correspondence and the heavy concentration on English narrations of Hollywood movies. To address these issues, in this paper, we construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SyMoN), containing 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video. Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively, demonstrating the effectiveness of the annotations. As benchmarks for future research, we create 6 baseline approaches with different multilingual training strategies, compare their performance in both intra-lingual and cross-lingual setups, exemplifying the challenges of multilingual video-text alignment. The dataset is released at:https://github.com/insundaycathy/M-SyMoN
Most existing work on aspect-based sentiment analysis (ABSA) focuses on the sentence level, while research at the document level has not received enough attention. Compared to sentence-level ABSA, the document-level ABSA is not only more practical but also requires holistic document-level understanding capabilities such as coreference resolution. To investigate the impact of coreference information on document-level ABSA, we conduct a three-stage research for the document-level target sentiment analysis (DTSA) task: 1) exploring the effectiveness of coreference information for the DTSA task; 2) reducing the reliance on manually annotated coreference information; 3) alleviating the evaluation bias caused by missing the coreference information of opinion targets. Specifically, we first manually annotate the coreferential opinion targets and propose a multi-task learning framework to jointly model the DTSA task and the coreference resolution task. Then we annotate the coreference information with ChatGPT for joint training. Finally, to address the issue of missing coreference targets, we modify the metrics from strict matching to a loose matching method based on the clusters of targets. The experimental results not only demonstrate the effectiveness of our framework but also reflect the feasibility of using ChatGPT-annotated coreferential entities and the applicability of the modified metrics. Our source code is publicly released at https://github.com/NUSTM/DTSA-Coref.
The ability to understand emotions is an essential component of human-like artificial intelligence, as emotions greatly influence human cognition, decision making, and social interactions. In addition to emotion recognition in conversations, the task of identifying the potential causes behind an individual’s emotional state in conversations, is of great importance in many application scenarios. We organize SemEval-2024 Task 3, named Multimodal Emotion Cause Analysis in Conversations, which aims at extracting all pairs of emotions and their corresponding causes from conversations. Under different modality settings, it consists of two subtasks: Textual Emotion-Cause Pair Extraction in Conversations (TECPE) and Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). The shared task has attracted 143 registrations and 216 successful submissions.In this paper, we introduce the task, dataset and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants.
Cross-domain Aspect-Based Sentiment Analysis (ABSA) aims to leverage the useful knowledge from a source domain to identify aspect-sentiment pairs in sentences from a target domain. To tackle the task, several recent works explore a new unsupervised domain adaptation framework, i.e., Cross-Domain Data Augmentation (CDDA), aiming to directly generate much labeled target-domain data based on the labeled source-domain data. However, these CDDA methods still suffer from several issues: 1) preserving many source-specific attributes such as syntactic structures; 2) lack of fluency and coherence; 3) limiting the diversity of generated data. To address these issues, we propose a new cross-domain Data Augmentation approach based on Domain-Adaptive Language Modeling named DA2LM, which contains three stages: 1) assigning pseudo labels to unlabeled target-domain data; 2) unifying the process of token generation and labeling with a Domain-Adaptive Language Model (DALM) to learn the shared context and annotation across domains; 3) using the trained DALM to generate labeled target-domain data. Experiments show that DA2LM consistently outperforms previous feature adaptation and CDDA methods on both ABSA and Aspect Extraction tasks. The source code is publicly released at https://github.com/NUSTM/DALM.
In recent years, Multimodal Named Entity Recognition (MNER) on social media has attracted considerable attention. However, existing MNER studies only extract entity-type pairs in text, which is useless for multimodal knowledge graph construction and insufficient for entity disambiguation. To solve these issues, in this work, we introduce a Grounded Multimodal Named Entity Recognition (GMNER) task. Given a text-image social post, GMNER aims to identify the named entities in text, their entity types, and their bounding box groundings in image (i.e. visual regions). To tackle the GMNER task, we construct a Twitter dataset based on two existing MNER datasets. Moreover, we extend four well-known MNER methods to establish a number of baseline systems and further propose a Hierarchical Index generation framework named H-Index, which generates the entity-type-region triples in a hierarchical manner with a sequence-to-sequence model. Experiment results on our annotated dataset demonstrate the superiority of our H-Index framework over baseline systems on the GMNER task.
Multimodal Emotion Recognition in Multiparty Conversations (MERMC) has recently attracted considerable attention. Due to the complexity of visual scenes in multi-party conversations, most previous MERMC studies mainly focus on text and audio modalities while ignoring visual information. Recently, several works proposed to extract face sequences as visual features and have shown the importance of visual information in MERMC. However, given an utterance, the face sequence extracted by previous methods may contain multiple people’s faces, which will inevitably introduce noise to the emotion prediction of the real speaker. To tackle this issue, we propose a two-stage framework named Facial expressionaware Multimodal Multi-Task learning (FacialMMT). Specifically, a pipeline method is first designed to extract the face sequence of the real speaker of each utterance, which consists of multimodal face recognition, unsupervised face clustering, and face matching. With the extracted face sequences, we propose a multimodal facial expression-aware emotion recognition model, which leverages the frame-level facial emotion distributions to help improve utterance-level emotion recognition based on multi-task learning. Experiments demonstrate the effectiveness of the proposed FacialMMT framework on the benchmark MELD dataset. The source code is publicly released at https://github.com/NUSTM/FacialMMT.
In social media, there is a vast amount of information pertaining to people’s emotions and the corresponding causes. The emotion cause extraction (ECE) from social media data is an important research area that has not been thoroughly explored due to the lack of fine-grained annotations. Early studies referred to either unsupervised rule-based methods or supervised machine learning methods using a number of manually annotated data in specific domains. However, the former suffers from limitations in extraction performance, while the latter is constrained by the availability of fine-grained annotations and struggles to generalize to diverse domains. To address these issues, this paper proposes a new ECE framework on Chinese social media that achieves high extraction performance and generalizability without relying on human annotation. Specifically, we design a more dedicated rule-based system based on constituency parsing tree to discover causal patterns in social media. This system enables us to acquire large amounts of fine-grained annotated data. Next, we train a neural model on the rule-annotated dataset with a specific training strategy to further improve the model’s generalizability. Extensive experiments demonstrate the superiority of our approach over other methods in unsupervised and weakly-supervised settings.
Comparative Opinion Quintuple Extraction (COQE) aims to identify comparative opinion sentences in product reviews, extract comparative opinion elements in the sentences, and then incorporate them into quintuples. Existing methods decompose the COQE task into multiple primary subtasks and then solve them in a pipeline manner. However, these approaches ignore the intrinsic connection between subtasks and the error propagation among stages. This paper proposes a unified generative model, UniCOQE, to solve the COQE task in one shot. We design a generative template where all the comparative tuples are concatenated as the target output sequence. However, the multiple tuples are inherently not an ordered sequence but an unordered set. The pre-defined order will force the generative model to learn a false order bias and hinge the model’s training. To alleviate this bias, we introduce a new “predict-and-assign” training paradigm that models the golden tuples as a set. Specifically, we utilize a set-matching strategy to find the optimal order of tuples. The experimental results on multiple benchmarks show that our unified generative model significantly outperforms the SOTA method, and ablation experiments prove the effectiveness of the set-matching strategy.
Emotion Cause Triplet Extraction in Conversations (ECTEC) aims to simultaneously extract emotion utterances, emotion categories, and cause utterances from conversations. However, existing studies mainly decompose the ECTEC task into multiple subtasks and solve them in a pipeline manner. Moreover, since conversations tend to contain many informal and implicit expressions, it often requires external knowledge and reasoning-based inference to accurately identify emotional and causal clues implicitly mentioned in the context, which are ignored by previous work. To address these limitations, in this paper, we propose a commonSense knowledge-enHanced generAtive fRameworK named SHARK, which formulates the ECTEC task as an index generation problem and generates the emotion-cause-category triplets in an end-to-end manner with a sequence-to-sequence model. Furthermore, we propose to incorporate both retrieved and generated commonsense knowledge into the generative model via a dual-view gate mechanism and a graph attention layer. Experimental results show that our SHARK model consistently outperforms several competitive systems on two benchmark datasets. Our source codes are publicly released at https://github.com/NUSTM/SHARK.
Most previous studies on aspect-based sentiment analysis (ABSA) were carried out at the sentence level, while the research of document-level ABSA has not received enough attention. In this work, we focus on the document-level targeted sentiment analysis task, which aims to extract the opinion targets consisting of multi-level entities from a review document and predict their sentiments. We propose a Sequence-to-Structure (Seq2Struct) approach to address the task, which is able to explicitly model the hierarchical structure among multiple opinion targets in a document, and capture the long-distance dependencies among affiliated entities across sentences. In addition to the existing Seq2Seq approach, we further construct four strong baselines with different pretrained models. Experimental results on six domains show that our Seq2Struct approach outperforms all the baselines significantly. Aside from the performance advantage in outputting the multi-level target-sentiment pairs, our approach has another significant advantage - it can explicitly display the hierarchical structure of the opinion targets within a document. Our source code is publicly released at https://github.com/NUSTM/Doc-TSA-Seq2Struct.
As a fundamental task in opinion mining, aspect and opinion co-extraction aims to identify the aspect terms and opinion terms in reviews. However, due to the lack of fine-grained annotated resources, it is hard to train a robust model for many domains. To alleviate this issue, unsupervised domain adaptation is proposed to transfer knowledge from a labeled source domain to an unlabeled target domain. In this paper, we propose a new Generative Cross-Domain Data Augmentation framework for unsupervised domain adaptation. The proposed framework is aimed to generate target-domain data with fine-grained annotation by exploiting the labeled data in the source domain. Specifically, we remove the domain-specific segments in a source-domain labeled sentence, and then use this as input to a pre-trained sequence-to-sequence model BART to simultaneously generate a target-domain sentence and predict the corresponding label for each word. Experimental results on three datasets demonstrate that our approach is more effective than previous domain adaptation methods.
As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention inrecent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodalalignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify fine-grainedaspects, opinions, and their alignments across modalities. To tackle these limitations, we propose a task-specific Vision-LanguagePre-training framework for MABSA (VLP-MABSA), which is a unified multimodal encoder-decoder architecture for all the pretrainingand downstream tasks. We further design three types of task-specific pre-training tasks from the language, vision, and multimodalmodalities, respectively. Experimental results show that our approach generally outperforms the state-of-the-art approaches on three MABSA subtasks. Further analysis demonstrates the effectiveness of each pre-training task. The source code is publicly released at https://github.com/NUSTM/VLP-MABSA.
Product reviews contain a large number of implicit aspects and implicit opinions. However, most of the existing studies in aspect-based sentiment analysis ignored this problem. In this work, we introduce a new task, named Aspect-Category-Opinion-Sentiment (ACOS) Quadruple Extraction, with the goal to extract all aspect-category-opinion-sentiment quadruples in a review sentence and provide full support for aspect-based sentiment analysis with implicit aspects and opinions. We furthermore construct two new datasets, Restaurant-ACOS and Laptop-ACOS, for this new task, both of which contain the annotations of not only aspect-category-opinion-sentiment quadruples but also implicit aspects and opinions. The former is an extension of the SemEval Restaurant dataset; the latter is a newly collected and annotated Laptop dataset, twice the size of the SemEval Laptop dataset. We finally benchmark the task with four baseline systems. Experiments demonstrate the feasibility of the new task and its effectiveness in extracting and describing implicit aspects and implicit opinions. The two datasets and source code of four systems are publicly released at https://github.com/NUSTM/ACOS.
Data augmentation and adversarial perturbation approaches have recently achieved promising results in solving the over-fitting problem in many natural language processing (NLP) tasks including sentiment classification. However, existing studies aimed to improve the generalization ability by augmenting the training data with synonymous examples or adding random noises to word embeddings, which cannot address the spurious association problem. In this work, we propose an end-to-end reinforcement learning framework, which jointly performs counterfactual data generation and dual sentiment classification. Our approach has three characteristics:1) the generator automatically generates massive and diverse antonymous sentences; 2) the discriminator contains a original-side sentiment predictor and an antonymous-side sentiment predictor, which jointly evaluate the quality of the generated sample and help the generator iteratively generate higher-quality antonymous samples; 3) the discriminator is directly used as the final sentiment classifier without the need to build an extra one. Extensive experiments show that our approach outperforms strong data augmentation baselines on several benchmark sentiment classification datasets. Further analysis confirms our approach’s advantages in generating more diverse training samples and solving the spurious association problem in sentiment classification.
As an important task in opinion mining, comparative opinion mining aims to identify comparative sentences from product reviews, extract the comparative elements, and obtain the corresponding comparative opinion tuples. However, most previous studies simply regarded comparative tuple extraction as comparative element extraction, but ignored the fact that many comparative sentences may contain multiple comparisons. The comparative opinion tuples defined in these studies also failed to explicitly provide comparative preferences. To address these limitations, in this work we first introduce a new Comparative Opinion Quintuple Extraction (COQE) task, to identify comparative sentences from product reviews and extract all comparative opinion quintuples (Subject, Object, Comparative Aspect, Comparative Opinion, Comparative Preference). Secondly, based on the existing comparative opinion mining corpora, we make supplementary annotations and construct three datasets for the COQE task. Finally, we benchmark the COQE task by proposing a new BERT-based multi-stage approach as well as three baseline systems extended from previous methods. %The new approach significantly outperforms three baseline systems on three datasets and represents a strong benchmark for COQE. Experimental results show that the new approach significantly outperforms three baseline systems on three datasets for the COQE task.
Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while ignores implicit aspects in text. To capture both explicit and implicit aspects, we focus on aspect-category based sentiment analysis, which involves joint aspect category detection and category-oriented sentiment classification. However, currently only a few simple studies have focused on this problem. The shortcomings in the way they defined the task make their approaches difficult to effectively learn the inner-relations between categories and the inter-relations between categories and sentiments. In this work, we re-formalize the task as a category-sentiment hierarchy prediction problem, which contains a hierarchy output structure to first identify multiple aspect categories in a piece of text, and then predict the sentiment for each of the identified categories. Specifically, we propose a Hierarchical Graph Convolutional Network (Hier-GCN), where a lower-level GCN is to model the inner-relations among multiple categories, and the higher-level GCN is to capture the inter-relations between aspect categories and sentiments. Extensive evaluations demonstrate that our hierarchy output structure is superior over existing ones, and the Hier-GCN model can consistently achieve the best results on four benchmarks.
In recent years, a new interesting task, called emotion-cause pair extraction (ECPE), has emerged in the area of text emotion analysis. It aims at extracting the potential pairs of emotions and their corresponding causes in a document. To solve this task, the existing research employed a two-step framework, which first extracts individual emotion set and cause set, and then pair the corresponding emotions and causes. However, such a pipeline of two steps contains some inherent flaws: 1) the modeling does not aim at extracting the final emotion-cause pair directly; 2) the errors from the first step will affect the performance of the second step. To address these shortcomings, in this paper we propose a new end-to-end approach, called ECPE-Two-Dimensional (ECPE-2D), to represent the emotion-cause pairs by a 2D representation scheme. A 2D transformer module and two variants, window-constrained and cross-road 2D transformers, are further proposed to model the interactions of different emotion-cause pairs. The 2D representation, interaction, and prediction are integrated into a joint framework. In addition to the advantages of joint modeling, the experimental results on the benchmark emotion cause corpus show that our approach improves the F1 score of the state-of-the-art from 61.28% to 68.89%.
In this paper, we study Multimodal Named Entity Recognition (MNER) for social media posts. Existing approaches for MNER mainly suffer from two drawbacks: (1) despite generating word-aware visual representations, their word representations are insensitive to the visual context; (2) most of them ignore the bias brought by the visual context. To tackle the first issue, we propose a multimodal interaction module to obtain both image-aware word representations and word-aware visual representations. To alleviate the visual bias, we further propose to leverage purely text-based entity span detection as an auxiliary module, and design a Unified Multimodal Transformer to guide the final predictions with the entity span predictions. Experiments show that our unified approach achieves the new state-of-the-art performance on two benchmark datasets.
The prevalent use of social media enables rapid spread of rumors on a massive scale, which leads to the emerging need of automatic rumor verification (RV). A number of previous studies focus on leveraging stance classification to enhance RV with multi-task learning (MTL) methods. However, most of these methods failed to employ pre-trained contextualized embeddings such as BERT, and did not exploit inter-task dependencies by using predicted stance labels to improve the RV task. Therefore, in this paper, to extend BERT to obtain thread representations, we first propose a Hierarchical Transformer, which divides each long thread into shorter subthreads, and employs BERT to separately represent each subthread, followed by a global Transformer layer to encode all the subthreads. We further propose a Coupled Transformer Module to capture the inter-task interactions and a Post-Level Attention layer to use the predicted stance labels for RV, respectively. Experiments on two benchmark datasets show the superiority of our Coupled Hierarchical Transformer model over existing MTL approaches.
Emotion-cause pair extraction (ECPE) is a new task that aims to extract the potential pairs of emotions and their corresponding causes in a document. The existing methods first perform emotion extraction and cause extraction independently, and then perform emotion-cause pairing and filtering. However, the above methods ignore the fact that the cause and the emotion it triggers are inseparable, and the extraction of the cause without specifying the emotion is pathological, which greatly limits the performance of the above methods in the first step. To tackle these shortcomings, we propose two joint frameworks for ECPE: 1) multi-label learning for the extraction of the cause clauses corresponding to the specified emotion clause (CMLL) and 2) multi-label learning for the extraction of the emotion clauses corresponding to the specified cause clause (EMLL). The window of multi-label learning is centered on the specified emotion clause or cause clause and slides as their positions move. Finally, CMLL and EMLL are integrated to obtain the final result. We evaluate our model on a benchmark emotion cause corpus, the results show that our approach achieves the best performance among all compared systems on the ECPE task.
The supervised models for aspect-based sentiment analysis (ABSA) rely heavily on labeled data. However, fine-grained labeled data are scarce for the ABSA task. To alleviate the dependence on labeled data, prior works mainly focused on feature-based adaptation, which used the domain-shared knowledge to construct auxiliary tasks or domain adversarial learning to bridge the gap between domains, while ignored the attribute of instance-based adaptation. To resolve this limitation, we propose an end-to-end framework to jointly perform feature and instance based adaptation for the ABSA task in this paper. Based on BERT, we learn domain-invariant feature representations by using part-of-speech features and syntactic dependency relations to construct auxiliary tasks, and jointly perform word-level instance weighting in the framework of sequence labeling. Experiment results on four benchmarks show that the proposed method can achieve significant improvements in comparison with the state-of-the-arts in both tasks of cross-domain End2End ABSA and cross-domain aspect extraction.
In this paper, we study automatic rumor detection for in social media at the event level where an event consists of a sequence of posts organized according to the posting time. It is common that the state of an event is dynamically evolving. However, most of the existing methods to this task ignored this problem, and established a global representation based on all the posts in the event’s life cycle. Such coarse-grained methods failed to capture the event’s unique features in different states. To address this limitation, we propose a state-independent and time-evolving Network (STN) for rumor detection based on fine-grained event state detection and segmentation. Given an event composed of a sequence of posts, STN first predicts the corresponding sequence of states and segments the event into several state-independent sub-events. For each sub-event, STN independently trains an encoder to learn the feature representation for that sub-event and incrementally fuses the representation of the current sub-event with previous ones for rumor prediction. This framework can more accurately learn the representation of an event in the initial stage and enable early rumor detection. Experiments on two benchmark datasets show that STN can significantly improve the rumor detection accuracy in comparison with some strong baseline systems. We also design a new evaluation metric to measure the performance of early rumor detection, under which STN shows a higher advantage in comparison.
In this paper, we target at improving the performance of multi-label emotion classification with the help of sentiment classification. Specifically, we propose a new transfer learning architecture to divide the sentence representation into two different feature spaces, which are expected to respectively capture the general sentiment words and the other important emotion-specific words via a dual attention mechanism. Experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method.
In this paper, we study domain adaptation with a state-of-the-art hierarchical neural network for document-level sentiment classification. We first design a new auxiliary task based on sentiment scores of domain-independent words. We then propose two neural network architectures to respectively induce document embeddings and sentence embeddings that work well for different domains. When these document and sentence embeddings are used for sentiment classification, we find that with both pseudo and external sentiment lexicons, our proposed methods can perform similarly to or better than several highly competitive domain adaptation methods on a benchmark dataset of product reviews.
Relation classification is the task of classifying the semantic relations between entity pairs in text. Observing that existing work has not fully explored using different representations for relation instances, especially in order to better handle the asymmetry of relation types, in this paper, we propose a neural network based method for relation classification that combines the raw sequence and the shortest dependency path representations of relation instances and uses mirror instances to perform pairwise relation classification. We evaluate our proposed models on the SemEval-2010 Task 8 dataset. The empirical results show that with two additional features, our model achieves the state-of-the-art result of F1 score of 85.7.