2024
pdf
bib
abs
Look before You Leap: Dual Logical Verification for Knowledge-based Visual Question Generation
Xumeng Liu
|
Wenya Guo
|
Ying Zhang
|
Xubo Liu
|
Yu Zhao
|
Shenglong Yu
|
Xiaojie Yuan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Knowledge-based Visual Question Generation aims to generate visual questions with outside knowledge other than the image. Existing approaches are answer-aware, which incorporate answers into the question-generation process. However, these methods just focus on leveraging the semantics of inputs to propose questions, ignoring the logical coherence among generated questions (Q), images (V), answers (A), and corresponding acquired outside knowledge (K). It results in generating many non-expected questions with low quality, lacking insight and diversity, and some of them are even without any corresponding answer. To address this issue, we inject logical verification into the processes of knowledge acquisition and question generation, which is defined as LVˆ2-Net. Through checking the logical structure among V, A, K, ground-truth and generated Q twice in the whole KB-VQG procedure, LVˆ2-Net can propose diverse and insightful knowledge-based visual questions. And experimental results on two commonly used datasets demonstrate the superiority of LVˆ2-Net. Our code will be released to the public soon.
2023
pdf
bib
abs
AoM: Detecting Aspect-oriented Information for Multimodal Aspect-Based Sentiment Analysis
Ru Zhou
|
Wenya Guo
|
Xumeng Liu
|
Shenglong Yu
|
Ying Zhang
|
Xiaojie Yuan
Findings of the Association for Computational Linguistics: ACL 2023
Multimodal aspect-based sentiment analysis (MABSA) aims to extract aspects from text-image pairs and recognize their sentiments. Existing methods make great efforts to align the whole image to corresponding aspects. However, different regions of the image may relate to different aspects in the same sentence, and coarsely establishing image-aspect alignment will introduce noise to aspect-based sentiment analysis (i.e., visual noise). Besides, the sentiment of a specific aspect can also be interfered by descriptions of other aspects (i.e., textual noise). Considering the aforementioned noises, this paper proposes an Aspect-oriented Method (AoM) to detect aspect-relevant semantic and sentiment information. Specifically, an aspect-aware attention module is designed to simultaneously select textual tokens and image blocks that are semantically related to the aspects. To accurately aggregate sentiment information, we explicitly introduce sentiment embedding into AoM, and use a graph convolutional network to model the vision-text and text-text interaction. Extensive experiments demonstrate the superiority of AoM to existing methods.
pdf
bib
abs
HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
Zhengkun Zhang
|
Wenya Guo
|
Xiaojun Meng
|
Yasheng Wang
|
Yadao Wang
|
Xin Jiang
|
Qun Liu
|
Zhenglu Yang
Findings of the Association for Computational Linguistics: ACL 2023
With the scale and capacity of pretrained models growing rapidly, parameter-efficient language model tuning has emerged as a popular paradigm for solving various NLP and Vision-and-Language (V&L) tasks. In this paper, we design a unified parameter-efficient multitask learning framework that works effectively on both NLP and V&L tasks. In particular, we use a shared hypernetwork that takes trainable hyper-embeddings and visual modality as input, and outputs weights for different modules in a pretrained language model, such as the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feed-forward blocks (i.e., adapter-tuning.). Our proposed framework adds fewer trainable parameters in multi-task learning while achieving superior performances and transfer ability compared to state-of-the-art methods. Empirical results on the GLUE benchmark and multiple V&L tasks confirm the effectiveness of our framework.
pdf
bib
abs
Licon: A Diverse, Controllable and Challenging Linguistic Concept Learning Benchmark
Shenglong Yu
|
Ying Zhang
|
Wenya Guo
|
Zhengkun Zhang
|
Ru Zhou
|
Xiaojie Yuan
Findings of the Association for Computational Linguistics: EMNLP 2023
Concept Learning requires learning the definition of a general category from given training examples. Most of the existing methods focus on learning concepts from images. However, the visual information cannot present abstract concepts exactly, which struggles the introduction of novel concepts related to known concepts (e.g., ‘Plant’→‘Asteroids’). In this paper, inspired by the fact that humans learn most concepts through linguistic description, we introduce Linguistic Concept Learning benchmark (Licon), where concepts in diverse forms (e.g., plain attributes, images, and text) are defined by linguistic descriptions. The difficulty to learn novel concepts can be controlled by the number of attributes or the hierarchical relationships between concepts. The diverse and controllable concepts are used to support challenging evaluation tasks, including concept classification, attribute prediction, and concept relationship recognition. In addition, we design an entailment-based concept learning method (EnC) to model the relationship among concepts. Extensive experiments demonstrate the effectiveness of EnC. The benchmark will be released to the public soon.
2022
pdf
bib
abs
A Span-based Multimodal Variational Autoencoder for Semi-supervised Multimodal Named Entity Recognition
Baohang Zhou
|
Ying Zhang
|
Kehui Song
|
Wenya Guo
|
Guoqing Zhao
|
Hongbin Wang
|
Xiaojie Yuan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Multimodal named entity recognition (MNER) on social media is a challenging task which aims to extract named entities in free text and incorporate images to classify them into user-defined types. However, the annotation for named entities on social media demands a mount of human efforts. The existing semi-supervised named entity recognition methods focus on the text modal and are utilized to reduce labeling costs in traditional NER. However, the previous methods are not efficient for semi-supervised MNER. Because the MNER task is defined to combine the text information with image one and needs to consider the mismatch between the posted text and image. To fuse the text and image features for MNER effectively under semi-supervised setting, we propose a novel span-based multimodal variational autoencoder (SMVAE) model for semi-supervised MNER. The proposed method exploits modal-specific VAEs to model text and image latent features, and utilizes product-of-experts to acquire multimodal features. In our approach, the implicit relations between labels and multimodal features are modeled by multimodal VAE. Thus, the useful information of unlabeled data can be exploited in our method under semi-supervised setting. Experimental results on two benchmark datasets demonstrate that our approach not only outperforms baselines under supervised setting, but also improves MNER performance with less labeled data than existing semi-supervised methods.