Qiankun Zheng


2023

pdf bib
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

pdf bib
LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Konstantin Chernyshev | Ekaterina Garanina | Duygu Bayram | Qiankun Zheng | Lukas Edman
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Misogyny and sexism are growing problems in social media. Advances have been made in online sexism detection but the systems are often uninterpretable. SemEval-2023 Task 10 on Explainable Detection of Online Sexism aims at increasing explainability of the sexism detection, and our team participated in all the proposed subtasks. Our system is based on further domain-adaptive pre-training. Building on the Transformer-based models with the domain adaptation, we compare fine-tuning with multi-task learning and show that each subtask requires a different system configuration. In our experiments, multi-task learning performs on par with standard fine-tuning for sexism detection and noticeably better for coarse-grained sexism classification, while fine-tuning is preferable for fine-grained classification.

pdf bib
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele
Findings of the Association for Computational Linguistics: ACL 2023

Local coherence is essential for long-form text generation models. We identify two important aspects of local coherence within the visual storytelling task: (1) the model needs to represent re-occurrences of characters within the image sequence in order to mention them correctly in the story; (2) character representations should enable us to find instances of the same characters and distinguish different characters. In this paper, we propose a loss function inspired by a linguistic theory of coherence for self-supervised learning for image sequence representations. We further propose combining features from an object and a face detector to construct stronger character features. To evaluate input-output relevance that current reference-based metrics don’t measure, we propose a character matching metric to check whether the models generate referring expressions correctly for characters in input image sequences. Experiments on a visual story generation dataset show that our proposed features and loss function are effective for generating more coherent and visually grounded stories.