Yue Li


2023

pdf bib
Classifying COVID-19 Vaccine Narratives
Yue Li | Carolina Scarton | Xingyi Song | Kalina Bontcheva
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Vaccine hesitancy is widespread, despite the government’s information campaigns and the efforts of the World Health Organisation (WHO). Categorising the topics within vaccine-related narratives is crucial to understand the concerns expressed in discussions and identify the specific issues that contribute to vaccine hesitancy. This paper addresses the need for monitoring and analysing vaccine narratives online by introducing a novel vaccine narrative classification task, which categorises COVID-19 vaccine claims into one of seven categories. Following a data augmentation approach, we first construct a novel dataset for this new classification task, focusing on the minority classes. We also make use of fact-checker annotated data. The paper also presents a neural vaccine narrative classifier that achieves an accuracy of 84% under cross-validation. The classifier is publicly available for researchers and journalists.

pdf bib
DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading
Hao Wang | Qingxuan Wang | Yue Li | Changqing Wang | Chenhui Chu | Rui Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

The use of visually-rich documents in various fields has created a demand for Document AI models that can read and comprehend documents like humans, which requires the overcoming of technical, linguistic, and cognitive barriers. Unfortunately, the lack of appropriate datasets has significantly hindered advancements in the field. To address this issue, we introduce DocTrack, a visually-rich document dataset really aligned with human eye-movement information using eye-tracking technology. This dataset can be used to investigate the challenges mentioned above. Additionally, we explore the impact of human reading order on document understanding tasks and examine what would happen if a machine reads in the same order as a human. Our results suggest that although Document AI models have made significant progresses, they still have a long way to go before they can read visually richer documents as accurately, continuously, and flexibly as humans do. These findings have potential implications for future research and development of document intelligence.

pdf bib
Don’t waste a single annotation: improving single-label classifiers through soft labels
Ben Wu | Yue Li | Yida Mu | Carolina Scarton | Kalina Bontcheva | Xingyi Song
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we address the limitations of the common data annotation and training methods for objective single-label classification tasks. Typically, when annotating such tasks annotators are only asked to provide a single label for each sample and annotator disagreement is discarded when a final hard label is decided through majority voting. We challenge this traditional approach, acknowledging that determining the appropriate label can be difficult due to the ambiguity and lack of context in the data samples. Rather than discarding the information from such ambiguous annotations, our soft label method makes use of them for training. Our findings indicate that additional annotator information, such as confidence, secondary label and disagreement, can be used to effectively generate soft labels. Training classifiers with these soft labels then leads to improved performance and calibration on the hard label test set.

2022

pdf bib
GateNLP-UShef at SemEval-2022 Task 8: Entity-Enriched Siamese Transformer for Multilingual News Article Similarity
Iknoor Singh | Yue Li | Melissa Thong | Carolina Scarton
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes the second-placed system on the leaderboard of SemEval-2022 Task 8: Multilingual News Article Similarity. We propose an entity-enriched Siamese Transformer which computes news article similarity based on different sub-dimensions, such as the shared narrative, entities, location and time of the event discussed in the news article. Our system exploits a Siamese network architecture using a Transformer encoder to learn document-level representations for the purpose of capturing the narrative together with the auxiliary entity-based features extracted from the news articles. The intuition behind using all these features together is to capture the similarity between news articles at different granularity levels and to assess the extent to which different news outlets write about “the same events”. Our experimental results and detailed ablation study demonstrate the effectiveness and the validity of our proposed method.

2021

pdf bib
Semi-supervised Meta-learning for Cross-domain Few-shot Intent Classification
Yue Li | Jiong Zhang
Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing

Meta learning aims to optimize the model’s capability to generalize to new tasks and domains. Lacking a data-efficient way to create meta training tasks has prevented the application of meta-learning to the real-world few shot learning scenarios. Recent studies have proposed unsupervised approaches to create meta-training tasks from unlabeled data for free, e.g., the SMLMT method (Bansal et al., 2020a) constructs unsupervised multi-class classification tasks from the unlabeled text by randomly masking words in the sentence and let the meta learner choose which word to fill in the blank. This study proposes a semi-supervised meta-learning approach that incorporates both the representation power of large pre-trained language models and the generalization capability of prototypical networks enhanced by SMLMT. The semi-supervised meta training approach avoids overfitting prototypical networks on a small number of labeled training examples and quickly learns cross-domain task-specific representation only from a few supporting examples. By incorporating SMLMT with prototypical networks, the meta learner generalizes better to unseen domains and gains higher accuracy on out-of-scope examples without the heavy lifting of pre-training. We observe significant improvement in few-shot generalization after training only a few epochs on the intent classification tasks evaluated in a multi-domain setting.

2020

pdf bib
Revisiting Rumour Stance Classification: Dealing with Imbalanced Data
Yue Li | Carolina Scarton
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

Correctly classifying stances of replies can be significantly helpful for the automatic detection and classification of online rumours. One major challenge is that there are considerably more non-relevant replies (comments) than informative ones (supports and denies), making the task highly imbalanced. In this paper we revisit the task of rumour stance classification, aiming to improve the performance over the informative minority classes. We experiment with traditional methods for imbalanced data treatment with feature- and BERT-based classifiers. Our models outperform all systems in RumourEval 2017 shared task and rank second in RumourEval 2019.

2010

pdf bib
Information Extraction of Multiple Categories from Pathology Reports
Yue Li | David Martinez
Proceedings of the Australasian Language Technology Association Workshop 2010