Zubair Afzal


2024

pdf bib
Scalable Patent Classification with Aggregated Multi-View Ranking
Dan Li | Vikrant Yadav | Zi Long Zhu | Maziar Moradi Fard | Zubair Afzal | George Tsatsaronis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Automated patent classification typically involves assigning labels to a patent from a taxonomy, using multi-class multi-label classification models. However, classification-based models face challenges in scaling to large numbers of labels, struggle with generalizing to new labels, and fail to effectively utilize the rich information and multiple views of patents and labels. In this work, we propose a multi-view ranking-based method to address these limitations. Our method consists of four ranking-based models that incorporate different views of patents and a meta-model that aggregates and re-ranks the candidate labels given by the four ranking models. We compared our approach against the state-of-the-art baselines on two publicly available patent classification datasets, USPTO-2M and CLEF-IP-2011. We demonstrate that our approach can alleviate the aforementioned limitations and achieve a new state-of-the-art performance by a significant margin.

2023

pdf bib
Enhancing Extreme Multi-Label Text Classification: Addressing Challenges in Model, Data, and Evaluation
Dan Li | Zi Long Zhu | Janneke van de Loo | Agnes Masip Gomez | Vikrant Yadav | Georgios Tsatsaronis | Zubair Afzal
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Extreme multi-label text classification is a prevalent task in industry, but it frequently encounters challenges in terms of machine learning perspectives, including model limitations, data scarcity, and time-consuming evaluation. This paper aims to mitigate these issues by introducing novel approaches. Firstly, we propose a label ranking model as an alternative to the conventional SciBERT-based classification model, enabling efficient handling of large-scale labels and accommodating new labels. Secondly, we present an active learning-based pipeline that addresses the data scarcity of new labels during the update of a classification system. Finally, we introduce ChatGPT to assist with model evaluation. Our experiments demonstrate the effectiveness of these techniques in enhancing the extreme multi-label text classification task.

2022

pdf bib
Unsupervised Dense Retrieval for Scientific Articles
Dan Li | Vikrant Yadav | Zubair Afzal | George Tsatsaronis
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

In this work, we build a dense retrieval based semantic search engine on scientific articles from Elsevier. The major challenge is that there is no labeled data for training and testing. We apply a state-of-the-art unsupervised dense retrieval model called Generative Pseudo Labeling that generates high-quality pseudo training labels. Furthermore, since the articles are unbalanced across different domains, we select passages from multiple domains to form balanced training data. For the evaluation, we create two test sets: one manually annotated and one automatically created from the meta-information of our data. We compare the semantic search engine with the currently deployed lexical search engine on the two test sets. The results of the experiment show that the semantic search engine trained with pseudo training labels can significantly improve search performance.

pdf bib
Few-shot initializing of Active Learner via Meta-Learning
Zi Long Zhu | Vikrant Yadav | Zubair Afzal | George Tsatsaronis
Findings of the Association for Computational Linguistics: EMNLP 2022

Despite the important evolutions in few-shot and zero-shot learning techniques, domain specific applications still require expert knowledge and significant effort in annotating and labeling a large volume of unstructured textual data. To mitigate this problem, active learning, and meta-learning attempt to reach a high performance with the least amount of labeled data. In this paper, we introduce a novel approach to combine both lines of work by initializing an active learner with meta-learned parameters obtained through meta-training on tasks similar to the target task during active learning. In this approach we use the pre-trained BERT as our text-encoder and meta-learn its parameters with LEOPARD, which extends the model-agnostic meta-learning method by generating task dependent softmax weights to enable learning across tasks with different number of classes. We demonstrate the effectiveness of our method by performing active learning on five natural language understanding tasks and six datasets with five different acquisition functions. We train two different meta-initializations, and we use the pre-trained BERT base initialization as baseline. We observe that our approach performs better than the baseline at low budget, especially when closely related tasks were present during meta-learning. Moreover, our results show that better performance in the initial phase, i.e., with fewer labeled samples, leads to better performance when larger acquisition batches are used. We also perform an ablation study of the proposed method, showing that active learning with only the meta-learned weights is beneficial and adding the meta-learned learning rates and generating the softmax have negative consequences for the performance.

2020

pdf bib
CORA: A Deep Active Learning Covid-19 Relevancy Algorithm to Identify Core Scientific Articles
Zubair Afzal | Vikrant Yadav | Olga Fedorova | Vaishnavi Kandala | Janneke van de Loo | Saber A. Akhondi | Pascal Coupet | George Tsatsaronis
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

Ever since the COVID-19 pandemic broke out, the academic and scientific research community, as well as industry and governments around the world have joined forces in an unprecedented manner to fight the threat. Clinicians, biologists, chemists, bioinformaticians, nurses, data scientists, and all of the affiliated relevant disciplines have been mobilized to help discover efficient treatments for the infected population, as well as a vaccine solution to prevent further the virus spread. In this combat against the virus responsible for the pandemic, key for any advancements is the timely, accurate, peer-reviewed, and efficient communication of any novel research findings. In this paper we present a novel framework to address the information need of filtering efficiently the scientific bibliography for relevant literature around COVID-19. The contributions of the paper are summarized in the following: we define and describe the information need that encompasses the major requirements for COVID-19 articles relevancy, we present and release an expert-curated benchmark set for the task, and we analyze the performance of several state-of-the-art machine learning classifiers that may distinguish the relevant from the non-relevant COVID-19 literature.

2017

pdf bib
Tagging Funding Agencies and Grants in Scientific Articles using Sequential Learning Models
Subhradeep Kayal | Zubair Afzal | George Tsatsaronis | Sophia Katrenko | Pascal Coupet | Marius Doornenbal | Michelle Gregory
BioNLP 2017

In this paper we present a solution for tagging funding bodies and grants in scientific articles using a combination of trained sequential learning models, namely conditional random fields (CRF), hidden markov models (HMM) and maximum entropy models (MaxEnt), on a benchmark set created in-house. We apply the trained models to address the BioASQ challenge 5c, which is a newly introduced task that aims to solve the problem of funding information extraction from scientific articles. Results in the dry-run data set of BioASQ task 5c show that the suggested approach can achieve a micro-recall of more than 85% in tagging both funding bodies and grants.