2023
pdf
bib
MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines
Vincent Nguyen
|
Sarvnaz Karimi
|
Maciej Rybinski
|
Zhenchang Xing
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
2021
pdf
bib
abs
Combining Shallow and Deep Representations for Text-Pair Classification
Vincent Nguyen
|
Sarvnaz Karimi
|
Zhenchang Xing
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Text-pair classification is the task of determining the class relationship between two sentences. It is embedded in several tasks such as paraphrase identification and duplicate question detection. Contemporary methods use fine-tuned transformer encoder semantic representations of the classification token in the text-pair sequence from the transformer’s final layer for class prediction. However, research has shown that earlier parts of the network learn shallow features, such as syntax and structure, which existing methods do not directly exploit. We propose a novel convolution-based decoder for transformer-based architecture that maximizes the use of encoder hidden features for text-pair classification. Our model exploits hidden representations within transformer-based architecture. It outperforms a transformer encoder baseline on average by 50% (relative F1-score) on six datasets from the medical, software engineering, and open-domains. Our work shows that transformer-based models can improve text-pair classification by modifying the fine-tuning step to exploit shallow features while improving model generalization, with only a slight reduction in efficiency.
pdf
bib
abs
Cross-Domain Language Modeling: An Empirical Investigation
Vincent Nguyen
|
Sarvnaz Karimi
|
Maciej Rybinski
|
Zhenchang Xing
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.
2020
pdf
bib
abs
Pandemic Literature Search: Finding Information on COVID-19
Vincent Nguyen
|
Maciek Rybinski
|
Sarvnaz Karimi
|
Zhenchang Xing
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association
Finding information related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually. We investigate how to better rank information for pandemic information retrieval. We experiment with different ranking algorithms and propose a novel end-to-end method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search. This work could lead to a search system that aids scientists, clinicians, policymakers and others in finding reliable answers from the scientific literature.
2019
pdf
bib
Investigating the Effect of Lexical Segmentation in Transformer-based Models on Medical Datasets
Vincent Nguyen
|
Sarvnaz Karimi
|
Zhenchang Xing
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association
pdf
bib
abs
ANU-CSIRO at MEDIQA 2019: Question Answering Using Deep Contextual Knowledge
Vincent Nguyen
|
Sarvnaz Karimi
|
Zhenchang Xing
Proceedings of the 18th BioNLP Workshop and Shared Task
We report on our system for textual inference and question entailment in the medical domain for the ACL BioNLP 2019 Shared Task, MEDIQA. Textual inference is the task of finding the semantic relationships between pairs of text. Question entailment involves identifying pairs of questions which have similar semantic content. To improve upon medical natural language inference and question entailment approaches to further medical question answering, we propose a system that incorporates open-domain and biomedical domain approaches to improve semantic understanding and ambiguity resolution. Our models achieve 80% accuracy on medical natural language inference (6.5% absolute improvement over the original baseline), 48.9% accuracy on recognising medical question entailment, 0.248 Spearman’s rho for question answering ranking and 68.6% accuracy for question answering classification.