Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification

Edmond Menya, Mathieu Roche, Roberto Interdonato, Dickson Owuor


Abstract
We present EpidBioBERT, a biosurveillance epidemiological document tagger for disease surveillance over PADI-Web system. Our model is trained on PADI-Web corpus which contains news articles on Animal Diseases Outbreak extracted from the web. We train a classifier to discriminate between relevant and irrelevant documents based on their epidemiological thematic feature content in preparation for further epidemiology information extraction. Our approach proposes a new way to perform epidemiological document classification by enriching epidemiological thematic features namely disease, host, location and date, which are used as inputs to our epidemiological document classifier. We adopt a pre-trained biomedical language model with a novel fine tuning approach that enriches these epidemiological thematic features. We find these thematic features rich enough to improve epidemiological document classification over a smaller data set than initially used in PADI-Web classifier. This improves the classifiers ability to avoid false positive alerts on disease surveillance systems. To further understand information encoded in EpidBioBERT, we experiment the impact of each epidemiology thematic feature on the classifier under ablation studies. We compare our biomedical pre-trained approach with a general language model based model finding that thematic feature embeddings pre-trained on general English documents are not rich enough for epidemiology classification task. Our model achieves an F1-score of 95.5% over an unseen test set, with an improvement of +5.5 points on F1-Score on the PADI-Web classifier with nearly half the training data set.
Anthology ID:
2022.lrec-1.399
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3741–3750
Language:
URL:
https://aclanthology.org/2022.lrec-1.399
DOI:
Bibkey:
Cite (ACL):
Edmond Menya, Mathieu Roche, Roberto Interdonato, and Dickson Owuor. 2022. Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3741–3750, Marseille, France. European Language Resources Association.
Cite (Informal):
Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification (Menya et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.399.pdf