SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Samuel Cahyawijaya; Tiezheng Yu; Zihan Liu; Xiaopu Zhou; Tze Wing Tiffany Mak; Yuk Yu Nancy Ip; Pascale Fung

doi:10.18653/v1/2022.bionlp-1.14

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Xiaopu Zhou, Tze Wing Tiffany Mak, Yuk Yu Nancy Ip, Pascale Fung

Abstract

Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer’s disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences.

Anthology ID:: 2022.bionlp-1.14
Volume:: Proceedings of the 21st Workshop on Biomedical Language Processing
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Venue:: BioNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 140–154
Language:
URL:: https://aclanthology.org/2022.bionlp-1.14
DOI:: 10.18653/v1/2022.bionlp-1.14
Bibkey:
Cite (ACL):: Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Xiaopu Zhou, Tze Wing Tiffany Mak, Yuk Yu Nancy Ip, and Pascale Fung. 2022. SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 140–154, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study (Cahyawijaya et al., BioNLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.bionlp-1.14.pdf
Video:: https://aclanthology.org/2022.bionlp-1.14.mp4
Code: hltchkust/snp2vec

PDF Cite Search Code Video