Using generative adversarial networks for genome variant calling from low depth ONT sequencing data

Overview

As illustrated in Figs. 5, the LDV-Caller consists of three sub-models, ie, generative model Gvariant caller Cand adversarial discriminator D. Image The and (I_ {gt} ) are a pair of training sample extracted from low depth and high depth data. G is first used to generate the projection image G(The). crack C takes concatenation of G(The) and The as input and predicts four types of variant attributes to make final decision as16. D is a binary classifier which can provide adversarial learning to facilitate optimizing the LDV-Caller. It’s notable that D is no longer needed at testing time.

Table 4 reports the detailed structures of our LDV-Caller. In the following, we will delve into each submodule and loss function to clarify the entire method.

The LDV-Caller

Pileup image features

There are two methods for generating input features from the binary aligned map (BAM) data for deep learning models. One is proposed in DeepVariant17. As shown in Figs. 2a, they pile up all reads covered the candidate variation position into a 2D color picture. Different types of feature are included in different channel. The features include read base, strand of the read, base quality, and mapping quality, etc. Although this method includes almost all useful information, the input size is too huge to restrict its processing speed15 and16 leverage another counting strategy to generate the input features like Fig. 2b. The size of the input is respectively (33 times 4 times 4 ) and (33 times 8 times 4 ), which is more convenient to process. In this paper, we use the same input features as16which added useful strand information compared with15.

Figure 2b shows the input features of our LDV-Caller model. It has three dimensions, ie, base position, read base and counting strategy. The position dimension centers at the candidate positionc and ranges in ([c-a, c+a])where or is the flanking window. The read base dimension consists of A, C, G, T, A-, C-, G- and T-, where A, C, G, T are from forward strand while A-, C-, G-, T- are from reverse strand. The third dimension includes various statistic ways. In this paper, we use (1) the allelic count of the reference allele, (2) insertions, (3) deletions, and (4) single nucleotide alternative alleles. These three-dimension tensors with size of (33 times 8 times 4 ) compose our input for the LDV-Caller.

Caller variant

In this paper, we choose model C proposed in16 as the variant attribute extractor for two reasons, ie, effectiveness and efficiency. At first16, achieves the state-of-the-art results for germline variant calling on Oxford Nanopore Technology (ONT) data. At the same time, it is at least 7 times faster than both traditional method Longshot11 on ONT data and deep learning based method DeepVariant17 on NGS data.

The model C has four types of variant attributes to predict. They are: (1) genotype with 21 classes, (2) zygosity with 3 classes, (3) length of allele 1 with 33 classes, and (4) length of allele 2 with 33 classes. Based on these predictions16, also proposed an algorithm to determine the most probable genotype and corresponding alternative alleles. We use focal loss22 as our loss function, and parameter ( gamma ) is set to 2. Furthermore, we treat each task equally. Therefore, model C is supervised by ( mathscr {L} _C )as:

begin {aligned} { mathscr {L} _C} = { mathscr {L} _ {21}} + { mathscr {L} _ {3}} + { mathscr {L} _ {33 { } + { mathscr {L} _ {33} ^ {‘}}, end {aligned}

(1)

where, ( mathscr {L} _ {21} ), ( mathscr {L} _ {3} ), ( mathscr {L} _ {33} ), ( mathscr {L} _ {33} ^ {‘} ) are respectively losses for four variant attribute prediction tasks.

Generative model

Similar to image quality enhancement23the generative model G is proposed to build relation between two data distribution domains, ie, from low depth domain to high depth domain. It’s an encoder-decoder network of which input and output have the same size. We also introduce skip connections and dilated convolutional layers into model G. Table 4 shows the detailed layers. For each variant candidate, we first respectively extract 3D pileup images with size of (33 times 8 times 4 ) from low depth and corresponding high depth data. Whereas, the last two dimensions of these 3D pileup images, especially the third dimension, have too small size to make model G very deep. Deep model with more pooling operations is the key to extract high level semantic features. For these reasons, 3D pileup images are then flattened along the third dimension to get 2D images The and (I_ {gt} ) with size of (33 times 32 ). As shown in Figs. 5, G takes image The as input and generates a predicted high depth image, ie, the projection image G(The). We use mean square error as the loss function to optimize model G. It is calculated as follow:

begin {aligned} { mathscr {L} _ {G}} = { parallel {G (I) – I_ {gt}} parallel} _2 end {aligned}

(2)

In order to recover more complete information from low depth data, we also introduce adversarial learning into our method, which has proven its effectiveness for better image transformation. Specifically, the adversarial discriminator D is a binary classifier. Details are given in Table 4. The adversarial discriminator D and the generative model G are like playing a minimax two-player game24. pattern D aims at correctly distinguishing predicted high depth image from truth high depth image, while the generative model G wants to generate enough realistic high depth image to confuse model D. Therefore, adversarial learning has been used to promote model G. For each training step, model D is inferred by two times. At first, when training model Git provides extra adversarial loss ( mathscr {L} _ {adver} )as:

begin {aligned {{ mathscr {L}} _ {adver} = – { mathbb {E}[log (D(G(I),I))]. end {aligned}

(3)

Then, when training model Dcorresponding loss is ( mathscr {L} _ {D} )as:

begin {aligned} { mathscr {L}} _ {D} = – { mathbb {E} [log {D(I_{gt}}, I) + log (1-D(G(I), I))] end {aligned}

(4)

Loss function

All components in the LDV-Caller are trained end-to-end. However, due to the existing of adversarial learning, there are two optimizers during training. For each iteration, parameters of model G, C are first updated at the same time. They are supervised by loss ( mathscr {L} )as:

begin {aligned} mathscr {L} = lambda _1 mathscr {L} _ {G} + lambda _2 mathscr {L} _ {C} + lambda _3 mathscr {L} _ {adver }, end {aligned}

(5)

where ( lambda _1, lambda _2 ) are set to 1, while ( lambda _3 ) is set to 0.1 in this work. After that, the model D is supervised by loss ( mathscr {L} _ {D} ).

Implementation details

Datasets

In this work, we used dataset from Genome In A Bottle consortium19.21 (GIAB) because of the integration of multiple variant calling program on a bunch of datasets, and the special process steps on disconcordant variant sites among the programs. It provided all kinds of necessary data for genome analysis, eg, reference human genome sequence (FASTA file), binary alignment map (BAM file) built from sequenced reads, high confidence regions (BED file), truth genome variants (VCF file), etc. We used three samples, ie, HG001, HG002 and HG003, of which Oxford Nanopore (ONT) files are available. All these samples were generated by the same chemistry kits, ie, SQK-LSK109 library prep kits on the FLO-PRO002 flow cell. And GRCh38 version was used as the reference genome sequence. The original sequencing depths of HG001, Hg002, and HG003 are 44.3x, 52.2x, and 51.6x. Low depth data was then generated by down-sampling original BAM file. Furthermore, 157 public Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) samples from NCBI website20 were used for further validation.

Evaluation metrics

The evaluation is performed on all variants by the predicted label and the ground truth label. Recall, precision and F1 score are used as the evaluation metrics. (t_p ), (f_p ), (f_n ) are used to respectively represent true positive, false positive, and false negative samples. The recall rate is calculated by (Recall = {t_p} / {(t_p + f_n)} ). The precision is calculated by (Precision = {t_p} / {(t_p + f_p)} ). The F1 score is calculated by (F1 = 2 times (Recall times Precision) / (Recall + Precision) ). And we used the submcule vcfeval in RTG Tools version 3.925 to generate these three metrics.

Hyper-parameters selection

The size of the flanking window is 16. The algorithms are implemented using Pytorch26 with an NVIDIA Tesla P100 GPU. We use Adam optimizer27 with an initial learning rate of 0.0003. Each mini-batch contains 5000 samples. For each training period, we train the LDV-Caller model up to 30 epochs which takes 36 hours. When inference, analyzing whole human genome needs 380 minutes.

Training and testing

We first extract all genome sites with minimum of 0.2 allele frequency. These potential sites and truth variant sites are used to extract pileup images. For each site, low depth image The and corresponding high depth image (I_ {gt} ) are generated. Labels for variant caller and variant discriminator are calculated from truth variant VCF file provided by GIAB consortium. Image The, (I_ {gt} ) and labels compose the training data. Our LDV-Caller is trained in an end-to-end style. However, parameters of three sub-models in LDV-Caller are supervised by two optimizers. One is for adversarial discriminator. The other optimizes generative model and variant caller at the same time. As suggested in24, these two optimizers alternately work in the training process. Moreover, in this work, we give them the same training iterations.

In the testing stage, the adversarial discriminator is not required. At first, we need to extract all potential variant candidates, ie, all genome sites with minimum of 0.2 allele frequency. Then pileup images are generated from low depth sequencing data at these sites. And 2D low depth images are obtained by reshaping 3D pileup images. The generative model takes these 2D low depth images as inputs and recovers high depth images. Both predicted high depth images and corresponding low depth images are passed to the variant caller. Lastly, based on predicted variant attributes from the variant caller, variants are yielded via using post-processing algorithm in16.

Ethical approval

For the whole experiments, we (1) identify the institutional and / or licensing committee approving the experiments, including any relevant details, (2) confirm that all experiments were performed in accordance with relevant guidelines and regulations.

Informed consent

Informed consent was obtained from all participants and / or their legal guardians.