The Phosphorus SMN CNV Algorithm
A breakdown of our algorithm for identifying the SMN1 and SMN2 counts computationally from FASTQ files
Spinal Muscular Atrophy is an autosomal recessive muscular disease that causes progressive muscular degradation and weakness. SMA often leads to an early death, most commonly in infants with the most severe form. It has an estimated incidence of 1/10,000 and a carrier frequency of 1 in 40 to 1 in 60.
SMA is caused by the deficiency of the motor neuron protein SMN (Survival Motor Neuron). The SMN protein is encoded by the SMN1 and SMN2 genes. The number of copies of these genes that a human carries can greatly vary. Normally, humans have two copies of SMN1 and zero to four copies of SMN2. SMN1 produces adequate levels of SMN protein, while SMN2 is only able to produce low levels of protein. Therefore, the loss of either one or both copies of SMN1 will result in SMA, while the loss of SMN2 has no consequence.
Although SMN1 and SMN2 are 99% similar, the two genes have one functional difference in exon 7 — specifically, a C > T change — that results in splicing of the SMN2 gene such that stable SMN protein is not produced. SMN2 occasionally produces full length SMN mRNA, which means that the more copies of SMN2 someone has, the more they can compensate for a loss of the SMN1 gene. In essence, SMA is caused by the loss of the SMN1 gene, but the severity is determined by the number of copies of the SMN2 gene.
Twenty batches of FASTQ files comprising over 500 samples were used for training and testing, and the algorithm results were compared against the actual SMN1 count for each sample as determined by droplet PCR. BWA and Samtools were used for alignment and indexing against the SMN1 gene and Edico was used for alignment to the whole genome.
The Phosphorus SMN CNV algorithm consists of two main steps. First, the ratio of the number of copies of SMN1 to the number of copies of SMN2 is determined by aligning the FASTQ to just the SMN1 sequence. Because SMN1 and SMN2 are so similar, both SMN1 and SMN2 reads will align to the reference. Then, the overall ratio is determined by examining the ratios of base counts at the five nucleotides where SMN1 and SMN2 differ.
The second step is to determine the total count of SMN1 and SMN2. The total count of SMN1 and SMN2 is proportional to the depth of the alignment to the SMN1 sequence. Therefore, the total count is determined by looking at the depth coverage of alignment to SMN1 relative to the average depth coverage of a control region spanning the whole genome. Three samples are selected within each batch to be control samples, and these controls are used to normalize the total count by batch. Once the ratio of SMN1 count to SMN2 count and the total SMN1 and SMN2 count are determined, simple algebra gives predictions for the SMN1 count and the SMN2 count.
Below is a table of the results of our algorithm. An error means that the algorithm reported a problem, or, in other words, identified that the predicted number of SMN1 genes was impossible. A miss signifies that the algorithm did not report an error, yet predicted the incorrect number of SMN1 genes as based on droplet PCR. As shown in the final row, errors occurred about two percent of the time, misses occurred less than 0.5 percent of the time, and overall the algorithm worked correctly on ~97.5% of samples.
In conclusion, our algorithm achieved a 97.35 percent accuracy as well as a 98 percent statistically-predicted accuracy. Most importantly, the algorithm successfully identified all 20 test samples that have only one SMN1 and therefore were carriers for the disease.