Phosphorus Research: CtsCNV, A Copy Number Variant Detection Method for Clinical Targeted Sequencing Data

Last month, Phosphorus attended the American Society of Human Genetics (ASHG) 2018 Annual Meeting in San Diego, California. The ASHG Meeting is an important event for Phosphorus, allowing us to see what is new in the wider world of genetics and to get a better sense of those whom we are helping with our products and services. Phosphorus was proud to present its poster, entitled “CtsCNV: A Copy Number Variant Detection Method for Clinical Targeted Sequencing Data.” The poster provided details on our attempt to develop an algorithm using clinical NGS data to better detect copy number variants (CNVs). Select text from the poster is below:
Methods
- Data
We downloaded 90 low coverage 1000 Genome Project WES data. Targeted panels were customized in the Roche/NimbleGen platform. Genes relevant to disease phenotype were included based on relationships described in Online Mendelian Inheritance in Man (OMIM), Human Phenotype Ontology (HPO), GeneReviews, and variants reported in ClinVar NIH database were also included. Affymetrix Microarrays were used for validation. CNV were detected with XHMM, CODEX, EXCAVATOR2, CONVADING and cstCNV. - ctsCNV algorithm
Before calling CNV, sequencing data was first processed by our BioQC pipeline to ensure enough sequencing quality. In general, we require ~100x of mean sample coverage to confidently call the CNV events. Our cstCNV takes either BAM or genomecov.bed data format and bins the targeted regions into 100-bp sub intervals, using the mean value of each interval sequencing depth as input. A key point in cstCNV is to include several normalization procedures to remove variation in depth due to non-biological noise and can be readily used in clinical targeted sequencing (CTS) data. This procedure corrects for sample variability, batch effects, bias in GC content in the sequences, and other technical biases. Loci with extreme read depth (high or low) or high variance among samples are removed from analysis. After normalization (spline normalization, z-score, PCA), copy number estimates per interval are computed by comparing each sample’s normalized depth per interval to the median normalized depth within a batch. For PCA, we remove top components because majority of variation is not biological as a function of the number of samples. A z-score is computed on the difference between sub interval depth and median of loci across samples. CNVs are called by running the widely used Circular Binary Segmentation algorithm2 on the interval estimates along with a permutation based significance test. All bioinformatics algorithms were implemented within ElementsTM platform, and will be freely available as API.
Results
- ctsCNV outperforms other callers on targeted sequencing data
First, using a custom Roche NimbleGen targeted panel, 372 previously characterized DNA samples were sequenced at different depths (median ~400x). These samples included 81 known CNVs. The performance of ctsCNV was compared to several commonly used callers including CODEX, XHMM, EXCAVATOR2 and CONVADING. CtsCNV achieved the highest sensitivity with a good precision (Table 1). - ctsCNV shows high accuracy on 1000 Genome WES data
To ensure the analysis was not biased to internal lab conditions, all callers were also tested on 90 lower depth whole exome sequencing data from the 1000 Genomes project. We used three high-resolution CNV datasets (by Hapmap, by McCarroll et al. and by Conred et al.) to measure accuracy. CNV result type is defined as ‘ALL’, ‘COMMON’ and ‘RARE’ based on frequency. The cstCNV achieved the highest recall and precision as well (Figure.1). - ctsCNV robustly adapts to clinical CNV testing
We have optimized the algorithm to real world clinical data. We are able to accurately call CNV in 1) long CNVs (up to 3.6Mbp Y chromosome micro-deletion, median depth ~250x); 2) in low coverage sequencing data (median depth ~120x); 3) in small batch (Table 2, Figure.2).
Discussions
Clinical genetic diagnosis of CNV is currently challenging with NGS data, due to uneven coverage and GC content, non-biological biases such as batch effect, and small batch size and low read depths because of cost consideration. Our ctsCNV method allows for the detection of exon CNV with excellent sensitivity and specificity as compared with other four CNV callers in clinical panel sequencing as well as WES data. Furthermore, ctsCNV shows great performance on clinical data of small batch and low coverage. In addition, our method can be readily adapted for the challenging Y chromosome micro-deletion detection.
Conclusion
We have developed a robust CNV calling method for clinical NGS data, which outperform many CNV callers on accuracy and is applicable to small batch and low coverage data.