to BioTechniques free email alert service to receive content updates.
Precise breakpoint localization of large genomic deletions using PacBio and Illumina next-generation sequencers
 
Michal J Okoniewski*1,2, Janine Meienberg*3,4, Andrea Patrignani1, Alicja Szabelska1,5, Gabor Matyas#3,4, and Ralph Schlapbach#1
1Functional Genomics Center Zurich, Zurich, Switzerland
2Department of Neuroimmunology and Multiple Sclerosis Research, Neurology Clinic, University Hospital, Zurich, Switzerland
3Center for Cardiovascular Genetics and Gene Diagnostics, Zurich, Switzerland
4Zurich Center of Integrative Human Physiology, University of Zurich, Zurich, Switzerland
5Department of Mathematical and Statistical Methods, Poznan University of Life Sciences, Poznan, Poland


##These authors jointly directed this work

*These authors contributed equally to this work
BioTechniques, Vol. 54, No. 2, February 2013, pp. 98–100
Full Text (PDF)
Supplementary Material
Abstract

Herein we present the applicability of single-molecule (PacBio RS) and second-generation sequencing technology (Illumina) to the characterization of large genomic deletions. By testing samples previously characterized using a Sanger approach, our methods determined that both next-generation sequencing platforms were able to identify the position of deletion breakpoints. Our results point out various advantages of next-generation sequencing platforms when characterizing genomic deletions; however, special attention must be dedicated to identical sequences flanking the breakpoints, such as poly(N) motifs.

The PacBio RS next-generation sequencing (NGS) technology (Pacific Biosciences, Menlo Park, CA, USA) has not only the potential to identify modified bases and thereby characterize methylation patterns (1, 2), but it also provides previously unprecedented sequencing read lengths (>2kb), making it useful for quickly improving existing genome assemblies (3). In this study, we used the advantage of such long reads for the characterization of large deletions previously identified by multiplex ligation-dependent probe amplification (MLPA) and microarray analyses. Using traditional Sanger sequencing to characterize large deletions is time-consuming and work-intensive (4, 5), increasing the need for effective breakpoint localization. Indeed, for Sanger sequencing a large fragment (2–10kb) containing the breakpoints has to be amplified by long- range PCR (LR-PCR) and subsequently sequenced in order to identify exact breakpoint positions. Furthermore, since Sanger sequencing permits only ~600 bp to be sequenced using one primer, several sets of internal primers are required for large LR-PCR products.

In contrast, NGS may offer simplified sequencing in such cases. Herein, we tested this possibility by using not only the long reads of the PacBio platform, but also the short reads of a second-generation sequencing technology (HiSeq 2000, Illumina, San Diego, CA, USA). Illumina offers stable lengths of short reads (100 bp in this case) with errors most likely to be grouped at the ends of reads (6, 7); the PacBio RS reads from this study had a mean length of 2459 bp and random distribution of errors affecting 10–15% of nucleotides. In addition, only a few dedicated computational techniques are available for the characterization of large deletions by NGS (8), making data analysis a challenge.

The three DNA samples used in this study harbor previously characterized large hemizygous deletions. Deletions in sample 44 and sample 70 (of length 26,887 bp and 302,580 bp, respectively) affect the FBN1 gene in patients with Marfan syndrome (4); A deletion in sample 53B has a size of 3,408,306 bp and comprises the entire COL3A1 gene in a patient with Ehlers-Danlos syndrome vascular type (5). Accordingly, ~6.5–8.5 kb LR-PCR products were amplified using the Expand Long Template PCR System (Roche Diagnostics, Rotkreuz, Switzerland) as described previously (4, 5) and purified by means of QIAquick PCR Purification Kit (Qiagen, Hilden, Germany).

Method summary

We present the applicability of single-molecule (PacBio RS) and second-generation sequencing technology (Illumina) to the characterization of large genomic deletions. Both next-generation sequencing platforms were able to identify the position of deletion breakpoints previously identified by Sanger sequencing.

SMRTbell libraries were prepared using the PacBio C2 chemistry (3–10 kb) DNA preparation kit (Part no. 001–540–726, Pacific Biosciences) as well as 5µg purified amplicons without fragmentation. Libraries were subsequently sequenced on the PacBio RS using one SMRT cell per sample and taking two movies of 45 min each. The reads have been mapped with the BLASR mapper (9), which is supplied in the SMRT Portal software suite (Pacific Biosciences) and applies therefore as a standard mapper for PacBio reads. The same amplicons were sequenced on the HiSeq 2000 sequencer using Illumina's TruSeq DNA Sample Preparation v2 protocol with 1 µg input material and 100+100 bp pair-end reads. The reads were mapped using the standard mapper, Bowtie (10). For both NGS platforms, the mappers have been used with default parameters. Respective sequences are available in the NCBI Sequence Read Archive (object ID: ERP002092).

For the PacBio data, the read coverage in the SMRT Portal software suite resulted in a clear drop of read depth in the deleted region (see Supplementary Figure S5), which was subsequently confirmed by zooming in on the breakpoint regions by means of the Integrative Genomics Viewer (IGV) (11) (Figure 1 and Tables 1 and 2 for sample 70; see Supplementary Figures S1–S2 as well as Supplementary Tables S1–S4 for samples 44 and 53B, respectively). Respective Illumina data displayed in IGV showed more gradually sinking patterns at the expected deletion ends; the site of breakpoints in these data was identified by an increase in mismatches (Figure 1, Tables 1 and 2, Supplementary Figures S1–S2, and Supplementary Tables S1–S4). This may be due to the fact that the mappers typically allow several mismatches, meaning that many of the short Illumina reads could be mapped over the breakpoints. In contrast, PacBio RS data show a number of reads spanning over the deletion, which have not been mapped by the SMRT Portal aligner to the standard reference due to the high number of mismatches. The read depth of both platforms is more than sufficient to find the breakpoint; tests using 1/2 or 1/3 of reads per sample also produced satisfactory results (data not shown).









le>





le>

An additional difficulty may be identical sequences on both sides of the deletion, a common phenomenon that has already been described for different genes (12-14). In particular, this could be observed in all three deletions presented in this study (“CC” in samples 53B and 70 and “GC” in sample 44). In order to find the precise sequence of poly(N) motifs (tandemly repeated nucleotides) at the sites of break and rejoining, we have developed an AWK script to count matches at the sites of suspected deletion breakpoints (see Supplementary Materials). This counting was performed with perfect matches only, resulting in the data depicted in Figure 2 (sample 70) and Supplementary Figures S3 and S4 (samples 44 and 53B, respectively). When a single nucleotide (or pair, in the case of GC) has a fixed probability of being misinterpreted, it can be assumed without loss of generality that the distribution of the occurrences of specific motifs follows the Poisson distribution. The hypothesis that the maximum number of counts represents the appropriate motif has been tested. For PacBio RS reads in sample 70, the probabilities of wrongly accepting the null hypothesis are far below the 0.01 level of significance (p = 1.5 x 10-23, p = 1.06 x 10-46, and p = 3.2 x 10-141 in the cases of 20, 10, and 5 flanking bases, respectively) (Figure 2). In the case of Illumina, due to the high number of reads, error levels are so low that they go below that afforded by the small-number precision in the R language. For details on the calculations, see the R script in the Supplementary Materials. The script can be used on any FASTA or FASTQ data and checks the statistical power at a given significance level regardless of the platform.





As shown by this study, the determination of deletion breakpoints can be done with data obtained from both NGS platforms. However, whereas the long reads of PacBio RS showed a sharp decrease in read depth, Illumina short reads exhibited an increase in mismatches related to the position of the breakpoints. Sample preparation costs are comparable for PacBio and Illumina platforms. However, sequencing using PacBio RS can be done within a working day, while Illumina's system, even the smaller MiSeq version, requires more time. Both platforms are suitable for precise breakpoint localization, provide an alternative procedure for the characterization of large deletions, and require fewer resources and less time than traditional Sanger sequencing.

Acknowledgments

We are grateful to Yu-Chih Tsai, Jonas Korlach, and Stephen W. Turner for discussions on PacBio technology and data analysis. This work was supported by the FGCZ, as well as grants from the COFRA Foundation (to G.M.), Gottfried & Julia Bangerter-Rhyner-Stiftung (to G.M.), Jubiläumsstiftung Swiss Life (to G.M.), Foundation for People with Rare Diseases (to J.M. and G.M.), Clinical Research Priority Program (CRPP/KFSP-MS) of University of Zurich (to M.O.), and Sciex.ch (no. 11.182 to A.S. and M.O.).

Competing interests

The authors declare no competing interests.

Correspondence
Address correspondence to Michal J. Okoniewski, FGCZ, Winterthurerstrasse 190, 8057 Zurich, Switzerland. Email: [email protected]">[email protected]

References
1.) Clark, T.A., I.A. Murray, R.D. Morgan, A.O. Kislyuk, K.E. Spittle, M. Boitano, A. Fomenkov, R.J. Roberts, and J. Korlach. 2012. Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing. Nucleic Acids Res. 40:e29.

2.) Murray, I.A., T.A. Clark, R.D. Morgan, M. Boitano, B.P. Anton, K. Luong, A. Fomenkov, S.W. Turner. 2012. The methylomes of six bacteria. Nucleic Acids Res 40:11450-11462.

3.) Zhang, X., K.W. Davenport, W. Gu, H.E. Daligault, A.C. Munk, H. Tashima, K. Reitenga, L.D. Green, and C.S. Han. 2012. Improving genome assemblies by sequencing PCR products with PacBio. BioTechniques 53:61-62.

4.) Mátyás, G., S. Alonso, A. Patrignani, M. Marti, E. Arnold, I. Magyar, C. Henggeler, T. Carrel. 2007. Large genomic fibrillin-1 (FBN1) gene deletions provide evidence for true haploinsufficiency in Marfan syndrome. Hum. Genet. 122:23-32.

5.) Meienberg, J., M. Rohrbach, S. Neuenschwander, K. Spanaus, C. Giunta, S. Alonso, E. Arnold, C. Henggeler. 2010. Hemizygous deletion of COL3A1, COL5A2, and MSTN causes a complex phenotype with aortic dissection: a lesson for and from true haploinsufficiency. Eur. J. Hum. Genet. 18:1315-1321.

6.) Kozarewa, I., Z. Ning, M.A. Quail, M.J. Sanders, M. Berriman, and D.J. Turner. 2009. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods 6:291-295.

7.) McElroy, K.E., F. Luciani, and T. Thomas. 2012. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13:74.

8.) Ye, K., M.H. Schulz, Q. Long, R. Apweiler, and Z. Ning. 2009. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25:2865-2871.

9.) Chaisson, M.J., and G. Tesler. 2012. Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and application. BMC Bioinformatics 13:238.

10.) Langmead, B. 2010. Aligning short sequencing reads with Bowtie. Curr. Protoc. Bioinformatics Unit 11.7 Chapter 11.

11.) Thorvaldsdottir, H., J.T. Robinson, and J.P. Mesirov. 2012. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform [Epub ahead of print].

12.) Giacalone, J.P., and U. Francke. 1992. Common sequence motifs at the rearrangement sites of a constitutional X/autosome translocation and associated deletion. Am. J. Hum. Genet. 50:725-741.

13.) Otto, E., R. Betz, C. Rensing, S. Schatzle, T. Kuntzen, T. Vetsi, A. Imm, and F. Hildebrandt. 2000. A deletion distinct from the classical homologous recombination of juvenile nephronophthisis type 1 (NPH1) allows exact molecular definition of deletion breakpoints. Hum. Mutat. 16:211-223.

14.) Liu, H.X., L. Cartegni, M.Q. Zhang, and A.R. Krainer. 2001. A mechanism for exon skipping caused by nonsense or missense mutations in BRCA1 and other genes. Nat. Genet. 27:55-58.