2Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA
3Laboratory of Pharmaceutical Biotechnology, Faculty of Pharmaceutical Sciences, Ghent University, Ghent, Belgium
High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed.
Over the past five years, next-generation sequencing (NGS) technology has become widely available to life scientists. During this time, as sequencing technologies have improved and evolved, so too have methods for preparing nucleic acids for sequencing and constructing NGS libraries (1, 2). For example, NGS library preparation has now been successfully demonstrated for sequencing RNA and DNA from single cells (3-11).
Fundamental to NGS library construction is the preparation of the nucleic acid target, RNA or DNA, into a form that is compatible with the sequencing system to be used (Figure 1). Here, we compare and contrast various library preparation strategies and NGS applications, focusing primarily on those compatible with Illumina sequencing technology. However, it should be noted that almost all of the principles discussed in this review can be applied with minimal modification to NGS platforms developed by Life Technologies, Roche, and Pacific Biosciences.
In general, the core steps in preparing RNA or DNA for NGS analysis are: (i) fragmenting and/or sizing the target sequences to a desired length, (ii) converting target to double-stranded DNA, (iii) attaching oligonucleotide adapters to the ends of target fragments, and (iv) quantitating the final library product for sequencing.
The size of the target DNA fragments in the final library is a key parameter for NGS library construction. Three approaches are available to fragment nucleic acid chains: physical, enzymatic, and chemical. DNA fragmentation is typically done by physical methods (i.e., acoustic shearing and sonication) or enzymatic methods (i.e., non-specific endonuclease cocktails and transposase tagmentation reactions)(12). In our laboratory, acoustic shearing with a Covaris instrument (Covaris, Woburn, MA) is typically done to obtain DNA fragments in the 100–5000 bp range, while Covaris g-TUBEs are employed for the 6–20 Kbp range necessary for mate-pair libraries. Enzymatic methods include digestion by DNase I or Fragmentase, a two enzyme mix (New England Biolabs, Ipswich MA). Comparisons of NGS libraries constructed with acoustic shearing/sonication versus Fragmentase found both to be effective (13). However, Fragmentase produced a greater number of artifactual indels compared with the physical methods. An alternative enzymatic method for fragmenting DNA is Illumina's Nextera tagmentation technology (Illumina, San Diego, CA) in which a transposase enzyme simultaneously fragments and inserts adapter sequences into dsDNA. This method has several advantages, including reduced sample handling and preparation time (12).
Desired library size is determined by the desired insert size (referring to the library portion between the adapter sequences), because the length of the adaptor sequences is a constant. In turn, optimal insert size is determined by the limitations of the NGS instrumentation and by the specific sequencing application. For example, when using Illumina technology, optimal insert size is impacted by the process of cluster generation in which libraries are denatured, diluted and distributed on the two-dimensional surface of the flow-cell and then amplified. While shorter products amplify more efficiently than longer products, longer library inserts generate larger, more diffuse clusters than short inserts. We have successfully sequenced libraries with Illumina instruments up to 1500 bases in length.
Optimal library size is also dictated by the sequencing application. For exome sequencing, more than 80% of human exomes are under 200 bases in length (14). We run 2 × 100 paired-end reads and our exome sequencing libraries typically contain insert sizes of approximately 250 bases in length as a compromise to match the average size of most exons while sequencing without overlapping read pairs. The size of an RNA-Seq library is also determined by the applications. We typically do basic gene expression analysis using single-end 100 base reads. However, for analysis of alternative splicing or determination of transcription start and stop sites, we employ 2 × 100 base paired-end reads. In most instances, the RNA will be fragmented before conversion into cDNA. This is typically done through the use of controlled heated digestion of the RNA with a divalent metal cation (magnesium or zinc). The desired length of the library insert can be adjusted by increasing or decreasing the time of the digestion reaction with good reproducibility.