2Shandong University School of Medicine, Jinan, China
3Department of Surgery, NorthShore University HealthSystem Research Institute, Evanston, IL, USA
*Y.-C.J. and J.X. contributed equally to this work.
PROTOCOL FOR: Simplified DGS procedure for large-scale genome structural study
Ditag genome scanning (DGS) uses next-generation DNA sequencing to sequence the ends of ditag fragments produced by restriction enzymes. These sequences are compared to known genome sequences to determine their structure. In order to use DGS for large-scale genome structural studies, we have substantially revised the original protocol by replacing the in vivo genomic DNA cloning with in vitro adaptor ligation, eliminating the ditag concatemerization steps, and replacing the 454 sequencer with Solexa or SOLiD sequencers for ditag sequence collection. This revised protocol further increases genome coverage and resolution and allows DGS to be used to analyze multiple genomes simultaneously.
Increasing evidence suggests that genome structure in normal human populations is highly variable (1) and can be substantially altered in pathological conditions (2). Identifying normal genome structural variation will provide fundamental knowledge for understanding biology and identifying pathological structural alterations can provide clues for better understanding the genetic factors contributing to diseases and provide specific markers for clinical applications.
Recent development of genome technologies is making it increasingly practical to study genome structure, not only for individual genomes but also for a large number of genomes [(3), International Cancer Genome Consortium (http://www.icgc.org)]. Among the new approaches for studying genome structure is the paired-end sequencing approach. Its basic principle is to sequence the two ends of a DNA fragment to determine the structural nature of the detected fragment by comparison with reference genome sequences. If the two end sequences map to their proper locations in the reference genome sequences, this implies that the original fragment has normal structure, and structural changes can be detected when the two end sequences do not map correctly to the reference genome sequences.
Different Sanger sequencer–based paired-end techniques have been developed, including BAC clone-end sequencing (4), fosmid clone-end sequencing (5), and Chip-CHIP DNA paired-end tag sequencing (6). With the availability of next-generation DNA sequencers, new paired-end sequencing techniques have been developed, including paired-end mapping (PEM) (7), massively parallel paired-end sequencing (MPPS) (8), and ditag genome scanning (DGS) (9), developed by our laboratory. These new paired-end sequencing techniques provide advantages over the Sanger sequencer–based methods including higher genome coverage, higher resolution, and lower sequencing cost. The major differences between DGS and PEM or MPPS are that (i) DGS targets restriction DNA fragments whereas PEM and MPPS target random DNA fragments, and (ii) DGS collects the two end tags (ditag) from the same DNA fragment in a single sequence whereas PEM and MPPS collect a single end tag of a DNA fragment as a single sequence, or two end tags as two separate sequences.
Because of the presence of two restriction sites and two end tags in a single sequence for a given fragment, the DGS ditag provides much higher specificity than PEM or MPPS tags for the detected DNA fragment. In addition, mapping DGS ditags to reference genome sequences is simple since it uses a pre-constructed reference ditag database from the reference genome sequences. In contrast, each PEM/MPPS end sequence needs to be searched against the entire reference genome sequence. Mapping millions of short random sequences to the reference genome sequence requires the exhaustive use of large computational power, which restricts its use in regular laboratories. However, DGS requires multiple molecular cloning steps to collect ditag DNA templates from a DNA sample for sequence collection. These labor-intensive steps prevent its use for analyzing large numbers of samples. In comparison, PEM and MPPS collect DNA templates by simple gel purification of sonication-sheered genomic DNA, which allows for analysis of multiple samples simultaneously. This manuscript presents a simplified DGS method that can be readily applied to the study of multiple genomes.
We have made two major changes to simplify the original DGS protocol. The updated protocol is included as Supplementary Material, online.1. Alternative sequencers for ditag sequence collection
The original DGS protocol was designed for use with the 454 sequencer since it was the only next-generation sequencer available at that time (10). Recently, new next-generation sequencers such as the Solexa (Illumina, San Diego, CA, USA) and the SOLiD (Applied Biosystems, Foster City, CA, USA) sequencers have been developed. While the sequencing costs for the 454, Solexa, and SOLiD sequencers are similar, these two new sequencers provide Gb-level sequence productivity per run, compared with the 100 Mb available with the 454 sequencer. The increased sequencing productivity will lead to higher genome coverage, higher resolution of genome scanning by collecting ditags from higher frequency cutting restriction fragments, and will avoid the “homopolymer” problem inherent to the 454 sequencer.2. Eliminating ditag concatemerization
The original DGS protocol includes a process for ditag concatemerization. In order to obtain more ditags (~36 bp) per 454 sequence read (100–250 bp), the ditag templates needed to be concatemerized to create longer templates for sequencing. The concatemerization process was technically challenging, time-consuming, and the major limiting factor for preparing ditag DNA templates. Although the sequence length from Solexa and SOLiD sequencers is shorter than that of the 454 sequencer, the length of 35–75 bp from these two sequencers is perfectly suitable to cover the 36-bp DGS ditags. More importantly, it also allows sequencing ditag templates directly without ditag concatemerization. This will substantially simplify the overall DGS process and allow the DGS method to be used to analyze larger numbers of samples.
The major steps of the revised DGS protocol include restriction digestion of genomic DNA, adaptor ligation, circularization, tag releasing, ditag formation, and PCR ditag releasing. Figure 1 outlines the revised DGS process; Table 1 compares the steps between the original and the revised DGS protocols. In the revised protocol, all steps involving concatemerization are eliminated, including ditag concatemerization, concatemer gel purification, concatemer cloning, concatemer library transformation, concatemer library plasmid preparation, concatemer release, and gel purification. Furthermore, the three-step genomic DNA library construction is also replaced by a one-step in vitro adaptor ligation. The simplified protocol decreases the time required for sample preparation from 1–2 weeks to 1–2 days and allows the preparation of multiple samples simultaneously.
We tested the revised DGS protocol using genomic DNA from the leukemic Kasumi-1 cell line. After obtaining the purified ditag templates from the PstI fragments, we cloned the ditag DNA into the TA vector and sequenced randomly selected clones using big-dye reagents with M13 primers. Of the 96 clones sequenced, 81 generated qualified ditag sequences with PstI sites at both ends and a ditag in between. The 36–37-bp ditags were most common among the resulting ditags. A few ditags contained longer or shorter sequences due to the inaccuracy of MmeI digestion (Supplementary Table S1). The 20% unqualified sequences may be due to artifacts from restriction digestion, ligation, or sequencing. It is necessary to eliminate the unqualified sequences to generate high-quality ditags for downstream genome mapping studies.
Although 20% of the raw sequences were unusable, the increased sequencing productivity of the Solexa and SOLiD systems over the 454 system will provide a sufficient quantity of high-quality sequences for ditag extraction. In this experiment, PstI was used for ditag release. PstI is the restriction enzyme that provides the highest restriction frequency, resulting in a maximum number of total bases to be sequenced (40 Mb, or ~1.1 million reads at 36 bp per read) among the 6-base restriction enzymes (9). This number is easily within the capacity of Solexa and SOLiD sequencers. For example, Solexa sequencer provides 100 million reads per run. Eight PstI ditag samples can be sequenced in individual lanes per run to generate 12.5 million reads per lane. This results in 12.5 times more ditag genome coverage. When sequencing productivity is further increased, the 5-base or 4-base-recognizing restriction enzymes can be used for ditag release to further increase genome coverage and resolution. In summary, the revised DGS protocol substantially simplifies the DGS process and allows the DGS technique to be used for large-scale genome structural studies.
This work is supported by a grant from the Guglielmi Fidelity Charitable Fund (S.M.W.), a career development award (S.M.W.), and a Clinical Collaborative Research Program (CCRP) Award (D.J.W.) from NorthShore University HealthSystem.
The authors declare no competing interests.
Address correspondence to San Ming Wang, Center for Functional Genomics, ENH Research Institute, 1001 University Place, Evanston, IL, 60201, USA. email: [email protected]