2Department of Biology, The University of York, Heslington, UK
3Hollings Marine Laboratory, College of Charleston, Charleston, SC, USA
DNA hybridization capture combined with next generation sequencing can be used to determine the sequences of hundreds of target genes across hundreds of individuals in a single experiment. However, the approach has thus far only been successfully applied to capture targets that are highly similar in sequence to the bait molecules. Here we introduce modifications that extend the reach of the method to allow efficient capture of highly divergent homologous target sequences using a single set of baits. These modifications have important implications for comparative biology.
Next generation sequencing is revolutionizing our understanding of the links between genotype and phenotype at scales that were unimaginable a few years ago. Genome-wide association studies are now routinely conducted to explore the genomic basis for phenotypic variation (1) or disease predisposition (2), as well as to identify the genomic elements that underlie adaptive evolutionary responses (3).
Although such genome level association studies are powerful, they are labor-intensive to execute, as they require not only sequencing of entire genomes, but also that the collected genomic data be carefully parsed, assembled, and accurately annotated. This is most easily done when closely related and well-annotated reference genomes are available for comparison. However, genome assembly and annotation become substantially more difficult with increasing evolutionary distance to the reference genome. Indeed, it is often easier to assemble the genomes of divergent species de novo, than it is to compare them to a reference that differs in genome organization. This said, most studies do not require whole genome comparisons. Comparative medical and physiological studies, for example, usually center on a specific set of genes relevant to a pathway of interest (4). In the same vein, most molecular evolutionary and phylogenetic studies focus on the comparison of small subsets of homologous genes.
Unfortunately, no technology yet exists that allows researchers to efficiently explore genetic variation across evolutionarily divergent organisms for large sets of pre-specified target genes, such as those that determine pelage color in vertebrates (5), or those involved in particular physiological adaptations (6, 7). PCR amplification, the main and for a long time only technology for targeting specific DNA regions for sequencing, is generally too labor intensive, costly, and inconsistent to generate such multigene data sets. In principle, DNA hybridization capture technology, hereafter referred to as gene capture (8, 9), should be up to the task as it allows for hundreds of pre-specified genes to be targeted and isolated for sequencing in a single experiment. Indeed, gene capture techniques do work well when baits are used to interrogate libraries with limited sequence divergence (10-13), such as those for the same species or for closely related species, or when baits are designed to target ultraconserved elements among species that otherwise have highly divergent genomic sequences (14, 15). Unfortunately, existing gene capture protocols do not work well when targeting genes whose sequences are highly divergent from those of the baits that are used to capture them, although we know from first principles and experience with Southern blotting (16), that they should.
Modified DNA hybridization conditions allow gene capture across widely divergent homologous sequences.
Here we describe a gene capture method that is effective for capturing a pre-specified set of protein coding genes across species that have been evolving independently for hundreds of millions of years, using a single bait array. We have achieved this by tuning the stringency of the hybridization and washing steps of the procedure to optimize retention of target versus non-target DNA fragments, and by conducting two rounds of gene capture whereby the captured products from the first round are used as templates to augment a second round of gene capture (see detailed Supplementary Protocol). Materials and methods Proof of concept
To demonstrate the efficacy of gene capture across highly divergent taxa we selected a set of target genes that would be (i) present in all taxa tested and (ii) unambiguously identifiable (unique) within each genome. We targeted coding DNA sequences (CDS) from single-copy protein-coding genes that are shared across gnathostome vertebrates. We compared the published and annotated genomes of human (Homo sapiens), chicken (Gallus gallus), western clawed toad (Xenopus tropicalis), green anole (Anolis carolinensis), zebrafish (Danio rerio), and elephant shark (Callorhinchus milii), and identified the putatively orthologous genes within each respective genome. A within-species BLAST search (http://blast.ncbi.nlm.nih.gov/Blast.cgi) (17) was used to identify gene regions determined to be single-copy according to a 60% dissimilarity criterion (i.e., there were no genomic regions present that were more than 40% similar to the potential target according to the within-species BLAST search criterion). We then used among-species BLAST searches to verify that the identified targets occurred in all six vertebrates. The EvolMarkers software for selecting such sequence sets for any combination of species is freely available at http://bioinformatics.unl.edu/cli/evolmarkers/(18). We identified a total of 1449 candidate single-copy CDS markers that were shared across the 6 vertebrate species using this approach. The size of the target CDS varied from 151 bp to 2390 bp. Bait development
We developed separate custom biotinylated RNA bait libraries based on the identified CDS for the H. sapiens, G. gallus, X. tropicalis, A. carolinensis, D. rerio, and C. milii genomes using the MYBaits target enrichment system (MYcroarray, Ann Arbor, Michigan). Each resulting bait library comprised a pooled series of 120 bp baits designed for each target gene. We included a 60 bp overlap between baits, allowing for 2× complete coverage of each target gene (2 × tiling). If the length of the target gene was less than 120 bp, the sequence was extended in length to 120 bp by adding thymine (T) nucleotides. Experimental design
We carried out two experiments. First, in an effort to evaluate the effectiveness of the method across broadly different classes of gnathostome vertebrates, we used baits to capture the selected set of 1449 target single-copy genes from (i) genomic libraries of the species from which the baits were derived (as a positive (+ve) control) and (ii) distantly related species belonging to the same vertebrate class as the species from which the bait sequence was derived. Thus, for fish we used baits designed from the D. rerio genome to capture target DNA sequences from D. rerio (+ve control) and the stickleback (Gasterosteus aculeatus); for amphibians we used baits from the X. tropicalis genome to capture targets in X. tropicalis (+ve control) and the axolotl (Ambystoma mexicanum); for reptiles we used baits from the A. carolinensis genome to capture targets in A. carolinensis (+ve control) and the painted turtle (Chrysemys picta); for birds we used baits from the G. gallus genome to capture targets in G. gallus (+ve control) and the zebra finch (Taeniopygia guttata); and for mammals we used baits from the H. sapiens genome to capture target genes in H. sapiens (+ve control) and the gray short-tailed opossum (Monodelphis domestica). The similarities of the target CDS ranged from 89% to 61% among the 6 vertebrates (ultraconserved element approaches target highly conserved sequences with similarities typically greater than 90% (14, 15).
We compared the effects of two different hybridization and washing schemes on the capture results. Standard conditions comprised 65°C for both hybridization and washing temperature, following the manual for the MYBaits Target Enrichment System (Mycroarray, Ann Arbor, Michigan). The relaxed conditions comprised hybridization under a touchdown gene capture scheme similar to Mason et al. (10) but with a much lower final temperature: 65°C for 11 h, followed by 60°C for 11 h, 55°C for 11 h, and 50°C for 11 h, using 45°C instead of 65°C for the last three washing steps (see detailed protocol). We also tested whether capturing twice (i.e., using the captured products after re-amplification as templates to perform a second round of capture) increases the number of captured target genes.
Secondly, in an effort to explore the effectiveness of the method within a class of gnathostomes, we applied the optimized protocol to a diverse suite of chondrichthyan fishes (sharks, skates, rays, and chimaeras), the group that represents the most basal gnathostome class and whose origins date back to the Devonian period, more than 400 million years ago (19). We used baits based on the elephant shark (C. milii) genome to capture the 1449 target genes from C. milii (+ve control), five skates and rays (Aetobatos narinari, Leucoraja erinacea, Neotrygon kuhlii, Rhinobatos schlegelii, Torpedo formosa), and seven sharks (Carcharhinus amblyrhynchos, Chlamydoselachus anguineus, Etmopterus joungi, Heterodontus portusjacksoni, Isurus oxyrinchus, Orectolobus halei, Squatina nebulosa).
For detailed experimental design information, please refer to the protocols for library preparation using the with-bead method and gene capture that are provided as Supplementary Material. Sequencing and analysis
We pooled 20 indexed samples in equimolar ratios after gene capture. We then measured the pooled product using the CFX Connect Real-Time PCR system (Bio-Rad, Hercules, CA) and used 600 µL of 8 pM sample for paired-end 150 bp sequencing on a MiSeq sequencer (Illumina, Inc, San Diego, CA). Sequence reads associated with each sample were identified by their respective indices. The adapter sequences and sites with lower qualities were trimmed using the Cutadapt application (version 1.1; http://code.google.com/p/cutadapt/)(20). Contigs were assembled de novo for each species using ABySS (version 1.3.4; www.bcgsc.ca/platform/bioinfo/software/abyss)(21). A custom Perl script (Supplementary Material) was used to retrieve the sequences for each gene for each species studied from the assembled contigs. The aligned sequences for each gene were checked by eye to ensure accuracy using Geneious Pro v5.6.2 (Biomatters, Auckland, New Zealand; available at www.geneious.com/). To ensure that only orthologous sequences were used in our final analyses, we only considered cases where the target returned a single hit. We were able to do this because our bait arrays were explicitly designed to target single copy genes in the 6 model organisms (based on a conservative 60% dissimilarity criterion used in the initial selection of target sequences). As a further test of orthology, we used a Hidden Markov Model (HMM), trained on a curated set of alignments of the 1449 single copy targets across the 6 model organisms (22) to assign captured sequences to their corresponding orthologous targets. In all cases, more than 90% of the captured sequences were assigned as orthologs by the HMM. We excluded all genes whose orthology was questionable from further analysis. We caution that paralogs resulting from gene duplication and subsequent asymmetric loss cannot always be ruled out with our approach. The average identity between the captured sequences and the baits was calculated using custom Perl scripts (Supplementary Material). Results and discussion Comparison across different classes of gnathostome vertebrates
The results we obtained varied depending on hybridization and washing conditions (Table 1). Under standard conditions, very few target sequences were successfully retrieved, except for the positive controls, where both bait and target library were derived from the same species. The only exception to this was the result obtained for birds where 554 target sequences were captured for T. guttata (zebra finch) using G. gallus (chicken) baits. We speculate that this is a consequence of the unusually high level of sequence similarity seen among divergent species of birds relative to their non-avian vertebrate counterparts (23). By contrast, under the relaxed hybridization conditions, there was an eight-fold increase in the number of target sequences captured relative to experiments executed under standard conditions. The difference in the effectiveness of gene capture was especially obvious where target species were highly divergent from bait species (Table 1).