2Biomathematics and Statistics Scotland, Dundee, UK
3Leibniz Institute of Plant Genetics and Crop Plant Research, Gatersleben, Germany
4Australian Centre for Plant Functional Genomics, University of Adelaide, Glen Osmond, Australia
5Leibniz Institute for Age Research, Fritz Lipmann Institute (FLI), Jena, Germany
Second-generation sequencing now provides the potential for low-cost generation of whole-genome sequences. However, for large-genome organisms with high repetitive DNA content, genome-wide short read sequence assembly is currently impossible, with accurate ordering and localization of genes still relying heavily on integration with physical and genetic maps. To facilitate this process, we have used Agilent microarrays to simultaneously address thousands of gene sequences to individual BAC clones and contiguous sequences that form part of an emerging physical map of the large and currently unsequenced 5.3-Gb barley genome. The approach represents a cost-effective, highly parallel alternative to traditional addressing methods. By coupling the gene-to-BAC address data with gene-based molecular markers, thousands of BACs can be anchored directly to the genetic map, thereby generating a framework for orientating and ordering genes, and providing direct links to phenotypic traits.
Physical map–based genome sequencing strategies have recently been superseded by relatively low-cost but ‘quick and dirty’ second-generation sequencing (2GS) based on whole-genome shotgun approaches (1,2). However, for genomes that comprise a high proportion of repetitive DNA and gene families that have evolved by duplication, de novo assembly of 2GS data remains problematic, at best providing information restricted to low copy regions (3). While this type of output has considerable and immediate utility, it remains intrinsically limited through a lack of context, encompassing little information about gene location or order. Current 2GS draft assemblies of large and complex genomes are therefore a temporary substitute for a gold standard reference sequence which remains the primary goal of most research communities.
Based on current technology, such a reference sequence can still realistically only be delivered by robust but expensive physical map–based genome sequencing (4). This is typically centered on high information content fingerprinting (HICF) of bacterial artificial chromosome (BAC) clones that are assembled into contiguous sequences (contigs) prior to sequencing the Minimal Tiling Path (MTP) of clones (5,6). The physical map overcomes issues associated with large recombinationally inert regions typical of large and repetitive genomes, and provides the necessary context for exploring genome structure and organization, both within and among species. A key step in the physical map–based sequencing strategy is anchoring individual BAC clones and contigs to high-resolution genetic maps. This provides both a framework for orientating and ordering contigs, as well as a direct connection to mapped phenotypic traits (4). However, currently established anchoring techniques tend to be low-throughput or involve many processing steps (7).
We describe a novel application of DNA microarray technology to potentially anchor thousands of genes to thousands of BACs. The work was conducted on the large [5.3 Gb (8)] unsequenced barley genome but is generally applicable to any organism for which BAC resources and genetic maps exist.
Materials and methods
Barley BAC library pooling strategy
Genomic DNA from barley cultivar Morex was used to construct a BAC library (HVVMRXALLeA) in pIndigoBac536.Genomic DNA from barley cultivar Morex was used to construct a BAC library (HVVMRXALLeA) in pIndigoBac536.
The original library contained 147,840 BAC clones (in 385 384-well plates) with an average insert size of 125 kb, giving a total of ~3.7× barley genome coverage. The BAC library was divided by the commercial vendor Amplicon Express (Pullman, WA, USA) into 55 sequential Super Pools (SPs), each consisting of seven 384-well plates. For each SP, BAC DNA was initially pooled into seven sets corresponding to each of the 384-well plates, 16 sets corresponding to each of the 384-well plate rows of all seven plates, and 24 sets corresponding to each of the 384-well plate columns of all seven plates. The seven plate, 16 row, and 24 column sets were then further pooled to create five Matrix Plate Pools (MPPs), eight Matrix Row Pools (MRPs), and 10 Matrix Column Pools (MCPs), respectively, for a total of 23 Matrix Pools (MPs). This Matrix pooling strategy results in each BAC being contained in two different MPs for each type of MP (Plate, Row, or Column) (i.e., in two of the five MPPs, two of the eight MRPs, and two of the 10 MCPs). The strategy for pooling, hybridization and deconvolution of the BACs is summarized in Figure 1.
Amplification of pooled BAC DNA
SP and MP BAC DNAs were supplied as above. As the concentration of pooled DNA supplied was very low (~5 ng/µL), high-concentration stocks were produced from each pool using the phi29 DNA polymerase rolling circle amplification method (GenomiPhi V2 DNA Polymerase Kit, GE Healthcare, Piscataway, NJ, USA). Amplification conditions were as recommended by the manufacturer: 2 µL each DNA pool was denatured in 9 µL sample buffer at 95°C for 5 min and subsequently placed on ice. A mixture of 1 µL GenomiPhi phi29 DNA polymerase and 9 µL reaction buffer, containing dNTPs and random hexamers, were added to the denatured samples. After 2 h at 30°C, the reaction was terminated by heating for 10 min at 65°C. DNA yield and quality were checked on 0.7% agarose.
Microarray labeling and hybridization
A custom microarray SCRI_Hv35_44k_v1 was designed (design 020599; Agilent, Santa Clara, CA, USA) representing 42,302 barley expressed sequence tag (EST) contig sequences. Sequences for this design were selected from a total of 50,938 putative unigenes from HarvEST assembly 35 (w w w.harvest-web.org) representing ~450,000 ESTs. Selection was based upon contig orientation derived from (i) homolog y to the nonredundant protein database (NCBI nr), (ii) homology to ESTs from directional cDNA libraries, and (iii) presence of a polyA tract. The microarray was designed with one 60-mer probe per selected unigene in 4x44k format using default parameters in the Agilent eArray software (https://earray.chem.agilent.com/earray). All of the probes were designed to be within or close to the 3′ untranslated regions of the ESTs to maximize selection of unique probe sequences. In addition, the entire EST collection was utilized for cross reference within eArray software to minimize chances of nonspecific hybridization. Full details can be found at ArrayExpress (www.ebi.ac.uk/arrayexpress; accession no. A-MEXP-1728).
Two-channel processing of the microarrays was used, with the entire set of MP DNAs from one SP labeled entirely with Cy3 and the entire set of MP DNAs from another SP labeled with Cy5. In total, 23 arrays were utilized for the two complete sets of MPs (equivalent to two SPs) described in the text. Briefly, amplified MP DNA (200 ng) was labeled using a modified BioPrime Array CGH Genomic DNA Labeling System (Invitrogen, Carlsbad, CA, USA): amplified MP DNA in 11 µL was added to 10 µL 2.5× random primer reaction buffer mix and denatured at 95°C for 5 min prior to cooling on ice. To this was added 2.5 µL modified 10× dNTPs buffer (1.2 mM each of dATP, dGTP, dTTP; 0.6 mM dCTP; 10 mM Tris pH8.0; 1 mM EDTA), 1 µL either Cy3 or Cy5 dCTP (1 nM) and 0.5 µL Klenow enzyme (20 U) and incubated for 16 h at 37°C. Labeled samples for each array were combined and unincorporated dyes removed using Qiaquick PCR Purification Kit (Qiagen, Valencia, CA, USA) as recommended, eluting with 20 µL EB buffer (Qiagen). Hybridizations and washing were performed as recommended (Agilent Protocol v5.5). Scanning was performed with an Agilent G2505B scanner using default settings and data extracted using Agilent FE software (v9.5.3). The experimental design and all microarray data have been submitted to ArrayExpress (www.ebi.ac.uk/arrayexpress; accession no. E-TABM-1017).
Data preprocessing and deconvolution
Data were preprocessed and normalized using GeneSpring software (v7.3; Agilent). Extracted data from each of the two sets of MP arrays were imported separately and normalization performed as for single-channel arrays (data transformation of individual probe measurements to a minimum of 0.01; per chip normalization to the 50th percentile). Data were exported from GeneSpring for downstream analysis.
Deconvolution of plate, row, and column pools was performed using scripting (Supplementary Material) in GenStat (VSN International, Hemel Hempstead, UK) by identifying maximum signal above background in two out of five MPPs, two out of eight MRPs, and two out of 10 MCPs (see strategy in Figure 1). Identification of a unigene in two MPPs, for example, was measured by the log ratio of the difference between the second- and third-largest MPP signals and the difference between the third largest and the median probe signal of the remaining MPPs. As we are expecting two significant outlying signals from each MP type (for a single copy gene), this approach identifies those pools with the first- and second-highest significant probe signals above background. This ratio was then weighted using a Gompertz function and identification confirmed when the weighted ratio (gScore) exceeded an empirically assigned threshold (0.9). A match with the BAC clone was confirmed when the combined identified unigene plates, rows and columns corresponded to the known BAC pooling addresses.
Illumina sequencing of SP1 DNA
Separate samples (five in total) of amplified SP1 DNA were sequenced on an Illumina GAII sequencer (San Diego, CA, USA), with a single lane used for each sample. This resulted in a total of 9,269,872 reads of 54-nucleotide length and 57,650,261 reads of 70-nucleotide length (4.5 Gb total). The sequence reads were quality-trimmed with a quality score cutoff of 20 to remove poor-quality sequence at the start and end of reads. This left a total of 66,827,380 reads of average length of 62 nucleotides. The combined trimmed reads from all samples were mapped to the Harvest 35 unigene reference sequences using both MOSAIK and SOAPAligner (9) read-mapping tools, in order to be able to compare different outcomes and to eliminate tool-based bias. For each of the mapping programs, two mismatches were allowed.
Validation using quantitative PCR
Assays were designed using the online software of the Universal Probe Library (www.roche-applied-science.com/sis/rtpcr/upl/index.jsp; Roche, Basel, Switzerland) (Supplementary Table S4). Quantitative PCR reactions of BAC-pooled DNA with SYBR Green I were performed on 7500 Fast equipment (Applied Biosystems, Foster City, CA, USA). A total reaction volume of 10 µL included 2 µL unamplified BAC pool DNA (10 ng), 5 µL SYBR Green I PCR Master mix (Sigma-Aldrich, Dorset, UK) and gene-specific primers (500 nM each). The PCR program was performed as follows: 1 cycle of 95°C for 10 min, followed by 40 cycles of 95°C for 15 s and 60°C for 60 s. Following amplification, a reaction product melting curve analysis was performed (95°C for 15 s, 58°C for 1 min and 95°C for 15 s) to provide evidence for a single reaction product.
Results and discussion
We designed and used an Agilent custom barley array (ArrayExpress: A-MEXP-1728), representing 42,302 expressed genes, to screen whole genome–amplified DNA derived from multidimensional pools of BAC clones that were also incorporated into the HICF physical map (data not shown). The barley BAC library consists of 147,840 clones which were sequentially assigned to 55 Super Pools (SPs) with seven 384-well plates per SP. Plate, row, and column pools from each SP were further pooled, respectively, into five Matrix Plate Pools (MPPs), eight Matrix Row Pools (MRPs), and 10 Matrix Column Pools (MCPs), for a total of 23 Matrix Pools (MPs) (see “Barley BAC library pooling strategy” in the Materials and methods). Each SP is estimated to represent ~1/16th of the whole barley genome, a feature that reduces the problem of multiple addressing when deconvoluting gene to BAC associations. To date, we have screened a 1× equivalent of the barley genome (i.e., the MPs from 16 SPs; data not shown) to couple gene-to-BAC addresses as a contribution to the International Barley Genome Sequencing Consortium effort (10). We describe here the proof-of-principle involving two SPs: SP1 and SP2.
We independently labeled each MP of SP1 with Cy3 and each MP of SP2 with Cy5, and then co-hybridized each of the 23 pairs of labeled MPs (one MP from each SP) to one of our custom barley microarrays. Therefore, 23 microarrays in total were used to screen all the MPs of both SPs. Following signal extraction from scanned images, data were imported into GeneSpring software (Agilent) to normalize overall signal between each MP microarray. Downstream data processing scripts were written in GenStat software (VSN International) (Supplementary Material) to create a two-stage semi-automated procedure for the identification of BAC clones containing genes corresponding to probes on the microarray. Both stages exploit features of the MP strategy. The first identifies, separately for the plate, row, and column MPs, those probes which have high signals only two times in a single dimension (Figure 2A). The second combines this information across dimensions (plates, rows, and columns) and matches the output with the known BAC-pooling patterns, to allow deconvolution and identification of the individual BAC clones that contain genes corresponding to probes on the microarray (Figure 2, B and C). The scripts can simultaneously deconvolute single and double BAC hits (the latter have two addresses in one of the three dimensions due to high signals in three MPs for that dimension), with the following criteria: (i) the minimum gScore (weighted log-ratio) of plate, row, and column MPs must be >0.9 for both single and double BAC hits; (ii) for a single BAC hit, the third-highest signal in plate, row, and column MPs must be <100; for double BAC hits, the fourth highest signal in plate, row, and column MPs must be <100. If any of these criteria was not met, the probe was filtered out. After filtering, a total of 1364 BACs (531 from SP1, 833 from SP2) hybridized to 3089 microarray probes (Table 1). This included 738 BACs that hybridized to two or more probes and 626 BACs with a unique hit. Of 74 probes that had double BAC hits, 26 were adjacent in the stock plates, suggesting contamination during processing, and were therefore removed. From the annotation of the positive probes, genetic map locations were available for 25% of the corresponding unigenes, providing genetic anchoring information for 352 BACs from SP1 and SP2 (Table 1; Supplementary Table S1).