Structural biologists have their work cut out for them. There are some 2.5 million nonredundant protein sequences in GenBank®, but only about 24,000 protein structures in the Protein Data Bank (PDB). Clearly, increasing structure determination throughput is a top priority. Improved bioinformatics will play a big role in accelerating the pace of protein structure discovery, and with this in mind, Canaves et al. (p. 1040) developed a primer selection tool for amplification of full-length open reading frames (ORFs). Appropriately enough, the result is not small-scale. An unlimited number of sequences of any length can be uploaded, and the tool can generate 1000 primer pairs per minute. The program is accessible on the web, and, for those not content with working from afar, the underlying Perl module engine is also freely accessible. Of course, any primer picker is only as good as its amplification success rate. Canaves et al. tested theirs on targets from one eukaryote, four Archaea, and nine bacteria, obtaining success rates of 60% to 94%. Researchers interested in chipping away at some of the 2.476 million unsolved structures will find this primer selection tool an invaluable first step in their discovery pipeline.
The increasing availability of genomic data is providing the scientific community with an unprecedented amount of information. Consequently, the elucidation of the roles of genes and proteins requires the implementation of novel high-throughput approaches. One of the goals of the Joint Center for Structural Genomics (JCSG) (1), which is one of nine pilot centers funded by the Protein Structure Initiative (2,3,4), is the development of high-throughput methods for large-scale protein structure production. The Primer Selection Tool presented here is a JCSG bioinformatics tool (5) capable of designing large sets of oligonucleotide primers for the simultaneous PCR amplification of full-length open reading frames (ORFs) under the same experimental conditions. To the best of our knowledge, only two similar systems are currently available: (i) the Java®-based Express Primer Tool designed by the Midwest Center for Structural Genomics (6), and (ii) Xpression Primer™ (PREMIER Biosoft International, Palo Alto, CA, USA), a commercial software product for the Windows® and Mac® platforms.
The JCSG Primer Selection Tool has been implemented as a web-based application, providing multi-user capability and easy access. The current implementation of the Tool can generate 1000 primer pairs in <1 min, including input file upload through a broadband Internet connection. There is no limit in the number of sequences or sequence length that can be uploaded. The Tool is accessible through the JCSG web site (http://www.jcsg.org) under Links > JCSG Tools > Primer Selection Tool. The engine underlying the Primer Selection Tool is a Perl module freely available on the BioTechniques' web site at http://www.BioTechniques.com/June2004/CanavesSoftware.html.
The required input to the Primer Selection Tool is an uploaded ASCII file containing single or multiple nucleic acid sequences in FASTA format (Figure 1A). The tool also can generate multiple nested primers from a parent target without having to prepare multiple DNA subsequences. This is accomplished by uploading a second tab-delimited file containing the accession code of the parent target and start-end coordinates of the subsequences to be amplified.
The melting temperature (Tm) of an oligonucleotide depends on sequence length, GC content, and concentration and type of cation present. Although numerous equations can estimate the theoretical Tm of oligonucleotides, the Tm calculations performed by the Primer Selection Tool are based on the Meinkoth and Wahl Long Probe method (7). The formula used is:
Although the tool is designed to provide primers for a large number of targets under the same amplification conditions, it is usually not possible to generate multiple primers with the same exact Tm. Therefore, the Primer Selection Tool allows users to specify the acceptable range of temperature over and below the optimal Tm for which primers are acceptable (Tm tolerances). Tolerances of ±2.5°C are generally adequate, although users can increase that range if the number of failures in primer generation is too high.
Specificity, Tm and time of annealing are at least partly dependent on primer length, making this parameter critical for successful PCR. In general, oligonucleotides with 18 to 24 bases are very sequence-specific. Longer primers give even higher specificity, but the economic cost increases with length, and many suppliers have substantially higher prices for oligonucleotides longer than 35 bases. Therefore, the Tool's minimal and maximal default lengths for primers are 18 and 35 bases, respectively (Figure 1A). Additional optional parameters include the addition of restriction sites, prepending an ATG codon (methionine) to the input sequence, or selection of multiple output formats (see the user manual supplied with the Primer Selection Tool Perl module on the BioTechniques' web site at http://www.BioTechniques.com/June2004/CanavesSoftware.html). The Primer Selection Tool writes the oligonucleotide primer sequences in 5′ to 3′ orientation.
After uploading the DNA sequence data and entering all the optional parameters, a collection of primers with Tm's within the user-defined tolerance ranges is calculated for each target sequence. If the calculated primers are not within the user-defined length constraints, primers are discarded. Because of the speed of the program, introducing minor changes in Tm tolerances, primer lengths, and recalculating the primers is a fast and effective way to force the identification of acceptable primers.
The 3′ terminal position in primers is essential for the control of mispriming. The presence of G or C bases at the 3′ end of primers (GC clamp) helps to promote correct binding due to the stronger hydrogen bonding of G and C bases. GC clamp scores are calculated based on the 3 last bases of each primer according to the following schema: [GC][GC][GC] = 0; [ATGC] [ATGC][AT] = 1; [ATGC][AT][GC] = 2; and [AT][GC][GC] = 3, with 0 corresponding to the worst GC clamp and 3 corresponding to the best, respectively. The primer selected for each target corresponds to the primer within the selected Tm range that has best GC clamp.
The program not only calculates and outputs the Tm's for the selected primers, but also their sequences, lengths, GC clamp scores, and GC content in percentage value (Figure 1B). Ideally, the GC content of primers should be between 45% and 55%, although we have not experienced PCR failures due to lower GC contents as long as the primers have optimal GC clamps. Therefore, GC content is not used as a filtering criterion and acceptance or rejection of primers based on their GC content is left to the user.
The Primer Selection Tool has been successfully used for large-scale design of both prokaryotic and eukaryotic primers. Table 1 reports the success rates for a representative set of 874 eukaryotic and prokaryotic targets processed in 96-sample batches in a plate-based automated high-throughput setup. In two independent experiments (93% and 98% success rates), 362 of 380 eukaryotic proteins were successfully amplified. Comparable high-throughput experiments with primers designed using the Midwest Center for Structural Genomics’ Express Primer Tool (6) show amplification success rates for single organism plates between 72% and 88%. In another three independent experiments, we amplified 494 prokaryotic targets from bacterial and archaeal genomes, with success rates ranging from 60% to 94%. The differences in success correlated with the number of template sources used in each experiment (Table 1). Larger pools of species result in primers with wider ranges of GC content, Tm, and primer lengths, possibly causing a drop in PCR efficiency with respect to experiments with more homogenous PCR templates and primers. Although the efficiencies in our complex plates containing targets from six species are 85%-94%, the JCSG Primer Selection Tool still surpasses the performance of the Express Primer Tool. Only for a highly complex plate (11 bacterial and archaeal species per 96-well plate) does our efficiency drop below that achieved by the Express Primer Tool.
In summary, we present a new primer design tool specifically designed for high-throughput proteomics or genomics pipelines. The performance of the tool compares very favorably to the other web-based program currently available, namely Express Primer Tool (6), both in success rate and speed. Also, our product compares favorably to Xpression Primer regarding speed (300 sequences in 2 min for Xpression Primer), cost, and platform independence. We show that based on well-known and sound primer design concepts (8,9), it is possible to implement a simple albeit highly efficient tool for the design of a large set of primers for plate-based high-throughput experiments. The existence of this tool, coupled with robotics, is important for the improvement in productivity of high-throughput genomics and contributes to further our knowledge of protein structure and function by accelerating the exploration of the increasingly vast amounts of genomic data.
This work was supported by the National Institutes of Health (NIH) Protein Structure Initiative grant no. P50-GM 62411 from the National Institute of General Medical Sciences (