to BioTechniques free email alert service to receive content updates.
MDC-Analyzer: A novel degenerate primer design tool for the construction of intelligent mutagenesis libraries with contiguous sites
 
Lixia Tang1, Xiong Wang1, Beibei Ru1, Hengfei Sun1, Jian Huang1, and Hui Gao2
1School of Life Science and Technology
2School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
BioTechniques, Vol. 56, No. 6, June 2014, pp. 301–310
Full Text (PDF)
Supplementary Material
Abstract

Recent computational and bioinformatics advances have enabled the efficient creation of novel biocatalysts by reducing amino acid variability at hot spot regions. To further expand the utility of this strategy, we present here a tool called Multi-site Degenerate Codon Analyzer (MDC-Analyzer) for the automated design of intelligent mutagenesis libraries that can completely cover user-defined randomized sequences, especially when multiple contiguous and/or adjacent sites are targeted. By initially defining an objective function, the possible optimal degenerate PCR primer profiles could be automatically explored using the heuristic approach of Greedy Best-First-Search. Compared to the previously developed DC-Analyzer, MDC-Analyzer allows for the existence of a small amount of undesired sequences as a tradeoff between the number of degenerate primers and the encoded library size while still providing all the benefits of DC-Analyzer with the ability to randomize multiple contiguous sites. MDC-Analyzer was validated using a series of randomly generated mutation schemes and experimental case studies on the evolution of halohydrin dehalogenase, which proved that the MDC methodology is more efficient than other methods and is particularly well-suited to exploring the sequence space of proteins using data-driven protein engineering strategies.

Directed evolution has become a powerful approach for creating proteins with desired properties for industrial applications since the concept was first introduced in the field of protein engineering in 1993 (1). The advantage of this method is that there is no prerequisite for enzyme structure and mechanism, and in particular, it allows one to obtain critical residues that are located far from the active site of the target protein. Although directed evolution has led to important advances in biological research, it is obvious that the methodology always leads to a large library size, and laborious screening efforts are often needed (2-4). Furthermore, the vast majority of variants in the libraries were proved to be non-functional. These drawbacks motivated the development of a number of combinatorial and computational strategies for constructing small libraries (5-8). Reetz et al. reported an iterative CASTing method (9), which has been proven to be a valuable means for accelerating enzyme evolution by constructing small focused saturation mutagenesis libraries. This approach has been successfully applied in the evolution of several enzymes for industrial applications (8-10).

METHOD SUMMARY

MDC-Analyzer is an extended version of DC-Analyzer allowing the automated design of intelligent mutagenesis libraries that can completely cover user-defined randomized sequences with less redundancy in cases where multiple contiguous and/or adjacent sites are targeted.

To implement these strategies in practice, the construction of high-quality focused mutagenesis libraries is another issue of great concern. In addition to chemical synthesis methodology (11, 12), our previously developed DC-Analyzer shows great potential for constructing a mutagenesis library with one gene per protein (13), where stop codons, rare codons of E. coli, and codon bias are all eliminated in the constructed “small-intelligent” libraries. Meanwhile, a quite similar approach called 22c-trick using a mixture of three oligonucleotides (two degenerate) with NDT (12 codons) VHG (9 codons) and one TGG primer was also reported (14). In the latter case, two redundant codons for valine and leucine were produced. The two strategies have been proved to be quite efficient in the construction of high-quality saturated mutagenesis libraries (14, 15). However, when the number of mutated sites increases, especially when these sites are located contiguously or in adjacent positions, a large number of degenerate primers are needed. Moreover, the number of generated variants also rises exponentially (Table 1). Thus, protein sequence space could not be efficiently explored.

Table 1. 


Table 1.   (Click to enlarge)




With developments in computational tools and bioinformatics, data-driven protein engineering strategies could provide multiple ways to further reduce library size by finding suitable subsets of amino acids at each mutation site, rather than saturated mutagenesis (16, 17). SCHEMA and FamClash were developed with the aim to reduce library redundancy by in silico prescreening (18, 19). SCHEMA is a structure-guided computational algorithm for assessing the fitness of generated recombinant proteins, while the FamClash algorithm uses protein family sequence alignment data to predict incompatibilities in chimeras by analyzing changes in the properties of amino acid pairs in terms of charge, volume, and hydrophobicity. These strategies have been successfully applied in the recombination of several classes of enzymes aimed at improving enzyme thermostability or substrate specificity (20-22). HotSpot Wizard and ConSurf-HSSP are online software applications that can predict the existence of amino acids with certain desired properties in hot spot regions in order to produce functional variants (23, 24). These “rational random” approaches significantly reduce library size and enrich the fraction of functional sequences in libraries. Moreover, simultaneous randomization of the sites of interest allows for the exploration of possible synergistic effects among these sites, which might get lost in an iterative approach.

To bridge the gap between these approaches and their applications, an appropriate degenerate codon strategy is generally adopted to construct “small but smart” libraries (25). In this strategy, appropriate codons were deduced based on the sequence alignment of related proteins, and libraries were subsequently created, followed by a standard PCR-based mutagenesis process. By this method, the resulting libraries contain a much higher frequency of functional variants than those obtained by using a saturation mutagenesis approach. Although the strategy is easy to implement in standard molecular biology laboratories and has been successfully applied in the evolution of esterases (26, 27), the quality of the “small but smart” libraries could still be further improved with respect of stop codons, rare codons, and codon bias. More recently Nov et al. used integer programming methodology to model and improve the quality of libraries created (28). However, in their work, they need to find an efficient optimal integer solution to a given integer programming problem with 3375 variables. The methodology becomes even more complicated when the desired amino acid sequence contains more than one mutation site.

Hine and coworkers presented an elegant ProxiMAX randomization strategy as an extended version of MAX methodology for constructing non-degenerate mutagenesis libraries (29, 30). In principle, ProxiMAX can be used to construct such high quality libraries with any of those user-defined mutation schemes, irrespective of the site numbers and their locations (30). The approach has been successfully applied in the saturation mutagenesis of 11 contiguous codons simultaneously. However, its working efficiency is still a main challenge, especially when multiple sites are targeted, since randomization is based on iterative cycles of a series of tandem reactions of blunt-ended ligation, amplification, purification, and digestion with MlyI.

MDC-Analyzer was designed as an extended version of DC-Analyzer aimed at automatic design of degenerate primer sets that can encode the entire set of desired amino acid sequences with less redundancy in cases where multiple contiguous and/or adjacent sites are targeted. The program was not only theoretically validated with a series of randomly generated mutation schemes but also experimentally evaluated with a case study. Moreover, the redundancy of the library constructed was significantly decreased by applying MDC-Analyzer compared with the “small but smart” strategy in the case study of randomization of loop residues 411–444 in phenyl acetone monooxygenase (PAMO) (31). Thus, MDC-Analyzer provides an alternative means to efficiently evolve important proteins. Materials and methods Mutagenesis

In the case study of library construction using MDC-Analyzer methodology, three contiguous residues (T134/P135/F136) of halohydrin dehalogenase from A. radiobacter AD1 (HheC) were randomized according to mutation scheme Sample No. 3 in Supplementary Table S2. For this, 5 pairs of oligonucleotides designed using MDC-Analyzer (5 forward primers carrying mutation site 134–136: 5′- ATTACCTCTGCAXXXXXXXX XGGGCCT-3′, where XXXXXXXXX represents KGCWYGWYG, ATCWYGWYG, KGCKWCWYG, DKCCAGWYG, and ATCKWCWYG) were synthesized independently and mixed with a ratio of 32:16:32:24:16 according to the encoded amino acid numbers. The primer mixture was used for the subsequent PCR process. The recombinant expression vector pBADHheC containing the wild type hheC gene (GenBank accession no.: AF397296) was used as a template. PCR reactions were performed according to the QuikChange protocol (32). The reaction system (20 µl) contains 1× HF buffer, 200 µM of each dNTP, 1 mM Mg2+, 100 ng template, 2 µM of each mixed primer and 0.01 U/µl Phusion high-fidelity DNA polymerase (New England Biolabs, Ipswich, MA). The temperature program used was: 98°C for 3 min, followed by 30 cycles of 10 s at 98°C, 45 s at 50°C, and 2 min at 72°C, with a final incubation at 72°C for 10 min. In the case study of library construction using DC-Analyzer methodology, two active-site residues (P135/F136) of HheC were fully randomized according to the procedure as previously described (13). Mutagenesis library construction

After amplification, 20 µl reaction mixture was digested with DpnI (New England Biolabs) at 37°C for 2 h to remove the parental template. Five µl of DpnI-digested mixture was used to transform E. coli MC1061 chemically competent cells in cases where 3 contiguous residues (T134/P135/F136) of HheC were partially randomized, while 10 µl of DpnI-digested mixture was used to transform competent cells in cases where 2 residues were fully randomized. The transformed culture was plated on LB agar plates supplemented with 100 µg/mL ampicillin, resulting in ∼400 colonies for MDC-Analyzer randomization and >1000 colonies for DC-Analyzer randomization. Sequence analysis was performed by Invitrogen (Shanghai, China). MDC-Analyzer

MDC-Analyzer was developed for the construction of intelligent mutagenesis libraries in cases where subsets of desired amino acids on contiguous target sites have been predetermined. Such functions could not be well executed with DC-Analyzer.

Let A = {Ala, Arg, … Val, Stp} be the set of 20 standard amino acids and stop codons, and B = {A, C, G, T} be the set of 4 bases that make up a DNA sequence. There are 15 non-empty subsets of B, and we denote as D the set of all conventional degenerate bases (Supplementary Table S1), where Y represents C or T, R represents G or A, N represents G, A, T or C, etc.

Let X, Y, Z ∈ D represent 1 of the 15 degenerate bases (Supplementary Table S1). A degenerate PCR primer designed for an amino acid sequence with s sites would be a primer with a degenerate codon sequence denoted as X1Y1Z1 - X2Y2Z2 - … - XsYsZs where XiYiZi corresponds to the degenerate codon for site i. There are (153)s = 3375s possible codon sequence variants. After eliminating the 3 stop codons and 8 rare codons of E. coli (CGA, CGG, AGA, AGG for Arg; CUA for Leu; AUA for Ile; GGA for Gly; CCC for Pro), the remaining (1279)s possible variants can be used for the design of optimal degenerate primer profiles.

Let As = {A11 - A12 - … - A1s, A21 - A22 - … - A2s, … , An1 - An2 - … - Ans} represent a set of n desired amino acid sequences with the same number of s sites, where Aij stands for the amino acid at the jth site of the ith desired amino acid sequences. For encoding the set As of n desired amino acid sequences with s sites, m degenerate PCR primers Ps = {X11Y11Z11 - X12Y12Z12 - … - X1sY1sZ1s, … , Xm1Ym1Zm1 - Xm2Ym2Zm2 - … - XmsYmsZms } need to be designed where XijYijZij corresponds to the degenerate codon at site j of the ith degenerate PCR primer. Our objective is to find the optimal scheme of m degenerate PCR primers that can encode the entire set of desired amino acid sequences with less redundancy, where m is a number specified by the user.

For a degenerate primer p and a set As of desired amino acid sequences, let π(p,As) denote the subset of desired amino acid sequences in As that can be designed using the degenerate PCR primer p, and C(p) denote the set of codon sequences belonging to p. The cardinality of a set is represented by ∣·∣. For example, the degenerate primer AGV (where V = A/C/G) includes the non-degenerate codons AGA, AGG (both encoding Arg), and AGC (encoding Ser); hence π(AGV, {Arg}) = {Arg}, ∣π(AGV, {Arg})∣ = 1, C(AGV) = { AGA, AGG,AGC}, and ∣C(AGV)∣ = 3.

For a scheme of m degenerate PCR primers with s sites, there are 1279ms possible good combinations. This is a really huge number, and an exhaustive search for optimal results is practically impossible. A possible optimal scheme is then explored using the strategy of Greedy Best-First-Search. For exploring possible optimal scheme Ps= {d1,d2, … ,dm} of m degenerate PCR primers, we define an objective function as the following equation, which needs to be maximized recursively for i from 1 to m:



where d is a variable of a degenerate PCR primer and α is an exponential adjustment. In Equation 1,



gives the union of desired amino acids having been encoded by the first i - 1 degenerate PCR primers [i.e., {d1, d2, … ,di-1}], and



gives the set of desired amino acids that remain uncoded for the ith degenerate PCR primer di. The ith degenerate PCR primer di in the optimal scheme Ps is mainly determined according to whether it can encode most of the remaining uncoded amino acid sequences. This is the main rationale behind the objective function in Equation 1. In the equation, α is an experimentally adjustable number, and can be set to larger than 1. The larger the value of α, the more the weight for encoding the remaining uncoded amino acid sequences. It is used to make a trade-off between the encoding library size and the coverage (percentage) of the encoded desired amino acid sequences in the library. Since our method is a local greedy approximation, purposely for exploration, we adjusted α to be 2.5 according to a number of program tests, the results of which showed that the encoding library size will dramatically increase when α is set to some number above 3 and the coverage of encoded desired amino acid sequences becomes very low when a is set to some number below 2.

In our program, we considered the most outside loop of exploration over the number of degenerate primers. In each outside loop, say loop i, an ith degenerate PCR primer di is searched and added to all of the former schemes {d1, d2, … , di-1} produced in the previous round without repetition according to the Greedy Best-First-Search strategy, which generates all possible current schemes {d1, d2, … , di} for this round. The strategy employs a heuristic function defined as Equation 1 that guides the search. Due to the trade-off between search power and the limitation of computer resources, we prune unnecessary search tree branches by choosing to preserve the top 500 best current schemes of {d1,d2,…,di} for the next round of exploration.

For each previous scheme {d1,d2,…,di-1}, the remaining desired amino acid sequences can be represented by



To search the ith degenerate PCR primer di for each scheme, we iterated over the first site to the last site of the desired amino acid sequences. For each iteration, say on site k (from 1 to s), the prefixes of the ith degenerate PCR primer di can be represented by X1Y1Z1 - X2Y2Z2-…- Xk-1Yk-1Zk-1, where XjYjZj corresponds to the degenerate codon that encodes the jth site of the remaining desired amino acid sequences. With prefix X1Y1Z1 - X2Y2Z2 -…- Xk-1Yk-1Zk-1 of the ith degenerate PCR primer di, the set Ak-1 of the remaining desired amino acid sequences that can be possibly encoded is represented by Equation 2 (bottom of page), where proj(a,k-1) converts amino acid sequence a into an amino acid subsequence that only contains the first k-1 sites of a, and



does a similar conversion except that it is defined on a set of amino acid sequences. To conduct an exhaustive search for the best degenerate codon for encoding the kth site of amino acid sequences given in Ak-1, we interatively add one of the 1279 variants of XYZ to all of the former prefixes X1Y1Z1 - X2Y2Z2-…- Xk-1Yk-1Zk-1 produced in the previous round and thus generate all possible prefixes X1Y1Z1 - X2Y2Z2 -…- XkYkZk of an additional length for this round. For evaluation, each generated prefix X1Y1Z1 - X2Y2Z2-…-XkYkZk is mapped to a tuple of (outk, fqk, rak, nak), where outk is 0 when XkYkZk encodes some undesired amino acids (i.e., not at site k of amino acid sequences in Ak-1), and is 1 if otherwise; fqk is 0 when the frequency of all amino acids being encoded by XkYkZk is not the same, and is 1 if otherwise; nak is the number of distinct amino acids that can be encoded by XkYkZk; rak is the ratio of nak to the number of codons being covered by XkYkZk. All of the generated prefixes are thus ranked according to their corresponding tuples in lexicographical order, and the bigger the better. After the ranking, we choose to preserve the top 2000 best prefixes of X1Y1Z1 - X2Y2Z2-…- XkYkZk for the next round of exploration until the length of the prefix reaches s.




(Click to enlarge)




In MDC-Analyzer, users can either input a mutation scheme manually or load one from a text file. The output data will be sent to a user-defined e-mail address. The running time of the software tool normally takes from a few seconds to a number of hours, depending on the complexity of the input data. MDC-Analyzer is available at http://immunet.cn/DC/cgi-bin/MDC.pl. Results and discussion

In addition to removing codon redundancy, mutagenesis library size could be greatly decreased by reducing amino acid alphabets at each targeted position. Such mutagenesis profiles can possibly be predetermined by multiple sequence alignment or computer-based tools (Hotspot Wizard, SCHEMA, etc.). MDC-Analyzer was mainly developed to facilitate the above strategies in mutagenesis library construction, especially when multiple mutation sites are located contiguously or in adjacent positions. The procedure for using MDC-Analyzer in library construction is shown in Figure 1. Once the mutation schemes for the target gene of interest are determined, MDC-Analyzer can output a set of optimal degenerate primers that can be used for the construction of an intelligent library. The criteria for an intelligent library are defined as follows: all desired amino acid sequences are encoded with equal probability; library redundancy is as small as possible; stop codons and the rare codons of E. coli (CGA, CGG, AGA, AGG for Arg; CUA for Leu; AUA for Ile; GGA for Gly; CCC for Pro) are eliminated.




Figure 1.  Schematic representation of the construction of an intelligent mutagenesis library. (Click to enlarge)




To demonstrate the use of MDC-Analyzer in the design of degenerate primers, 15 mutation schemes were randomly generated and analyzed (Supplementary Table S2). In each scheme, all mutation sites were partially randomized and produced libraries containing less than 1000 amino acid sequences to keep library size in a manageable range. The results showed in most cases that the designed generate primer set contains less than 10 degenerate primers, and the encoding library size is about 1.2 to 5-fold larger than the number of input amino acid sequences (Supplementary Table S2). The mutation scheme (CGI,FMQTSD,LST) (i.e., Sample No. 3 in Supplementary Table S2) contains three mutation sites that are separated by commas. Amino acids C/G/I, F/M/Q/T/S/D, and L/S/T are expected to be encoded at position 1, position 2, and position 3, respectively. In total, the mutation scheme generates 3 × 6 × 3 = 54 desired amino acid sequences. To encode all 54 desired amino acid sequences, 3 profiles of designed degenerate primer sets were generated by using MDC-Analyzer, each of which was marked with 2*, 4*, and 5, respectively, according to the number of designed degenerate primers (Supplementary Table S2). The numbers marked with an asterisk indicate that in such cases the designed degenerate primer set does not encode all desired amino acid sequences with an equal probability (Figure 2A). An equal occurrence of desired amino acid sequences can be obtained using the third profile of the designed degenerate primer set (Figure 2B). To construct intelligent libraries, a standard QuikChange mutagenesis protocol (32) could be used when these mutation sites are located contiguously or in a proximity mode, while megaprimer-based PCR (33) could be applied in cases where the mutation sites are located far from each other.




Figure 2.  Theoretical and experimental distribution of desired amino acid residues in variants encoded by MDC-Analyzer designed primer sets. One of the 15 randomly generated mutation schemes (Sample No. 3, CGI,FMQTSD,LST) was used as an example. To encode all 54 desired amino acid sequences, 3 profiles of design degenerate primer sets were generated (Supplementary Table S2), each of which was marked with 2*, 4*, and 5, respectively. The theoretical distribution of desired amino acid residues in variants encoded by the degenerate primer profiles 4* and 5 are shown in panels (A) and (B), respectively. (C) The experimental distribution of desired amino acid residues in variants encoded by the degenerate primer profile 5. (D) The experimental distribution of desired amino acid residues in variants encoded by DC-Analyzer designed degenerate primers. Three contiguous residues (T134/P135/F136) of halohydrin dehalogenase (HheC) from A. radiobacter (Click to enlarge)


To further evaluate the feasibility of MDC-Analyzer, three contiguous active-site residues (T134/P135/F136) of halohydrin dehalogenase from A. radiobacter AD1 HheC were randomized according to the above mutation scheme (CGI,FMQTSD,LST). Using the third profile of the designed degenerate primer set (Supplementary Table S2), the resulting intelligent mutagenesis library theoretically contains 120 amino acid sequences, which includes the 54 desired amino acid sequences with an equal occurrence. For 95% completeness of the above library, a library size of about 395 variants is required, according to a calculation using the online program GLUE (
34). However, for 95% completeness of a library with 32 × 32 × 32 = 32768 variants (NNS randomization), the required library size increases to ∼9.8 × 104, which is far beyond the manageable range.

To evaluate the quality of the above constructed intelligent mutagenesis library, 60 colonies were randomly picked and sequenced, of which 58 colonies were successfully sequenced. The sequencing results showed that all 58 DNA sequences contain no rare codons or stop codons. Twenty-four of 58 DNA sequences can be used to encode the desired amino acid sequences defined by the mutation scheme (CGI,FMQTSD,LST), and only 5 of these are identical. These results suggest that the designed primers were properly primed. The experimental results for the distribution of amino acids at three target positions are shown in Figure 2C. All expected amino acids occurred at the target positions. By taking into account 5 identical amino acid sequences, the obtained 24 amino acid sequences represent 35% of 54 desired amino acid sequences, which is very close to the theoretical completeness value of 38% calculated using GLUE. With NNS methodology, to obtain 35% completeness of the 54 desired amino acid sequences, more than 14,000 colonies need to be sampled. The sample size is reduced to 3400 when using the strategy of DC-Analyzer, but the size is still too large. For comparison, two active-site residues of HheC (P135/F136) instead of three (T134/P135/F136) were fully randomized using the DC-Analyzer method. The sequencing results of 60 randomly picked colonies showed that all colonies carried mutations at the target sites, and no rare codons or stop codons were obtained in this case. The experimental results for the distribution of expected amino acids at two target positions are shown in Figure 2D. Apparently, the occurrence frequency of all expected amino acids obtained using the DC-Analyzer method is lower than that obtained using the MDC-Analyzer method. Thus, only 4 amino acid sequences defined by the mutation scheme (FMQTSD,LST; generates 6 × 3 = 18 desired amino acid sequences) were obtained, representing 22% of the target sequences. In summary, the results demonstrate that randomization using MDC-Analyzer strategy could greatly decrease library size and inherent amino acid bias, which would in turn facilitate screening efficiency.

In addition to decreasing library size, the second notable advantage of the MDC-Analyzer over the DC-Analyzer and the MAX methodology is that the MDC-Analyzer can handle tough tasks with multiple contiguous or adjacent mutation sites randomized simultaneously, For this reason, the performance of the MDC-Analyzer was analyzed using the data set collected from the studies of phenyl acetone monooxygenase (PAMO) by Reetz et al. (31). In this case, loop residues 411–444 of PAMO were focused and partially randomized based on the sequence alignment of the wild-type PAMO and seven other Baeyer-Villigerases. The alignment result showed that at these 4 positions only a limited number of amino acids appear (411: S/A, 412: A/V/G/L, 413: L/F/G/Y and 414: S/A/C/T; generates 2 × 4 × 4 × 4 = 128 desired amino acid sequences). Thus, amino acids located at the four positions were randomized with the amino acids occurring at the same positions by adopting appropriate codon degeneracies. For this, one degenerate primer that encoded a total of 864 variants was used to cover 128 desired amino acid sequences, and 2587 colonies needed to be screened for 95% coverage of the constructed library. The appropriate codon degeneracy strategy was originally used for automated design of degenerate codon libraries (LibDesign) (35), and subsequently used for the construction of “small but smart” libraries (26, 27, 31). By using MDC-Analyzer, 2 degenerate PCR primers that encode a total of 360 variants were generated to cover the above desired 128 amino acid sequences. In the latter case, only 1078 colonies need to be sampled for 95% coverage of the constructed library (Table 2). The screening effort is minimized by a factor of 2.4 as compared with the “small but smart” approach. In addition to the elimination of stop codons and rare codons of E. coli, the occurrence of each desired amino acid sequence is equal in the constructed library. None of these properties are considered in the “small but smart” libraries.

Table 2. 


Table 2.   (Click to enlarge)




The MDC-Analyzer software tool is presented for automatic design of optimal degenerate PCR primer profiles in the construction of intelligent mutagenesis libraries when multiple contiguous or adjacent mutation sites are partially randomized simultaneously. The generated intelligent libraries have all the benefits of the libraries constructed by applying DC-Analyzer but with small library redundancy. MDC-Analyzer utilizes the advantages of data-driven protein engineering approaches in finding subsets of desired amino acids in hot spot regions of proteins and exhibits a great ability in randomizing multiple contiguous amino acids simultaneously, which will greatly enhance the power of directed evolution in generating enzyme variants with novel catalytic properties. Moreover, the redundancy of the generated intelligent library represents only a tiny fraction of the redundancy of a saturation mutagenesis library. In conjunction with a data-driven protein engineering strategy, MDC-Analyzer can thus greatly relieve the bottleneck of the saturation mutagenesis approach when multiple positions are targeted. With the availability of several well optimized megaprimer-based PCR strategies (36-39), the intelligent mutagenesis libraries generated by MDC-Analyzer can be constructed as simply as with NNS randomization. In principle, the MDC-Analyzer could be more efficient in library construction as compared with PrpxiMAX randomization because the latter approach requires several cycles of tandem reactions of blunt-ended ligation, amplification, purification, and digestion. By analyzing data from literature, the libraries constructed using MDC-Analyzer methodology were proved to be more intelligent than the so called small but smart libraries. In summary, the MDC-Analyzer represents a valuable alternative tool in protein directed evolution and is particularly relevant to data-driven protein engineering. Author contributions

L.T. conceived the study, participated in the experimental design, and drafted the manuscript. X.W. carried out part of the computer program design and drafted the manuscript. B.R. carried out the web design. H.S. carried out the experimental work. J.H. carried carried out the web design and maintenance. H.G. conceived the study, participated in the computer program design, and drafted the manuscript.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (No. 21342005); sub-project of National Science and Technology Major Project on Water Pollution Prevention and Control (No. 2012ZX07203-003). We acknowledge Dennis Wise for polishing the English language of the manuscript.

Competing interests

The authors declare no competing interests.

Correspondence
Address correspondence to Lixia Tang or Hui Gao, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. E-mail: [email protected] or huigao@uestc.edu.cn


References
1.) Chen, K., and F.H. Arnold. 1993. Tuning the activity of an enzyme for unusual environments: sequential random mutagenesis of subtilisin E for catalysis in dimethylformamide. Proc. Natl. Acad. Sci. USA 90:5618-5622.

2.) Liao, H., T. McKenzie, and R. Hageman. 1986. Isolation of a thermostable enzyme variant by cloning and selection in a thermophile. Proc. Natl. Acad. Sci. USA 83:576-580.

3.) Arnold, F.H., and A.A. Volkov. 1999. Directed evolution of biocatalysts. Curr. Opin. Chem. Biol. 3:54-59.

4.) Turner, N.J. 2009. Directed evolution drives the next generation of biocatalysts. Nat. Chem. Biol. 5:567-573.

5.) Lutz, S. 2010. Beyond directed evolution--semi-rational protein engineering and design. Curr. Opin. Biotechnol. 21:734-743.

6.) Chica, R.A., N. Doucet, and J.N. Pelletier. 2005. Semi-rational approaches to engineering enzyme activity: combining the benefits of directed evolution and rational design. Curr. Opin. Biotechnol. 16:378-384.

7.) Fox, R.J., and G.W. Huisman. 2008. Enzyme optimization: moving from blind evolution to statistical exploration of sequence-function space. Trends Biotechnol. 26:132-138.

8.) Reetz, M.T., L.W. Wang, and M. Bocola. 2006. Directed evolution of enantioselective enzymes: iterative cycles of CASTing for probing protein-sequence space. Angew. Chem. Int. Ed. Engl. 45:1236-1241.

9.) Reetz, M.T., and J.D. Carballeira. 2007. Iterative saturation mutagenesis (ISM) for rapid directed evolution of functional enzymes. Nat. Protoc. 2:891-903.

10.) Reetz, M.T., M. Bocola, L.W. Wang, J. Sanchis, A. Cronin, M. Arand, J. Zou, A. Archelas. 2009. Directed evolution of an enantioselective epoxide hydrolase: uncovering the source of enantioselectivity at each evolutionary stage. J. Am. Chem. Soc. 131:7334-7343.

11.) Neuner, P., R. Cortese, and P. Monaci. 1998. Codon-based mutagenesis using dimer-phosphoramidites. Nucleic Acids Res. 26:1223-1227.

12.) Virnekäs, B., L. Ge, A. Plückthun, K.C. Schneider, G. Wellnhofer, and S.E. Moroney. 1994. Trinucleotide phosphoramidites: ideal reagents for the synthesis of mixed oligonucleotides for random mutagenesis. Nucleic Acids Res. 22:5600-5607.

13.) Tang, L., H. Gao, X. Zhu, X. Wang, M. Zhou, and R. Jiang. 2012. Construction of “small-intelligent” focused mutagenesis libraries using well-designed combinatorial degenerate primers. Biotechniques 52:149-158.

14.) Kille, S., C.G. Acevedo-Rocha, L.P. Parra, Z.G. Zhang, D.J. Opperman, M.T. Reetz, and J.P. Acevedo. 2013. Reducing codon redundancy and screening effort of combinatorial protein libraries created by saturation mutagenesis. ACS Synth Biol 2:83-92.

15.) Tang, L., X. Zhu, H. Zheng, R. Jiang, and M. Majeric Elenkov. 2012. Key residues for controlling enantioselectivity of Halohydrin dehalogenase from Arthrobacter sp. strain AD2, revealed by structure-guided directed evolution. Appl. Environ. Microbiol. 78:2631-2637.

16.) Chaparro-Riggers, J.F., K.M. Polizzi, and A.S. Bommarius. 2007. Better library design: data-driven protein engineering. Biotechnol. J. 2:180-191.

17.) Davids, T., M. Schmidt, D. Böttcher, and U.T. Bornscheuer. 2013. Strategies for the discovery and engineering of enzymes for biocatalysis. Curr. Opin. Chem. Biol. 17:215-220.

18.) Voigt, C.A., C. Martinez, Z.G. Wang, S.L. Mayo, and F.H. Arnold. 2002. Protein building blocks preserved by recombination. Nat. Struct. Biol. 9:553-558.

19.) Saraf, M.C., A.R. Horswill, S.J. Benkovic, and C.D. Maranas. 2004. FamClash: A method for ranking the activity of engineered enzymes. Proc. Natl. Acad. Sci. USA 101:4142-4147.

20.) Meyer, M.M., L. Hochrein, and F.H. Arnold. 2006. Structure-guided SCHEMA recombination of distantly related b-lactamases. Protein Eng. Des. Sel. 19:563-570.

21.) Heinzelman, P., R. Komor, A. Kanaan, P. Romero, X. Yu, S. Mohler, C. Snow, and F.H. Arnold. 2010. Efficient screening of fungal cellobiohydrolase class I enzymes for thermostabilizing sequence blocks by SCHEMA structure-guided recombination. Protein Eng. Des. Sel. 23:871-880.

22.) Romero, P.A., E. Stone, C. Lamb, L. Chantranupong, A. Krause, A.E. Miklos, R.A. Hughes, B. Fechtel. 2012. SCHEMA-designed variants of human Arginase I and II reveal sequence elements important to stability and catalysis. ACS Synth Biol 1:221-228.

23.) Pavelka, A., E. Chovancova, and J. Damborsky. 2009. HotSpot Wizard: a web server for identification of hot spots in protein engineering. Nucleic Acids Res. 37:W376-W383.

24.) Glaser, F., Y. Rosenberg, A. Kessel, T. Pupko, and N. Ben-Tal. 2005. The ConSurf-HSSP database: the mapping of evolutionary conservation among homologs onto PDB structures. Proteins 58:610-617.

25.) Jochens, H., and U.T. Bornscheuer. 2010. Natural diversity to guide focused directed evolution. ChemBioChem 11:1861-1866.

26.) Jochens, H., D. Aerts, and U.T. Bornscheuer. 2010. Thermostabilization of an esterase by alignment-guided focussed directed evolution. Protein Eng. Des. Sel. 23:903-909.

27.) Nobili, A., M.G. Gall, I.V. Pavlidis, M.L. Thompson, M. Schmidt, and U.T. Bornscheuer. 2013. Use of ‘small but smart’ libraries to enhance the enantioselectivity of an esterase from Bacillus stearothermophilus towards tetrahydrofuran-3-yl acetate. FEBS J. 280:3084-3093.

28.) Nov, Y., and D. Segev. 2013. Optimal codon randomization via mathematical programming. J. Theor. Biol. 335:147-152.

29.) Hughes, M.D., D.A. Nagel, A.F. Santos, A.J. Sutherland, and A.V. Hine. 2003. Removing the redundancy from randomised gene libraries. J. Mol. Biol. 331:973-979.

30.) Ashraf, M., L. Frigotto, M.E. Smith, S. Patel, M.D. Hughes, A.J. Poole, H.R. Hebaishi, C.G. Ullman, and A.V. Hine. 2013. ProxiMAX randomization: a new technology for non-degenerate saturation mutagenesis of contiguous codons. Biochem. Soc. Trans. 41:1189-1194.

31.) Reetz, M.T., and S. Wu. 2008. Greatly reduced amino acid alphabets in directed evolution: making the right choice for saturation mutagenesis at homologous enzyme positions. Chem. Commun. (Camb.) 43:5499-5501.

32.) Hogrefe, H.H., J. Cline, G.L. Youngblood, and R.M. Allen. 2002. Creating randomized amino acid libraries with the QuikChange™ multi site-directed mutagenesis kit. Biotechniques 33:1158-1160.

33.) Miyazaki, K., and M. Takenouchi. 2002. Creating random mutagenesis libraries using megaprimer PCR of whole plasmid. Biotechniques 33:1033-1034.

34.) Patrick, W.M., A.E. Firth, and J.M. Blackburn. 2003. User friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries. Protein Eng. 16:451-457.

35.) Mena, M.A., and P.S. Daugherty. 2005. Automated design of degenerate codon libraries. Protein Eng. Des. Sel. 18:559-561.

36.) Tseng, W.C., J.W. Lin, T.Y. Wei, and T.Y. Fang. 2008. A novel megaprimed and ligase-free, PCR-based, site-directed mutagenesis method. Anal. Biochem. 375:376-378.

37.) Sanchis, J., L. Fernández, J.D. Carballeira, J. Drone, Y. Gumulya, H. Höbenreich, D. Kahakeaw, S. Kille. 2008. Improved PCR method for the creation of saturation mutagenesis libraries in directed evolution: application to difficult-to-amplify templates. Appl. Microbiol. Biotechnol. 81:387-397.

38.) Pai, J.C., K.C. Entzminger, and J.A. Maynard. 2012. Restriction enzyme-free construction of random gene mutagenesis libraries in Escherichia coli. Anal. Biochem. 421:640-648.

39.) Tang, L., K. Zheng, Y. Liu, Z. Zheng, H. Wang, C. Song, and H. Zhou. 2013. Exploring the potential of megaprimer PCR in coupling with orthogonal array design for mutagenesis library. Biotechnol. Appl. Biochem. 60:190-195.