2Division of Chemistry and Chemical Engineering
3Howard Hughes Medical Institute, Division of Biology and Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
Natural evolution has produced an astounding array of proteins that perform the physical and chemical functions required for life on Earth. Although proteins can be reengineered to provide altered or novel functions, the utility of this approach is limited by the difficulty of identifying protein sequences that display the desired properties. Recently, advances in the field of computational protein design (CPD) have shown that molecular simulation can help to predict sequences with new and improved functions. In the past few years, CPD has been used to design protein variants with optimized specificity of binding to DNA, small molecules, peptides, and other proteins. Initial successes in enzyme design highlight CPD's unique ability to design function do novo. The use of CPD for the engineering of potential therapeutic agents demonstrated its strength in real-life applications.
Proteins are linear heteropoloymers, synthesized from 20 different amino acids. They participate in nearly all of the structural, catalytic, sensory, and regulatory functions that are associated with living systems; these myriad roles are made possible by their ability to self-assemble into well-defined structures specified by their sequences. Molecular evolution has shown that varying a protein sequence through mutation and recombination can generate new structures and functions. Drexler firs noted that existing biological systems for protein fabrication could be harnessed to produce nanoscale molecular machines with designed functions (1). However, the potential of this approach can only be fully realized with reliable methods to predict sequences that perform the desired functions. Over the last decade, computational protein design (CPD) has clearly shown its potential as a solution to this problem.
CPD was first conceived as the inverse of the protein-folding problem, since its goal is to generate amino acid sequences that adopt a specific three-dimensional fold (2). CPD utilizes the main-chain coordinates of a known protein structure as a fixed scaffold. Various amino acid types are modeled at each designed position, and potential mutations are suggested based on their interactions with the scaffold and with each other (Figure 1A).
Although the main chain is held fixed during a CPD calculation, various conformations of each amino acid type at each position are sampled to find sequences expected to stabilize the fold and satisfy any additional functional requirements. The distribution of energetically accessible conformations available to each amino acid side-chain is approximated using a set of discrete, low-energy conformations called rotamers (3,4) (Figure 1, B–D). At the beginning of a typical CPD calculation, rotamers of the user-specified amino acid types are assigned to residue positions from a predefined library. The problem is thus to find a choice of rotamer at each position such that the fold is stabilized and the desired function is achieved.Energy Functions
Amino acid sequences and conformations are scored using a set of energy functions designed to reproduce the features of stable proteins. Although the specific energy functions used and their parameters vary between different CPD implementations, most implementations include a function that prevents atomic overlap and favors van der Waals interactions (5) and a function that benefits the formation of hydrogen bonds (6,7). Although interactions between a protein and its aqueous environment are crucial for stability, it would be prohibitively expensive to model water molecules explicitly in a CPD calculation. Therefore, solvation potentials are used to reward the burial of hydrophobic groups and penalize the burial of polar groups; energies are computed using surface area (8,9) or occluded volume models (10,11). Electrostatic interactions may be modeled using Coulomb's law with a constant dielectric (6,12) a statistical pair potential (13), or more sophisticated methods (14). There energy functions were designed to simulate different conformations of a single sequence and can give spurious results when used to choose between different sequences. Therefore, the scoring functions are typically supplemented with heuristic, statistical, and negative design terms to compensate for the limitations of the inverse folding model. These terms include heuristic estimates of side-chain entropy (15), penalties for non-polar exposure (5), statistical rotamer probabilities (13), and composition-based unfolded state energies (13,15). Proteins have been successfully designed with multiple combinations of these functions, but no consensus has yet been reached on the ideal set of functions or the proper weight for each term.Sequence Optimization
The energy functions in CPD are used to compute the interaction energies between each pair of rotamers at different designed positions (pair energies), as well as the interaction energy between each rotamer and the fixed scaffold (singles energies). Given a rotamer at each position, the total energy can be computed by summing the singles for each rotamer and pairs for each pair of rotamers. Typically, all singles and pair energies are precomputed, yielding a combinatorial optimization problem for which the minimum energy solution corresponds to the amino acid sequence and conformation expected to best stabilize the fold. Several optimization algorithms have been used to solve the design problem. Monte Carlo with simulated annealing (16,17,18), genetic algorithms (18,19), and fast and accurate side-chain topology and energy refinement (FASTER) (20,21) are stochastic optimization routines that sequentially improve an existing solution by making perturbations and accepting or rejecting them based on their energies. These algorithms are able to generate reasonable solutions quickly and will always give an answer, regardless of the difficulty of the design problem. Their running times can be extended indefinitely in an attempt to improve the quality of the production. However, they can never guarantee that the optimal sequence has been found. Alternatively, strategies based on the dead-end elimination (DEE) theorem remove singles and pairs from the calculation that can be mathematically proven not to exist in the optimal solution (22,23). Although DEE can find optimal solutions to smaller problems, it does not scale well, and one must resort to the inexact algorithms described above when it fails to yield a definitive answer.