Hi. I'm David Baker of the University of Washington, and today I'm going to give you an introduction to protein design. Proteins function by folding to unique native structures, and some representative native structures are shown on this slide. Proteins are encoded in genes in our genomes. Each gene encodes one protein, and the proteins up to these unique native structures in order to carry out their biological function. Native structures of proteins are likely the lowest energy states for the protein sequence, so for each amino acid sequence of a protein their corresponds an energy landscape, of which I've shown a cartoon here, and there are many different possible conformations a protein can have. The native state of a protein is the lowest energy state, what I've shown here. There are two research problems I'm going to describe today. The first problem is the problem of predicting protein structure. In our genomes, we have on the order of 30,000 different genes. Each encodes a unique protein, and each organism that exists on Earth has a different genome with a different complement of genes, and hence proteins. So, there's a general problem of predicting what the structures and functions of these proteins are. So, the top arrow shows going from an amino acid sequence to a 3-dimensional structure. So, in this case we have a fixed amino acid sequence and we have to find the lowest-energy structure. The inverse problem is the protein design problem, which I'm going to focus on today. In this case, we don't start with a naturally occurring amino acid sequence or a naturally occurring structure. Rather, we start with a brand new structure that we'd like to make and we go backwards to find an amino acid sequence which will fold up to that structure. Both of these problems, the protein structure prediction problem and the protein design problem, are very hard problems, and I'm going to tell you why in the next few slides. The first reason they're hard is that a polypeptide chain can have a very large number of different possible conformations. For each side chain in a... for each amino acid in a protein chain, there are many rotatable bonds, as shown in this schematic, so each side chain, each amino acid can have on the order of 3 different conformations. So, if you have a 100 residue protein, that means you have 3 conformations for the first one, 3 for the second one, and the number of possible conformations, total, you get by multiplying together all of these possibilities. So, it's 3 times 3 times 3... up to 100 times. So, more generally, if you have... if Nres is the number of amino acids in the protein, the number of different conformations is 3 to that power, so 3^Nres. And this is an astronomical number. The second reason that these problems are hard, in particular the design problem is hard, is there's also an astronomical number of protein sequences. So again, the first residues can be any 1 of the 20 different amino acids. The second position can also be any 1 of the 20 amino acids, so the number of possible sequences is 20 times 20 times 20... to the Nres power, which is again a very, very large number. The third reason that these are hard problems is that we need to find the lowest energy structure for a sequence, for example, in the protein structure prediction problem. It's hard because calculation energies is difficult to do accurately because proteins have many, many atoms and they're surrounded by water molecules, which also have many atoms. Each water only has three atoms, but there are many, many water molecules. So, we need to energies accurately for systems that have many 1000s of atoms. And now what I'm going to do is tell you about how we go about solving these problems. So, to search through the possible conformations for a protein, we try and mimic the actual folding process, and here you see a movie depicting the computer calculation -- this is using the Rosetta methodology which my group and others have been developing for the last 15 years or so -- we try and simulate the actual process of folding so we can sample through and find the lowest energy structures much more quickly than we could if we were sampling all possible configurations, which is essentially impossible. So, this calculation that you see here takes not much longer than it takes you to watch it to actually calculate, to actually carry out on a computer. The challenge is that every folding calculation like this, or nearly every one, will end up in a different final structure, so what we need to do is many, many of these independent calculations to build up a picture of what that energy landscape looks like and where the lowest energy structure is. The second problem that I mentioned -- searching through the space of sequences -- we handle as shown in this animation. Starting with a protein backbone for which we want to find a very low-energy sequence, we carry out a calculation which at each step we're randomly substituting in a different amino acid identity, and different side chain conformation for that amino acid, at a randomly selected position. We can do these substitutions very rapidly, we evaluate the energy, and we accept the change if the energy got lower. So, in this way, we can scan through a very large number of possible sequences and quite rapidly identify the lowest energy sequence for a structure. The third problem, the necessity to calculate energies accurately, we solve in the following way. We use a model in which we try and capture the detailed interactions between atoms as accurately as we can, so there are terms in the energy function that favor close atomic packing, but the atoms can't be overlapping, they penalize the burial of polar atoms that would like to interact with solvent, they penalize the burial of such atoms away from water, they favor the formation of hydrogen bonding interactions between polar atoms, we model the electrostatic interactions, the favorability of positive and negative charges to be close together, and we also model the bending preferences of the polypeptide chain. So, given what I've told you, the algorithms for searching for the lowest-energy structure for a given amino acid sequence, that was in the movie where the protein structure was moving around, and the algorithm for searching for the lowest-energy sequence for a fixed structure, there are again two problems which we can approach. The first problem is the structure prediction problem where, again, we are going from genome sequences to try to... starting from those and predicting the structures and functions of the proteins that are encoded by those genes. The second problem is the design problem, where we start with something completely new that we would like to make and work backwards to identify a sequence which is predicted to fold up to that structure. And, for the remainder of this talk, I'm going to describe some examples of the second type of calculation, the design calculation. First I want to give you an overview of the different types of protein structures found in nature. There in the top left is a depiction of a globular protein, where the secondary structure elements, the alpha-helices and the beta-sheets, come together and form a roughly spherical protein with hydrophobic residues buried in the interior, and it's the burial of those hydrophobic residues away from solvent which stabilizes the protein. On the right is a protein that consists of long helices packed together to make, for example in the case of what's shown, a channel protein. In the lower left is a repeat protein in which a very simple module is repeated over and over and over again to make a long filament. And then finally, on the bottom right is a small protein which is held together with disulfide bonds, which are shown in yellow. And, nature accomplishes all the great diversity of biological functions, in our bodies and in all living things, through different... by utilizing these different types of proteins in different circumstances where each one is most appropriate. So, what I'm going to describe now is our efforts to design ideal versions of these classes of proteins, not a protein that exists in nature, but sort of like the Platonic ideal of a globular protein or a repeat protein. In contrast to what's been... has come through evolution has been the result of natural selection, so random amino acids substitutions, then selection... the process that... and so what the result is... the proteins you actually get have a lot of history in them and they may have initially functioned in one way and then they were coopted for something else, so each protein has a lot of idiosyncrasies because of its history. What I'm going to now describe to you is taking what we've learned about these classes of proteins and the algorithms I've described to make, again, sort of idealized protein structures which are free of those types of idiosyncrasies. And, the way this works is I've outlined how the calculations... how we calculate a sequence which is predicted to fold up to a given structure, but that's just the first step. The next step is, since we've designed the protein, we know what its amino acid sequence is because we came up with that amino acid sequence... from the amino acid sequence we can work back to the DNA sequence, that's using the genetic code which was worked out in the 1960s... once we know the DNA sequence we can write down... we can essentially buy, or make very easily in the lab, a synthetic piece of DNA that encodes this protein. So, the protein we've designed on the computer will have never existed in nature, it's something completely new, and the real miracle of this is that it's so easy to manufacture DNA these days that we can, for any crazy protein we design on the computer, we can very, very easily make a gene that encodes that protein and once we have that gene we can make the protein in the laboratory by putting the gene into bacteria, growing up the bacteria, we can extract the protein out, and then we can determine whether that protein folds up to the structure that we designed, and we can also measure other properties of the protein. So, what I'm going to tell you about are several design calculations. We set out to make a brand new protein that was an idealized version of what exists in nature. We carried out the design calculation, we designed a gene encoding the designed protein, we put it into bacteria, purified the protein, and then solved the structure. So, I'm going to be showing you the designed models and then the crystal structures of those designs that we determined experimentally. So, the first example is of the class of globular proteins, which are composed of regular secondary structure elements surrounding a hydrophobic core. After we do the design calculation, where we come up with a sequence that's predicted to adopt the structure, and the two structures I'm talking about here are the ones that are shown under the design column on this slide, again they're idealized so all the helices are perfect helices, the strands are perfect strands, and the loops are very regular, there's one more step. We take advantage of the protein structure prediction calculation I described. So, we take those sequences and we send them out to volunteers all around the world who participate in a project called Rosetta@home, and these volunteers predict what the structure is of that sequence; they search for the lowest-energy state of that sequence. And, in the plots on the left, you see many, many red dots. Each red dot is the result of a different Rosetta@home volunteer. On the y-axis is the energy that's calculated by the Rosetta program that's running on their computer, and on the x-axis is how far away that low-energy structure they found was from the structure we're trying to make, the one that's in the design column. And, you can see, first of all, how big and complicated the space is by the fact that many of these lowest-energy structures that are found are very far away from the structure that we're targeting. So, the x-axis is root-mean-squared deviation in the atomic coordinates. So, these structures on the right of these plots are 10 Ångstroms... each atom is on average 10 Ångstroms away from where it was supposed to be in the designed model. So, you can see that different people land in different local minima on the landscape, so different ones of those bumps or those wells that I showed in that schematic near the beginning. But, what you can see is true for both of these sequences is that the lower the energy, that's again on the y-axis... the lower the energy the more the structure tends toward the designed model, and so there's almost a funnel shape to these plots where, as you go to lower and lower RMSD, going left, the energy gets lower and lower. So, the lowest-energy structures found by our Rosetta@home volunteers, who really play a critical role in our research, the lowest-energy structures are almost identical to the designed model. When we see this property, which is the one that we are looking for, we then manufacture a gene, a synthetic piece of DNA that encodes the design, we make it in the lab, and then we solve the structure, in this case by nuclear magnetic resonance, with colleagues in the NESG Structural Genomic consortium. And, on the right you the see the column marked NMR shows the experimentally determined structure, and you can see it's very similar to the designed models in the second column. And, then on the far right are superpositions... blow-up superpositions of the designed model and the experimental structure, and they show that the side chains in these designs are, in actuality, where we designed them to be. So, we've been able to make such structures almost pretty routinely now, so we can make brand new globular protein structures like this quite effectively. In fact, a new student coming to my laboratory typically is assigned the project of making up a brand new protein structure and proving that the design... designing it and then characterizing the design in the laboratory. Now, we can get to larger structures in this way... we can make this Platonic ideals of globular proteins and we can put them together to make larger and more complex structures. So, this shows an example of taking two of the... two idealized building blocks we've solved the structure of, fusing them together, and in the lower panel on the left is the designed model and the right is the crystal structure. So again, this is a completely made up protein, but when we solve its structure experimentally it comes out exactly as we designed it. Now, the second class of proteins I described are not globular, they're not spherical, they can be long and elongated, and this is actually a protein that's very close to my heart because I designed it myself. This protein... a schematic of it is shown on the top right. This is composed of 80 residue helices, and I made it taking advantage of the equations that Francis Crick worked out whereby a backbone structure can be described by a small number of parameters, and I can make many, many different such structures by sampling through different possibilities for these parameters. I do that and then I design each possibility and choose the lowest-energy structures. When this protein is manufactured in the lab... when it was manufactured... I did some initial tests and found it was very stable, and then Joe Rogers, a graduate student in England, was asking me for a protein to do experiments on so I sent him this protein and he sent back this result, which is really quite remarkable. In order to unfold this protein, you have to add extremely high amounts of a chemical denaturant called guanidine, that's on this plot on the left, and the unfolding... you can see that on these lines... as you add more guanidine are pretty flat, and then at very high concentrations, over 7 molar, the protein starts to unfold, but only really does this at very high temperature. So, this is something that's simply not seen for naturally occuring proteins. These designed proteins can be more ideal, so much more stable. And, when the crystal structure was solved of this protein, it was found to be nearly identical to the designed model. So, we can make this class of proteins also. I mentioned repeat proteins, that was a third class, and we've also been able to make idealized versions of these types of proteins. So, on the second column here, you see a repeated protein that goes on indefinitely, and on the left is a comparison of the designed model in red to the crystal structure in grey. You can see they're nearly identical. And, on the right you see another example of an infinitely extending repeat protein where we've made one subsegment of it in the lab, and you again see that the crystal structure is nearly identical to the designed model. So, we're very excited about these as the basis for new types of new nanomaterial. We can make rods, straight rods and curved rods, and start building things out of them. And the final class of proteins, those small disulfide-bonded proteins, are very interesting because they could form the basis of new types of therapeutics because they're very small and easy to make. And, here this shows examples of... this is work by Vikram Mulligan, a postdoc in the lab, where he's designed very short peptides that are predicted to fold up to unique structures, and there are three examples in the top row of this slide of designs he made, then below that are NMR structures of these peptides when they're actually made in the lab. And again, these peptides come out with very, very similar structures to the designed models. So, what I hope I've shown you today is I've given you... explained something about how... about the protein structure prediction problem and the protein design problem. I've told you how we go about approaching these problems, and then I've shown you that we can start to design sort of idealized versions of the different classes of proteins that are found in nature, and these proteins are likely... will be the basis for designing a whole new world of functional proteins to solve modern day problems, and I'll talk about that in another iBio seminar. And, I want to acknowledge the fantastic people who have actually done most of this work. So, Robu and Rie Koga developed these rules for making idealized protein structures, and I showed you... took you through the design of two of their structures. Vikram Mulligan, I mentioned, did the designed cyclic peptide work. TJ Brunette, Possu Huang, and Fabio did the work on the repeat proteins. And thank you for your attention.