2020 User Workshop – 2.6 – B Cell Epitope Prediction

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so i'll talk today about uh b-cell epitope predictions after it's my third talk of the day and um the outline of the talk is to do some recap of the biology what makes b-cell epitopes or antibody epitopes different from t cell episodes then i'll overview the tools available on the iadb and specifically that's going to be including let me turn on my huh where is it okay um here pointer now i should have a pointer um specifically i'll talk about the linear sequence based epitope prediction methods and the discontinuous 3d structure based prediction methods and i'll briefly talk about computational antibody design in terms of antigen antibody structure modeling and antibody protein docking okay so here's an example of an antibody binding an epitope and the epitope sorry the antibodies here which consists of two chains the heavy chain in purple the line chain in yellow and you see this is actually the binding region where it binds to the epitope on the target protein and the target protein in this case is like the hemagglutinin ha chain which actually is like a h a protein which consists of two chains ha1 and ha2 and you see ha2 is here at the bottom they form a complex but all of the residues found by this antibody are on the ha1 chain so the epitope are the residues on the protein that is being bound by the antibody so all of these on the ha1 chain and then in contrast here the peritope are the residues on the antibody that are making contact directly with the antigen so both of these are important and the perito so the region and the antibody here that makes these contacts is the where the vtj recombination happens where there's lots of sequence variability um so this is going to be highly variable to allow antibodies to bind different antigens so zooming in here when you look at the residues that are actually bound they are colored in red you see that there are some that are in contact and others that are not the typical cutoff we use is the six angstrom difference um and so you see here that the protein chain continues in their stretches that are not in contact stretches that are and that makes it's a discontinuous epitope and pretty much every epitope in reality when you look at it in detail is discontinuous as in that there's going to be individual amino acids in contact and neighboring ones that are not and the total binding site is made up by these discontinuous residues in the overall protein sequence at the same time here is an example of a b-cell epitope from sperm whale myoglobulin here you have an alpha helix which is considered to be the epitope and you can imagine that the antibody probably binds only these residues facing towards the antibody itself but you can make that peptide and it will actually fold in this conformation and you will have a recognition of the peptide so unless you have the structural information that tells you which specific residues in a peptide are bound you might as well consider it linear so this becomes like a little semantic issue uh the actually pretty much every epitope when you have a crystal structure will be discontinuous linear really typically means that you can make a peptide synthesize it and it will bind the antibody that doesn't mean that every residues in the peptide is bound to the antibody it means that you can synthesize it it binds yeah and if you have the additional information uh you'll realize that it's still discontinuous the most extreme discontinuous would be something like an epitope that would be spanning like this helix and that helix and you'd have residues here you cannot synthesize a single peptide that includes all of these at a reasonable length that would be recognized by an antibody so to get to the visa epidurals we sell epitope prediction tools on the idb you again start with tools idb.org and then you click on this b cell tools tab and here's the main tools i'll be talking about so the first question when you do predictions is always should you be doing predictions in the first place so as in in the t cell case and even more so probably for the antibodies you should first make sure that there isn't already information available in the idb on the antigen that you're interested in so if you already if you have an antigen of interest go to the idb search if there are b selectors in it that have been characterized that's always going to be better than relying on predictions you might find some and you might say well maybe there are more that would be another use case for the prediction tools where you would use the tools to identify additional candidates in addition to what is known and divided into the idb so um sequence based epidemic prediction so the main and original way how epitopes were predicted were linear aperture predictions that are based on amino acid physical chemical properties so essentially this would be scanning an amino amino acids in a protein identify segments that would be likely to be potential t cell epitopes based on having certain physical chemical properties and the ones that were originally identified were better terms surface accessibility flexibility antigenic propensity uh hydrophilicity are the xp scales that we found in the literature that were concrete that we could identify one of the issues here is that many of these scales really don't necessarily predict something to be an epitope they are predicting which residues on the protein are going to be on the surface of the protein because being on the surface of a protein is a necessary requirement for it to be accessible for antibody binding so the other approach are machine learning methods so that are not scale based but are based on an algorithm that goes through and tries to again learn from positive examples distinguish them from negative examples and there's a number of these available the one that's implemented on the idb side is peppy prep but there's a number of other methods that have been published as well and these are all used to optimize the prediction accuracy based on training and testing data sets so again here's the home page of the idb where you can get to these prediction tools and when you click on the first one you're getting this kind of page that summarizes the tools that are available you should already be familiar with this kind of interface it looks uh similar ultimately to how the idp t-cell tools look like and you're asked first to input a sequence here uh you could also add a id it says swiss part here and actually any uniprot id would be acceptable and that would just like grab those and then you can select the prediction method um and uh babypret is the recommended method and it has been only changed to being 2.0 to be the default which is in advancement of the larger training set to the original deputy print method but results here you see the other scale based method that i mentioned before so if you're wondering what these methods are you can click on that link and get to the help tab and it tells you in more detail what the different methods are what the parameters mean that you can answer and get any kind of other additional information for the specific reference also when you click on the reference tab um where are the papers where these things have been published and that's always going to be better than the um i mean more complete if you want to have the comprehensive information how each of these methods have been generated so going back to this example here now entering a sequence for sperm with myoglobin and uh putting the sequence in here uh hitting submit using the debi prep prediction as a default you're getting this kind of output so here this output is essentially scanning this sequence in the protein and calculating a score that is going to be centered at position four um around uh like a for um it's going to be a seven there and the center is going to be a position four and the score is calculated around that and that's going to then calculate and you see here you have like a low score in this case that means a lower likelihood of being an epitope and you have these yellow regions of a higher score that go up and the threshold is adjustable if you check the threshold uh 0.35 is currently set but you can change it and then it's going to change essentially this graph um the default um threshold that can be chosen for for a given protein is just to use the average you could enter like minus 1.5 here that would be the the default somewhere here and give you some more potential exposed regions and then there's two table summaries of this the first one here being essentially the predicted peptides and peptides with a grain of salt these are like linear stretches here look like the single l here this residues here which is the adv agh i think um that is here etc so which would be like pected segments that are potentially epitope candidates and then for each of the i mean individual residues the score that is being output like the l in position two here um is a potential epitome so um that is the one method as i said there's multiple and you can see here when you look at them there's a certain commonalities but also differences between what different prediction methods give you and you see here as an example three regions that are called by three different methods as being potential epitopes obviously different scales in the top two cases and based on the machine learning approach by battery prep and there is always the thought that using a consensus of multiple methods could be a good way of filtering out erroneous predictions and should be more reliable than a single method now overall when we implemented this and then tested all these methods on iadb data we found that no method get avc values above 0.6 and 0.6 is not a good method so these linear predictions as represented here didn't really provide us with like a really meaningful high confidence prediction they are still informative because they are telling you something about how epidemic recognition works so and one of the drawbacks here obviously when you look at these linear predictions is that you're completely ignoring that the antibody recognizes a 3d folded protein which is essentially not included in this prediction information so how do you get at real antibody antigen complexes the problem and the reason why these predictions are less frequent is that it's much more complicated to get at these so x-ray crystallography has long been the gold standard for antibody antigen binding site identification and it is still um i think the um yeah should i say that i don't know it it still probably is the gold standard um nmr has been a different approach which kind of gives you an idea of so the addition of like the um dynamics of the antibody antigen binding interaction forming and leaving and different states that this might have it's giving a lot of additional information but which had been in some cases kind of hard to incorporate in prediction algorithms though what's been currently being going through the roof is electron microscopy and i'm very much looking forward to getting more data from this which has gone down to resolution levels that actually get down to uh amino acid level uh resolution the problem with em used to be that you just got like rough ideas it binds kind of on the top or on the bottom of a protein or so now you're getting down to really good resolutions and the turnaround time is this is turning out to be something like 48 hours or so for something like the coronavirus epitopes right now so i'm hopeful that we are actually going to have a resonant um renaissance not resonance renaissance uh off of antibody epitope mapping data using this new technique but that's for the future currently how do you get at 3d antibody entity complexes uh paulo just showed you the sector tool that is actually doing that in the most convenient way to get um annotated 3d complexes and that also get redundancy removal and then allow for other filtering otherwise you can look at the idv 3d export um but the coordinates like how does the protein actually look like how are the atoms are organized are always going to be based on 3d structural data that's deposited into the protein data bank the pdb so structure-based epitope predictions are driven by these kinds of data set in the iidp we have two methods the discotome and the alipro method that have been implemented that are based on geometrical properties combined with amino acid scales there's other tools like this that are including some additional approaches we always are trying to implement different methods and there are often some issues with like um licensing and stuff and then also like the computational complexity or even the ability to implement something like a new server format at all but we are always on the lookout for new methods that are being developed we have also tried and tested a bunch of protein protein docking algorithms quite extensively and have not found that an unbiased uh like i shouldn't say unbiased like a general purpose protein protein docking algorithms they do not perform very well in doing antibody antigen dropping so again this is the tools page where you can get to your predictions and uh okay that was actually going to the wrong direction here disco talk is the second one here where you would go to that was originally trained the descriptive one version on a set of 75 non-redundant antibody protein antibody protein complexes dyscop2 takes many additional anti epitopes into account and also considers actually the biological confirmation of the um dimers of the proteins that are often in there each residue in this data set is assigned a score calculated as a linear combination of normalized values of the different scales like parker cytrophobicity amino acid occurrence which is like normalized to occurrence in inside and outside of peritoneal aperture complexes the number of contacts within 10 angstroms and the area of relative solvent accessibility so when these features are fed into machine learning algorithm they can provide auc values of 0.7 to 0.73 for the different discotheque versions and these are the methods describing them and compared to 0.6 that is a quite massive set up but obviously still far from perfect this is how you use the idb discot implementation so it's a structure-based antibody prediction so you need to have a structure that means you have to give the tool as a starting point the structure of the antigen that you're looking for epitopes in so you can ideally you have a pdb id of your structure but alternatively you can also select a file so if you have done homology modeling of your antigen of interest you can save that in pdb format and use that here one important thing for pdb files is that they often contain proteins that multiple chains like the ha1 ha2 complex that i showed you in the very first slide so there you have to specify which chain you are you interested in for the algorithm to work um so when you go into the pdb for example and put in your pdb uh sequence you can search for it you can search for sequences uh there's a bunch of advanced options and then however you get there ultimately going to be on a page like this where you have the pdb id which is obviously before um four digit for letter format uh of a combination of numbers and letters uh in this case for the ema one protein from plasmodium falciparum um and again like in the pdb format there you see the data that you have and you see there's two chains a and e that are in this complex and that is actually telling you which one you should probably choose when you want to use i mean depending obviously again on your application when you want to feed this into discotheque so if you don't have this available then you have to go through homology modeling steps of your protein as i mentioned and there's a number of different sets of tools out there that are providing that for you that are doing comparative modeling or homology modeling um and we were um looking into implementing those on the idb as well but they're extremely computationally expensive and they're not our area of expertise so we are providing actually a set of links where you can go and do this but this is not something we have implemented on the idb itself so once you have your structure data in this case just putting in the pdb four letter code uh you can run the tool and we would now recommend using the 2.0 version of dyskoto which should be the default the one important difference here is the scores that are output on a quite different scale so disco 2.0 would consider things that are above uh sorry that are greater than minus 3.7 to be a good threshold that gives around 50 sensitivity and 75 percent specificity and the same can be reached with threshold of less than minus 7.7 for for discontrolled 1.1 and as you see the scores are quite different this is an example of the output you're getting from dyskotope you have the similar kind of linear look at the scores across the protein sequence with scores that are lower are highlighted in red here are less likely to be epitopic regions and the ones that are above are more likely to but the whole point obviously of discotope is now that is no longer limited to the linear amino acid sequence but it's structure based so okay yes you can just threshold download the data do the table view i should have done that first this is where you can look at this data exactly the same way as for the sequence based predictions but then you can more importantly do the 3d view here where you can look at these residues that are identified these are all on chain a these are the positions that have the highest propensity score and then you can highlight them here in your model structure so i think that was it for discotheque the other tool is elipro which is predicts linear and discontinuous antibody epitopes based on the geometrical properties of protein structure so this was originally a algorithm proposed by janet thornton and that was published in 1986 in ambo and essentially takes a protein approximate as an ellipsoid and then calculates for each residue in the protein how far it protrudes out of this ellipsoids and the idea is that the residues that are protruding the most are most likely to be antibody binding sites and that was validated like like very early on but there was no computational interpretation so we re-implemented this approach not really adding anything to it but just trying to make it available for public use on the ieb website and it actually has been performing quite well um given specifically how uh yeah how historically and how small the original data set was built upon so the same thing you enter a pdb id you give some scores thresholds which are essentially the how much the residues protrude you have a maximum distance um um for predicting uh discontinuous epitopes together so you're going to have your high scoring residues and then essentially going to have a circle around them if you want at this maximum distance which identifies kind of the patches where the epitopes are aligned so uh doing this uh specifically here for one example hitting submit this is the output that you're getting first of all you're getting the both of the chains here a and b in in this protein chain and say you just care about the first one and submit that then you're getting your output back similar to before you're getting the individual stretches of linear epitopes that you're finding and you're getting the scores for individual epitopes in here that are forming a patch in the 3d structure you can view each of these so you click on each of these view buttons and then you oh sorry and i'm not showing this uh right now but then you're seeing the 3d structure view and the residue is highlighted or you can click here to view the residue scores meaning you're specifically then showing um in the in this example the scores of the individual residues here in that protein so to summarize uh the bc actual predictions linear and discontinuous confirmation builds can be overlapping and depending on method of discovery i said that at the very start and i should stress that again so it always seems like there was like it is a big discrepancy but ultimately the main practical difference people talk about is linear are things if you can make a peptide and you have binding to it people refer to it often as a linear epitope but in that linear peptide there's still going to be some residues that are going to be in close contact and others that are not definitely if it's a longer peptide and the discontinuous peptides so definitely something that's confirmation as if you cannot make a peptide of of 15 or so residues in length that typically means that you have different stretches of the protein coming together but there really is no way to incorporate the binding site in a single linear stretch traditional epitope prediction methods for b cells largely predict surface accessibility that has been a an issue so many of the original performance assessments said oh these methods work really well but what you're really predicting is you're distinguishing the hydrophobic core of a protein from the surface area once once you adjust for that and really asking for which part of the surface does my um antibody bind to those performance values drop below if a 3d structure of the antigen is available either as a reliable model or directly from an experiment predictions can be further improved using methods like elipro or descriptor so there's a practice exercise here and similar to before i'm not going to actually ask you to go through it i'm just going to walk you through it as we have it here so you can download the crystal structure here of the dengue the two virus envelope glycoprotein and as the two methods that we have pepperypress and discotope to make predictions on them so there's a pdb structure for it for the glycoprotein and uh actually bound by an antibody and that serves kind of both as a test set and as a um yeah tested out but it works in one go so this is the output from baby prep where you get like a number of different um regions again that are likely epitope candidate regions and uh given that there's actually a lot of them if you look at this this says essentially like nearly half or so of the of the protein matches this you can do a more astringent threshold and identify something like 90 percent uh selectivity here gives you a fewer it gives you fewer of the residues being identified as being above the threshold so using that and this is peppy brad and now you can do the same for discoto you're getting another set of residues back and this is the same approach as before but one thing that is kind of important is when you look at the number of residues included here you'll notice that this ends at 390 versus peppy pret ends at 420 even though you started off with the same protein or presumably you did and one of the issues here that is important to remember is that when we're looking at pdb structures there's the sequence that was put into the crystallization which is actually the top one of these here the secret sequence and then there's the sequence that is actually crystallized so what happens quite often is that there's going to be missing residues in your pdb sequence of a crystal structure because there's floppy ends or flopping loops um that are not in one stable conformation which makes it impossible to crystallize them and they're going to be missing in the 3d structure file so that's actually a quite frequent problem in in converting from the 3d structure world to the sequence world and back um that these kind of floppy loops are often ignored so focusing then on the shared region um what episodes were predicted in yellow those are all highlighted and comparing to the 3d structure that we specifically had for that pdb format uh we find a number of the correctly predict actually every one of the contacts that that was there was also uh predicted so in that sense it's a success but you also see that a lot of things are predicted that are actually not um [Music] found as positive but now keep in mind that when we're looking at one crystal structure that we have a single antibody in it and so there might be other antibodies binding to other sites and the way to get at those is to actually go for um overall epitope residues from dengue in the idb so if you run those to the immunobrowser you're getting a number of regions back where there's plenty of immune reactivity from antibodies against different stretches in the protein and if we map that then you see there's actually a large amount of overlap now between the predicted and the actually recognized regions in the protein so theoretically the whole exposed surface area of an antigen can be targeted by different antibodies that's kind of like the dry foam message that we also have so the question which residues on a protein are the epitope is kind of misplaced as in depending on the antibody different epic reasons residues in an epitome in protein can be episodes so um one method that has been developed to overcome that and that was actually a collaboration with janna ofran um who has been hosting this tool on his web server site we have not actually been able to manage to implement this on the idb side even though we've certainly tried allows you to enter an antibody and then give the specific epitope recognized by it in a protein target and this is based on the complementarity of the specific residues in the cdr3 of the antibody and the specific residues on the proteins or for given sets of antibodies it identifies like what is the most likely kind of matching site um there's a number of benchmarks that have been done for um different protein epidural prediction methods and uh this was one done by julia when she was in our lake julia ponomarenko who's now in madrid and um back then comparing on a set of 42 crystal structures that had not been published before these methods were kind of frozen we saw that elipro actually performed best uh in a new version in 2012 um there was discord took two performed outperformed the others definitely about perform just go to one which this is why this is currently our best prediction measure but there's other methods out there but as you can see the range of performance is actually quite similar across different methods there doesn't seem to be like a standout method at this point one of the issues continuously as we're talking about these things for the relative poor performance is the quality of the benchmark data sets one of the issues is that the biological unit is often not available meaning that we are looking at proteins in isolation that are say if you're looking at a viral envelope protein and if you're looking at that in isolation you're saying oh all these all these awesome binding sites are potentially there but in the virus as a whole they're not as in the biological unit when that protein is incorporated into a membrane only some parts of it are going to be actually accessible to antibodies and that might be driving a lot of the selectivity that we're seeing that is just not reflected when we're looking at a protein in isolation and the other thing that we're consistently seeing is that there's some antigens that are extremely well characterized and so like the influenza ha lots of hiv proteins heavily characterized and then lysozyme there's a number of essentially model antigens that are extremely well characterized from a prediction perspective it would be awesome to rather have a thousand different proteins that would be targets of well resolved 3d structures and that's not the case and that was my last slide and with that i want to stop here and catch my breath excellent well you will have a little bit of extra time to catch your breath thank you so much bjorn for that
Info
Channel: Immune Epitope Database
Views: 1,374
Rating: 5 out of 5
Keywords:
Id: Rq21e-ou3Os
Channel Id: undefined
Length: 32min 54sec (1974 seconds)
Published: Thu Nov 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.