MIA: Haitham Elmarakeby, Biologically informed ML for cancer discovery; Primer by Felix Dietlein

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so i wanted to talk to you about today about like genomics tools for interpreting patterns of somatic driver and passenger mutations in cancer and if you were kind of like a little bit in cancer you've probably seen this slide over and over again like that we feel like why why are driver mutations important is because we believe that cancer actually evolves based on like subsequent event of different driver mutations and each of those driver mutations gives the cancer like some survival or proliferation advantage to proliferate over its normal tissue but you should probably also know this is just how the story because with every driver mutation event there is like a plenty of passenger mutations actually most of the mutations that we see in cancer genomes are passenger mutations so if you want to pinpoint those mutations in cancer genomes that are functionally important it actually becomes like searching for a needle in a haystack because you have your few driver mutations but most of the mutations are passenger mutations and you can basically approach this problem from two angles either you can try to characterize the passenger mutations or you can first try to characterize the driver mutations and either way you'll actually arrive at a classification of drivers passenger mutations but i think it's important to keep both of these like components in mind that they always go hand in hand if you want to have a reasonable detection of driver versus passenger mutations so why are we actually interested in driving mutations at all and i think there is like three major reasons why we are interested in cancer genomics and clinics so the first reason is that we actually know about cancer genes that are important in tumor formations based on their driver mutations so a lot of the genes we actually have only discovered not necessarily experimentally but by largely a computational analysis and to see like where mutations actually have accumulated and they've telled us a lot about like how cancer actually works the second reason why we're interested in driver mutations is because driver mutations are diagnostically and therapeutically important so knowing exactly about like which genes are involved and which tumor types gives you actually therapeutic opportunities for designing specific small molecule drugs specifically tailored to the individual patient's cancer genome and the third reason why we're interested in driver mutations is because at this point we have robust profiling platforms to sequence up to the entire genome of cancer patients for leveraging these knowledge into clinical diagnostics and i think this is pretty remarkable given that 20 years ago the human genome project just was just completed for one genome and now we can basically sequence an entire patient genome just in clinical diagnostics and i think it's really important to keep in mind that it is always this interplay of population variant discovery and optimizing the therapy for the individual patients that makes charac that makes actually cancer genomics very characteristic and interesting so on the one hand you have these large patient cohorts from which you want to extract like those mutations in these biobuilding scale databases that are functionally important for cancer information so as i said like searching for the needle in a haystack but on the other hand you really want to use these like large scale the biobank scale data and actually focus them for the individual patient and that is actually like how you identify the individual target lesion in a cancer genome in a patient's cancer genome that you might leverage for a small molecule inhibitor therapy and this is really where computational tools in cancer genomics come into play and why i think they're like so important not only for the interpretation of the biobank scale data but also for like making clinical decisions for the individual patient and usually we don't see a patient only at one single time point but we usually see the patient over multiple time points over the disease so usually like there is some type of sequencing involved in the initial diagnosis of the patient for characterizing the initial genomic profile of the patient and making the initial therapeutic decision but unfortunately after like the initial clinical response most patients will eventually acquire some form of resistance and this is where actually like you have a second profile usually from the cancer genomics platform to design the second line therapy so it's usually like really a sequential follow-up over time where like cancer genomics analysis work together with pathologists and clinicians to make therapeutic decisions so how do we identify driver mutations in cancer and actually this problem has been actually like the in in the cancer genomics field for more than 15 years at this point so it's far from being trivial so probably one of the first studies that tried to like systematically identify driver mutations in cancer was the science paper in 2006. so they had a relatively limited cohort like from today's point of view so only 11 breast cancers 11 colorectal cancer and a validation court of 24 samples at that time that was a lot for like whole axon sequencing data and they basically tried to come up with an atlas of driver mutations back then so what you can basically see is that the number of genes that they came up with was just huge so it was just like a lot more than people were expecting and when looking at those genes a lot of those genes didn't really smell right so there were a lot of genes involved in this catalog that were not like really involved in cancer at all they were like more like structural genes and not really expected to be mutated in cancer but amid all those genes you actually saw already like those genes that now turn out to be a real driver event they were just contaminated amid a lot a large number of false positive events so early on people actually figure out well there is something probably wrong the way that we try to identify cancer genes it's probably not okay to just like identify the gene those genes that have the highest number of mutations so just by the mutation counts but you probably need to calibrate your model to the regional background mutation rate and that was actually something that people had appreciated over time that actually the background mutation rate has a substantial fluctuation across the cancer genome and different like signals in the epigenomic structure of a cancer genome actually modulate its mutation rate and so you need to intelligently integrate these like modulators of the epigenomic cancer mutation rate into your models to identify whether the number of mutations is beyond what what you expect because if you have for instance like a a region where you just expect a large number of mutations to suspect your chance the the sheer fact that there is a lot of mutations won't really tell you that this gene is like involved in cancer but if the same number of mutations might appear in a region that is not expected to have a large number of mutations then actually this might actually tell you something about its relevance in cancer genome signal so this is like basically the first criteria like mutational axis but it's really important to calibrate this by the background mutation rate and one of the first algorithms that did this was like this lawrence paper in 2003 where they actually came up with a model to modulate the background mutation rate and integrate this into like uh into a statistical model and there is basically two ways that you integrate the uh the the the background mutation rate so one thing is as i told you about the epigenomic markers so this model basically takes three markers into account of how the mutation rate might modulate over the genome but the other uh criterion that turned out to be important are actually synonymous mutations and the reason why this is is because synonymous mutations are by definition all passenger mutations because they don't change the the the protein structure they will never never have an effect on the like cancer functions they won't have like a driver mutation effect whereas non-synonymous mutations can either be driver mutations or passenger mutations although they change the amino acid structure of a gene most of the non-synonymous mutations will still be passenger mutations so what you can now do is actually compare the number of non-synonymous mutations with the number of synonymous mutations to get another approximate whether your your um mutational axis actually exceeds your expectation based on the background mutation rate and there was a third criterion that we'll talk about a little bit later about like which mutation types you actually see in which genes so for instance if you have a c2a mutation on lung cancer that might actually make you less excited than like a t2g [Music] all right so they basically integrated this tool in the larger atlas um in a follow-up paper and actually characterized um or actually came up with the first atlas of driver mutation genes across 21 tumor types there were a few other papers actually that make important like contributions to this field that i just quickly want to mention so one of those papers being this oncodrive fml method that actually like leverages the same concept of non-synonymous versus synonymous mutations but they came out actually very within a like an another intelligence solution because as i told you the non-synonymous mutations many of them are actually turn out to be passenger mutations but we all treat them the same in our usual statistical models so we just count the number of non-synonymous mutations and the non-number of synonymous mutations so now as you can imagine if there is just a lot of non-synonymous mutations that turn out to be passenger mutations well your model will actually be very diluted so a better way i would actually to characterize or to prioritize those non-synonymous mutations that are more likely to be driver mutations so obviously you can do this like like like exactly so you cannot like tell exactly pinpoint okay this non-synonymous mutation is exactly a driver mutation and this is a passenger mutation because otherwise we would basically be done but there is like it turns out that there is like bioinformatics scores and there is a ton of those and you can actually do at least like a prediction on like more like a gray scale of whether this mutation is more likely to be a driver mutation or more likely to be a passenger mutation and so by integrating these like functional scores and that's basically an entire field for itself but basically by integrating these functional scores into a statistical model you can actually additionally boost the power of the original model so in addition to what i've been telling you about like the characterization of driver mutation versus passenger medication there was another development in cancer genomics that was happening basically in peril and that was like basically mutation signatures in cancer so basically i've been talking a lot about how do we characterize driver mutations but as i said initially another important aspect is actually also to understand passenger mutations because those always go hand in hand so the more you understand about passenger mutations you can actually characterize those outlier events that turn out to be driver mutation so on the passenger mutation side people have been coming up with this call this model called mutation signatures where you basically take the trinucleotide context around the mutation and count the number of mutations in each trinucleotide context and basically decompose this using a non-negative matrix factorization algorithm and i come up with these like so-called mutation signatures which correspond to different forms of dna damage or mutational processes in the human cancer genome so basically it turns out that actually driver mutations tend to differ from these mutagenic processes that we just described by mutogenic mutational signatures and there were actually four papers in the field they were kind of like hinting at this fact that actually like mutation signatures might be important for characterizing also driving mutations so i'll just like basically mention like one of those papers um where basically they characterize they basically compared like missense mutations nonsense mutations and silent mutations and the later one being by definition always passenger mutations and they actually found the cumulative probability of those non-silent mutations actually differed significantly from a silent mutation so the signal is not very strong obviously but it actually gives you an additional power boost if you can if you combine the signal together with the mutational axis criterion that i just talked about so that was basically the the the incentive of our methods so we knew that there was an additional signal out there but it hadn't really been integrated into these models before and it's basically the same idea or like the same principle like previous methods had worked for instance like this lawrence paper like people have been knowing okay like mutation rates differ across the genome but we need to come up with a statistical model for that and that was basically where i was standing when i started my post doc i knew there is something out there but we really need to capture this and integrate this together with all these other statistical criteria that we had that we had in the field before that was basically the the the goal of building this tool called meat panning to integrate nucleotide context for driver gene identification so why do we actually think an additional criterion was even needed like why don't we just say well let's just use the old algorithms just generate more data and just call it a day well it turns out that the previous tools actually have been working well in some cancer types but they hadn't been working in other cancer types very well and it actually turned out that those cancer tapes that types that had low mutation rates such as thyroid cancer or sarcoma um actually like just like accounting the number of mutations works actually pretty well so you can just like see this in the upper diagram here where you actually see like a non-cancer gene versus a cancer gene and a cancer gene will have more mutations because of its boost by driver mutations above the expectation so if the expectation is already very low then actually this will actually give you a good signal to discriminate uh non-cancer genes versus cancer genes however if your background signal is actually extraordinarily high such as melanoma being the other extreme of the spectrum then actually your background signal is already high and the additional boost that you get by driver mutations is not that much that you can use for distinguishing between these two type of classes so you'd actually need a lot of samples to distinguish these two types of scenarios so it actually turns out that passenger mutations they all follow like the same distribution depending on their surrounding nucleotide context driver mutations also turn tend to be in these nucleotide contexts but in addition to this signal they are also like guided by where um functional positions um exist in in in in in cancer genes so obviously we don't know where those functional positions are exactly because otherwise again we would just be done with characterizing drivers versus passenger mutations but actually just knowing the fact that there is an additional signal that will make a deviation from the expected nucleotide context for driver mutations gives us like this boost how we can distinguish these two scenarios so just like look at the top of this plot here so this is like a low like a low count scenario and you would actually see by these additional red dots you could actually distinguish just by counting the number of mutations very easily between the upper and the lower type of gene but now look at this type of scenario so you already have a lot of those mutations and if i just gave you the number of mutations in both of those scenarios you would actually have a hard time just distinguishing these two scenarios but now if i just told you where those mutations are located and that the the red dots are somewhere outside where i would not expect them based on the passenger mutation context that would actually give you an additional information that you would otherwise not in have interpreted in your model so i'm not saying that by our model we would actually come up with a better model for all cancer types but we'd particularly come up with a better model for those cancer types where you have a high background mutation rate and i think that's very important to consider like that it's really depending on the cancer type how much you can actually boost uh the prior models all right so i've basically been telling you there's two different criteria that went into our model so mutational access which previous people had been using and the positions in which mutations are located so now obviously the big question was like oh sure absolutely this is james just uh maybe you're going to get into this more but i'm curious if you can comment a bit on sort of how you how in general the field is benchmarking the predictions of driver versus passenger mutations or absolutely um so i would have probably done that a little bit in the discussion but let's actually talk about this question now because i think it is important so first of all the problem with our field is why it's actually very hard to benchmark the like um the performance of your model is that you never know the ground truth right so if you want to do like something like a classical auc like you would do in machine learning you'd actually know the ground truth the problem is here that you don't know the crown truth so i mean if your model comes up with an additional gene then this could actually be just garbage and noise or it could be a new discovery so that was a little bit of a problem for us so it so one thing that we actually used for benchmarking our our model um was actually using like classical or traditional um lists of known cancer genes so there's a bunch of them like including the like cancer gene census list and you look whether those genes tend to be more like tend to be in the top hits of your significance and the reason why we think this is a good criterion is because probably the most common drivers like tp53 and keras we already know so the top significant hits of your models should be these canonical drivers and your new discoveries they should be if there are new discoveries they should not be the top significant hit but they should be like somewhere in more like the tail of what your model sees so if your top significant hit is something that a gene that you've never heard of and it's more significant than t53 then that's probably not good so you can basically come up with i wouldn't call it like an asc because that's always like suggesting that you know the ground truth but you cannot come up with something that is aoc like so you basically rank are my known cancer genes among the top significant hits and you treat every new every new discovery like as if it was a false positive although it is not and there thereby you can approximate a little bit the performance of your model so that is number one what you can do number two what you can do and that's also what we did in this model is uh to look up whether this um whether other models had been discovering the same gene or whether the literature had been giving any like hint that this gene might be important so the problem about these literature searches was always that literature is just like contaminated right so if you just google a random gene um and and google or like pubmed and together with cancer well chances are pretty good that there there's like just like a random people like reporting that so you have to be very careful about how you curate those things but i mean if you do this in a systematic manner so we basically did this both for our findings but also for a completely random genes and you just see whether your new findings come out more frequently than if you just googled random genes together with cancer so that was like another criterion that we used so the third criterion that we used was looking at crystal structures um so obviously you don't have crystal structures for every gene but for a bunch of genes you have and you can actually see whether those um mutations that your algorithm had been putting the most weight on were actually in functionally important positions that he could nominate without knowing anything about the crystal structure of the gene so that gave us a third hint the fourth ideal scenario would obviously be and we haven't done that but i think it is actually something that the cancer field has to do long term is obviously doing like functional experiments and that maybe sounds like more easier said than done because i know that the readout essay for all these functional experiments will not be the same right so i mean long term it's obviously i mean the most common readout assay that we use currently is just using like proliferation and cell survival but i guess like a lot of those driver genes will have more nuanced effects than that so for instance they might compensate more the oncogenic stressor of driver mutations etc etc and like those and and the readout assets that we currently use might actually not capture all the effects of those driving mutations that's why the fourth criterion that i've been mentioning we haven't done but i know that the cancer field is not doing it like to the degree that we probably should and for a lot of genes we just have to add it like with a question mark so we know it's statistically significant it kind of smells right but we don't really know what this gene exactly does and that's kind of like where we're standing with drivers just passenger mutations in the field does that answer the question a little bit yeah got it no it sounds like um a real challenge to to really clearly benchmark methods but yeah yeah i mean it's a real challenge to benchmark methods and i think it's also a real challenge to prioritize those uh genes that would actually make new drug targets right i mean the eventually i mean we don't just want to understand genes for the sake of understanding genes at least i i mean of course it's always nice to learn something new about biology i get it but eventually we want to we want to like help a cancer patient right so i'm just asking like how do i treat gym based on the cancer genome and the problem is that i have a lot i mean if you just like look at the patient's genome you will find a lot of stuff based on these algorithms but you have no idea like which of those things you should prioritize for like drug response and so that's kind of like i think where like there is a significant gap and i think all these things have to like eventually like work hand in hand like experimental people together with structural people very into functional people and like more statistical people like me to to actually like solve cancer awesome all right so coming back a little bit to the um the the the last criterion that i mentioned so obviously the the the biggest challenge of this model is to predict actually how many mutations would i just expect by chance for an individual position in the cantor genome so i've told you a little bit about this like mutation signature model and why don't we just use that model for predicting that well it's not that this model is is bad per se but it was not developed primarily for for like predicting an individual uh positions number of mutations so the primary purpose for this model was to detect different forms of dna damage but the primary purpose of this model was not to exactly predict in that position we have that in that many number of mutations so one thing that we actually thought was missing in this model and how we could actually boost its power is actually turn um considering the extended nucleotide context around mutation so in this model you just consider the trinucleotide context and that's probably fine for characterizing dna damage but you have to understand that in our case we don't pool all the mutations across the genome but we actually deal with the data sparsity issue because per position you might just have one or two or three mutations and you need to like make a score based on those mutations so the more that you can actually learn for the individual mutation the better your scores will be so that's why we actually thought it will be an important part of our model to come up with a model that actually expands beyond the trinucleotide context and actually if you just look at mutation signatures by by just by eye what's been happening for instance a melanoma or bladder cancer or immunometrial cancer it actually turns out um that the the context sensitivity of those mutation signatures does not end at the trinucleotide context so this was more like a mathematical simplicity but it's not like biology that basically only the the surrounding nucleotides affected its context sensitivity and that's it that's why i actually like to account for the entire biology we basically expanded that model into a composite likelihood model and basically the way that we characterize our mutation signatures it no it's not necessarily by these like 96 bin models but we can actually like come up with modis for for this model where you have the entire context sensitivity and for some some of those mutation signatures as you see for instance aging or smoking it actually turns out it really ends after the trinucleotide context so there is this relatively flat signature and there is not that much more that you can learn from the context but for other signatures actually it does turn out that the context is important um and so basically i i'll skip a little bit through that but basically what this diagram shows you on the x-axis uh basically we we had the like on the left side we had the classical medication signatures and the observed mutation frequencies on the y-axis we had basically the expanded model if we reduce this again for triangulate context and we just wanted to say well our model is not different from the previous model if you if you if you restrict it to the trinucleotide context it just expands this more generally into the expanded nucleotide context all right so i've talked a little bit about those criteria why we thought this was important um a few other just like thoughts on that like what are advantages of our model so why can't you just like expand the like bin model so one thing is i mean if you if you want to expand for more like um possibilities of nucleotide context and you just count the number of mutations well the number of groups your bin will just grow exponentially so you'll just run into data sparsities that won't work the other thing is basically at some point you'll start to over fit things in certain nucleotide contexts so you'll maybe have like a keras hotspot uh that are also surrounded by like exactly the same nucleotide context so you would have one bin where you would have an extraordinary high mutation count and this overfitting would actually kill your entire signal because that's exactly what you don't want to do is it's like overfitting into the triangular into the nucleotide context and then basically kill the entire signal between driver versus passenger mutations plus the other thing is um and that's actually another like important concept to bear in mind is is that the classical mutation signature model had actually been looking at the more frequent groups right so they actually wanted to see which of those um nucleotide contacts are most commonly targeted by passenger mutations well we actually wanted to look at the other spectrum we want to actually see which of the contacts are are the least commonly targeted by cancer mutations and that's why we actually needed a log scale model instead of a linear scale model because if this is just like a commonly targeted nucleotide context then well we're not really interested in that but we're interested in those mutations that are actually like deviating from from that and that's why we need to come up with a different model so basically just like to benchmark our model a little bit uh was just like applying this to a bunch of different known cancer genes and non-cancer genes and just looking how many mutations do you have that deviate from the expected nucleotide context and it turns out actually yes like cancer genes have have more like mutations that deviate from the expected nucleotide context so that's good um so that was a little bit the other benchmark validation that i've talked about uh was actually like going back to like crystal structures uh and to map actually those mutations that were in nucleotide context um that deviated from our expectation and it actually turns out that this model could learn um or that this model could actually predict a lot of those uh positions that might be functionally important for instance based on the protein dna interaction without actually knowing anything about the nucleotide structure and we feel that's why nucleotide contexts are actually important because it would actually enable us to predict those positions without knowing anything about like what this protein actually does um given that we don't have crystal structures from all proteins so basically to summarize a little bit of our algorithm how algorithm works without getting into too much of the weight of technical details it's basically we have a very similar problem to where people were standing initially so initially people had to like quantify the mutational axis above the background mutation rate well that was more on a megabase scale so we had actually the problem to quantify x's or deviations on a single nucleotide scale so you see those problems are kind of related but they're a little bit different so for the megabase scale you need to account for things like the epigenomic factors the silent mutations etc etc whereas for the nucleotide context scale uh you need to actually like account for like uh you actually need to account for the like specific context around the mutation so basically what you would do is first to come up with those signatures that you already know which create like uh passenger mutations in certain contexts and you basically try to associate every individual mutation with one of those processes so to decide later whether these materials are drivers or passengers what you basically do is assign them to one of those mutogenic processes and to particularly look at those mutations that deviate from this process i.e like those mutations where you didn't find a matching background process and those mutations we cur we termed unusual nucleotide context and is very important to keep in mind because i get this asked a lot is kind of is every mutation in an unusual nucleotide context driving mutation and the clear answer to that is no it's a little bit like we talked about non-synonymous mutations versus synonymous mutations in the beginning so all these criteria just give you a boost of telling you a little bit on more on a grayscale whether they tend to be more towards drivers versus passenger mutations but none of these criteria can exactly tell you whether an individual mutation is a driver mutation or a passenger mutation so you just need to iterate this criteria over a lot of mutations until you have more and more of a feeling whether this gene might contain more driver or passenger mutations so please do not treat nucleotide context as a black and white criterion it's just more like always on a gray scale to tell you a little bit more whether it's be intense towards passenger or driver mutations so we applied this model eventually another sort of more technical question yeah i'm curious are is there evidence that these mutational processes are varying themselves at a megabase scale maybe related to sort of like dna accessibility or you know for the mutagenic processes or for dna repair i could imagine that being another layer of complexity yeah so that that's that's a terrific question um so the answer is probably yes but the technical answer is well we don't have enough data to account for that um so what we basically did so we were aware that there might be certain regions where the just the entire process might deviate from our model and obviously we would not want to have those things to contaminate our predictions so one thing that i didn't talk about was basically that we recalibrated the model everywhere locally again with synonymous mutations and actually looked whether this process would deviate and there were like certain genes where the entire process would just deviate too much and we would just say well the model is not applicable in these processes so yes it's an important criterion to take into account it's just that if you look on an individual gene maybe you have like leader like in those genes that are like not like known driver genes like not like tb23 but things that are maybe like less frequently mutated you might just have 10 mutations so obviously that will not be enough um to recalibrate the entire model so the best that you can do is just avoid false positives if it deviates and you just know well sorry in that case you just can't apply this criterion does that answer the question yes thanks but it's a great question and it's it's a thing that is important to bear in mind because otherwise obviously you might actually come up with just false positives i mean another thing is like sequencing artifacts and like important curation of those artifacts um we can probably talk a little bit more in the discussion or offline about that but it's also very important to keep in mind especially if you want to do these like aggregated harmonized analysis of like large scale um whole exome sequencing data it's very important to keep in mind like what goes into this cohort and what does not go into this cohort um because an individual chord that was just badly curated might actually just contaminate a lot of things cool so we basically did this like across a large number of samples and different cancer types and we basically come up with an atlas of those genes um for every individual cancer type so if you want to check this out i just put the website i mean you can check it out in our paper but we also made this available as a platform called cancergenes.org so just in case you want to check that out um so it's a little bit like what we talked already about um initially about how do we benchmark these things um so it's really important to keep in mind like what had been known before and what kind of like new things would we come up with so and obviously ideally the the things that are always on top of those lists they already are known before so it's very unlikely that we'll just cover new to you three gene um and so if you just look in general at this atlas you can see that most of the things that are always like top ranked that already been covered by previous databases and then if you go down more in the list and to do the less frequent genes then you'll actually see more quote-unquote holes coming up and so that's actually good and then basically for each of those holes we had like four different levels of confidence so basically the highest confidence being that this gene is a canonical cancer gene but hadn't been discovered by previous studies and the lowest confidence is we couldn't even find up a single paper that would implicate in this gene so just be careful with that um all right talk about this uh platform so also if you want to like use our tool called mute panning we also made this available on genepattern.org so it's like an interactive online tool where you can just like add your files by drag and drop and just like run this on your own data set so i highly encourage you to check that out so skip that and basically just quickly come to one other aspect that i felt was important uh to mention and to do so i've been talking a lot about like expanding the number of genes and gene lists for every like of those mutation types and i do think this is important but eventually cell and tumor pains don't behave like linear excel spreadsheets they're like complex cellular mechanisms and all of those genes interact together with each other into a pathway so if you just focus on discovering more individual genes well you might always come up with additional like things that are like just mutated in one percent but it's hard to actually evaluate how much do these things actually have in common and how do they actually like like interact with each other right so i think it is important to keep this in mind because for instance if your gene is just mutated in one percent but is in a completely different pathway than any of the genes that you had been discovering before well then this might not spark that much interest then if you come up with a pathway where each gene is individually mutated just in one percent but together those genes might just be mutated as much as a canonical driver gene so for instance like one thing is like chromatin modification pathway uh so those things are just incredibly notoriously hard to detect because individually those genes are just like mutated relatively rarely but if you if you basically think about that maybe for the cancer it doesn't really um play that much of a role uh in which uh in which basket it puts its egg it's kind of like yeah maybe this is just a spread out across different genes and like combined this pathway might be equally important to like a driver gene like keras but we just haven't detected it that much because individual genes are just like mutated like randomly but together this pathway is very important so this was one of the other criteria that like our study didn't like wasn't the first one to do this but i mean we combined this criteria these pathway criteria together with like robust statistical criteria from our meat planning model to come up with this pathway-based classification together um together with like statistical significance criterias and we felt actually it was important to to combine these criteria uh together and so based on like combining this criteria we could not only like um quantify uh the contribution of individual genes to the mutational landscape we would actually quantify the the contribution of entire pathways to the individual landscape and it turns out that some of those pathways are just all focused on one individual gene like keras or tb23 but other pathways are really not focused on an individual gene but they're really spread across a lot of the different genes and it's actually important to exactly look at those pathways because i think there is really like the most room for um for potential discoveries so just to sum things up a little bit let's maybe let's show it go ahead one more that's super interesting and i'm wondering if um this question of sort of aggregating evidence of uh sort of functional effect from the gene level to the pathway level is like to what degree is that analogous to i assume there's some aggregation of evidence from the variant level to the gene level maybe maybe that's not right but if so is is that roughly analogous but obviously in one case you you know precisely which roughly like which variants are part of which gene but the pathway is like another layer of uncertainty yeah so first of all because of lack of time i've obviously been oversimplifying things a lot and i really apologize if i haven't been giving credit to a lot of people that have made important contributions so i just want to emphasize we're definitely not the first people who do this there were definitely papers before which had these network pathway ideas and there were also papers that actually used like crystal structure or like a gene structure for actually driver discovery so all of these things had just motivated us to include these ideas into our tool with meat panning and i do think that one thing is of course like looking at all these criteria individually but i think another important thing is we'll be actually to to actually combine all these criteria together ideally in one integrated framework for the driver gene discovery and just what i wanted to emphasize is that if you do that you can actually maybe come up with more genes because you know well if a certain pathway might actually spark more excitement then you know so like that might actually be like an additional criteria to just distinguish those scenarios but i mean it's a highly complex field and we could probably fill an entire talk uh just talking about what people had been doing before how our model might differ a little bit from that and how i eventually think all these things should come together but i do think it's just important to mention that eventually the field does not just go into uh discovering like new like linear lists or new genes but it's actually more like characterizing things on a pathway level cool so just to rep or any other questions before i wrap things up just quickly there was a question earlier about the extent to which um sort of sequencing artifacts are are accounted for um and i think you you mentioned that i i don't know if there's a uh easy way to get into that like a a little more detail there or if you want to maybe come back to that at the end do you yeah no no no sure absolutely so first of all as i said i mean first of all sequencing artifacts that are always too important to account for so uh it's important to note that also other models can be completely thrown off by sequencing artifacts um there were just like a few things that we just thought is just important to keep in mind especially for this particular tool and to do a more rigorous like filtering of the published cohorts before and to also benchmark them against like known filters um so we basically did this on a sample level we did this on a cohort level and i think all these things were just important to keep in mind particularly not only for our tool but like for driver discovery in general because if you come up with things that are just mutated in one percent i think it's even more important to to take a step back and just to see well are these like one percent of mutations just like maybe like a sequencing artifact that just exists in an individual chord or might they just be like something that is like consistent across cohorts in general another thing is basically the publication of synonymous mutations so i hinted to that a little bit about um that the synonymous mutations are important for calibrating our models but not all papers actually publish anonymous mutations unfortunately so some papers actually will only publish the nonsense mutations that was like another technical challenge that we had to overcome um but all these things are more like on the technical level they're like not necessarily like like big picture like concept wise but i just wanted to bring this at least like to the audience's attention that we a have thought about this and it'd be that it's important to think about when you want to apply our tool for instance to your own data set um that like looking at the filters is actually very important to keep in mind cool so basically just to sum things up basically do we know all driver mutations today uh because a lot of people there is like really like different opinions in the field because some people say well we're doing all these like additional um analyses with more intelligent tool but there's like more diminishing return and that is definitely a good good good answer um but i do think that we're not necessarily done so we're definitely done with those uh genes that are like more than ten percent like we won't find another krs gene for sure at least like i would be surprised unless it's like a cancer type maybe that we haven't studied at all there can always be surprises then like in the five to ten percent range it's probably most but not all driver genes so there might be like some surprises there but then it really gets down to the like two to five percent range where we we do see things coming up but if you just do a power analysis of our models you'll actually see that our model is not power fees and the card data available uh to identify all of these things is particularly those tumor types which have high background mutation marines and then there is basically the two below field and i do think those things can be actually important because if you can make a therapy for just two percent of your lung cancer patients um that actually can be important like if you if you find a gene that will lead to a drug but it's just like uh present in one or two percent of the patients then this can actually make a functional uh like a significant importance so that's why i and i think basically that below two percent field is basically still the wild west it's largely uncharted and unexplored i mean we definitely do come up with some things but our models are not powered enough we definitely need more data or better models to do that so i do think it really depends on which kind of like level you're looking for but i do think for like making new drugs even like those things that are below 2 can be actually very important especially if you look at those things in combination with other driver genes so basically just to sum up things that we talked about today is basically that most cancer genomes contain just very few driver mutations like looking for a needle in a haystack emit like several passenger mutations another thing to keep in mind is basically that it has been taking us like about like 10 years ago like basically we didn't really know that much about the landscape of driver mutation has been really taking us more than a decade to come up with intelligent models to figure this out and i do think that actually besides data generation that the model development will be an important key contribution to figuring out the landscape of driver versus passenger mutations so most of the driver genes that we know today have been identified by statistical algorithms that was a little bit what we talked about the lack of experimental validation which i do think is a key limitation of the field experimental follow-up therefore is needed for several driver mutations to actually confirm their driverness so it's not enough to just say that they are significantly mutated but eventually you really need to know what are they actually doing in cancer genome signaling to really understand what what really is a driver mutation and it really depends always on your definition of what you consider a driver versus passenger mutation and largely i mean the field of driver genes has been really fueled by this like combination of large-scale data generation plus the gradual improvement of statistical algorithms and i do think that we need to bring these concepts closer together like i showed with these pathway analysis to eventually come up with a holistic map of the landscape of driver and passenger mutations in cancer gyms i'd just like to thank various people that have worked on this study like most informants of course in my mentors lab ellie van el and with the wonderful collaboration that i had with the sunyac lab at the brigham women's hospital also like to thank people at the chip program and the bch and bch as well as the broad institute and as well as my uh funding resources thanks so much again for having me was really great pleasure giving this talk and yeah looking very much forward to our discussion section later thank you i'll continue talking today about cancer genomics as well so i'm interested in customizing the neural network to better understand the progression in prostate cancer as felix illustrated earlier that the current understanding of cancer is that it is a genetic disease that comes through the accumulation of the different somatic mutations in normal cells so if you imagine like a certain normal cell that acquires a mutation probably random randomly or through different external factors like radiation or exposure to chemicals this this cancer cell may eventually divide and pass this mutation to the offspring and probably some of these newly developed cells will acquire more and more mutations and through the accumulation of such mutations in cancer cells that these cells will acquire new characteristics that enable these cells to um grow and divide in uncontrolled way and probably forming a lesion somewhere in the antibody so i am interested in studying the um the landscape of the imitations of patients while felix listed in the first talk um different methods to discover driver mutations in in the genomes i'm more interested in taking all of these mutations and trying to find how they are as fitted with clinical outcomes so clinical outcomes in patients like whether their um cancer is progressing over time or whether they are responding to the treatment or um even like predicting their survival so one problem that happens when you start associating individual mutations to clinical outcomes like that the frequency of such mutations is is very low in the population so you can imagine that the heterogeneity in the cancer population is is very high to the extent that different patients may have different um totally different mutations while they are like they have a similar phenotype for example they have the same cancer type uh so how how how should we aggregate such limitations and and gain like um uh power from different locations in their genome this is an open question however um one natural way to do this is to think about maybe a higher level entity biological entity uh on which you can aggregate such mutations for example if you aggregate such mutations on the level of the veins you may gain some some more power however this problem still exists and so this is a representative study from our own lab that's published in natural genetics in 2018 and when we collected a lot of prostate cancer patients and tried to see what genes are highly imitated there we found that there are a lot of methidic leans um and with the exception of tb 53 probably that is highly mutated in a lot of patients most of these genes are mutated in only one or two or three percent of the population and frankly we didn't know exactly what to do with all of these genes and what is the best way to predatorize these genes or is there any way to come up with um reasonable ranking of these genes to understand uh what's going on um so we end up with this very long tail of the genes and we needed some some way to better understand the function of these genes uh if we continue with the theme of aggregating signals from different locations on the gene on uh using higher level entities probably um one one further step will be to aggregate these genes on the pathway level as as felix indicated in his first talk and and this is like a representative study from tcga consortium where they um propose using these thin pathways to study different cancer types um and you can imagine that there are so many other pathways that you can include as well and try to incorporate into your study however these are like one of the most representative thin pathways in in different kinds of types so frankly the pathway enrichment analysis has been there uh for a while and however it has been used as a post-hoc analysis where um you using pathway enrichment after you do your skill testing on the um on the gene level and then you try to see if your genes are clustering in a certain pathway or something like that however using the pathway to to inform predictive models is not something that is done um routinely in the cancer genomics and in our project we are interested in using uh deep learning as a predictive model to us to shape genomic profile of different patients with the clinical outcomes and as you know deep learning is the uh as the state of the art and achieve the state of the art performance in so many fields including computer vision speech processing and natural language processing and this can be attributed to um three major things first the the advances in the computational elements including cpus tv use gpus and even customized hardware accelerators that can accelerate training deep learning models and also the availability of so many um so many i mean like huge data sets and and unstructured and structured format from different sources this enable the development of new methods and techniques for deep learning over the years and if you project these three things on the cancer genomics field um probably we can start with the competitions uh i don't think right now we have a big problem with the competition and with the availability of hipaa certified service providers you can now run your models on the cloud without so much problems um i accept that you need just some money um to fund your research as long as you have the enough money to um to to buy the surface you should be fine so probably we are in good shape regarding the competition of course there are a lot of logistic things but um at least we are on a very good trip in the cancer economics field however when it comes to the data we have a a very big problem uh we don't have the privilege of having millions and millions of data points to train our models on rather we have very small number of samples and these even these samples are fragmented everywhere so you can imagine all of these hospitals and healthcare facilities who are like taking care of patients they are sequencing and storing so many patients however sharing these patients well these data with each other is very problematic due to so many factors so collecting um a good data set that is um uh that can be fed into your machine learning model is a very um challenging task and and putting all of these things in harmonized way and running the unified analysis on all of these datasets um taking into account uh the different technologies used to um uh measure these datasets and the different computational pipelines through which these thesis were passed um so all of these are like very challenging things and this is why one major effort in our lab is uh to try to solve this problem for our community so this is the same paper that we talked about before is published in nature genetics in 2018 and we were able to collect like more than a thousand patients for all of these patients we had like a whole exam sequence and and now this data is available on c by portal so you can go and run your your own analysis download this data or check for the highly immediate greens or do whatever analysis that you would like to do so this actually facilitate the downstream analysis for so many uh people and and this is like a major effort in our lab that um we try to solve this this problem for the community and we have very um multiple similar efforts in the lab in different cancer types uh when it comes to the the models if you have a predictive task that you are trying to solve you have so many options to choose from and and the machine learning field is very rich with all of these models that you can use to solve your predictive uh problem uh however each one of these models come with pros and cons and and it's every one of these machine learning models has the trade-off between interpretability and accuracy so if you use like something like simple like logistic regression you have a very simple model that can you can train very fast and you can get some results it is very interpretable uh however the accuracy is usually less than the other models if you go to like deep neural networks probably you can get good predictive performance however the interpretation of such models is is is very very painful uh so our goal in this project is not only using deep learning but also try to enhance interpretability of such uh networks and try to get some insights from uh the how the model is is behaving uh so for for researchers and cancer genomes is it is not enough to just collect some data do some pre-processing and feed this into some uh magic model and get some performance metrics at the end rather it is um it is important to open the black box and see what's going on inside the model itself so we can get some insights about how the model is working and we can get like some explanations from the model itself and then this field has has been very hot in the recent years and you you can think about like um different techniques for interpretability that that are applied uh to deep learning and also to other machine learning models uh so for example it is thought that um in in the image classification problems the deep neural network is um it is thought that it is learning like um increasing level of features so probably the first layers will learn something like edges the intermediate layers will learn something like parts of the the input image and maybe the the last layers will learn some higher level features of the the input so visualizing such channels and filters from the neural network is may give you some idea how the model is learning and what exactly the model is is detecting into in in your image so this may be helpful in learning how the model work and also how the how to troubleshoot the the the the model itself and another technique maybe like this is a saliency map for some image where we are trying to detect even the and the image and this is this map has like um a high concentration on the sheep and the image itself so uh attributing the the outcome and or projecting the outcome on the input space is helpful thing that can give you some idea how the um the the model or what the model is focusing on and there are so many other techniques like based on gradient propagation or difference propagation or different attribution methods based on attention mechanisms so they feel this is moving very fast here so we can touch upon this later so in the next few minutes i'll be trying to talk about these things i'll be talking about pnet which is like a customized neural network model that we developed and published recently and i'll be talking about the uh the problem that we are trying to solve using this uh this model and hopefully we will come to some computational results and some interpretation discoveries from integrating the the model that we developed and even uh some experimental validation hopefully we will have some time for that um so just keep in mind that this is not a straightforward process and usually during the development it is like you have a lot of spirals and a lot of feedback from different stages and and one state feeds into the other and you go back and forth with a lot of discussions and um conversations with different stakeholders probably to make sure that we are coming up with like plausible uh biological insights and also uh superior computational uh performance uh so i'll start first with the peanut thing and as we know in the deep learning language we have some computational nodes if you have some inputs x1 to x6 like that you can combine them easily using um some weights and get some outcome maybe y that's a function of the inputs and this simple model may be like linear regression or logistic regression based on the f function uh so this is a very simple model it is very interpretable you can go and inspect these weights check which weight is higher and probably the corresponding feature will be more important than the other features if you like to have more um more power in in the modeling probably you need to add more layers and then to come up with this non-linear model that has multiple layers in depth and and you'll end up with this deep learning architecture where you have like multiple nodes per layer in multiple layers per model and you have outcomes and these layers are connected in in dense uh way so this is an example of uh feed forward model um however if you have a prediction problem there is no such guideline for the number of nodes per layer or number of layers uh per uh model or even the the the kind of the um uh the uh connections between the layers however uh the the the typical practice that you can start up with and arbitrarily over parameterized model and you hope that you can fit the data with some help from like drop out or regularization or even drop connect uh so as you can see we have so many things that we are doing arbitrary here um so we have we start with arbitrary number of layers between a number of nodes player and arbitrary connections and arbitrary uh like dropping um nudes and connections uh in peanut however we try to add more restrictions on the architecture itself and associate some meaning to the nodes and edges in the in the model so instead of having arbitrary model we have like very structured model that is uh following certain um that is including our prior knowledge about the biology itself so in the first layer we have a number of nodes corresponding to the number of features that you are interested in modeling in the second layer we have a set of nodes representing a set of genes that you are interested in modeling and and the next layer we have a set of pathways and and the connection between the uh genes and pathways are based on our prior knowledge that this gene is part of this pathway and the the next layer probably it is um including some biological processes higher level one that is composed of multiple pathways and and so on and so forth until you can come up with certain outcomes that you are interested in uh so this seems very simple thing um however it is it is powerful uh so right now we have um some meaning for the the layers and instead of having like a better layers we have a certain number of layers corresponding to the hierarchy that you are encoding into our network we have meaning for the input layers and these are representing the features that you are measuring we have some meaning to the second layer these are the genes that we are interested in and we think they are important in modeling the process that we are modeling uh and we we have a set of pathways and biological processes as well and the connections all as well they have some meaning so they are controlling the flow of the information from the input to the the outcome uh and all of these um ages representing some relationship that is known in the in the biology well while this looks uh simple in the real scenario we are trying to model something like that uh we have access to all of the pathways involved in in the reactant dataset and we are trying to model all of these pathways i'm not sure if the picture is is clear but you can imagine each point in this graph represents some pathway each connection represents a child parent relationship between a pathway and probably a super pathway and if you go up in the hierarchy you have like some higher liver pathways and and so on and so forth until you come up to the biological process that you that is this pathway is uh are part of and you have a lot of these biological pathways as well and in total you have like more than uh 3 000 um pathways involved in in this modeling so you are trying to translate all of these nodes and connections into a neural network language and you hope that you will be able to capture something there and of course you can imagine that the the leaf pathways or the leaf nodes in this graph are are connected to the genes and these fields are connected to different features that are measured for each gene so it is not like very easy task it is like involve a task to translate all of these things into neural network language and change transfer this abstract uh hierarchy into computational model that can learn something and get some results uh so i'll skip this uh as he as you may know may know that the training a sports model is more challenging than training like a dense model and we found that uh we needed to do some engineering uh to make sure that we are getting better performance in using the sports model and and we found that for example adding a classification layer after each layer in our model helps uh like posting the computational performance a little bit and these are the number of nodes per per layer and as you can see uh we have decreasing number of nodes uh when we go deep into the the model itself until we get like 26 in the last layer uh this is one number uh and then the first number actually is based on the number of means that you are interested in and the number of features that uh you are feeding for each uh gene if you sum this all of this up it is like almost 70 000 parameters so technically this is a very small number in terms of the parameters however it is very wide in terms of the number of uh nodes so it the the sports model that we developed is very um it has very small number of parameters however it has a lot of uh nodes inside the the the model itself and we found that this is very helpful and and from the biological perspective since uh before that we struggle into uh how uh how we combine different features on the level of the genes in a systemic way so instead of figuring this ourselves or manually weighing different features differently the model should be able to come up with different weights for these different uh features to minimize the loss function that you are optimizing through the training process we found this also very helpful since we have some meaning for the uh the the m with the outputs and even the intermediate layers it may be helpful to go back and use some of the attribution methods and and find what features are important for the outcome and what genes are affecting the out the outcome most and what pathways as well and hopefully we can come up with a um a path from the input to the outcome where uh these uh nodes in the middle are are most important clarification questions so is it is it correct that um you're using these uh pathway and process definitions to kind of predefine the allowed connectivity within the network yes exactly okay exactly this is what we are doing so we are adding constraints on the architecture itself based on the pathways genes and processes involved in the model that we are interested in and all of these nodes and the edges are are taken from uh from the actin so we are literally converting this reactant to be like functional and active computational model that that can learn something yeah totally and then probably the next slides will explain this a little bit so this is an example of dense layer where we have like uh this w matrix that connects um the two layers of nodes here if you have like an x vector as an input you can get an y vector as an outcome by multiplying the uh the the inputs by this weight matrix and and usually this w is like a dense matrix uh however if you if you would like to model arbitrary sparse layer probably you can use another matrix m here that is like represent the adjacency matrix uh between the the input and the outcome here and by multiplying m by w you can get a masked new weight metrics that you can train in in your training algorithm and sometimes for example in the first layer you have a pattern and this partial layer where you have certain pattern for the the mask itself and this may be used to uh accelerate the the training a little bit instead of like multiplying this matrix by the the the input you can do some uh clever tricks here to accelerate the the processing itself uh so this is a high level overview about the the model itself the details are published in in our nature paper uh uh last month so you can click for the uh for the details and i'm happy to answer questions in the discussion uh time so next slide i'll be moving to the um the the problem that we are trying to solve here um so and in general we are interested in understanding the cancer progression in patients and we would like to know uh whether there are any genomic differences between the um the uh the patients who are harbor uh primary cancer compared to those people who have like their cancer progressed and like developed with static cancer somewhere else uh what are these genomic uh drivers and and in the uh i mean if if there is any drivers if there is any differences between these two populations can can we capture them automatically using using our protective model uh and this can be measured using the the computational performance if we would like to pose this problem as a classification problem uh for example and can the next question probably uh can we go back and and ask what are these features that are different from these two populations uh so um we used uh our data that we put out for the community and then ran like a secondary analysis on the same data that we published before uh using our pnet model and this data is for uh more than a thousand patients some of them they the sample was taken from the primary site and for the others december was uh taken from the static um uh location and when we tried to adjust using the genomic profile of these patients uh where we have the mutation the deletions and the common number of verification we try to uh to model this this process and solve the problem as a classification assembly classification problem using our peanut and compare this with other models as well uh so i'll skip this and it's not very relevant here but for the computational results uh i can go uh quickly here um so using yeah sorry we're getting into the results is it correct that you did you use uh like molecular features as well like rna expression or or no well yeah i mean this is a very good question um uh so and one of the experiments we used in phoenix operation but uh i i would say that we did not have the gene expression for all of the patients uh and so anticlinic gene exhibition is not something that you get routinely uh in the clinic uh rather you would get a lot of like whole exam sequence with uh mediation and common number variations so for for the main experiment and the four day results that i'll be showing uh soon uh we use only the uh the mutations and the common number uh deletion and verifications as input for for the model itself uh we we tried using different data modalities like fusion uh uh things like that we didn't see that it is changing the performance very much for the gene exhibition itself i i think the it was very challenging to to um to use the gene expression in the uh in the model probably because of the different batch effects and things like that uh but uh frankly this was not the main uh the main focus of ourselves uh since we didn't have enough uh um enough samples with gene expression uh but this is of course this is like part of our future work to focus on different data models other than pure genomics data yeah i can imagine some tricky confounding factors there too with expression from contaminating normal cells in the metastatic site and things like that yeah yeah i would say that gene exhibition is not something easy and then there are a lot of nuances come with processing phoenix suppression and you should be very careful when you're considering exhibition as an input for for your for your mobile yeah uh yeah great so um yeah i mean using the standard training testing validation split we were able to come up with good performance and um for like more than uh 93 of the primary patients we were able to classify them correctly and we when we checked the area under curve and under precision record curve we found that peanut is achieving like a little bit higher performance compared to the other models including support vector machine and adaptive forcing the random forests uh i would say that the the difference is not like a very big one uh but at the end of the day uh um the the performance is comparable uh when we did like cross-validation uh standard five-fold cross-validation we found even that p-net is achieving higher performance in terms of the area under curve on the precision record curve and even the the f1 and it is the variance is probably less than the other models as well we try to compare the p-net to the dens the equivalent dense model and and things gets a little bit tricky here uh since you need to be careful regarding defining what is the equivalent dense model here uh so if you are talking about a model with this similar capacity of uh modeling which means that you are comparing peanut model with a dense model that has the same number of uh parameters uh yes uh we we can do this uh yeah we convert the p net with dense model with the same number of parameters um uh with increasing the number of samples and we found that uh when you train both models um using this large number of samples you get similar performance but when the number of standards start decreasing we found that peanut actually achieving um a higher performance compared to the dense model and you can get up to like 20 percent increase in the performance when you have like very small number samples as like a hundred sentence uh something like that uh so this was interesting so it seemed that peanut actually is maybe helpful in the small number settings which is like which is the case in in most of the genomic studies that we have if you are comparing the the peanut however to um a dense network with the same number of nodes uh here things are even trickier uh uh so the the dance so peanut has so many news actually uh and but we have very sparse network we have only 70 000 um parameters uh for a dense network with the same number of nodes you will end up with uh like um i'm not sure like half billion prompter or something like that so this is not like practical thing uh so probably what we did here for this comparison is that we uh forces the first layer of the network to be sparse and the uh all of the following layers to be um dense and this hybrid model we just call it dense here um and so the performance uh the the comparison of the performance is is very similar so peanut is 93 and the dense network is like 0.9 uh and just keep in mind that peanut has only 70 000 parameters and distance network has like 40 million parameters we didn't do this for all of the sample sizes um however we we think that the comparison holds for different sample sizes just for the sake of the time um we tried also to externally validate this network on two cohorts that on which the the model didn't wasn't trained um so we we got like a localized uh um dataset and metastatic dataset and and our model was able to capture like eighty percent of the institutions uh correctly um with some errors of course so it seemed that the model can generalize to other datasets to some extent and probably some more experiments need to be done here but uh at least this is a really thing about that the model can generalize to other datasets uh what's interesting actually that we tried to see if um uh of uh if the the model actually has has been doing some some errors in the prediction and uh we try to see if there is if the model actually is doing something uh other than just pure mistakes and and we inspected the the errors of the model and we try to see if there is any um any interesting thing in these errors and it seemed that there is something uh so for some of these patients actually we had some information about their um biochemical recurrence which is that the cancer will come back later in in the end time probably this can be detected through like elevated psa levels in the patients uh so we found that patients who had like high peanut score which mean that the the model itself the peanut is is saying that these patients are armed aesthetic patients even if they have primary uh cancer it seemed that those patients had worse prognosis which mean that probably they are more susceptible uh they have higher probability of uh developing biochemical appearance uh later in in their life uh and on the other hand patients who have like lower peanut score they they had like better survival progression-free survivor and i think this was interesting that um peanut um just trained on this genomic profiles of these patients may be learning something here that helps and start finding patients based on their probability of developing uh by kim karakiran's later in their life uh so this was something interesting like that we got only from inspecting the predictions of the uh of the model uh so next i'll be talking about the interpretability aspect of the model and whether we get anything uh from inspecting the model itself um as you know there are so many techniques published recently uh so many um methods out there that you can use uh different attribution methods and explanation techniques that you can use to get some insights into your model and i am not in the space of comparing comparing all of these things there are a lot of studies like saying that some of these models can be represented in terms of the other models and there are other so many um things in common between these techniques uh but i mean we end up we ended up using deep lift from stanford uh for different reasons um including like the the uh how easy it it was and the availability of the good at that point and the subjective evaluation of the ranking whether these rankings uh make any biological sense so these techniques actually can help you assigning scores for different features for each patient and if you aggregate all of these scores on the population level probably you can get some score for the feature on the population level uh so for example we did this for the inputs we aggregated all of the important scores uh on the population level and we found that um uh this is for the first layer i mean the rectangular here uh the the the bigger the one the more important it is uh so we found that uh the the progression or the prediction of the p net at least is is driven by a lot of cavity number uh events especially couple number of amplifications and this is consistent with our understanding of cancer college itself when we did the same thing for the next layer we found a lot of interesting genes that um uh came up on the top of the ranking like things like ar tb53 p10 and rp1 all of these genes are interesting things that are found in the literature and then this was like very uh interesting point to us that the model is capturing known biology and using some of the uh i mean some some factorization normalization you can even assign some weights for these edges that are connecting the first to the second layer uh so for example here you can right away and figure out that ar amplification is important tb53 mutation or ptn deletion so this actually gives you a very nice visualization of the importance of the nodes and how they are connected with each other if you start doing this for all of the layers you can come up with a multi-level visualization of the system where in each layer we are representing the the top uh nodes and and their connections for the other nodes that are shown here we just represented them as residual uh node um so you can imagine that you can start inspecting such uh visualization of the model and come up with your own biological hypothesis here and you can observe the flow of the information from the input to the the the outcome through different layers in the model itself and um from interacting with different uh oncologists and different scientists they found that this this uh visualization is very appealing and and it may be informative in in developing biological hypotheses uh when we went back and checked for example the ranking of the genes and then the second layer or the first after the the inputs we found things like ar 253 p10 rp1 all of these are like expected to be on rank on the top however when you go down the list you found some interesting things that are not very expected in this location like things like mdm4 version of r and things like that uh so things start getting interested here uh interesting here um so we wonder why it is it is uh i mean it's ranked high in in the list while and it is not very um uh known in the literature especially in the prostate cancer field uh that these genes are like very involved in in the progression of the cancer um when we inspected the individual genes we found that uh yes yeah certainly ar amplification is enriched in the static population tb53 mutation is enriched in the uh certification as well and so p10 uh deletion and we found also like a lot of like amplification events in the mdm4 and the um static population and and we thought that probably this is because mdm4 is related somehow to the t53 pathways and we found a lot of these pathways actually coming up high in the ranking in different pathway layers uh so we know that t53 is like a major tumor suppressor that actually regulates a lot of biological processes inside the the cell it regulates a lot of genes including cell cycle genes and cell death genes and dna repair genes and even interestingly the tp53 itself is highly regulated gene that is uh regulated through different mechanisms so if if you are a as a cancer cell that you are trying to stop the three from functioning you can do so through so many ways probably by mutating the coding region of the tb53 or ortho epigenetic processes like methylation or phosphorylation or we can do this by like increasing the activity of mdm2 or mdm4 uh that are known to be negative regulators of tb53 so you have so many options and seen that the the option through mdm4 is something that cancer cells are doing so when we inspected and inspected the the motion distribution of mdm4 and other genes including ar and 53 we found some some genes for example here we have like 19 patients who have amplification if in mdm4 without any events in ar or 253 we found also some patients who have like mdm4 and ar but they don't have anything in tb53 so it seemed that that to stop t53 is is not only through mutation and in the t53 itself but you can amplify the other genes that are negatively regulating the tb53 and this was an interesting thing for for us so um well we thought that that we we may have some some interesting biology here that we would like to follow up uh with experimentally and see if if our computational nominations have any uh biological uh meaning or it is reasonably biological so we went uh to our friends uh in in han bilhan lab and our friend justin uh helped us with one of his my screen that published recently uh this is an office screen and prostate cancer cell line and we found that a lot of our highly highly ranked genes are present here in the screen for example found that mdm4 itself is the the the most highly ranked uh nomination that we came up with this was very interesting so on the y-axis this is the z-score and the higher uh the more positive this score is the more resistant this uh the the s line will be for for uh for the treatment so this was interesting to see that mdm4 is even ranked high in this screen so now we have two sources of evidence the first one is coming from the computational model the second one is come coming from this um this screen and we followed up with more experiments on mdm4 individually uh so for example when we targeted this mdm4 and using crispr we found the cells survival of the cancer cell lines different cancer cell lines are going down and when we targeted this this mdm4 using a drug as well we found the that the cancer cell lens became more sensitive especially if this cancer line is 353 uh wild type uh so it seemed that we found something here uh something that's highly interesting that and this can be targeted as well using drugs and and we did the targeting in in the cell lines using crispr and also using drugs uh and and this is a very interesting thing that uh through the computational modeling of data that are aggregating uh that are aggregated over different uh times we were able to um to nominate some computational uh competition innovate some targets that we were able to validate uh later in the in the lab and probably maybe the next step is to start clinical trials on this um in in the screen and see if uh if this can like help uh making the the cancer more sensitive to the treatment so just for the sake of the time to summarize we developed a pnet model which is a sparse model um and we were able to include all of these biological information in in the hierarchy of the model itself and it seemed that it works and computationally it was able to uh identify advanced cancer patients based on their genomic profiles and we found that some of these patients with high penis scores they had like um a higher probability of uh developing uh by kim karakirans after after a while and we found that peanut discovered known biology about prostate cancer progression also nominated new uh targetable uh uh gene uh that that we were able to validate uh experimentally in the cancer cell lines until now uh so the future thing that we are um looking forward to do is um uh to of course try to include more data modalities in the model itself and with the hope that um the more data that we feed into the model the more accurate the model will be and the more accurate your ranking will be and uh we are also exploring different options for including uh the the hierarchical prior knowledge and we we did this successfully using the the reactant data sets but we are also considering other uh hierarchical and um graph based uh knowledge that we can uh include into our model including ppis and and other uh graphs that we can get uh also um we are trying to apply peanut to solve other problems including uh uh drug response to especially immunotherapy and um we are also trying to see if we can like um understand more the the differences between different attribution methods and interpretation techniques as applied to the sparse network rather than the dense network you
Info
Channel: Broad Institute
Views: 955
Rating: undefined out of 5
Keywords: Broad Institute, Broad, Science, Institute, of, MIT, and, Harvard, Genomics, Sequencing
Id: Gs8faSB0wBg
Channel Id: undefined
Length: 88min 20sec (5300 seconds)
Published: Tue Nov 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.