Machine Learning at GSK with Kim Branson - #356

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] all right everyone i am here with kim branson kim is senior vice president and global head of artificial intelligence and machine learning at gsk kim welcome to the twiml ai podcast thanks sam good to be here looking forward to our conversation let's jump right in you come to gsk by way of genentech which kind of this traditional uh or this historical silicon valley challenger before that you spent a lot of time in silicon valley and now you're at a large pharma tell us about that journey and how it came to be sure so um i guess my background just you know i did my phd in machine learning um really looking at um applying machine learning to actually drug discovery so small molecule drug drug discovery and design was what i what i started off and this is i'm dating myself but like you know in the you know 2003 around then anywhere where you know the cutting edge techniques were things like support vector machines and random forests and things like that but doing a lot of work on doing simulation of physics simulation things like that with small molecules binding proteins and all the things around with that um from there i ended up doing um you know very sort of academic thing of spending time at stanford and others did a lot of started earlier work on machine learning applied there into graph convolutional networks that sort of stuff was happening um and a lot of my time was spent actually in um you know i spent some time at vertex pharmaceuticals but then a fair bit of time so i've been startups of and so things like you know was involved in early search startup that was acquired by twitter another company doing large scale medical records aggregation and differential privacy machine learning on that sort of stuff and although like all things 90 of your time is building all the stuff to do etls extract tables out of pdfs and all those wonderful things and that was acquired by apple um and then um a lot of large scale machine learning on claims other types of data it's all about like predicting like what's the problem predicting disease x at time t given given someone's past medical history so you could intervene early and things like that and from there you know many friends at genentech i've always been involved in computational chemistry so joined genentech and um really was i guess really recruited from genentech to come to gsk which uh you know it's one of those things that wasn't really on my radar as a place to be and i had some friends there on you know i'd known of the company for a while but um they sort of convinced me to come in and i realized that actually something very different happening here so they brought in a new head of r d so hal baron had joined and um you know how it was you know with the calico with definitely cole and people like that and i was sort of like well there's a guy who really actually understands what machine learning looks like and what it requires to be done properly and was and sort of you know really explained to me that like it's going to be such a core part of gsk right they're like they're really serious about it it's not just a company like i want to build a 10 person team and we'll dabble and we'll kind of half resource you but it was going to be such a fundamental thing so that's what sort of led me here and it's let me sort of create the create the group and then scale as we as we do now and give us some context for gsk what's the core business and where does ml and ai fit in yeah sure so we're obviously a you know a pharmaceutical company you know we make you know medicines and vaccines um across a wide range of things and so gsk is an it's an old company right it's been around for about 300 years right so all these companies every now and then they go through these sort of revolutions and they you know they sort of turn themselves inside out and reposition themselves and uh you know what was really apparent is that we have this um increasing body of sort of genetic data so these are these large genetic databases um so where we have lots of people we cost of sequencing has gone down so you know remember i guess 20 years we have the human genome one one one genome now we can do lots and lots of people right so you can see sequence lots of people you know about their medical histories and you have you could basically do it answer a question like you know he's here about two people that got a disease here a bunch of people like look on that don't get the disease and ask the question well what's different in the genetics right the idea is that the genetics kind of points at a clue as to what you might want to do a medicine for how what's involved in that disease so a really increasing amount of just data being generated in the genetic side of things and the other side of things is happening is really these technologies for functional genomics right so maybe the first wave you can think of as molecular biology right restriction enzymes we have plasmids we can like you know do genetic engineering and make a cell make a protein the next this is really the continuation of the evolution of those those tools is now with crispr and uh talons and those sorts of things where you can actually perturb a specific gene right you can turn it up or turn it down in a particular cell type almost and also a single cell level so now you've got this other set of technologies where you can start to like generate huge amounts of data at scale and what it turns out with is that we're just biology can measure so much more now there's just this massive amount of measurement and it's multimodal so it's you know looking at rna-seq and a single cell levels that's looking at the messenger rna made as you doing these edits you can do cellular imaging you can do proteomics you've got all you know all the omics as we call them you always explosion of data and so really you need machine learning in the middle of it to sort of make sense of the data but also even to help you kind of make sense of all the literature you've got need to plan the next experiments and that's so in the discovery phase so it's really core we have this this three prong strategy that's sort of this the you know genetics right and function dynamics on the side and aimo in the middle to sort of integrate together um to help us really you know find the targets and design better medicines and so is it is it accurate to think of genetics as the kind of the data source there what's happening in the gene genomics as a control point the way you influence what's happening and then ai is telling you what influence to exert based on what you you know the patterns you see yeah so i think that the way i think about it is um genetics gives us gives us this genetic databases give us a clue of what to start looking at right um and we know it's really important because we've done studies and others have done this as well showing if you've got a medicine that works on a target so remember genes include messenger in rna messenger and rna encodes proteins and proteins is the thing that does the things in our cells and the proteins are what we would call it's a target the thing we want to modulate typically and that might would be with a small molecule and things like clinical things like penicillin or aspirin or it could be something like an antibody right which usually extracellular not usually inside cells that sort of blocks box something and so the genetics is giving us a hint of what to look for but it doesn't it's not the whole story you still need to actually kind of it gives you a clue but you need to sort of go and do further experiments understand like well you know is this the correct target or is it something else that's in this pathway that's it it's operating in that's the function genomics coming in where the functional genomics allows us to basically what happens when we turn this particular gene off like lower the level of that protein does that look like it has the right effect right and sort of mimicking the effect of a um making medicines we don't have to make a medicine to work out whether it works we can use that kind of thing to inform it so it sort of it starts to put these things together and all these experiments now because the cost of measurement has gone down because we can take a single cell and make a change in a single cell right and then do rna sequencing like look at it all the changes in messenger rna on those single cell levels it generates a whole bunch of data that's at a large scale and and that's where we sort of bring machine learning in maybe i can illustrate the point one of the um probably the one the most one of the key problem we work on is that in these large genetic databases right we really only know what to do with about probably 15 to 20 of what we call the variance and a variant is where you know um you know your dna sequence will differ from mine right maybe i get the disease and you don't i've got like a particular mutation right right if that mutation falls into what we call the open reading frame of the gene or the bit that encodes for the protein that gets translated in hemorrhoids it encodes the protein we know it affects the protein and we can go and look at that protein and work out what's happening you know it does it not fold or is it is it full but it's missing it's not as active that kind of thing a whole bunch of them fall outside of that right they fall into the regulatory regions of dna and you can imagine there's a whole bunch of control structures in dna saying what to turn on what proteins and what conditions and so we spent a lot of time in building machine learning models really to understand um what genes those variants are regulating right the ones that are in the in the control area all right it's not always like the closest gene it can they can have quite long range effects it might be different proteins right and so one thing that if we can sort of understand what what things they're regulating it gives us a whole a lot of certain potential targets to go and look for so that's the sort of thing we use machine learning on right so we use it on discovering kind of you know things to make medicines against the better targets and they're genetically validated so we know they're more likely to work and then actually maybe things on the on the closer to the clinical side of the world right so i've got a medicine how best to use it who's going to respond respond by it how do we measure the response and things like that and that's things like computational pathology or other types of things in in the clinical domain one of the things that i heard in your explanation this is maybe a little bit of a tangent was suggesting that you know i think of of you know variance as you know one of the genes gets you know flipped from an a to a g or something like that right so it's you've got these four and one of them gets flipped uh but i heard in your description that you know that's a oversimplification and maybe not just you know which gene is encoded am i am i hearing that correctly yeah so um you know our cells are sort of you know they're a network we have lots of different proteins to be made and they you know they communicate to other proteins and form up various functions and um sometimes it's literally the amount of the protein right is important right and so you know there are classic diseases you can think of these what we call a rare genetic disorder where you've got a single mutation right and it makes a protein function less you know a canonical example might be i mean you think of hemoglobin the sickle cell you can think of factor 10 a deficiencies you don't make an effective factor today you no need to put that protein back some things are about like the level of the proteins it is it can influence the behavior some things are about some of these mutations means that maybe the protein doesn't get made at all something i mean the protein gets made but it doesn't it's not as stable so it gets turned over rapidly so you just don't have as much and sometimes you might have a mutation that makes a protein always on so it's always conforming it's fine so it's not regulated anymore right that's another thing and it's not regulated it's driving the behavior other pathways and this leads to you know um aberrant function which leads to pathology and disease right and so it's not just as simple as like i've got a mutant right you know what what is the mutant this mutant tells me i need to make a drug against that is actually what's the effect of the mutant and this is the kind of the thing before i was talking about that kind of the variant to gene problem as we phrase it okay the other missing piece is that kind of the gene to function problem right and that's the other that's another key thing so you've got these data sources that come from genetics and genomics and you're applying ai ultimately trying to develop new drugs new interventions can you you talk about some of the specific use cases or or problems yes so i think that um you know there's a lot of things first of all thinking about um obviously you know taking your clues from genomics and things to come up with those targets right and then you have to think about well what's the effect of that we're just talking about like that mutation on that target right so there might be is it more stable or not um and then it might be well i think about how do i you know what a thing is to take the uh the target and that sort of cellular context and know that you've kind of made something better or not right like how do i know that i've actually found a good target that's moving you know it's gonna it's gonna become a good medicine in people and that's where actually so we do um all of the things we build all the models we build actually have sort of a large sort of experimental feedback loop right and it's actually um we know for example that variant to gene model that has sort of an experimental feedback loop where we're actually doing what we call experiments as code so we're asking the model what it needs and it's really adaptive sampling under uncertainty constraints so you know rather than having data being generated by some other process at gsk and i'm trying to build a model of that where i might have you know 900 examples a week or something i'm very good at right and i'd be like i really like more examples of things i'm not good at we actually sort of use a lot of sort of automated like biology so this is biological with robot robotic automation to generate data and things that sort of feed back into these models and it's the model that becomes the tool that helps to solve the problem right so we can use the we can use that um that model we built to help us map more of those variants to those genes but then we still need to understand that gene to function file and again there we use automated experimentation but in this case we we're doing um things with these various cellular models so they can be what we call induced pluripotent cells right these ipsc cells are these like stem like cells from patients that have a disease patients that don't have a disease and then we actually sort of want to work out you know how what we want to basically take take the the disease tissue and make it look more like the normal tissue and when we say look like it could be measured by a bunch of different ways it could be measured by looking at imaging data it could be imagine looking at gene expression pattern or protein or some kind of functional consequence and typically these assays are complicated you can't do you know with crispr and things like that sometimes you can just do what we call a genome-wide screen i'll just do all the things right all 20 000 genes these ones are so complex that you can't do that and you sort of need to do an adaptive experimentation thing so you can take your clues you start with from genetics and from the literature for example and you sort of see them a model for that and then the model sort of makes an experiment right and we do everything as a sequential learning product kind of problem we make we do an experiment we perturb a few genes we look at how it moves things we feed that data back here and say okay what have we learned based on that my next best experiment is to do is to do this other thing so then i make another mutation i do another sort of round of like interventions on that solar system and then sort of iterating and you think about it's kind of like an optimization problem i'm trying to find the best target to move it to that right and so and you want to get there with the fewest number of steps right and try and find the best things and at the same time while you're trying to make it look like you know it's affecting the disease so you have to like your ground truth is that cellular models you also do other things like making sure that like you know there are certain proteins in the body that you can't touch right that like have toxicity associated with them right so a classic one is in you know cardio toxicity right like it's no good making something against your ra drug if it's going to give you cardio toxic right so there's certain things that we know we can't hit this there's toxicity from what we call on target talks if i hit this protein but something bad happens and then there are other sorts of toxicology we can also at the same time have an ai system that's learning to learn which targets are toxic which targets not to touch based on prior data and things like that which things are eat which targets are easier to make a small molecule or not this is a multi-objective optimization because i can come up with targets that are really great right and we sit around well we have no idea how to make a selective medicine against that right so an example of um one of those things is a protein involved in cancer called k-ras you know it took people years to come up with a selective k-ras inhibitor it was always a great target but it just was really what we call intractable so it's this optimization of finding something that moves your your model of the disease in the right area it's kind of tractable it's not toxic right and then we can put forward so that's that's sort of the thing there so we again we use machine learning in that sense to help design those experiments and carry that out so can we can we maybe take a a step back uh have you talked through a a specific concrete example of a project whether you're talking about the cancer one you just mentioned and um you know what the data sources are what the the evaluation criteria are and then talk through how this sequential learning idea plays out in some concrete context sure let's let's let's talk about the variant to gene one because i think that's that that's something that's everybody has more of a sense of that now um especially so what we do in that model is we have some genetic variants from from standard geos analysis these genetic white association studies and these are saying these variants in this region of dna important in the role of this disease but we don't know which gene it is it could be gene a it could be gene b it could be gene c and so what we actually do is the the system that looks at that um it treats the whole thing sort of a bit as a ranking problem right so top level model is is like a ranking model right and there are a whole bunch of different features that feed into that so think of it equivalent to web search you're saying for this for this disease and this variant what is the gene right and you're coming up with your rank list of genes right and you'd want your top ranked list gene to be first the first on the page right be the the causal gene so that system actually has a bunch of different models right that feed into that so one of these models are things that are like sort of these stacked encoder models that futurize off raw dna so look at the raw dna sequence and then they predict like which where a transcription factor and which type may bind whether that sequence of dna is what we call open or close which is where it's packed up in open or closed chromatin um there are other features that talk about whether those these particular genes are expressed or not in that in a particular tissue type because not all genes are turned on in all different tissue types right you know cardiac cells are different from neuronal cells are different from skin cells there are other features that come out of like knowledge graphs so sort of like these node embeddings and i can talk about how we have a pretty large knowledge graph we use behind things all these different types of models and there are some of them are neural networks in our own right some of them are different types of things are all sort of features that go into this other model and it's again a neural network type model but again it's supervised in the form that we say okay here's a variant and here what we think is the gold standard you know gene for that variant is right and then we go away and you learn how to weight those particular features right just much like you would train everything anything else the challenge we have right is there isn't a massive amount of gold canonical you know variant to gene features because i just told you that we only know what to do with 15 of them so then we have to do the experiment part an experiment is where we bring in the functional genomics so what we actually do is take cell type depending this depending on what we're doing it could be different types of cell types but you can say it's it's a primary t cell from a human donor we do the edit and then we actually sequence those cells we look at the mrna levels and we say okay we know what it was before and we know what was afterwards what's the differential gene expression yeah and you say i think this variant affects this gene so then we go and look at that gene and look at the change in gene expression right and if we get it right you know that gene expression falls right and only that particular gene right and that's and that becomes the training data so then that kind of feeds back in you know we rebuild the model and off we run again and so what it means is that the different teams that run those different sort of sub models um you know they can they also have different data sources that they will bring in and some of them are generated from external data some of them internal data to build their kind of their feature factories that feed into this thing but that's how we train the whole model um and it's it's a really interesting scenario because probably i would say 45 of the time right a simple model right which is like the variant affects the closest gene we'll get you there and i'd be right there the problem is is that the rest of the time that model doesn't work and it's not the closest gene it could be something quite far away right and one third of the time from doing this experiment it's really really far away and so not what we expected so we've been running this learning loop which basically like builds the model and generates the training data at the same time and as a result the overall model that maps variants of genes gets better and so we track that over time so when we started off you know we were mapping like they were like 15 percent of the unexplained things and uk buy went we could map then we moved to 24 and now we're at 40 so we know what to do now with 40 percent of our genetic variance in this database right and that's gives us a whole bunch of new potential targets to go and explore with some of those other systems i talked about got it so is the again sequential learning loop in some ways you describe it and it sounds like this you know automated thing that's kind of continuing to iterate over time performing and optimization across you know all of the experiments you're doing in other ways it sounds like you're applying machine learning it's giving you some you know some features some signals and then it you know goes into a scientist brain that determines you know what the next step is how close does that loop yeah so the tools i'm talking about the the models we build um you know there are obviously scientists involved in running the you know the experiments as code right so sort of like you know pertaining to the robots depending on how complicated experiment and the throughput we're doing involved in that um it's really where the where human sciences get involved is sort of the output of this sort of thing you know another experiment we have is involved in discovery of an out of cancer drugs right and it's a concept something called synthetic lethality and all that means is basically all cells most both biology we have redundant pathways are really important things right tumor cells grow really rapidly and they tend to sometimes like because they divide so rapidly they can tend to like break and only have one functional copy of it or something and there's a you know and so what we know then is if we can identify which things are likely to break and i can make a drug against that sort of thing because it doesn't have a backup anymore it selectively kills the tumor cell over your normal your normal tissue right so a gsk drug like naraparib right it's what's called it's a it's a classic drug called a pop inhibitor but pop inhibitors are involved in dna repair right so basically if you stop the dna repair it then basically um you know acts to sort of kill the cancer cell so what we actually do is have another system and again this one basically looks and tries to come up with what we call synthetic lethal pairs right if you see this mutation then you can target this other particular gene right and so we have a common set of things we know tumor cells like you know um mutations they get and say well what things pair with that so again we do experiments we knock that thing out we turn it up we turn it down we look at its effect on viability and a whole bunch of different tumor cell lines right but then the output of that doesn't automatically become like this is the target go away and make it anybody against it and make a small molecule against it that's where we interface with our experimental colleagues because there's a whole lot of these are all narrow purpose ml systems we're building right so there's a whole lot of of data and things like that and things that they bring into bear to think about to work out like how well that happens and different experiments that i will then go and design to really it's really a hypothesis that's suggested by this machine learning algorithm for example so you know all the things we do uh you know to surface the information to work with our colleagues and this sort of thing and really sometimes the production of the ml model is a lot more automated but then the use of the model is where the human scientists right these are tools for for scientists right we cannot encode all the background knowledge of biology and things uh in the way we want and also the scientific literature is really messy right not everything in it is correct right so there's a there's a certainly a role for domain expertise in that and did i hear you earlier reference work that you've done to apply machine learning to to mine the scientific literature itself that's right so um one of the ways you think about if you're doing sequential learning and you're running all these big learning learning loops is like you know humans we've been doing medicine for a long time right we know a lot about biology medicine and things like that it would be foolish to start from a you know a clean slate and have to learn all that and we'll take a lot more samples so um there are ways to think about that like how do i have structured priors if you're amazing and things like that so we have um another group who does uh you know um one of the great advances you've seen obviously has been nlp right so we have all these journal articles that are either in you know open source like on the bio archive or you know elsevier or pubmed et cetera right there's lots of those those sources and there's also extend there's also data sets published as well and so we have a a group that sort of builds a uh a an nlp type model and it's they're based on like bird type architectures again we're seeing encoders that appear everywhere and what that does is that that sort of pulls out things right so with entities right so it's entity and relationship extraction so we put what we call a semantic triple which is really a thing a type of relation another thing or a subject predicate object pairs right the predicates we care about are a limited set of things and luckily the scientific literature isn't as free form as the rest of the thing people writing things right so you didn't say a has function in b right x does y x does not do y right and where x's and y's could be genes diseases small molecules you name it when there's a set of things we're interested in so we mine all that out we run all this sort of thing over literature we pull out all those semantic triples and we we stick that into a really big store so we have a graph that's like 500 billion nodes right it's huge you were referring to earlier yeah yeah so we don't use a whole thing what we do is we pull out things so maybe i'm interested particularly in just genes and diseases and you know proteins protein products such as genes and things like that and i can pull out those subgraphs um and we might do some link prediction on that so actually it's not known but we're pretty sure that x does do y or maybe there's some weak evidence for it you know you can apply again you're applying ml algorithms on top of that to build that knowledge base and then we actually use usually node embeddings there are different ways you think about how do i represent that data into my algorithm i don't it doesn't have to learn that like gene x does gene y we've already told it that this happens right so how do you how do you represent that structured knowledge that you're confident in and this means that we can be kind of more sample efficient right we view all our ml algorithms as sort of information engines we can be more sample efficient with the data we're doing so we learn the things we don't know right rather rather than relearn the things we do know is a key so that's enough that's another um big area and once you've built these um these knowledge graphs you know there's lots of other uses you can put them to rather than just like the machine learning group right if any biologists can say oh you know what's new about my protein okay here's all the facts about my protein like does x do y i've got a hypothesis that x is involved in y right no one holds the scientific literature in their head anymore right it's you know it's too complicated so you can go and query that but the interesting thing is like you know i said before a big 300 year old company like there are things that have been and gone in gsk that people have forgot right like something right you know so like you actually mine your own data and go oh well we did an experiment about that or we felt we've you know it's interesting protein someone's got by the way you know 20 years ago we worked on this or someone worked in a related thing and they found a molecule that affects this it wasn't what they were looking for at the time but we've got it on a shelf somewhere so that kind of thing becomes like you know the brain of gsk and because you know it's not just scientific the papers it's also the data sets of cyber papers you can start put to put whole data sets where people are doing these big experiments at scale and industrial scale into these knowledge graphs as well so there are lots of experiments where people are doing screens for a particular function or things like that and they come up with lists of genes that are known to be involved in things you can import that knowledge as well into the knowledge graph so it becomes a some of like growing reference space to to use one thing i'm curious about is you think of kind of how your team operates against a quadrant of kind of innovating on the biology and innovating on the machine learning i'm curious where you where you find them and where you want them to be you know in this space it's moving so quickly oftentimes you may have to innovate on the machine learning to make it work for your application um is that the case here yeah so i think that um we have in general i don't think we don't work on very many things as a group so we're about a you know 120 person research group and we're quite globally distributed i've been in san francisco i have people in we have team members in you know boston philly london uh tel aviv heidelberg switzerland right we're kind of everywhere um but a lot of things we work on like there isn't a solution right there isn't a variant-to-gene off-the-shelf piece of software algorithm right because the data's not there you got to build the whole thing but there are cases where we can borrow things and there are cases where we have to do research so we do a lot of research into causal machine learning because obviously we want to come up with things that are causal for the disease so like if i you know and what i would say have a a small level of clinical hysteresis and that all that means is basically a small change in this particular like you're drugging against this thing like i don't have to knock it down 100 if i knock it down 10 percent has a large effect on the course of disease that's the easiest medicine to make and something like well if you take this down to 99 of its level in the body then you might see a clinical effect that's probably not a good candidate for medicine so we do a lot of work on like causal reconstruction of causal data from network data and things like that and there are some areas um where you know you might start off with just taking things that have worked really effectively so computational pathology is a great example um so pathology is when you have you might have seen those ready purple slides of a tumor and things like that or biopsy or things like that and so typically they were like you know there was the hemotoxin ears and stained slides and they're looked at by a human pathologist who's looked at things and know and does things like oh stage one two or three cancer for example we know it's obviously some information in the image the higher the number worse it is right um now what happened was you know we got digitization happening so we got people started to scan these slides at high resolution right and these are big images right these like four terabyte images okay gigapixel images and then the other side we've got conf nets and units and things like that sort of sitting around so the natural thing was like well i'll just take a unit i'll take a confident and i'll see what i can what i can do and i might want to segment things i might count the number of types of cells in the slide rather than having a pathologist go through and do that right i might want to say well what's the tumor stroma ratio like so that when you take a tissue block you might have the tumors growing here and there's normal tissue around it right how big is that area for example what are the characteristics and so it started off with people doing those types of things right and those technologies work but then almost everywhere you go we start to ask more advanced questions you innovate on my methodology right so some of the things we do now are well can i predict the genetic status of a tumor just from the image alone right and you ask a pathologist they're like oh could you tell me whether this tumor has this particular mutation from looking at it they're like it's crazy there's no way i can do that i'm a human right and you think okay well but you can actually build models and we've done this that can actually predict like the genetic status of the tumor so there are subtle micro changes in environment and stain density and things like that due to like the changes in biological processes right a human eye can't be can't obviously be sensitive to train that but you can sometimes do be retrospective and see what features has the model learn but suddenly you're taking the comp nets and these types of things and resonance frequently and then you're starting to push push them into different areas you start to do start to tinker with them again and then you're finding different architectures so we're again we're seeing that we see that also in you know cellular imaging as well where we use the same types of things where we're looking at cells and how their phenotypes would change right what they look like picturely as they change when we give them a drug or don't treat them in the drug for example so it doesn't take much before you've taken some off-the-shelf technology but typically you you're starting with an architecture or something else like that that you will then adapt to your use case right so that big um you know variant to gene algorithm right you know well i cast as a ranking problem there's been lots of machine learning research into ranking problems for a long time right there's off the shelf tooling and things like that and ways to think about things so we those are things we start with right that we bring in and you know but there are some things where it's it's wholly new algorithms and architectures and things that you know we we're sort of having to invent uh as well and does your team publish in those areas yeah so yeah i mean so we we publish uh you know we publish all our all our code and our work and it's kind of really important you do that because you're talking before about like how do you find people right so i think what's happened is um there's a lot of people that have realized that you know there's now lots of this data appearing in biology right i mean even since post code but everyone's really interested in human health right and you want to find an environment where you have the computation resources and people that to do that um but nobody wants to join somewhere and then can like vanish into a black hole right and we use the models you just talked about like you know wearing something so i would i would apply resonate some to computational pathology and start with that because it's out there in the literature in the domain and that's why ai has moved so fast is that free exchange of ideas and test sets and state of the art benchmarks and we love all those things as well so you know we we publish our code right because it's the data that's really important right so you know we will publish if it's a model that we've built we've got a code and there's a public data set and we can build a model we would also publish that model built public data right because it's it's the sort of the the data gsk is generating allows us to build a model at much higher quality right that's our kind of strategic advantage right so it's again it's all about the data and then that's similar to other industries right you know facebook publishes lots of really cool graph algorithms and things like that they don't give you their social graph right data similar for us but it also means that we can contribute the community um we're running a a challenge i think at icl iclr this year as well on gene discovering causal discovery from networks and things like that um i think we've got two or three papers in europe's this year out of the group as well um so yeah we we publish and you know both in in the conference proceedings and then and other scientific literature as well it's very important for us nice nice do you think much about uh tooling and infrastructure platforms i have read yeah uh that you're i've read about your ai hub um a bit so yeah the answer is yes but yeah absolutely um so you know there's a few so the infrastructure i mean there's one thing i learned from doing start it's like the you start building infrastructure now the next best way to start reading infrastructure is tomorrow because it allows you to scale and if you suddenly have infrastructure problems it's really difficult to solve once you're in that in that phase in the ai team we have a whole group that's really our aml platform organization um and they build all the kind of tooling and infrastructure for us to kind of um you know deal with data containers and running things and scanning algorithms what kind of those aspects and it's about not only just um you know gpus computing things you know we do a lot of pi torch extensively uh is our preferred platform of choice but something even also comes down to the things we're looking at we end up needing kind of like novel compute right so we've had a strategic partnership with a company called cerebrus which you may have some of you you may be familiar with right so rivers have one of these companies they built like you know a really really amazing piece of hardware um and so we use cerebrus for a particular type of problem where we're building these encoder models on on dna now what's interesting is encoder models is we want to have a really really large window size right and so you get into this it's really challenging to build model parallel and data parallel kind of algorithms at this sort of scale and the data sets we're passing over are really really large as well right these four these big genomic data sets so that was a really interesting problem that the service system is like it's got a massive throughput it could be a really big model it's like because of the scale of the chip produced and and the latency between that and the memory was really large so you know we started working with them we've got us we have a cs1 system we'll have our cs2 soon and you know that becomes a strategic thing that we can start to build new algorithms and play with new things on and actually build models for and then you know so so the compute is really important for us it's uh it's key to be unconstrained by by compute right to be like you know to think of like what's the best way to solve the problem you know we usually constrain my data like that's the data i love to have right and so this is also why you know we also work um with nvidia where we're pushing the bounds of kudos so we have a strategic arrangement within video where we have people on site in london where you know we're making changes to low-level you know lib dnn and things like that or working with them on those types of things so you know how do we how do we how do we make it easy for us to focus on like you know only one problem which is the science problem rather than two problems with like the science problem and engineering problem right because the challenge we face are hard enough right so so that's that's a key is a key component for us and you know the other thing is like we want to think about how many iteration cycles a machine learning person can do per day right i don't want to be someone sitting there like i've got an idea i've kicked it off well i guess that'll be done in two days time i'll sit here and read anything i want to like look at something have an idea or have a bunch of ideas kick them off and then actually get the results back that afternoon and think about it then you know run around another set so that's also really key for us is to be you know so as we grow we've needed to add you know every every new aml higher requires you know a bunch of a100s or whatever we need like added to the stack right there's a cost yeah you mentioned earlier you reference a feature factory so you're yeah developing these features and you reference a feature factory is that a concept or an idea or is that a physical thing like a feature store so you can imagine for you know if i've got uh you know that section of dna and i've got my variant in it right those there are different models that can like tell you different things about it so analogy with the web page well i would have the title of the web page the links to it the text and those links you know the content of the web page the word count the author the date those are all features about the website right and you know that you have tools and code that could pull those things out and represent that in a featurization to some kind of ml model similar in this case there are we look at that whole stretch of dna and the disease it's in the cellular context and there are algorithms in this case they're models themselves that work out how to featurize that to represent that to that whole ranking algorithm so we term those sort of things as sort of as a feature factory right so the ones that look at the raw dna sequence and we'll say well this is open and close chromatin the other one will say well this gene is on in this cell type right you know and it might imagine the model when it's trying to learn how to rank the importance of those things to give you you know it's candid list of genes might say well well this gene isn't even on in this cell type that's involved in this disease so this is probably a low you know unlikely thing right this one's in closed chromatin that's unlikely this one's an open chromosome it's involved in this disease and things like that so it becomes a good candidate so it learns how to rank all these different things and how to combine them and so it's it's not a uh it's not a sort of a physical thing but it is a sort of like it's it's a featurization type aspect right so it's basically futurization by by other sort of sub models themselves when you think about all of the the various uh algorithms and and tools and things that kind of factor into and enable what you're doing and look forward what are the areas that um either you need the most innovation to happen in or you're excited because you see the innovation happening you know whether you know we're talking about algorithms or tooling infrastructure that kind of thing yeah i think one of the the things that i'm deeply interested in is robustness and reliability constraints right and uh and uh and this plays into a you know a debate with you know in regulation and things like that is that as we start to build sort of probabilistic reasoning systems and imagine let's take the pathology example where i have images and things like that um and we've maybe you know we've designed it well so we've made sure we've got like you know a bunch of people different backgrounds you know we've got a lot of larger training we have underrepresented groups we've done the best case to do that that's very important um and we built a model and the model has good performance characteristics you know maybe 80 of the time it's correct and it predicts someone's got a has a gene gets a genetic status right so we know then who just sequence or not for example i'm just using a hypothetical scenario um how do we know how that model behaves when you know what's the adversary image for like a you know for a pathology thing like it's like maybe an area is out of focus right or it's got a pen mark or there isn't enough tumor strength how do we know that this model fails gracefully how can we define its bounds of operation things like that and you know for some other methods and things it's easier to define and construct that you know for neural network it's it's harder to know how you know some of those changes result in you know the decision boundary actually happening and so knowing that you've trained a model that's um for some of these scenarios you would train off some you know some performance for robustus characteristics right but it is hard to know how the those robustness and reliability characteristics happen so we're doing a lot of we do a lot of research in that area and we have various groups we interact with and phd students we sponsor but that's a really active area that you know it's not just you know our organization particularly it's across the industry in various things where people want to know um i know there's those sorts of aspects and that's where you get into monitoring those sorts of things but for us it's all about knowing how i can measure that i found a good robust solution and it's not brittle right the small changes the import don't lead to large changes in the activation um that's that's one key area uh i think for us that you know looking at simpler transformer architectures that could lead to the same kind of performance is another really key thing because we can train them faster and things like that so understand a little bit of that sort of you know just the you know um model parameter space the performance trade-offs and so those sorts of things you know um where you think well maybe there is a simpler architecture that can do just as well uh that's that's a common area of research and you know and then a lot of it actually comes down to really uh biology is all about um low low end high dimensionality right and sort of biases and time series right so a lot of our data if you think about it that idea where i've edited my variant in and i'm looking to see which genes change right well i can measure that six hours after i've made my edit 12 hours or 24 hours and the whole thing is changing over time right and so actually beginning to start to increase those sort of those temporal dimensions so this is where we have a lot of time series data and we can generate that in biology and this is where i think that's another area that of key research and probably the f the final one is really sort of multimodal and end-to-end multiple enter and learning multi-modal so they're the classic example is i have some cells in a dish i'm taking images of them and i might pull them out and do rna-seq and i've also got my time series let's plug it in for good measure and i want to start to look at that and i want to build a model that can classify when a perturbation has made a cell look like the wild-type cell the healthy tissue and made its gene expression look like that right and i'd like to be able to enter and learn that right now we typically learn the gene expression model the cellular imaging convent together we maybe take the top two layers of those sorts of things and you throw that into another model that learns to integrate them we don't propagate the error back down through all of that just because the complexity of the thing yeah but that's an area where i think that you know could be really useful and you know typically you know i mean the confidence that we convolution size right that's that you could use attention on that you can have these dilated and flexible convolutions where you could adapt that because it might not you know picking the one that looked good for that it could be a very specific one that could work better for that problem right when you want to uh you know so that's that's a massive area of research for us and because we have that in medicine right we have um you know your biopsy your pathology biopsy might be done once right when you're diagnosed but we can do clinical imaging every seven or eight weeks so it's like a ct scan or mri for example but i might do your your blood work right i can sample you know take a pile of blood and i might look for circulating tumor dna right so things like grail or or frenom or those sorts of things um garden health right those sorts of assays we're looking at literally dna in the blood that's come from the tumor cells right and look sequencing that and and that could be done they were done at different time scales but they're all multimodal things about that particular patient and you're trying to integrate all those together to say what's your particular outcome going to be are you responding to this therapy where are you going to go when what what therapy should we give you next for example and to some degree that brings us back to compute because the scale required to integrate all this together is significant yeah it's compute and data right because um you can we can measure so many things right but we're measuring pretty often often what you see in biology is we can measure more data on things that are less rare than people so i can learn lots of data on cells in a dish right but cells in a dish aren't going to give me the effect on like you know a whole person right i can't ever get information about whole organ failure from singles or culture of hepatocytes but i mean we're starting to see more complex innovations in biology so things like organoids and things like that where they start to have more of the complexity and you said and where we end up finding machine learning is actually building a bridging model from the the thing that we can perturb and measure at scale and then how well does that correlate to you know to humans where we can only really like we can't perturb humans we can treat humans if we if we've got a really good thing we can measure things about us right so um that's sort of another area um that's one of the things that we were actually doing with this sort of this king's college collaboration that was recent in the press where what we'll be doing isn't is as we're taking um tumor samples from from patients right and we can culture that tumor into it with and it's not organized it's the tumor but it's plus their immune system and it's ac critically it's that immune system components from that particular patient and then we can start to see how that responds right with various drugs or influence and we can measure very things about that and look at that over time and the idea there is to sort of build a model of you know how best to treat that patient what are the characteristics or even what is the risk lung cancer for example you might resect it you're hoping that you've got it all when there's no secondary metastasis but there are some people that see a higher rate of secondary metastasis than others right so could you how can you identify that for example so it's this really interesting interplay between the development of experimental biological techniques the ability to generate data at scale right and the ability to build models to kind of connect that back to humans you mentioned robustness as being important and before that you talked a little bit about explainability i'm curious how you think about that and and how you approach uh machine learning problems with those concerns in mind do you you know drive for performance and then back off to the explainability requirements is it the other way around is it some kind of hybrid it's it's a really interesting debate because a lot of the times um you know i think people kind of use interpretability all these types of things as a proxy for i don't trust or understand sufficiently the engineering validation right and so you know i mean like the question i was like okay i can give you a very what was a simple one to be like i'm like oh i like a logistic regression with like six parameters i'm like okay if i give you a little digital progression with six parameters right and maybe only allow like you know positive non-zero coefficients right so this huge number of functional forms can be in there but most people can't look at in their head and really understand how that works makes makes this decision right or even say the threshold so and we use technology and systems everywhere day-to-day without knowing how they work right like everyone in the lab where it comes down to is actually where you want reliability constraints and how it performs right um and that's that's the sort of the trade-off but there's also there's a trade-off between when do we really need to have secondary tracks and things like that we're making really big decisions right like you know uh you know avionics for flying my plane or maybe doing i'm about to diagnose someone with something else you know i really need to how do i how do i have a phone source of data how to make sure it's robust and there's sort of things in the discovery phase where do we need to have someone explains to a human scientist these are the key features why we think this this cell type is more like this cell type and others and this this target's pushing it in the right direction versus and there's a trade-off between like well we're asking the machine to do tasks at a human car we're pulling in so much data and we're going to take those results and then check them in other ways right there i would rather harness the full power of machine learning and not hamstrung the system by saying well okay here's a saliency map and having suppose i agree with that or not or the functional form so depending on what we're doing it's a trade-off but what we always care about is making sure we build a robust reliable model so like understanding how you're assessing it right how you measure the performance how do you understand that like you know you haven't somehow an information leakage and those sorts of things come through so it's a tension but where you do find things is where you have to make sure that you you know if you're doing you're replacing what someone would be doing manually right so there's a whole thing of like well is this thing better than me how do i know it's working how was the quality aspect and usually once you want to say is like look i'm here to automate the boring right so you can actually do and go to high level science then spend your time you know looking and analyzing this right and also giving them sort of an order trail they can go back and look at the data that went into the system and maybe look at it from itself or diagnosis tools right on the model's performance as well you know so things like is the input vector within a vector space that's well bounded by the training to in the test set what does the error manifold look over that that thing is it uniform is there are there spiking regions where it you know because we a lot of our performance measures are global measures of mod performance but you could easily take the input vectors you can tile them on a 2d plane you can work out the error function over those things right and you can say oh wow overall it's a pretty good model but like you know small molecules that look like this this thing's lousier right so having uncertainty bounds those sorts of things they go a long way to actually putting these things in production and you know another thing is that it's really important not just to have any model ever speed out a number anything we put in production has to give you both a number and a confidence bound and also has to also has to refuse to return a value it's saying i don't have that this is so far out what i've seen boss i have no idea we can collect those we can log those maybe we've got enough of them we can build another model and over time almost every model becomes a cascade function it's like well this is a global model for this one this is the model for this maybe eventually we can unify them again but that's really important those are all the sort of the functional things because it's it's honestly it's quick to make a model and once you give people a tool and it's very quick for them to use it but it's the opportunity cost of the downstream decisions that they can make with it right they decide to do experiment a and not experiment b for example so we think a lot about those sorts of things um and you know when you start off it's always i want to know how it works what's the model thinking and things like that yeah but you know a lot of those theories of mind are not really truly how the model is thinking even if we do distillation right and things like that that's not really it is it's sort of a hack on on those things getting models to predict or describe their confidence that's an active research area itself does that requirement that models your model spit that out does that put a limit on the types of architectures you use or is it itself kind of a area where you have to question reliability and trust of that confidence element you know we do a lot of work on that internally i think um i guess francesco farin's got some really good papers coming out on that that well he's published that we we've done on that sort of aspect we try and have some general genetic methods we can bolt on to any architecture so you can kind of separate the problem a little bit some architectures also give you can give you at the same time estimates of that depending on what you're doing but we also um depending where we are in development and where we're using it um you can you might have great or less requirements for for that kind of aspect of things right so there's a bit of a certainly once you're making big important decisions and you're putting things in production and it's going into other processes you definitely need to have those aspects but i wouldn't say every model builder gsk by every group will always have those characteristics for us where we're making doing these big things that have you know big downstream consequences and decisions you need to have that sort of place and i think that you know i think just in the community globally people are realizing that right you know and that's also about sort of monitoring things in production right you know uh of checking things and you know there's plenty examples of models going rogue and and you know that kind of thing and then so for us it's a it's a it's a key thing um but we depending on the stage is and how stringent we are on those requirements for it maybe kind of one more direction to briefly explore kind of zooming out um the can you speak a little bit about the you know kind of building an organization like yours in the context of a large uh large company large pharmaceutical company um you know transformation implications uh organizational receptivity to probabilistic models that kind of thing is it um you know it's a research organization or a scientific organization at its core so do you you know not have the resistance that you know all places have or yeah look i mean i wouldn't have come here if it wasn't for having like you know house ahead of r d because you've got someone who really gets it and knows what you know you've got to go all right himself like he's a engineer as well um it's really important to have an organization that uh because it's core to the strategy like people like oh we're gonna have ai ml we're gonna do it and so when i came in i'm like look we get like the normal way you hire and do people doesn't work anymore right i wanna interview people and i'm gonna give an offer in a few days like within like three days we're gonna do it we're gonna use hacker rank we have all these different things we bring people in and we're not gonna do the usual process they get an offer in three months time or three you know a month because they've got other places to be right um we want to show them that we're not this giant associate organization that we can move fast and do things right um you know even to the way we're working you know we can use slack we're going to be distributed everywhere we're going to go where talent is you know we use a laptop and a thing we can we can work that way you know post covert i think the rest of the organizations caught up to that so we came in and like you know did things in a very very different way right and like from the way we interview hr hiring but also you know we all use macs like this hpc requirements how we're going to do things and like and so we built like a whole new process to do things right that are the way we work right we work in two week sprints we have these different types of things what's interesting is actually seeing um the wider organization kind of being given permission to think and innovate and like someone picking up some of those tactics and things like that and the methods of managing science right because i first talked to my apology colleagues they're like you understand but it doesn't work that way you can't plan things out for like this could take longer or shorter i'm like well you know computer science doesn't work that way either right like we actually like you know it's i love to think that like we know the thing and we write down all the steps and we just do it that's not how it works right so it's always a garden of forking paths so doing that stopping every two weeks going what what didn't work okay now we'll do this lets you have a really good way of planning and working across really complex teams as we do so you know we've broadened a lot of that culture um you know there are always people that are true believers right they're like oh my gosh and i can do everything you're like whoa slow down you know let's talk about what we're doing here and then there are people that are super skeptical right that are like well you know how is this anything different from before you know sort of stuff or or to the no you never replace a human scientist human ingenuity you know i've got a better way of picking up i can synthesize this on my head and maybe they really do have an alpha value maybe they really do have that but sometimes a lot of them i'm like yeah i think you've got a little bit of recall buyers there as well so it's a tension and trade-off um and like all these things we'll work out where these some of these tools based on the technology maturity stack where they're best used and and where they're where where where too early right um or where we don't have enough data or the right data that sort of thing but it is it's it's been fun and it's been fun to sort of you know as the organization's grown and more people have come in and i think what people have realized is that um [Music] if you're interested in doing machine learning in bio and these sorts of things that um companies like gsk actually have lots of compute so if people that you know like jeremy with mit is like now i've got more compute and more data than i ever had right and i don't have to write grants and spend all my time doing these sorts of things and if you get it right you've got a whole machine that will kind of translate to an impact right to patients you know to really do those sorts of things and that's that's really important a lot of people as well is that is that connection to the part of the whole thing rather than okay i'm an academic i built something i've got a good idea now i have to make a startup it's a whole thing to like and it's going to take like so long for my work to get out there and actually influence the world so those are all good things um you know there are certain pluses and minuses to doing things in large corporations the smaller ones right a small start-up we can raise capital you know we're all in the same room or you know place we know what we're doing we're all in charge just go we don't have to have all those overhead big things large corporation takes some time to turn the ship but once you can focus that whole thing on something man you can really drive drive on it so different skills um you know certainly the largest cooperation sort of thing i've sort of worked in and uh but the organization has to want to do it i think is is the lesson i've learned like if they're not really into it or the senior leadership aren't really into it or a large fraction it's not a core strategy it's a very difficult thing right um and different companies are in different stages right some of them are externalizing it but we decided to build a really large in-house team what are the what are the couple three top three things that keep you up at night like what do you most worry about in your role the first thing is being able to generate um the data at the right kind of cadence right so um one of the great you know if you're in different domains right you can depending what you're doing you can get you know high frequency lots of new data generated quickly for what we have to do you know for example if you think of like reinforcement learning right you know i've got a simulator of the game or things like that i get many many samples i can run lots of experiments right i'm in bio land right like i don't get you know 12 million data points i get like 300 data points every four to six weeks and they cost and by the way it costs us a lot of money to generate those data points so then you know you start to ask this question of like you know what's my information gain what's my model performance gain per data point for time where am i my linear my plateauing like you know how many cycles do i need to run i can run 12 cycles a year i get so many data points is that enough right um so it's all about for me a lot of it is about a can i'm you know am i ever going to generate enough data to solve this problem or generate the right data um and another thing is that the cost of the experiment but you know people will ask like well how much data do you need to build this model kit and i'm like i don't really know yet you know i need more but then we'll start to see a trend but you know or am i collecting the right data so it's really about um those learning cycles those sorts of aspects i'm i'm really the other thing that sort of keeps me up enough is thinking about um the best ways and the ways that we have other data sources and things that we generate data there's lots of historical data in gsk that we um pull it together and use it in the right fashion right and you know that we remove uh it's really important to you know there's a patient data about individual people right but when we run a trial and we do things for people they're contributing to medical research and if you talk to a lot of people within trials they're like you say well we run your trial we put it into a box and we the message is successful or not right and like no one else can touch that data right i'm like that's that's criminal right they contributed medicine there's other things we can learn about it so what i'm concerned about is when we generate data and we're doing things as an organization how do we make stackable data sources to build this like longitudinal corpus right an individual medicine for a particular for say room site arthritis it may fail right which is terrible for patients and terrible for us our drug didn't work it didn't work as well as we hoped but the question is what do we learn from it and how do we build data sets that have the same common longitudinal characteristics you know maybe a common cause i can join them up together and i can build this longitudinal corpus of data and so a lot of time when people are doing an experiment in the lab you know they do a data they do an experiment and then you know they'll analyze it for that particular use case and that may then we lose the metadata or it's lost and things like that so i like to tell people like you know do you know build data for future you so you can use it again and also like you know collect those other data points at additional marginal cost right like that are really useful so those are the things that really keep me up um you know you know the the final thing is really about like from building models in pathology we're starting to do things that are really like you know doing patient prognosis doing prediction doing sorts of things saying this person's like the benefit from this medicine or not as we started going to learn that it's like we have to be right and the other thing is we also we make medicines for everybody um the challenge we have is that we're using lots of prior information or i'm not relying on on a system where you know i can get digitized medical records or pathology can be uploaded um we can raise the risk of building really great ai advances but only work for people that are in like you know countries that have the the data infrastructure and things like for that so we think a lot about you know what are the data like what's the data cold chain we have a cold chain for vaccines and medicine what's the equivalent of that you know we don't want to build these great advances and say oh i'm sorry you know that's you know it's going to take five to ten years for another country to be able to have access to it right those are my top three things well kim thanks so much for joining us and taking the time to share a bit about what you're working on and how you think about the the problems in your space thanks sam it's been a lot of fun i'm a big fan of the pod so cheers thank you so much thanks so much
Info
Channel: The TWIML AI Podcast with Sam Charrington
Views: 312
Rating: undefined out of 5
Keywords: TWiML & AI, Podcast, Tech, Technology, ML, AI, Machine Learning, Artificial Intelligence, Sam Charrington, data, science, computer science, deep learning, Kim Branson, GSK, pharmaceuticals, machine learning infrastructure, genetics, genomics, human genome, king’s college, hardware, engineering, cerberus, NeurIPS, 23andMe, machine learning platform, feature factory, bert, knowledge graph, AWS, CUDA
Id: v6WOeOHVer0
Channel Id: undefined
Length: 64min 48sec (3888 seconds)
Published: Mon Nov 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.