Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and so we do a lot of data science bioinformatics analysis in the context for health promotion and also we do a lot of drug design so I think that is why I am here today we're going to share to you about how do we use machine learning data science data mining to model the the properties of the drug molecule so so now that we get to know each other so myself I graduated bachelor's in biomedical science from my Kido University International College and then I had my PhD also from medical technology and Mahidol University and so for my PhD thesis I did a lot of QSAR utilization of data mining to create prediction model to predict like DNA splice Junction to predict whether the drug will be toxic whether the drug would have potency in order to inhibit the target protein and so I've been doing that research for the past more than 15 years so as a PhD student and then after graduation and continue my research into data science in the context of drug discovery or more or less you call it computational drug discovery also for another 12 years so altogether I'm here for almost 20 years so four-year undergraduate four-year PhD and 12 years as a faculty member so now it's my 20th year okay so time really flies okay so I'm gonna talk about lead discovery and development and my name is Shannon back in that time there were only a handful of people working in bio Mattox so it was pretty rough journey because back in the day it was pretty hard to find funding writing a research grant and trying to convince the grantor that bioinformatics is something that's worthwhile to explore so but now but now the situation has changed everything is now data intensive data driven right we have a I coming we have the Internet of Things we have a lot of new technology coming so everything is driving by data so it's data driven so far we also contributed to the to the scientific literature in the form of research articles review articles and book chapter so all of this are in the context of data mining and drug discovery so if you would like some more details about our research group you can go to codes dot bio for more information and we share all of our code whether we we call it in R or Python and also the data set that we use to build the model we share everything on the github so you can go to github with the URL shown here and then you can download all the code that we have and then you can pretty much reproduce what we have published about so our figures or tables or plots are in the code that we share so you could pretty much reproduce our work you could update the data and you can recreate that so everything is provided so maybe you're wondering why I have a camera here okay so I also just started my own YouTube channel okay so I'm also a youtuber now it's pretty awesome journey and so I've released already four videos about data science like in general terms so today we're going to have data science in the context of drug discovery and probably we will upload it to YouTube okay that's the advertisement so now let's go into the detail so if we look into the definition of the word disease from the dictionary we can see that it it is referred to illness of people animals plan as caused by infection or failure of health rather than by an accident okay so so these are the or the way qaulity it's like is anything that is harming us whether it's some abnormality in the proteins that are within our body and we rely on bioactive compounds or drugs in order to reverse such remedy and as you will soon find out in this lecture there's a lot of details that is happening under the hood so the the drug the peer that we take it's essentially small molecule and it will interact with proteins within our body so we will cover that in the next slides so drug it is a biological or chemical entity so if if you have a biological entity like antibody right you could call it a biologist but if you have a small molecule which is chemical based you could call it a small molecule or synthetic drug or sometimes you can call it a natural product if you right if you isolate that from the plants okay and so these molecules will modulate the disease state so therefore they are called modulator okay and if the modulator molecule inhibits the protein it is called the inhibitors and if it activates the function of the protein it has caught the activators so if we look at the big picture we step back a little bit everything is interconnected so we have the drugs we have the proteins we have the proteins and proteins interacting we have the genes governing the transcription into the resulting protein so everything is a big complicated Network and they are quite interlinked and as you can see here if we map them and look at the connected we start to see some patterns okay so some proteins are clustered together the performing similar function as part of a a common theme of a common metabolic pathway okay like energy metabolism or drug metabolism okay so so they have several protein belonging to the same metabolic pathway so we have four major components here we have the drug we have the target so target could be protein target okay the druggable target and the drugs will interact with the target in order to cause a modulation of the target either to inhibit or to activate the protein okay and here we have the genes right genes could be controlled by methylation or other we call it epigenetic markers the methylation or the methylation will control or regulate the expression of the protein from gene to protein and once we have the protein you use the molecules drug or biology to model it that disease State so so that is the the drug target interaction in a nutshell so if we look at the big picture drug discovery process is costing approximately two billion US dollar and according to the tuff Center I think the number just updated to 2.6 billion dollars it takes roughly 10 to 15 years and the failure rate is more than 90% so anything can go wrong during the drug discovery process so even if it passed the drug discovery process got FDA approval still people could develop side effects or even mortality so those drugs could eventually be taken out of the market okay so so it is a very time-consuming and painful process so the question is how can we improve that so let's find out so let's let's have a look at the the particular steps in the drug discovery process so we have here the six steps so step number one is the target identification so in order to develop a drug we first have to find out what is the target protein what is a target macromolecule that we want to modulate that we want to disrupt okay so that will start from identifying the potential druggable target so the dragonboat target will be could be enzymes could be receptors or protein that is at the critical parts of the metabolic pathway so it could be the rate limiting step okay so if you have a molecule that will inhibit that particular protein then the the subsequently kweli will be inhibited okay or if that protein is up regulated it will cause a positive effect towards the sea state then we want to develop some activator to to up regulate that function okay but before we to do that we have to identify which protein out of the 30,000 proteins that are in the human body right we will find out which protein is worth exploring in order to target a particular disease so this step is called target identification or a target discovery so it could involve in vivo assays I knock out right knock out some genes and see whether it give the the the desirable phenotype that we want it or it could be from individual tests culture cell culture or do done using expression analysis or even proteomic analysis and then the next step would be once we have already identified a target for example there is a target protein called aromatisse which is responsible for breast cancer so once we have identified okay we want to target aromatisse so what do we do we will start to screen for potential hit compounds so we want to know from a big chemical library is there any any molecule in there that could bind to the aromatisse protein and if it can then we will do the next step we will do lead optimization so by lead optimization is like okay for for the second step we we want to screen a big library of compounds they say we have 1 million compound and from 1 million compound we identify that there are potentially five hit compounds and from the five hit compound let's say that we take one two to do some further investigation we will call this lead optimization so we're going to take that hit compound we're gonna make it as a lead compound and then we're gonna modify the compound so so this is the field of medicine or chemistry so we take the compound we motley we substitute the functional group so for example we might replace a hydrogen with a halo chain if you look at the sides are halogen and hydrogen they're quite small I mean the sides the bulkiness would not be differ will not differ from the original hydrogen atom but then functionally it will be significantly differ different right so you have more electrons there so that that's a very hot topic halo chain based drug okay so you could do this thing called bio ISOs there meaning you you retain the same functional activity of the compound by changing the chemical constituent the chemical functional group okay so they call it bio I saw stair bio mean the biological function remain the same I saw mean kind of like the isomer so something that looks equivalent and stare like isomer I saw stair so bio ISOs there is a compound chemically it's different but functionally biologically it functions in the same way so that would also involve a field called scaffold hopping okay that that's another field of research that people in the medicinal chemistry area are doing so just in case you would like to find more details you can google for bio Isis stair or scaffold hopping so scaffold opping is a very lucrative concept let's say that someone has already done some patent of the compound if we want to overcome the patent we can employ scaffold hopping to totally hop from one scaffold into another scaffold so the scaffold let's think of a molecule as as a human body the scaffold is like the body the scaffold is like a body and the functional group are like the hands and the legs there are the fingers which interact with objects so my hand I could pick up the bottle so think of my hand as the functional group this these are the halogen atoms they are the hydroxy group so the body will be the scaffold so if I change the scaffold then it becomes like a chimera so once we change the scaffold we could patent a different scaffold so we just bypass the IP then intellectual property so that that is a strategy used by the pharmaceutical company in order to patent their own version of the drug okay so structurally it looks almost the same but chemically is different therefore you could develop your own patents so in this lead optimization you will do a lot of substitution you're gonna take one functional group when replaced with different functional group and therefore you're gonna do this thing called structure activity relationship so I'm going to cover that in more detail and the subsequent slide so we take a molecule and then we we use synthetic chemistry to modify the functional group from oh2 Helu 10 to night nitro group to other functional group and therefore each modified compound we will test the bio activity in the form of ic50 ki ec50 okay once that once we have identified the potent compound then we will do some admit optimization k add met would include at absorption distribution metabolism excretion toxicity so they are the pharmacokinetic property of the drug molecule so the thing is we want compounds we want the drug to be potent against the target protein to bind tightly however we don't want it to be toxic so therefore we have to balance between potency and safety so the pharmacokinetic property so that is a very challenging thing if we improve the potency the safety is affected so if you think of this as a computational problem it is a multi objective optimization problem because you have multiple objective you want absorption to be good you want the distribution to be good but you also want to have the the potency to be also very good so but the thing is like this you increase the potency may be the toxicity also increase so it binds more tightly to the target protein but the molecule will be more toxic so the thing is how to find a sweet spot where we have relatively good but maybe not so not like a hundred percent good binding but also safe profile so the thing is we start from a million compound and then we try to narrow down funnel into one compound so that is why it takes so many years right and imagine that you you might have identified one compound and then it became toxic during the preclinical stages then you have to go back and start again right so here same thing as the previous slide but with a little bit more detail right so biological evidence that this target protein is a modifier of the deceived state and then you've try to screen for the hit compound and then you you will take that hit compound you will modify it and this is the lead optimization okay okay so these are the details that you could read up more so as I mentioned it is a multi objective optimization right so the thing is we want to have it bind tightly to the target protein but also have optimal absorption absorption permeation to the target site and be metabolically stable as well right be non-toxic and most importantly it could be synthesized okay and and nowaday there is a lot of emphasis on green green chemistry so how can we make compounds that are easily synthesized and also not detrimental to the environment as well so this is a very challenging issue that's why we need computers to help us okay because back in the days drug discovery it's more or less based on serendipity based on luck right how Alexander Fleming discovered over the weekend that a one of the petri dish there was an inhibition stone and then he discovered the penicillin right but nowadays because there's so much data that we are generating from the Omegas era so it is simply impossible to manually analyze the data that's why we need to use computational approaches programming molecular simulation in order to make sense of the data so the thing is how do we come up with new compounds so obviously we look to nature for inspiration because nature already has so much bioactive compound right in the old days we we have compounds from plants herbal compounds right so I think different country will have different remedy well I think it's the field of what it's the ethno pharmacology right there's a root of how such compounds like there's traditional Chinese medicine right and I you love the Vedic medicine right from India right and so that is the number one source that we go to right we look to nature okay another another way would be to rationally rationally meaning from from theory create a new compound so how can we do that then we have to look into the protein structure right look at how does the protein active site look like so we have a target protein and if we go back to biochemistry 101 the protein will have an a pocket called the active site sometimes you call it the binding cavity the binding pop get the binding site the catalytic site right where catalysis are taking place right you have a substrate coming into the binding pocket and then the residue will interact with the with the compound and then the structure will be converted into a product because the functional group will be modified and then you have a product right and if you look into the thermodynamic you have the you have the substrate and then you have the transition speed analog so the energy will increase and then once you produce the product and if you will drop again right so if you look into thermodynamic it looks like that another way would be to use computers to rationally create new compounds so there was a group from Switzerland called the raiment group re ymo and D okay links in the next slides don't worry so what did they do they train the computer to know organic chemistry so now the computer can synthesize compound so what did they do they have 13 atoms and then they use the organic atoms carbon hydrogen nitrogen oxygen phosphorus sulfur hydrogen in order to combinatorially combine the atoms in different combination think of it like a Lego block you combine the Lego block in all possible combination how many how many possible combination will you get those are the number of compounds that you will get if you have 13 atoms and you have so you have ten different potential atoms then you have thirteen position right so it's 10 to the 13 power right and also if you just link it into a linear chain compounds are not like that right maybe carbon another carbon another carbon another carbon another nine to ten and if we if you could do it differently maybe put nitrogen here or put it right here you have carbon carbon and then you put nitrogen here and if you do it differently you put it in another position right so so it's more or less like a computational problem so the computer would do this and generate millions of compound billions of compound okay so this group they develop the the database called a gdb is also in your slide gdb 13 gdb 15 GB 17 so with gdb 17 they were able to synthesize 166 billion compound and did you know that aspirin coincidentally is found in that 166 billion compound and so imagine if aspirin is already in there there are potentially other possible drugs that are already inside that artificial compound library that is waiting to be discovered okay so according to theory it's in there and what about our known chemical space if we go to the database called pubchem or chambo they are the known compound that organic chemists are able to create synthetically there are only in the handful maybe 10 million or 100 million but then the gdb database has 166 billion so think of how many percentage have we started to explore out of the potential number we are exploring less than a percent and there's 99 percent more that we need to explore and how can we do that we need the computer we need molecular simulation your data is so big 1 million compounds I think it takes about roughly 400 megabytes on your hard drive 166 billion would take about I think it's 200,000 terabytes you know the small hard drive that we have probably 1 to 2 terabyte imagine 200,000 terabytes so we need a lot of these hard drive stack and one laptop is not possible to analyze the 166 billion right if you want to make a structure-activity relationship model right we need a very super powerful computer ok so this is the actual details that I was talking about so set up up to 17 atoms you have 166 billion possible molecules that could be created theoretically and if you would like to find out more this field is called compound enumeration and you can read up in this excellent review article in the accounts of chemical research in 2015 so this group they did a lot of chemical space exploration we also did some chemical space exploration in the context of breast cancer drug so we published that in the molecular diversity journal okay so chemical space what is chemical space think of it as like the universe and think of the Stars like like the compounds right think of shining stars like potent compound right and the in the in the galaxy you have so many stars right and some are potent somewhere toxic so the thing is how do you navigate this galaxy in order to identify a potentially promising drug right we could we could do it empirically empirically meaning we could try each one of them systematically test the bioactivity but that would simply take forever and the cost will be so much but if we could test it theoretically on a computer right do some structure activity relationship create some molecular docking experiment and see does it really interact with our target protein if it does try it experimentally if it doesn't skip it go to the next compound right so this is the drug discovery toolbox okay they are the tools that for the drug Discoverer will use okay so what kind of tools do we need so in the field of chemistry we're gonna use combinatorial chemistry so this is kind of like Legos right you have different functional groups you have different scaffold scaffold is like the body and the functional group is like the arms and the legs and the fingers and you put it together in our possible combination think of a Lego you put the Lego in different possible combination and that you will get one unique compound and if you we shuffle the the ordering the connectivity you get another compound so that is combinatorial chemistry okay and for each compound you will test it experimentally okay that's why they have high-throughput screening right you have 144 Wells right and each well we have different compound and all of them will have the same target protein and then you test the bioassay ic50 look at the fluorescence that's it buying or not and maybe out of a million compound you might identify a hundred compound that combine but still that would take some time and it's pretty expensive okay so if we are on a budget we can do this first okay and only try those that are very promising okay so instead of investing a hundred million baht you could invest maybe 1 million baht for your bioassay okay so drug discovery would then become more feasible for smaller companies or even academia okay because most of the pharmaceutical drug discovery work is done by the big pharma and there is an initiative where academic Institute's can also join they call that the open source drug discovery OSD D and the P is based from India and he developed so many software so many databases that will help allow us to like small academic Institute like small research groups such as us to take part in discovering a new drug okay especially drugs that are not so lucrative for the big pharma okay because if the big pharma they're not making any money out of it or if they're making small amount of money then they don't want to bother investing a lot of equipment and money for the reagents so that's why the academic sector will have to take part of that okay so you have the target protein and then you want to measure whether your compound has two desirable at bioactivity and that bioactivity could either be I see 50 you see 50 ki percent inhibition percent binding right and such data is compiled in a database called symbol or binding DB or pubchem okay normally do we have a break we are about 25% into the slide so we have about 16 more slice 50 more slides left okay okay so I think we already covered this and these are the databases I can show you after we have the presentation how does each of the database works okay so what kind of computers do okay you might be aware that computers could play jeopardy it could play chess right Kasparov and IBM right deep blue and you are aware that Google has developed a self-driving car right and you also are aware that nASA uses computers to do their simulation to help prepare their astronaut to go out into outer space we you are aware that Boeing or other aircraft companies they're using computers to do their simulation and that Super Markets they give us a loyalty card in order to collect data about us okay so every transaction that we do they know what are we buying and that and then they will also offer some promotion to us okay sometimes they will understand us more than we understand ourselves okay so they can predict when you're gonna buy what you are going to buy okay computers could also be used to predict outbreaks epidemic outbreaks as well and I believe they have done that already in in the US for yellow fever I think and the question is why don't we use it for drug discovery and we do and did you know that computers particularly deep learning can now paint like van Gogh and Picasso and we now have this thing called fake news where you have a transcript or an audio and then you could generate being motion of a person from deep learning like for example if I have a it was actually a Thai researcher in the US he developed this program that could convert an audio of Obama into a graphical looking OPA virtual Obama talking so the lips would will be in synchronization with the audio okay so in the future we might potentially have more faking English coming up right and computers could be used to programmatically code music okay and there's this program called sonic PI and imagine that you train deep learning an algorithm that relies in neural network which simulates the human brain you could train it to create its own music okay you train it with Mozart you train it with Bach and then it might create its own music okay which they have done they have done such experiment they train the computer with a lot of Mozart music and then based on the existing training set of Mozart Mozart music then they create a new music that sounded kind of like what Mozart would make but it's not made by Mozart okay it's pretty interesting and computers can dream okay that's very interesting computers can dream visually dream hey this is like van Gogh and Picasso so this is the image and then the deep learning algorithm will create something that looks like that and this is the original painting okay so they converted this image into this which is based on that the original painting right and this is the neural network dreaming looks like a lot of mountains and forests okay so this is developed by Google deep dream okay so now let's go into the details of why do we need computational models in drug discovery so the thing is we want to understand the relationship between the chemical structure and the bio activity right we have a comma we have a chemical library right we have a library of compound we have many compounds and each compound will look almost identical but some functional group will be different maybe they are like ninety percent similar and this one at the at this finger it might be hydroxy group and in another compound it could be a methyl group in an another compound it could be a methoxy group methoxy group would look like a methyl group but with Oh H attached to it right and then it could change to chlorine nitro group etc you can have a hundred different variation okay and for each you attest the IC 50 or the ki okay and then you must use that to train the computer and then the computer will learn if the molecule looks like this it will have high activity if the molecule looks like this you will have low activity and then come up with a rule what should a good activity drug look like what should a bad activity drug look like so based on such rule we can learn from it and then use that as a guideline in order to design drugs that are potentially potent and have good pharmacokinetic profiles so what questions can be answered by computational models well number one it helps in the target identification so it will help to answer what target protein could my compounds bind to and model eight so a lot of the chemists might have the question that they develop a lot of compounds in their lab let's say they have curcumin a derivative of curcumin they want to know what protein could their compound inhibits right if we think in the shoes of a chemist and then would my compound bind on specifically to other protein and cost side effect okay and in the field they call it off target binding and they also have this term called poly pharmacology right so a compound that is supposed to bind to protein a could unspecific li bind to protein B and actually that is not a bad thing because if it binds to protein a but it also binds to protein B and if protein B happens to be involved in another disease we could repurpose that compound to treat another disease and so this is called drug repurposing or drug repositioning okay so there's a lot of case where we could use antifungal drugs to treat breast cancer we could use as dr. Malik pointed out we could use antimicrobial drugs to treat cancer okay so that's a game changer so we don't have to wait for 10 years invest 2.6 million billion dollar we could just take an existing fda-approved drug and computationally build some computational model to see whether it can bind and inhibit a target protein that is different than the one that the drug was originally developed for and if it works then we could publish it and then say that okay this drug could be repurposed to treat another disease and then the medical doctor reading that paper could then justify using that drug to treat another disease especially if that there's no existing drug for that disease okay so we're sharing the knowledge that's already out there but then we're putting it in another usage okay kind of like teaching teaching a drug and old drug a new trick okay so it's essentially that and it's very good right there's a lot of effort if you google for drug repositioning okay a lot of people are looking into that area okay and that is where computational drug discovery is a key player okay so third thing what type of compound can bind to the target protein of my interest let's say that okay before I was talking about some chemists right they have a compound and they want to do what protein does can it bind to okay let's let's move back in the role of a biologist I have a target protein I have a protein but what compounds can buy into it okay maybe I've been working all my life on this target protein this particular kinase which compound can bind and inhibit that okay so this computer can be used to do that if you screen a big library of compound against your target protein and see which one ranked high okay and then test that number four are there similar compound to my query compound that may potentially exort similar binding behavior so this is the essence of drug repositioning okay a drug a compound that looks like another drug okay so if drug a can bind to protein target protein a and drop B look similar to a so drop B could also bind to target protein a and if target target protein B has no drug and you call it you could call it a orphan receptor because there is no drug that can bind to the target protein B and target protein B looks like target protein a so drugs available for a for target protein a could be potentially repurposed for a target protein B because b parking protein B looks like target protein a okay so we could use that concept of molecular similarity to draw a linkage between compounds and protein okay whether a compound look like another compound or a protein looks like another protein okay so the fifth question that computers can help is how does my compound by into the target protein of my interest so we could use molecular docking for them okay and and visualize the docking poles how does the compound bind three dimensionally to the target protein structure number six how can I modify the structure of my compound so that we can enhance the pharmacokinetic property or the bio activity of the compound so this is a very common question asked by the medicinal chemists okay so these are the key questions that a lot of people might be pondering about about how they can apply computer into drug discovery research okay so maybe a little bit of background history so in back in 1998 two researchers contributed to quantum chemistry and they won the Nobel Prize so they look at the electronic property of a compound so this year berth to the field called computational chemistry okay how you could visualize the electronic distribution in a molecule and describe it in a quantitative way so that is in a way very essential for doing ligand based drug discovery okay there's a lot of terminology so I'm cramming like several courses into one lecture so in the next slide we're gonna have some new terminology for you to be aware of Nobel Prize in Chemistry fast forward in 15 years in 2013 so this is in the field of computational biochemistry so how can they use computers to simulate the catalytic mechanism of protein and science how do they actually convert substrate to product so three researcher Martin cardless Michael Leavitt area war shell right mine car place from Harvard he developed a program called charm to do molecular dynamic Michael Leavitt is a one of the pioneers for molecular dynamic he's from Stanford University an area war shell is from the USC u of San California so he uses a lot of quantum mechanic molecular mechanic in order to simulate the biochemistry of proteins so this figure is based on one of our review article that we've written in 2015 so we wrote a review article entitled maximizing computational tools for successful drug discovery so in a nutshell we discuss about the available tools that are are out there for free or commercial and how you can make use of it in order to plan your own drug discovery project so what you have here is kind of like think of it as like cooking okay you have the raw materials and you have the equipment to make the food so your raw materials could be the the chemical database of small molecule the database of proteins and also the database of protein-protein interaction database our pathway and then you have the equipment to make your food right like the high throughput screening and and other approaches and then your approaches could also include bioinformatics how you use computers to visualize protein structure gene sequences and how you could use computers to make sense of chemical structure so that's the field of chem informatics you use informatics in what field then you combine the terms together and then you get a new terminology like chem informatics it's traumatic in the context of chemistry bioinformatics is informatics in the context of biology and you also could have you know informatics right pharmaco in formatic okay so so you could create your own field like that okay chemo genomic is chemistry in the context of and the field of chemical biology the interact omics okay so how this protein and another protein interact how does this protein and another compound interact right chemical biology and then all of this will will be used at the middle here so you you could design a new drug you could discover a new drug according to four major pathway number one you could look at the drug in the big picture in as a network okay as a network of where each unit is called a node and you look at how do they interact like a spiderweb each node is a different protein and how does it interact the the connection is called the edge if you zoom in to one of the node they are a particular protein then you could employ a structure based approach you could take that protein you crystallized the protein you do x-ray crystallography or NMR and then you get the protein structure structure base you can use molecular dynamic you can use some electric docking to look at how this is the protein bind to another compound ligand based okay looking at the compound and then you work with the chemical library so in this area you will use a lot of cheminformatics so you have a chemical library and based on that you will develop a structure activity relationship okay and further in along the same line instead of looking at the compound you look at it as a fragment you take a compound you chop it up into small fragments and then you study each individual fragment and the fragment sometimes we call it sub structure because structure is the chemical structure and sub structure is a fragment of the chemical structure so they call it sub structure you could also call it privileged sub structure if that particular sub structure is important for drug discovery okay so if you google for privileged sub structure okay for as a anti breast cancer then you will see okay there's a lot of people saying that in order to be a good anti breast cancer drug you need to have a method group at a particular position so that it could interact with the aromatase at a particular amino acid okay so they're looking into the detail what functional group should you have in order to develop an a drug so that's called privileged sub structure so we have a small break and then we can continue okay so we are back to part two so we stop that right before we talk about bioinformatics okay so generally as I mentioned you apply informatics to biology you call it by informatics you apply to chemistry we call it chem informatics okay so anyone of you have ever used any bioinformatics tools if yes what oh you did masters oh okay okay so what did you learn did you visualize some protein assignment pearl yeah yeah yeah [Music] keg keg KEGG check database okay so that's the biochemical pathway blast okay so looking at more some motives some genes blasting in okay mm-hmm okay sequence alignment okay so you familiar will access a format doing multiple sequence alignment building some phylogenetic trees okay looking at the clustering hierarchical clustering okay alright okay looking at the protein structure protein structure hi mo chimera okay okay so that's very crucial oh I see I see okay so maybe mostly like genes genes related phylogenetic looking at evolution okay I mean if you if you look at evolution right you have convergence and you have divergence so imagine you have two organisms cohabiting in the same ecosystem and if they're from different sources we got they're living in the same area they're competing with the same population then they could Co evolve and it could be undergoing like parallel evolution right but if you have species from the same family and then you put it into different environment and it each of them will then adapt to the environment and evolve further so then you have their versions yeah right and the thing is the interesting thing was genes that had diverged at some point could potentially be ideas that we could use to do drop repurposing so genes maybe they're methylated they're inhibited from functioning and when they're in another environment they're different than the organism from a different environment then they will evolve their own metabolic pathway some will be inhibited some will be activated differently right so think of it as activating and inhibiting some of the genes at a genetic level and each of them will have their own existence right develop their own protein expression profile and then but the thing is proteins that had that virtue at one point they share similar looking proteins and so what compound could inhibit protein a as I mentioned could also inhibit protein B right so those that are antimicrobial drug could be anti-cancer drug because the the cancer drug target looks similar to the bacterial drug target but they're in different contexts right yeah so that's the thing that we could apply phylogenetic okay so back to the bioinformatics so you're using informatics to analyze and compare genes right you want to know how this gene a differ from gene B right that's why you do sequence alignment or even protein right you do protein structure alignment put to protein structure you superimpose it how does it differ right you take two gene sequence you align it and looking at the consensus right notify by the color or if you get from your costal omega-class so W + 2 X the depending on which program environment you're using you're using window it become class to x-ray the web would be close to W a new version will be custo Omega CL us tal it's a program for doing sequence alignments so we'll the star asterisk at the bottom will tell you that that position has all of the same amino acid at the same position okay and some position will have like 50% similar some will have like totally different and some position will be all the same so that we call that consensus okay so aside from analyzing comparing whether two genes look alike whether two protein look alike and looking at the phylogenetic of that you could also explore how this the structure look like you swim in zoom out rotate left rotate right right and then you get to kind of think okay the protein structure is like this the lidium structure is like this and if it interact like that and there's some room maybe we will change our methyl group to become a methyl group to become a little bit longer and it will bind better to the target protein okay that is also possible you can use the protein structure visualization tool we could we call this molecular modeling okay visualize the molecule rotate or move it about actually there's this company they develop kind of like an augmented reality or virtual reality you can put on a special glass you could navigate through the protein structure right and you have special motion control tool right you kind of like in the Minority Report movie right you use your hand to move objects on the table with this here with that here in the future I think pretty much we're moving into that direction right augmented reality there's a lot of VR games right or even in Iron Man the special glass you could recognize the person so it's facial recognition right and the name even though he forgot the name right and then the computer was suggesting okay this one is called he's called John or Jane or whatever right okay so aside from looking at the structure and in trying to discern the function you also can use it to analyze the network biology check database it's a database of biochemical pathway right it tells you this protein is involved with energy metabolism involved with the Krebs cycle and all that so Network biology and metabolic pathway Khem informatics okay this is using informatics for computers so well you do a lot of stuff here you describe the molecule in terms of its electronic property do you describe it in terms of its descriptor so in order to do some some structure activity relationship we have to find a way to quantitate the chemical structure so what is the molecular weight of a molecule right what is the solubility of a molecule different molecule will have different solubility different size different composition of their atomic situates right well one molecule could have 11 carbons two nitrogen one oxygen right and it's this protein it's this compound and if you modify the compound you get a different number of atomic count but that's at the superficial layer right you could describe the the compound in so many ways the connectivity like a graph right the connected the kinetic atoms because each atom are connected differently they're not all linear right there they're connected in a secondary and tertiary way in three dimension right so one carbon could branch out four atoms so according to the rules of organic chemistry right carbon for tetrahedral ninety ten to one lone pair atom oxygen one with a carbon or one with a hydrogen so you could Oh H or OH - and some lone pair electrons okay I mean so you quantitate the compound and so they have they are represented by numbers same thing with us if we go and take a health check-up right we measure our diastolic blood pressure measure the the lipid profile measure the other parameters of the blood cbc right and so many more hemoglobin levels and so these will want to take our health a different state right that's why we cannot have any food over the night so that we will not perturbed the glucose level in our body and the health profile health parameter of each of us are unique depending on the the biological state of our body because our body is a function of the proteins inside of it and the proteins are are working in order to sustain life right there are metabolizing substrate creating to product and the metabolite is created in our body and I mean there's so many ways that you could quantitate the health parameter of a human being right and I think adherent now there's a lot of researchers who are looking into the metabolomic where are you looking at metabolites and looking at it in a different disease state right or even the proteomic level at or even at the genomic level so all of these are all make all of them are all about the data right you're comparing different groups of population this sea state normal state right in the sea state maybe they have up regulation or down regulation of some protein and if there's a regulation or down regulation of protein then we'll there will be different in the metabolites that will be formed right so all of these are like a domino effect right so to be to have a disease maybe you have down regulation of some protein of regulation of some protein as a result the resulting metabolite will be different okay so all these are data that could be used to create some computational model okay okay so here's the terminology it might come handy for your exam as well so these are the terminologies for the drugs and become component of the drug so they are the drug and is precursor okay when I say exam I know there's a lot of pen and pencil writing on them on the paper but don't worry it's open book exam you can bring this in okay maybe you're writing a star right okay notes read more on this okay so drug is the ultimate outcome of a drug discovery project but before we could get a drug we have the prototype right the prototype would be the leads compound right and then we have the optimal lead right and before we reach the optimal lead we have to identify a prototype which is the hit compound and the hit compound will be derived from your screening assay it could be high-throughput screening or low throughput screening if you have a big sophisticated automated equipment you could do high-throughput screening HTS but if your lab you could do like 10 20 then it's a low throughput screening and then if you look at the the functional group so this is the chemist or the organic or synthetic chemist you look at the privileged sub structure so there so they are the functional group right and then you you also look at the fragment okay why I will show you in the next slide why you have to care about fragment why it's fragment so important because drug is a function of the fragment therefore you modify the fragment you get a different drug okay very simple if you think of it as an equation drug equal to function of fragments okay so a drug is a function of fragments like a Lego block you change the fragment component you get a new drug or you get a different drug if your drug this toxic change the top stick fragment to be less top it sounds easy right but it's difficult because it's multi objective you make it more potent you sacrifice at the cost of more toxicity you make it less toxic the solubility is bad you make it more potent become more toxic so it's like it's a very difficult game to play okay so you have to you have to figure out which one which is the soft spot the good spot the sweet spot okay we cannot have it all we're gonna have very safe low toxicity good solubility that's that that's in the ideal world but the thing is we have to make some sacrifice right potent potency toxicity maybe a little bit higher okay so identifying the hits okay so these are the details so how do you do that you can do using this high throughput screening right you have several microtiter plate each well we'll have a different compound with the same protein you screen it until you find a potential hit compound once you have to hit the compound you select that hit compound ask your prototype to work on then you start to modify that and when you modify that it becomes lead optimization next slide lead optimization so here you have the fragment it's rather small you take this fragment you add another fragment you get a lead compound ok so it's like a Lego you add two fragments together to get a new drug and so the fragment will be small but when you add it together it becomes bigger okay so these are just examples okay if you want to use this example for your answers feel free just make sure you put the context properly right fragment combining appropriate fragments together to get a new lead compound and it optimize it a bit further to get a desirable drug once it passes the FDA right preclinical and clinical trials and when you optimize it you modify the add met property what is admit absorption distribution metabolism excretion toxicity and what what do you optimize aside from admin you optimize the potency how should it bind to the target protein okay so it's a game of multi objective optimization so a potential question could be what is multi objective optimization of drug properties describe elaborate give some examples right okay fragment base drop design okay this is a very hot feel as I mentioned you have mini fragment you piece it together you get a new compound and these are the examples so this is like a funnel alright if you looking at the bottom part the size of the fragment is small with lower number of frappe this is the atom count 13 mean 13 heavy atoms carbon nitrogen oxygen phosphorus sulfur not including hydrogen okay so they call it the heavy atoms excluding hydrogen atoms okay so you have 13 13 carbon nitrogen oxygen phosphorus sulfur okay and you're combining it in different combination 13 of it you get between 13 and 17 you get a fragment between 17 and 22 you get a lead between 22 and 30 you get a drug and so you see that the molecule from a fragment to become a drug here very very Mura finna be you start out with the fragment and then you work your way up and you see that at the fragment level you have 10 to 9 power so there's quite limited right but then the thing is when you start to piece together the fragment the possibility increases right from 10 to the 9 and then whether it's a total possibility 10 to the 63rd power for a drug so you can see that small alteration at the fragment level will give rise to big differences at the drug level take some time to digest this where our figure it's very important it's kind of like this you have 20 amino acid you have 20 canonical amino acid ok don't care about the D amino acid know that you have 20 L amino acid okay so our carnitine forget about the amino acid derivative 20 of them you have 5 position how many possible peptide ok let me make it simpler dipeptide 2 position how many possible peptide can I get and have some volunteers okay you have a dipeptide you have two amino acid and each amino acid you have twenty possible amino acid would you like to answer 400 very good so it's 20 yes okay or if you make it into exponent is 20 to the second power so the same thing as 20 times 20 if you have a tripeptide then what is it what's the possible peptide that you can get 8000 20 times 20 times 20 and you continue that imagine you have a peptide as 8 amino acid in length so it's 20 to the 8th power if you have 16 amino acid in length 20 to the 16 power and it starts to be very complicated the longer the peptides become it could adopt some secondary structure and it becomes very complicated when you put different combinations together so what is the possible number of peptide you can get so from 20 amino acid you could get millions of possible peptides so when you start with fragment of only 20 and as you build your way up to have more amino acid in a in a chain link after amino acid peptide becomes longer and longer and longer and longer the possibility increases exponentially same thing with the compound so compound is more complicated right you have only 20 amino acid and if you have only 20 position you have 20 to the 20 power which is still a lot but imagine you have 1 billion possible fragment and you have 2 fragments together so it's 1 billion times 1 billion right and if you add 3 fragment together it's exponentially too large right and it is even challenging for our existing computational ability that's why we need to develop more effective evolutionary pling so so in a lot of the analysis that we do we don't analyze the food dataset okay we analyze a subset of the data set we take maybe 10% 20% and create a model use the rest as the external test set okay because it's simply too big okay Lipinski Wu of five have you ever hear of this the PSP were five to develop a new drug they always talk about the Lipinski Wu of five so let's say they go to a conference a chemist develops a new drug and he say okay our compound passed Lipinski's who were five and so the question is what is it okay so this makes a good exam question right what is the Lipinski wa 500k so here the answer is in the slide so at Pfizer there was a researcher called Christopher Lipinski he collected a large data set of more than 2,000 compounds which are fda-approved drug and they are administered orally okay not inject that drugs but orally administered drug so out of the two thousand they analyzed the data and then they develop a rule of five meaning that everything is multiple of five the molecular weight of compound is less than $500 in the Lippo Felicity has a lakh PE less than five lakh P is the octanol water ratio hydrogen bond donor less than five hydrogen bond acceptor less than ten so they are a multiple of five so therefore they are called the Lipinski you are five of orally administered drug so it doesn't have to follow the exactly a drug that is approved by the FDA could violate one or two or three and they become very toxic right yes so some very effective antibiotic what second third generation very toxic violate all uo5 and we try to avoid that so they are like safe safe profile so it is based on statistical analysis of all fda-approved drug that are administered orally they follow this general pattern this general guideline so some drug could could violate one of the four rule some drug could be bigger than 500 Dalton it's okay as long as the lipophilic city is good some could have less than 500 Dalton but not so good liberal felicity it's okay but as long as the other one are in the in the same within the criteria oh it should be less than melaka weight should be less than 500 lipophilic City should be less than five therefore there will be no issue with it reaching the target sites Oh No Oh to be a drug generally it would pass but it doesn't have to pass all it could violate one or two yes however it needs to pass the preclinical and the clinical trials right on the big population of people so once it passes that should be okay unless there are reports of adverse drug effect then some drug could be disapproved okay so it's a very long journey until you discover the drug and then discover that some people are very allergic or respond pretty faithfully to the drug and people could also ask for the withdrawal or that drug from the market totally yeah and there are many cases that happen like that yes okay that's a good point hmm so to be a hydrogen bond donor meaning it has to give away the hydrogen atom to an acceptor so heightened bond donor and acceptor they're like partners okay and they share hydrogen bond so it could be n it could be one electrode what is the hydrogen bond it is an electronegative atom attached to a hydrogen atom and that hydrogen atom were engaged in the hydrogen bond with another with the electronegative atom that gives away a hydrogen sound complicated so electronegative atom could be oxygen nitrogen and that out oxygen or nitrogen could be bonded to a covalently attached to a hydrogen atom Oh H and H 2 and H and the O H the H will have interaction with all so it's like a dipole so one face is negative one face is positive yeah so the H of one so you have a you have a oxygen and oxygen is attached to okay better here for the camera okay so you have a all or n and it's attached to H covalently and you have interaction with another o or with another n so that's here is the hydrogen bond interaction okay so that the question is why not five right okay so this is not based on the opinion of one person but it's based on analyzing the acceptable safety range of all fda-approved drug and then the the from based on their exploratory data analysis they discovered that the hydrogen bond donor acceptor and and should be within the threshold level so so the fundamental question is the threshold level so five or ten the cutoff should be no more but it could be less if it's more what happens there will be toxic effect it will not pass clear and one-half clearance it will accumulate in the kidney I mean there there's a lot of such adverse effect of that so it's not based on opinion of Christopher Lipinski he's not the judge but he analyzed the data and the based on that data like according to statistics probably like more than 90% or 95% confidence interval the cutoff should be at this position so the original paper I believe was published in 1996 okay and this figure was created in by one of our PhD student who already graduated and works at relativity and also collaborate with CB rod as well where they test on point on it and see went on right maybe you heard of him okay so back to the PowerPoint and then we also have the lead like rule of three okay so this rule is for the lead compound it should have less than 300 Dalton make make note of the previous slide here fragment is approximately like $300 Dalton so the drug should be less than $500 in so the lead here should be right here lead fragment lead drug maybe I did not create this figure okay so maybe this could be colored in a different color maybe pink so you have fragment here you have the lead compound you have two drug no more than 300 Dalton is the recommended size or fragment if it's bigger then it's probably already a drug right no more than 300 why because if you have a lead that has bigger size than 300 Dalton and you want to do lead optimization what happened you have to attach additional fragment to the lead what happens it becomes bigger and what happens then becomes bigger than 500 that's where the lead should be as small as possible so that it will accommodate lead optimization where you attach a new fragment to it and from 300 you attach another 300 it becomes 600 which is still above the limit right so you ideally want to have it less than 300 250 280 attach another fragment maybe less than 500 however it doesn't have to be 500 because 500 it's like a statistical cutoff what if it's 501 right it's still okay yeah it's just a rough cut off so 501 Dalton does not say this is bad compound right roughly 500 just like an imaginary threshold that you just label it that okay 90% of the drug pass or good or safe but if it passes a little bit it's okay right so good drug could violate one or even two okay we even did an analysis in this paper that we published if you want to read about it search for RSV advanced and then search for my last name and you see four papers in there look for estrogen receptor inhibition so we do a rule 5 analysis in there Rule three okay so here they propose other parameters like less than three hydrogen bond donor less than three hydrogen bond acceptor less than three rotatable bond so you have two atoms and if they are connected via a single bond and if they could rotate it's called a rotatable bond if you have double bond it's Richard it cannot rotate okay so single bond covalent single bond could be rotated and what happens when it's rotated it's called dihedral angle if you look at an amino acid right you have two amino acid like this can I have two piece of paper okay thank you I guess it's too big to show this is amino acid one this is some you know after two and they are rotatable bond right here it's dihedral angle so this is the rotating bond right here at the middle here so it could rotate rotate we call it that dihedral angle if you google for this dihedral angle peptide you will see two plane and they call it Phi and sigh right sigh like Gangnam style no sigh okay and Phi 1 angle rotatable so chemical space as I mention already you are able to see the diversity of the compounds okay so diversity comes with it are the concept of lightness and difference diversity right so some compound if they are located closer together they are similar but if they are located further away they are different and so each dots here represent one compound each dot is one compound as you see here faintly here each dot is one compound or one peptide so this is a chemical universe of peptide shown in red and of compound that have the that are passing the rule of three and compound that are passing the rule of five and oligo saccharide are shown here DNA's are shown here graphene right here graphene here diamond noise okay so we don't really hear a lot about that right so that's the chemical universe of the macromolecules so you see that the rule of three compound or a subset of the rule of five okay so therefore the diversity of the rule of three will be less than the rule 5 okay so in the previous slide you have the chemical space here we have the biological space so they're similar right to compound that looks alike or similar if two compounds are different they're located further away this is a graphical visualization but if you visualize the protein you get the same concept protein sharing similar sequence in the phylogenetic tree if they're located closely together in the branching out they are sharing similar sequence but if they are located further away they they have strike lis different sequence okay and if you visualize it in a three-dimensional way they are located here and here it could be located very far away proteins that are located closer to one another they are similar so maybe you have proteins that are globular in shape they will be located closer together right proteins that are less they have like this ordered fold will look different than a protein that have a ordered globular fold okay so the protein fold dictate the structure array and the structure of the protein dictate the function right and the protein structure comes from the Alpha helix beta sheet and also the loop so how does it actually actually arrange will give rise to the actual fold of the protein and the fold will give rise to the function of the protein and the fold is dictated by the sequence of the amino acid okay so the amino acid that we have in the proteins they're not random they are non random okay in order to be a cytochrome p450 member it should have a particular motif right so a lot of if you're studying about bioinformatics right you see the term motive so what is what is the motive it's a common sequence found in a particular family or subfamily right motive let's say that you want to have a peptide that is capable of binding metal ions then you have a metal binding motive okay like for example cysteine with any three amino acid and sisty in so you call it the c x XX c c 60 in XXX is any amino acid and another sister in so why you have to have three spacer here because it will adopt a helical fold amino acid number one get back to the board amino acid 1 Oh let me try it again in Exodus is the alpha helix and then you have the amino acid right here I mean actually here okay I mean this could be the SH or SH right the sulfur atom softball group from the cysteine and this could be cysteine right and another cysteine and then you have another three amino acid in between you have number one number four number 7 and number two number one is here number two number three and then number fours right here and then you repeat again so it could be a CX x XC and another xxx C so it's a motive that's a motive motive it's a pattern the pattern is cysteine followed by any three amino acid and another system and it could be a repeat as many repeat as you need so maybe if you read a paper they might call it C X X X bracket two three four five there's a number of repeats of the motive right I mean there's so many motifs out there right leucine zipper sink finger right so they are the motive for metal binding peptide okay and there are so many protein families out there okay like am energy lipophilic GPCR g-protein coupled receptor kinases Cody Asus cytochrome p450 nuclear receptors right fragment space I meant already mention it a lot of times so let me skip and feel very beautiful visualization so each one of here are the fragments and if you take this fragment and you combine it with this fragment you get this compound beautiful right so they call this the structural classification of natural product you have this scaffold there are all heterocycle you combine it with another benzene it becomes this one so there's like like a phylogenetic tree of chemical compound natural compound and to give you the same information using a different figure here this again fragment lead and drug space so you see the the fragment space is small right here right the molecule weight is about approximately less than approximately 200 right at the radius about 200 and then you have the fragment which is roughly less than 300 and then you have the drug roughly less than 500 right but then there are some that have outside the general rule here so this is showing only the solubility the C log P and showing the molecular weight and then you also have the hydrogen bond donor and acceptor but natural here it is showing only two solubility the log P and the molecular weight so I've shown you the fragment space lead space and drug space in three different figures so please have a look compare and contrast [Applause] so ultimately you want to have a drug that is within this area less than 5 and less than 500 right but before you get the compound here you have a lead you have to come up with a lead compound and then before you get a lead compound you come up with the fragment okay so the fragment if you want to read the literature in the medicinal chemistry area to find some idea what kind of compound you should start with search for privileged stop structure so that's a fancy term for fragments that have therapeutic value so they're not normal fragments okay they are therapeutic fragment and they call that privileged that they are privileged okay they are blessed with medicinal property so they call it privileged substructure so these substructure are mostly found in drugs right maybe they have a review article about pyridine as a privileged substructure for anti-cancer drug development so one person could theoretically write a review article about that or another person could write a review article about coumarin as a privileged soft structure or curcumin as a privileged substructure for some drug developments okay let's hop on to poly pharmacology so before I mentioned briefly right that if we develop a drug to bind to target protein a but if it happens to bind to target protein B off target binding it could cost side-effects but it could also be repurposed for a new drug indication right so you call this drug repositioning however this is coming from poly pharmacology concept poly pharmacology meaning compound could buy it to poly mini protein okay one compound mini protein and the same concept could work in Reverse let me show you stores for we could bind to multiple kinase target so this compound can bind to multiple member of the kinase family okay so therefore we will have a paradigm shift shift from one drug one target to one drug multiple target and there's even a field where we want to develop a multi target drug meaning we want to have one drug that could bind to multiple target so how do you do that let's say that you want to inhibit protein a you also want to inhibit protein B at the same time so how can you do that with one compound you have you have you wanna you want to you want to inhibit a and B you take a fragment of a a fragment that you know can inhibit a and you take a fragment that you know can inhibit B and then what you do you fuse it together you link it together and then you get a compound that have component that can bind to a component that can try to B and the trick part is the linker how can you make it such that it's not too big can parts of a and B do they form a common component maybe it has the same sinuan it could pretty much overlap so you save a couple of atoms here so it doesn't become too big and then you synthesize the two fragments linked together let's say that okay I have two hands both hands quite similar they say it shares a thumb so if it's super imposed at the thumb so I conserve the space here but if it's not superimposable then I need a linker to Lake a and B then it becomes pretty big but if it can superimpose with a thumb and a finger then the molecule becomes much smaller exposing only the fingers that are able to bind to a and the fingers are able to define to be right and I kind of merge it together so I have only the three finger that can okay only these are like the epitome that can bind to target protein a and this three fingers can bind to be epitope is kind of like this if you wanna can you pick up this box I can't show to us how many are you using your whole hand no a couple of fingers so that's the epitope right the point of contact of your finger the tip of your finger I mean you can even use two finger that's pretty heavy right so it means that these two point of interaction must be strong in order to pick it up or you can rely on multiple weak binding multiple weak interaction same concept you could rely on multiple weak van der Waals interaction or you could rely on very strong electrostatic or even covalent interaction right so that's the same concept that you comply for drug discovery and the thing is if you could find a way to look at the structure of protein a and protein B and see how are they similar you could rationally combine to fracking together to get a small drug that could bind to a and B but without being too big so that's an art okay and that requires a lot of computational eyeballing the data set well and so in order to do drug repositioning you will also need to use systems pharmacology right this department has a laboratory of systems pharmacology where you have a network of proteins where you have to identify which proteins are involved in which pathway and how can you strategically target to protein at the same time or even more right however it's a double-edged sword it's not always good if a compound could bind to multiple kinase it would definitely have side-effects right so you have to figure out if you want to inhibit this protein but this drug also inhibit the other kinase what will happen but what is the subsequent effect of that are you ok with that if you inhibit one protein you cure the disease but you might cause another abnormality or are you okay with that so it's a trade-off okay so here as I mentioned there's a lot of network biology involved in doing drug repositioning okay it's just make making sense of the data a and B cliques should come together to develop a drug okay a and B looks alike therefore a drug that binds to a combine to be right B and a looks alike therefore a drug crime I a could also bind B right so we could go both way compounds that look alike protein that looks alike and use the molecular similarity for doing drug repositioning okay so I think the take-home message here is all about how you can use informatics in order to draw relationships between the the chemical structure of the compound so structure will cost function cost activity right and how you can use the computer to correlate find a relationship between to be a good drug it must have the structure in order to give this activity to be a bad drug it will have this characteristic this features in order to be a bad drug so such rule will guide you in defining a drug where you have to optimize for multiple objective right a drug they have good potency my to the target protein but might have poor toxicity then you have to kind of reduce the potency change the functional group and increase the the safety profile maybe we do stock sicily better absorption better clearance from the body okay so actually I mean there's a lot of okay there's some more we have some time okay structure activity relationship I mentioned this a couple of times right so it's about finding the relationship between chemical structure and activities so how I mean what does the chemical structure look like okay it looks like this it's rather small alright but if you open up any any organic chemistry book there are just a bunch of atoms connected to one another okay and the atoms connected to one another they are meaningful okay they're not random a carbon connected to another carbon to form a ring the benzene ring and there's electron delocalization and then that benzene ring connected to a nitrogen or oxygen or sulphur group and has meaning because they will interact with the residue inside the binding pocket okay so this drug will be here and this is the correlation between the actual activity and the prediction okay so ideally it should fall on the line okay so some might have an actual value of six and the prediction model will predict it to have six point five so the prediction will be a bit optimistic but some prediction will be a bit pessimistic okay so like if we have a drug that's should have an activity of say 70% but then we predict that it will have activity of 100% that we are overly optimistic under of that compound so that's the variation okay so there is variability there's an error component when we do the prediction our prediction are not 100% accurate okay but they are having some degree of error maybe they are 80% accurate and there's 20% error but the question is how confident are we in the twenty percent error or eighty percent accurate okay it's hard to say but we can output those parameter when we build the model I'll put the probability for each prediction we can do that okay so I think in the exam we have something about this Q star and you have to explain what is Q star how can it be used okay so what's Q sorry the definition is here right quantitative structure-activity relationship how can it be used it could be used to observe relationship between the chemical structure and the activity right a bioactivity it could be used to help you make sense of your chemical library right you have a library of compound and you want to make sense what functional group are good for activity what functional group are bad for activity and based on such knowledge you will use it as a guideline to develop your own drug or use it as a guideline as a threshold to filter the good drug from the bad drug right so think of it as like a filtering threshold hmm so I'm sure that a lot of the of the human disease we have cut-offs right we have threshold same thing here to be a good activity drug we have some threshold which we could elucidate from the Q star model so the model can tell you similar to the Lipinski were five that in order to be a good drug you should have an electrostatic charge greater than five it should have lipophilic t less than ten I'm making this up so in order to be 5 or in order to be 10 depends on the context of the protein that you are these designing a drug to inhibit some protein that have binding cavity involving a lot of electrostatic interaction then you need your compound to have electrostatic interaction with the protein but some protein they have a lot of lipophilic interaction you need compounds that have favorable hydrophobic lipophilic interaction but if you have electrostatic it won't work to interact with the protein if the protein relies on lipophilic interaction you have electrostatic ligand it won't interact with your protein okay so the contact is different for different protein so to good make so so you might have a question what is a good drug it depends on the protein target protein target depends on the the binding cavity it's a electrostatic ELISA is it hydrophobic if it's a hydrophobic it require complementary ligand that will have hydrophobic feature if the protein is electrostatic it require complementary electrostatic interaction so if we ask you how do you define or what is the ideal feature of a drug of course generally you look to the rule of five right that will tell you the optimal Abnett optimal pharmacokinetic property but in order to be potent you depend on the unique feature of the protein hydrophobic electrostatic okay so in order to know that you have to visualize the protein structure okay there are tools like time o py m o L you use that to visualize rotate the molecule the protein structure okay so that's in a nutshell I already talked about that in this slide already and to show you graphically right there are three major component of a quantitative structure-activity relationship model three component number one what is the biological activity okay it could be a ic50 if you want to do inhibition it will be ic50 if you want to look at the activation it could be a ec50 right inhibition could also be a ki those are your bio activity they are your if you think of it an in terms of statistic they are your dependent variables the I see 15 the M I see the ki they are your dependent variable what do you want to predict if you compare to human health we want to predict whether a person will develop cancer or not okay so that's the dependent variable what do you use to build a prediction model are the independent variables what are the independent variables they are the variable explaining about the person does the person smoke does the person have any family history of someone having cancer what is the the blood profile of the person what is the lipid profile of the person if you analogously analyze compare that to a compound what is the molecular weight of the compound that's the container electrostatic residue functional group does it contain any lipophilic functional group does it contain any halogen atom does it contain any hydroxy atom hydroxy functional group how many benzene ring does it have similar they are the quantitative feature of the compound shown here each column are the feature of the compound they are the independent variable right so this two compound they are different because they have different variable values and we quantitate them according to the charge energy dipole moment you don't have to know it you can think of it as X 1 X 2 X 3 X 4 X 5 X 6 and we want to predict Y so in a simple linear equation ml are multiple linear equation multiple linear regression you have y equals to MX plus B right if you have an equation like y equals to 5x plus 5 if your X is one right 5x means 5 times 1 you get 5 plus 5 you get 10 so y equals to 10 if you have an equation ic50 equal to 5 times the energy value plus the value of five if your energy value is 1000 1000 times five is five thousand five thousand plus five is the dashing 1000 or five okay and you add it up and you get the ic50 value okay so it's simple equation like that so therefore we can see that I see 50 equals to function of X so X will be your feature okay so if you want a potent drug you want a favorable pharmacokinetic drug what do you have to do do you have to modify the chemical structure so that the feature will be different you want to increase electrostatic you have to add electrostatic atoms to it you want to make it more lipophilic you add more hydrophobic atoms to it as your four big functional group so in a nutshell the concept is structure-activity relationship right you want to get the desirable activity you have to modify the structure in order to get your desirable activity so the the field of QSAR is growing very rapidly I've seen here so if you were looking at back in 1974 not so many papers so 1974 was a year where most were two stars started to grow right it was around the time of Corinne Hans he's a professor at kalpana Cal State Pomona and he developed he coined the term qf they are actually the first paper I think was in 1874 the researcher with the surname of cross he published his PhD thesis in France on the relationship between the structure of alcohol and the I think it's toxicity and then after that in 1970s Corinne Hans partnered with a client biologist in order to correlate the precursors of plant hormone with the officer bioactivity okay and then over time people started to expand the concept to correlate the structure with other biological activity right and the biological activity could be inhibition activation and others okay even in the field of chemical sciences you could try to predict the melting point boiling point of the compound based on the structure so this is a typical work flow of a QSAR paper of a QSAR project so if you zoom in you will see that you have you have to compile a data set from a database or from the literature you get your your initial data set and then you have to clean your data right you do a series of cleaning of the data you remove the redundant overlapping molecule take it out any missing value fill it in if it's not available take it out and then you get a final data set that you have already cleaned so all of this step is called data pre-processing or you can call it data curation so you make the quality of the data higher you increase the quality by taking away the missing values and you're annotating the data set making more high values and then from there you will do a series of preparation in order to create your model split the data set once you have created the model then you have to you get the resulting features which feature are important which feature are less important and then in order to figure out whether your q-star model is too reliable or not you want to reshuffle your model and build the model again okay so this is too much detail for an introductory lecture so we're gonna skip this out and so I think that the importance lies at the feature important step here right here so in order to I mentioned about deriving a set of wool in order to design a good and bad drug right so that set of rule comes from this step feature interpretation when you have when you create a multiple linear regression equation right you get an equation like y equals to 5 X plus 5 5 X the 5 and the X the 5 + 5 X is the regression coefficient they are the weight value they tell you the importance of that variable so imagine you have y equals to 5 X plus 5 if you change 5 to become 0.5 does it influence the resulting Y more or less this chart here why equals to 5x plus 5 or y equals to 0.5 X plus 5 if x equals to 1 in this scenario x equals to 1 5 times 1 is 5 plus 5 you get 10 in this scenario X is 1 right 0.5 times 1 you get 0.5 plus five you get five point five so 0.5 X will affect the resulting Y value to a lesser extent therefore the weight of 5 X is more and the weight of 0.5 X is less so this feature is important whereas in another model this feature is less important okay so this is the regression coefficient to tell you the relative weight of that variable so whether the variable is important or not depends on the magnitude of this value if the 0.005 it barely modifies the base value of five right so same concept when you have a more sophisticated equation like I see 50 equals to 5 times the energy plus 0.01 times the solubility okay and then whenever you know the energy value you plug it in you know the luck p-value you plug it in and then you get the ic50 value okay so QSAR it's right here is this very simple we're going back to it's a elementary or maybe high school mathematics right why it goes to MX plus B right but we're using it in the context of biology so the impact is they're using this simple approach to make sense of big data okay that's the important part so this is the equation and looking at the value of the regression coefficient they tell you the feature importance yeah right they are the feature importance if the value is high this feature is important if the value is low this feature is less important so here in this equation energy is more important than lakh P because it influences the ic50 more so if you want to have a quick win what do you do you modify the energy and in order to do that what do you do you modify the atoms in the function of of the of the compound okay so what are the application of QSAR you can predict the activity and in the context of material science you can predict the property like melting point freezing point boiling point you could design novel compounds that could be electricity generating molecules OLED Oh le D you know the one in the TV monitors OLED you could use QSAR to develop a new OLED molecule or a novel fluorescent compound you want to develop a new colored molecule make it brighter make it red make it blue make it yellow modify the structure right energy equals two so the color is a function of the energy you want to have it blue you have it you have to have a molecule with more energy you want to have it red color you need to have them energy be lesser so it becomes red shifted okay so why do I know that because we have studied that before in modeling the color of the green fluorescent protein so we have tried using Q SAR for a wide variety of bioactivity prediction chemical activity chemical property and biological activities so here are the list of biological activity that you could apply QSAR to and the list of chemical property that you could ply QSAR to what is QSAR compare/contrast with polio chemometric sound like a simple question rate but it lies in how do you explain those two concept so let me explain it like this figure QSAR that we have already talked about quantitative structure-activity relationship is trying to correlate the structure of many compounds which bind against a single target protein let's say that I have aromatisse enzyme I'm testing a hydrate compound against one protein which is the aromatisse enzyme and this is QSAR I will have a hundred compound I will compute the features of the hundred compound I will create a model out of that and this is Q s they are okay a hundred compound tested against whether it inhibit one protein in proteic emo metric this is kind of like meta-analysis I could create a pro do chemometric model so it's using informatics in the context of protein and chemical data so what do you have you have what is similar to the POS they are you have a compound library you have many compound what what is the different difference the number of target protein here we have one here we have many so this is kind of like meta analysis right if if you're publishing one paper it's that you're publishing one paper you're testing one protein you publish one paper right you test Ted compound against one protein you publish a paper let's say that next year you change you use the same compound you test against a different protein so now you have this and this and then you could combine the data together like in a meta-analysis where you could combine the results together and then you make a pro do chemometric model and do you wonder why you want to build a pretty chemometric model you could do drug repurposing okay let's say that that in 2018 you have ten compounds or a hundred compound you test against protein a and then next year 2019 you already have two protein a from last year then in 2019 you test with protein B and in the year 2020 it you test with protein C now you have three proteins you combine the data together you have the same compound with three different protein proteome pound a and then you have the ic50 when it binds you come with protein a you have the ic50 when it binds to protein B and C so you have this data and this data together you create a protein chemometric model so the beautiful thing is you can study the selectivity and the specificity of the compound to another relevant protein yes so imagine that the multiple target protein could be the members of the kinase family you have a Conger compounds here and you have the cell hundred protein family members so you have a data matrix of 100 compound by 100 protein so what's the possible combination theoretically 100 multiplied 100 you get 10,000 possibilities meaning for each of the hundred compound you would test against all of the hundred protein and you do the same thing for the second compound do the same thing for the third and until the thousand until the hundred compound so theoretically you will test 10,000 times you have 10,000 bioactivity points because you have 10,000 pairs 100 protein one compound 10,000 pairs however do you need to test all 10,000 pairs maybe not maybe you ran out of protein 50 and maybe you ran out of compound hundred you synthesized or you purchase a fixed amount you ran out nothing is left do you have to abandon your project no because you could fill in the blank if you have missing protein you have missing compound you can fill in the blank you can even combine data from other papers you don't have to do it yourself but make sure that they are using similar assay as yourself so instead of relying only on your data in-house lab data you could combine your data with other people's data you have one one big big data and the beauty of protocol metric you combine it together you build one unified model so think of it as like a meta analysis of the Q star so they call it polio chemometric okay so that's the exam question okay compare contrast Q star and protein metric okay so let me reiterate this Q star you're looking at the structure activity relationship between a set of compound against a single target protein in pro do chemometric you're investigating the structure activity relationship of a set of compound against a set of proteins okay that's the definition how can you use Q star and produce chemometric you can use the identified feature from the coefficient the regression coefficient as a guideline to decide decide whether what feature are important for good activity and which feature are required for bad activity and you set as a guideline to design your drug how can you apply Pro you chemometric model you can use it to understand the specificity and the selectivity of the compound against different target protein right selectivity and the specificity of the compound against different kind of protein when it's interacting and how you can make use of that you could use it to do drug repositioning right so the cost of a drug repositioning is relying on the similarity between the protein and the similarity between the compound so proteins that already have some drug that are capable of interacting with that protein we could also harness this similar data to inhibit another protein that is similar to it okay so therefore we can solve the problem of orphan receptor okay so orphan receptor is a protein that doesn't have unknown fda-approved drug for it so the car orphan no drug against it but because we use the concept of molecular similarity by doing some cheminformatics analysis you could draw a link you draw an inferential link that compounds that can inhibit a and because protein B is similar to protein a therefore this compound could also interact with protein B because a and B are alike and if compound inhibit a then logically it could in inhibit B as well okay so we you could use that for a drug repositioning okay I think that's the core of this talk so let me do a summary right okay summary rodeo chemometric okay this this technology was developed by Maris lepen and Jarrah Whitford of lucilla back in 2001 and in about 2012 we had a collaboration with Professor draw with Berg and then we together we we apply for a grant from the Swedish Research Council and then we got the grant for a three year project we did several joint projects together to expand 20 chemometric to other system like to apply pretty chemometric to look at the green fluorescent protein diversity predict the color of many yellow blue red fluorescent protein we also expand that to investigate the effect of chemical library against cytochrome p450 family 5 isoform from human and and this approach has been useful for drug repositioning ok which we have not yet done but it has high potential which could potentially be used and that could be used for personalized medicine and nowaday precision medicine right in 2016 in January together with Uppsala University and we invited eminent scientists the the developer the professor who developed the cake database also came the project leader of Chambal database also came and Hersey and several other leading scientists like andreas bender who is working a lot into the in the field of chemo genomic also attended so it was roughly about 200 researchers from all over the world coming to Thailand in pattaya where we organized the conference okay so in conclusion there are a lot of doubts about QSAR right that okay it's this number is this prediction should we believe in it well here are the shortcoming right right the high dimensionality of the input space too many features too many variables what is the interpreter can we interpret the meaning of from the model and what happens when there are outliers outlier meeting compounds that don't really fit the rest of the group so sometimes you have a prediction that doesn't fit a prediction that has high error because it is an outlier so an outlier meaning that the compound is not in the domain that is suitable to be predicted so it shouldn't be predicted so before building the model you should assess the applicability domain we should which is an advanced topic which we're not gonna cover in this lecture so just imagine that you have a data that covers people from the camp at home and then you want to generalize it to people from other places like from Los Angeles or from Stockholm which you cannot write because people in different regions of the world have different contexts different health parameters different profiles of the people involved right so in order to make a universally accessible prediction model then you need to recruit people from all over the world to create a unified model but if your model is locally based then it will only be used for people in that area same thing with a applicability domain if your prediction model is based only on curcumin you cannot use it to predict the activity of other chemical structures that are not curcumin so if you want to do that then you have to increase your data set okay so there though there are a lot of flaws and issues to be addressed q star is still a force to be reckoned with it has a lot of potential although there are weaknesses but if we are able to be aware of those weaknesses and being aware of that try to minimize the potential flaws and then we can fully make utilization of it and recently if you follow the news there is the this paper published that mentions that using deep learning a group of researcher was able to discover totally new compound in 46 days using AI deep learning to discover a new drug from the computer and synthesize the drug okay everything in 46 days 21 day was taken up by finding the compound and another 20 days for synthesizing and assay the bioactivity so everything the whole thing took less than two months okay from using AI deep learning so it has a lot of potential and imagine what you can do in the future as we improve the the technology will evolve so myself I've been doing this research for more than 10 years now as I think it's 15 years 16 years and so to SAR it's a very exciting field and it is continuously evolving and we are also a part of that revolution and so we are expanding to multi target q star polio chemometric okay so in order to take advantage of the explosion of the omics data okay so that pretty much wraps the core content of this presentation and if you have your spare time feel free to look around the internet google for some of these software that we have developed from our group by a creator OSF P emo pred cryo protects and here are some recommendation if you want to get started in computational drug discovery here are what you need you just need a laptop you just need some free software and make some use of R or Python as the basis for your starting up and a lot of the data we already provide on our github so you can just download it and give it a try so computational drug discovery a lot of them are relying on open source free software okay and all of this would not have been possible without this wonderful group ok so thank you for your attention you okay so if you have any questions please let me know okay right right okay that's a very good question so I think that's one of the common question is how reliable is the model right how confident can you be in the model so there is a field called assessing the probability the the confidence of excuse our model they use this thing called the current formal prediction so the thing is when you're making a prediction for each prediction what is the probability of it being correct so we can output that alongside the prediction so if you get the ic50 we also get the probability that this ic50 has a probability of zero point nine nine percent meaning that we're 99 percent confident that the prediction will be accurate however some some compound will have a probability that's maybe lower sixty percent they say that we predict ic50 has a value of this number and the probability is sixty percent so therefore we have low confidence in this prediction and the confidence depends on the applicability domain meaning that the the query compound that you want to predict how similar are they to the training compound what are the training compound compound that are used to create the data that aren't used to make the prediction model so let's say you have a set of curcumin and your query compound is a coumarin it's not curcumin okay it's totally different and the probability will be very low but if you have a curcumin derivative which looks like the training set data then the probability let's say it's 80% or 85% depending on the similarity of your query compound the one that you want to predict it is similar to the training the probability will be higher and the more confident you can be in the prediction so you have to do the applicability domain that's very important and a lot of we're planning to adopt that we're planning to have a two step to star model prediction for step compare the compound with the training compound how similar are they if they're too different then tell the user this compound is not workable with our data set so we can say like this company is outside the domain of our prediction or because I say because it is outside the domain of our prediction but but the probability will be lower therefore use the prediction at your own risk okay so it's like kind of like a disclaimer warning the user okay so you can tell that based on the likeness of your query compound with the training data right right thank you it has to go hand in hand so you have to experiment and you have the data so the thing is in order to create the data in order to rate the prediction model we rely on the experiment to generate the initial data once we have the initial data we use the computer model to fill in the blank and the thing is once we identify potential hits we would test it so then we need to go back to use the experimental again and also it using the computer is kind of data-driven meaning that it will help you to save time and effort in order to preliminary assess the feasibility of using the compound and it's computationally it says that this compound will might work where eighty percent sure that's gonna work then we could try it but if the computer say okay this compound won't work then we could skip it but there there is some chance that the model could have some false positive or false negative so maybe the computer would say skip it but in reality it might work but but that chance is much smaller but there's also possibility other person it can be any question I mean as I mentioned in the early slide so I need to make this field more accessible to general public so that's why I created the YouTube channel so I mean if you would like to learn more then you can go ahead to the YouTube channel it's called beta professor you could enter that URL or you can go directly to YouTube search for data professor and you'll likely find it if you find it useful I mean we're not really smart more content so we're gonna have like tutorial step by step how these all are these Python and it's week up so it will be most programming or even DUI point-and-click data might be something to analyze the patient so it's purely educational and is for the love of science for the little bit of science so we've been publishing papers but the thing is the fun is kind of subsiding but also the fun is also in educating so trying to make it more accessible to why population so that's why we want to make good music and that's why we brought our cat here I think it's a very exciting field signal it's everywhere you have the Internet of Things right you have your mobile phone even there's an application called ifttt it's better than that theoretically if you could program your phone to be able to recognize some events like if the GPS location changes add that to your Google sheets if you take a new photo automatically upload it to Dropbox automatically send it via line chat to your friend if if it's going to rain tomorrow email you that is going to rain tomorrow so that you can prepare to bring your umbrella right so you could have a trigger like a threshold yeah if the precipitation rate is gonna rain more than 60% email you give you a reminder bring your umbrella you take a new photo email to your friend okay so you could use this to do so many things a lot of threshold a lot of trigger yeah I wonder maybe in the future if we can have something like if the cell density reaches this level automatically email you so that you can awake you from your sleep and run rush to the lab and continue with your experiment right if the cell density reaches a certain level right maybe in the future that will happen yeah we can only imagine that right yeah so that the PhD thesis will finish much quicker so everything is about data and data driven all the transactions that we make are the purchase that we make at what time what objects did we buy together right Association analysis we buy apples and oranges we buy peanut butter and bread and they have a promotion they put it closer together so that we can buy this and we also buy that right or maybe when you do shopping you you will buy one object and then they say that ninety ninety-five percent of people buying this object is also buying this object so they're monetizing on that fact based on the Association that people buying this will buy this so once you buy this they say do you want this to like when you go to 7-eleven right the athlete can Antibes start up Ajo right if you buy this do you want this too and then you might as well say okay sure right maybe you buy a laptop and then the store say do you want a mouse with that do you want an adapter and then you'll be like okay sure you want to back along with that let's just have a bag and then you'll be like okay so the benefit will be demonetised right the store will make more money so everything that's about data driven right even refrigerator are becoming more smart right if it run out of egg or milk it will automatically have a sensor till the top supermarket to automatically send the milk and egg to your front door so everything is problematic so in the in this thing called Internet of Things everything is connected by the internet so your microwave your television everything right now you could send videos from your phone to the TV right you have and it's on a go you have Google home right you just ask Alexa what's the weather like today should I bring an umbrella tomorrow Alexa can you open this YouTube music on YouTube so everything will be voice-activated right well maybe it will learn from you so every time you go home it'll open the music that you like right it alerts from your lifestyle everything will be kind of like coming out of the sci-fi movie right yeah imagine how you could make use of that for your research yeah maybe your ms mass spec machine will learn from your previous experiment and propose to you why don't you try this compound I don't know maybe AI will be that smart in the future right it will help you to design your own experiment and then you'll be like oh okay thank you mass spec and then you do the experiment to verify that I mean possibility it's pretty much endless thank you for your attention
Info
Channel: Data Professor
Views: 15,174
Rating: 4.983707 out of 5
Keywords: data science, big data, bioinformatics, bioinformatic, cheminformatics, cheminformatic, chemoinformatics, chemoinformatic, QSAR, machine learning, drug discovery, drug design, proteochemometrics, structure-activity relationship, quantitative structure-activity relationship, lecture, dataprofessor, data professor, omic, omics, chemical space, #QSPR, pharmacy, pharmacology, systems biology, biology, drug, drugs, artificial intelligence, data science bioinformatics, #datascience, data science project
Id: uoVAd_zd-90
Channel Id: undefined
Length: 142min 7sec (8527 seconds)
Published: Wed Oct 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.