Machine Learning ML in Drug Discovery and QSAR 1/3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so as a note uh we are not completely yet ready with ai or ml of course there are a lot of studies everything is getting implemented success stories are coming in we are here to reach that particular level so it takes time like human genome project i am talking about drug discovery i'm not talking about any other area in many other areas ai and ml is already being uh implemented and it's in action in drug discovery we are reaching we are we are still trying to reach there a slide contains many of the contents and pictures or videos taken from web articles and lectures and the copyrights is respected to the others i've seen that many of the participants were asking for the data so i thought this particular slide will be available in my slideshare the link is given on the slide and some of the hands-on including a jupiter notebook 9 pedal qscr auto dock modeler dynamics molecular dynamics it's all in there in my youtube channel and today whatever i'm going to use for the 9 orange as well as for jupiter notebook i'll be using some codes for qscr modeling and also for analysis purpose they are all available in my github so it's all open so you can go and find them once you're done with the presentation so i think you already have all the data with you with respect to this presentation so we are trying to actually bridge the gap that is what we wanted to i don't want to say about academia or industry here but it's more about what is tomorrow's need what is actually the market requires uh as professor shasti was rightly mentioning we were mostly focusing on publications and others but tomorrow we want our students and scholars to be more productive and employability should be there there is a good increase in the num percentage of employability when compared with the times of india survey when compared to 2019 so in 2020 they say the upskilling has much better happened with the students on the employability criteria but still things are missing so we need to upscale ourselves apart from your curriculum we cannot say that everything to be added to the curriculum so they have to look at what is expectations of market or industry or commercialization or even why should we patent a particular technology whether there is a need in the market so patent means money it's not just not only to put it in the cv but it's it's tomorrow we want to make money with patent that's what we are looking at but when you're looking in the pharmacy the or when you're looking at drug discovery we still have a lot of unanswered questions there is no medicine for this there's no drug for this efficacy is bad toxicity issues but still as a normal civilian they always think what i would really like is something to cure the high cost of medicine what can we do as researchers to reduce a high cost of medicine of course tomorrow i cannot come up with a drug it's not a one-man job anyway there are thousands of people researchers regulatory mechanisms everyone is involved but still the research has its own main counterpart with the cost related to the medicine that's where we wanted to understand the failure so this is uh this particular pie chart is being made to understand from the papers uh the both the papers which i have listed in 1988 and 1997 where we are talking about how the oral drugs was failing of because of what reason between 1960s and 1980s it was mainly because of pkpd or adme studies or properties and efficacy but when you're looking at strategy toxicity and adverse effect it was a bit minimal strategy because we used human uh synergetics and expertise there rather than any much of tool because that time we don't have too many tools or ai or ml whatever it is what we call rather than more of statistics and later uh came in bit coming between like 2000 to 2010 i don't have any data with 2010 to 20. so i didn't put it so later we could see the trend has completely changed pkadme efficacy is one of the least reason the cause of failure the most reason was toxicity and strategy so when you're designing new molecules maybe using tools or let it be any other sophisticated um like algorithms but still they were having issues on the strategy so now we need to understand that there should be a good synergetic between human expertise and computational modeling completely relying on computational models now that's not that great so we need to put our expertise also so science of relevance is something very important so we have to also understand which is right which is wrong what is the significance of the parameterization how to tune the particular parameters of a model all these things we need to do without knowing that we cannot simply use it someone used something it worked for their project might not work for your project because your problem statement and your biological systems your compound configurations is completely different as dr sunil earlier discussed uh about the bio assessed your replacement when a replacement happen you focus on a specific parameter when you look at multiple parameters you end up with nothing so you have to go one by one and to try to consider and come up with an optimal balance so estimation of adm at early stages where we this is only because we could generate lot of models to predict the admin pk studies in later stage that is 2010 2010 but ultimately we require experimental data to generate all this model so similarly for toxicity also we require tremendous amount of experimental data in the following years to see the reducing the percentage or the contribution to the failure of oral medicines so today i'm i'm bifurcating into four different topics the first i will say about machine learning uh what why and how and then two i will talk about second i'll talk about the machine learning in chemistry how the chemical data is being used in data sciences and drug discovery on how to avoid failures and finally what to do next with are we still ready with ai or ml in product discovery so starting with the first part about the machine learning learning denotes changes in a system that enables a system to do the same task more efficiently the next time that's what our parents are my parents or everyone's used to tell okay you made a mistake now it's it's a first and last don't repeat that so we want to be more efficient next time so that we don't repeat that mistake the same thing machine learning means you are adding in data to improvise to enrich the quality of the data or quality of the model so we want to enable a system that to do the same task more efficiently the next time so why do we need to learn because we want to understand and improve discover new things like data mining scientific discovery and fill in the skeleton and incomplete specification that is something very many of them are working on even my counterparts in uk is also working on data imputation so data imputation means there will be missing data in many of the compounds which is similar like assay data and all so there are deep learning algorithms which will be able to help you to fill those missing data already some of the algorithms are there but the applicability domain and the standard uncertainty was huge but now we are able to overcome with that so also build software agent that can adapt to their users to other software agents and reproduce and important aspects of intelligent behavior so where the three parameters in learning system is one is task t performance p and experience e so how exactly what knowledge is to be learned how this knowledge is to be represented and how this knowledge is to be learned when you see this it is something that is related to human learning right exactly the same but we are transforming it to binders as uh in dr carthagan's presentation he was mentioning about the binary is how the machines learns about what the data that you are giving so that's that's only conversion happening so artificial intelligence is something that we are enabling machines to think like humans it has its own uh things that we have to take uh take it into uh like understanding but in machine learning it's training machines to get better to do a task without explicit programming so keep in mind in statistics we use a data we come up with a model but in machine learning they we can go on adding the data it will learn itself and improve the accuracy of the model and the deep learning which uses a multi-layered networks for machine learning there are hidden layers also there so it could be a black box too so that that's all we have to consider when you are going for all these different variants now currently in our life ai is impacting on our smartphones then detectors like the speed detectors right on the roads then your social media platforms e-commerce they just understand what you search and they pop up with a lot of ads right banking sector finance autonomous vehicles and smart forms and then machine learning is also a branch of computer science which deals with system programming in order to automatically learn and improve with experience now don't worry you need not to learn programming as such because we are all more into medicinal chemists or drug discovery guys so we don't want to do too much of algorithm building there are plenty of algorithms and libraries being built we have to make use of it that's it so for example you autodoc we didn't build right someone has built it scripts and all we just making use of autodoc to predict and see the binding affinity study so that's the same way we can do here so the architecture you have to understand that knowledge base performer critique and learner so critic is like someone is giving feedback about your work and you're improving it that is a learner so the problem generator is to test the performance against it so performance should be checked at each stages today i'll be displaying during my demo how the performance is being scored and other parameters and the paradigms of machine learning is that rot learning interactive learning induction analogy clustering discovery and genetic algorithms i'm not going deeper into it this already previous speakers have spoken about the different variants of algorithms like unsupervised supervised learning and reinforcement learning based on the data category and data types we try to implement these kinds of algorithms so there are main types of techniques in ml supervising unsupervised semi supervised reinforcement transduction and learning to learn and five popular algorithms are decision tree which i'll be using today neural network also then probabilistic network i'll be showing you in the multi-parameter optimization the nearest neighbor where you can correlate with your chemical structure and activity data so how far your activity is deviating when there's a functional group is changing within your structure so nearest neighbor methods can be applied there and support vector machine also have used today for uh qscr studies so which i'll be showing in my demo the stages of machine learning is model building model testing and applying the model all three of them will be shown in the demo today then what is not machine learning is artificial intelligence and rule based inference so rule based inference example i will be showing you in the presentation i am not going to talk anything or give any demo about artificial intelligence machine learning is what we are more concentrating to so let us see what how machine learning has influenced or how it has helped in chemistry especially in chemical data so it this was a um like a survey which was being taken uh by cndn uh in during the year 2018 where it was clearly mentioned what area of chemistry people are working in to understand whether machine learning is really overhyped or not so you can see that around 31.8 percentage agreed that it is overhyped and from a response of 151 people in which 23 almost 23 percent was organic chemist 16 person was analytical chemist 10 percent was biochemist physical chemist 17 percent inorganic 12 but computational chemist just 7.3 percentage so we have to still see because when computational chemistry is more i could see how this could change but anyway still a worth of understanding in 2008 now it could have as of today it could have absolutely changed and the lockdown gave an opportunity many of them to upskill with all these techniques and uh approaches so the one of the workflow in chemistry is that uh the chemist analysis like they might analyze they might synthesize so there will be a lot of annotated data like sensor screening synthesis and simulation and then we pre-process because that is one of the important challenge when you're going for before go information learning models you have to prepare your data and data types properly so not simply clicking a button it automatically does no you have to define the data types properly so pre-processing is very important so for that you need to understand what is the kind of data and how to be applied in statistics for example when you usually apply any data to statistics you normally normalize the data we check whether there is a normalization required or not the same way ml has also its own requirement for pre-processing the data and then we apply machine learning models what kind of algorithms to be applied depending upon what kind of data type when i say data type it could be categorical or integers or strings so we have to see what kind of data type will be useful and then we generate the model and then we decide about classification or regression but advantage here is each time we add new model the model gets i mean new data the model gets trained again to improve but again we have to check with the applicability domain performance everything to be checked the other criteria is that object oriented data set for example we already have design of experiments we we do a lot of experiments and optimization so in between we make a model a machine learning model to minimize the number of experiments and minimize the material property to improve the uh what to say yield or reproductiveness so we basically learn from the existing reactions and we try to optimize them maybe any assays or whatever properties and then based on the history we come with the proposal like dr mundo was mentioning the re-access they also learn about the failures yes it is very very important as you see in the second this one so framework for working with an in-house databases history reaction we need to understand failures as well as success so people students who are working and if you have negative results p please include them to your literature because that helps others to understand how not to work or what did not work so this also helps the machine learning algorithms to understand which of the reactions was successful which one was failure so this will be always giving you recommended reaction based on different models and then we do have data from simulations too for example uh there are plenty of simulations happening on dft calculation of energy so if you do dft it might take a longer time but ml might be a little bit more faster but model generation takes bit more time so here you can see the difference in the values that is the energy values predicted by machine learning models and dft calculation that is quantum chemical calculations the same when you're coming to public databases they use neural network to predict a reaction condition for example for this particular reaction what would be the best catalyst or solvent or reagent or even the temperature for an optimal uh achieving the better yield for the this particular reaction so all these um what the same algorithms and models are being developed and many of them are openly available nothing is too much of proprietary because still people are learning so they want contributions from other people to improve the models and these are deep learning algorithms for solving different chemical challenges understanding the molecule designing new molecule synthesizing the new molecule that is understand retrosynthesis pathway and even predicting their binding affinity or energy values so they try to learn and then come up with the predicted data similarly this is a schematic representation of main component of atomistic ml so when you come up with the output you need to understand what is it about this we are talking about a scatter plot but this is more of a clustering or this is about a chemical space so this is again a tree decision tree kind of so each of this representation gives a lot more interpretation capabilities and this was a recently in kaggle where and it is also published and it also talks about different data where they made a competition and they have used two different models to see what is the accuracy level so auc and asker so you can see some models did not give that great r square but some you can see they use are even 94. so it all depends upon the data and the method that we have to look at both the dnn and non-dnn models were being used this is the reference so uh from a famous writers like nathan and also so he will be the right person because there are many ai companies has already been mentioned by previous speakers so the descriptors for chemistry how do we define them into the cpu so that they can understand how the chemical structures that is being transformed for this particular structure is easy for us to understand what is aromatic ring pyrol or whatever it is the keto group or oxygen hydrogen all those things so the issues with them machine learning is arbitrary size and arbitrary audio but in ideally general features are their general compact unique invariant smooth and fast the same way this single structure is being represented by this particular binaries bits and binaries but when i am going to go more definition so local or atomic descriptor i am again defining each of this regions of the compound in a different descriptor so these can be called as a fingerprints also so fingerprints so this the ml models or even algorithm they do understand a lot of fingerprints and numerical data when compared to when you draw it as a structure now let's go into drug discovery in order to look at the theme of this particular conferences also how to use ai with in drug discovery drug discovery approaches are completely dependent upon availability of data availability of data means right data when i say here 3d protein structure i am looking forward to a good resolution 3d x-ray crystallography structure i'm not talking about a homology model homology model has its own drawbacks you have to make lot of validation molecular dynamics to ensure that you have the better structure with you and then when i'm talking about 3d ligands i'm talking about x-ray crystal ligands or the best conformer to start with because your ligand as such when you synthesize it has a crystal form when it got solidified or crystallized right but when it gets consumed and it gets binded to a biological complex the confirmation will change so the biological complex confirmation for a ligand or a drug molecule is different than when it is in a stable form so we our job in finding docking studies is to find out the stable confirmation which is stable within the complex but the molecular docking most of the time we make our receptor as rigid now that's where the importance of molecular dynamics comes into the picture so we wanted to do molecular dynamics in order to understand how far your ligand is stable within that pocket for a certain amount of 200 nanoseconds or 100 damage minimum 100 nanoseconds is required not 20. so that is a requirement so that's what we call it a structure based drug design and when you go for fragment we have a 3d protein structure but we tried with all ligands nothing worked then we come up with de novo or scaffold routine the challenge here is we need to check out the synthetic feasibility and accessibility so i always tell if you're not a chemist who is doing this exercise have a organic chemist or a medicine chemist sitting next to you so that he or she will be able to tell you the best defragment that is synthesizable because until unless you confirm that this is indecisive no point of making it in computers and predicting virtually so we have to go for um chance of success rather than missing the opportunities and then very famous ligand-based drug designing approach when you don't have a 3d protein but you have activity data or you have a reference active confirmation then you do a pharmaca for studies similarity studies even based on pharmacophore or structure based or fingerprint based even qscr also personally i prefer 2dqc when compared to 3d 3d has its own many things that to be taken care so simply generating 3d structures are not viable there'll be a lot of false positives when you go for qa so you have to take much care about which 3d you're going to use it for 3d qsc calculations and when none of the data is available you go for virtual screening that is high throughput screen so that's where we have to understand the role of docking in 1894 or even in 2000 we talked about lock and key hypothesis to exercise a chemical action and ligand interacting with a protein must fit into the binding activity like a key into a keyhole that's what we were learned but right now in 2020 this is only half of the truth we cannot say a key lock hypothesis multiple keys can go to a lock so that's how the hypothesis have changed understanding levels also have changed i'll try to justify this while we are doing molecular dotting right we want to determine the optimal binding structure of a ligand to a receptor quantify the strength of the ligand and the receptor interaction that's what we are looking at mainly we are looking at where the ligand will bind how will it bind how strong it could bind the role of solvation and desolvation what makes a ligand bind to the receptor better than others translation and rotation of ligands tortions did i anyway talk about docking score no docking score is only applicable when you're doing virtual screening with the considering the enrichment factor if you're not able to talk about enrichment factor reporting virtual screening data is of no use we cannot validate that but when you're doing a normal let's say four molecule docking score does not give you any sense there to rank the compounds it's more of energy terms that we have to look at so many of the key points that you have to look at is what is a catalytic function what is a regulatory mechanism of your enzyme which you have considered whether it is an active site on an allosteric site what is a cleft detection did you really identify with the functional site did you identify the key amino acids which is there in your binding pocket geometrical is there an open or a closed conformation is there any change in confirmations happen because of the binding of this ligand flexibility whether it is an apocate or on a substrate bound side conservation we are looking at selectivity sensitivity and specificity surface characteristics whether it is a solvent accessible area what is a hydrophobicity or hydrophobicity and finally we talk about the interaction energy so these are this is a picture that was taken from enzyme methodology book it's really interesting that we have to have a better understanding when you're talking about molecular docking results this is what i was talking about these solvation terms we are looking at tortions of the compound complexes and then also hydrogen bonds and its contribution in log b so we are trying to understand how which of the compounds uh having or the confirmation generated from docking could be present in a crystal database so here you can see the green and yellow and orange so the color represents that if it is green it observed in a cambridge structural database that is cst so if this particular dihedrals or two oceans are presented that means this particular biological confirmation is possible to crystallize so that's what we are looking in terms of so where is docking score coming into the picture i said talking score is very important when you're going for virtual screening but ultimately as a medicinal chemist you do not want to simply do virtuously you wanted to synthesize new molecules you want to come up with the lead molecules or derivatives that's where we have to look at another one is that this particular carbon atom is having a negative contribution to the binding affinity so i should think of whether i can come up with some new derivative or replacement of this element to see whether it enriches my affinity that's what we have to look at the contribution when you do a lead optimization on a 3d based lead optimization using your protein structure data and finally we also look at the chemical spaces there are large amount of data last week there was a ultra huge large databases where being chemical spaces there was a workshop from nih so there they were talking about huge databases like billions of compounds in a databases and all so there are many being spoken about even people from merck astrosenica pfizer even from google many of them were talking about the expansion of this chemical space of data so this is going to really help the medicinal chemist to understand what could be the best building blocks to come up with novel compounds based upon the pocket of your protein that's what is the revolution so they try to do a mapping not simply based on a structure similarity but more based on a form of a force similarity so for example this particular region is being made similar on the pharmaco4 concept and they found this to be similar and you can see the similarities 0.61 and this one to this this one to this this one to this this one so pharmacophoric features are being found similar so this was our query molecule and they were able to find a new virtual hit based upon the building blocks of chemical space based on the pharmaco4 criteria now the global similarity will be 0.8 but i don't think when you do a structured similarity between this compound and this compound you will get pointed it will be lesser than that so that's what the mapping works actually and also a lot of applications of machine learning on retrosynthetic pathway where isis and also being used you just give a new compound it will be able to fetch the database and then look at the literature and predict the similarity and the yield what could be the strapping agent and all the data will be could be predicted so finally the objectives of drug discovery is we have to come up with an optimal balance of property when you do a molecular docking or a bioassay studies you are just finding a heat molecule that's it where we understand the potency we never could call it as a drug like molecule even to call it as a lead molecule you should find that your drug is safe at least some extent safe of course on our level of research and then good good amount of absorption rate in human industrial absorption metabolic stability my drug doesn't break down or when it passes through cytochrome p450 to stable solubility and flexibility of my molecule all these things to be considered in order to call them as an optimal drug candidate rather than a drug molecule so the three mantras are like fail fast fail cheap when confident you move forward with that compound and avoid missed opportunity do not use unnecessary filters uh which is not really required for that kind of objective so that you miss might miss opportunity this is only because there are a lot of data the data overload is also a big challenge so if you let's let's forget about what is written there rather i'll give you a very simple example let's imagine you have a 10 different compounds in order to identify two best drug candidates i have to carry out 10 different experiments if i am doing a one single experiment for the 10 compounds it is easy for me to pick up the best two based on thresholds or some criteria but here i have 10 different experiments independent experiments to be considered in order to choose the best two from the ten how to do most of the time i hear from um audience and participants and researchers that take an average but i really i in my question i emphasize it is independent experiments all are important so we have to give priority so ultimately what we have to understand is i'll come to the point we have lot of data from in silico in vitro in vivo and then how do we come up with a conclusion we have to prioritize we have to give importance which properties or which experiment is more important than the other for this particular object your activity or property that you are looking at second thing is you have to consider uncertainty is also uncertainty means error or when you say a single uh experiment your carried out experiment just once there is uncertainty you don't know standard deviation there is no mean at least three times you have to take it right so when you when experiment itself has uncertainty when you come to a prediction of course there should be uncertainty so we need to understand the standard deviation of the prediction also if there is a more deviation we don't give much important because it is unreliable that doesn't mean that the software or the model is bad because the chemical class of the compound that you use for prediction is not within the applicability domain of training set sorry i know it is might be a little bit vague but you have to a bit understand about applicability domain so that you will understand more about qscr and others and finally when you do selection the selection process is not using a tool you have to do manually sit down analyze the data for me initially when you do a selection of compounds from your big large set you have to give more importance to diversity and then quality so don't just pick up a single molecule and just build up the derivatives rather have different scaffolds and then have multiple derivatives and that's where you have the diversity there then you have a quality later you can give more importance to quality and then diversity so that's how the process should go on the initial conventional filtering what we do is we have set of molecules we do for potency test and we remove some of the compounds which might be not but active and later we try to carry out absorption rate studies for industrial absorption again we remove some of the compounds which does not work and then again we expect that some of the compounds will be really nice and good then we did a metabolic stability studies and again we removed some of them and finally we expect that we are getting a very good compound unfortunately in most of the case studies we found that our reported studies we found that this is not happening your best compound will go and hide somewhere because it might not be having a good absorption rate even though it had a good potency or the same case in your you might have experienced even though your compound is very active you found later it is toxic the same way so you might end up with a non-active compound also so when you do step-by-step filtering there will be chances of missing opportunities that is what i call missed opportunities so we have to think of how to apply different methods so this is where we are using rule inferences uh rule inference based method this is not machine learning but this is rule based method so where we try to consider different properties and also we try to give the threshold and we give the level of importance how far this property is important when compared to the others something similar to lipinski rule of five where they have said one rule can be void but others should be sufficed so the same similarly some rules are being come here and it is being measured on probabilistic scoring this is already being tried this is a published work where they could find only 60 percent of fisa candidates uh when compared to applying multi-parameter optimization they found around 74 percentage so the difference is 14 percentage was a missed opportunity when they did it one by one and again adme we are looking at the distribution metabolism absorption excretion i'm not going too much into the theoretical factors but all the adme models are based upon qscr we need to have the right starting structure mostly it will be 2d not 3d and then we extract the descriptors and we get the data of all these adme properties from experimental essay and then we call it them as a response variable and then we apply it on different statistical or machine learning models i remember 2014 or 13 uh dr sunil was one of the guest editor for bentham's medicine chemistry and then he has uh inviting for um like uh what to say um john i mean articles myself my professor back in uh professor matika olsen we wrote a review on uh machine learning and drug discovery i think back in 2013 or 14 i think it was published in bentham journal of medicine chemistry so that's how that's what i remember because this this was a particular slide i prepared for my phd so still i'm continuing using it so that's what i remembered now so finally after generating all these models it's not just simply about r square it's not just simply about less error value it's also about physical meaning significance relevance so whether your property or the descriptor that you have chosen has a good relevance to the activity or the property that you have chosen for modeling that is very very important even though statistical parameters are also equally important so finally you have to do validation and then finally you go for prediction so ultimately you are looking generating this qrsa models so that others can use your model and predict the activity for you their compounds for example power server that is what they do so whenever we publish qsa paper ensure that you come up with all this raw data dumped somewhere publicly so that people can use your model to predict their activity otherwise it's of no use actually now today we will be looking at the demonstration on generating random forest uh then than random forest then svm models also and we will be also i i won't be dividing today's set to validate but we will be dividing only into splitting into training and test set only so we also have a validate or test with independent set in order to understand the applicability domain of your model this is what is called domain of applicability we want to understand the diversity of a training set defines which defines a domain of apple for example let's imagine these are the orange dots are the structures with similar clustered uh to understand now i bring a new compound in order to predict activity using this training set which is the model i generated if my compound is fitting within the applicability domain the standard deviation will be less i have more confidence in rely in the reliability of the model prediction if my structure is slightly away from the applicability domain okay okay situation but if it is too far away from my applicability domain of my training set it is unreliable because the standard deviation is more so i cannot take this prediction as a validated one so until unless a qscr model or a adme prediction comes up with standard deviation the reliability of the value is questionable how do you how do you how do you like justify yourself this is a accurate data i don't say accurate because you cannot predict any accurate values in models anyway but at least reliable data with more confidence that's what we have to look into it now the some of the challenges i already told you in deep learning where imputation so traditional qscr modeling offers a little advantage and the compound bioactivity of property data is also very sparse so we want to add these missing data or we want to work on these experimental data to make better prediction for compound by activities and all so let's imagine you have a descriptor and you take the prediction and come up with the prediction but let's say you have assays but there are a lot of missing data here in the essay so we can use a deep learning algorithms to impute the essay data to it i'm not going too much of the details here a lot of publications have come up and case studies so uh this will be a bit more lengthier talk almost two hours of talk so i don't want to go so but it's about we are understanding about the descriptors and the essay when we combine together we are able to impute the missing data available using deep learning uh methods and also imputation is to check with the probability with the predicted values and these are the different applications being as a case study which being already being worked out with several pharma industry also with the open source malaria project where they are able to get the generator models so ultimately what we are trying to understand from this drug discovery section of my presentation we are trying to bridge the gap between a chemist and a modeler whatever you model tomorrow if it wants to say that this is working a chemist have to synthesize this and to get it tested so there should be a good synergetic and a collaborative uh go on going uh with the chemist and the modeler at any region let it be pharmacologist or pharmaceutical chemist whoever it is and if you look at the success stories these are the list of success stories and i think one of the uh review was written by dr sunil i have to check i'm not sure but uh uh these are the main reviews from where this list is being uh prepared so there are good success stories of cad and ml which has gone to phase one phase two phase three of clinical trials so we have to wait and see of more news coming up now now how this machine learning can be applied how we can we can do it ourselves so that's where we have to simply depend upon a single particular package called as anaconda anaconda includes all jupiter notebooks scipy numpy then number tensorflow matplotlib pandas sky kit learn and many more python libraries so you have to just install anaconda that does the job so today i'm going for the demo now i think i have time yes i do uh so we will be doing a sar model using machine learning methods one is svm random forest we will be trying so for that we are using rdkit which already dr sunil has already mentioned it so rdk is a collection of chemi informatics and machine learning software then we will be using sky kit and pantas sky kit is again a free software especially for machine learning libraries for the python it features various classification regression and clustering algorithms including support vector machine so it is already a library is there only thing you need to know how rightly it to be used because there are specific syntaxes where you have to define it that's all because you need not to write lengthy codes and also not not at all required i am not a programmer so right but still i am able to use make use of sky kit and pandas and all and pandas is a software library for data manipulation and analysis it offers data structures operation for manipulating numerical tables and time series for example i told you earlier before doing machine learning analysis you have to treat or pre-process your data so pandas libraries can be used for that another thing you you don't you hate completely coding or anything like i don't want too much of flexibility i'm just a graphical user interface guy so then you have two options one is orange and nine nine i have already published many of the workflows in my github and there are a few youtube videos which i have already given for chem informatics so today i'll be showing you just quickly because of time limitation so let me uh go to the first demo of nine yes i hope you can see my screen so uh so let me tell you about the different parts of nam can be downloaded from nime.com and it is free or nime.org i think and it is a free open source software the desktop environment the server environment is commercialized uh so you will have a lot of nodes available here uh just a minute yeah so you have a lot of nodes available here so these nodes can be installed how this installation is happening you have to go to help menu install new software and you can able to install even many of the free uh nodes then you can install biosol yt nodes for docking then k max on notes for chem sketch cam viewer i mean marvin sketch and marvin viewer not sorry not cam sketch so these can these are already free and many of the artik and indigo and any laws uh earl wood than pedal weka 3d cam and cdk many more nodes are available now what is the use of name is automating the process but be careful with the automation you have to test it properly check everything and then only automate so what i'm saying automating is let's imagine you want the software itself or the name itself to download the campbell database do a pre-processing of selection of compounds neutralize desaltination that means remove salts and then you want to convert 2d to 3d then you want to generate confirmations you want to do docking after docking you want to do some clustering after clustering you want to do some pre-analysis and post analysis with conditional constraints and then you want to do come up with pharmacofor whatever steps i told you might be using different tools and you have to wait for each software to finish the calculation and but in nine each of the nodes you can define it and then connect them automatically when the first collection or downloading of the compounds finishes automatically the next one will start and you can just once you optimized and checked or tested all your workflow you can just click a button it runs it automatically and once at the end of the day when you finish and come back you will see all the results so that is advantage of nine i've been using nine uh since my uh phd days since 2011 but that time there is not too many nodes available but now you can see lot many nodes available so let me explain what is this workflow is about so i am getting a database of using sdf reader and then i am drawing a structure with chemsketch okay what i'm doing here is a similarity study substructure search so i already have a sdf when i doub double click on it i have option to load uh the sdf file if you want to see how many molecules are already there i can say scan files you can see 191 molecules are there in my sdf and these are the three columns compound id the pki values or predicted ki values and series okay now that's okay i didn't change that's why it won't and then now chem sketch i'm clicking double clicking on the chem sketch it will pop up with the chem sketch drawer uh drawing tool and then you can draw the substructure there so that you're saying that i want to search that just this aromatic ring or if i want to give some other thing i can do that and i just click ok since i didn't make any changes it is saying uh so it will generate the fingerprints of all your database i just loaded only 191 you can load millions of components based upon your memory of your computer in nine it is not about more cpu but it's more about ram and cache so you need to have a lot of ram uh to work online and then you generate the fingerprints of your compound and then we try to do a fingerprint similarity search so one structure substructure against 191 molecules and then we generate a distance matrix calculation in order to generate the final similarity viewer so let me run it so in order to run i can i hope you can see my cursor you can click on the top there is execute all the nodes otherwise if i just select this i can run just that particular node by clicking the single play button otherwise if you want to see what are the sdf that is already read so here there are traffic lights if red means data is not sufficient or not given hello means data is given but not executed when it turns to green that means all accomplished result is ready these are the output nodes and these are the input nodes you click the right button and say red molecules click on red molecules and then it should show a table which is something similar to marvin view and you can see the 2d structures and what is the content that is read by the two sdf okay now since we have read i'm just clicking on this so that all the uh will be executed so see it automatically got executed and the final one is running i'll show once this is finishing i'll show the output it looks really nice because it will do a comparison between a substructure search to find out the similarity with this particular database or 191 compounds i cannot say it as a database yeah so i click again the right button and i say view the heat map so the heat map it just go the advantage here is it looks colors but what i'm going to get information just mouse over on top of it you will see which compound within the data set against the other compound which is having more similarity if i go to the middle of course they're exactly the same the similarity is zero means they're duplicate if i'm going above you can see the similarity is changing like five 0.5357 and also this also represents if you click you can identify the row numbers which will tell you which structure to which structure is similar so when you do qs here it's always get too better to understand about the homogeneity of the comp i mean homogeneous compounds or data set you can always check this now let us look at the other one a similarity to target because my aim was to do a substructure search against the compound that i draw in the cam sketch so it found let let us see so here i have a row zero i have to see where is my i think row zero is my um structure or maybe uh i i have not added it sorry yeah i have to add that so here what i'm doing is i'm selecting a structure within the data set and i'm asking how far it is how this row second structure is similar to the other compounds so this is sometimes i really wanted when i wanted to do any qsc or any studies like that okay so similarly if i'm going to the second uh workflow i'm not going to d because of time shortage uh i here what i'm doing is i am again trying to generate 3d structures for example i read the 2d structures of sdf files and then i removed the small fragments or salts from my compounds i added hydrogen and then i did a optimization using uh force feed like universal force field or mmf 94 and then if i want i can generate again 2d if required otherwise i can use this for my docking studies or the other way is i can use my sdf reader i can simply use open people so open babel i can give to generate 3d also so i have to give minus minus gen 3d tool so if i if i say here optional parameter then 3d it will generate 3d structures and all the input files will be format supported whatever you do in openviewer the same thing can be used only thing in mac it don't come with the payable binaries you have to separately install and mention the location otherwise in windows and linux it works by default now uh another one uh where i have done acupuncture modeling uh which i have taken from open uh the workflows from my website but i have added something extra so here they have used a random forest regression method uh we use uh sdf reader we generated fingerprints using artikit and then we partitioned it like a test set or and sampling and then we generated the model using the model we predicted it and then we generated the scatter plots and all so this is again can be done within nine so many applications are there within the nine now in this particular workflow is to generate the descriptors so i can generate descriptors using cdk x log p r d kit enlos molecular properties then also paid all descripts so you i know many of you will be using on payroll descriptor that also can be used here advantages i am just loading my single sdf reader all softwares will run together i can merge them finally it will write it as a csv file so this is one of the best advantage with nine next i'm going to orange so orange is equally something similar to nine but not many things but in orange uh you can do only uh something similar to machine learning and statistics you cannot do a lot of chemistry there so what here i have done is i have taken a log bb data blood brain barrier data and then around 56 structures are available i have loaded this in my file so this is what i told you you need to know what kind of data types they are so here you can see the name i have given as a number and that should be a feature so for example descriptors are all my future activity will be my target and what is a category type is a categorical or i mean continuous data or numerical data so at the end i have an end point where i have already mentioned it as target so that is where my target point if you want to visualize them again you just double click this you will be able to visualize all your data here okay this file is also available in my github under qsar this file is there just download orange and load that file and you are free to change this csv file you can generate the payroll descriptors include your activity data there and then you just load it here it does all the modeling automatically because it's already there now i wanted to select specific descriptors basically i don't want to select all the descriptors too many descriptors false positives all over fitting issues all those things will come so i don't want to select many descriptors i i deselected some for while testing and then this i'll come to later so i wanted to generate model using neural network support vector machines random forest and tree right and then finally i wanted to generate the models so that is under test and score when i double click on it it will show you the results i hope you can see the screen so here it is telling me r square is high for 3 0.95 for svm 0.87 for random forest 0.99 for neural network in fact i won't take neural network anyway because the sample data set is very less only 59 so other one is i can do a cross validation when i click on it it will do a 10 fold cross validation calculations will be running uh here the particular window is covering up otherwise it will show the percentage how far it is um like the calculations are running i cannot see let me check with it okay uh i'm not able to see it uh just a minute okay oh yeah it is calculating see uh you can see a timeline is going let let it finish we will wait until then and after that if you want to make traditions for example uh how these models have predicted my experimental value using this model i can put a predictions and then it will generate the prediction so how do i get all these under data you can find file csv file and select rows and all and under model you can select svm neural network and all those things now let's imagine i made a model i want to share it with my friend or a collaborator who wanted to use this model to predict for their compounds so you can save the model here so for example you see here so i i'm saving the random forest model right it will be saved as a file and your friend your collaborator can load that model as a file again here and then he has to he or she has to load the descriptive just the description without any activity data just the descriptor files uh of his or her compound and using this model and using the descriptor it will be able to predict the activity of the new set of molecules so in my presentation i told you building the model testing the model deploying the model or applying the mode that all three is covered here in this particular uh workflow even in i'm also i showed you building deploying or both always i have showed you but here i am giving you a little bit of saving model option in name also you have saved mod auto option is also available this i thought it would be much more straightforward so that's why i thought to more focus here so let it finish that then i will show you uh the output data and prediction so this is doing uh what is a cross validation using 10 fold if i would have given five it will be it would have been much quicker so ensure that you have a little bit of ram and cpu available because that's why i said model generation will take some time this is only 56 molecules imagine when you have thousands molecules so it will take some time um but here you are not doing any coding so let me double click and see yeah i can see it's not that great but it's okay at least i want to show the results right now let us see what is the scatter plot so i can see the scatter plot here so in the scatter plot i can define uh which model i have to use uh the predicted model i have random forest svm and neural let me check neural network and this is my end point the actual experimental end point if i want to change the color everything i can do if i want to save this picture at the bottom there's a save button i can save the image you can change the color increase the symbol all things can be done within here okay okay so uh that's it with orange now i'm quickly going to uh one more few more things uh so in order to open jupiter notebook you have to go to command line after installation of what to say anaconda you have to load jupiter hyphen notebook that's how the jupyter notebook will load into your browser then before that the same slide whatever i'm showing you is already there in the slideshare uh you can view them there and today whatever i showed you in the this is the github a github slash gd bio so if you go to qsar whatever i showed you in the orange the file is already here under orange and you can download this file and just load it to your orange so it will be there all right otherwise if you wanted to work on more jupyter notebooks i have jupyter notebooks included for gromacs which does molecular dynamics i am working on namd sorry amber openmm as well as for umbrella sampling it will be uploaded soon and then whoever wants to do protein modeling already the advance and basic modeling of pro modeler tool is already there and it has given instructions how to do so you just require your initial sequence of the protein rest of the things are literally automated please read the instructions and go now let's go to the application so i will go in the present mode quickly so that it will be visible much better this is a jupiter notebook it's just a fun experiment a voice recognition or typing script because i don't want to go for too much science then it will be a bit boring for you so what i did is i'm importing um python audio google translator speech recognition and and pytts x3 uh libraries into python and i am importing those libraries called speech recognition and the google translator and then i recognizes the speech recognition and then i'm asking the the code the python code to identify what all mics i have it identified and then i selected my mic which i have connected right now and now this is the beauty so this i earlier said but now i am going to execute it again okay let us try this experiment hi everyone i am very happy to be presenting my slides in this particular session wishing you all best luck yeah it the time is up so i am very happy to presenting my slides in this particular section so i just said over the voice the python algorithms identified it the google translator helped me to recognize the speech and translate it to the particular text so see we thought this is a big science or big coding required no now things are much much simpler just few like six letters of course six lines of code i'm able to um trans like transfer my audio into text see this is the advantage so you will start liking it so start doing with fun experiments don't do science at the first because initially you need to get attached and get more interesting to your python programming later we can also do translations also so let me go to science a bit more this is the fun experiment is done so this is again i'm going to uh jupyter notebook again where i'm using qscr blood brain barrier uh which understands the permeability of compounds so this was adapted from an open source openly available published notebook from maria and pavel so i want to import the library what are the important libraries which i already showed you in the presentation we need rd kit and numpy sky kit learn uh and job libis also was part of sky kit learn earlier now you have to import it separately as job lib and then we have to also import rd kit and others right now i am reading the molecules and activity data from an sdf file it can be any sdf file uh a 2d especially not 3d so i am loading it from which is stored in my data log bb.sdf i wanted to read one by one and i wanted to also define it as log bb class and then i wanted to see how many of them are there there are around 321 containers for activity that means uh the data features are there now if i want to visualize the data for example this is the 24th molecule let's see if i want to see 45th molecule this is the 45th molecule if i want to see the fourth molecule oh don't think that if you just give moles for there you will get that you have to do the previous steps which i showed you otherwise it will not work right so this is this is how you can visualize the molecules and next this is something very interesting i remember a previous speaker was trying to show but i think sometimes it doesn't work i'm sorry even sometimes the a lot of dependencies are there so ngl view is a very interesting visualizer within the python libraries even in jupyter notebook where you can visualize your trajectories of molecular dynamics even proteins also so this is a particular gromacs file which after generating a confirmation which we can view so you can zoom in zoom out all those things can be done trajectories movies can be generated anything so just want to show you that otherwise we'll go back to our work so now we want to calculate the descriptor especially fingerprints using rdk so we are using rdk and we are especially using morgan fingerprints with a radius of two and we generated the fingerprints after generating the fingerprints uh we will go for converting the fingerprints to 2d numpy array so again i don't want to confuse you with too much of things but we are converting to the array so that the models uh the machine learning models can understand them properly right and we understand about the 321 to 2048 features now we want to split the whole set to training and test it that's the exercise for qscr anyway so we try to split them so here i have given test size is equal to 0.2 that means 80 percentage of the data set will go to training and 20 will go to the test set and then i'm trying to uh display how many training set i have it is around 256 and then i am going to scale them because this step is very crucial because in case of binary fingerprints it may be less useful but for svm it is very crucial so we try to scale them uh uh to the standard scaling and then uh we okay okay here also we can dump the model so you can see the dumb scaled model can be saved as a pkl file and later we can call this pkl file for models now here we are doing cross validation exactly what we did in orange there i did a 10 fold but here we are doing only 5 fold so that is what n splits equal to 5 random state as 42 and then i am printing out what are the how many force so here if i scroll let me see if i can do that ah so ok i am not able to display that then i will be generating random forest model by giving estimators so this is very important because we don't know we want to fine tune it so optimal tuning of parameters are very important to build the models so these uh we want to understand which in estimators will be best optimal for to generate a best accurate model so those things we have to define we have to understand about rf modeling scm svm and all then we did uh like a model building has been done with iterations and finally we did a five-fold fitting uh to understand the model now let's save the model which i told you we can save it as pkl and now let's analyze the model we want to predict the model we want to use a model to predict the test compounds which we have already split it and then calculate its performance so now we are loading the model which i already told you the scale dot pkl is a load a model which we can load it and test predict for the test set of molecules we have already bifurcated which is 20 and then we are predicting the values log bb values after the prediction we want to check how far the test prediction came now you can see the accuracy score is 75 percentage 75.38 percentage so 25 percentage compromise so we have to still optimally uh look at tuning the parameters then correlation coefficient is around 49.95 and kappa score is around 49.85 these are all statistical parameters next is we are looking at the estimation of applicability domain as i told you qscr if you are doing 100 you have to report applicability domain so we want to predict the training sets applicability domain and then we are building up the svm model again to check the uh process so earlier we built with random forest now we are building with svm model so the accuracy the performance things will completely change so that's where we are trying to predict and understand how the threshold here you can see the accuracy has improved around 77.7 percentage and the correlation is around 40 54.87 and kappa is about 54.2 so uh there are more models i'm not going there because there is limitation of time so let me finish my balance slides a few more three four more slides and then i will wind up the session so we are done with the demo so we covered orange uh the same similar things online and the similar things on jupyter notebook now it is your choice which ide or which interface you have you wanted to start with uh so don't think that only a programmer can do all these things no never please spend some time on youtube there are plenty of videos for jupiter notebook nime and orange you have to spend some time so even during your free hours if you're really enthusiastic to getting into machine learning i don't call ai anyway we are not at there now what to do next are we ready for ai and ml when you're looking because i'm talking to nyper and i hope many of the faculties are related to pharmacy we do have something called as pharma 4.0 so where uh drug discovery is more of cloud-based simulation model outcome testing that's what we did it we want to test the performance that is very trivial nowadays better accuracy better goal so uh social collaboration with stakeholders mobility-driven real-time electronic research data exchange that is called lab notebooks lab electronic notebook in academia it's not much use but in industry it's very widely used and clinical and pre-clinical trials on electronic data capture systems uh pharmacovigilance clinical trial data and many more and supply chain you come up with rfid manufacturing marketing sales blockchain many more comes into the pharma 4.0 but this is all under the industry 4.0 where we are today talking about advanced robotics 3d printing internet of things for example dr karthiken is more working into internet of things are in um biological problems like esri was discussing about cancer disease and his applications i i have tested his system anyway i just send only the pdb id and the molecule name i just emailed it after a few minutes i got the result back with the scores and all and he just called me to confirm whether you received it so it was really great actually so that's how it is internet of things actually artificial nobody is sitting there and doing it's all automated in the machine so artificial intelligence big data so intelligent flexible distributed production that's all the error that we are looking at but the tran digital transformation has its own drawbacks its own limitations so we need to know where we cannot replace humans many places still cannot be replaced so you many of the times when i present this slide people used to ask me is it a danger to humanity or is our job at stake no no no never it is maybe for a short time our jobs might be in stake for a short time but later we they require them us to work because data generation cannot be done using computers only by humans so this is one of the consortium they told us i don't know it is official but it is already they have a website and everything so it is taken from internet anyway not my claim so artificial narrow intelligence which is what we are today whereas job is enhanced i could know that there are more than 93 000 job openings in india especially on data science but tomorrow that is going to go down because once all the models are up and very efficient then we might see a little bit down on the some of the job not all jobs but still again it will go up because still they require it so ups and downs will be there when technology implementations are there we were all being worried when y2k came but did something happen we just worried only on 1999 2000 that's it from 2001 we are back to job lot of job opening so that's how it is so all these technology implementations will have a little bit deep in the job openings or at stake but later it has to boost up because lot of manpower is required but i don't know whether this proposal or perspective that we're looking at 2040 and maybe beyond that whether this will be i don't have any claim and i don't take any stake here so it's all we have to wait and see so uh that's all from me for today to conclude uh when you're doing any kind of virtual high throughput screening or large data set screening initially give importance to more of diversity i'm talking to especially to medicinal chemists and 25 percent to score or quality and as the size of the lead reduces the funnel goes down you consider more importance to the quality of the score and less to the divers there you will have more derivatives rather than diverse set of scaffolds please consider enrichment factors when you you're doing virtual screen very much mandatory and good energetics between human expertise and computational tools are very important it's not about which computational tool you use no it's about whether you have valid results using any tool when i published these many papers nobody asked me why i didn't use any commercial tool i have used autodoc for my most of my publications for qsyr i have used paydal then qscrs which is absolutely free of course so nobody asked me whether you require any but it has its own we have to justify ourselves whether my model's performance is good or bad so once you come up with these models you will be able to prove yourself how far your model is reproducible avoid missed opportunities try to put the right method in the right place do not misuse it understand significance of parameters and properties for example there i told you when my test set is 20 percent and my n estimators in our random forest is 100 250 and 500 this accuracy is only 74. if i fine-tune those parameters my accuracy might be improved i have to check it so we need to know understand what is the significance of changing that n estimators in random forest to get the influence on uh accuracy evaluate and decide the tool or upper don't think that pandas or sky kit or arctic kit can be used everywhere no it has its own limitation so we have to test it evaluate it and make the decision which one is the right choice and check reliability of the data use that is very important in machine learning as well as in any science in drug discovery until unless your initial data is very well validated and from a reliable source it just cannot be understood in a proper way one more thing is when you're doing qs here i have heard from many students that scholars they try to merge data from different papers maybe from same group or from different group but keep ensure that do not do that do it only if you are sure that the assay that was carried out in order to calculate that biological property or activity is exactly the same with all the papers you have collected if there is a difference in essay don't combine them there will be outliers there will be deviation more in your so there will be more false positives earlier days qasim was mostly used for local dataset not for very diverse dataset right
Info
Channel: Girinath Pillai
Views: 2,699
Rating: 5 out of 5
Keywords: machine learning, orange, statistics, qsar, ai, ml, cadd, drug discovery
Id: R7FYypCUasc
Channel Id: undefined
Length: 74min 4sec (4444 seconds)
Published: Thu Dec 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.