Python for Bioinformatics - Drug Discovery Using Machine Learning and Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you are looking for a way to apply python and machine learning to a real world application this is the course for you you will learn all about bioinformatics for drug discovery using python shannon is a professor of bioinformatics and he knows how to break things down in a way that is simple to understand welcome to the bioinformatics from scratch course my name is cenin and i'm an associate professor of bioinformatics in this course you'll be learning about bioinformatics through the lens of drug discovery no prior knowledge of bioinformatics or biology is needed although if you have some it will be helpful and so we're starting from the basics which means that we're going to start from collecting data sets to pre-processing the data set to performing exploratory data analysis or eda and also to build machine learning models in order to make prediction as well as to obtain data-driven insights that will be useful for drug discovery and then you're also going to be learning how you could compare machine learning models and then selecting a suitable one for your use case and then finally we're going to be deploying the model as a web application and actually i've created an infographic where i will summarize all of the content of this particular video in an infographic style and so let's have a look so this is the infographic of the bioinformatics from scratch series that was on my youtube channel called data professor and so in this collaboration with free codecamp we're going to combine all of the six part series into one video course and so let's have a quick overview of what you'll be learning in this course so in part one you're going to be learning about target protein search which means that you're going to select a target protein of your interest that you will focus on for example if you want to have a machine learning model for discovering breast cancer drugs then you're going to select aromatase as your target protein search or if you're looking into alzheimer you might want to search for acetylcholine estrase which we'll be using in this particular series and then we're going to collect the data set from the chambo database using the python library from temple and then we get the bioactivity data and then we're going to pre-process that dropping the missing data dropping duplicate data and then we're going to label compounds according to their bioactivity thresholds in order to obtain a created data set and in part two we're going to perform exploratory data analysis whereby we're going to firstly clean the smiles notation which represent the chemical structure of the compound in the dataset that we're going to analyze in the video series and then we export that out as files number four and five which will be available to you in the github repo provided in the video description of this video and so the number four five here will be labeled in the description and so what it essentially mean is that they have two class or three class which is the y variable and then once we have cleaned the smiles notation we're going to calculate descriptors in order to perform eda and for eda we're going to use visual part and also we're going to perform statistical analysis as well and in part three we're going to calculate additional descriptor which we will be using in order to build machine learning models in the subsequent part which is part four and so in part four we're going to build a random forest model for performing prediction on quantitative data which is the pic50 and therefore it is called the regression model and finally we'll make a scatter plot in order to see the distribution or the goodness of fits of the actual value and also the predicted values and then in part five we're going to compare several machine learning models and then we're going to make a performance comparison plot as you see here and in the last part part six we're going to deploy the model by making it into a web application meaning that the user could input the molecule of their interest and then the web app will be making the prediction and so we have a lot to cover here in the six part series but no worries because i'm going to provide a step-by-step guide from the basics and so without further ado let's get started welcome back to the data professor youtube channel if you're new here my name is chenin nantan ahmad and i'm an associate professor of bioinformatics on this youtube channel we cover about data science concepts and practical tutorials so if you're into this type of content please consider subscribing so in the previous video i have shown you how you can apply machine learning on a computational drug discovery project particularly we downloaded a data set derived from the study of dilani and the data set is essentially a collection of compounds along with their molecular solubility value which is a important physical chemical property of compounds on whether it can be solubilized in water to what extent and so some of you might be wondering what if you want to collect original data let's say that you want to create a new data science project for your portfolio and you want something that is new original has never been done before then this video is for you because in biology there is a lot of unknown that is waiting to be researched about and so in this video i'm going to show you how you can retrieve and download biological activity data of compounds from the chamber database which you can subsequently use to construct machine learning models which is also technically known as quantitative structure activity relationship and so the development of such qsar or q-star models holds great value for drug discovery efforts particularly it allows us to understand the origins of the biological activity and the interpretation of the model will allow us to understand how we can design a better drug and so such data that you're going to collect and download today by following along with this video not only will allow you to build your data science portfolio but may also initiate or scratch the surface towards the development of novel therapeutic agents or novel drugs and so without further ado let's get started okay so the first thing that you want to do now is head over to the github of the data professor and then click on the code repository and then scroll down click on python scroll down and click on the cdd ml part 1 bioactivity data and then right click on the raw link and you want to save it into your computer or if you would like to follow along on the google code lab you are more than welcome to do so so what you want to do is go to colab and then click on the file open notebook and then click on the github tab and then type for data professor enter and then it's going to be the first file that you see here cddml part one okay and then you want to click on that and then it should open up a new notebook for you but i already have that so i'm going to follow the one that i have open right here so the exciting part of this video is that you're going to collect original data so it's going to be the same data that researchers in the field are collecting and publishing about and so today you're going to have the opportunity to contribute to computational drug discovery okay so the database that we're going to use is called chambo database and it is a database comprising of more than 2 million compounds and it is compiled from more than 76 000 documents and the version as of march 25 2020 is chambo version 26 okay and so the first thing that you want to do here is to install the jumbo web resource client and so we're going to use the pip install here so this library will allow you to download the biological activity data directly from the jumbo database but before we do that let me show you how the tempo database actually looks like so you could search on google for chambo c-h-e-m-l okay so let's say that we're going to search for coronavirus and then we're going to go with the search for coronavirus in all targets we're going to click on that and so the targets here refers to the targeted proteins or target organism that the drug will act on so biologically these compounds will come into contact with the protein or the organism and induce a modulatory activity towards it it could either be to activate the protein or the organism or to inhibit it okay and so this will give us seven targets here and if we scroll down we're gonna see the type of the target would be comprising of organism and single protein and the single protein will be sars coronavirus 3c like proteinase and the replicase polyprotein 1ab and so these are for the saris coronavirus 1. and so as you can see that the sars coronavirus 2 is not yet deposited in this database and so we're going to work with what we have here okay so let's head back to the notebook and the chambo web resource client should have already been installed and let's proceed with importing the library so here we're going to import the pandas spd and we're going to use from jumbo webresourceclient.new client import new client okay and then in this section we're going to search for the target protein and so it's essentially going to be the same process that we're searching right here in the search bar we type in coronavirus so we're going to do exactly in the code but first we're going to assign new target to the target variable and then we're going to create a variable called target underscore query equals to target dot search and then the search keyword will go here and then we're going to create a target data frame and then we're gonna assign the target query inside the argument using the from dictionary function okay and then finally we're gonna display the contents of the data frame and then we're gonna run this in order to see that okay and so here we see seven results and it's the same thing that we see here seven targets okay so seven results here and then notice that there are two single protein right here and the rest are organism so same thing right here we have two single protein and the rest being organism okay and so in this tutorial we're going to use the single protein for our further investigation okay and so let's go to the next step so in this section we're going to select and retrieve bioactivity data for sars coronavirus 3c like proteinase which is the fourth entry right here or actually it's the fifth entry but it has the index number of four okay so it actually is fifth entry let's call it the fifth entry okay and so let's run the cell and so notice that the chamber id here is chambo 3927 so this is the target id so it's a unique identification of the target okay and so here we're going to define a variable called activity and we're going to use new client.activity and then afterward we're going to define a variable called res and then we're going to assign it the block of code here which is activity dot filter and then in the argument here we're going to use the target chamber id equals to the selected target and then we're going to have the closing parenthesis as part of the filter function and then we're going to apply another filter which is to select only the values containing ic50 for the column called standard type okay and so i'm going to show you that in just a moment and let's show the contents of the data frame because there are so many columns why don't i show only the first three because the font is rather big and we need to access the scroller here okay so let's find the column that i was talking about the standard type so here standard type ic50 okay so we're gonna select only the ic50 here and so let me show you what are the unique values in the standard type column okay and so we see that there are only ic50 here okay so for this particular data set it wouldn't matter because all of the value here are the same and they are ic 50. but in cases of other data set there might be a combination of other bioactivity unit types so it might be ic50 it might be ec 50 or it could be percent activity so when we define a particular standard type here it will make our data set more uniform and so we won't have a mixture of different bioactivity units okay so we're going to use only the ic50 type and the standard value is the potency of the drug and so the number here represents the potency and so the lower the number the better the potency of the drug becomes okay and likewise the higher the number the worse the potency becomes okay so ideally we want to find a number of the standard value to be as low as possible meaning that the inhibitory concentration at 50 percent will have a low concentration meaning that in order to elicit 50 of the inhibition of a target protein you would need lower concentration of the drug let's think of it this way the number here reflects the concentration of the drug and so the lower the concentration that is required the better it is because if you have higher number it means that you require more amount of the drug in order to produce the same inhibition at 50 and so analogously let's say that if you could take five milliliter of a medication versus five liter of medication right but which is impossible in order to produce the same effect which one would you choose right okay so something to think about so let's go back to here all right and so finally we're going to write out the data frame into a csv file and we're going to call it the bioactivitydata.csv and we're going to have the index number to be false because we don't want the index number to be in the resulting csv file okay so let's write that out and let me mount my google drive into the notebook and so i'm gonna click here okay and so i'm gonna paste in the authorization code enter all right so it's mounted i think i might have already run this because the data folder has probably been created but let me check okay so it's right here data so it has already been created okay but for you guys let's say i create data too okay so for you guys creating a new folder called data in your codelab notebook folder for the first time would probably work so let's continue and so we're going to copy the bioactivity data here into the folder and let's ls that and we're going to see the bioactivity data so let me also add the dash l function here so i can also see the time at which it is created and it is created on april 29th so it's right now and so let's see the content of the csv let's list this again in our current working directory and we're going to take a glimpse of the contents of the bioactivity data and it looks like this so it's a csv data okay and then we're going to proceed to the next step we're going to do some handling of the missing data if there is any and then we're going to drop compounds with missing standard value so the thing is we have already dropped any missing values here okay so apparently for this data set there is no missing data however this code might come in handy for other data sets where there is missing data okay so let's proceed to the next step and so here we're going gonna do some data pre-processing of the bioactivity data so for the benefit of creating machine learning models where we could classify compounds into three categories as either being a active compound an inactive compound or an intermediate compound and so the active compound will be defined as drugs that have ic50 of less than one micromolar and one micromolar is equal to 1000 nanomolar and so a drug having ic50 value of less than a thousand nanomolar will be classified as active and a drug having ic50 value greater than 10 000 will be classified as inactive and drugs having value in between 1000 and 10 000 will be called intermediate so in some of the research projects that we have normally done we either use the two class or the three class okay and so we're going to use the conditions as defined here that i have already told you about and so we're going to run this block of code and then we're going to iterate over the molecule chamber id column let's go back here let me show you molecule molecule chamber id okay so this data set is comprised of many compounds and a compound is a drug a molecule a molecule is a chemical structure that produces a modulatory activity or in other words it exerts some effect on the target protein kind of like when you take medication and the medication exerts some effect on you like you might feel drowsy you might feel thirsty which are the side effects but the drug will directly act on the target protein in order to produce the desired biological effect which ultimately cures your symptoms and that is why you're taking drugs right or medication right and so you see that this is the chambo molecule id so each compound will be described by a molecule chamber id and so each row represents one compound and there might be a possibility that multiple rows will contain the same molecule chamber id and if that is the case for simplicity we're going to keep only one of them okay because we don't want any redundancy in the data set right so for the molecule chamber id let me show you before we iterate so df2 dot molecule tempo id so it essentially contains the chambo id as i have mentioned so they are the unique identification number of each molecule and so we're going to iterate through each of them right but first we're going to create a empty variable called mole cid and so in each iteration of the for loop here we're going to append the molecule chamber id into this empty variable so let's run that and then we're going to see that the mole cid contains the molecule tempo id again okay and here we're going to do the same thing we're going to define an empty variable called canonical smiles and then we're going to iterate over the canonical smiles and then we're gonna append it to the empty variable here and then we're gonna do the same for the standard value which is the ic50 but actually this is only one way of doing things so this might actually be a complicated way actually another way would be to simply so in df2 dot molecule tempo okay so we just call selection equals to whatever we want so we want the molecule jumbo id and we want the canonical smiles and we want the standard value and then df2 we have the selection and then we're going to assign this to the f3 actually this might be a easier way the f3 right so we get the data frame here containing the three columns that we needed and actually we could do the same here as well canonical smiles okay i have to run it first probably have to run this okay and so it gives you the same thing so i'm not sure and we also want the bioactivity class as well which do we have yeah we have it here bioactivity class and so we're gonna just append to it and set to df3 i'm going to have pd concat and then the f3 with the bioactivity class and then axis equals to 1. no oh so there are series and data frame objects okay so it's a list so this needs to be made into a data frame or a series that would work all right so it works so same thing here does it look the same yes it looks the same okay so actually this might be a easier way so let me copy this okay so i'm offering an alternative method here as well and i'm gonna move it below this okay so select either way okay so now we're gonna create a csv file for the pre-processed spiral activity data and we're assigning df3 dot to csv and then the name of the file and then the index will be funnels and that's all so let's check the file ls okay so here it is the pre-processed data so let's copy that into the google drive pre-processed data let's have a look all right so we have both of them here so let me annotate this a bit all right so congratulations you have successfully downloaded the biological activity data from the chamber database and so now we can use this for subsequent machine learning model building and i'm going to cover that in a future video and so please stay tuned to that while in the meantime you could also use this data set that you have created on a data science project of your own or you could also modify the search query at the beginning so let me show you instead of using coronavirus you could use another keyword let's say aromatase so the aromatase is an enzyme as part of the cytochrome p450 which is responsible for breast cancer and so the goal of drug discovery effort is to find a compound or a molecule that will be able to inhibit the function of the aromatase enzyme okay and so here is the human aromatase enzyme so as you can see try out different keywords and see what protein you have and then you could use this novel data set in your own data science project so the possibilities are endless and so now you have original data that you could play around with that no one else in the world might have because you guys might be using different keywords right and so this will be a novelty in itself and so if this video was helpful to you please give it a thumbs up and if you haven't yet subscribed please subscribe to the channel for more awesome content on data science and as always the best way to learn data science is to do data science and so please enjoy the journey okay so today's part two of the bioinformatics project series where we will show you how to apply data science for drug discovery and in the previous video i've covered about how you can collect data directly from the bioactivity database chambo and so in today's video we're going to take a step further by computing molecular descriptors and we're going to then perform exploratory data analysis on the computed descriptors and so without further ado let's get started okay so the first thing that you want to do is head over to the github of the data professor and you want to click on the code repository and then scroll down and find python and so before we proceed with doing part two exploratory data analysis we're going to do a recap of part one and we're not going to do a ordinary recap but we're going to do a concise version and so in this concise version let's have a look we've trimmed down the code and it will be a bit more lightweight okay so i'll show you that in a moment and so what you want to do now is you want to scroll down and right click on the raw link and then save link as and then you want to save it into your computer okay and then you can follow along on your local computer using jupyter notebook however if you want to use google colab we can do that as well and so let's go to the codelab right now and so in colab you can click on the file open notebook and then click on the github tab and then you want to type in data professor and then go to the code repository so make sure that it is data professor slash code and then you want to click on the cddml part 1 bioactivity data concise okay but i'm going to use the one already in my collab and aside from part one we're gonna also do part two which is also in the python repository under code and then you wanna click on the cddml part two exploratory data analysis and then you to right click on the raw link and save link as and then also save it into your computer and so we could do the same thing right inside collab by file open notebook okay and then you want to click on the github and type in data professor slash code enter and then you want to find cdd ml part 2 exploratory data analysis and click on that one okay but since i already have it locally i'm gonna use it okay so let's start with part one so you wanna click on the connect and then give it some time to load up okay so let's go over what i have done in this concise version so essentially there's two things so the first thing is that redundant code cells were deleted and the second part is that code cells for saving files to google drive has also been deleted and so at the end of the notebook we can simply download it by clicking on the files button on the left hand side of the panel and then we could download a copy of the zip file of the curated data okay so i'm going to show you that in just a moment alright so the notebook is loaded and you want to install the chambo web resource client so go ahead and run the cell all right so it's installed and let's go ahead and import the libraries and let's run the code cell for searching for coronavirus and this is the result and so we're gonna select the fifth entry which has the index number of four right here and so let's run that code cell so a detailed explanation of all of this has already been given in the previous video so i'm just gonna run the code cell one by one so if you want further information please check out the previous video of part one okay and so this is the bioactivity class label combine the data frames and writing out the output file okay and so let's download the pre-processed file and there you go right so a recap in order to download this file make sure that you hover your mouse over the three dotted line on the far right of the name of the file and then click on it and then choose download all right so we're done with the part one and so now let's continue with part two and so it should be noted that explanation for all of the code cells in this part one has already been given in the previous video the part one of the bioinformatics project series okay and so let's proceed to the contents that are intended for this part two of the bioinformatics project series so let's close this notebook and now let's go to the part two and make sure to click on the connect and then you want to run the code cell for installing conda and rd kit and so what rdkit essentially will allow you to do is it will allow you to compute the molecular descriptors for the compounds in the dataset that we have compiled from part one okay so let me explain again in part one we have already downloaded the data set of the biological activity from the chambo database and so the data set will comprise of the molecule names and the corresponding smiles notation which is the information about the chemical structure which we will use in this part too in order to compute the molecular descriptors and the data from part one also contains the ic50 which in part one we have already performed the binning into the bioactivity class active inactive and intermediate okay and so in this part two we're going to select only two by activity class which are the active and the inactive so that we can easily compare between the active compounds and the inactive compounds okay so without further ado let's have a look at the code so now conda and rdkit has already been installed and so let's load up the pandas library and make sure to click on the file button here on the left hand panel and then you want to upload and then choose the bioactivity data that we have prepared from the previous part one okay and so it's right here now it has already been uploaded and then we can close the panel here okay and so let's load up the csv file so the following block of codes we're going to compute the lipinski rule of 5 descriptors or simply lipinski descriptors you might be wondering what is lepinsky descriptors well lipinski descriptor originates from the fact that christopher lipinski who is a scientist at pfizer came up with a set of rules called the rule of five which was used to evaluate the drug likeness of compounds and so the drug likeness is based on the key pharmacokinetic properties comprising of absorption distribution metabolism excretion which has an acronym of acne and this is also known as the pharmacokinetic profiles and so what essentially admin will tell us is that it will tell us the relative drug likeness of the compound whether it can be absorbed into the body distributed to the proper tissue and organs and become metabolized and eventually become excreted from the body and so in order to derive the rule of five christopher lipinski collected a set of fda approved drug that are normally administered orally and then based on his analysis he observed that the four descriptors that was used for his analysis had corresponding values in multiples of five as follows so the molecular weight should be less than 500 dalton the octanol water partition coefficient or log p has to be less than five hydrogen bond donors is less than five hydrogen bond acceptors is less than ten and so as you can see all of the values are multiples of five okay and so let's proceed with computing the descriptors so let's load up the library and then compute the descriptors so this is a custom function that was inspired from this link here and it was modified to include the descriptors for this analysis all right and so we have the lipinski descriptors in this data frame and in order to get that we're going to apply the custom function called lipinski which was the custom function here which takes in as input the smiles notation so the smile sensation contains the chemical information and so what the chemical information tells us is the exact atomic details of the molecule and so it's going to use that as the input to compute the molecular descriptors all right and so let's continue and so let's run that and let's have a look at the data frame all right so we can see that there are four descriptors that we have previously covered including molecular weight log p which will tell us the size log p will tell us the solubility and so this is the relative number of the hydrogen bond donors and acceptors and so we can see that there are a total of 133 rolls and four columns and as a recall the data frame that we have read directly from the curated file from part 1 is shown in the df data frame and so we're going to combine the df data frame and the lipinski data frame together because we want to have the standard value and the bioactivity class columns and so we're going to use the pd concat function in order to combine the df and df lipinski data frame and then we're going to put it into the df combined variable and then let's have a look at the new data frame all right and so you can see that the last four columns are integrated into the df data frame here all right and so the dimensions of the data frame is correct 133 rows and then the number of columns has been expanded to be b8 okay and so now we're going to convert the standard value which is the ic50 to the pic50 scale and so the reason for doing the ic50 to pic50 transformation which is essentially the negative logarithmic transformation from the ic50 value is that the original ic50 value has uneven distribution of the data points and so in order to make the distribution more even we will have to apply negative logarithmic transformation okay and so let me give you a challenge let me know how the distribution of the original ic50 looks like versus the pic50 that you have performed the transformation so let me know in the comments after you have tried this and so a hint is that what you can do is perform a simple scatter plot okay so let me know in the comments if you see any difference in the distribution of ic50 versus the pi350 okay so let's do the actual transformation by running this custom function oh and one point here which is worthy to note is the ic50 value which is contained inside the standard value column has large numbers and the large number here will after performing negative logarithm it will become a negative value and in order to prevent that we're going to need to cap the maximum value right here to be here 100 million so we need to cap the value to be 100 million so that the resulting pic50 would not be less than 1.0 otherwise it will have negative values and that will make interpretation a bit more difficult okay so we're going to cap the values to a hundred million by creating a custom function called norm value and so what essentially the nor value function would do is that it will read through the individual values in the standard value column and if the value is greater than 100 million it will cap the value to be 100 million so that the value will not exceed 100 million and so therefore after performing negative logarithmic transformation it will not be less than 1.0 okay and so let's perform the norm value here let's describe the value again and notice that the maximum value is 1 times 10 to the eighth power so it is a hundred million whereas previously the value is rather big okay okay and so we're gonna apply the pic50 function to the normalized data frame and then we're gonna call the new data frame to be df final okay and so notice that we have now created a new column called pic50 and we have already deleted the original ic50 column notice that the standard value here column has now been deleted and it is converted to be pic50 which is the negative logarithmic form of the ic50 and so let's describe the data frame all right and so now the maximum value is 7.3 and the minimum value is 1.0 and what we want to do now is to allow simple comparison between the two bioactivity classes therefore we're going to delete the intermediate class and we're going to call the new data frame to be df2 class all right and so we have now 119 rolls by eight columns and so let's perform exploratory data analysis using the lipinski descriptors and so in tem informatics or draft discovery we're gonna call the exploratory data analysis to be chemical space analysis because what it essentially does is it allows us to look at the chemical space and the chemical space is kind of like a chemical universe right so as jose medina franco post said each chemical compound could be thought of as like stars okay and so the active molecules would be compared to a constellation and it will be referred to as constellation and so he developed a approach which he termed constellation plot whereby you could perform chemical space analysis and create the constellation plot whereby the active molecule would be correspondingly have larger size in comparison with the less active molecule and so we're going to apply a similar concept in our plot here as i will show you in the next few moments here and so what we want to do first is import the library seaborn and the matplotlib as plt and so now we're going to create a simple frequency plot of the two bioactivity classes so using this block of code we're going to create a frequency plot comparing the inactive and the active molecules and in doing so we're also going to save it as a pdf file right so the x and y labels are obtained using these two lines of code and the frequency plot is using the count plot function where we use x variable to be bioactivity class right here and so as you can see there is no need to define the y variable because the y variable here is the frequency and so the edge color is black which means that the bar will have a black outline okay and so being able to save it as the pdf file will allow you to use the resulting files for your report for your publication for your project and as i have already mentioned in the part one of this bioinformatics project series these two notebooks are crafted based on actual research protocol that we use in our own research group okay so let's proceed all right so now we're going to make a scatter plot of the molecular weight versus the log p or the solubility of the molecules and we're first going to start by defining the figure size to be 5.5 by 5.5 and we're going to use the scatter plot function here whereby the x variable will be mw or molecular weight and the y variable will be the log p and the data will be df2 class and the hue here would refer to the color and so the color will be defined on the basis of the bioactivity class and because there are two classes you will see that the color comprises of blue and orange whereby blue will refer to the inactive molecule and orange will refer to the active molecule and the size of the data points here will be according to the pic50 values and we define the edge color to be black which is the edge of the circles and the alpha transparency is defined to be 0.7 and the x label and y label are custom here mw and log p with a font size of 14 and we have the font weight to be bold and in this line we're gonna define that we want the figure legend to be outside the plot otherwise it will be embedded inside which will make it very difficult to see so we opted to have the figure legend outside and then finally we're gonna save it into the pdf file so let's run this block of code here all right so it's finished and so let's do the same thing for the psc 50 value so the same concept applies just changing the name of the variables and so here we see the distribution of the inactive class and the active class and so this is to be expected because we use the threshold to define active and inactive and so the threshold that we used was five and six right so if the pxc 50 value is greater than six it will be active and if the psc 50 value is less than 5 it will be inactive and so you can see that the distribution of the inactive is rather vast in comparison to the distribution of the active molecules which is between 6 and 7 whereas the inactive is between 1 and five okay and so we're going to perform mann whitney u test in order to look at the difference between the two bioactivity class active and inactive and so we're going to apply this man whitney u test to test the statistical significance of the difference whether they are different or not different and so the code for performing the manually u test was modified from machine learningmastery.com and we made it into a function all right and so let's run it and let's apply the man whitney function to the pic50 and what it will do is it's going to compare the active class and the inactive class to see whether there is a statistical significance for the pic50 variable and so based on this analysis the p-value is rather low and therefore we reject the null hypothesis and therefore we can say that it is having different distribution meaning that active and inactive okay and so we're gonna apply the same plots and statistical analysis for the other four lipinski descriptors as well and so let's breeze through this box plot man whitney box plot man whitney box plot man whitney oh again okay man whitney box plot man whitney and so make a note that all of the files from the mad whitney and the box plot are saved as files and so the mad whitney has its own csv file and the box plot has its own pdf file and so we can download all of this at the end in order to use it for your own project and research and so let's have a look but before we do that let's do some interpretation of the results okay so let's make sense of the results here so let's start with the psg50 values so taking a look at the psc50 values the actives and inactive displayed statistically significant difference which is to be expected because the threshold value was already defined at six and five okay as i have already explained of the four lipinski descriptors only log p exhibited no difference between the active and the inactive while the other three descriptors comprising of mw number of hydrogen bond owner and acceptor shows statistically significant difference between the active and inactive okay and so okay so let's continue all right and so finally we're going to sip up all of the files comprising of the csv files and pdf files which was generated in this notebook and so all of the manually you test and the box plot will now be zipped up and we can conveniently download it into our computer so let's zip up the file and click on the file button on the left panel and then hover your mouse on the three dot alliance click on it and click on download all right and so it will download into your computer and so you will see the plots that we have generated in the notebook and the resulting men whitney you test okay all of them are downloaded as csv file alright so if you find value in this video please give it a thumbs up and if you haven't yet subscribed please subscribe to the channel and as always the best way to learn data science is to do data science and please enjoy the journey welcome back if you're new here my name is and i'm an associate professor of bioinformatics and this is the data professor youtube channel okay so in this video i'm going to continue with the part 3 of the bioinformatics project series where i go through how you can implement a bioinformatics project from scratch so a short recap in part one i showed you how you can retrieve the bioactivity data directly from the chambo database followed by a quick data pre-processing in part two i've shown you how you can calculate the lipinski descriptor and perform exploratory data analysis and this video is part three where i'm gonna show you how you can calculate molecular descriptors followed by preparing the data set that we will be using for the next part which is part four and in part four we're going to do some model building and so without further ado let's get started okay so the first thing that you want to do is head over to the github of the data professor and click on the code repository scroll down and click on python scroll down and notice that i've created three additional files which has the prefix of acetylcholine esterase which is the name of the target protein that our research group have previously published and the great thing about this target protein is that there are an abundance of bioactivity data and therefore it will be a great starting point for model building so essentially what i have done is change the name of the target protein in part one and then perform all of the code cells and did the same thing with part two by taking in the output from part one and then i finally exported the files from part two and then used it for part three which is today and so the export of data from part one and part two will be provided on the github of the data professor so it will be provided here in the data directory so you'll be noticing that there are six additional files containing the name essentia calling series 0 1 until 0 6. okay so let's download the part 3 asset coiling esterase so you could click on it and then you could right click on the raw link and then save it into your computer because i already have it in my google code lab so i'm going to use this one okay so let's begin so you're gonna see here that in this part three we will be calculating molecular descriptors and then we're going to prepare the data set which will be used for the next part part four and in part four we're going to perform some model building so we are going to need to download the paddle descriptor software which is provided on the github of data professor and i'm going to provide the link to the original website of the developers of the paddle descriptors along with the link to the original research paper okay so the paddle dot zip file has been downloaded along with the paddle.sh file which is the shell script file containing instructions on how to run the paddle calculation because here we're going to use paddle to calculate the molecular descriptors okay so we're going to unzip the folder okay and so we're going to download the acetylcholine file which is containing the pic50 along with the three bioactivity class all right and so we're going to import pandas as pd and we're going to import the csv file and assign it the df3 variable name so let's have a look at this data frame okay so we're going to select only the canonical smiles column along with the molecule temple id column and we're going to put it in the selection variable and then we're going to subset the data by using df3 bracket selection which contains the name of the precise columns that we wanted and then we're going to assign the name of df3 underscore selection and then we're gonna save it as molecule.smi let's run it let's have a look at the file using the bash okay so it contains the smiles notation here and the name of the molecule so a quick recap these smiles notation represents the chemical information that pertain to the chemical structure so the c here represents the carbon atom the o represents the oxygen the n represents nitrogen okay and so let's continue and so in this line we're going to see how many lines of molecules do we have and we have 4695 which matches this number here four six nine five so we wanted to check that all of the rows are coming in the molecule.smi file okay and so we're gonna perform the descriptor calculation by running bash paddle.sh okay so maybe you're wondering what is inside the paddle file let's have a look paddle.sh so it contains the instruction so we're going to use java and then we're going to use one gigabyte of memory and we're going to use this option because we don't have a display on the google code lab so we're going to use the java awt headless equal to true and then we're going to specify the jar function because we're going to use the paddle descriptor.jar file and then here we're going to remove the salt and the salts are the sodium and the chloride which are in the chemical structure and so this program will automatically remove all salts and also small organic acid from the chemical structure so if that sounded gibberish then so it essentially means that we are cleaning the chemical structure so that there are no impurity okay and here are the other options pertaining to how we also clean the chemical structure and then this option tells the program that we're going to compute the molecular fingerprint and the fingerprint type will be pubchem fingerprint okay and so finally it will output the descriptors into the file called descriptors output.csv and let's run it some molecules are taking 5 seconds some are 0.6 okay some are fairly quick 0.3 okay because we have 4695 it's going to take you about 18 to 19 minutes to complete so why don't we just stop this and then we can download directly the computed file which is here descriptors output dot csv so why don't you head over to the data professor github click on the data repository and then you will see this page and then scroll down and find descriptors output dot csv click on it and then right click on the download button and save link as and then download it into your computer so we're going to save it into the desktop and i'm going to change the save as type to be all files and i'm going to change.txt to be csv okay so it's automatically opening it for me okay so we're gonna go back to this notebook and then as you see it's barely up to about 200 so i'm gonna stop this if it allows me to all right so it stopped so let's see if there's any generated file so it has generated some output and so i'm going to delete this because i'm going to upload the completed version so here i'm going to click on the upload and then this is the desktop and i'm going to click on the descriptors output dot csv okay and it's currently uploading wait one moment so let's list the files so okay so it's uploading so notice the file size okay so it's increasing that's a good sign wait one moment it's fairly big file almost hang in there all right so it's finished so let's have a look again so it's about 8.3 megabytes and so we're going to read the descriptor's output into df3 underscore x and so here we're going to prepare the x and y data matrices and the x data matrix will comprise of the molecular descriptors which are the pubchem fingerprints so let's have a look and so we're going to delete the first column here the name because we want only the molecular features so let's drop it using the dot drop function and the name of the column that we want it to be dropped and then we see that the name column has been dropped and then we reassign it back to df3 underscore x all right so now let's create the y data matrix and here we're going to take the pic 50 column directly from the f3 data frame which is the initially loaded data frame and then we're going to assign it to the df3 underscore y and then here we're going to combine x and y together so this really depends but for portability we're gonna combine it and then we're gonna output it into a csv file which we will be uploading to the github and then we're gonna use that for part four okay and so here we're gonna output this data set 3 data frame into a csv file and notice that the name here is fairly long and the purpose for having such a long name is to allow less confusion and to allow us to easily see what is the purpose of this file and so the first segment of this name is the name of the target protein which is the acetylcholine esterase 0 6 is just the sequential order number and so bioactivity data 3 class pxc50 essentially tell us that it contains the bioactivity data information along with three categorical class comprising of active inactive and intermediate and it also contains the psc 50 values and then the last segment here is pubchem fp signifying that it contains the pubchem fingerprint so this will become handy when we have more than one fingerprint type and so paddle allows you to compute more than 10 different fingerprint types and so let's make it your homework to try to compute other different fingerprint types so you want to play around with the options and see what other molecular fingerprints are available and then you could rename the file accordingly all right so let's see if we have already written out the file and it's right here okay and then we're gonna write out the file so let's run the code cell here and let's save it into our computer all right so it's finished and it's gonna open up for us to see all right let's see so these are the fingerprints of pubchem and the last column is the pic50 so we're going to use this file to perform model building in the next episode so please stay tuned to that so support this channel by smashing the like button subscribe if you haven't yet done so and click on the notification bell in order to be notified of the next video and if you have come this far in the video please give yourself a big clap and comment down below that you have watched until the end and big kudos to you guys and as always the best way to learn data science is to do data science and please enjoy the journey okay so welcome back to this part four of the bioinformatics from scratch series where i show you how to do a bioinformatics project using machine learning in a step-by-step manner so in this video we're going to build a simple regression model based on the random forest algorithm and the dataset that we're using is based on the acetylcholine esterase inhibitors which is derived from the previous tutorial videos and so without further ado let's get started so the first thing that you want to do is head over to the github of the data professor and then you want to click on the code link and then click on python and then you want to find cddml part four and so if you haven't yet gone through the previous three episodes here please make sure to go through that in the provided playlist up and below and so you wanna click on the part four and then right click on the raw link and then save link as and then save it into your computer all right so let's get started so let's connect to the google code lab all right so for those of you new here you could open the notebook directly by clicking on the open notebook click on github and then you type in data professor and then find cddml part four right here and then you click on that one okay so i'm going to use the one provided here so let's begin so the first block of code here is to import the necessary libraries so we're going to simply run that and then we're going to load in the data set that we have prepared from the prior videos so we're going to load it in and so the data set here is based on the pubchem fingerprint and it's going to contain the bioactivity data for the acetylcholine esters inhibitors so one of you asked a very great question in the prior video in part three we have prepared a pop can fingerprints and in part two we have prepared lipinski descriptors and then the question was what's the difference between these two so that's a very great question so firstly the lipinski descriptor will provide us with a set of simple molecular descriptors that essentially will be giving us a quick overview of the drug-like properties of the molecule and so historically christopher lipinski created a set of four descriptors that he had investigated in his research that are responsible for drug-like properties whereby he analyzed a set of orally active drugs and then he came up with this rule of five whereby compounds that are passing the rule of five will make good oral drugs and so for the pubchem fingerprints which we will be using today as well for the model building it is describing the local features of the molecules so the lipinski descriptor will be describing the global features of the molecule particularly the molecular size of the molecule the solubility of the molecule and also the number of hydrogen bond donor and acceptor which is the propensity to accept and donate hydrogen bonds and by local features for the pubchem i mean that each molecule will be described by the unique building blocks of the molecule so if we think of molecules as kind of like a lego building blocks so each molecule will be comprised of several lego building blocks and the way at which the lego building blocks are connected it will create a unique properties for the drug and that is the essence of drug discovery and also the essence of drug design so essentially the connectivity of the lego blocks are giving rise to the unique structure of the molecule and also the unique molecular properties and so therefore we have to find a way to rearrange the lego building block in such a way that the molecule provides the most potency toward the target protein that it wants to interact while also being safe and not so toxic right because if the molecule is toxic then you have side effects happening all right so we have already downloaded the data set now and so let's have a look at the input features so the pubchem fingerprint has 881 input features so let's think of the input features for the pubchem fingerprints as kind of like a unique as the name implies fingerprints so each molecule will be given a unique fingerprint kind of like each of us humans have a unique fingerprint right and so the unique fingerprints of each molecule will allow the machine learning algorithm to learn from the unique properties in terms of the molecular properties of the compound and then create a model that will be able to distinguish between compounds that are active compounds that are inactive right because this is the goal of our model building we want to see which functional group or fingerprints are essential for designing a good drug or a potent drug and so the target variable that we are using for our prediction is called pic50 which is the minus negative logarithm of the ic50 value ic50 is the inhibition concentration at 50 okay so let's have a look further in 3.1 the input features so notice that okay let me increase the font size it might be a bit too small here okay there you go so for those of you who are using mobile phones to look at this video so i'm going to increase the font size so let's continue so the input feature here x equals to df dot drop so we're going to drop the pic50 in order to create the x variable matrix okay let's see so the df here is reading in the downloaded data set file which is comprised of the fingerprint and the pic50 value okay so it's in the df data frame okay so in order to create the input features we're going to drop the pic50 column because the psd50 column will be used as the y variable so upon dropping the ps350 we will have only the pubchem fingerprints and so we will call this x and then for y we're going to use df dot pse 50. okay so let's run the blocks of code here oh okay i have to run the top one here first all right and then run the x all right run the y right so x and y are loaded in and then we're going to have a look at the shape of the data so we have 4 695 rows or compounds and we have 888 and then we have 881 pubchem fingerprints so here we're going to remove the low variance features and then we're going to have a look so we have 137 fingerprints left which is from the 881 so variables having low variance will be removed and then we're going to split the data in a 80 20 fashion and then we're gonna look at the data dimension again all right so let's build a simple regression model using random forest and so we're gonna use an estimator to be a hundred and then upon building the model we get about 0.50 all right so we did not set the seat number so it is varying over time because of the random features that it is taking to build the model all right so why don't we set seed here okay so let's set the seed number import numpy snp and then let's build the model five one two let's run that again five one two try again all right so you see that if we don't set the c number the seat number will be randomized and then we get different results so here we're setting the c to 100 and we're getting the same results so let's make the prediction and now here in this block of code we're gonna make a scatter plot of the experimental versus the predicted pic50 values and then here you go we have a scatter plot of experimental and predicted alright so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey welcome back to part five of the bioinformatics project from scratch series where i show you how you could build your own computational drug discovery model using the machine learning algorithm in today's episode i will be showing you how you could compare several machine learning algorithms for building regression models of the acetylcholine esterase inhibitors and today we're going to be using a lazy and efficient way of building several machine learning algorithms and this was shown in a recent video using the lazy predict python library and so we're going to be using that for today's tutorial and before proceeding further let's do a quick recap so in part one i have shown you how you could collect original data set in biology that you could use in your own data science project particularly i have demonstrated to you how you could download and pre-process the biological activity data from the chambo database and the data set is comprised of compounds and molecules that have been biologically tested for their activity toward the target organism or protein of interest then in part two i have shown to you how you could calculate the lapinski descriptors which are descriptors used for evaluating the likelihood of being a drug-like molecule and then i've shown to you how you could perform some basic exploratory data analysis on these lapinski descriptors particularly the eda are based on making simple box plot and scatter plot in order to visualize the differences of the active and inactive subset of the compound in part 3 i have made some changes to the target protein and then we're using the acetylcholine esterase as it provides a larger data set to work with and so in this part we have already computed the molecular descriptors using the paddle descriptor software and then we prepared the data set comprising of the x and y data frames and then we used that to build a prediction model in the subsequent parts which is part four where we use the descriptors generated from part 3 in order to build a regression model using the random forest algorithm and now to today's episode let's get started so here we're going to be comparing several machine learning algorithms using the lazy predict library and so the first thing that you need to do is install the lazy predict and so in a prior video i've shown you how you could use the lazy predict to do a quick and rapid model building of classification and also regression models in just a few lines of code and so let's start by installing the library okay and so we have already installed it and then we're going to be importing the necessary libraries and so here we're using the pandas seaborn and also the second learn library specifically we're importing the train test split function and then we're going to be importing the lazy predict and also the lazy regressor function and so now we're going to be loading up the data set and we're going to be directly downloading it from the github of data professor and so the links is here wget to download it and now we're going to be reading in the file and then we're going to be assigning it to the df data frame then we're going to be splitting it up into the x and y variables and let's take a look at the dimension of the x variable and so here we see that it has a total of 4 695 rows or the number of compounds in the data set and it has a total of 881 descriptors or the features or the number of columns and so the first thing that we need to do is we're going to be removing the low variance features so those that have low variance and let's take a look at the dimension of the data set again and so we have a reduced subset from 881 to be 137 variables now we're going to be performing a data split using the 80 20 ratio all right now comes the fun part so as you can see here we're going to be building more than 20 machine learning models and so we're using only two lines of code here so the first one is like any other scikit-learn functions for building the model is to assign the machine learning algorithm into a classifier variable and then we're going to be assigning the results from the prediction after we built the model and then we're assigning it to the train and test variables so the train and test variables here will be containing the performance of the model's prediction and so let's build the model so here it has 39 models 39 machine learning algorithms so this might take some time because the data is relatively big at almost 5000 rows and so it should be noted here that model building is using default parameters for all of the 39 algorithms used and so if you want to perform hyper parameter optimization that will be a topic for another video right and so models have been built and let's have a look at the train okay so lg bm is the best model here so from our prior tutorials random forest was used for the model building and so here it had slightly better performance let's have a look at the test set lgbm regressor random forest also at third place here but the thing is they're roughly the same okay 0.57 and 0.56 let's have a look at the data visualization of the model performance and so the bar plot of the r squared values is provided here and we're going to have a look at the rmse values here and then we're also going to have look at the calculation time provided here so the longer the bars become the longer it takes to build the model all right and so congratulations we have already built several machine learning models for comparison in prior videos of the bioinformatics from scratch series you have learned how to compile your very own bio activity data set directly from the channel database how to perform exploratory data analysis on the computed lipinski descriptors you have also learned how to build random forest model as well as building several machine learning models for comparing the model performance using the lazy predict library and so in this video we will be taking a look at how we can take that machine learning model of the bioactivity data set and convert it into a web application that you could deploy on the cloud that will allow users to be able to make predictions on your machine learning model for the target protein of your interest and so without further ado we're starting right now okay so the first thing that you want to do is go to the bioactivity prediction app folder and so this folder will be provided in the github link in the video description and so before we start let me show you how the app looks like so i'm going to activate my condy environment and for you please make sure to activate your own content environment as well so on my computer i'm using the data professor environment so i'm going to activate it by typing in conda activate data professor and i'm going to go to the desktop because that is where the streamlet folder resides and then we're going to go to the bioactivity folder let's have a look at the contents so the app.py will be the application and so we're going to type in streamlit run app.py in order to launch this bioactivity prediction app okay and so this is the bioactivity prediction app that i'm going to be showing you today how you could build one so let's have a look at the example input file so this is the example input file so in order to proceed with using this app we're going to have to upload the file drag and drop right here or browse files and select the input file and so while waiting for a input file to be uploaded you can see here that the blue box will be giving us a waiting message so it's saying upload input data in the sidebar to start so essentially the input file contains the smiles notation and the channel id and so the tempo id you can think of it as kind of like the name of the molecule here and particularly the tempo id is a unique identification number of this particular molecule that chambo database has assigned to it and the smiles notation here is a one-dimensional representation of this particular chemical structure and so this smiles notation will be used by the paddle descriptor software that we're going to be using here today in the app in order to generate molecular fingerprint which describe the unique chemical features of the molecule and then such molecular fingerprints will then be used by the machine learning model to make a prediction okay and so the prediction will be the pic50 values that you see here and the psc 50 value is the bioactivity against the target protein of interest and so in this application the target protein is acetyl cholinesterase and this target protein is a target for the alzheimer's disease okay and so this app is built in python using the streamlight library and molecular fingerprints are calculated using the paddle descriptor and so back in 2016 we have published a paper describing the development of a q star model for predicting the bioactivity of the acetylcholine series and so if you're interested in this article please feel free to read it so i'm going to provide you the link in the video description as well okay so let's drag and drop the input file so example as it took one in series i'm going to drag a drop here and then in order to initiate the prediction i'm going to press on the predict button and as you see here the input file is giving you this data frame and then it's calculating the descriptor and the calculated descriptor is provided here in this particular data frame so you're going to see here that there are a total of five input molecules and there are 882 columns and you're going to see here that the first column is the chamber id so in reality you're going to have a total of 881 molecular fingerprints and the molecular fingerprints that we're using today is the pubchem fingerprint and because we have previously built a machine learning model which i will be showing you using this file the jupyter notebook file we had reduced the number of descriptors from 881 to 217. no actually 218 because we have already deleted the first column the name of the the symbol id column and so we have reduced from 881 columns to 218 columns okay and so in the code we're going to be selecting the same 218 columns that you see here which corresponds to the descriptor subsets from the initially full set of 881 okay so we're going to use the 218 as the x variables in order to predict the psa 50 and finally we have the prediction output and the last data frame here and we have the corresponding tempo id and then we could also download the prediction by pressing on this link and then the prediction is provided here in the csv file okay so the data is provided here all right and so let's get started shall we okay so we have to first build our prediction model using the jupyter notebook and then we're going to save the model as a pickle file right here okay so let me show you in which will take just a moment so let me open up a new terminal and then i'm going to activate jupiter typing in jupyter notebook okay so i have to first activate condy environment kind of activate data professor so it's the same environment and then jupyter notebook all right there you go and then i'm going to open up the jupyter notebook all right and here we go so actually this was adapted from one of the prior tutorials in this bioinformatic from scratch series and essentially we're going to just download the calculated fingerprints from the github of data professor using this url link and so we're importing pandas as pd and then we're downloading and reading it in using pandas and the resulting data frame looks like this and so you're going to see here that we have all of this so one column the last column is pic50 and then we have 881 columns for the pubchem fingerprints and then the next cell here is we're going to be dropping the last column or the pic50 column in order to assign it to the x variable and then we're going to just select the last column denoted here by -1 and assigning it to the y variable and so now that we have the x and y separated we're going to next remove the low variance feature from the x variable so initially we have 881 and so applying a threshold of 0.1 this resulted in 218 columns and then we're going to be saving it into a descriptor list dot csv file so let me show you that descriptor lists the csv file okay and then you're going to see here that the first row will contain the names of the fingerprints that are retained in other words the name of the descriptors of the 218 columns here we here can see that top can fingerprint 0 1 2 has been removed and we have fingerprint 3 and fingerprints 4 until 11 has been removed fingerprint 14 has been removed fingerprint 17 has also been removed so more than 600 fingerprints have been deleted from the x variable and so the removal of excessive redundant features will allow us to build the model much quicker okay and so in just a few moments i will be telling you how we're going to be making use of this descriptor list in order to select the subsets from the computed descriptors that we obtain from the input query right here let me show you that we get from the input query right here so out of this smile citation we generated 881 columns and then we're going to be selecting a subset of 218 from the initially 881 by using this particular list of descriptors okay and let's go back to the jupiter notebook all right let's save it and then we're going to be building the model we're in the forest model we're setting here the random states to be 42 the number of estimators to be 500 and we're using the random force regressor and we fit the model here in order to train it and then we're going to be calculating the score which is the r2 score and then we're assigning it to the r2 variable and then finally we're going to be applying the trained model to make a prediction on the x variable which is the training sets also and then we're assigning it to the wide red variable okay so here we see that the r squared value is 0.86 and then let's print out the performance mean squared error of 0.34 and let's make the scatter plot of the actual and predicted values okay so we get this plot here and then finally we're going to be saving the model by dumping it using the pickle function pico dot dump and then as input argument we're going to have model and then we're going to save it as essential calling series model dot pkl and there you go we have already saved the model okay so i'm going to go ahead and close this stupid notebook and let's help over back and let's take a look at the app.py file okay so let's have a brief look you're going to see here that the app.py is less than 90 lines of code and about 87 to be exact and you're going to see that there are some white spaces so even if we delete all of the white space it might be even less maybe 80 lines of code okay so the first seven lines of code will be importing the necessary libraries and so we're making use of streamlit as the web framework and we're using pandas in order to display the data frame and the image function from the pil library is used to display this illustration and the descriptor calculation will be made possible by using the subprocess library so that will allow us to compute the paddle descriptor via the use of java and we're using the os library in order to perform file handling so here you're going to see that we're using the os dot remove in order to remove the molecule.smi file so i'm going to explain to you that in just a moment base64 will be used for encoding decoding of the file when we will make the file available for download the prediction results and the pickle library will be used for loading up the pickled file of the model okay and so you're going to be seeing here that we're making three custom functions so lines 10 through 15 the first custom function will be our molecular descriptor calculator so we're defining a function called desk calc and then the statement underneath it will be the bash command and so this bash command is what we're normally using when we type into the command line okay and so this option here will allow us to run the code in the command line without launching a gui version of paddle descriptor and so without this option here it will launch a gui version but since we don't want that to happen we're going to use this option okay and so we're using the jar file to make the calculation of the fingerprints and then you're gonna see here that we have additional options such as removing salt standardizing the nitro group of the molecule and then we're using the fingerprint to be the pubchem fingerprint using the xml file here and then finally we're generating the molecular descriptor file by saving it to the descriptors underscore output.csv file and so this batch command will be serving as input right here in the subprocess dot p open function okay and then finally after the descriptor has been calculated we're removing the molecule.smi file and so the molecule.smi file will be generated in another function so i will be discussing that in just a moment and the second custom function that we're generating here is file download so after making the prediction we're going to be encoding decoding the results and then the output will be available as a file for downloading using this link and the third function that we're creating is called build model so it will be accepting the input argument which is the input data and then it will be loading up the pickle file which is the built model into a load model variable and then the model which we have loaded will be used for making a prediction on the input data which is specified here and after a prediction has been made we're going to be assigning it to the prediction variable then we're going to be printing out the header called prediction output which is right here and underneath it we're going to create a variable called prediction output and we're going to be creating a pd dot series so essentially it is a column using pandas and so the first column is prediction and then we're naming it pic50 which is here and then we're going to create another variable called molecule name and the column that we're creating is the chamber id or the molecule name which is right here the first column and then we're going to be combining these two columns given by the individual variables called prediction output and molecule name so we're using the pd.concat function and then in bracket we're using molecule name which is the first column prediction output which is the second column and then we're using an axis equals to one in order to tell it to combine the two variables or the two columns in a side-by-side manner okay so axis one will allow us to have the two columns side by side otherwise it will be stacked underneath it so psv50 column will be stacked underneath the molecule name if the axis was to be zero okay and finally we're writing out the data frame which is here and then we're allowing it to generate the download link which is right here and we're making use of the file download function described earlier on here okay and then aligns number 38 we're generating this or displaying this image of the web app okay and lines number 43 until 51 or 52 is the header here the bioactivity prediction app title and then the description of the app and then the credits of the app and this is written in markdown language all right and so let's have a look further lines 55 until 59 will be displaying the sidebar right here so 55 will be displaying the header number one upload your csv data and then we're creating a variable called uploaded file and here we're using the st.sidebar dot file loader file uploader and then as input argument we're displaying the text upload your input file which is also right here and then the type of the file will be the txt file so right here and then we're creating a link using markdown language to the example file provided here to the example essential coding series so it's going to be the exact same file that we have selected as input okay so that's the sidebar function that you see here all right and so let's have a look for so here you can see that from line 61 until 87 we have the if and else condition so if we click on the predict button which is right here using the st.sidebar dot button function with input argument of predict so if we click on it it will make the descriptor calculation and apply the machine learning model to make a prediction and finally displaying the results of the prediction right here and allow the user to download the predictions however if we didn't click anything whereby we loaded up the web page from the beginning as i will show you right now you will see a blue box displaying the message of upload input data in the sidebar to start okay so two conditions if the predict button is clicked it will make a prediction otherwise it will just display the text here that it is waiting for you to upload the input data okay so let's have a look under the if condition so upon clicking on the predict button as you have guessed it will load the data that you had just drag and dropped and then it will be saving it as a molecule.smi file and this very same file here molecule.smi will be used by the desk calculation function that we have discussed earlier on particularly the molecule.smi file will be used by the paddle descriptor software for the molecular descriptor calculation and after the descriptors have been calculated we will assign it as the x variable it's right here okay so i'm going to tell you in just a moment lines number 65 will be printing out the header right here so let me make a prediction first so that we can see let's drag and drop the input file press on the predict button it's right here original input data line number 65. line number 66 will be printing out the data frame of the input file so you're going to see here two columns the smile citation which represent the chemical structure information and the jumbo id column lines number 68 will be displaying a spinner so upon loading up this results here by pressing on the predict button you saw earlier on that they had a yellow message box saying calculating descriptor and so underneath we have the desk calculation function and after it is calculated it will be displaying the following content the calculated molecular descriptor which follows here on lines number 72 right here calculated molecular descriptor and then it will be reading in the calculated descriptor from the descriptor output.csv file it will be assigning it to the desk variable then we're going to be writing out right here and showing the data frame of the descriptors that have been calculated and then we're going to be printing out the shape of the descriptor and so we see here that it has five rows or five molecules 880 molecular fingerprints and then in lines number 78 until 82 is going to be the subset of descriptor that is read from the previously built model from the file descriptor list dot csv and so you can see here that we're going to create a variable called x list and then we're reading in the columns okay and then we're going to be from the initial descriptor of 881 we're going to be selecting a subset provided in the x list and then we assign the subset of descriptor which is 218 descriptors selected from the initially set of 881 and then we assigned that to the desk subset variable and then finally we print it out as a data frame and we also print out the dimension as well so we see here that there are five molecules and 218 columns or 218 fingerprints and finally we make use of this calculated molecular descriptor subset and use it as an input argument to the build model function and then as i have mentioned earlier on it will be building the model and then finally it will be displaying the model prediction result right here so users can download it into their own computer thank you for watching until the end of this video and if you enjoy bioinformatics tutorial then you might want to also check out my youtube channel where i have several other bioinformatics tutorial and content where i show you how you could use python or r to make sense of biological data sets and i like to end my videos by saying the best way to learn data science is to do data science and please enjoy the journey
Info
Channel: freeCodeCamp.org
Views: 178,267
Rating: 4.9786735 out of 5
Keywords:
Id: jBlTQjcKuaY
Channel Id: undefined
Length: 102min 54sec (6174 seconds)
Published: Wed Jun 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.