How to Build Bioinformatics Tools

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to be sharing with you a plenary lecture that i have given at the international e-workshop on machine learning applications in drug discovery basic to advanced from 17th to 31st of may 2021 which was organized by the department of biotechnology at vickner university and so my plenary lecture was on may 17 and the practical tutorial workshop is on may 18th and so in the plenary lecture i talked about how i develop bioinformatics tools and in the practical tutorial workshop i talk about how you could develop your own bioinformatics tool using python and so the bioinformatics tool will be developed using the streamlit library and also psychic learn and also pandas which will essentially take in smiles notation of a molecule of your interest and then psychic learn will be used to make a prediction and we also will be needing the article library which will be used for molecular descriptor calculation and then finally the streamlet library will be used to contain every of the components that i have mentioned into the form of a web application and so this web application you could then also deploy it to the internet and so i'm going to provide you the timeline in the video description and so you could feel free to hop on to different parts of the video and so let's get started okay so the the presentation slide is provided at this link bitly bit dot ly slash data professor dash bioinfo dash talk and so today we're going to talk about towards the development of bioinformatics tools and so in this one hour lecture or talk i'm going to talk about like the the high level overview of what we're doing at our university in our research center and then later tomorrow we're going to have a practical session showing you how to develop a bioinformatic tool using the python programming language and so it will be similar to what i normally do on my youtube channel so a little bit about myself so i'm currently at the center of data mining at the faculty of medical technology over at mahidong university and mahedon university is one of the oldest universities in thailand and has consistently been amongst the top-ranking university in the biomedical science and as some of you may know i also run the youtube channel called the data professor where i have tutorials and also concept videos about data science and also occasionally about bioinformatics so at the interface of applying data science to biological data and i'm also part-time blogger at a platform called medium and i normally publish in the towards data science and so the typical article that i publish is tutorials showing you how to analyze the data using python or r and also how you could develop bioinformatics application as well so i really like this quote that was published in the plus biology journal and the title of the article is all biology is computational biology so some of you who are experimental biologists or those of you who are computational biologists so fundamentally all of you could be considered to be computational biologists because at the end of the day what you're doing is you're generating data right you're performing experiments you're you're testing different essentially you're testing different parameters right in computer science you have parameters that you set right in your machine learning algorithms you will set like how many number of learning rates or what is the momentum what is the learning app box that you want to use so similarly in biology you also try out different buffer different ph so essentially you're tuning the parameter and then you're performing the experiment right and then you're collecting the data and then the data that you're collected you're using it to make analysis and nowadays it's very difficult to only use statistical analysis approach so there is an extension to using more and more machine learning algorithms in order to make sense of the data and so the quote here says that i will argue that computational thinking and computational methods are so central to the quest of understanding life that today our biology is computational biology and so our quest for understanding about life has been an ongoing quest ever since the dawn of humankind and so we seek out to understand the world around us even in the expedition right we wanted to discover the different parts of the world and yeah and nowadays our quest is extending toward the galaxy we want to discover what other galaxy or what other planet there are right from expedition from nasa and elon musk spacex company and in biology essentially in the life science what we are generating is we're generating exponential amounts of data and so this data is called omix data and they're essentially coming from yeah they're essentially coming from the genomics proteomics glycomix lipidomic metabolomic and also interactomic and so you're just essentially adding the term omega at the end and so you're essentially collecting data on the genes on the proteins under sugar on the lipids and also metabolite the breakdown of small molecule and also the interaction among the molecule and so all of this information collectively is known as the omics data and so i have summarized this in the form of an infographic and so in bio biology or biochemistry 101 you have probably learned about the four macromolecules of life right so think of back in the days when you were in high school so in biochemistry course you have learned that the four macromolecule of life included protein lipids nucleic acid and nucleic acids you have dna and rna and so aside from that to mediate a lot of the biological processes we also need metabolites right and the important thing is how to the metabolite how do the protein nucleic acid and lipids how do they interact with one another and so that is an essential area and so collectively oh and also you also learn about the carbohydrates or the sugar right and then you know that as the glycomic and so collectively all of this are known as the omix data and therefore you can see that biology is data intensive because there is a lot of data being generated and it is inevitable that you will have to use more sophisticated computational approaches in analyzing and handling the data and so there is a paradigm shift right so before we would have this conception that we try to develop a drug and we hope that it will pretty much be like a silver bullet that it will cure all of the disease but then that's not the truth because in recent years we discovered and via the advance of the precision medicine we're able to understand that in order to specifically target a particular disease we need to target a specifically a particular protein and as a result we also know that drugs also have side effects and the reason being is that the drug may not be specific toward a single target protein and so for example if you develop a drug that you think will target protein a but then in reality it also has a off-target binding to protein b and so the binding of target binding to protein b will also cause a side effect okay and in thailand there's a professor uh his name is dr dryrock pisces gun from july university he developed this sophisticated bioinformatics approach in combination with experimental using neoantigen and then he's able to develop this cancer immunotherapy and so in his research group he uses bioinformatics hand-in-hand with experimentation and so you can see that omics data are rather large and they're complex and they're heterogeneous and so there's this term called the curse of dimensionality and that data is often characterized by thousands or ten thousands of variables and therefore it renders the statistical approach that are traditionally used for analysis to be irrelevant in the context that there needs to be a better way to handle all of these massive data and as you might know data could come in many form not only tabular data but also image data sound video real time streaming data and so what is bioinformatics so you're probably going to find out more in a series of lectures and workshop throughout this two-week event international events and so let me kick this off by telling you that bioinformatics is a field that applies statistics and information theory as well as machine learning in order to make sense of biological data so it helps you to understand the molecular basis of how disease occur for example how mutated genes work and so as an example let's say that you identify which gene is responsible for a disease so that is your intention you want to see which gene is responsible for a particular disease and so how can you do that well one way would be to compare the gene frequency in a proteomic analysis meaning that you're going to extract the protein that are expressed from two different one population are those who have the disease and another population are those who don't have the disease and then you're just simply comparing the frequency of the gene and then from proteomic analysis you can then see which genes are more prevalent in one group versus the other and so this is only a simple example and so you can see here that bioinformatics is lying at the interface of biology and computer science particularly in that biology we have a lot of data that are being generated and then in the field of computer science we're applying that to make sense of the data and so you can see that this is a good symbiosis between biology and computational science all right so as i mentioned already that bioinformatics is an area where you could apply computers to make sense of big biological data which helps you to understand the intricate complex biological mechanism that are governing many of the biological phenomenon and so as you have seen in the prior infographic that the post-genomic era has brought about various various omits such as the genome proteome metabolome microbiome metagenome interaction and i'm sure there are several other omics and so here are some of the common tasks that you could encounter in bioinformatics and so for those of you who are coming from the field of computational science you could look for possible tasks that you could apply computational science or machine learning to and for those of you who are coming from the field of biology then these are some of the tasks that you might look into in utilizing computational tools in order to help you analyze plan design and discover novel genes novel molecules that might help you to understand the biological mechanism and so a common task for a bioinformatic researcher or graduate student would be to search public databases right for example genbank popcam there are several others mint database and uni products right so there's several databases for you to search pertinent information on the gene on the protein on small molecule and also pathway right like the keg database and then after searching for it you want to compare right and so a comparison could be for example performing a simple sequence alignment in order to see the similarity and the difference between the genes proteins rna and also the small molecule and aside from that you would also like to visualize the structure if it is available right from the x-ray crystal structure and also you could retrieve that from more databases right like the protein data bank and aside from that what if there's no available experimental structure you could use computational approach to make prediction on the protein structure or the molecular structure of the entity of your instrument and aside from that you could apply machine learning deep learning in order to build predictive models that will be able to make prediction and help you to understand the contribution of the various feature on the data and then integrate and curate so this is pretty much in machine learning terms or the field of data science is data curation and so a lot of people might find this process in data science a bit tedious right because data cleaning and data creation will take up about 80 of your time right in order to make a high quality data you will have to remove or handle missing data and also you try to clean the data or subset the data to be relevant for your particular analysis and so i have made an infographic summarizing the aforementioned explanation into this one-page infographic and so it is search compare model and also integrate okay and so there's just two terminologies computational biology and bioinformatics that might that some might wonder what's the difference between the two so here's my interpretation of this so in my own understanding and interpretation computational biology focuses more on the utilization of computational tools in order to understand biology whereas in bioinformatics the emphasis is focused on the development of algorithms the development of tools that bioinformat that bioinformatics researcher or computational biologist or biologist can use to analyze and solve the biological data so so to say is that bioinformatics focuses on the tool development and then the develop tool will use by will be used by biologists and so that will help them to understand biology and so if you think of it as a loop bioinformatics focus on tool development and then the biologists will use the tool developed in order to understand biology but i'm sure that this two terminology is a bit less loosely defined and so you can see the two terms being used interchangeably and so as mentioned already bioinformatic tool is used broadly to refer to databases like for example genbank uniprod protein data bank and softwares right there are several like pymo gene edit and other also web servers which our research group is also developing and publishing and so in the practical section i'm going to show you how you could develop your own bioinformatic web application and so that only require you a minimal knowledge of python and i think that it's fairly easy to learn and as you will see it's not that difficult and if you're able to develop bioinformatics web application this will greatly boost the impact that your research paper or your research project will have like for example if you publish a model a model would just sit in the paper right but if you make it into a web application your model will be accessible to millions of people around the world who are interested in using your model to make a prediction and so by deploying your model you're making a bigger impact because other scientists can use your model's prediction ability and therefore i believe that if you could convert your model into a a usable form which is in the form of a web application then you will greatly boost the impact of your work and so as and so we're going to find out tomorrow in the practical session and so you might know that some bioinformatics tools are either commercially available like for example from the company schroldinger or open eye or um what is it there's several others uh they develop a lot of great bioinformatic tools and many are freely available right but then i'm going to cover what's the difference between the two in the next slide all right so this is a comparison between commercially available tools and the free tools right so for the commercials is the cost right you have to pay either subscription monthly annually or it's a perpetual meaning that you pay a one-time fee and then you get to keep that particular version but if you want upgrade in the future then you have to pay again but then for the academic free version some are free open source but some will require you to apply for a waiver for example it might be free only for academic institutes okay so if you're from university you might be able to use it for free but then if you're from a industry setting then the company will have to pay for that and let's have a look at the features for commercial there will be periodic addition of feature so you can expect that they will continuously roll out new features annually or or quarterly but then for the free tool it really depends on the funding of the research group developing the tool and also depending on the the team that are working on the tool because they're all working as a part-time right so they're not paid to work on the tool so whenever they are free they can't work on the tool and so this might be a bit less reliable in terms of the periodic features and so for the support commercial company has better support because you're paying for it but then for the academic software most of the support are coming from the community and so for example you could ask from the um like stat overflow or you could ask from other community blog posts for help and so this is not guaranteed and so that's one thing to consider and the ease of use is for commercial they're trying to make it as intuitive as possible they're trying to make it as easy as possible to use but then for the free one there might be some bug right and so whenever you encounter bug you could you could notify the developer and they could correct the bug and so it really depends on which one is suitable for you but then in either case both provide solid bioinformatic tools for use in your research all right and so let's have a comparison between bioinformatics versus data science and so you're going to see here that they're both similar right and the difference here is the domain knowledge which in bioinformatics the domain knowledge is of course biology but in data science the domain knowledge could be anything else right it could be finance it could be other areas of business right or if you apply it to other area like what uh the and actually there there's so several other domain knowledge that you could apply it to right in economics for example so here are some of the questions that you might be wondering why do we need computational models in drug discovery so an example is the ibm deep blue which was able to defeat humans in jeopardy and in chess and google released a self-driving car nasa uses computers to simulate space missions computers are being used to simulate and design aircrafts and cars supermarket and shopping mall are using it to analyze customer spending behavior and so the question is why not use computer to discover design and develop new drugs so for example in quantitative structure activity relationship or qsar which is my research field of interest this is an area where you try to understand the relationship between the chemical structure and the biological activity and so you normally would use machine learning in order to make the correlation between the structure and the activity right because if you understand the structure of the compound of the protein you could make you can make the prediction on the activity of the compound and also on the protein however a lot of experimental data are needed in order to derive the biological data set that will be used for making the model building okay and in order to make new predictions it might also be time consuming right because it might take time to synthesize the compound it might take time to acquire the target protein to express it from the living organism and then perform the biological assay however if you could build a prediction model on existing data and then you make a prediction and so you could bypass the synthesis of the compound you could bypass the bioassay experiment therefore you could make a prediction in only a few minutes so this is an example of the development of an encoder decoder system to synthesize a new molecule using the computer right so essentially here is that you're training the computer to understand about the chemical structure and so you're encoding the information and then the computer will learn and then it will be able to create a new molecule and so it is decoding it as a new molecule okay so this was published in 2018 and actually the the full bibliographic information should have already been released so if you google for this author and this particular journal you'll be able to find this paper from which the artic from which the image was taken from and so this is another example from one of our research work and so here is computational models can be used to quickly build to predict the pharmacokinetic and also the biological activity of any compounds of interest and so this is a field known as the quantitative structure activity relationship or qsar and so the models can be applied for developing personalized medicine right because the thing is every human beings will have slight mutation in the target protein and so we never know which particular alteration in the protein that you have and so a drug that has been developed for the general population let's say that your protein has slight mutation and therefore the effects that the drug will have on you might be slightly different than others and therefore this could cause some side effect happening okay so in one of our research paper we have observed some of the influence of the functional group in a chemical structure with respect to the biological activity and so we were able to observe that some small mutation or some small alteration in the chemical structure could give rise to drastic biological activity and so seemingly it might look quite similar but then if you add or replace some of the functional group as you see here in green and red or purple i mean pink then you can see that the chemical structure will be altered and if you are altering the chemical structure then the resulting activity that this particular compound will have will also be changed and so that is the essence of performing qsar and on my youtube channel i also provide several tutorials on how you could develop such qsar model using machine learning so some of the specific questions that could be answered by computational models are as follows so let's say that you're wondering what are the target protein that your compound could bind to or another could be what type of compound can bind and produce the activity of the target protein of your interest by modulating it it means that you could activate the parker protein or you could inhibit the target like for example when you're taking a drug the molecule of the drug will be able to bind a target protein in your body right and so which target protein does it bind to so that could be the question and another question which is highly relevant in this day and age is are there similar compound to my query compound that could potentially have similar binding behavior and so this is in the context of drug repositioning so in drug repositioning the concept is if you have known drugs for a particular protein and then your protein is similar in structure to another protein and so theoretically if you apply logic to that proteins that compounds that could bind to protein a could also bind to target protein b because protein a and b looks quite similar and so what works for a could also work for b and therefore if you could find this linkage in your computational analysis you could greatly develop new drugs by simply applying existing drug to treat a new disease and so in this field called data mining and so the focus here is applying machine learning algorithms to develop machine learning models on the retrospective data or the biological data that you have collected and so the objective here is to produce knowledge and knowledge could be compared to a gold nugget when you're comparing data mining to mining gold mining right so the gold nugget is comparable to knowledge that you will discover from your analysis and so you're going from raw unstructured data to more structured data in order to uncover the patterns and also to gain knowledge from those patterns and to apply the knowledge to make applications that will help you to achieve wisdom and therefore you can see this hierarchy of going from data to knowledge to wisdom and so machine learning can help you in making such hierarchical transition and so as i mentioned already qsr modeling the focus is on finding the relationship between the chemical structure and the biological activity using machine learning right because the relationship that you could find will be able to allow you to understand which features are responsible for the biological activity and so this particular workflow is a five-step process that was published about 12 years ago in one of my early review article which i talked about what is the concept of qsar modeling and so here you can see that you're starting from a chemical structure and then you're describing it in numerical terms you're essentially converting the chemical structure into a tabular numerical data and then these numerical data is the molecular descriptor so they describe the features of the molecule and then you perform data cleaning data preprocessing and then finally you use the curated data to build model using machine learning algorithm and then the models you could use it to make prediction you could use it to make understanding of your feature by performing feature or model interpretation and finally as i'm going to talk about in this or i have already talked about is to develop this bioinformatic tools that other people could use right because the model that you develop will be alive we'll be living if you deploy it into the internet and other users could have access to that particular model and therefore you're making a big impact by deploying your model as a bioinformatic tool and so in one of the medium article that i've written about i've provided this infographic on the overview of the term eqsar and so essentially we're applying machine learning to drop this and so in this particular infographic you can see that molecule one and molecule two we're going to describe it in terms of the molecular fingerprints one and zero and so if it has a value of one it means that it has a particular feature it has a value of zero it means that it does not have a particular feature and then this particular data in a tabulated data frame it's going to be assigned to a y variable meaning that whether the compound is active having a biological activity of being active or not active meaning that it's not active and so we're going to use this tabular data set to make a prediction model using machine learning and the essence of this is that you're correlating x and y right you're correlating all of this x variable in this example we have 16 of them and so based on 16 x variable we're going to predict y and therefore the equation of y equals to function of x and x has 16 of them and we're able to make a prediction for a new molecule molecule three if you describe it into the molecular feature or molecular descriptor you're able to make a prediction of the why right so the users can make a prediction if they have this molecule then they could see that the model has predicted the molecule three to have a value of one which is active okay and the model could also be used to understand the contribution of the molecular feature on the biological activity and so more detail of this is actually in the blog post on medium that are written and published in the torah data science and in this particular terminology proteochemometric it extends the concept of qsar that i've mentioned in the prior example here by extending it i mean that aside from the molecule data sets we're also going to integrate protein information into the model and therefore in a proteochemometric model you're going to have chemical data and protein data inside the same model okay so for a typical qsar model you would have a chemical library data set meaning that if you visualize it in tabular form each row will represent a molecule and so you're going to have a data frame of molecule and imagine that you added another data frame to it now you have two table two data frames one data frame is for proteins and one data frame is for compound or small molecule okay so this is proteochemometric you're integrating chemicals and proteins in the same model okay and when they're in the same model what can you do to this you could perform drug repositioning right you could understand the similarities between protein a and b as i have mentioned already and if compounds are able to bind to protein a they're also able to bind to protein b given that a and b are quite similar however if b is different than a then it won't work right so this is a very awesome approach for understanding and also for repositioning unknown existing drug so you could think of it as kind of like teaching a new it's like teaching an old drug a new trick right you have an existing drug that you could apply it to make to use it as a new therapeutic treatment for a new disease right because in practical terms you might have already developed drug a to treat protein a treat disease a but then you have never tested whether it could bind to protein b right because in a given experiment the cell emphasis is placed on you want to target protein a but then you have no idea about protein b right because maybe in the future 10 years from now after the drug has been released protein b has been discovered right but then at the time it was not known that protein a and b are similar right and then 10 years later it is discovered that protein b looks like protein a and therefore in retrospect you could use computer to identify drugs that have been already fda approved and then you're pretty much reusing it you're adding value to existing drug so that they could be used for treating new disease and actually this produce chemometric modeling we have worked together with the pioneer of the field so the scientists who developed or coined the term proteochemometric was professor gerald wickberg and we had several research publication with professor wickberg on this and so i could provide you the links in in the future and we could add this to the description of the video as well and back in 2015 i created this workflow visualizing the various biological resources that are available right so the different color at the middle here represents the different level hierarchy of the systems in biology and so at the holistic level we have systems based right so we have intricate biochemical pathway right we have databases that tells you which proteins are causing diabetes which proteins are causing cardiovascular disease and then there are databases that tells you which protein binds to which protein right like the keg database or a database that tells you which protein binds to which ligand right so we have popcam and we also have protein databank and which database are containing information on the protein structure right also protein data bank and which database contains information on compounds and we have pubchem for that and which database have the chemical fragments right and so several tools like open eye they have a lot of software that provides you with the chemical fragments and so the thing is how do you tie in all of these available resources in order to develop your own bioinformatics project because the thing is there's so many data available how do you piece it together so that you could create a jigsaw and fulfill the image of the chick because there's so many possibilities and in this particular article that i've written in collaboration with my former phd advisor we developed this particular high-level view that will allow you to enhance the success rate of your bioinformatics project and so the title was maximizing computational success of drug description and so i could also provide you the link in the video description as well and so the thing is how do you make use of the available databases how do you apply the experimental approaches here and how do you perform the biological or computational analysis here and so essentially you're trying to find a connection between using the available databases using the available experimental assays and also which computational tools should you use at which particular hierarchy and how could you use all of that to craft your own bioinformatics project and so in a typical qsar you would use data from small molecule databases and you would which comes from medicinal chemistry so scientists have performed for example hydropool squeezing or hydrophobic synthesis in order to make new molecule and then so the newly generated molecule and the databases here will be used here in the ligand base and then you're you could apply chemist approaches you could calculate molecular descriptors describing the molecule you could perform lipinski wall 5 filtering in order to perform your qsar analysis and perhaps you could use it to drive the synthesis of new compounds as well okay so more information is providing this article and a typical overview of the procedures that are involved in the development of a q-star model is summarized in this particular table and so it's essentially the same as it's in the typical data science life cycle where you start from data collection right you want to have let's say that you have access to a database and then the database contains so much data and so you have to tune in which particular data do you want which data subset do you want to analyze so for example out of the entire data that are available on the whole proteome let's say that you're interested in the aromatase protein and so aromatase protein is involved in breast cancer and so out of the 30 000 proteins available you select a single protein which is aromatase and then for aromatase then you would go to the chamber database you would download data sets are that are available for the biological activity of compounds against the aromatised protein and then after that you would pre-process the data you would clean the data as mentioned here and then you would build a prediction model using machine learning and then after you have developed the model you would evaluate the performance whether it is acceptable and then you would also perform model interpretation in order to make sense of the data okay so this is a typical data science life cycle and you could also apply this for qstar model development so more information on this is provided in this particular review article that was published in 2010 in the expert opinion interrupt discovery review article okay so i think i can skip all of this so these are just a list of databases that provide you access to chemical structure information and these are some of the molecular descriptor software that you could use to quantitate or convert a molecule into numerical form it could be quant it could be quantitative or it could be qualitative and so a popular software that we like is called paddle descriptors and so actually it's not here because it was i think it's published after 2010 but it's using cdk library as well for the paddle and so you could see the use of paddle in some of my youtube video as well these are a list of computational chemistry software and so essentially they're applying computational chemistry or quantum mechanics in order to understand the computational aspect of the molecular structure okay in terms of the quantum mechanics and so you could understand about the energy of the molecular orbitals and also the electrical or electrostatic properties of the molecule and so the this is the research team who helped develop all of the tools that are shown here so we published a article in a book chapter published by springer called auto week and so we developed a wrapper using python on the program wika and made it automatic and so this was published i think it was back in 2008 or 2009 or so and it was at a time when we don't really have the term or it was part of the early auto ml so automatic machine learning so back in the days we published this software that will allow you to develop neural network model and also support vector machine by just loading in the data and clicking on the start button and then you wait for maybe a day and then the model has already been trained and optimized for the parameter and so we developed this i think more than 10 years ago and we published about that and the first web application that our research group developed is called the osfp and we made it to allow it to predict the oligomeric states of the fluorescent protein because proteins can exist in a monomer or a dimer or trimer or quad or other oligomeric form meaning that proteins could have intramolecular or intermolecular interaction with one another and so based on the input sequence in fasta format the web server could predict whether the protein will be monomeric or oligomeric okay so if you're interested in this one you can search for it in google and we publish this in the journal of chem informatics and then we also have several others that are mentioned here like hemopred or cryoprotect and they're developed by our co-worker doctor watchura and the other teams are involved in the development of other software as well and we also provided some analysis of the metabolic syndrome in order to understand the predisposing factor that give rise to diabetes and so this is based on the health data sets that are collected from the faculty of medical technology and we use it to make a prediction model and as you can see here we publish several models and papers here and our research group is aiming to be reproducible meaning that all of the models that we publish we also share the data and the code okay so this was at the time before i started my youtube channel and so we share all data and all code and so we hope that everyone who reads the paper could reproduce the work that we have published because back in the day we found that it might be troublesome to reinvent the wheel right you read a paper and you want to reproduce it but then you have to decipher from the materials and method part and that might take you a couple of weeks in order to reproduce that particular work but what if you share the entire script the entire code and the accommodating data and so anyone who reads your paper could just go to your github download the entire folder maybe change some parameters or change the input data or even update the data and therefore the model will be always relevant with the newest data right and so you're adding value you're adding more impact to your model by sharing it with other people and when you're sharing it other people are more likely to use your data your code and then that also increases the possibility of them citing your research article as well so that could also be another strategy for increasing the impact of your research work and so we published several invited book chapters and several invited review articles and by the uh so um academic press from elsevier in springer and we also published several invited review articles here and yeah do check these out you could google for it if you're interested in these articles and book chapters and back in 2016 we developed uh we co-hosted this first international conference on pharmaceutical bioinformatics back in 2016 in january in thailand in pattaya and in this conference there were almost 200 or more than 200 participants coming from more than 30 countries around the world and we have several eminent scientists uh signed the the director of the chamber initiative from the ebi uh we also have a professor uh gleason who is currently in thailand and he has this bio medical engineering and pharmaceutical engineering research group and we also have professor bender who is coming from cambridge university and he has a lot of research work in the field of proteochemometric chemogenomic and also q-star modeling and also we have a professor from japan kyodo who is the originator of the kick database and this is myself and this is professor gerald wickberg who developed the field of proteochemometric as i have mentioned already and so this was the brochure or poster of the conference that we have hosted back in 2016 and so let me summarize all of the content of this presentation and so the question is why develop our own bioinformatics tools and so you might know that there's already several thousands of bioinformatics tools that are already in existence and perhaps you're thinking that all possible tools should have already been developed right is it true or false and you might think that bioinformatic tools will be available forever is that true or false and another question is that existing tools may lack certain features that we need in our own project and so what do we do do we proceed without this feature and hope that someday someone would develop it or you could put the matter in your own hand by developing the tool yourself so if you see that there's a gap in the field you develop it your developer tool that address the gap so from my own personal experience is that back in the days when i was a bioinformatics tool user we rely on existing tools that are available and then we found that there are some gap in the in the field there are some bioinformatics features or tools that are not available and we hope that some companies would develop that but then sometimes we waited for several years and no one has developed that um and then the thing is we discovered that okay why don't we develop our own right so if you have knowledge in python and you have knowledge in r then you could use shiny to build up your bioinformatic tools if you're using python you could use trimlet which is quite easy to use to deploy your machine learning model and so in tomorrow's practical session i'm going to show you how to develop your own tool using python okay so using bio using our programming i've also shown that on my youtube channel and also using python i've also shown that in my youtube channel and so if you're interested in for more of that content you can check out the data professor channel but then also tune in to tomorrow's practical session so i'm going to show you how you could develop your predictive model web application that you could use to make prediction on the biological activity and so tune for that stay tuned for that one and so the thing is you have to approach this when you want to develop your own bioinformatics tool which path will you take would you hire a programmer to develop it for you or would you learn how to program it and then create it yourself okay and so personally i learned myself and i also created it myself and in this journey it helps me to eventually create this youtube channel and by creating the youtube channel it also given me the opportunity to learn more about the latest tools and the subscribers are suggesting so good advices suggestion on the new tools so they're requesting like can you make a video about this and then i could see that okay i've never learned or used this tool before and so i always get good suggestions from the subscribers and i'm also learning all the time and so this is my roadmap for developing your own bioinformatics tools so step number one you come up with the concept of your bioinformatics tool what do you want to develop and then number two is to make a wish list of your feature what do you want it to do so compile a list of the features that you want the tool to achieve number three you create like a logical high level overview of how does the data connect with the particular methods that it should use in order to transform the input to become the output and what steps are required okay and so number four is to essentially connect the dot and implement the feature and the logical workflow that you have developed right in step number five which is to actually code it right in step number four you can see that it's essentially a collection of tasks right but then the logical order at which each particular task is also important and so you want to create functions or custom functions that could do a particular feature in a modular approach and then you're you can connect the function together and then eventually your entire workflow could connect from starting from input process to input and then creating the output and then this entire workflow if you deploy it on the internet using like streamland using framework like or shiny you could create a web application that other people could have access to your model and so you would create more impact on your research work and so if you're interested in learning more about the use of data science or more about data science more about bioinformatics then please do check out my youtube channel and so i release weekly videos from one to two videos every week and i also write some blog posts on medium and so you could also check out my uh blog post like for example i shared how you could learn data science in 10 steps or actually how you could develop how you could build your own regression model and i try to combine infographic drawing here combine it with step-by-step actionable tutorials so you could just follow along in a step-by-step manner and it's not that difficult as you thought so if you're coming if you're a biologist i also am a biologist my undergraduate degree is in biology or biomedical science and i also picked up programming python only after i graduated my bachelor's degree and so if i could do it then anyone can do it follow along with my journey and you can connect with me on various social platforms and i'll be happy to entertain any questions if you could have thank you uh if any of the participants have any questions regarding the presentation please please feel free to unmute yourself or drop or drop them in the chat box hello professor sir i am profanities hello professor chinese yes i am okay so you have developed a particular model called talc bread called right right that's with my collaborator yeah another professor it's very interesting it is very interesting so it's very interesting because in the local population in india also we've got both alpha kalasama it's better to have certain models to be distributed so that by by having their biomarkers we can easily predict we can easily predict what is the what is the level of that thalassemia apart from that we can also do some prognosis and therapeutic things oh that's awesome yeah so actually it's not myself who developed this is actually i'm only a co-author of the article and so actually there's another professor at my research faculty who is the corresponding author and so if you're interested i could also connect you with her as well if if you could email me uh maybe dr abraham could could share my email to you right in one of your workshops are you also demonstrating a security answer oh uh in the workshop i will be develop i will be showing how to develop a bioactivity prediction web application so it's going to you it's going to be like qsar so you're going to have a activity data of compounds against a particular protein and then we're going to predict that so the last aspect is i am quite unhappy that ramachandran gn ram children developed ramchandran plot in india from any student any student can submit any sequence of amino acids we used to get favorable and uh most favorable regions now unfortunately it is not available in the google uh such the drum page okay so getting it done the what is it called ram channel plot it is very set several things please develop a tool for ram channel plot by user okay thank you for your suggestion so let it be open to the students to understand what are the favorable regions what are five what are high values right right okay thank you thank you very much for your success uh so uh we have a question from bharat uh his question is how to develop a model to retrieve data using data mining right so to develop a model okay so i think the sequence should be like this you start from retrieving the data right and then you collect the data for a particular topic of your interest okay like for example if you're interested in predicting the a particular disease then you would collect the data you would retrieve the data and after you retrieve the data you would clean the data set and then you would build the model right so that that is essentially the workflow yeah so that's all the questions that we have okay also can you give the concluding works what of thanks to dr channing dr thanks for your great time and accepting your kindly mutation oh my pleasure thank you very much for making the session yes we're between the tomorrow's workshop sir okay yeah thank you so much thank you thank you my pleasure so i've shared the notion link and so we're here i guess everyone can see this right yeah so so today we're going to be building a simple bioinformatics web application uh in python and so the example here we're going to be using the water solubility of molecule as the topic of our web application and so the prerequisite here is going to be a basic understanding of biology and also basic proficiency of python and the resources that we're going to be using today is firstly if all of you could install conda on your own computer and so conda will allow you to manage the various python libraries and also it will allow you to manage the environment of your python and i've actually made a video about how to install and use conda here if you click on this if you hover on it and click on the open link and it will bring you up here and then you could click on the youtube uh but then today i'm going to show you live uh how you could go to conda and download it so you want to go here download mini contact right click on it let's see um let me share it and now you're going to be on the miniconda website and so i recommend you to download version 3.8 and so a rule of thumb is whenever there's a latest version i usually will ignore it because when it's a new version many of the libraries might not yet support the new version and so if there's version 3.9 i'll just go for three 3.8 and if there's version 3.8 that's the latest i would go for 3.7 okay so for for this case you want to go for 3.8 and depending on which version is your windows or is your mac then you want to click on the corresponding link to download it okay so like i'm on a mac so normally i would click on this one okay the pkc one it will allow you to install it in a click click click fashion and if you're on a windows 64-bit download this one if 32-bit download this one okay and so let's go back to the web page here okay and the application that we're building today is going to be based on this particular video that i previously made on youtube and so you could also check that out as well and the repo of the code that we're going to be using today will be provided here in the link column and then click on the github repo link let me click here all right so you're here so this is the repo that we're going to be using today so i recommend that you go ahead and click on the green button here which says code and then click on download zip file and then afterwards you're going to unzip that share the screen it's right here so when you unzip it these are the contents and so what we have here is the jupyter notebook and the jupyter notebook here will allow us to build the machine learning model and then the python file here solubility dash app dot py this is the actual web application that we're going to be building and the jupiter notebook will be producing this file pkl which is the pickled file of the trained model and then in order to deploy the web app so normally the web app here this file will work on your local computer if you have streamlit library installed however if you would like to deploy it to the web to the internet you need to have additional files so you need to have the requirements.txt let's have a look at that so the requirement.txt here will essentially tell you the python library and the version number so here it tells us that we're using trimlet version 0.71 and pandas we're using 1.1.3 and so the great thing of using this requirement txt is that you're fixing the python library version to a particular version that it work that are known to work with your web application so even in the future when pandas update its version or numpy updates this version then it's going to work so your web application is going to work so you're not going to rely on the latest version so more or less you're kind of putting this into like a time capsule and so if you're using this particular requirement txt file one year from now your web application will work okay so let me share the terminal so i hope that all of you are seeing this terminal yes all right perfect and so i have to save this into my downloads folder and then let's see what's the name of the folder solubility so if you're on a mac it will look like this if you're on a ubuntu or lid or other versions of linux it'll look something like this so you will change your directory to the location where you have unzipped the zip file from the github repo okay if you're on a windows you could use the powershell and then you could do the same thing cd change directory to the location that you have previously unzipped the content and so the contents that you see here are the same that i've mentioned already and let me let me open up the application so normally i would have to activate my contact environment so i'll show you in just a moment how to create a new environment but for now we're just going to load it up so i'll load up my conda environment conda activate data professor and then to run the app you just type in streamlit run and then the name of the app so it's dreamlit run name of the app and then wait for a moment and then your application will load up in a web browser all right and so you can copy this link and then let me show you so this thing will pop up okay so the web browser of the web application will pop up and to the left hand side here this is the input panel so you put in your smiles input notation here so here we have three example molecule right molecule number one number two number three and then this is like the logo of the web application and then this is the title of the web app description of the web app and then we provided the links to the data the paper at which it was coming from and so here we're just gonna reiterate the input file the molecule one two three and then here we're just printing out the molecular descriptor that we have computed and we have computed the mole log p molecular weight number of rotatable bonds aromatic proportion and we compute these descriptor using the rd kit library in python and then it will be using a simple linear regression in order to predict the log f value and so we have the predicted log s value here for the first molecule this one and then the second value is for the second molecule and then the third value is for the third molecule okay so this is the web application that we're going to be building today and so let me show you now how to install conda on your computer let me close this let me end this uh let me share the terminal again okay so normally when it's still running i would press on the control c in order to abort this and so now let's create the honda environment moment but let me show you here so on the github repo that you see here data professor github slash data professor slash solubility dash app scroll down to the readme so this is the readme file and then you want to call you want to follow this step so all of the instructions is provided all here okay so right here we're going to first create a conda environment okay actually this should be solubility okay and so you could copy this right you could just click on this button to copy it and then we're going to type the contents here head over to the terminal back to the terminal okay so here we're back to the terminal let me deactivate my original condy environment click on the activate okay now notice that whenever i i am in a condy environment the name of the content environment that i am in will be in the parenthesis before the dollar sign and now that i have deactivated the data professor environment i am now in the base environment okay so i imagine that right now you have probably installed conda and your computer and to figure out whether you already have conda you just want to type in honda okay and so if you see these command here which is like the help option then it means that you have conda okay and so normally i just use only a few functions here so i just use conda followed by the name of the command that i want to do like for example i want to list the package here i could just say conda list and so i will see all of the libraries that i have on my computer okay if i say conda install and then like the name of a library then it will install the library okay so for now i'm going to create a new environment so i'm going to type in conda create dash n for new and then the new environment i will call it solubility and then i will specify the version of the python to be python equals to 3.7.9 and then enter okay and then it will ask me whether i would like to proceed to install all of these libraries and so you want to hit on yes type in y and hit on enter all right and now the environment has been created successfully and so you want to activate the conda environment by typing in conda activate solubility as shown here and whenever you're finished with using your environment you can just deactivate it by typing in conda deactivate here so now we're going to activate it on the activate all ability right and so let's list the package here so this is a new honda environment and therefore you see that we only have a few packages here and now we're going to all right let me check so we're going to install the python library in the requirement.txt so we already have it here okay so no need to download it into your computer so you want to just go and type in pip install dash r requirements dot tips okay so this will allow you to install the contents let's have a look before we do that so i'll use cats to see the contents of the requirements.txt file and so here we're going to install streamlit version 0.7 pandas 1.1.3 numpy 1.19.2 pillow for the image and also the second learn okay so we're installing only a few libraries here so instead of typing in one by one we just type in a single line and it will install all of the library in the requirement.txt file so you want to type in pip star dash r requirements dot t h t enter so it will now install the python libraries so that should take a few moments and the great thing about using conda is that it's going to require it will install all of the prerequisite library as well so notice that scipy was not included in the list but then scipy is a prerequisite or dependency for one of the libraries and so it is installing that as well so after that we're going to install the rd kit and rd kit is a python library that allows you to do a wide range of chem informatics tasks like for example you could display chemical structure images you could you could read in smiles rotation you could then convert the small instrumentation into other format like mol format or mode two format or even spf you could convert it into a three-dimensional structure you could perform molecular mechanics optimization of the structure so that you get you get a relatively good 3d structure right with the low energy confirmer and you could also compute various types of molecular descriptor molecular fingerprints and and so the artikit library will allow you to do a lot of tasks you could even search by the substructure as well okay so if you have already kit you could convert it you could convert molecules into numerical values and then you could use psychic learn to build the model okay and then you're in this particular tutorial we will use driblet to to build the web application and pandas to to create data frames of the data set so this should take you a while in the meantime i can show you the already kid website okay so this is the artikid website have a look at the documentation so this documentation provides you all of the essential information like here like here how you could import the chem function and then you could use the camp function to read in molecule in the smiles notation okay or you could read it in from a mole file right here dot mol and then when you whenever you read in a molecule whether it be a smile flotation or whether it is a molecular file either a mol or a stf file it will convert that into a rdkit's molecule object okay so it will be this object and then afterward you could use the object here to compute molecular descriptors which we will be doing today okay so here let's have a look at the contents again okay so you could let's see what you could do here so this documentation is very comprehensive so it is also a cookbook integrated in here like here you could draw a molecule inside a jupiter notebook okay so if you activate this you could display the structure of the molecule in a jupiter notebook and you could iterate through your entire data set and then you could display the particular compound that you would like to see right show the atomic properties alongside the structure you could annotate it you could also highlight the substructure in your molecule you could color the atoms as well right here you could highlight the substructure okay so a lot of things that you could do with the arctic kids and so if you're into like bioinformatic or chem informatic then you need to use this library all right so let me head back to the conda terminal all right so now the libraries has been installed in pip so let's install moments all right so now we're going to install rdkit by using conda install and then rd kits will be installed from the condaforge channel or also rdkit channel either one and then the artik library is specified here so this might look confusing so let me show you something first okay so let's say that i would like to install a library i would type in conda install and let's say that i would like to install jupyter i would type in like this conda install jupyter okay but if i type in contact install already kit okay it wouldn't work okay it it cannot find rd kits in the default repository of conda and therefore we have to specify the channel okay so the channel name so that is from the option here dash c okay so we're gonna specify the channel name conda install and then the channel name so look for it in condo forge and then already kids go for that so arctic kits should be available from two channels either honda forge or also rd kit okay so this is an extra so you could also delete this if you want you could use either one either rdkit or condaforge let me share the screen of the terminal now okay so i'm not sure if you saw it let me type it again i typed in conda install dash c conda dash forge and then rdk okay because dash c honda forge is the name of the channel that will allow me to look for the artikit library enter to install audi kids because if you're not using conda you could install rdkit manually and it's going to be quite difficult to do so manually i can show you because if you do it manually you have to install other libraries as well as the dependency so that might take you a couple of hours here if you want to build it from stores okay from honda installation yeah so if you go to the documentation here under installation there's many ways for you to install conda and so it might look quite confusing because there's so many approaches that you could do but if you follow the approach that i'm showing you here it will be quite simple just using one line and you could install rdkit batch but for those of you who are using jupiter notebook it might be a bit tricky to use conda inside the jupiter notebook but i've also provided a video on my youtube channel where i show you how you could install conda on a jupiter notebook and then you could install rdkit there so this should take a short moment and so most of this tutorial will probably involve a lot of installation of the library here so normally in my tutorial video i would just fast forward this installation process okay there you go so it's installed already and let's check it out by typing in conda list let's have a look at the libraries that we have so far all right and so you can see that now we have a lot of libraries and dependencies installed in your in our contact environment okay let's try it out let's type in python let's try import rdkit all right it works so if it doesn't work it will print out like some error message okay so now it works and note here that we have python 3.7.9 which we have specified to be installed here okay let me quit it all right and so now we're ready to be running the web application but before doing that let me show you the jupiter notebook um how should i do this we have to install jupiter notebook so conda install jupiter because we're going to show you how you could run the youtuber notebook file which is here let me share this right here so we're gonna run this in order to build the model okay one moment so uh what time zones are all of you coming mostly from india okay so we're we're at we're at night time right now right what's at night p.m okay right so it is uh it is 10 30 for me so it's probably nine o'clock for you guys right nine pm at moments we have one question start can i go ahead sure one person is asking for example i have 26 compounds for these compounds there is a data set in csv format which includes descriptors that he has calculated around 5000 and icf values of compounds he would like to perform some qsr analysis with deep learning in artificial networks and predicted new ic50 values did not calculate any descriptors for his molecule but he just gave numbers one two three four to the molecules right okay he's asking some program uh some coefficient any codes sample code can do that right so so he has five thousand descriptors essentially you can see in the chat okay okay let's see okay i have 26 compound for this compound there is a data set in csv format that include descriptor okay so there's a csv file that contains five thousand victors and it has the ic50 value and he wants to perform okay neural network to predict the ic50 okay so the question is i will not calculate any descriptor for my molecule and i just gave numbers like one two three four to the molecule okay so one two three four is probably the name of the molecule i put the calculated descriptor for all complex in the csv from some external program i want to use the csv and create my model with deep learning any code sample be made about it how can this be shown in existing program okay yeah so the concept is quite the same whether you use deep learning you could use keras library or you could use pytorch but the tutorial that we're using today we're using scikit-learn okay so instead of using psychic learn you could use the pi torch or keras or tensorflow but but nowadays keras is already integrated in tensorflow right so you could use tensorflow or pytorch um instead of the scikit-learn that we're going to be using today and uh you said that the question said that he already has 5000 descriptors so you don't have to calculate the descriptor that we're going to be calculating here okay so for this one we have the descriptors calculated in rd kit which is moloch p molecular weights rotatable bonds aromatic proportion so here is a rather very simple example of a q star model or machine learning model where we use only four descriptor but in in the example of the question uh you already have 5 000 okay and so the same thing you have 5 000 so here we have four and you have ic50 but here we have the lock s okay but i would recommend you to convert the ic50 value to the pic50 value okay because if you look at the distribution of your ic50 you will see that your distribution might be clustered and so therefore it is recommended to calculate the negative logarithmic value of your ic50 in order to make it more uniformly distributed okay and then you could apply the same approach that we're using in this tutorial that i'm going to show you in the jupyter notebook to build the model okay so i hope that answers the question yes thank you thank you and now we have already uh installed jupiter so we're gonna run jupyter now okay so go ahead and type in jupyter notebook in the terminal jupiter notebook hit enter and then all right and so you you're going to see the jupyter notebook pop up here okay and so we're going to run the solubility web app dot ipynb this is the ipython notebook it stands for ipython notebook and then later it it was renamed to jupiter norfolk all right there you go so we're simply running it sequentially okay so here we're going to import pandas so what you could do is you could click on the cell and then you could click on the run and when you click on run it will run this and you're going to notice here that you see the asterisk symbol which means that it is it is running so it is a bit slow wait one more second all right it works now and now we're gonna run this particular cell again so instead of ron you could also use the shortcut um on the keyboard you want to press the shift and enter and then you're going to get the same thing okay so i like to use shortcuts so shift type shift enter at the same time shift enter okay and then we're gonna see that the third cell has already ran okay so you can just shift enter shift enter and it will sequentially run each cell from top to bottom okay so now let me explain um so here the great thing of a juvenile notebook is it allows you to put your code in a documented fashion okay so the two essential component of a jupiter notebook is here you have the the text cell this is also a text cell so a text cell if you double click on it it will become a markdown language okay so in a markdown you have the hashtag to represent the s1 heading and then you have the double asterisk before and after in order to make your text bold if you have a single one it will be italic let me show you here an italic if you have one right it will be italic so you have to shift enter it and then notice here that this one is it's in it's a lick form and this one is bold right because we have double asterisk and then we also have the h tag and let's double click on this one for this one we only have the ace tag if we add double asterisk it will be bold but it looks quite the same you just take it out so you could google for markdown cheat sheets and then you will see all of the uh commands that you could use right like for example if you click the plus you will get a code cell and when you get the code cell you want to click on here the drop down here you could change it into a markdown and now you could put in text okay this is like h1 and then you have two it becomes h2 h3 and so what what is h1 is 2 is 3 it's kind of like this section up section of subsection okay and so whenever you have h1 is 3 so the section will be smaller and smaller but then there there there are also cool stuff that you will see when you are using this in a google cool lab if you use it in the google code lab you could collapse the entire h1 level okay on the codelab you could just click on the uh the button here and it will collapse the entire content here and if you click on the button here it will collapse everything underneath this particular section okay so it seems to work only in the google code lab that's a cool feature to have all right and so this one in the second cell we're going to be reading in the data set here which is provided on the github repo so it's the dilani solubility descriptor so this one is pre-computed okay so actually we're not using rdkit okay so it is already pre-computed and so you can copy this link and let's have a look let me share this okay so these are pre-computed descriptors so this is in csv format okay so the value is separated by the commas as a column separator and so the first line you're going to see the name of the descriptors right we have the first value to be moloch p multiple weights number of rotatable bonds aromatic proportion and the log s value okay so actually i've shown you so i'm not sure let's see right now 22. so this is pre-computed right so we have previously used the rd kits so let me let me find the youtube video right now beta professor is in the playlist of bioinformatics one moment share it here it's in this playlist okay it's right here part one part two okay so there's two parts so let me copy the links for you so in part one here i've shown you step by step so you can see that it takes about 25 minutes for this one but i'm afraid that this will will be too long for this tutorial so i'm gonna provide you this two links part one and part two in the notion web web page that i've shown you already okay so in this video i explained to you what each of the descriptor mean and then explain to you how you could compute the descriptor oh advertisement yeah so this one also provides the code as well if you click here code it will bring you to the code page so you could run this particular notebook right so this particular notebook will allow you to install rd kits on google collab so you could run this on the google code lab as well and then this one is about installing conda all right here we can see the data sets the data set is comprised of the name of the compound in the first column and it provides the the value of the measured the experimental value of the solubility and in the last column it provides the smiles notation okay so we use the smile citation here which is a chemical structure information and then it's right here and then we read it in we read it in using rd kids using the chem function so in the camp function we use the mole from smile and then we put in the smiles notation here right and it will become an object a molecule object right so each molecule will be read in and it will be a molecule object and then we will compute the descriptor for each shown here okay so each molecule this is molecule one molecule two molecule three four five and so this data set has a couple hundred molecules 1144 molecule okay and then we we iterate through the entire molecule list right so we we iterate through using the for loop okay and then we've created custom function right here we define our own custom function to calculate the molag p value the molecular weight value the number of rotatable bonds right and then we created a data frame in pandas here and then after we use this this custom function which takes in as an input argument the smiles notation so here we use the custom function that we have generated and then using as input argument the smiles notation and then it will spit out the data frame of the computed molecular descriptors so actually aside from the three descriptor that you see here you could modify right here right you could add additional descriptors right so you could copy this line copy it and then you could add additional lines here right and then you could cherry pick the descriptors that you would like to be calculated in this code right you have you can have like additional maybe 10 or 20 or 100 additional lines here so that it will calculate additional descriptors okay so actually that's the content for another another tutorial maybe i'll create a video tutorial video on the youtube channel as well so in rd kit there's so many descriptors let me show you then i hope time will allow okay so okay so i try to finish this part in 15 minutes and then i'm going to show you the actual web app okay so we will have about half an hour for that okay so let me go back to the rd kids documentation let's find the example one moment descriptor right here okay so you can just find right i type in command f or control f and then i type in descriptor and then here descriptor calculation so aside from the mo log p as shown in the tutorial that we're using today you could also calculate other descriptor like the tpsa as well okay so in order to look at the full options you want to click on the link here like the api or actually right here list of available descriptor all right so right here so there's so many descriptors right look at this column list of available descriptor there's so many gas cycle charge okay and we use only a few but then some of them might be a bit redundant and you also have to compare okay which one are redundant number of amide bonds number the ring count all right tpsa and then they also have 3d descriptor as well so you probably will have to generate a 3d confirmation of your molecule and so that will require optimization using molecular mechanic force field as well and if you want to have more accuracy you could also use quantum mechanic or computational chemistry in order to optimize your molecules geometry and then you could calculate other quantum mechanic descriptor like the energy of the molecular orbital and also here you could calculate molecular fingerprints as well and they have a couple here and each fingerprint type here will have several hundreds okay okay so this is very nice let's see there's a lot of descriptor that you could calculate and you could feel free to add it into the custom function that i've shown you right here you could add it into this custom function right all right and now that we already have generated the descriptors and then and then it's gonna read in the data frame of the calculated descriptor and it will perform the data split which i will be showing you in this notebook so let me close this okay so part one and part two i'll provide you the link okay let me do it right now the link share copy let me add it to notion so let me share it to you okay so notion is a very good application um if i'm not sure if some of you have used notion it's a good note-taking application and it's very powerful as well and it has a lot of templates so i'm gonna make some video of this soon so let's see learn more helpful okay let me add a new entry and it's gonna be what should they call it i'll call it chem informatics part one and link here then i'll call it chem informatics part two open and let me find the link and part two will essentially be using pi carrot to build the model and so pi carrot is a auto ml library in python so moes ali has developed this awesome library and so you could you know in only a few lines of code you could build machine learning models using a lot of machine learning algorithms in one one run okay there you go so came in for my part one and part two is right here and actually i could provide you the github link as well for your convenience if you wanna try this out in your spare time let's see part one part two part one is this one one two we have two files one okay so the links are provided here the the link to the github repo and let us now start create the web app um but first we have to head back to the jupyter notebook so many windows open one moment okay let me share the screen all right okay so we have loaded up the data frame here and now we're going to drop so for this data frame that we have loaded up okay it's called data set i could create a new cell here i could type data set shift enter and then it states data frame so what i want to do now is i want to split this into x and y variable okay so the x variable will contain the first four column and then the y variable will contain only the log s column okay so i'm going to split log s to be column to be variable y and then the first four variable will be the variable x okay and therefore i created here x equal to data set dot drop and i'm going to drop the last column the log s column okay and then i specify it to be x variable and therefore you see that the log s column is now gone okay and now we're going to specify the last column here the last column minus one the last column to be the y variable okay so this is one way of doing it another way would be to say y equals to data set log s see the name log s okay same thing you get the same the same output so there's many way or you could also do it like this y equals data set dot log s let me show you okay so there's three ways of doing the same thing okay so uh select whichever approach you like and so we get the y variable to be the log s column okay and now let's have a look let's have a look at the x variable and let's have a look at the y variable okay now we have x and y so we're good to go now we're ready to use scikit-learn to use x and y to build the model here okay so in this tutorial we're not going to perform any data split we're just going to use x y the full data set to build the model but then in your own custom workflow for your own research project i would recommend you to perform data splitting as well and so here we're gonna use a basic linear model okay so we're gonna use the linear model function from the sk learn and then in order to evaluate the model performance of a regression model because here we're going to build a regression because the y variable is quantitative okay so it's a floating number okay so however if we want to do classification then we're going to use different metrics here okay so for regression we're going to use the r squared value and the mean squared error or we could also use mean absolute error as well okay so up to you but then for this tutorial we're just going to use mean squared error okay so run the cell and then we're going to create a variable called model and then we're going to assign the linear model.linear regression function to this variable and then we're going to train the model by using model.fits and then the input argument will be x and y and x and y has been created here as we have done previously okay it's taking quite some time to load it on my local computer and let's train the model shift enter here okay and the model has been trained fairly rapidly and let's make the prediction okay so model prediction so as you can see here the juvenile notebook is documented so that you could read it okay you could also provide more explanatory text below here like for example this is here or um we will now perform the prediction by applying the trained model this will be performed by using the model.predict function and then you could use that you could highlight this by using the tick right and you notice that it has this like gray color on the background as well to highlight it okay and then the text font looks a bit different so this would highlight the the function name okay so you could you use this let me show you okay so you could use like superscript by using the dollar sign before and after and then this would represent the superscripts if you want to make it like let's see evaluating like for example if you have the term ic50 you want to do it like this okay and note that you need to have the opening and closing braces on the 50 because if you do it like this let me show you then only the five will be in the subscripts okay therefore you need to have the braces around the 50. all right so this is just to show but it doesn't make sense because in this example we're not using the ic50 okay we use the r square and vmse okay so mse is the mean squared error all right and so we're gonna apply the model.predict function and then we're going to use the x variable and then we're going to assign the prediction into the y underscore prep variable and then the predicted value is shown here okay and if you want to convert this into like a data frame type uh we could use it into a series and then we could say why pred um but then we're gonna assign this to let's create a new variable let's call it ypres without the underscore and then oh it's not defined okay because this is capital capital right we have it here and so we have this as a column now because before it was as an array okay so here we're using the pandas series and let's see what if i want to combine it to this data frame again this is x right but yeah okay so i have the x data frame here right i want to combine these two data frames x and y print okay so let me show you how you could do that you could do pd.concats and then and then you want to use the opening and closing bracket and then you just type it x the element that you want to combine and we're going to combine y prime so just type in y thread and now you want to do x is equal to 1 shift enter and now you see the predicted value is shown here okay but let's say that we want to show it alongside the actual value y so we could also do put the y first and then we put the white thread next notice that this one doesn't have a variable is it name y red let's call it log s log s red okay so i added the name to lock spread to the wiper variable and let's combine it again and there you go the name of the column is updated and then i have to assign this into a data frame in order to save it right and then now i have it as a data frame right and let's say that i want to save this into a file then you could just do df.2 csv and then you could name the file you could say results dot csv hit enter and then you're gonna get the csv file let me find it it's right here so this is the resulting file okay so right here lock spread it's right here lock s is right here okay so you see the actual value and then you see the predicted value here okay so you could save this as a file all right let's continue back to the notebook to actually build the file and so that we could actually build the web application okay so we have 25 minutes left let's go back to notebook all right here so i just added this as a bonus in this tutorial just in case you're wondering um right now we're gonna print out the model performance right here so you can see the coefficient of the regression coefficient okay so you know like in a typical let me add more explanation like uh in the typical linear equation you have like this for this one we have log s equals to the four parameters right we have molar p we have more weight we have number of rotatable bonds and an aromatic proportion okay so here we have the intercept to be this value so that's our equation and then we have a r to be here it's right here it's sequential so we have this one multiplied by and then this one is right here this is our equation and then this value is right here is plus right just minus this one minus small weights and then the first value is right here there you go so that's our linear regression equation so this is the equation generated by this particular model okay and to get the intercept it's right here you use this function model dot intercept to get a coefficient you use the model dot co f okay to get the mean squared error it's the mean squared error function and to get the r2 score you can use the rtscore function and the input argument will be the actual value y and the predicted value y print okay and so you have the r square value to be 0.77 and then we have the mean squared error to be 1.01 okay and then oh actually i created this already okay right so in re reverse order i have this one right okay so i've already shown you how to do it programmatically already so you so this equation is generated automatically okay so even if you add additional data to this data set it's going to generate this equation automatically for you okay and now let's create the visualization of the predicted value so this is the plot that you have generated right so this is the experimental and the predicted lock s value so it looks pretty good and now we're gonna save it out okay we're gonna use pickle function here to dump it to save it into a pkl file okay and so let's do this so let me make a note here is that if you if you want to save the model out make sure that the model that you saved here is done locally um because if you do it on the google collab the thing is google codelab might be using a different version of scikit learn and therefore it might not be compatible with your your app web application okay so you might be using a different numpy version you might be using a different scikit-learn version and so the pkl file that is generated will not be the same okay so you want to do it locally on the computer that you want to run the web application let me show you the contents of the web application take a look at the solubility app dot py okay share the screen okay right here share adam okay get this one better all right so here so let's have a look generally so this particular web application is only 110 lines of code okay so that's not a lot of lines of code to make this fully functional web application that you have saw and so the first couple of lines here we're going to import the libraries right we're going to import all of the necessary libraries that we're using so you can see here that we have pickle which is the trained model that we have done already um because the thing is we don't want to retrain the model over and over and over again right because the data set is it's the same so we just want to train the model one time and then we deploy that trained model and then the the web application will take the query molecule the input molecule and then it will use the rd kit to perform the molecular descriptor calculation okay so just a moment ago i told you about how we created this custom function to calculate the molecular descriptor so we we have to put the the custom function inside the code here okay so we documented that here custom function so lines number 15 until 57 is the custom function to calculate the molecular descriptor and then lines number 63 so this is essentially just gonna display the logo of the web app let me go back we find the application oh no okay we have to run the web app now all right so let me close the jupyter notebook let's go back to the terminal and i'm going to control c to exit that and it's asking me whether i want to shut down a notebook server and say yes oh it's resuming because i took longer than five seconds so let me do it again all right now i exited that and now i'm going to run this solubility app.py streamlit run dollar ability app dot py all right and now this is the web app okay so the image variable that i have talked about just a moment ago is this image that i have drawn the logo of the web application okay let's go back to the code right and then we cr we use the st.write function in order to display the heading of the web application as i mentioned already we use the h1 tag here which is the aster the hashtag so that font will be big so it's a heading tag and then we describe the web application here and then the side panel will be shown here um let me show you back to the web application so note that on the side panel right here on the left hand side you have the text box that takes in the smiles notation as the input file let me ask the input text so each will be a molecule right so for example i could modify it and then after that i want to press shift i want to type in command enter to apply it and now the prediction as you see is right here it's updated so this thing here the third molecule notation matches this third notation here okay so this is the side panel and let's go back to the code it's right here where were we right here smiles input right here st dot sidebar dot text area so text area function will allow us to have the text box and st sidebar will tell us the location that we want to place the text box so if we delete this part here it will not be placed in the sidebar but it will be placed in the main panel okay so we're gonna leave it there okay so here we're just taking in the smiles notation and then we're this is the default uh input that you saw the three molecule that you saw slash n is the new line it means to hit on enter okay so there will be enter at the end of the first line and then another enter at the end of the second line all right and now after that it's going to calculate the molecular descriptor and then it's gonna here assign the calculated descriptor to the x variable and then and then essentially that's all right on lines number 103 we're going to load in the trained model uh remember the train model that we have used in the trigger notebook it's right here solubility model dot pk l and now we're gonna load in the model and then we assign it to the load model variable and then we're gonna use load model dot predict and then we're gonna use the x variable here x is right here the molecular descriptor that we have generated and so it will use the molecular descriptor that has just been calculated from the user inputs from the side panel and it makes the prediction and it assigns the prediction into the prediction variable and then it will place the prediction underneath the ft.header here okay and that's it that's that's essentially the web app let me go back and show you the web app okay so i i've told you that it will display the computed molecular descriptor and it will display the computed log s value which has been okay and so as you can see here that this is a very simple web application and so i have already explained to you that we initially loaded the necessary libraries for this one like streamled numpy scikit-learn rd kit right and so the user will put in the molecule that they want right and then command enter or control enter and then the app will run the prediction in real time okay in real time okay and it makes the prediction and so it takes in the input molecule as a smallest notation it performs the molecular descriptor calculation right it displays the input molecule that you have placed in the text box it displays your input here it then computes the molecular descriptor and it shows the computed micro descriptor here it takes these computed molecule descriptor and apply the model.predict function to make a prediction and then the resulting prediction value will be displayed here okay and so this is essentially the web application contained in the app.py file and i think we we've done it okay so we have 10 minutes left for questions and answers so i'll be happy to entertain any questions that you may have i have a question with this model we can predict the descriptors for a new uh molecule yes so we're not predicting the descriptor but we could calculate the descriptor and then we could predict the solubility value okay right so the rd kits will be used for calculating the liquid descriptor yes you can mute hello so thank you for a wonderful explanation uh so i wanted to know for any qsar model they asked for applicability domain so how could we calculate that in machine learning okay so actually we've done a couple of research articles where we calculate the applicability domain so normally we like to use the principal component analysis so you could calculate the applicability domain by performing pca analysis on your your data sets and then you could also do the same for your training set and then your ex your test set so if you perform data split right you could do the pca of your training set of your test sets or of your other external sets and then you could visualize the the distribution of the molecule in the scores plot and so that will essentially be your applicability domain so if the new compound or the test compound fall outside the applicability domain meaning that they do not overlap it means that a new compound is not predictable so it is outside the applicability domain and so actually actually i've never done a a web app for that maybe that could be a good idea for a future video as well so yeah i'll probably make a video showing you how to do that thank you so all right okay uh sir i have one question actually nice presentation thank you so as you have told that we can see a pickle file if uh from any cycle version and after that here we want to use the different python version our psychedelic version so we cannot use so like i have saved one uh vehicle file from uh suppose cycle tab 1.9 or something and if someone is downloading like i have given my code to the github so if someone is downloading my code so he can use or he cannot use or any other method to save difficult files so anyone can use in any cycle right that's a great question because on one location where i trained the model on the google collab and then i downloaded the pickle file into my local computer and then when i tried to show the uh the web application i try to run it and it gives me an error that okay you're using a different version of psychic learn and but then when i trained it locally and then i used the pickled file so so there were no issues there so there might be some compatibility issues so i would recommend you to share the code and then the user can download that code and they could also train it locally on their own computer okay thank you sir do you have some specific question you can ask okay [Music] hello yes yes go ahead uh suppose i want to do this training without the descriptors i just want to put the smiles into the training model like a regression model or something how can i do that and uh is what which model would be good for such case you mean you want to use the smiles but you don't want to calculate the descriptors without using receptors suppose if i means i just want to take the smiles and read it and put it in the training model is it working it is possible and actually we're doing that as well so for that the thing is you're bypassing the descriptor calculation and so therefore the descriptor will be the smile notation and so right now we're experimenting with using that uh we're spent meriting with using the smiles notation and then we're performing tokenization meaning that we're going to split this mild citation into the fragments so each character will be kind of like a descriptor in itself like for example uh in in in the typical small citation here let me show you so in the side panel here you see that nccc is a molecule right so it will tokenize it meaning it will split it into n c c c and then it will essentially see it as like a like a character right and then it will compare so if another molecule contains n in the first position and another one contain n in the first position it will be the same right and it will have uh like for example you could represent it as an image and so this would be like maybe uh have a value of one one so the thing is does it contain n in the first position if it does then it will have a value of one if we have a value of one if it doesn't it will have value of zero so therefore you will perform tokenization on your smiles notation and then that will be the descriptor and then we'll have to use uh and they we could use like deep network like lstm or you could also try out the scikit-learn as well using random forest so right now we're experimenting with that and hopefully we'll we'll publish a paper and maybe right make a youtube video about it because i wrote it can you conclude okay thank you and this one is this is an excellent lecture and everything was crystal clear and i hope all the participants understand everything if they want to enhance their knowledge so they should visit the data professor youtube channel and they can uh see the video and they can ask the question from professor cheney also he has summarized and he had also uh showed by hands on that how you can develop a regression model and this is very good for the advert prediction when you uh like you are using some tools in the dev designing for the admit prediction but most of the user don't know and they don't read the paper that how our model our how our compute computer are we are giving the input as light compound and it is calculating the descriptor and on the base of descriptor it is giving the values so generally the user uh don't think so maybe he they can correlate with that regression model and one more thing that they can also use that regression model in their research purpose when they want to develop any kscr model so i hope that uh everything was crystal clear and thank you professor chenin for actually invitation thank you it's my big pleasure and actually let me make a note here that if you're interested you could also go to my academic github as well let me share let me show you on the screen so let me find that is it right here yeah so my academic github is github.comlab and so these will have repository of research articles that we also published so for example in the hcv pred here if you search for this article we have already published it we publish it in the journal of computational chemistry and so you can read the full article here and then the the full text i mean the full code is provided here so we provide you with the created data set we provided you with the code and the descriptor the r code that you could use to reproduce our work so here we use r and then you could perform our model building from the code and we also shared the code to several of our other people as well so you could check out the the github here this is the first github and let me why don't i put this in the chats and another one would be channeling in another one would be my github for the data professor youtube so all of these three so the github of data professor will contain the files and the code used in the youtube tutorial uh and channel lab will contain the code that we use for our research publication okay yeah right for example information right like if you click you are you are gently contributing towards the science because you are creating the code and uh you are also creating a notebook and i uh think that github is the most useful thing to learn the data science and other things in the marketing level i think i think she's having some like my problem i i am going to end the session sir thank you so much for accepting your kind invitation dr channing sir we are looking forward to work with you sir more collaborate with you thank you so much dr cena sir i'm going to end this meeting now okay thank you so much everyone for attending thank you thank you sir and so i hope that this video was helpful to you and please support the channel by smashing the like button subscribing if you haven't already and also hit on the notification bell so that you will be notified of the next video and also hit on the notification button in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey
Info
Channel: Data Professor
Views: 24,218
Rating: 4.9685864 out of 5
Keywords: how to build bioinformatics tools, how to build bioinformatic tools, bioinformatics, bioinformatic, bioinformatics tools, bioinformatic tools, bioinformatics tool, how to develop bioinformatics tools, bioinformatics lecture, bioinformatics 101, bioinformatics workshop, learn bioinformatics, bioinformatics python, streamlit python, python web app, bioinformatics web app, bioinformatics app, build bioinformatics tools, build bioinformatics app, data science, computational biology
Id: LHM0Couv0w4
Channel Id: undefined
Length: 122min 35sec (7355 seconds)
Published: Fri May 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.