Data Science Full Course - 12 Hours | Data Science For Beginners [2023] | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] data science is undoubtedly one of the best careers to get started with it is an interesting field with a high demand for skilled professionals in this video on data science full course we'll see everything about data science end to end but before we get started make sure to hit the like button and subscribe to our Channel you can also hit the Bell icon to receive regular updates from here we also have a lot of training programs and certification courses on our site so if you are interested in it do check out the links given in the description so let's get started by looking at the topics that we'll be discussing in this data science full course video will begin this video by understanding what data science is this section will cover all the data science fundamentals that you would need to know followed by this we have the topic who is data scientist now once we understand who data scientist is we'll then see the roadmap to becoming a data scientist we'll also see some salary statistics after that next we'll discuss the data science Core Concepts in this we'll Begin by seeing the total life cycle this topic discusses the stages in acquisition and processing of data after that we have statistics and probability now this is a very important Concepts that sets your foundations so take your time understanding this part once we're done with Statistics and probability we'll cover what is machine learning followed by this topic we have some core algorithms in data science which includes linear regression logistic regression decision tree algorithm random Forest KNN algorithm 9 base classifier support Vector machine K means clustering algorithm a Priory algorithm reinforcement learning Q learning Etc now I know this list looks intimidating but trust me we'll easy well and guess what we're still not done after learning all these algorithms we'll discuss what is deep learning next we have data science for non-programmers this answers most of the questions that non-programmers have on data science now after all this we really do hope that you build a successful career in this domain for that reason we have included two bonus sections this includes content on how data scientists should create their resume and data science interview questions and answers so let's not waste any more time and get started with the first section which is what is data science foreign what is data science with the development of new technologies there has been a rapid increase in the amount of data this has created an opportunity to analyze and derive meaningful information from all this data this is why data science comes into the picture technically data science is defined as the process of extracting knowledge and insights from complex and large sets of data by using processes like data cleaning data visualization almost all of us use Google Maps but have you ever wondered how Google knows the traffic conditions between where you are and where you're trying to go or how does it determine the fastest way to your destination the answer is data science Google Maps collects data every day from a multitude or reliable sources that primarily includes smartphones it continuously combines the data from drivers passengers and pedestrians and then making use of machine learning algorithms Google Maps sends real traffic updates by way of colored lines on the graphic layers this helps you find your optimal route and even determine which areas should be avoided due to road work or accidents isn't that amazing as data science continues to evolve the demand for skilled Professionals in this domain is also increasing drastically in order to uncover useful intelligence for their organizations data scientists must master all the aspects of data science do you guys remember the times when we had telephones and we had to go to PCO boots in order to make a phone call now those signs are very simple because we didn't generate a lot of data but these days we have smartphones which store a lot of data so there's everything about us in our mobile phones similarly the PCS that we use in the earlier times it used to process very little data all right there was not a lot of data processing needed because technology wasn't evolved that much so if you guys remember we use floppy disk back then and floppy disk was used to store small amounts of data but later on hard disks were created and those used to store GBS of data but now if you look around this data everywhere around us all right we have a data stored in the cloud we have data in each and every Appliance at our houses similarly if you look at smart cars these days they're connected to the internet they're connected to our mobile phones and this also generates a lot of data in order to process this much data we need more complex algorithms we need a better process all right and this is where data science comes in now iot or Internet of Things is just a fancy term that we use for network of tools or devices that communicate and transfer data through the internet guys iot data is measured in zettabytes and one zettabyte is equal to trillion gigabytes so according to a recent survey by Cisco it's estimated that by the end of 2019 which is almost here the iot will generate more than 500 zettabytes of data per year moving on let's see how social media is adding on to the generation of data now the fact that we are all in love with social media it's actually generating a lot of data for us okay it's certainly one of the fuels for data creation Now all these numbers that you see on the screen are generated every minute of the day now guys apart from social media and iot there are other factors as well which contribute to data generation these days all our transactions are done online right we pay bills online we shop online we even buy homes online these days you can even sell your pets on OLX species not only that when we stream music and we watch videos on YouTube all of this is generating a lot of data so with the emergence of the internet we now perform all our activities online okay obviously this is helping us but we are unaware of how much data we are generating what can be done with all of this data and what if we could use the data that we generated to our benefit well that's exactly what data science does data science is all about extracting the useful insights from data and using it to grow your business now before we get into the details of data science let's see how Walmart uses data science to grow their business so guys Walmart is the world's biggest retailer with over 20 000 stores in just 28 countries okay now it's currently building the world's biggest private Cloud which will be able to process 2.5 petabytes of data every hour now the reason behind Walmart's success is how they use a customer data to get useful insights about customers shopping patterns now the data analyst and the data scientists at Walmart they know every detail about their customers they know that if a customer buys pop dance they might also buy cookies how do they know all of this like how do they generate information like this now they use the data that they get from their customers and they analyze it to see what a particular customer is looking for now let's look at a few cases where Walmart actually analyze the data and they figured out the customer needs so let's consider the Halloween and the cookie sales example now during Halloween sales Analyst at Walmart took a look at the data okay and he found out that a specific cookie was popular across all Walmart stores so every Walmart store was selling these cookies very well but he found out that there were two stores which were not selling them at all okay so the situation was immediately investigated and it was found that there was a simple stalking oversight okay because of which the cookies were not put on the shelves for sale so because this issue was immediately identified they prevented any further loss of sales now another such example is that through Association rule mining Walmart found out that strawberry Pop-Tart sales increased by seven times before a hurricane so a data analyst at Walmart identified the association between Harry Kane and strawberry Pop-Tarts through data mining now guys don't ask me the relationship between Pop Tarts and hurricane but for some reason whenever there was a hurricane approaching people really wanted to eat strawberry popped up so what Walmart did was they placed all the strawberry Pop-Tarts and the checkouts before a hurricane would occur so this way they increase sales of their papa now guys this is an actual thing I'm not making it up you can look it up on the internet not only that Walmart is analyzing the data generated by social media to find out all the trending products so through social media you can find out the likes and dislikes of a person right so what Walmart did is they're quite smart they use the data generated by social media to find out what products are trending or what products are liked by customers okay an example of this is Walmart analyze social media data to find out that Facebook users were crazy about cake pops okay so Walmart immediately took a decision and they introduced cake pops into the Walmart stores so guys the only reason Walmart is so successful is because the huge amount of data that they get they don't see it as a burden inside they process this data analyze it and then they try to draw useful insights from it okay so they invest a lot of money a lot of effort and a lot of time and data analysis okay they spend a lot of time analyzing data in order to find any hidden patterns so as soon as they find out hidden pattern or association between any two products they start giving out offers or they start giving discounts or something along that line so basically Walmart uses data in a very effective manner they analyze it very well they process the data very well and they find out the useful insights that they need in order to get more customers or in order to improve their business so guys this was all about how Walmart uses data science now let's move ahead and look at what is data set now guys data science is all about uncovering findings from data it's all about surfacing the hidden insights that can help companies to make smart business decisions so all these hidden insights or these hidden patterns can be used to make better decisions in a business now an example of this is also Netflix so Netflix basically analyzes the movie viewing patterns of users to understand what drives user interest and to see what users want to watch and then once they find out they give people what they want so guys actually data has a lot of power you should just know how to process this data and how to extract the useful information from data okay that's what data science is all about so guys a big question over here is how do data scientists get useful insights from data so it's all starts with data exploration whenever a data scientist comes across any challenging question or any sort of challenging situation they become detectives so they investigate leads and they try to understand and the different patterns or the different characteristics of the data okay they try to get all the information that they can from the data and then they use it for the betterment of the organization or the business now let's get started and understand who is a data scientist well when a term scientist comes to a mind the first thing is experiments so scientists are related with experiments and a data scientist is related with data and experimenting with that data so data scientist is someone who experiments with data and has certain tools and Technologies at his disposal to draw insights and meaningful patterns from this data so these tools and technologies have been listed here and we will cover first machine learning now this machine learning is very important tool for a data scientist to build and test models also another tool is programming language to extract analyze and manipulate the data a data scientist also uses yet another important tool and that is statistics so now statistics is used to make predictive analysis to understand and interpret the results from the data so statistics is really important tool for a data scientist so another tool is database so database is yet another important tool which means to access the data yet another tool is Big Data to work with complex data set and Big Data Technologies are really required now in a data scientist we will cover all these Technologies in detail in the coming up slides be patient a little and this is just a coverage of giving you a brief gist of the tools and technology so another technique is deep learning so a data scientist is also required to have the knowledge of neural networks NLP to draw meaningful insights from unstructured data and also to process such data now let's move on to discussing about the role of a data scientist so what exactly does a data scientist do and this will be clear once we go through a few job responsibilities of a data scientist now this data has been taken from Glassdoor so let us look at the role that a data scientist needs to perform and as you can see in the right hand side I've just the gist of the role is being listed so a data scientist is required to build and test and iterate the machine learning models so we've already covered the machine learning is the indispensable tool of a data scientist so iteration requires a lot of repetition repetitive experiments with the data and it is quite a complex process and repetition is done to improve the results of the model so a data scientist is really required to iterate again and again over the machine learning models although the building part is not so much because already the companies are working under predefined pre-designed coded software so testing and iteration becomes the indispensable part of a data scientist now another role of feedback look at here is that they're working in teams or working in team with the engineering team to deploy validate and monitor the machine learning models so this is very important because once a model goes to a production state so you have to monitor it and also you have to undertake the deployment of it how to really connect it with the front end and see how is it really working what accuracy is it giving and then most importantly a data scientist is required to research data identify opportunities now this means that drawing useful insights from the data by using various techniques as we have seen earlier so data scientist applies all these techniques together to draw meaningful insights from the data and now let us understand what all to learn to become a data scientist so that this role is suited for a data scientist and one can perform this very well so the Technologies to learn for a data scientist are programming language it could be either python or R machine learning deep learning and NLP Concepts which need to be really very clear and solid foundation of the same another technology to learn is statistics that is both descriptive and inferential another technology is database knowledge of SQL and nosql and Big Data Big Data the knowledge of hadoop's Park any one of them is required so now let's cover these Technologies in detail now moving on forward to the programming language now why a programming language is really required by a data scientist is to analyze huge volumes of data and also to perform analysis of the data so programming language really AIDS in better analysis manipulation of the data so programming language generally used our python R Scala or Java but generally preferred as Python and it's the most popular language which is now being used in the data science Community R is also there but on the production side it becomes a little tough okay so now we will cover both the languages and the important libraries to study for these languages so now moving on forward to programming language python so in Python what all needs to be covered first library is pandas now this library is very important because this is used for data analysis and manipulation and pandas also provide very fast flexible data structures this means that numerous functions and methods are available in pandas which allows you to perform data analysis and manipulation of the data very quickly and very fast and time series analysis is also possible with the help of pandas so the examples like when you're forecasting when you're forecasting weather predictions or stock prices predictions or houses predictions then pandas really help you know in data analysis and manipulation so that's an example of that now coming to numpy this is another Library which is numerical Python and this is used for scientific computations now numerical calculation in machine learning is done with the help of numpy and this is the foundation I would say the building block of python and all the calculations mathematical calculation scientific computations is possible because of numpy and numpy contains multi-dimensional array and Matrix data structure with these powerful data structures one can really perform efficient calculations in Python all right now moving on forward we have the visualization libraries like c-bond and matplotlib these two libraries are used for visualizing the plots looking at the spread of the data and seeing how does the data look whether are there any outliers or what is the spread of the data is just normally distributed or it is only unevenly distributed so these visualization libraries really help us to visualize the data and also another thing is that c bond is built over matplotlib that means that it derives functionality methods from matplotlib but offers more flexibility in terms of plotting the graphs and better representation in terms of color and aesthetically beautiful plots can be plotted with the help of c-bond so now we are moving forward to another language which is R so R is also preferred in the data science community and data scientists work with r r has very strong statistical packages as it was developed by statisticians and R is really good for data analysis but when you take next step into machine learning R has its limitations to as we have discussed earlier also so now for R if we are using R then we can use deep layer package now with this is used for data analysis and manipulation and for the exploration of data the applied is a very strong packet and it is a grammar of manipulation in art this means that deploy provides us a consistent set of functions then these functions are called verbs so R has a defined language and that language has to be preserved so we use verbs here so functions are called verbs here and they help to perform the manipulation on the data so now moving on forward we have another package and that is janitor and this is used for data cleaning like removing empty values or removing duplicates from your data and so on so one thing to note here is that data scientists really spend most of the time in data cleaning in transforming the data in just removing the missing values or empty values or a cleaning cleaning is the most part that data scientists tend to so these libraries can be really helpful when one is building models or one is working on various complex kind of a data so now moving on further to another library that is lubricate and this lubricate is for date and time manipulation in R and for python we have pandas all right for date and time analysis and this really helps because it already has functions and methods predefined which understands date and time analysis and manipulation can be done very easily with the help of Liberty now another library that we would cover is ggplot2 so for data visualization in our now this DG stands for grammar of Graphics now in rgd plot 2 is used to make the graphs plots and we work with layers in ggplot2 so with the help of these layers we can plot a very complex plot with the help of ggplot2 and we can say that digiplot 2 has very aesthetically appealing plots and it gives us the flexibility to show our data in various structures and various formats and we can always see what the data is saying so we visualize the data with the help of ggplot2 in our now I want to tell you something really very interesting as why is data visualization important and now let us just uncover this thing the significance of data visualization can be understood with the help of the image here on the left hand side and this is known as anscom quadrant now nscom was an English mathematician and he developed this quarter to demonstrate the importance of data visualization now this comprises of four as we can see four data sets and these data sets have the same or identical simple descriptive statistics right but when they were being plotted we get different graphs of each data set now this is the importance of visualization that the numbers and figures will appear same but when they are plotted they would be different so to get the real picture and the meaning of your data your data needs to be plotted to get the true picture all right so it's always been said that numbers will not reveal the true picture but the but the visualization of that data would definitely reveal it so this was a very important endscom quadrant to visualize that race data visualization why is it really important so moving on further linked with this is data visualization tools that a data scientist uses and one can learn these tools it could be very handy so few visualization tools or softwares are power bi tablet now these tools really help you to visualize large data sets and these powerful softwares they can connect with multiple data sources in no time you can build dashboards interactive dashboards powerful visualization and even you can go on creating applications with these tools so business intelligence tools knowledge is one of the requirements but it is not that necessary it's easy to learn and it's fun too okay now move coming on to another concept after programming language and that is machine learning now machine learning this is the ability to understand algorithms both supervised and unsupervised machine learning algorithm so now we will discuss about supervise and supervised machine learning algorithms so supervise the something when we work with labeled data that means the output is known and unsupervised is when we don't know the output and the algorithm has to make a guess okay what is this and which category or which class cluster will the data fall into so primarily we deal with supervised machine learning and there are two types of supervised machine learning that is classification and regression and in unsupervised machine learning we will see dimensionality reduction and then various algorithms like clustering and in clustering there are so many algorithms like hierarchical clustering or k-means and so on so a data scientist has to know about the algorithms different kinds of algorithms that can be applied to a data to give good results that's why it's a little vast but it is really very interesting so various algorithms like svm names by random Forest K means good command over the supervised and unsupervised machine learning is required by a data scientist hyperparameter tuning now this is again important because hyper parameters are nothing but they control the behavior of an algorithm so you can say that if hyper parameter tuning is being done or performed on any algorithm that is done to improve the results so for model optimization this is required so to improve the predictions by tweaking a few parameters via hyper parameter tuning that is what a data scientist also does so to learn machine learning moving on forward now to learn machine learning in Python what libraries are required is that psychic learn is a package for machine learning library in Python so the psychic learn is the machine learning library and this supports both supervised and unsupervised learning and it also provides provides various tools for model fitting data preprocessing model selection model evaluation and other utilities so if you if you're making any model or you're importing any of the algorithms I get loan package would be used okay now moving on further to R so in our if one wants to build machine learning models so mlr is the package or carrot is the package so here mlr has all the important and useful algorithms to perform machine learning tasks and carrot stands for classification and regression training so with carried one can process and streamline model building and evaluation process also feature selection could be done and other techniques are also can be applied with the help of these libraries these are very strong and powerful libraries which have Consolidated packages and you just need to import it in your project and use it but the knowledge of how to use it where to use it is what the data scientist knows so now moving on further to another technique and technology that is deep learning techniques so now when we want to work with complex data sets when the data set size is huge then traditional machine learning algorithms they are not preferred and generally we move on to deep learning so gradually from machine learning one moves to deep learning that is a gradual shift so when we talk about deep learning what comes to our mind is neural networks so when you're working with various kind of layers and neurals neural networks then deep learning is preferred and neural networks are neural they are a set of algorithms which are modeled you know Loosely after the human brain to recognize patterns so just as to make it the system more intelligent by just understanding the patterns and predicting so deep learning is more advanced than machine learning I can say so a few deep learning algorithms are convolution neural networks RNN that is recurrent neural network works or lstms which is long short term memory Network and there are many other deep learning techniques or algorithms so these algorithms they help to solve complex problems like image classification or text to speech or language translation so these are really very powerful algorithms which are deployed to solve real world challenges and how to really life make life simple so now moving on forward to natural language processing and what if I say that we all have some time or the other had experienced with NLP and we know the end result how remember Siri and Alexa yes I'm talking about natural language processing and this is a subfield of artificial intelligence and this is concerned with interaction between machines and humans so it is used for speech recognition reading text voicemail virtual assistants like Alexa Siri and these are also the examples of NLP so for python the Deep learning libraries are mxnet are deep net deep R H2O so talking about mxnet art is used for feed forward neural network convolution neural network whereas H2 can also be used for the same but H2O is for deep Auto encoders so when we move on to deep learning they are more sophisticated terms and technologies that needs to be learned and to Puff to improve the model performance one needs to understand the working of algorithm the limitation of these algorithms and how one can really improve upon the model by iterating over these models so now moving on forward from machine learning deep learning and another tools and technique and that is statistics so yet this is the yet another important tool now how does a data scientist use statistics so it has been used to identify the right questions so data scientists also use statistics to apply it to the data and ask the right questions and once they know what the question question is they can use statistics to find answers so statistics also help them to understand how reliable their results are and How likely is it that the findings can be changed now moving on further the statistics help data scientists to do predictive modeling this means that using the past data to create models and that can be used for predicting the future events that is predictive modeling now another importance of Statistics is that it will help a data scientist to design and interpret experiments to make informed decisions so now let us understand this with the help of an example an observation was made that an advertisement X supports an advertisement X has a seven percent higher click-through rate than an advertisement B so a data scientist would determine whether or not this difference is significant enough to Warrant any increased attention focus and investment now these statistics will help us to experiment and design the frequency statistics or hypothesis testing and confidence intervals so data scientists would work with these tools like frequency statistics and hypothesis testing to understand whether this is really important or not whether advertisement X requires more attention or advertisement B requires more attention focus and investment or not so statistics helps to determine these important and crucial decisions all right so now another use of Statistics is to estimate intelligently so data scientists estimate intelligently that means by using Bayesian theorem then various theorems which help to estimate intelligently because a Bayesian theorem takes the result from the past experiences and observations and then make predictions so this is estimating intelligently and they can also summarize what estimates are and mean with the help of Bayesian theorem so now statistics equip a data scientist to solve problems and make data-driven decisions and now moving on forward to most common statistical methods and this is descriptive statistics so we talked about descriptive statistics now let us understand what is descriptive statistics so descriptive means it is a summary statistics that quantitatively summarizes features and some measures include these are the central tendency measure measures which are being listed on the right hand side like mean median and mode and on the left hand side it is skewness and variability which are measures of standard deviations and variance is nothing but knowledge of like how much is the spread of the data right skewness is where the data is inclined whether it is right towards the right or left and determining the normal distribution of the data we understand that are there any outliers in the data what how the data really is a spread so these descriptive statistics they help us from through the numbers that they give we understand what the data is now moving on further to understanding what is inferential statistics so a data scientist must also be able to use inferential statistics because here inferential means we are taking a small sample from the entire data and we are making generalization about the entire data so this is inferential statistics inferencing something from a small data assuming that this is this will also apply to a large data okay so we make inferences from a sample about the population now the methods for this includes hypothesis testing probability now probability is used to predict the likelihood of an event regression analysis is yet another method so regression what we do is we model the relationship between variables like Anova is yet another which is analysis of variance this is a test to compare the means of different groups and how different they are and based on the results we take decisions another method is chi-square and this chi-square is used to determine any relationship between variables or compare the result between an expected and an observed values so this is inferential statistics and now we will move on to another technique and that is from statistics to databases so let's understand what our databases so now data has to be stored before analyzing or making predictions and to store the data we have databases thus the knowledge of database again becomes important for a data scientist so every time a data scientist would have to retrieve a data he has to look upon a person who could help him retrieve the data so it is better that a data scientist should be equipped and shouldn't possess this knowledge so that whenever the need arises to access the data or store the data data scientists can do it by himself right so that is why database knowledge is required so now coming to the types of databases that is relational database first we'll be talking about so as it is said relational that means the data has some relation between itself or between each other so the data is related and it is stored in a tabular format in some predefined schema in the databases so what happens is that when it is in a tabular format when it is in a clean format the retrieval becomes quite easy it's smooth it is not complicated and this data is related somehow so the example I can say of relational databases are MySQL Microsoft SQL server or Oracle databases and now another type of a database is non-relational database now non-relational that means it is not having any relation and this database is it does not follow any preset schema and also it is not in a tabular format this is not important and we can store any kind of a complex data whatever complex data we have on an unstructured data we have that goes into non-relational database and few examples of non-relational database are as we can see here mongodb Apache gdaf now let me tell you something about Apache gdaf it is optimized for storing relationship between nodes so why is it important because this shows connection between entities in Social graph and this software really makes social network data more informative and it provides real-time graph processing so when you are analyzing huge volume of data from social network sites these non-relational database really help you to understand analyze store the data and find some relation between them so that is why it is really important to have the knowledge of these databases nosql databases and mongodb is optimized to store the documents right so there are various non-relational databases I've just listed a few here now let's move on from database to another technology and that is big data now to understand or the ability to analyze the unstructured data this was relevant in the context of big data so big data we are moving forward and Big Data what happens here is that we are dealing with large and more complex data set day by day and companies are dealing with huge volumes of data especially from new data sources so these data sets they're so voluminous that traditional data processing softwares cannot manage them that's why the Big Data technology is really helpful and Big Data it is a collection of data from diverse data sources and it has certain characteristics like volume variety velocity and veracity so these are certain characteristics of big data now moving on forward to Big Data Technologies now why they are needed because to manage big data we have to use Big Data Technologies and they are classified into Data Mining data analysis data visualization and data storage so these are the big data Technologies so all these big data Technologies are not required by a data scientist we will particularly deal with data storage which is required and which is really important to know by a data scientist so for data storage we have Hadoop so we will cover first Hadoop what is this Hadoop so Hadoop framework is there to store and process the data in a distributed environment so which means that it can store and analyze the data which is present in different machines and it also comes with very high speed and very low cost so the companies like Microsoft or Intel IBM they use Hadoop to store and process data so when you're working with huge volumes of data you need to even store and analyze it also where at very quick speed and at low cost so Hadoop gives you that flexibility and power and those features which you really need so now coming on forward to another big data data storage is mongodb now this is again a nosql database and it offers word a direct alternative to rigid schema which is used in relational databases but mongodb another flexibility is that while handling a wide variety of data types I'll add large volumes you can use mongodb and it also can be used for distributed architectures so again mongodb can be used and companies like MySQL SQL Server they use mongodb to handle data another big data storage is Apache spark now Apache spark is again used for large-scale data processing now both I would say that the Apache spark and Apache Hadoop they are both open source Frameworks for big data processing there are some key differences like Hadoop uses mapreduce to process data while spark uses resilient distributed data sets rdds so now moving on further for our other Technologies which are recommended for a data scientist is Excel now Excel is used since decades for analysis but now more advanced languages and more simple languages are there out like Python and R but Excel is here to stay so using Excel for basic analysis like formulas pivot tables vlookups or vbas this can be used by data scientists so VBA is nothing but you know Visual Basic for application and this can help automate repetitive word and data processing functions and it could also help you to generate custom graphs forms and reports so Excel is really a very powerful tool which is yet even today which is being used so now we are moving on forward to web scraping so scraping libraries like beautiful soup in Python it is also known as bs4 or Scrappy these are popular libraries in Python but if one is using R then our West package is useful in extracting information from the web so web scraping is not there in most of the job responsibilities of a data scientist but if one is equipped with this and have a knowledge of how to use various apis and get the data from that various resources over internet then this could be a real help now moving on forward to line X now why is Linux required because the company operate on various operating systems not necessarily windows so Linux is the most recommended language I would say or the commands need to be learned from Linux for a data scientist because anyways when you're joining a company you have to learn those commands so basic knowledge not much just basic concepts to how to operate like next commands and this would be really helpful as Linux is considered more secure as compared to other OS and companies really want that their data should be secured their information should be secured so a different operating system is used so that is the entire idea about learning Linux but this is not essential but this is an option now moving on forward to another technology that is git now git is for code management or your source code management when you're working in teams and you're sharing the code so various versions of various changes could be made in that code by various people so git provides One Source wherein all the changes could be merged into just one source and one gets an edited version so this is very very uh handy and it saves a lot of time and making collaboration is very easy here you can also allow multiple changes to emerge into a single source and it is basically a Version Control System very useful again so git commands and git knowledge not commands but at least you know how to how to really work with kid for code management now moving on further to another this is again an option that is cloud technology but few companies really now are looking for data scientists who also know the knowledge of cloud technology and now Cloud technology like Azure or AWS so that you can gain access over or to the storage of software those servers over the Internet so how how do you release access that how do you retrieve information from cloud by using AWS Azure or different technologies that the company is using so the benefit of by company or the Sorting to cloud is that it provides unlimited storage space and very high speed the backup rate is good and it restores the data it is also cost effective and it saves much of the infrastructure cost of the companies too so that's why companies are really moving forward to cloud and they need people who can really access Cloud along with deploying machine learning models so this is preferred sometimes it becomes important too okay now so moving on further to how to structure the learning process now the above list of subjects could be really overwhelming and can really you know demotivate a serious Learner in the long run so how one should structure the learning process to scale up the ladder of a data scientist now all these topics need not be learned at one go but parallel learning is recommended here by parallel learning I mean that learning python first and then moving on to data analysis visualization with the help of various projects building machine learning models with the help of one programming language that will give the idea of how really algorithms work on data understanding the answers or the outputs from the data and then tweaking in with certain hyper parameters and improving those results once you are done with machine learning move on to deep learning and apply those statistical analysis the status inferential statistics and diff and descriptive statistics because we find the description the statistical bar from the data and understand how does it work by various plots also so statistics really is helpful in understanding the results so these learnings are being used in unison not individually so the knowledge of all these things are really required [Music] so there are several reasons why somebody should become a data scientist but we're gonna take a look at the most compelling ones which are growing demand High salary low competition and diverse domains so let's go over them one by one first up growing demand U.S Bureau of Labor Statistics estimates that there be around 11.5 million data scientists jobs by the year 2026. that says a lot about the field itself and the reason behind such a humongous growth rate is that organization all across the globe have realized that data science is pertinent and very important in crucial decision making and so all the big organizations across the world are willing to pay Hefty salaries to data scientists that brings us to the second reason why somebody should become a data scientist and that is high salary in India alone a data scientist can easily make on an average over nine lakhs in the United States that number is 120 000 US Dollars data scientists with experience of more than five years can make 20 to 30 lakhs or even over that and in United States they can make over 200 to 500 000 per annum so those are some attractive figures and keep in mind they don't include bonuses incentives benefits and allowances so those are some crazy good figures the third reason is low competition organizations all across the globe have realized the importance of data science and so they're hiring data scientists left right and center but the problem for them is it's hard for them to find good data scientists so in other words the demand for data scientists is very high and the supply of good data scientists is considerably low the fourth reason is diverse domains if you choose to become data scientists you can actually end up in any domain for example product manufacturing energy retail marketing Healthcare and a dozen more and the implementation of data science is increasing day upon day and as each day passes by people are discovering more ways to use data science in different ways in different domains and so data scientists job is never going to be boring because they can move between domains and help different sorts of Industries so it's always going to be challenging and it's always going to be fun so now that we have seen that it's a good idea to become a data scientist if you're data oriented let's now find out what does a data scientist do as in what are their roles and responsibilities okay so the first thing that they do is ask the right questions to begin the discovery process so what I mean by that is data scientists are hired to solve problems for a company or to help them make crucial decisions in order to solve any big problem they have to ask the right questions which will lead them to the right data to collect and the processes and models they need to use or build to eventually pull the right answers that will help a business grow or get rid of its problems so After figuring out what data to collect now they actually have to clean and prep the data for analysis so this is referred to as ETL which stands for extract transform and load extraction is the process of acquiring the data from various sources that could be in different databases like MySQL mongodb and others transformation is the process where missing values in the datas are filled and the undesirable values are replaced with the correct values then all that data from different sources is merged to make clean uniform data that could be integrated and stored in the organization's existing database systems so what would be next loading so loading refers to the process where the data is integrated and stored in the organization's databases so after ETL they have data that is ready and stored to perform analysis and investigation exploration on so what do they do next well of course they perform initial data investigation and exploratory data analysis to make sure that everything that they collected is useful in the context of solving that problem and this also gives them an idea as to Which models and algorithms they need to use in order to extract the meaningful insights so obviously after they perform this investigatory and exploratory data analysis they choose the appropriate models and algorithms that will identify patterns and Trends in the data so they use different data science techniques such as machine learning statistical modeling and deep learning sometimes as well in order to make these models that gives the appropriate results that they're looking for after that they move on to checking the accuracy of the models and improve the models if needed once they are satisfied with the models and its accuracy and the answers that they're getting they move on to making reports so that they can extract all of these insight into a report form that can be presented to the stakeholders and so this is usually the last step where a decision is made using the insights and recommendations from the data scientists and their reports it is also at this stage that data scientists may make some adjustments based on the feedback that they get from stakeholders even if the decision is made and the problem is solved they continue to adjust and retrain the models if needed so they continue to provide the desired results that will help guide the business so I hope this was clear why don't we move on to the next section which is skills required for a data scientist so after going through a lot of job descriptions on on sites like indeed Glassdoor and more we've found that these are the basic skills required if you want to make it as a data scientist so first up mathematics mathematics forms the foundation for data science in mathematics you should know statistics and probability and if you want to become the best data scientist possible you'll also learn linear algebra and calculus we're going to dive into details of what you should learn in mathematics but for now let's move on second skill that you need is programming you don't have to be a master at programming you need to know enough in order to use the libraries handle files and process data then the third skill that you should have is data science and machine learning so you should know all the concepts of data science and machine learning and you should know which Concepts to implement in which scenarios moving on the fourth skill is deep learning so it is very useful for an aspiring data scientist to not only know the concepts of data science and machine learning but they should know deep learning so they are able to replicate neural networks to produce predictive models the fifth skill that a data scientist does need to have is data visualization they should be able to take all the insights from models and systems that they've built and be able to put that in a form of a report where they can explain the insights in the simplest way possible so even a Layman who doesn't know technical terms is able to understand what the data is trying to show and the last skill that data scientists should know is Big Data data is all around us and we're producing more and more data every day data that is so huge that it's hard for normal people to comprehend but data scientists need to know how to process big data so processing big data and drawing insights from it requires different set of tools than processing normal data sets so it is essential for data scientists to know how to work with big data so I hope you got a picture of what kind of skills are required in order to fulfill the roles and responsibilities that we went over apart from these sort of technical skills there are some soft skills that you need to be very good at as well for example good communication being able to abstract the complex ideas and explain them to stakeholders and business clients now that we got a good picture of roles and responsibilities of a data scientists and what kind of skills are required in order to fulfill those roles and responsibilities let's quickly find out how to become a data scientist by showing you a roadmap so the first point on the journey to become a data scientist is to make sure that you're well versed in the mathematics for data science so what do I mean by mathematics well not all mathematics is essential in order to become data scientists but it's very useful to know four different distinct concepts of math namely linear algebra calculus probability and statistics you can probably probably get away with just learning probability and statistics but if you want to become a really good data scientist you will have to incorporate machine learning math which entails linear algebra and calculus so what do we need to learn in these topics so in linear algebra you should know what a scalar Vector Matrix is and how to perform different types of operations on each one of these then you should also learn the application of linear algebra and machine learning how is this subject linear algebra used in machine learning there would be no point of learning linear algebra without knowing its application in machine learning it's like having a tool not knowing what that tool does next subject that you should learn is calculus in calculus please learn differentiation rules partial differentiation and how all of these apply in machine learning but those two topics that we just talked about linear algebra and calculus are relatively optional if you just want to become an entry-level data scientist the two of the most important topics are probability and statistics so let's take a look at probability now you should know rules of probability dependent and independent events Implement conditional marginal and Joint probability using Bayes theorem probability distribution and Central limit theorem and in statistics you should know all the terminologies numerical parameters like mean mode median sensitivity entropy sampling techniques types of statistics hypothesis testing data clustering testing data regression modeling so the concepts that we just talked about of probability and statistics will form the basis that you will use to process data and gain insights so as you're learning mathematics for data science it's also a good idea to start acquiring programming skills so in programming skills what do you need to know well two of the most famous programming languages are R and python but especially python is used in the entire world in India both Python and R are used heavily go with either one of them and you shouldn't have any problems remember earlier mentioned that you don't have to be a programming wizard in order to get good at data science well that is absolutely true let's now say that you're learning Python Programming so you start with the basic syntax and what data structures are after that you should start learning file operations functions and object oriented programming and how to use different modules in order to process data you should also know how to handle exceptions and as you're learning the math it's very good to start learning the libraries like numpy pandas matplotlib if you're learning the programming languages are learn the same Concepts and for libraries it is essential to learn e1071 and part for our programming apart from python or R you should also know a bit of SQL as it forms the basis to understand any type of databases you don't have to master SQL just try to understand the crud which is create read update and delete data in relational database systems this is going to give you a good idea how databases work and how to query them how to modify data transform data and extract data so let's say you've been learning math and programming skills and you've done some practical examples or real world scenarios now would be a good time to start learning different concepts in data science and machine learning so why don't we take a look at data science in data science you should first obviously understand what data science is you should know what data analysis pipeline is then data extraction types of data techniques like data wrangling exploratory data analysis data visualization and when it comes to machine learning you should definitely know what type of machine learning models are there namely supervised learning unsupervised learning reinforcement learning and there are various algorithms within each one of them so you should note them as well apart from these you should also learn dimensionality reduction concept time series analysis and then depending on the kind of situation and problem at hand you should also learn how to do model selection and boosting so at this point you're getting quite good and deep into data science and as you learn all of these topics one by one you should do practical examples using python or R and make sure that you really have a grasp of all these Concepts because these are the Core Concepts that will help you get your foot in the door and help you become a data scientist so what is next so the fourth skill is deep learning Concepts and know how to implement them so in deep learning you should know single layer perception tensorflow 2.0 convolutional neural networks Regional CNN Implement boltzmann machine and auto encoder generative adversarial Network emotion and gender detection R and n and Gru lstm if you haven't heard about these Concepts then you must be feeling overwhelmed let me put your mind at ease these are not too difficult especially after learning machine learning and data science Concepts you should be able to pick them up quite easily so a big part of becoming a data scientist is to make sure that whatever you're learning conceptually you should be able to implement it in real life scenarios and so just learn the basics and know how to implement them moving on to the next skill that you should have is Big Data and its tools like Pi spark Hadoop and so on now we are entering the realm of Big Data for a data scientist it is very essential especially these days to know how to handle big data so you might ask what are the different Big Data tools and Concepts so here they are Big Data Hadoop and Spark Apache spark framework spark rdds data frames and Spark SQL machine learning using spark ml lib understanding Apache Kafka and Apache Flume Apache spark streaming processing multiple batches and data sources again don't dive too deep into these topics you should just know the basics don't be overwhelmed and move on to the next thing last skill to acquire which is data visualization and so you should know tools like Tableau so in data visualization you should learn basic visual analytics Advanced visual analytics calculations level of detail LOD Expressions geographical visualizations Advanced charts dashboard and stories and so obviously you can use different tools but two of the tools that are used most widely across the world are Tableau and power bi so these are very powerful tools and they're very easy to use very user friendly and learning them will enable you to represent data in a very spectacular fashion so wrapping up so you should start your journey with mathematics for data science along with that you can start learning programming skills as well then learn data science and machine learning fundamentals and Concepts know how to implement various Concepts using python or R then move on to learning deep learning Concepts and know how to implement those as well after that you should know how big data works and what are the tools that are industry standard like we discussed and finally you should learn how to visualize the different insights patterns and trends that you have gathered by implementing different models and systems and I can guarantee that if you follow this roadmap judiciously then you will become a data science wizard [Music] moving ahead let's look at data scientists job trends this is a graph of number of job openings for data scientists per 1 million postings what we can infer from the graph is that data scientist jobs have a growing Trend and this trend is expected to continue in the coming years apart from this Statistics also say that 97 percent of data centers jobs are on full-time basis whereas only three percent are part-time also one of the important things to get a job is mastering the required skills the most in-demand skills that every company looks for in a data scientist are listed here along with the corresponding percentage of job openings this doesn't necessarily mean you need to master all these skills in order to find a job as a data scientist in fact mastering Just Three core skills which are python are and SQL can provide a solid foundation for 7 out of 10 job openings today as data scientists now let's look at the average data scientist salary based on different criterias first let's look at the salary based on degree as we can see both in India and the United States there is a clear correlation between degree level and the associated salary higher the degree better the salary hence PhD holders earn the highest as a data scientist compared to others the next criteria is experience your experience plays a crucial role in deciding your salary be it India or the United States typically more experience in the relevant field results in higher pay but definitely up to a certain point in case of stagnant salary even with a great experience consider learning new skills to outstand in the job market the next criteria we are going to discuss is location not surprisingly geographic location is one of the biggest factors when it comes to how much you can earn in a given profession as a rule of thumb salaries are going to be higher in larger cities such as Mumbai in India or New York in the United States than they are in rural locations however it's important to remember that cost of living in rural areas is much less than in big cities considering these factors we have found out cities like Bangalore in India and San Francisco in the United States still top the list of providing a high salary the last criteria we are going to look at includes companies there are a large number of companies that are now looking for data scientists but it's essential to find companies that commonly hire data scientists and also provide a good pay hence we have made a list of companies that are ranked similar in India and the United States for data scientists salary the companies include Oracle JP Morgan Intel Amazon IBM and Accenture similarly you can see the salary in each company for a data scientist in the United States now let's quickly discuss the future scope for data scientists data scientists are high in demand in Industries like healthcare transport e-commerce cyber security along with Aviation and Airlines the healthcare industry is extensively making use of data scientists to develop a system that can predict health risks they are also being hired for the process of drug Discovery to provide insights into optimizing and increasing the success rate of predictions the Travel and Transport industry always had to handle large amount of data conventional ways of handling data is no more enough hence using data scientist efficient data analysis and prediction can be done to improve business process and customer service in the e-commerce industry to improve user experience and personalized service and suggestion they use customer data for handling this user data companies need data science professionals to understand customer behavior and then Market them the correct product the next industry is cyber security due to an increase in online transactions and internet usage fraudulent activities have also increased organizations are now adopting data scientists who apply techniques to detect such fraudulent activities and to prevent losses the last industry that we are going to discuss is Aviation and Airlines in the aviation and Airlines industry companies use data for putting up their prices optimizing routes and carrying out preemptive maintenance to develop each of this system requires knowledge in the field of data science this leads to substantial increase in the need of data scientists [Music] now let's move ahead and look at the data life cycle so guys are basically six steps in the data life cycle it starts with a business requirement next is the data acquisition after that you'll process the data which is called data processing then there is data exploration modeling and finally deployment so guys before you even start on a data science project it is important that you understand the problem you're trying to solve so in this stage you're just going to focus on identifying the central objectives of the project and you'll do this by identifying the variables that need to be predicted next up we have data acquisition okay so now that you have your objectives defined it's time for you to start Gathering the data so data mining is the process of gathering your data from different sources at this stage some of the questions you can ask yourself is what data do I need for my project where does it live how can I obtain it and what is the most efficient way to store and access all of it next up there is data processing now usually all the data that you collected is a huge mess okay it's not formatted it's not structured it's not cleaned so if you find any data set that is cleaned and it's packaged well for you then you've actually won the lottery because finding the right data takes a lot of time and it takes a lot of effort and one of the major time consuming tasks in the data science process is data cleaning okay this requires a lot of time it requires a lot of effort because you have to go through the entire data set to find out any missing values or if there are any inconsistent values or corrupted data and you also find the unnecessary data over here and you remove that data so this was all about data processing next we have data exploration so now that you have sparkling clean set of data you are finally ready to get started with your analysis okay the data exploration stage is basically the brainstorming of data analysis so in order to understand the patterns in your data you can use histograms you can just pull up a random subset of data and plot a histogram you can even create interactive visualizations this is the point where you dive deep into the data and you try to explore the different models that can be applied to your data next up we have data modeling so after processing the data what you're going to do is you're going to carry out model training okay now model training is basically about finding a model that answers the questions more accurately so the process of model training involves a lot of steps so firstly you'll start by splitting the input data into the training data set and the testing data set okay you're going to take the entire data set and you're going to separate it into two parts one is the training and one is the testing data after that you'll build the model by using the training data set and once you're done with that you'll evaluate the training and the test data set now to evaluate the training and testing data set you'll be using series of machine learning algorithms after that you'll find out the model which is the most suitable for your business requirement so this was mainly data modeling okay this is where you build a model out of your training data set and then you evaluate this model by using the testing data set next we have deployment so guys the goal of this stage is to deploy the model into a production or maybe a production like environment so this is basically done for final use acceptance and the users have to validate the performance of the models and if there are any issues with a model or any issues with the algorithm then they have to be fixed in this stage [Music] let's move ahead and take a look at what is data now this is a quite simple question if I ask any of you what is data you'll see that it's a set of numbers or some sort of documents that I've stored in my computer now data is actually everything all right look around you there is data everywhere each click on your phone generates more data than you know now this generated data provides insights for analysis and helps us make Better Business decisions this is why data is so important to give you a formal definition data refers to facts and statistics collected together for reference or analysis all right this is the definition of data in terms of statistics and probability so as we know data can be collected it can be measured and analyzed it can be visualized by using statistical models and graphs now data is divided into two major subcategories all right so first we have qualitative data and quantitative data these are the two different types of data under qualitative data we have nominal and ordinal data and under quantitative data we have discrete and continuous data now let's focus on qualitative data now this type of data deals with characteristics and descriptors that can't be easily measured but can be observed subjectively now qualitative data is further divided into nominal and ordinal data so nominal data is any sort of data that doesn't have any order or ranking okay an example of nominal data is gender now there is no ranking in gender there's only male female or other right there is no one two three four or any sort of ordering in gender race is another example of nominal data now ordinal data is basically an ordered series of information okay let's say that you went to a restaurant okay your information is stored in the form of customer ID all right so basically you are represented with a customer ID now you would have rated uh their uh Service as either good or average all right that's how ordinal data is and similarly they'll have a record of other customers who visit the restaurant along with their ratings all right so any data which has some sort of sequence or some sort of order to it is known as ordinal data all right so guys this is pretty simple to understand now let's move on and look at quantitative data so quantitative data basically deals with numbers and things okay you can understand that by the word quantitative itself quantitative is basically quantity right so it deals with numbers it deals with anything that you can measure objectively all right so there are two types of quantitative data that is discrete and continuous data now discrete data is also known as categorical data and it can hold a finite number of possible values now the number of students in a class is a finite number all right you can't have infinite number of students in a class let's say in your fifth grade there were 100 students in your class all right there weren't infinite number but there was a definite finite number of students in your class okay that's discrete data next we have continuous data now this type of data can hold infinite number of possible values okay so when you say weight of a person is an example of continue is data what I mean to see is my weight can be 50 kgs or it can be 50.1 kgs or it can be 50.001 kgs or 50.0001 or is 50.023 and so on right there are infinite number of possible values right so this is what I mean by continuous data all right this is the difference between discrete and continuous data and also I'd like to mention a few other things over here now uh there are a couple of types of variables as well all right we have a discrete variable and we have a continuous variable discrete variable is also known as a categorical variable all right it can hold values of different categories let's say that you have a variable called a message and there are two types of values that this variable can hold let's say that your message can either be a Spam message or a non-spam message okay that's when you call a variable as discrete or categorical variable all right because it can hold values that represent different categories of data now continuous variables are basically variables that can store infinite number of values so the weight of a person can be denoted as a continuous variable all right let's say there is a variable called weight and it can store in finite number of possible values that's why we'll call it a continuous variable so guys basically variable is anything that can store a value right so if you associate any sort of data with a variable then it will become either discrete variable or continuous variable that is also dependent and independent type of variables now we won't discuss all of that in depth because that's pretty understandable I'm sure all of you know what is independent variable and dependent variable right dependent variable is any variable whose value depends on any other independent variable so guys that much knowledge I expect all of you to have all right so now let's move on and look at our next topic which is what is statistics now coming to the formal definition of statistics statistics is an area of Applied Mathematics which is concerned with data collection analysis interpretation and presentation now usually when I speak about statistics people think statistics is all about analysis but statistics has other parts to it it has data collection is also a part of Statistics data interpretation presentation visualization all of this comes into statistics all right you're going to use statistical methods to visualize data to collect data to interpret data all right so the area of mathematics deals with understanding how data can be used to solve complex problems okay now I'll give you a couple of examples that can be solved by using statistics okay let's say that your company has created a new drug that make your cancer how would you conduct a test to confirm the drugs Effectiveness now even though this sounds like a biology problem this can be solved with Statistics all right you will have to create a test which can confirm the effectiveness of the drug all right this is a common problem that can be solved using statistics let me give you another example you and a friend are at a baseball game and out of the blue he offers you a bet that neither team will hit a home run in that game should you take the BET all right here you'll just discuss the probability of whether you'll win or lose all right this is another problem that comes under statistics let's look at another example the later sales data has just come in and your boss wants you to prepare a report for management on places where the company could improve its business what should you look for and what should you not look for now this problem involves a lot of data analysis you'll have to look at the different variables that are causing your business to go down or that you have to look at a few variables that are increasing the performance of your models and thus growing your business all right so this involves a lot of data analysis and the basic idea behind data analysis is to use statistical techniques in order to figure out the relationship between different variables or different components in your business okay so now let's move on and look at our next topic which is basic terminologies and statistics now before you dive deep into statistics it is important that you understand the basic terminologies used in statistics the two most important terminologies in statistics are population and Sample so throughout the statistics course or throughout any problem that you're trying to solve with Statistics you will come across these two words which is population and Sample Now population is a collection or a set of individuals or objects or events whose properties are to be analyzed okay so basically you can refer to population as a subject that you're trying to analyze now a sample is just like the word suggests it's a subset of the population so you have to make sure that you choose the sample in such a way that it represents the entire population all right it shouldn't focus at one part of the population instead it should represent the entire population that's how your sample should be chosen so a well chosen sample will contain most of the information about a particular population parameter now you must be wondering how can one choose a sample that best represents the entire population now sampling is a statistical method that deals with the selection of individual observations within a population so sampling is performed in order to infer statistical knowledge about a population all right if you want to understand the different statistics of a population like the mean the median the mode or the standard deviation or the variance of a population then you're going to perform sampling all right because it's not reasonable for you to study a large population and find out the mean median and everything else so why is sampling performed you might ask what is the point of sampling we can just study the entire population now guys think of a scenario wherein you're asked to perform a survey about the eating habits of teenagers in the U.S so at present there are over 42 million teens in the U.S and this number is growing as we are speaking right now correct is it possible to survey each of these 42 million individuals about their health is it possible well it might be possible but this will take forever to do now obviously it's not it's not reasonable to go around knocking each door and asking for what does your teenage son eat and all of that all right this is not very reasonable that's why sampling is used it's a method wherein a sample of the population is studied in order to draw inference about the entire population so it's basically a shortcut to starting the entire population instead of taking the entire population and finding out all the solutions you're just going to take a part of the population that represents the entire population and you're going to perform all your statistical analysis your inferential statistics on that small sample all right and that sample basically represents the entire population all right so I'm sure I've made this clear to y'all what is sample and what is population now there are two main types of sampling techniques that I'll discuss today we have probability sampling and non-probability sampling now in this video we'll only be focusing on probability sampling techniques because non-probability sampling is not within the scope of this video all right we'll only discuss the probability part because we're focusing on statistics and probability correct now again under probability sampling we have three different types we have random sampling systematic and stratified sampling all right and just to mention the different types of non-probability samplings we have snowball quota judgment and convenient sampling all right now guys in this session I'll only be focusing on probability so let's move on and look at the different types of probability sampling so what is probability sampling it is a sampling technique in which samples from a large population are chosen by using the theory of probability all right so there are three types of probability sampling all right first we have the random sampling now in this method each member of the population has an equal chance of being selected in the sample all right so each and every individual or each and every object in the population has an equal chance of being a part of the sample that's what random sampling is all about okay you are randomly going to select any individual or any object so this way each individual has an equal chance of being selected correct next we have systematic sampling now in systematic sampling every nth record is chosen from the population to be a part of the sample all right now uh refer this image that I've shown over here out of these six groups every second group is chosen as a sample okay so every second record is chosen here and this is our systematic sampling works okay you're randomly selecting the nth record and you're going to add that to your sample next we have stratified sampling now in this type of technique a stratum is used to form samples from a large population so what is a stratum a stratum is basically a subset of the population that shares at least one common characteristics so let's say that your population has a mix of both male and female so you can create two stratums out of this one will have only the male subset and the other will have the female subset all right this is what freedom is it is basically a subset of the population that shares at least one common characteristics all right in our example it is gender so after you've created a stratum you're going to use random sampling on these stratums and you're going to choose a final sample so random sampling meaning that all of the individuals in each of the stratum will have an equal chance of being selected in the sample correct so Guys these were the three different types of sampling techniques now let's move on and look at our next topic which is the different types of statistics so after this we'll be looking at the more advanced concepts of Statistics all right so far we discussed the basics of Statistics which is basically what is statistics the different sampling techniques and the terminologies and statistics all right now we look at the different types of statistics so there are two major types of Statistics descriptive statistics and inferential statistics in today's session we'll be discussing both of these types of Statistics in depth right we'll also be looking at a demo which I'll be running in the r language in order to make you understand what exactly descriptive and inferential statistics is so guys we're just going to look at the basics so don't worry if you don't have much knowledge I'm explaining everything from the basic level all right so guys descriptive statistics is a method which is used to describe and understand the features of specific data set by giving a short summary of the data okay so it is mainly focused upon the characteristics of data it also provides a graphical summary of the data now in order to make you understand what descriptive statistics is let's suppose that you want to gift all your classmates a t-shirt so to study the average shirt size of a student in a classroom so if you were to use descriptive statistics to study the average shirt size of students in your classroom then what you would do is you would record the shirt size of all students in the class and then you would find out the maximum minimum and average shirt size of the club okay so coming to uh inferential statistics inferential statistics makes inferences and predictions about a population based on the sample of data taken from the population okay so in simple words it generalizes a large data set and it applies probability to draw a conclusion okay so it allows you to infer data parameters based on a statistical model by using sample data so if we consider the same example of finding the average short size of students in a class in inferential statistics you will take a sample set of the class which is basically a few people from the entire class all right you already have had grouped the class into large medium and small all right in this method you basically build a statistical model and expand it for the entire population in the class so guys that was a brief understanding of descriptive and inferential statistics so that's the difference between descriptive and inferential now in the next section we'll go in depth about descriptive statistics all right so let's discuss more about descriptive statistics so like I mentioned earlier descriptive statistics is a method that is used to describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data there are two important measures in descriptive statistics we have measure of central tendency which is also known as measure of center and we have measures of variability this is also known as measures of spread so measures of center include mean median and mode now what is measures of center measures of the center are statistical measures that represent the summary of a data set okay the three main measures of center are mean median and mode coming to measures of variability or measures spread we have range interquartile range variance and standard deviation all right so now let's discuss each of these measures in a little more depth starting with the measures of center now I'm sure all of you know what the mean is mean is basically the measure of the average of all the values in a sample okay so it's basically the average of all the values in a sample how do you measure the mean I hope all of you know how the mean is measured if there are 10 numbers and you want to find the mean of these 10 numbers all you have to do is you have to add up all the 10 numbers and you have to divide it by 10. then here represents the number of samples in your data set all right since we have 10 numbers we're going to divide this by 10. all right this will give us the average or the mean so to better understand the measures of central tendency let's look at an example now the data set over here is basically the car's data set and it contains a few variables all right it has something known as cars it has mileage per gallon cylinder type displacement horsepower and Rural axle ratio all right all of these measures are related to cars okay so what you're going to do is you're going to use descriptive analysis and you're going to analyze each of the variables in a sample data set for the mean standard deviation median mode and so on so let's say that you want to find out the mean or the average horsepower of the cars among the population of cars like I mentioned earlier what you'll do is you will check the average of all the values so in this case we'll take the sum of the horsepower of each car and we'll divide that by the total number of calls okay that's exactly what I've done here in the calculation part so this 110 basically represents the horsepower for the first car all right similarly I've just added up all the values of horsepower for each of the cars and I've divided it by 8. now 8 is basically the number of cars in our data set right so 103.625 is what our mean is or the average of horsepower is all right now let's understand what median is with an example okay so to Define median median is basically a measure of the central value of the sample set is called the median all right you can say that it is a middle value so if we want to find out the center value of the mileage per gallon among the population of cars first what we'll do is we'll arrange the mpg values in ascending or descending order and choose a middle value all right in this case since we have eight values right we have eight values which is an even entry so whenever you have even number of data points or samples in your data set then you're going to take the average of the two middle values if we had nine values over here we can easily figure out the middle value and you know choose that as a median but since there are even number of values we're going to take the average of the two middle values all right so 22.8 and 23 are my two middle values and I'm taking the mean of those two and hence I get 22.9 which is my median all right lastly let's look at how mode is calculated so what is mode now the value that is most recurrent in the sample set is known as mode or basically the value that occurs most often okay that is known as mode so let's say that we want to find out the most common type of cylinder among the population of cars all we have to do is we will check the value which is repeated the most number of times here we can see that the cylinders come in two types we have cylinder of Type 4 and cell gender of type 6 right so take a look at the data set you can see that the most recurring value is 6 right we have one two three four and five we have five six and we have one two three yeah we have three four type cylinders and five six type cylinders so basically we have three four type cylinders and we have five six type cylinders all right so our mode is going to be six since six is more recurrent than four so guys those were the measures of the center or the measures of central tendency now let's move on and look at the measures of the spread all right now what is the measure of spread a measure of spread sometimes also called as measure of dispersion is used to describe the variability in a sample or population okay you can uh think of it as some sort of deviation in the sample all right so you measure this with the help of the different measure of spreads we have range interquartile range variance and standard deviation now range is pretty self-explanatory right it is the given measure of how spread apart the values in a data set are the range can be calculated calculated as shown in this formula so you're basically going to subtract the maximum value in your data set from the minimum value in your data set that's how you calculate the range of the data all right next we have interquartile range so before we discuss interquartile range let's understand what a quartile is right so quartiles basically tell us about the spread of a data set by breaking the data set into different quarters okay just like how the median breaks the data into two parts the quartile will break it into different quarters so to better understand how quartile and interquartile are calculated let's look at a small example now uh this data set basically represents the marks of 100 students ordered from the lowest to the highest goals right so the quartiles lie in the following ranges now the first quartile which is also known as q1 it lies between the 25th and the 26th observation all right so if you look at this I've highlighted the 25th and the 26th observation so how you can calculate q1 or first quartile is by taking the average of these two values all right since both the values are 45 when you add them up and divide them by 2 you'll still get 45. now the second quartile or Q2 is between the 50th and the 51st observation so you're going to take the average of 58 and 59 and you'll get a value of 58.5 now this is my second quarter the third quartile or Q3 is between the 75th and the 76th observation here again you will take the average of the two values which is the 75th value and the 76th value all right and you'll get a value of 71 all right so guys this is exactly how you calculate the different quarters now let's look at what is interquartile range so IQR or the interquartile range is a measure of variability based on dividing a data set into quartiles now the interquartile range is calculated by subtracting the q1 from Q3 so basically Q3 minus q1 is your IQR so your IQR is your Q3 minus q1 all right now this is how each of the quartiles are each quartile represents a quarter which is 25 percent all right so guys I hope all of you are clear with interquartile range and what are quartiles now let's look at variance now variance is basically a measure that shows how much a random variable differs from its expected value okay it's basically the variance in any variable now variance can be calculated by using this formula right here x basically represents any data point in your data set n is the total number of data points in your data set and X bar is basically the mean of data points all right this is how you calculate variance variance is basically uh Computing the squares of deviations okay that's why it says s squared there now let's look at what is deviation deviation is just the difference between each element from the mean okay so it can be calculated by using this simple formula where X I basically represents a data point and mu is the mean of the population all right this is exactly how you calculate deviation Now population variance and Sample variance are very specific to whether you're calculating the variance in your population data set or in your sample data set that's the only difference between population and Sample variance so the formula for population variance is pretty explanatory so x i is basically each data point mu is the mean of the population n is the number of samples in your data set all right now let's look at sample variance Now sample variance is the average of square differences from the mean all right here x i is any data point or any sample in your data set X bar is the mean of your sample all right it's not the mean of your population it's the mean of your sample and if you notice n here is a smaller n is the number of data points in your sample and this is basically the difference between sample and population variance I hope that is clear coming to standard deviation is the measure of dispersion of a set of data from its mean all right so it's basically the deviation from your mean that's what standard deviation is now to better understand how the measures of spread are calculated let's look at a small use case so let's say Daenerys has 20 dragons they have the numbers 9 2 5 4 and so on as shown on the screen what you have to do is you have to work out the standard deviation all right in order to calculate the standard deviation you need to know the mean right so first you're going to find out the mean of your sample set so how do you calculate the mean you add all the numbers in your data set and divided by the total number of samples in your data set so you get a value of 7 here then you calculate the rhs of your standard deviation formula all right so from each data point you're going to subtract the mean and you're going to square that all right so when you do that you'll get the following result you'll basically get that's 425 4 9 25 and so on so finally you'll just find the mean of these squared differences all right so your standard deviation will come up to 2.983 once you take the square root so guys it's pretty simple it's a simple mathematic technique all you have to do is you have to substitute the values in the formula all right I hope this was clear to all of you now let's move on and discuss the next topic which is Information Gain and entropy now this is one of my favorite topics in statistics it's very interesting and this topic is mainly involved in machine learning algorithms like decision trees and random forest all right it's very important for you to know how Information Gain and entropy really work and why they're so essential in building machine learning models we'll focus on the statistic parts of Information Gain and entropy and after that we'll discuss a use case and see how Information Gain and entropy is used in decision trees so for those of you who don't know what a decision tree is it is basically a machine learning algorithm you don't have to know anything about this I'll explain everything in depth so don't worry now let's look at what exactly entropy and Information Gain Is now guys entropy is basically the measure of any sort of uncertainty that is present in the data all right so it can be measured by using this formula so here s is the set of all instances in the data set or all the data items in the data set n is the different type of classes in your data set Pi is the event probability now this might seem a little confusing to you all but when we go through the use case you'll understand all of these terms even better all right coming to Information Gain as the word suggests Information Gain indicates how much information a particular feature or a particular variable gives us about the final outcome okay it can be measured by using this formula so again here h of s is the entropy of the whole data set s s j is the number of instances with the J value of an attribute a s is the total number of instances in the data set V is the set of distinct values of an attribute a head of SG is the entropy of subset of instances and H of a comma s is the entropy of an attribute a even though this seems confusing I'll clear out the confusion all right let's discuss a small problem statement where we'll understand how information gained and entropy is used to study the significance of a model so like I said Information Gain and entropy are very important statistical measures that let us understand the significance of a predictive model okay to get a more clear understanding let's look at a use case all right now suppose we're given a problem statement all right the statement is that you have to predict whether a match can be played or Not by studying the weather conditions so the predictor variables here are outlook humidity wind they is also a predictor variable the target variable is basically played all right the target variable is the variable that you're trying to predict okay now the value of the target variable will decide whether or not a game can be played all right so that's why the play has two values it has no and yes no meaning that the weather conditions are not good and therefore you cannot play the game yes meaning that the weather conditions are good and suitable for you to play the game all right so that was a problem statement I hope the problem statement is clear to all of you now to solve such a problem we make use of something known as decision trees so guys think of an inverted tree and each branch of the tree denotes some decision all right each branch is known as the branch node and at each branch node you're going to take a decision in such a manner that you will get an outcome at the end of the branch all right now this figure here basically shows that out of 14 observations nine observations result in a yes meaning that out of 14 days the match can be played only on nine days all right so here if you see on day one day two day eight day nine and day eleven the Outlook has been sunny all right so basically we're trying to Cluster our data set depending on the Outlook so when the Outlook is sunny this is our data set when the Outlook is overcast this is what we have and when the Outlook is rain this is what we have all right so when it is sunny we have two yeses and three nodes okay when the Outlook is overcast we have all four as yeses meaning that on the four days when the Outlook was overcast we can play the game all right now when it comes to rain we have three S's and two nodes all right so if you notice here the decision is being made by choosing the Outlook variable as the root node okay so the root node is basically the topmost node in a decision tree now what we've done here is we've created a decision tree that starts with the Outlook node all right then you're splitting the decision tree further depending on other parameters like Sunny overcast and rain all right now like we know that Outlook has three values Sunny overcast and brain so let me explain this in a more in-depth manner okay so what you're doing here is you're making the decision Tree by choosing the Outlook variable at the root node the root node is basically the topmost node in a decision tree now the Outlook node has uh three branches coming out from it which is sunny overcast and rain so basically Outlook can have three values either it can be sunny it can be overcast or it can be rainy okay now these three values are assigned to the immediate Branch nodes and for each of these values the possibility of play is equal to yes is calculated so the sunny and the rain branches will give you an impure output meaning that there is a mix of yes and no right there are two yeses here three nodes here there are three yeses here and two nodes over here but when it comes to the overcast variable it results in a hundred percent pure subset all right this shows that the overcast variable will result in a definite and certain output this is exactly what entropy is used to measure all right it calculates the impurity or the uncertainty all right so the lesser the uncertainty or the entropy of a variable more significant is that variable so when it comes to overcast there's literally no impurity in the data set it is a hundred percent pure subset right so we want variables like these in order to build a model all right now we don't always get lucky and we don't always find variables that will result in pure subsets that's why we have the measure entropy so the lesser the entropy of a particular variable the most significant that variable will be so in a decision tree the root node is assigned the best attribute so that the decision tree can predict the most precise outcome meaning that on the root node you should have the most significant variable all right that's why we've chosen Outlook all right now some of you might ask me why haven't you chosen overcast now guys the overcast is not a variable it is a value of the Outlook variable all right that's why we've chosen outlook here because it has a 100 pure subset which is overcast all right now the question in your head is how do I decide which variable or attribute best splits the data now right now I know I looked at the data and I told you that you know uh here we have 100 pure subset but what if it's a more complex problem and you're not able to understand which variable will best split the data so guys when it comes to decision trees information again and entropy will help you understand which variable will best split the data set all right or which variable you have to assign to the root node because whichever variable is assigned to the root node it will best split the data set and it has to be the most significant variable all right so uh how we can do this is we need to use Information Gain and entropy so from the total of the 14 instances that we saw nine of them said yes and five of the instances said no that you cannot play on that particular date all right so how do you calculate the entropy so this is the formula you just substitute the values in the formula so when you substitute the values in the formula you'll get a value of 0.9 for 0. all right this is the entropy or this is the uncertainty of the data present in our sample now in order to ensure that we choose the best variable for the root known let us look at all the possible combinations that you can use on the root node okay so these are all the possible combinations you can either have Outlook you can have windy humidity or temperature okay these are our four variables and you can have any one of these variables as your root node but how do you select which variable best fits the root node that's what we're going to see by using Information Gain and entropy so guys now the task at hand is to find the information gained for each of these attributes all right so for Outlook for Bindi for humidity and for temperature we're going to find out the information gain all right now a point to remember is that the variable that results in the highest Information Gain must be chosen because it will give us a most precise and output information all right so the information gained for attribute windy we'll calculate that first here we have six instances of true and eight instances of false okay so when you substitute all the values in the formula you'll get a value of 0.048 so we get a value of 0.048 now this is a very low value for uh Information Gain all right so the information that you're going to get from Windy attribute is pretty low so let's calculate the information gain of attribute Outlook all right so from the total of 14 instances we have five instances which say Sunny four instances which are overcast and five instances which are rainy all right for sunny we have three yeses and two nodes for overcast we have all the four as yes for any we have three yes and two nodes okay so when you calculate the information gain of the Outlook variable you'll get a value of 0.247 now compare this to the information gain of the windy attribute this value is actually pretty good right we have 0.247 which is a pretty good value for Information Gain now let's look at the information gain of attribute humidity now over here we have seven instances which say high and seven instances but say normal right and under the high Branch node we have three instances which say yes and the rest four instances would say no similarly under the normal Branch we have one two three four five six seven instances which say yes and one instance which says no all right so when you calculate the information gain for the humidity variable you're going to get a value of 0.151 now this is also a pretty decent value but when you compare it to the information gain of the attribute Outlook it is less right now let's look at the information gain of attribute temperature all right so the temperature can hold repeat so basically the temperature attribute can hold hot mild and cool okay under hot we have two instances which says yes and two instances for no under mild we have four instances of yes and two instances of no and under cold we have three instances of yes and one instance of no all right now when you calculate the information gain for this attribute you'll get a value of 0.029 which is again very less so what you can summarize from here is if we look at the information gained for each of these variable we'll see that for Outlook we have the maximum gain all right we have 0.247 which is the highest Information Gain value and you must always choose a variable with the highest Information Gain to split the data at the root node so that's why we assign the Outlook variable at the root node all right so guys I hope this use case was clear if any of you have doubts please keep commenting those doubts now let's move on and look at what exactly a confusion Matrix is now the confusion Matrix is the last topic for descriptive statistics all right after this I'll be running a short demo where I'll be showing you how you can calculate mean median mode and standard deviation variance and all of those values by using r okay so let's talk about confusion Matrix now guys what is a confusion Matrix now don't get confused this is not any complex topic now a confusion Matrix is a matrix that is often used to describe the performance of a model all right and this is specifically used for classification models or a classifier and what it does is it will calculate the accuracy or it will calculate the performance of your classifier by comparing your actual results and your predicted results all right so this is what it looks like to positive plus true negative and all of that now this is a little confusing I'll get back to what exactly true positive true negative and all of this stands for for now let's look at an example and let's try and understand what exactly confusion Matrix is so guys I made sure that I put examples after each and every topic because it's important you understand the Practical part of Statistics all right statistics has literally nothing to do with Theory you need to understand how calculations are done in statistics okay so here what I've done is um now let's look at a small use case okay let's consider that you're given data about 165 patients out of which 105 patients have a disease and the remaining 50 patients don't have a disease okay so what you're going to do is you'll build a classifier that predicts by using these 165 observations you'll feed all of these 165 observations to your classifier and it will predict the output every time a new patient's detail is fed to the classifier right now out of these 165 cases let's say that the classifier predicted yes 110 times and no 55 times all right so yes basically stands for uh yes the person has a disease and no stands for no the person does not have a disease all right that's pretty self-explanatory but yeah so it predicted that 110 times the patient has a disease and 55 times that no the patient doesn't have a disease however in reality only 105 patients in the sample have the disease and 60 patients do not have the disease right so how do you calculate the accuracy of your model you basically build the confusion Matrix all right this is how the Matrix looks like and basically it denotes the total number of observations that you have which is 165 in our case actual denotes uh the actual values in the data set and predicted denotes the predicted values by the classifier so the actual value is no here and the predicted value is no here so your classifier was correctly able to classify 50 cases as no all right since both of these are no so 50 it was correctly able to classify but 10 of these cases it incorrectly classified meaning that your actual value here is no but you classifier predicted it as yes all right that's why this 10 over here similarly it wrongly predicted that five patients do not have diseases whereas they actually did have diseases and it correctly predicted 100 patients which had the disease all right I know this is a little bit confusing but if you look at these values no no 50 meaning that it correctly predicted 50 values No Yes means that it wrongly predicted yes for the values that it was supposed to predict no all right now what exactly is this true positive true negative and all of that I'll tell you what exactly it is so true positive are the cases in which we predicted a yes and they do not actually have the disease all right so it is basically this value all right we predicted a yes here even though they did not have the disease so we have 10 true positives right similarly true negative is we predicted no and they don't have the disease meaning that this is correct false positive is we predicted yes but they do not actually have the disease all right this is also known as type 1 error false negative is we predicted no but they actually do not have the disease so guys basically false negative and true negatives are basically correct classifications all right so this was confusion Matrix and I hope this concept is clear so guys that was descriptive statistics now before we go to probability I promised you all that we'll run a small demo in R all right we'll try and understand how mean median mode works in R okay so let's do that first so guys again uh what we just discussed so far was descriptive statistics all right next we're going to discuss probability and then we'll move on to inferential statistics okay inferential statistics is basically the second type of Statistics okay now to make things more clearer for you let me just zoom in so guys it's always best to perform practical implementations in order to understand the concepts in a better way okay so here we'll be executing a small demo that will show you how to calculate the mean median mode variance standard deviation and how to study the variables by plotting a histogram okay don't worry if you don't know what a histogram is it's basically a frequency plot there's no big signs behind it all right this is a very simple demo but it also forms a foundation that every machine learning algorithm is built upon okay you can say that most of the machine learning algorithms actually all the machine learning algorithms and deep learning algorithms have this basic concept behind them okay you need to know how mean median mode and all of that is calculated so guys I'm using the r language to perform this and I'm running this on rstudio for those of you who don't know our language I will leave a couple of links in the description box you can go through those videos so what we're doing is we're randomly generating numbers and we're storing it in a variable called data right so if you want to see the generated numbers just run the line data all right this variable basically stores all our numbers all right now what we're going to do is we're going to calculate the mean now all you have to do in R is specify the word mean along with the data that you're calculating the mean of and I've assigned this whole thing into a variable called main which will hold the mean value of this data so now let's look at the mean for that I've used a function called print and mean all right so amine is around 5.99 okay next is calculating the median it's very simple guys all you have to do is use a function median all right and pass the data as a parameter to this function that's all you have to do so R provides functions for each and everything all right statistics is very easy when it comes to R because R is basically a statistical language okay so all you have to do is just name the function and that function is already in built in your R okay so your median is around 6.4 similarly we'll calculate the mode all right let's run this function I basically created a small function for calculating the mode so guys this is our mode meaning that this is the most recurrent value right now we're going to calculate the variance and the standard deviation for that again we have a function in R called as where all right all you have to do is pass the data to that function okay similarly we'll calculate the standard deviation which is basically the square root of your variance right now we'll print the standard deviation right this is our standard deviation value now finally we'll just plot a small histogram histogram is nothing but it's a frequency plot all right it'll show you how frequently a data point is occurring so this is the histogram that we've just created it's quite simple in R because R has a lot of packages and a lot of inbuilt functions that supports statistics all right it is a statistical language that is mainly used by data scientists or by data analysts and machine learning Engineers because they don't have to sit and code these functions all they have to do is they have to mention the name of the function and pass the corresponding parameters so guys that was the entire descriptive statistics module and now we'll discuss about probability okay so before we understand what exactly probability is let me clear out a very common misconception people often tend to ask me this question what is the relationship between statistics and probability so probability and statistics are related fields all right so probability is a mathematical method used for statistical analysis therefore we can say that a probability and statistics are interconnected branches of mathematics that deal with analyzing the relative frequency of events so they're very interconnected fields and probability makes use of statistics and statistics makes use of probability all right they're very interconnected Fields so that is the relationship between statistics and probability now let's understand what exactly is probability so probability is the measure of How likely and event will occur to be more precise it is the ratio of desired outcome to the total outcomes now the probability of all outcomes always sum up to one now the probability will always sum up to one probability cannot go beyond one okay so either your probability can be zero or it can be one or it can be in the form of decimals like 0.52 or 0.55 or it can be in the form of 0.5 0.7 0.9 but its value will always stay between the range 0 and 1. okay now the famous example of probability is rolling a dice example so when you roll a dice you get six possible outcomes right you get one two three four and five six phases of a dice now each possibility only has one outcome so what is the probability that on rolling a dice you'll get three the probability is one by six right because there's only one phase which has the number three on it out of six phases there's only one face which has the number three so the probability of getting three when you roll a dice is one by six similarly if you want to find the probability of getting a number five again the probability is going to be one by six all right so all of this will sum up to one all right so guys uh this is exactly what probability is it's a very simple concept we all learned it in eighth standard onwards right now let's understand the different terminologies that are related to probability now there are three terminologies that you often come across when we talk about probability we have something known as the random experiment okay it's basically an experiment or a process uh for which the outcomes cannot be predicted with certainty all right that's why you use probability you're going to use probability in order to predict the outcome with some sort of certainty sample space is the entire possible set of outcomes of a random experiment an event is one or more outcomes of an experiment so if you consider the example of rolling a dice now let's say that you want to find out the probability of getting a 2 when you roll a dice okay so finding this probability is the random experiment the sample space is basically your entire possibility okay so one two three four five six phases are there and out of that you need to find the probability of getting a 2 right so all the possible outcomes will basically represent your sample space okay so one to six are all your possible outcomes this represents your sample space now event is one or more outcome of an experiment so in this case my event is to get a two when I roll a dice right so my event is the probability of getting a 2 when I roll a dice so guys this is basically what uh random experiment sample space and eventually means all right now uh let's discuss the different types of events there are two types of events that you should know about there is disjoint and non-disjoint events disjoint events are events that do not have any common outcome for example if you draw a single card from a deck of cards it cannot be a king and a queen correct it can either be king or it can be Queen now non-disjoin events are events that have common outcomes for example a student can get 100 marks in statistics and 100 marks in probability all right and also the outcome of a ball delivered can be a no ball and it can be a six right so this is what non-disjoint events are all right these are very simple to understand right now let's move on and look at the different types of probability distribution all right I'll be discussing the three main probability distribution functions I'll be talking about probability density function normal distribution and Central limit theorem okay probability density function also known as PDF is concerned with the relative likelihood for a continuous random variable to take on a given value all right so the PDF gives the probability of a variable that lies between the range A and B so basically what you're trying to do is you're going to try and find the probability of a continuous random variable over a specified range okay now this graph denotes the PDF of a continuous variable now this graph is also known as the bell curve right it's famously called the bell curve because of its shape and the three important properties that you need to know about a probability density function now the graph of a PDF will be continuous over a range this is because you're finding the probability that a continuous variable lies between the ranges A and B right the second property is that the area bounded by the curve of a density function and the x-axis is equal to 1. basically the area below the curve is equal to 1 all right because it denotes probability again the probability cannot arrange more than one it has to be between 0 and 1. property number three is that the probability that a random variable assumes a value between A and B is equal to the area under the PDF bounded by A and B okay now what this means is that the probability value is denoted by the area of the graph all right so whatever value that you get here which is basically one is the probability that a random variable will lie between the range A and B all right so I hope all of you have understood the probability density function it's basically the probability of finding the value of a continuous random variable between the range A and B all right now let's look at our next distribution which is normal distribution now normal distribution which is also known as the gaussian distribution is a probability distribution that denotes the symmetric property of the mean right meaning that the idea behind this function is that the data near the mean occurs more frequently than the data away from the mean so what it means to say is that the data around the mean represents the entire data set okay so if you just take a sample of data around the mean it can represent the entire data set now similar to the probability density function the normal distribution appears as a bell curve all right now when it comes to normal distribution there are two important factors all right we have the mean of the population and the standard deviation okay so the mean and the graph determines the location of the center of the graph right right and the standard deviation determines the height of the graph okay so if the standard deviation is large the curve is going to look something like this all right it'll be short and wide and if the standard deviation is small the curve is tall and narrow all right so this was it about normal distribution now let's look at the central limit Theory now the central limit theory states that the sampling distribution of the mean of any independent random variable will be normal or nearly normal if the sample size is large enough now that's a little confusing okay let me break it down for you now in simple terms if we had a large population and we divided it into many samples then the mean of all the samples from the population will be almost equal to the mean of the entire population all right meaning that each of the sample is normally distributed right so if you compare the mean of each of the sample it will almost be equal to the mean of the population right so this graph basically shows a more clear understanding of the central limit theorem right you can see each sample here and the mean of each sample is almost along the same line right okay so this is exactly what the central limit theorem States now the accuracy or the resemblance to the normal distribution depends on two main factors right so the first is the number of sample points that you consider all right and the second is the shape of the underlying population now the shape obviously depends on the standard deviation and the mean of a sample correct so guys Central limit theorem basically states that each sample will be normally distributed in such a way that the mean of each sample will coincide with the mean of the actual population all right in short terms that's what central limit theorem States all right and this holds true only for a large data set mostly for a small data set there are more deviations when compared to a large data set it's because of the scaling Factor right the smallest deviation in a small data set will change the value very drastically but in a large data set a small deviation will not matter at all now let's move on and look at our next topic which is the different types of probability now this is a important topic because most of your problems can be solved by understanding which type of probability should I use to solve this problem right so we have three important types of probability we have marginal joint and conditional probability so let's discuss each of these now the probability of an event occurring unconditioned on any other event is known as marginal probability or unconditional probability so let's say that you want to find the probability that a car drawn is a heart all right so if you want to find the probability that a card drawn is a heart the product will be 13 by 52 since there are 52 cards in a deck and there are 13 hearts in a deck of cards that's right and there are 52 cards in a total deck so your marginal probability will be 13 by 52 that's about marginal probability now let's understand what is joint probability now joint probability is a measure of two events happening at the same time okay let's say that the two events are A and B so the probability of event A and B occurring is the intersection of A and B so for example if you want to find the probability that a card is a four and a red that would be joint probability all right because you're finding a card that is 4 and the card has to be red in color so for the answer this will be 2 by 52 because we have one two in hearts and we have one two in diamonds correct so both of these are red in color therefore a probability is 2 by 52 and if you further down it it is 1 by 26 right so this is what joint probability is all about moving on let's look at what exactly conditional probability is so if the probability of an event or an outcome is based on the occurrence of a previous event or an outcome then you called it as a conditional probability okay so the conditional probability of an event B is the probability that the event will occur given that an event a has already occurred right so if a and b are dependent if events then the expression for conditional probability is given by this now this first term on the left hand side which is p b of a is basically the probability of event B occurring given that event a has already occurred all right so like I said if a and b are dependent events then this is the expression but if a and b are independent events then the expression for conditional probability is like this right so guys P of A and P of B is obviously the probability of a and probability of B right now let's move on now in order to understand uh conditional probability joint probability and marginal probability let's look at a small uh use case okay now basically we're going to take a data set which examines the salary package and training undergone by candidates okay now in this there are 60 candidates without training and 45 candidates which have enrolled for edureka's training right now the task here is you have to assess the training with a salary package okay let's look at this in a little more depth so in total we have 105 candidates out of which 60 of them have not enrolled for edureka's training and 45 of them have enrolled for edureka's training all right this is a small survey that was conducted and um this is the rating of the package or the salary that they got right so if you read through the data you can understand there were five candidates without edureka training who got a very poor salary package okay similarly there are 30 candidates with edureka training who got a good package right so guys basically you're comparing the salary package of a person depending on whether or not they've enrolled for edureka training right this is our data set now let's look at our problem statement find the probability that a candidate has undergone edureka's training quite simple which type of probability is this this is marginal probability right so the probability that a candidate has undergone edureka's training is obviously for 45 divided by 105 since 45 is the number of candidates with edureka training and 105 is the total number of candidates so you get a value of approximately 0.42 all right that's the probability of a candidate that has undergone edutaker's training next question find the probability that a candidate has attended edureka's training and also has good package now this is obviously a joint probability problem right so how do you calculate this now since our table is quite formatted we can directly find that people who have gotten a good bank is along with edureka training are 30 right so out of 105 people 30 people have edureka training and a good package right they're specifically asking for people with edureka training remember that right the question is find the probability that a candidate has attended edureka's training and also has a good package all right so you need to consider two factors that is a candidate who's attended edureka's training and who has a good package so clearly that number is 30 30 divided by total number of candidates which is one zero five right so here you get the answer clearly next we have uh find the probability that a candidate has a good package given that he has not undergone training okay now this is clearly conditional probability because here you're defining a condition you're saying that you want to find the probability of a candidate who has a good package given that he's not undergone any training right the condition is that he's not undergone any training all right so the number of people who have not undergone training are 60 and out of that five of them have got a good package right so that's why this is 5 by 60 and not five by 105 because here they've clearly mentioned has a good package given that he has not undergone training so you have to only consider people who have not undergone training right so 25 people who have not undergone training have gotten a good package right so 5 divided by 60 you get a probability of around 0.08 which is pretty low right okay so this was all about the different types of probability now let's move on and look at our last Topic in probability which is base theorem now guys base theorem is a very important concept when it comes to statistics and probability that is majorly used in knife bias algorithm those of you who aren't aware knife bias is a supervised learning classification algorithm and it is mainly used in a Gmail spam filtering right a lot of you might have noticed that if you open up Gmail you'll see that you have a folder called spam right all of that is carried out through machine learning and the algorithm used there is knife bias right so now let's discuss what exactly the base theorem is and what it denotes the bias theorem is used to show the relation between one conditional probability and its inverse all right basically it's nothing but the probability of an event occurring based on prior knowledge of conditions that might be related to the same event okay so mathematically the best theorem is represented like this right like shown in this equation the left hand term is referred to as the likelihood ratio which measures the probability of occurrence of event B given an event a okay on the left hand side is what is known as the posterior all right it is referred to as posterior which means that the probability of occurrence of a given an event B right the second term is referred to as the likelihood ratio all right this measures the probability of occurrence of B given an event a now P of a is also known as the prior which refers to the actual probability distribution of A and P of B is again the probability of B right this is the bias theorem now in order to better understand the base theorem let's look at a small example let's say that we have three bowels we have bowel a bowel B and bowel C okay bowel a contains two blue balls and four red balls Ball B contains eight blue balls and four red balls bowel Z contains one blue ball and three red balls now if we draw one ball from each Bowl what is the probability to draw a blue ball from a bowel a if we know that we drew exactly a total of two blue balls right if you didn't understand the question please re-read it I shall pause for a second or two right so I hope all of you have understood the question okay now what I'm gonna do is I'm gonna draw a blueprint for you and tell you how exactly to solve the problem but I want you all to give me the solution to this problem right I'll draw a blueprint I'll tell you what exactly the steps are but I want you to come up with a solution on your own right the formula is also given to you everything is given to you all you have to do is come up with a final answer right let's look at how you can solve this problem so first of all what we'll do is let's consider a all right let a be the event of picking a blue ball from bag a and let X be the event of picking exactly two blue balls right because these are the two events that we need to calculate the probability of now there are two probabilities that you need to consider here one is the event of picking a blue ball from bag a and the other is the event of picking exactly two blue balls okay so these two are represented by a and X respectively so what we want is the probability of occurrence of event a given X which means that given that we're picking exactly two blue balls what is the probability that we are picking a blue ball from bag a so by the definition of conditional probability this is exactly what our equation will look like correct this is basically occurrence of event a given an event X and this is the probability of a and x and this is the probability of X alone correct now what we need to do is we need to find these two probabilities which is a probability of a and X occurring together and probability of X okay this is the entire solution so how do you find P probability of X this you can do in three ways so first is white ball from a either white from b or red from C now first is to find the probability of x x basically represents the event of picking exactly two blue balls right so these are the three ways in which it is possible so you'll pick one blue ball from bowel a and one from bowel B in the second case you can pick one from a and another blue ball from C in the third case you can pick a blue ball from back b and a blue ball from bag C right these are the three ways in which it is possible so you need to find the probability of each of this step two is that you need to find the probability of a and X occurring together this is the sum of terms one and two okay this is because in both of these events we are picking a ball from bag a correct so guys find out this probability and let me know your answer in the comment section all right we'll see if you get the answer right I gave you the entire solution to this all you have to do is substitute the value right if you want a second or two I'm gonna pause on this screen so that you can go through this in a more clearer way right remember that you need to calculate two probabilities the first probability that you need to calculate is the event of picking a blue ball from bag a given that you're picking exactly two blue balls okay the second probability you need to calculate is the event of picking exactly two blue bonds all right these are the two probabilities you need to calculate so remember that and this is the solution all right so guys make sure you mention your answers in the comment section for now let's move on and look at our next topic which is the inferential statistics so guys we just completed the probability module right now we'll discuss inferential statistics which is the second type of Statistics we discussed descriptive statistics earlier all right so like I mentioned earlier in financial statistics also known as statistical inference is a branch of Statistics that deals with forming inferences and predictions about a population based on a sample of data taken from the population all right now the question you should ask is how does one form inferences or predictions on a sample the answer is you use Point estimation okay now you must be wondering what is point estimation one estimation is concerned with the use of the sample data to measure a single value which serves as an approximate value or the best estimate made of an unknown population parameter that's a little confusing let me break it down to you for example in order to calculate the mean of a huge population what we do is we first draw out the sample of the population and then we find the sample mean right the sample mean is then used to estimate the population mean this is basically Point estimate you're estimating the value of one of the parameters of the population right basically the mean you're trying to estimate the value of the mean this is what point estimation is the two main terms in point estimation there's something known as the estimator and there's something known as the estimate estimator is a function of the sample that is used to find out the estimate all right in this example it's basically the sample mean right so a function that calculates a sample mean is known as the estimator and the realized value of the estimator is the estimate right so I hope Point estimation is clear now how do you find the estimates there are four common ways in which you can do this the first one is method of moments here what you do is you form an equation in the sample data set and then you analyze a similar equation in the population data set as well like the population mean population variance and so on so in simple terms what you're doing is you're taking down uh some known facts about the population and you're extending those ideas to the sample all right once you do that you can analyze the sample and estimate more essential or more complex values right next we have maximum of likelihood now this method basically uses a model to estimate a value all right now a maximum likelihood is majorly based on probability so there's a lot of probability involved in this method next we have the base estimator this works by minimizing the errors or the average risk okay the the base estimator has a lot to do with the bias theorem all right let's not get into the depth of these estimation methods finally we have the best unbiased estimators in this method there are several unbiased estimators that can be used to approximate a parameter okay so Guys these were a couple of methods that are used to find the estimate but the most well-known method to find the estimate is known as the interval estimation okay this is one of the most important estimation methods all right this is where confidence interval also comes into the picture right apart from interval estimation we also have something known as margin of error so I'll be discussing all of this in the upcoming slides so first let's understand what is interval estimate okay an interval or range of values which are used to estimate a population parameter is known as an interval estimation right that's very understandable basically what they're trying to see is you're going to estimate the value of a parameter let's say you're trying to find the mean of a population what you're going to do is you're going to build a range and your value will lie in that range or in that interval all right so this way your output is going to be more accurate because you've not predicted a point estimation instead you have estimated an interval within which your value might occur right okay now this image clearly shows how Point estimate and interval estimate are different so guys interval estimate is obviously more accurate because you're not just focusing on a particular value or a particular point in order to predict the probability instead you're saying that the value might be within this range between the lower confidence limit and the upper confidence limit all right this just denotes the range or the interval okay if you're still confused about interval estimation let me give you a small example if I stated that I will take 30 minutes to reach the theater this is known as Point estimation okay but if I stated that I will take between 45 minutes to an hour to reach the theater this is an example of interval estimation all right I hope it's clear now now interval estimation gives rise to two important statistical terminologies one is known as confidence interval and the other is known as margin of error all right so guys it's important that you pay attention to both of these terminologies confidence interval is one of the most significant measures that are used to check how essential a machine learning model is all right so what is confidence interval confidence interval is the measure of your confidence that the interval estimated contains the population parameter or the population mean or any of those parameters right now statisticians use confidence interval to describe the amount of uncertainty associated with the sample estimate of a population parameter now guys this is a lot of definition let me just make you understand confidence interval with a small example okay let's say that you perform a survey and you survey a group of cat owners to see how many cans of cat food they purchase in one year okay you test your statistics at the 99 confidence level and you get a confidence interval of 100 comma 200 this means that you think that the cat owners buy between 100 to 200 cans in a year and also since the confidence level is 99 it shows that you're very confident that the results are correct okay I hope all of you are clear with that all right so your confidence interval here will be 100 and 200 and your confidence level will be 99 right that's the difference between confidence interval and confidence level So within your confidence interval your value is going to lie and your confidence level will show how confident you are about your estimation right I hope that was clear let's look at margin of error now margin of error uh for a given level of confidence is the greatest possible distance between the point estimate and the value of the parameter that it is estimating you can say that it is a deviation from the actual point estimate right now the margin of error can be calculated using this formula now zc here denotes a critical value or the confidence interval and this is multiplied by standard deviation divided by root of the sample size all right n is basically the sample size now let's understand how you can estimate the confidence intervals so guys the level of confidence which is denoted by C is the probability that the interval estimate contains the population parameter let's say that you're trying to estimate the mean all right so the level of confidence is the probability that the interval estimate contains the population parameter so this interval between minus zc and zc although area beneath this curve is nothing but the probability that the interval estimate contains a population parameter all right it should basically contain the value that you are predicting right now these are known as critical values this is basically your lower limit and your higher limit confidence level also there's something known as the z-score now this score can be calculated by using the standard normal table all right if you look it up anywhere on Google you'll find the z-score table or the standard normal table okay to understand uh how this is done let's look at a small example okay let's say that the level of confidence is 90 percent this means that you are 90 confident that the interval contains the population mean okay so uh the remaining 10 percent which is out of 100 the remaining 10 percent is equally distributed on these tail regions okay so you have 0.05 here and 0.05 over here right so on either side of C you'll distribute the other left over percentage now these these codes are calculated from the table as I mentioned before all right uh 1.645 is calculated from the standard normal table okay so guys this is how you estimate the level of confidence so to sum it up let me tell you the steps that are involved in constructing a confidence interval first you'll start by identifying a sample statistic okay this is the statistic that you will use to estimate a population parameter this can be anything like the mean of the sample next you will select a confidence level now the confidence level describes the uncertainty of a sampling method right after that you'll find something known as the margin of error right we discussed margin of error earlier so you find this based on the equation that I explained in the previous slide then you'll finally specify the confidence interval all right now let's look at a problem statement to better understand this concept a random sample of 32 textbook prices is taken from a local College Bookstore the main of the sample is so so and so and the sample standard deviation is this use a 95 confident level and find the margin of error for the mean price of all textbooks in the bookstore okay now this is a very straightforward question if you want you can read the question again all you have to do is you have to just substitute the values into the equation all right so guys we know the formula for margin of error you take the Z score from the table after that we have deviation which is 23.44 right and that's standard deviation and n stands for the number of samples here the number of samples is 32 basically 32 textbooks so approximately your margin of error is going to be around 8.12 this is a pretty simple uh question all right I hope all of you understood this now that you know the idea behind confidence interval let's move ahead to one of the most important topics in statistical inference which is hypothesis testing right so basically statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected okay hypothesis testing is an inferential statistical technique used to determine whether there is enough evidence in a data sample to infer that a certain condition holds true for an entire population so to understand the characteristics of a general population we take a random sample and we analyze the properties of the sample right we test whether or not the identified conclusion represents a population accurately and finally we interpret their results now whether or not to accept the hypothesis depends upon the percentage value that we get from the hypothesis okay so to better understand this let's look at a small example before that there are a few steps that are followed in hypothesis testing you begin by stating the null and the alternative hypothesis all right I'll tell you what exactly these terms are and then you formulate an analysis plan all right after that you analyze the sample data and finally you can interpret the results right now to understand the entire hypothesis testing we look at a good example okay now consider four boys Nick John Bob and Harry these boys were caught bunking a class and they were asked to stay back at school and clean their classroom as a punishment right so what John did is he decided that four of them would take turns to clean their classrooms he came up with a plan of writing each of their names on chits and putting them in a bowl now every day they had to pick up a name from the bowel and that person had to clean the cloth right that sounds pretty fair enough now it has been three days and everybody's name has come up except John's assuming that this event is completely random and free of bias what is the probability of Jon not cheating right what is the probability that he's not actually cheating this can be solved by using hypothesis testing okay so we'll Begin by calculating the probability of John not being picked for a day right so we're going to assume that the event is free of bias so we need to find out the probability of John not cheating right first we'll find the probability that John is not picked for a day right we get three out of four which is basically 75 now 75 is fairly high so if John is not picked for three days in a row the probability will drop down to approximately 42 percent okay so three days in a row meaning that is the probability drops down to 42 percent now let's consider a situation where John is not picked for 12 days in a row the probability drops down to 3.2 percent okay that's the probability of John cheating becomes fairly high right so in order for statisticians to come to a conclusion they Define what is known as the threshold value right considering the above situation if the threshold value is set to five percent it would indicate that if the probability lies below five percent then John is cheating his way out of retention but if the probability is above threshold value then John is just lucky and his name isn't getting picked so the probability and hypothesis testing give rise to two important components of hypothesis testing which is null hypothesis and Alternate hypothesis null hypothesis is basically approving the Assumption alternate hypothesis is when your result disapproves the Assumption right therefore in our example if the probability of an event occurring is less than five percent which it is then the event is biased hence it proves the alternate hypothesis so guys with this we come to the end of this session so before I wrap up we're going to run a quick demo which is based on inferential statistics let's quickly open up our studio again I'll be running this in our language and those of you who don't know R it's okay I'm Gonna Leave a couple of links in the description and this is pretty understandable this is just basic math right now in this demo what we'll be doing is we'll be using the Gap minder data set to perform hypothesis testing now guys the Gap minder data set contains a list of 142 countries with their respective values for life expectancy a GDP per capita and population every five years from 1952 to 2007. okay don't worry I'll show you the data set so what we'll do is we'll Begin by installing and downloading the gapminder data set so this install.packages will basically install the Gap minder package and this Library Gap minder will just load the Gap finder Library all right so we've successfully loaded our Gap minder package now let's view this data set these are the countries right different countries and you have the continents the year the life expectancy the population and the GDP per capita okay for so each of these countries we have this information right now the next step is to load the famous deployer package provided by our we're specifically looking to use the pipe operator in this package okay now for those of you who don't know what the pipe operator does this is a pipe operator right now the pipe operator allows you to pipe your data from the left hand side into the data at the right hand side of the pipe okay it's quite self-explanatory it's a pipe which connects two data points right so let's install and load this package all right now our next step what we're going to do is we're going to compare the life expectancy of two places okay we're going to compare the life expectancy of Ireland and South Africa and we're going to perform the t-test to check if the comparison follows a null hypothesis or an alternate hypothesis okay so let's run the code okay so now we'll apply the t-test and we'll compare the life expectancy of these two places all right let's run this notice the main in group Island and South Africa right here is the main you can see that the life expectancy almost differs by a scale of 20. now we need to check if this difference in the value of life expectancy in South Africa and Ireland is actually valid and not just by pure chance okay for this reason the t-test is carried out so pay special attention to this p-value over here this is also known as the probability value right the p-value is very important measurement when it comes to ensuring the significance of a Model A model is said to be statistically significant only when the p-value is less than the predetermined statistical significance level which is ideally 0.05 right so your p-value has to be much lesser than 0.05 as you can see from our output the p-value is very very lesser when compared to 0.05 right it is an extremely small value which is a good thing now in the summary of the model notice another important parameter called the T value right I've been talking about the T Test and what exactly is this T value you can see it over here it is 10.067 now a larger T value suggests that the alternate hypothesis is true and that the difference in life expectancy is not equal to 0 by pure luck hence in our case the null hypothesis is disapproved so to conclude the demo we'll be plotting a graph for each continent such that the graph shows how the life expectancy for each continent varies with the respective GDP per capita just to add a little bit data visualization I am performing this last step so guys here we can clearly see from the output that the alternate hypothesis is true all right and also this is calculated with a 95 confidence interval right you get what I'm saying when I'm talking about 95 confidence interval right this is our confidence interval 15 to 22 and our confidence level is 95 percent this is exactly what we discussed a couple of minutes ago now just to end the demo I'll show you a small visualization here what we're doing is we're showing how the life expectancy for each continent varies with respect to the GDP per capita for that continent okay so this is our plot okay if you look at the illustration you can almost see a linear variance right you can almost see a linear line in the life expectancy for each of the continent with respect to the GDP per capita this also shows how well the r language can be used for statistical analysis right look at how pretty the graph looks and also how well they show that there is almost a linear dependency between the GDP per capita and the life expectancy so what is the importance or what is the need for machine learning now ever since the technical Revolution we've been generating an immeasurable amount of data as per research we're generating around 2.5 quintillion bytes of data every single day and it is estimated that 1.7 MB of data will be created every second for every person on earth now that is a lot of data right now this data comes from sources such as the cloud iot devices social media and all of that since all of us are very interested in the internet right now we're generating a lot of data right you have no idea how much data we generate through social media all the chatting that we do and all the images that we post on Instagram the videos that we watch all of this generates a lot of data now how does machine learning fit into all of this since we're producing this much data we need to find a method that can analyze process and interpret this much data all right and we need to find a method that can make sense out of data and that method is machine learning now there are a lot of Top tire companies and data driven companies such as Netflix and Amazon which build machine learning models by using tons of data in order to identify any profitable opportunities and if they want to avoid any unwanted risk they make use of machine learning all right so through machine learning You can predict risks You can predict profits you can identify opportunities which will help you grow your business so now I'll show you a couple of examples wherein machine learning is used all right so I'm sure all of you have binge watch on Netflix now the most important thing about Netflix is its recommendation engine all right most of Netflix's Revenue comes from its recommendation engine so the recommendation engine basically studies the movie viewing patterns of its users and then recommends relevant movies to them all right it recommends movies depending on users interests depending on the type of movies the user watches and all of that all right so that is how Netflix uses machine learning next we have Facebook's Auto tagging feature now the logic behind Facebook's uh Auto tagging feature is machine learning and neural networks I'm not sure how many of you know this but Facebook makes use of Deep Mind face verification system which is based on machine learning natural language processing and neural networks so deepmind basically studies the facial features in an image and it tag your friends and family another such example is Amazon's Alexa now Alexa is basically an advanced level virtual assistant that is based on natural language processing and machine learning now it can do more than just um play music for you or that it can book your Uber it can connect with other iot devices at your house it can track your health it can order food online and all of that okay so data and machine learning are basically the main factors behind Alexa success power another such example is the Google spam filter so guys Gmail basically makes use of machine learning to filter out spam messages if any of you just open your Gmail inbox you'll see that there are separate sections there's one for primary there's social there's spam and there's your general mail now basically Gmail makes use of machine learning algorithms and natural language processing to analyze emails in real time and then classify them as either spam or non-spam now this is another famous application of machine learning so to sum this up let's look at a few reasons why machine learning is so important so the first reason is obviously increase in data generation so because of excessive production of data we need a method that can be used to structure analyze and draw useful insights from data this is where machine learning comes in it uses data to solve problems and find solutions to the most complex tasks faced by organizations another important reason is that it improves decision making so by making use of various algorithms machine learning can be used to make Better Business decisions for example machine learning is used to forecast sales it is used to predict any downfalls in the stock market it is used to identify risk anomalies and so on now the next reason is it uncovers patterns and Trends in data finding hidden patterns and extracting key insights from data is the most essential part of machine learning so by building predictive models and using statistical techniques machine learning allows you to dig beneath the surface and explore the data at a minute scale now understanding data and extracting patterns manually will take a lot of days now if you do this through machine learning algorithms you can perform such computations in less than a second another reason is that it solves complex problems so from detecting genes that are linked to deadly ALS diseases to building self-driving cars and building phase detection systems machine learning can be used to solve the most complex problems so guys now that you know why machine learning is so important let's look at what exactly machine learning is the term machine learning was first coined by Arthur Samuel in the year 1959. now looking back that year was probably the most significant in terms of technological advancements guys if you browse through the net about what is machine learning you'll get at least 100 different definitions now the first and very formal definition was given by Tom and Mitchell now the definition says that a computer program is said to learn from experience e with respect to some class of task T and performance measure P if its performance at tasks in t as measured by P improves with experience e all right now I know this is a little confusing so let's break it down into simple words now in simple terms machine learning is a subset of artificial intelligence which provides machines the ability to learn automatically and improve from experience without being explicitly programmed to do so in the sense it is the practice of getting machines to solve problems by gaining the ability to think but wait now how can a machine think or make decisions well if you feed a machine a good amount of data it'll learn how to interpret process and analyze this data by using machine learning algorithm okay now guys look at this figure on top now this figure basically shows how a machine learning algorithm or how the machine learning process really works so the machine learning process begins by feeding the machine lots and lots of data okay by using this data the machine is trained to detect hidden insights and Trends now these insights are then used to build a machine learning model by using an algorithm in order to solve a problem okay so basically you're going to feed a lot of data to the machine the machine is going to get trained by using this data it's going to use this data and it's going to draw useful insights and patterns from it and then it's going to build a model by using machine learning algorithms now this model will help you predict the outcome or help you solve any complex problem or any business problem so that's a simple explanation of how machine learning works now let's move on and look at some of the most commonly used machine learning terms so first of all we have algorithm now this is quite self-explanatory basically algorithm is a set of rules or statistical techniques which are used to learn patterns from data now an algorithm is the logic behind a machine learning model all right an example of a machine learning algorithm is linear regression I'm not sure how many of you have heard of linear regression it's the most simple and basic machine learning algorithm all right next we have model now model is the main component of machine learning all right so model will basically map the input to your output by using the machine learning algorithm and by using the data that you're feeding the machine so basically the model is a representation of the entire machine learning process so the model is basically fed input which is a lot of data and then it will output a particular result or a particular outcome by using machine learning algorithms next we have something known as predictor variable now predictor variable is a feature of the data that can be used to predict the output so for example let's say that you're trying to predict the weight of a person depending on the person's height and their age all right so over here the predictor variables are your height and your age because you're using height and age of a person to predict the person's weight all right so the height and the age are the predictor variables now weight on the other hand is the response or the target variable so response variable is a feature or the output variable that needs to be predicted by using the predictor variables all right after that we have something known as training data so guys uh the data that is fed to a machine learning model is always split into two parts first we have the training data and then we have the testing data now training data is basically used to build the machine learning model so usually training data is much larger than and the testing data because obviously if you're trying to train the machine then you're going to feed it a lot more data testing data is just used to validate and evaluate the efficiency of the model all right so that was training data and testing data so Guys these were a few terms that I thought you should know before we move any further okay now let's move on and discuss the machine learning process now this is going to get very interesting because I'm going to give you an example and make you understand how the machine learning process works so first of all let's define the different stages or the different um steps involved in the machine learning process so a machine learning process always begins with defining the objective or defining the problem that you're trying to solve next stage is data Gathering or data collection now the data that you need to solve this problem is collected at this stage this is followed by data preparation or data processing after that you have data exploration and Analysis and the next stage is building a machine learning model this is followed by model evaluation and finally you have prediction or your output now let's try to understand this entire process with an example so our problem statement here is to predict the possibility of rain by studying the weather conditions so let's say that you're given a problem statement and you're asked to use the machine learning process to solve this problem statement so let's get started all right so the first step is to define the objective of the problem statement our objective here is to predict the possibility of rain by studying the weather conditions now in the first stage of a machine learning process you must understand what exactly needs to be predicted now in our case the objective is to predict the possibility of rain by studying weather conditions right so at this stage it is also essential to take mental notes on what kind of data can be used to solve this problem or the type of approach that you can follow to get to the solution all right a few questions that are worth asking during the stage is what are we trying to predict what are the Target features or what are the predictor variables what kind of input data do we need and what kind of problem are we facing is it a binary classification problem or is it a clustering problem now don't worry if you don't know what classification and clustering is I'll be explaining this in the upcoming slides so guys this was the first step of a machine learning process which is Define the objective of the problem all right now let's move on and look at step number two so step number two is basically data collection or data Gathering now at this stage you must be asking questions such as what kind of data is needed to solve the problem is the data available and if it is available how can I get the data okay so once you know the type of data that is required you must understand how you can derive this data data collection can be done manually or by web scraping but if you're a beginner and you're just looking to learn machine learning you don't have to worry about getting the data okay there are thousands of data resources on the web you can just go ahead and download the data sets from websites such as kaggle okay now coming back to the problem at hand the data needed for weather forecasting includes measures such as humidity level temperature pressure locality whether or not you live in a hill station and so on so guys such data must be collected and stored for analysis now the next stage in machine learning is preparing your data the data you collected is almost never in the right format so basically you'll encounter a lot of inconsistencies in the data set okay this includes missing values redundant variables duplicate values and so on removing such values is very important because they might lead to wrongful computations and predictions so that's why at this stage you must scan the entire data set for any inconsistencies and you have to fix them at this stage now the next step is exploratory data analysis now data analysis is all about diving deep into data and finding all the hidden data Mysteries okay this is where you become a detective so Eda or exploratory data analysis is like a brainstorming of machine learning data exploration involves understanding the patterns and the trends in your data so at this stage all the useful insights are drawn and all the correlations between the variables are understood so you might ask what sort of correlations are you talking about for example in the case of predicting rainfall we know that there is a strong possibility of rain if the temperature has fallen low okay so such correlations have to be understood and mapped at this stage now this stage is followed by stage number five which is building a machine learning model so all the insights and the patterns that you derived during data exploration are used to build the machine learning model so this stage always Begins by splitting the data set into two parts training data and the testing data so earlier in this session I already told you what training and testing data is now the training data will be used to build and analyze the model and the logic of the model will be based on the machine learning algorithm that is being implemented okay now in the case of predicting rainfall since the output will be in the form of true or false we can use a classification algorithm like logistic regression now choosing the right algorithm depends on the type of problem you're trying to solve the data set you have and the level of complexity of the problem so in the upcoming sections we'll be discussing different types of problems that can be solved by using machine learning so don't worry if you don't know what classification algorithm is and what logistic regression is okay so all you need to know is at this stage you'll be building a machine learning model by using machine learning algorithm and by using the training data set the next step in a machine learning process is model evaluation and optimization so after building a model by using the training data set it is finally time to put the model to a test okay so the testing data set is used to check the efficiency of the model and how accurately it can predict the outcome so once you calculate the accuracy any improvements in the model have to be implemented in this stage okay so methods like parameter tuning and cross validation can be used to improve the performance of the model this is followed by the last stage which is predictions so once the model is evaluated and improved it is finally used to make predictions the final output can be a categorical variable or it can be a continuous quantity in our case for predicting the occurrence of rainfall the output will be a categorical variable in the sense our output will be in the form of true or false yes or no yes basically represents that it's going to rain and no will represent that no it won't rain okay as simple as that so guys that was the entire machine learning process [Music] now without wasting any more time let us understand what regression in machine learning is so what exactly is regression the main goal of regression is the construction of an efficient model to predict the dependent attributes from a bunch of attribute variables a regression problem is where the output variable is either real or a continuous value like salary weight area Etc we can also define regression as a statistical means that is used in applications like housing investing Etc to predict the relationship between a dependent variable and a bunch of independent variables for example let's say in the finance application or investing we can actually predict the values of certain stock prices or you know those values depending on the independent variables like how many years it takes for a stock to you know actually mature or how many days will it take to grow or those variables that you have in investing and depending upon that we can make a possible outcome or a possible prediction of how our stock is going to be invested in a profit state or a loss state or all those things or we can take another example like housing we can take different parameters like number of years it's been there how many people have used it or what is the area of the house depending on all these factors or how many rooms does a house have we can predict the price of a house so this is basically what regression really is so let us take a look at the various types of regression techniques that we have we have SIMPLE linear regression then we have polynomial regression support Vector regression decision regression we have random Forest regression and we have logistic regression as well that is also a type of vacation that we have but for now we'll be focusing on simple linear regression so let's talk about how or what exactly simple linear regression first so one of the most interesting and common regression technique is simple linear regression in this we predict the outcome of a dependent variable y based on the independent variables X so the relationship between the variables is linear hence the word linear regression then comes the polynomial regression so in this regression Technique we transform the original features into a polynomial feature of a given degree and then perform regression on it so this is basically polynomial regression after this we have support Vector machine regression or we can also call it svr we identify a hyperplane with maximum margin such that the maximum number of data points are within those margins it is also quite similar to the support Vector machine classification algorithm then we have decision entry regression a decision tree can be used for both regression and classification but in this case of regression we use the ID3 algorithm which is iterative dichotomizer 3 to identify the splitting node by reducing the standard deviation after this we have our random Forest regression which is basically an ensemble of predictions of several decision tree regressions so this is all about the types of regressions for now we are going to focus on simple linear regression so let's take a look at what exactly is the simple linear regression simple linear regression is a regression technique in which the independent variable has a linear relationship with the dependent variable the straight line in the diagram is the best fit line and the main goal of the simple linear regression is to consider the given data points and plot the best fit line to fit the model in the best way possible so if you talk about a real life analogy to explain linear regression we can take an example of a car resale value so we have different parameters you know when we are talking about a resale value of a car like how many years the car has been there in the market and how many kilometers it has been written the kind of mileage the car gives and then we have different parameters we can focus on and all these independent variables somehow are linearly connected or interconnected to the price of the car so that is one example to understand linear regression we'll be doing that in the use case I'll be telling you about how you can predict the price of car now talking about linear regression terminologies there are a few terminologies that you have to be thorough with to begin with linear regression so first of all we have to talk about cost function so the best fit line can be based on the linear equation that is given here so in this the dependent variable that is to be predicted is denoted by y a line that touches the y-axis is denoted by The Intercept b0 the B1 is the slope of the line and X represents the independent variables that determine the prediction of Y the error in the resultant prediction is denoted by e now talking about cost function the cost function provides the best possible values for b0 and B1 to make the best fit line for the data points we do this by converting this problem into a minimization problem to get the best values for b0 and B1 so with this the error is minimized in this problem between the actual value and the predicted value and we choose the function above to minimize now we Square the error difference and sum the error over all the data points the division between the total number of data points and the produced value provides the average square error for all the data points it is also known as mean squared error and we can change the values of b0 and B1 so that the MSE or the mean squared error value is settled at the minimum so this is one terminology that is cost function that we use in linear regression then we have the gradient descent so the next important terminology to understand linear regression is gradient Descent of course and it is a method of updating b0 and B1 value to reduce the Mac which is the mean squared error the idea behind this is to keep iterating the b0 and B1 values until we reduce the MSE to the minimum now to update b0 and B1 we take the gradient from the COS function and to find these gradients we take partial derivatives with respect to b0 and B1 and these partial derivatives are the gradients and are used to update the values of b0 and B1 I'm sure guys this might be a little confusing for you guys if you are new to this like gradient descent and cost function but you don't have to worry about this because in Python when we're using a linear regression we're going to be using the I scale and or the scikit-learn library so you don't have to worry about this you just have to integrate your model with a linear regression module that we have already over there and you will be down with and when I'm implementing the linear regression model you'll see how easy it is to actually Implement linear regression in Python so after this let's talk about a few advantages and disadvantages of linear regression so talking about the advantages first linear regression performs exceptionally well for linearly separable data and it is actually very easy to implement interpret and very efficient to train as well and even though the linear regression is prone to overfitting it it handles it pretty well using dimensionally reduction techniques regularization and cross validation and one more Advantage is that the extrapolation Beyond a specific data set so these are all the advantages that we have with linear regression let's talk about a few disadvantages as well so one of the most common disadvantage with linear regression is that it takes the Assumption of linearity between dependent and independent variables the next disadvantage is it is often very prone to noise and overfitting as well which is not a very good sign for any model if you are doing regression or classification in machine learning the next disadvantage is it is very quite sensitive to outliers as well and the last one is that it is very prone to multi-collinearity so these are all the advantages and disadvantages of linear regression now let's also take a look at a few use cases that we can use linear regression for so we can use it for sales forecasting depending upon the features like we can actually predict the price of a item or in sales we can forecast the price or make assumptions over there like predictions obviously then we can use it for risk analysis for disease predictions like we can take a disease data set and depending upon several features that are linearly connected with one risk detection we can use it for risk analysis as well we can obviously use it for housing applications to predict the prices and other factors like what are the different aspects that are going to be decisive in the price of a house all those features using the linear regression and then of course we can use it for finance applications to predict the stock prices investment evaluation Etc the basic idea behind linear regression is to find the relationship between the dependent and independent variables to get the best fitting line that would predict the outcome with the least error so we can use linear regression in simple real life situations as well like predicting the SAT scores with regard to the number of hours of study and other decisive factors that we can use it for now that we are done with the use cases as well let's take a look at specific use case that I'm going to show you in sqln or scikit-learn library and we're going to implement the linear regression model over there so let me take you through the steps that are going to take place over there so first of all we are going to load the data after that we'll explore the data so we'll take a look at how our data is what are the different features and what are the data points that we have in the data set after that we'll slice the data according to our requirements and then we'll train and split the data using the fit and predict method that we have in scikit loan and then we'll generate the model from cyclone and after we are done making the model we'll evaluate the accuracy so let's take it up to pycharm guys I'll show you a very simple example first of all with the diabetes data set that we have in scikit-learn we can simply import it from the data sets module that we have in scikit-learn and after this simple example I'm going to show you a custom data set that is a car resale value and we're going to create some prices for car resale values so without wasting any more time let's take it up to python so now that we are in Python guys we'll try to implement linear regression using simple data set that we have in SQL own Library so first of all we'll have to import basic libraries like matplotlib I'll make it a little bigger for visibility so first I'll import matplotlab dot Pi plot as let's say PLT and after that I'll import numpy as NP so from sqln we'll have to import data sets because we are going to use the data set from the data sets module and we are going to import linear model as well for linear regression okay I made a mistake over here and again we have to import metrics for accuracy evaluation and we'll be importing mean squared error so these are all the libraries that I'm going to use to show you how you can Implement linear regression but before that don't make a mistake of using these libraries just as it is because I have already installed all these libraries so for that you can just go to the project interpreter that you'll find in the settings and over here you can just install all the packages that you need so this is the add button there you can add all the libraries that you need so for SQL on you can just type SQL on and install that Library I'm not doing that again because I've already done it so after this first thing that you have to do is import your data set so for that I'm taking a variable disease and datasets Dot load data sets or I'm just going to load diabetes because that's the data set that we are going to use in this and let's just see what all we have in the data set that we have over here so I'm just going to print this to know what all we are dealing with right now so it's going to take a while so this is our data set guys we already have our data over here I'm sure we'll be having a Target we have description over here so this is our data set guys we have a Target also so this is a very sliced up data which is actually pretty good for our use right now because I'm showing you a simple implementation of linear regression so this is going to be very easy so what I'll do is I'll comment this line because we don't need this right now and after this I'll just take one more variable let's just say disease and inside this I am going to take disease dot data now after this I am going to split my data into train and test so what I'll do is I'll just write disease X after this I'll write train and let me take another variable that is the X test and we'll have to take this same variable for y as well because we need two variables so I'll take it as y train and disease why test now inside this I'll have to split my data into training and testing data so I'll use the disease over here or we can just call it data also disease X and I'm going to split the data from the starting until minus 30 so I'll leave the last 30 data entries inside the data set which I'm going to use in the training set so I'll use for the training set the minus 20 that is from the last the 20 entries that we have over here and similarly in the same thing we are going to do for the white train and Y test as well so here I'll just write y so before doing this I'm just going to make it as Target because we're using the target variable over here this is dot Target again and in this I am going to use the last 20 entries so we have successfully split our data now what happens is after splitting your data you have to generate your model for that I'm just going to take one variable so I'm just going to take it as reg or regression we can call it as regression as well so I'm going to use linear model and I'm going to call linear regression so this is how you can generate your model so after this you have to fit your training data inside the model using the fit function so I'll use fit over here and to fit the data I'm going to use disease extreme that has the data variables all those values that we have over here that is this one this array is going to be in the training one and we're going to use the target one for the disease y so I'll just do that disease X strain and disease y train as well so we're done after this I have to make a prediction variable as well I'll make it as y predict and inside this I am going to use the predict function so for this we are going to use the disease X test because this is what we are testing right now so our model has actually done guys I'll just write some code for checking the accuracy as well so I'll use the mean squared error over here and inside this I'll pass disease y test and the Y predict that we are going to get after the prediction so we're almost done after this I'll print the accuracy as well whatever we will get and let's just print the weights and the coefficients as well so as you say h is equal to Reg dot so you have a function for this as well to calculate the coefficient that is the weights and intercept as well so I'll name it as intercept and RG dot intercept we'll print these as well wait and intercept so your model is done guys so we have successfully used all these libraries we have load the data set and after that we have generated the model by splitting the data and everything so I'll just run this code and see what the output is okay so we have the accuracy that's the mean square error that is 2004 and then we have the intercept and the weights as well so this might sound a bit confusing for you guys so I'll just add a few things over here to make you understand this better so what I'll do is I'll make a few changes over here so I'll just write it like NP Dot new access and write it as 2 over here and we'll plot the graph also so for that I'll just write plot scatter inside this I'll use disease dot X test disease dot white test PLT Dot plot for the line and inside this I'm going to use disease X test and why predict so now when I run my code you will be able to see the graph which will have the data entries and you will be able to see the graph which has the data points and there'll be an intercept as well with a line plotted which is the best fit line over here okay we have not written pld.show that's why you're not getting the graph now it should be fine and we'll be able to see the graph as well so this is our best fit line guys so we have successfully implemented linear regression using python for this disease data set but to do this I had to segregate this data all separately and if you implement it using all the columns that you have in over here in the data it's not going to be fine I mean it's going to throw you an error and with not show you a best fit line so now that we are done with this one I'll show you one more use case inside we are going to use the custom data set which has the car resale value so let's get it done so I'll just remove all this so first of all I will import as you can see over here in the directory guys you can already see this cards.csv this is my custom data set that I'm going to use inside this one so I'll just import numpy as NP import pandas because we're using a custom data set I'm going to have to use pandas to import or to actually load the CSV file that I have over there and after this I'll use map.lib dot Pi plot as PLT and from SK learn linear model we are going to use linear regression and after this I'll just import SK learn dot matrix and import the R2 score or we're not going to use that so I'll just remove this one for now and we'll use import stats model believe this one for later as well so first of all I have to load the data so I'm naming it as cars so for this I'll just write pdrot read CSV and if you don't know how you can import the CSV files the custom made CSV files inside your program you can check out other tutorials on anti-record so we have a tutorial in detail how you can do CSV files so first of all let me just get a quick exploration of the data what we have over here so I'll just write the print statement I'll get the head so this is what my data set looks like guys it's loading still so we have all these values and to get the better picture over here I'll try to print The Columns as well so I'll just write cars.columns so it'll give me the name of all the columns that I have in this data set okay so we have all these values sales in thousands we have a Fourier resale value then we have price in thousands and we have engine size horsepower wheelbase width length then we have curb weight we have fuel capacity fuel efficiency as well okay so this is what our data looks like guys so let me just plot a graph first of all so I'll just write PLT and we're going to use the figure size for figure size is equal to let's just say 16 by 8 and after this I will use a scatter plot and inside this I'm going to give a few values so I'm going to use cars so first of all I want to check the relationship between a horsepower and then I'll check what is the relation between a horsepower and the price of the car which is in thousands of course get it as black yes and let's just put a few labels as well so X label is going to be let's say horsepower and we will write the Y label as let's just say price and let's just plot this graph for now put it over here only now when I run this let's see what I'm getting over here okay so I'm getting a graph like this so you can see all the data points and as we go further the values are actually increasing if we're not getting the Y labels over here somehow I think there's some problem just check it out foreign over here so on the x-axis we have the horsepower which is going until 450 so a car with the four horsepower 450 is ranging around 70 000 that is the price and that's going to be in dollars guys and all these values like a 150 horsepower car is it's going to be around 20 000 bucks and a simple 50 horsepower car is going to be around 10 000 marks so this is our data guys I think it's pretty evident that we have a relation between horsepower and the price of the car that we are actually looking at right now so we'll build the model around this for now so I'll take one variable let's just say x and inside this I'm going to put cars horsepower because we are going to take horsepower as our independent variable and all these values I'm going to reshape them into minus 1 and 1. I'm going to take one more independent variable y that's going to be my price in thousands guys I'm going to do the reshaping again it's just basically to avoid the errors guys reshaping that I'm doing over here and I'll just generate my model right now so I'll use linear regression and after this I'll try to fit my X and Y over here using the fit method so let's write X and Y now I'm going to print the coefficient using the coefficient function that I've shown you before I'm going to take the values as 0 and 0. again for the same thing I want The Intercept as well so instead of coefficient I'll write as intercept and I'll remove this one after this let's make one variable let's just say predictions that's going to store the predicted value and we're going to use the predict function over here so I'm just going to put X over here in the predict function and let's just plot a graph again so I'll write figure size again let's just take 16 by 8. we'll use the scatter plot again we will just write it again so first of all we have cars and horsepower [Music] then we have cars price in thousands after this I'm just going to write C is equal to Black now we have one more plot for the line so I'm going to use plt.plot inside this we'll have cars horsepower we'll use the predictions as well let's just make it blue we'll put the label again so the X label is going to be the horsepower and the Y label is going to be prices or just price I'll write PLT dot show now when I run this I should be getting uh okay we'll get the first plot before getting the second plot so we have a plot like this so actually the horsepower is over here that is going up till 450 horsepower and this is actually our price now when I close this okay I've made a mistake somewhere okay this is actually now when I run this again it should be fine I'll close this and after this okay so I've got the mistake that I was making so before running this program I will add the line width as well and let's just run this again so first I'm getting this which is horsepower right here and we have the price right over here after this let's see so we have a best fit line over here as you can see guys so this is a simple linear regression implementation on a custom data set [Music] so today we'll be discussing logistic regression so let's move forward and understand the what and why of logistic regression now this algorithm is most widely used when the dependent variable or you can say the output is in the binary format so here you need to predict the outcome of a categorical dependent variable so the outcome should be always discrete or categorical in nature Now by discrete I mean the value should be binary or you can say you just have two values it can either be 0 or 1 it can either be yes or a no either be true or false or high or low so only these can be the outcomes so the value which you need to predict should be discrete or you can say categorical in nature whereas in linear regression we have the value of y or you can say the value you need to predict is in a Range so that is how there's a difference between linear regression and logistic regression now you must be having a question why not linear regression now guys in linear regression the value of y or the value is you need to predict is in a range but in our case as in the logistic regression we just have two values it can be either zero or it can be one it should not entertain the values which is below zero or above one but in linear regression we have the value of y in the range so here in order to implement logistic regression we need to clip this part so we don't need the value that is below zero or we don't need the value which is above one so since the value of y will be between only 0 and 1 that is the main rule of logistic regression the linear line has to be clipped at 0 and 1. now once we clip this graph it would look somewhat like this so here you are getting a curve which is nothing but three different straight lines so here we need to make a new way to solve this problem so this has to be formulated into equation and hence we come up with logistic regression so here the outcome is either 0 or 1 which is the main rule of logistic regression so with this a resulting curve cannot be formulated so hence the main aim to bring the values to 0 and 1 is fulfilled so that is how we came up with logistic regression now hail once it gets formulated into an equation it looks somewhat like this so guys this is nothing but a S curve or you can say the sigmoid curve or sigmoid function curve so this sigmoid function basically converts any value from minus infinity to Infinity to your discrete values which a logistic regression wants or you can say the values which are in binary format either 0 or 1. so if you see here the values as either 0 or 1 and this is nothing but just a transition of it but guys there's a catch over here so let's say I have a data point that is 0.8 now how can you decide whether your value is 0 or 1. now here you have the concept of threshold which basically divides your line so here threshold value basically indicates the probability of either winning or losing so hereby winning I mean the value is equals to 1 and by losing I mean the value is equals to zero but how does it do that let's say I have a data point which is over here let's say my cursor is at 0.8 so here I'll check whether this value is less than a threshold value or not let's say if it is more than my threshold value it should give me the result as 1 if it is less than that then should give me the result as 0. so here my threshold value is 0.5 now I need to Define that if my value let's say 0.8 it is more than 0.5 then the value shall be rounded off to 1 and let's say if it is less than 0.5 let's say I have a values 0.2 then should reduce it to zero so here you can use the concept of threshold value to find your output so here it should be discrete it should be either 0 or it should be one so I hope you caught this curve of logistic regression so guys this is the sigmoid S curve so to make this curve we need to make an equation so let me address that part as well so let's see how an equation is formed to imitate this functionality so over here we have an equation of a straight line which is y is equals to MX plus C so in this case I just have only one independent variable but let's say if we have many independent variable then the equation becomes M1 X1 plus M2 x 2 plus M3 X3 and so on till mnxn now let us put in B and X so here the equation becomes Y is equals to B1 X1 plus b 2 x 2 plus B3 X3 and so on till b n x n plus c so guys your equation of the straight line has a range from minus infinity to Infinity but in our case or you can say in logistic equation the value which we need to predict or you can say the Y value it can have the range only from 0 to 1. so in that case we need to transform this equation so to do that what we had done we have just divide the equation by 1 minus y so now if Y is equals to 0 so 0 over 1 minus 0 which is equals to 1 so 0 over 1 is again 0 and if we take Y is equals to 1 then 1 over 1 minus 1 which is 0 so 1 over 0 is infinity so here my range is now between 0 to Infinity but again we want the range from minus infinity to Infinity so for that what we'll do we'll have the log of this equation so let's go ahead and have the logarithmic of this equation so here we have this transform it further to get the range between minus infinity to Infinity so over here we have log of Y over 1 minus 1 and this is a final logistic regression equation so guys don't worry you don't have to write this formula or memorize formula in Python you just need to call this function which is logistic regression and everything will be automatically for you so I don't want to scare you with the maths and the formulas behind it but it's always good to know how this formula was generated next let us see what are the major differences between linear regression versus logistic regression now first of all in linear regression we have the value of y as a continuous variable or the variable which we need to predict are continuous in nature whereas in logistic regression we have the categorical variable so here the value which you need to predict should be discrete in nature it should be either 0 or 1 or it should have just two values to it so for example whether it is raining or it is not raining is it humid outside or it is not humid outside now does it going to snow it is not going to snow so these are the few examples where you need to predict where the values are discrete or you can just predict whether this is happening or not next linear regression solves your regression problems so here you have a concept of independent variable and a dependent variable so here you can calculate the value of y which you need to predict using the value of x so here your y variable or you can say the value that you need to predict are in a range but whereas in logistic regression you have discrete values so logistic regression basically solves your classification problem so it can basically classify it and it can just give you result whether this event is happening or not next in linear regression the graph that you have seen is a straight line graph so over here you can calculate the value of y with respect to the value of x whereas in logistic regression the curve that we got was a S curve or you can say the sigmoid curve so using the sigmoid function You can predict your y values moving ahead let us see the various use cases wherein logistic regression is implemented in real life so the very first is better prediction now logistic regression helps you to predict your weather for example it is used to predict whether it is raining or not whether it is sunny is it cloudy or not so all these things can be predicted using logistic regression whereas you need to keep in mind that both linear regression and logistic regression can be used in predicting the weather so in that case linear regression helps you to predict what will be the temperature tomorrow whereas logistic regression will only tell you whether it's going to rain or not or whether it's cloudy or not which is going to snow or not so these values are discrete whereas if you apply linear regression you'll be predicting things like what is the temperature tomorrow or what is the temperature day after tomorrow and all those things so these are the slight differences between linear regression and logistic regression now moving ahead we have classification problem so python performs multi-class classification so here it can help you tell whether it's a bird or it's not a bird then you classify different kind of mammals let's say whether it's a dog or it's not a dog similarly you can check it for reptile whether it's a reptile or not a reptile so in large integration it can perform multi-class classification so this point I have already discussed that it is used in classification problems next it also helps you to determine the illness as well so let me take an example let's say a patient goes for routine checkup in hospital so what doctor will do it will perform various tests on the patient and will check whether the patient is actually ill or not so what will be the features so doctor can check the sugar level the blood pressure then what is the age of the patient is it very small or is it an old person then what is the previous medical history of that patient and all of these features will be recorded by the doctor and finally doctor checks the patient data and determines the outcome of its illness and the severity of illness so using all the data a doctor can identify whether a patient is ill or not so these are the various use cases in which you can use logistic regression now I guess enough of theory part so let's move ahead and see some of the Practical implementation of logistic regression so over here I'll be implementing two projects wherein I have the data set of a Titanic so over here we predict what factors made people more likely to survive the sinking of the Titanic ship and in my second project you'll see the data analysis on the SUV cars so over here we have the data of the SUV cars who can purchase it and what factors made people more interested in buying SUV so these will be the major questions as to why you should Implement logistic regression and what output will you get by it so let's start by the very first project that is Titanic data analysis so some of you might know that there was a ship called as Titanic which basically hit an iceberg and it sunk to the bottom of the ocean and it was a big disaster at that time because it was the first voyage of the ship and it was supposed to be really really strongly built and one of the best ships of that time so it was a big disaster of that time and of course there's a movie about this as well so many of you might have watched it so what we have we have data of the passengers those who survived and those who did not survive in this particular tragedy so what you have to do you have to look at this data and analyze which factor would have been contributed the most to the chances of a person's Survival on the ship or not so using the logistic regression we can predict whether the person survived or the person died now apart from this we'll also have a look with the various features along with that so first let us explore the data set so over here we have the index value then the First Column is passenger ID then my next column is survived so over here we have two values a zero and a one so zero stands for did not survive and one stands for survive so this column is categorical where the values are discrete next we have passenger class so over here we have three values one two and three so this basically tells you that whether a passenger is traveling in the first class second class or third class then we have the name of the passenger we have the sex or you can say the gender of the passenger whether passenger is a male or female then we have the age we have the sib SP so this basically means the number of siblings or the spouses aboard the Titanic so over here we have values such as one 0 and so on then we have patch so path is basically the number of parents or children aboard the Titanic over here we also have some values then we have the ticket number we have the fair we have the cabin number and we have the Embark column so in the embarked column we have three values we have S C and Q so s basically stands for Southampton C stands for chair bug and Q stands for queons down so these are the features that we'll be applying our model on so here we'll perform various steps and then we'll be implementing logistic regression so now these are the various steps which are required to implement any algorithm so now in our case we are implementing logistic regressions of very first step is to collect your data or to import the libraries that are used for collecting your data and then taking it forward then my second step is to analyze your data so over here I can go through the various fields and then I can analyze the data I can check that the females or children survive better than the males or did the risk passenger survive more than the poor passenger or did the money matter as in who paid mode to get into the ship were they evacuated first and what about the workers does the worker survived or what is the survival rate if you were the worker in the ship and not just a traveling passenger so all of these are very very interesting questions and you would be going through all of them one by one so in this stage you need to analyze your data and explore your data as much as you can then my third step is to Wrangle your data now data wrangling basically means cleaning your data so over here you can simply remove the unnecessary items or if you have a null values in the data set you can just clear that data and then you can take it forward so in this step you can build your model using the train data set and then you can test it using a test so over here we will be performing a split which basically split your data set into training and testing data set and finally you will check the accuracy so as to ensure how much accurate your values are so now let's go into all these steps in detail so number one we have to collect your data or you can say import the libraries so it may show you the implementation part as well so I'll just open my jupyter notebook and I'll just Implement all of these steps side by side so guys this is my Jupiter notebook so first let me just rename Jupiter notebook to let's say Titanic data analysis now our first step was to import all the libraries and collect the data so let me just import all the libraries first so first of all I'll import pandas so pandas is used for data analysis so I'll say import pandas SPD then I'll be importing numpy so I'll say import numpy as NP so numpa is a library in Python which basically stands for numerical Python and it is widely used to perform any scientific computation next we'll be importing C bone so c bond is a library for statistical plotting so I'll say import c bond as SNS I'll also import matplotlib so matplotlib library is again for plotting so I'll say import matplotlib dot Pi plot as pld now to run this library in Jupiter notebook all I have to write in is percentage matplotlib inline next I'll be importing one module as well so as to calculate the basic mathematical functions so I say import maths so these are the libraries that I'll be needing in this Titanic data analysis so now let me just import my data set so I'll take a variable let's say Titanic data and using the pandas I will just read my CSV or you can see the data set I'll write the name of my data set that is titanic.csv now I have already showed you the data set so over here let me just print the top 10 rows so for that I'll just say I'll take the variable Titanic data dot head and I'll say the top 10 rows so now I'll just run this so to run this I just have to press shift plus enter or else you can just directly click on the cell so over here I have the index we have the passenger ID which is nothing but again the index which is starting from one then we have the survived column which has the categorical values or you can say the discrete values which is in the form of zero or one then we have the passenger class we have the name of the passenger sex H and so on so this is the data set that I'll be going forward with next let us print the number of passengers which are there in this original data set so for that I'll just simply type in print I'll say number of passengers and using the length function I can calculate the total length so I'll say Len and inside this I'll be passing this variable which is Titanic data so I'll just copy it from here I'll just paste it dot index and next let me just print this one so here the number of passengers which are there in an original data set we have is 891 so around this number we're traveling in the Titanic ship so over here my first step is done we have just collected data imported all the libraries and find out the total number of passengers which are traveling in Titanic so now let me just go back to presentation and let's see what is my next step so we're done with the collecting data next step is to analyze your data so over here we'll be creating different plots to check the relationship between variables as in how one variable is affecting the other so you can simply explore your data set by making use of various columns and then you can plot a graph between them so you can either plot a correlation graph you can plot a distribution graph it's up to you guys so let me just go back to my Jupiter notebook and let me analyze some of the data over here my second part is to analyze data so I just put this in header 2. now to put this in header 2 I just have to go on code click on markdown and I just run this so first let us plot a count plot where you can pay between the passengers who survived and who did not survive so for that I'll be using the seabon library so over here I have imported c bond as SNS so I don't have to write the whole name I'll simply say sns.count plot I'll say x is good to survived and the data that I'll be using is the Titanic data or you can say the name of variable in which you have stored your data set so now let me just run this so over here as you can see I have survived column on my x-axis and on the y-axis I have the count so 0 basically stands for did not survive and one stands for the passengers who did survive so over here you can see that around 550 of the passengers who did not survive and they were around 350 passengers who only survived so here you can basically conclude that there are very less survivors than non-survivors so this was the very first plot now let us got another plot to compare the sex as to whether out of all the passengers who survived and who did not survive how many were men and how many were female so to do that I'll simply say sns.count plot I add the Hue as sex so I want to know how many females and how many males survive then I'll be specifying the data so I am using Titanic data set and let me just run this okay I've done a mistake over here so over here you can see I have survived column on the x-axis and I have the count on the Y now so here your blue color stands for your male passengers and orange stands for your female so as you can see here the passengers who did not survive that has a value 0 so we can see that majority of males did not survive and if we see the people who survive here we can see the majority of females survive so this basically concludes the gender of the survival rate so it appears on average women were more than three times more likely to survive than men next let us plot another plot where we have the Hue as the passenger glass so over here we can see which class at the passenger was traveling in whether it was traveling in class 1 2 or 3. so for that I'll just write the same command I'll say sns.com plot I'll keep my x-axis as some only I'll change my Hue to passenger class so my variable is named as P class and the data set that I'll be using is Titanic data so this is my result so over here you can see I have blue for first class orange for second class and green for the third class so here the passengers who did not survive were majorly of the third class or you can say the lowest class or the cheapest class to get into the Titanic and the people who did survive majorly belong to the higher classes so here one and two has more rise than the passenger who were traveling in the third class so here we have concluded that the passengers who did not survive a majorly of third class or you can say the lowest class and the passengers who were traveling in first and second class would tend to survive more next it is plot a graph for the age distribution over here I can simply use my data so we'll be using pandas library for this I'll declare an array and I'll pass in the column that is age say plot and I want a histogram so I'll see plot dot hist so you can notice over here that we have more of young passengers or you can see the children between the ages 0 to 10 and then we have the average age people and if you go ahead lesser would be the population so this is the analysis on the age column so we saw that we have more young passengers and more mediocre age passengers which are traveling in the Titanic so next let me plot a graph of air as well so I'll say Titanic data I'll say fair and again I'll plot a histogram so I'll say hist so here you can see the fair size is between 0 to 100 now let me add the bin size so as to make it more clear so over here I'll say bin is equals to let's say 20 and I'll increase the figure size as well so I'll say fixed size let's say I'll give the dimensions as 10 by 5. so it is bins so this is more clear now next let us analyze the other columns as well so I'll just type in Titanic data and I want the information as to what all columns are left so here we have passenger ID which I guess it's of no use then we have see how many passengers survived and how many did not we also see the analysis on the gender basis we saw when the female tend to survive more or the men tend to survive more then we saw the passenger class where the passenger is traveling in the first class second class or third class then we have the name so in name we cannot do any analysis we saw the sex we saw the age as well then we have sib SP so this stands for the number of siblings or the spouses which are aboard the Titanic so let us do this as well so I'll say sns.count plot I'll mention X as Civ SP and I'll be using the Titanic data so you can see the plot over here so over here you can conclude that it has the maximum value on zero so you can conclude that neither a children nor a spouse was on board the Titanic now second most highest value is one and then we have very less values for two three four and so on next if I go above if we saw this column as well similarly you can do for parts so next we have passed or you can say the number of parents or children which were both the Titanic so similarly you can do this as well then we have the ticket number so I don't think so any analysis is required for Ticket then we have fair so far we have already discussed as in the people who tend to travel in the first class you will pay the highest pair then we have the cabin number and we have embarked so these are the columns that we'll be doing data wrangling on so we have analyzed the data and we have seen quite a few graphs in which we can conclude which variable is better than another or what are the relationship they hold basically means cleaning your data so if you have a large data set you might be having some null values or you can say any n values so it's very important that you remove all the unnecessary items that are present in your data set so removing this directly affects your accuracy so I'll just go ahead and clean my data by removing all the Nan values and unnecessary columns which has a null value in the data set so next I'll be performing data wrangling so first of all I'll check whether my data set is null or not so I'll say Titanic data which is the name of my data set and I'll say is null so this will basically tell me what all values are null and it will return me a Boolean result so this basically checks the missing data and your result will be in Boolean format as in the result will be true or false so false mean if it is not null and true means if it is null so let me just run this over here you can see the values as false or true so false is where the value is not null and true is where the value is done so over here you can see in the cabin column we have the very first value which is null so we have to do something on this so you can see that we have a large data set so the counting does not stop and we can actually see the sum of it we can actually print the number of passengers who have the Nan value in each column so I'll say Titanic underscore data is null and I want the sum of it so I'll save dot sum so this will basically print the number of passengers who have the NN values in each column so we can see that we have missing values in each column that is 177 then we have the maximum value in the cape in column and we have very Less in the Embark column that is 2. so here if you don't want to see these numbers you can also plot a heat map and then you can visually analyze it so let me just do that as well so I'll say sns.heat map and say why tick labels false so I'll just run this so as we have already seen that there were three columns in which missing data value was present so this might be H so over here almost 20 percent of each column has a missing value then we have the caping columns so this is quite a large value and then we have two values for embark column as well add a c map for color coding so I'll say cmap so if I do this so the graph becomes more attractive so over here your yellow stands for True or you can say the values are null so here we have concluded that we have the missing value of age we have a lot of missing values in the cabin column and we have very less value which is not even visible in the Embark column as well so to remove these missing values you can either replace the values and you can put in some dummy values to it or you can simply drop the column so here let us first pick the age column so first let me just plot a box plot and they will analyze with having a column as age so I'll say sns.box plot I'll say x is equals to passenger class so it's p class I'll say Y is equals to H and the data set that I'll be using is Titanic set so I'll say Theta is equals to Titanic data you can see the age in first class and second class tends to be more older rather than we have it in the third class well that depends on the experience how much you earn or might be the N number of reasons so here we concluded that passengers who were traveling in class 1 and Class 2 are tend to be older than what we have in the class 3. so we have found that we have some missing values in M now one way is to either just drop the column or you can just simply fill in some values to that so this method is called as imputation now to perform data wrangling or cleaning it is supposed to print the head of the data set so I'll say titanic.head sorry it's Titanic underscore data let's say I just want the five rows so here we have survive which is again categorical so in this particular column I can apply logistic regression so this can be my y value or the value that I need to predict then we have the passenger class we have the name then we have ticket number Fair cabin so over here we have seen that in Cabin we have a lot of null values or you can say that n a n value is quite visible as well so first of all we'll just drop this column so for dropping it I'll just say Titanic underscore data and I'll simply type in drop and the column which I need to draw so I have to drop the cable column I'll mention the x is equals to 1 and I'll say in place also to true so now again I'll just print the head and let us see whether this column has been removed from the data set or not so I'll say Titanic dot head so as you can see here we don't have cabin column anymore now you can also drop the N A values so I'll say Titanic data dot drop all the any values or you can say any n which is not a number and I'll say in place is equals to true it's Titanic so over here let me again plot the heat map and let's say what the values which are before showing a lot of null values has it been removed or not so I'll say sns.heatmap I'll pass in the data set I'll check it is null I say why tick labels is equals to false and I don't want color coding so again I'll say false so this will basically help me to check whether my values has been removed from the data set or not so as you can see here I don't have any null values so it's entirely black now you can actually know the sum as well so I'll just go above so I'll just copy this part and I just use the sum function to calculate the sum so here that tells me that data set is green as in the data set does not contain any null value or any any n value so now we have Wrangler data you can say cleaner data so here we have done just one step in data wrangling that is just removing one column out of it now you can do a lot of things you can actually fill in the values with some other values or you can just calculate the mean and then you can just fit in the null values but now if I see my data set so I'll say Titanic data dot head but now if I see you over here I have a lot of string values so this has to be converted to categorical variables in order to implement logistic regression so what we will do we will convert this to categorical variable into some dummy variables and this can be done using pandas because logistic regression just take two values so whenever you apply machine learning you need to make sure that there are no string values present because it won't be taking these as your input variables so using string you don't have to predict anything but in my case I have the survived column so I need to predict how many people tend to survive and how many did not so 0 stands for did not survive and one stands for survive so now let me just convert these variables into dummy variables so I'll just use pandas and I'll say PD dot get dummies you can simply press tab to autocomplete I'll say Titanic data and I'll pass the six so you can just simply click on shift plus tab to get more information on this so here we have the type data frame and we have the passenger ID survived and the passenger class so if you run this you'll see that 0 basically stands for not a female and one stand for it is a female similarly for male zero stands for it's not male and one stand for male now we don't require both these columns because one column itself is enough to tell us whether it's male or you can say female or not so let's say if I want to keep only male I'll say if the value of mail is one so it is definitely a male and is not a female so that is how it you don't need both of these values so for that I just remove the First Column let's say female so I'll say drop first and true the overhead has given me just one column which is male and has the value 0 and 1. so let me just set this as a variable hsx so over here I can say sex dot head I'll just want to see the first five rows so this is how my data looks like now here we have done it for six then we have the numerical values in h we have the numerical values in spouses then we have the ticket number we have the pair and we have embarked as well so in Embark the values are in s c and Q so here also we can apply this get dummy function so let's say I'll take a variable let's say mbap I'll use the pandas Library I'll enter the column main that is embarked so let me just print the head of it so I'll say Embark dot head so over here we have c q and s now here also we can drop the First Column because these two values are enough whether the passenger is either traveling for Q that is Queens down s for Southampton and if both the values are 0 then definitely the passenger is from chair bog that is the third value so you can again drop the first value so I'll say drop and true let me just run this so this is how my output looks like now similarly you can do it for passenger class as well so here also we have three classes one two and three so I'll just copy the whole statement so let's say I want the variable name let's say PCL I'll pass in the column name that is p class and I'll just drop the First Column so here also the values would be one two or three and I'll just remove the First Column so here we just left with 2 and 3. so if both the values are 0 then definitely the passengers traveling in the first class now we have made the values as categorical now my next step would be to concatenate all these new rows into a data set or you can see Titanic data using the pandas we'll just concatenate all these columns so I'll say pe.con cat and then say we have to concatenate sex we have to concatenate Embark and PCL and then I'll mention the access to one I'll just run this okay I need to print the head so over here you can see that these columns have been added over here so we have the mail column which basically tells whether a person is male or it's a female then we have the Embark which is basically q and s so if it's traveling from Queenstown the value would be 1 else it would be zero and if both of these values are zero it is definitely traveling from chair Bob then we have the passenger class as two and three so the value of both these is zero then the passengers traveling in class y so I hope you got this till now now these are the irrelevant columns that we have over here so we can just drop these columns we'll dropping P class the embarked column and the sex column so I'll just type in Titanic data dot drop and mention the columns that I want to drop so I'll say I'll even delete the passenger ID because it's nothing but just the index value which is starting from 1. so I'll drop this as well then I don't want name as well so I'll delete name as well then what else we can drop we can drop the ticket as well and then I'll just mention the axis and I'll say in place is equals to true okay so my column name starts from uppercase so these has been dropped now let me just print my data set again so this is my final data set guys we have the survived column which has the value 0 and 1 then we have the passenger class oh we forgot to drop this as well so no worries I'll drop this again so now let me just run this so over here we have the survive we have the age we have the sib SP we have the part we have Fair mail and these we have just converted so here we have just performed data angling or you can say clean the data and then we have just converted the values of gender to male then Embark to q and s and the passenger class to 2 and 3. so this was all about my data wrangling or just cleaning the data then my next step is training and testing your data so here we will split the data set into train subset and test subset and then what we'll do we'll build a model on the drain data and then predict the output on your test data set so let me just go back to jupyter and let us implement this as well over here I need to train my data set so I'll just put this indeed heading 3. so overall you need to Define your dependent variable and independent variable so here my Y is the output or you can say the value that I need to predict so over here I'll write Titanic data I'll take the column which is survived so basically I have to predict this column whether the passenger survived or not and as you can see we have the discrete outcome which is in the form of 0 and 1 and rest all the things we can take it as a features or you can say independent variable so I'll say Titanic data dot drop so we'll just simply drop this survive and all the other columns will be my independent variable so everything else are the features which leads to the survival rate so once we have defined the independent variable and the dependent variable next step is to split your data into training and testing subset so for that we'll be using SK loan I'll just type in from sklearn.cross validation import train test plate now here if you just click on shift and tab you can go to the documentation and you can just see the examples over here I'll click on plus to open it and then I just go to examples and see how you can split your data so over here you have extreme X test y train y test and then using this train test plate you can just pass in your independent variable and dependent variable and just Define a size and a random state to it so let me just copy this and I'll just paste it over here over here we'll train test then we have the dependent variable train and test and using the split function will pass in the independent and dependent variable and then we'll set a split size so let's say I'll put it at 0.3 so this basically means that your data set is divided in 0.3 that is in 70 30 ratio and then I can add any random state to it so let's say I'm applying one this is not necessary if you want the same result as that of mine if you can add the random shape so this will basically take exactly the same sample every time next I have to train and predict by creating a model so here logistic regression will graph from the linear regression so next I'll just type in from sklearn dot linear model import logistic regression next I'll just create the instance of this logistic regression model so I'll say log model is equals to logistic regression now I just need to fit my model so I'll say log model dot fit and I'll just pass in my X strain and Y train all right so here it gives me all the details of logistic regression so here it gives me the classmate dual fit intercept and all those things then what I need to do I need to make prediction so I'll take a variable and set predictions and I'll pass on the model to it so I'll say log model dot predict and I'll pass in the value that is X test so here we have just created a model fit that model and then we had made predictions so now to evaluate how my model has been performing so you can simply calculate the accuracy or you can also calculate a classification report so don't worry guys I'll be showing both of these methods so I'll say from sklearn dot matrix import classification report so over here I'll use classification report and inside this I'll be passing in y test and the predictions so guys this is my classification report so over here I have the Precision I have the recall we have the advanced code and then we have support so here we have the value of precision as 75 72 and 73 which is not that bad now in order to calculate the accuracy as well you can also use the concept of confusion Matrix so if you want to print the confusion Matrix I'll simply say from sklearn dot metrics import confusion Matrix first of all and then we'll just print this so here my function has been imported successfully so I'll say confusion Matrix and I'll again pass in the same variables which is y test and predictions so I hope you guys already know the concept of confusion Matrix so can you guys give me a quick confirmation as to whether you guys remember this confusion Matrix concept or not so if not I can just quickly summarize this as well yes okay Swati is not clear with this so I'll just tell you in a brief what confusion Matrix is all about so confusion Matrix is nothing but a two by two Matrix which has a four outcomes this basically tells us that how accurate your values are so here we have the column as predicted no predicted by and we have actual no and an actual yes so this is the concept of confusion Matrix so here let me just fade in these values which we have just calculated so here we have 105 105 21 25 and 63. so as you can see here we have got four outcomes now 105 is the value where our model has predicted no and in reality it was also a no so here we have predicted no and an actual no similarly we have 63 as a predicted yes so here the model predicted yes and actually also it was a yes so in order to calculate the accuracy you just need to add the sum of these two values and just divide the whole by the sum so here these two values tells me where the model has actually predicted the correct output so this value is also called as true negative this is called as false positive this is called as true positive and this is called as false negative now in order to calculate the accuracy you don't have to do it manually so in Python you can just import accuracy score function and you can get the results from that so I'll just do that as well so I'll say from sklearn dot matrix import accuracy score and I'll simply print the accuracy I'll pass in the same variables that is y test and predictions so over here it tells me the accuracy has 78 which is quite good so over here if you want to do it manually you have to plus these two numbers which is 105 plus 63 so this comes out to almost 168 and then you have to divide it by the sum of all the phone numbers so 105 plus 63 plus 21 plus 25 so this gives me a result of 2 1 4. so now if you divide these two number you'll get the same accuracy that is 78 or you can say 0.78 so that is how you can calculate the accuracy so now let me just go back to my presentation and let's see what all we have covered till now so here we have first split our data into train and test subset then we have built our model on the train data and then predicted the output on the test data set and then my fifth step is to check the accuracy so here we have calculator accuracy to almost 78 percent which is quite good you cannot say that accuracy is bad so here it tells me how accurate your results are so here my accuracy score defines that and hence we got a good accuracy so now moving ahead let us see the second project that is SUV data analysis so in this a car company has released new SUV in the market and using the previous data about the sales of their SUV they want to predict the category of people who might be interested in buying this so using the logistic regression you need to find what factors made people more interested in buying this SUV so for this let us see a data set where I have user ID I have gender as male and female then we have the age we have the estimated salary and then we have the purchased column so this is my discrete column or you can say the categorical column so here we just have the value that is 0 and 1 and this column we need to predict whether a person can actually purchase a SUV or Not So based on these factors we will be deciding whether a person can actually purchase a SUV or not so we know the salary of a person we know the age and using these we can predict whether person can actually purchase SUV or not so let me just go to my Jupiter notebook and let us Implement logistic regression so guys I'll not be going through all the details of data cleaning and analyzing the part so that part I'll just leave it on you so just go ahead and practice as much as you can all right so my second project is SUV predictions all right so first of all I have to import all the libraries so I say import numpy as NP and similarly I'll do the rest of it all right so now let me just print the head of this data set so this we've already seen that we have columns as user ID we have gender we have the age we have the salary and then we have to calculate whether person can actually purchase a SUV or not so now let us just simply go on to the algorithm part so we'll directly start off with the logistic regression and how you can train a model so for doing all those things we first need to Define an independent variable and dependent variable so in this case I want my X status or independent variable I say data set dot I log so here I'll be specifying all the rows so colon basically stands for that and in the columns I want only two and three dot values so here it should fetch me all the rows and only the second and third column which is age and estimated salary so these are the factors usually used to predict the dependent variable that is purchase so here my dependent variable is purchased and the dependent variable is of age and salary so I'll say I'll have all the rows and I just want fourth column that is my purchased column dot values all right so I've just forgot one one square bracket over here all right so over here I have defined my independent variable and dependent variable so here my independent variable is age and salary and dependent variable is the column purchase now you must be wondering what is this I log function so ilock function is basically an indexer for pandas data frame and it is used for integer based indexing or you can also say selection by index now let me just print these independent variables and dependent variables if I print the independent variable I have the age as well as a salary next let me print the dependent variable as well so over here you can see I just have the values in 0 and 1. so 0 stands for did not purchase next let me just divide my data set into training and type sub save so I'll simply write in from sklearn dot cross split dot cross validation import drain test next I'll just press shift and tab and over here I'll go to the examples and just copy the same line so I'll just copy this and move the points now I want the exercise to be let's say 25 so I have divided the train and tested in 75-25 ratio now let's say I'll take the random set as 0. So Random State basically ensures the same result or you can say same samples taken whenever you run the code so let me just run this now you can also scale your input values for better performing and this can be done using standard scalar so let me do that as well so I'll say from sklearn Dot preprocessing import standard scalar now why do we scale it now if you see a data set we are dealing with large numbers well although we are using a very small data set so whenever you are working in a broad environment you'll be working with large data set where you'll be using thousands and hundred thousands of tuples so their scaling down will definitely affect the performance by a large extent so here let me just show you how you can scale down these input values and then the pre-processing contains all your methods and functionality which is required to transform your data so now that is scale down for test as well as a training data set so I'll First Make an instance of it so I'll say standard scalar then I'll have extreme I'll say SC dot fit fit underscore transform I'll pass in my X strain variable and similarly I can do it for test wherein I'll pass the X test all right now my next step is to import logistic regression so I'll simply apply logistic regression by first importing it so I'll save from sk1 from sklearn dot linear model import logistic regression now over here I'll be using classifier so I'll say classifier dot is equals to logistic regression so over here I'll just make an instance of it so I'll say logistic regression and over here I just passed in the random state which is zero and now I'll simply fit the model and I simply pass in X strain and Y train so here it tells me all the details of logistic regression then I have to predict the value so I'll say wipe red is equal to classifier then predict function and then I'll just pass in X test so now we have created the model we have scaled down our input values then we have applied logistic regression we have predicted the values and now we want to know the accuracy so to know the accuracy first we need to import accuracy score so I'll say from sklearn dot matrix import accuracy score and using this function we can calculate the accuracy or you can manually do that by creating a confusion Matrix so I'll just pass in my y test and my y predicted all right so over here I get the accuracy as 89 so if you want to know the accuracy in percentage so I just have to multiply it by 100 and if I run this so it gives me 89 percent so here I have taken my independent variables as age and salary and then we have calculated that how many people can purchase the SUV and then we have calculated our model by checking the accuracy so over here we get the accuracy is 89 which is great [Music] so what is classification and if I have to tell you about classification uh like for example what happens is like we have two type of when we talk about machine learning machine learning is nothing but you know like a series of much instructions you give it to the computer so that it can learn the patterns from your data set right to give you an example imagine that there is a trending topic for example you found it you want to find it out whether uh Prime Minister Modi will be the second prime minister once again the Prime Minister for the country or not okay so now what you will do is you will collect the data set from multiple different sources and you will you will actually build it like a algorithm uh where you you will get a label as yes or no yes he will continue as the next prime minister or no he will not continue as a next prime minister so you will collect the data set and you will feed this data set to the computer and this process is called as a machine learning right so now in this case what happens is um uh so this this is about the classification right so now machine learning is basically of two type one is called as a supervised machine learning another is called as a unsupervised machine learning and the third one is a reinforcement machine learning so when we speak about supervised machine learning as the name suggests it provides some supervision right for example the teacher teaching the kit it's a supervised machine learning right so we will give the trained examples we will give the train data set with a pure label on top of that uh this is called as a supervised machine learning so if I draw in front of you this type of machine learning look like this supervised machine learning where what happens is like you would have the data set which is a structured data set and you would have one column which is called as a label what you want to predict okay and you would have a various predictors by which you want to predict to give you an example imagine that you want to predict a pricing of a community okay you want to predict what would be the pricing of apartment in a particular Community right now these can be the variable like you can see that uh what would be the number of how many floors it has you can have a variable like what is a pollution level how how many educational institutions are nearby right so based upon that the pricing will change but this type of supervised machine learning why it is called supervisor of machine learning because we provide the independent variable or we provide the predictors also we provide the label data set okay now this supervised machine learning is basically of two type so this supervised machine learning one type one first is called as the you know regression based supervised machine learning okay and the second one is called as a classification based supervised machine learning now what is the difference between regression based and a classification based supervised machine learning regression based supervised machine learning is that machine learning where what you want to predict is continuous in nature okay imagine the you want to predict the com Community prizes right which is a continuous value if it is a continuous value then we will go ahead with the regression based supervised machine learning okay whereas if you want to predict something which is the discrete outcome to give you an example you want to predict that whether I will win the match or not okay I want to predict whether the particular employee will churn out from the company or not you want to predict you know like whether uh the person will have a cancer or not you're getting my point right so if you have the output which you want to predict is in the form of yes or no or true or false right this is called as the uh supervised machine learning but a classification based supervised machine learning okay so classifications based supervised machine learning is the process of dividing the data set into different categories or group by adding a label okay so always remember that whenever you guys want to predict the classes in the data set whenever you want to predict you know like whether this will happen or not whether the whether a person will do a credit card fraud or not you're getting a point right whether the employee will churn out from the company or not right whether the particular person will have a diabetes as a disease or not all these questions wherever you want to find it out yes or no or true or false or you want to predict classes in the data set this is called as a classification based supervised machine learning okay so this is what is called as a classification based supervised machine learning and today I will teach you you know I will tell you about various form of classification based supervised machine learning although we will do a deep dive into decision tree right now you will be able to understand that decision tree how decision tree is connected to the network what we are learning today with uh classification based supervised machine learning okay so now uh we have various algorithms algorithm is nothing but a set of mathematical equations for classification based uh supervised machine learning we first of all we have something called as decision tree then we would have something called as random Forest then knife base and then KNN which is called as a k nearest neighbors okay so let me give you few statements about this algorithm and we have many others but today we will focus on one of them which is decision tree so now first let's start with decision tree now what is a decision tree what I was telling you is believe me or not decision tree is something you use every day in your um in your daily life for example you take decisions and for example today also you took a decision to attend this webinar right but how do you decide a decision based on various further decisions right for example for today joining the webinar you have seen that okay uh when this webinar is about okay so you said it is weekday or weekend then you might have said checked it out right what is the time of the webinar then you might have checked it out what is the topic of the webinar right so and who is conducting this webinar So based upon this you took a decision shall I go ahead or not go ahead you're getting women right so this is what is called as a decision tree we call it as decision tree because it is a graphical representation of all the possible solution to a decision it's like a tree like decision tree is like a tree why because on tree also start with a root and then it emerged into various branches similarly you have a decision tree which I am showing to you here as a simplest algorithm which is being used for machine learning purposes now it is something like this imagine that you want to point it out that you know like do you want to go to a restaurant or do you want to buy a hamburger okay so you have two choices either you can go for a restaurant or either you can buy a hamburger now how would you decide which one you would follow so you will start with what is called as a root note you will start with the root note that whether I am hungry or not right if I am hungry right then only I will go for all these activity if I am not hungry at all then simply go and sleep you got my point right so this is how you will start with the first note which is to find it out whether I'm feeling hungry or not if I'm feeling hungry then I will decide that well I do I have money which is around 25 worth if I have money then I will go for a restaurant if I don't have a money then I will buy an hamburger you understood this this is simplest representation of decision tree basically you decide something on the basis of the previous outcomes and you can imagine any sort of example here imagine that you want to point it out whether the person will do a credit card fraud or not again it will depend upon the previous circumstances that you know for example it will depend upon how much is the salary of the person you it will depend upon what is the job profile of the person it will depend upon the fact that you know like for example how many fraud or how many cards this person have right so on basis of you will decide that whether this will do a credit card prod or not so this is the simplest form of classification based algorithm then we have the next algorithm which is called as a random Forest now what is a random Forest as the name suggests you build one decision tree in the last example right but one decision tree can sometimes be overfitting as a case right so people said that you know like why should I only trust one decision tree for example whenever you take a bold decision in your life you don't trust only a single voice right you want to hear it from multiple different people to make your decision more stronger right go to a doctor if the doctor says to you that you have this type of a disease you don't believe in that what you do is you also talk to second doctor third doctor Fourth Doctor to confirm that this is true or not okay so this is what is called as a random Forest as the name suggests why it is called random Forest it is called random Forest because now you are building the various number of decision trees here it's like a forest of all the trees right it is no longer a single tree it is a forest of complete trees so for example you can imagine that if it is my training data set I will I will split my training data set into multiple examples multiple decision trees will be built and then based upon the majority I will decide whether should I do this or not okay so it is also called as a bagging sort of methodology where we bring the outcome of various models or we bring the outcome of various trees all together to make the uh to make a powerful decision okay this is another algorithm which is called as a random Forest then after random Forest we have something called as knife base okay now what is a knife base algorithm knife base is a also a simplest algorithm but knife base is basically based on the base theorem okay so this algorithm is basically based on a base theorem and base theorem is based on the conditional probability right so it is based on the conditional probability which is your name base now in the name name is what happens is like we decide that whether something will happen or not on the basis of probability so let me illustrate you with this example imagine that you want to find it out whether I would have a disease or not okay so first of all the probability of having a disease is 0.10 and the probability of not having a disease is 0.90 okay so if there is a probability of having a disease is point one zero you will further find it out that what is the probability that my test will be positive given I am diseased what is the probability that my test to diagnose the disease is negative given I have a disease right similarly if you go into this direction if there is a probability of not having a disease is 0.90 you will find it out at what is the probability of having a disease test being positive with having with no disease and what is the probability of no disease given I don't have the disease which is 0.90 so basically here you check the outcomes here you check the outcome of all the putting all the possible combinations this is what is called as the knife base algorithm or base theorem or uh you know conditional probability based theorem okay then we also have something which is called as a key nearest neighbor which is the one of the finest algorithm uh which also helps you in deciding the uh classification right now what is K nearest neighbor K nearest neighbor as the name suggests what happens in this case is we try to build you know we try to build uh basically it's like a labor so it's something like this like if I give you the data set say I give you the customer data set okay and I it is a transaction data set so customer number one has bought a product number one right from a particular vendor at a particular rate and this is a profit this customer has given to us and this is the revenues okay this is my class customer number one similarly you would have customer number two customer number three customer number four right now if I ask you that which customer profiling is same is nearby same right so what you can do is you can group your customers or you will get to know that which customer behave in the similar way so you can say that customer One customer three and customer four they behave in a similar way because they are giving us the high profit and high Revenue margins you understood so this is what happens in the case of K nearest neighbors so care nearest neighbor what you generally do is you find it out that what uh I would be able to you know uh find it out who is my nearest neighbor like what are the similarities in the patterns we have right this is what we use for K nearest neighbors and this you can see that for example the algorithm based on the distance based mechanism find it out that how many people or what type of audience is similar to the outcomes right so then moving further here the with the second topic let's go into detail about what is decision tree all about right so so far I have touched base on classification based algorithm right and I was teaching you different type of classification based algorithm but since the focus of today's class is decision tree let's take a deep dive into the decision tree okay so let's get started with decision tree a decision tree is a graphical representation of all possible solution to a decision based on a certain condition what does it mean it means that it is as simple as that imagine that this is a tree right so it is like a tree where you have a problem statement that should I accept a job offer or not imagine that you want to find it out that should I accept a job offer or not this is a problem statement which is there in your mind now how would you solve this problem statement with the help of a decision tree first of all you will start with what we call as a root node we will start with a root node starting with you know to find it out what is a salary okay what I'm what what is the salary I'm getting so if my salary is equal to or greater than equal to 50 000 I'll go here if my salary is not greater than equal to 50 000 I will say I will not accept this offer okay now imagine that you say that your salary is greater than 50 000 then you will check another uh another variable here you will check it out whether I have to commute more than one hour if I have to commute more than one hour I will decline the offer then if you don't have to commute more than one hour then you will still consider this option and then you will check it out for example in this case we are checking it out whether you are getting the free offers also like coffee or some snacks or other things like that if yes then you will accept finally the offer otherwise you will decline the offer you got my point this is how our decision tree Works basically decision tree will keep on splitting keep on splitting unless and until you are able to find it out your decision okay so here the decision was shall I accept this or not right so it will keep on splitting keep on splitting unless and until you get your decision whether you do this or not like this example which I have to I have explained you okay now with this what happens is like let me if I go further and explain you further on this decision tree let's understand this more importantly right let's understand this uh one by one so imagine that this is my data set okay now in my data set you can see that I have various colors given to you it's like a green color yellow color red color red color and a yellow color I am saying if this is a if there is a green color fruit with a diameter of three right and I can say it is a mango whereas I am saying if it is a yellow color and a diameter 3 still will be called as a mango if it is a red color with one diameter then it is a grape and it can be a red color with a diameter of one still can be a grape but even a yellow with a diameter of three can be lemon now what is happening in this case is if you have this type of a data set where you want to predict the label of the fruit right you want to predict whether a fruit will be mango whether a fruit will be lemon right or whether a fruit in this case is mango and lemon or grape I have to build a classifier which is a decision Tree classifier on the basis of this data set imagine this is a problem statement given to you now first of all what you will do here is you will take this data set right you will take this data set and you will start with a root node root node imagine here is like is my diameter of the fruit greater than equal to 3 or not okay is my diameter of the fruit greater than equal to 3 or not if it is greater than equal to 3 right now if it is greater than equal to 3 you can see that you have three fruits here three rows of the data set here one is green with three mango yellow with three lemon yellow with three mango and wherever it fails in the condition you are left with where the diameter is not greater than equal to 3 it is less than equal to 3 then which are the data set we have red one gray and red one grape okay so I hope you are understanding this right how we have starting with the decision tree with the based on one condition here right which is diameter now based upon this what I will do is I have to split it further here in this case I don't have to split it because I already got the result so if my diameter is greater than equal to 3 in my data set if the diameter is not equal to greater than equal to 3 I know it is a grip right whereas if the diameter is greater than equal to 3 then it can be a mango or it can be a lemon I'm not sure in my decision so what I will do here is I will split it further so I have to split this further here the splitting is not required but if I split this further I may check it out now is the color equal to yellow or not now if the color is equal to Yellow then I have two fruit which is you are this row which is you know um your mango or lemon and if the color is not equal to Yellow then you will have the another row of the data set which is you know which you are left with right so this is how you have done this work in the case of your decision tree okay and how you will point it out at which type of the criteria or which type of the algorithm or which type of criteria or variable I should choose here to split the tree is on the basis of Genie index and the Information Gain how your decision tree Works basically on the basis of the condition and if you get a pure subset then no need to split it further if there is no pure subset you will keep on splitting keep on splitting unless and until you get a pure subset so in this case what has happened is you got hundred percent mango here you got 50 mango and 50 lemon so now this is what like how does it work right now this all depend upon the genie index basis right now in the next subsequent slide let me tell you and explain you about Genie index and Information Gain how does it work right how does that something works on the basis of uh your genie index so imagine that you know like imagine that I have a feature here is the color green or not right basis of whether the pitch whether you have a color green or not what would happen here is if the color is green you will get this row here if it is not you will get this these two rows here right in the next it may decide on the other basis right it can be is the diameter which is greater than equal to 3 or not and many other things okay now let's go ahead the example and the questions which is coming to your mind which is related to you know decision tree terminologies I'm pretty sure you would have these questions in your mind and you would be thinking that how should I decided which feature I should use and which feature shouldn't I be using this is the question which you had and this is an excellent question for understanding the decision tree but before I go on to that I have to tell you about some of the terminologies which we commonly use while we build the decision tree so first of all decision tree looks like this type of a tree based structure okay so every decision tree will have it root node so root node is where the decision tree will start so it represents the entire population or sample and it is further get divided or into two or more homogeneous sets as you know that this will be the first feature on the basis of your tree will start like three start with the root your in this case your decision tree will also start with a root node okay now once you have the root after that in a tree what happens is you will start getting the branches right I hope you understand what is branches in this case we will keep on splitting keep on splitting unless and until we get a decision whether this will happen or not so it's like a branches okay then we would also have a parent or a child node so child node is nothing but when you have a branches then the branches can also have the outcomes so in the previous example like for example we have whether the diameter is greater than equal to 3 or not you remember whether the diameter is greater than equal to 3 or not if it is true what is happening if it is a false what is happening so this is a child or child node basically this is a intermediate node this is not the final decision which is being made so this is called as a parent or the child node then we also have other terminology here which is splitting which you know that we will keep on splitting unless and until you get a desired node and finally the tree end with a leaf node so always remember one thing you will start your tree with a root node root node is a node with which we will start the decision tree and you will end your decision as decision tree at a leave node where you will get a decision that you should do this or not okay and now uh pruning is the activity where uh you you will cut down the decision tree if it is pruned lot amount of times I will even explain you this it is a case of overfitting you shouldn't build thousand you can shouldn't build thousand uh branches of the tree when it is not even required so I'll explain you this point now let's move further here and let's see this with the help of an example which was your uh Genie index and your information gain right so this is where you were asking me that which question to ask and when right how would you decide that which feature has to be taken first right let's take up an example here and I would encourage every one of you to hear me really fine here because this is on the basis how would you decide or how the algorithm decide to break this further and and I'm explaining you with the help of one simplest example and we will do some maths over it so imagine that there is a data set where I want to find it out whether I will play the match or not okay there is a cricket match or there is a football match or whatever it is I want to find it out whether I will play the match or not okay now how do the decision tree look like decently look like this if the Outlook if the Outlook is humid if the Outlook is humid and if the Outlook is humid and humidity is very high then I will not play the match on the other hand if the Outlook is humid and humidity is normal I may play the match okay if the Outlook is absolutely clear then I will always play on the other hand if Outlook is windy and the winds are very strong I may not play the match on the other hand if the Outlook is windy and the winter week I may play the match so now what is happening again is that you are not sure how you decided with Outlook how you decided with these notes Here how you decided with these features here let me try to show you this entire data set so this entire data set look like this okay so this is basically your 14 days data set and this exactly happens in the case of your classification base algorithm you will get a data set like this so you can imagine that you want to find it out I would like whether I will play or not right this is what you want to essentially find it want to find it out whether I would play or not on basis of what on basis of four different variables you have four different variables or features in the data set is Outlook temperature humidity and wind right so now it can be like if the Outlook is sunny temperature is hot humidity is high wind is not there I will not play the match this is how you will read this one data point right similarly I have multiple data point now I have to decide how can I build a decision tree and out of these four features which features should I use first as my root node right so this is what I have to decide and let's do it uh accordingly right so now what I'll do from here is uh we I will illustrate you what is Genie index what is Information Gain and how does this happens right and this is basically used to build your decision tree so before I make you understand about Genie index or Information Gain one thing which you should always remember is to understand the concept of impurity what is impurity impurity is nothing but you can see there is a basket where you have apples right now if you have a basket which is of apple and another uh tray it is written the label as Apple now in this case you will never make a mistake you will never make a mistake because here you have an apple here you have only one label so everything will be perfect it will be hundred percent basically there is no impurity there is no problem in your data set right on the other hand let me flip the story in this case imagine that I have different fruits in the basket which is like apple you have a banana you have a grapes you have a you know cherries and many other things right and you have many apples many labels here in this case try imagining you have to match each fruit with its label right now in this case the impurity cannot be equal to zero the impurity will not be equal to 0 in this case because what would happen here is that you have a chances of misclassification right this is a very important concept right when you have perfect think the misclassification will not happen but whereas if you have a multiple labels with multiple fruit misclassification or impurities will not be equal to zero right so this is associated with a term called as entropy maybe in your childhood days you have learned about entropy in your chemistry class in a simple sense what is entropy entropy is a randomness of the space sample space whenever you are not sure on your decision then entropy will be more imagine that I am giving you the data set where it is like 51 percent of doing this thing 49 percent of not doing this thing 51 chance that the employee may leave the organization 49 chance that employee may not leave the organization so basically you are not sure on your decision if you are not sure on your decision then the entropy will be very high on the other hand if you are very sure on your decision then the entropy will be very low so basically we need that feature which can provide us lowest entropy rather than the highest entropy to select as that as a good feature so we generally find it out entropy by the help of this formula what we simply do is don't get scared with this formula because everything happens automatically in r or python right imagine just look at this formula what is this formula this formula says that what is the probability that something will happen multiply by log base 2 probability that something will happen subtract this with probability that something will not happen into log base 2 probability that something will not happen Okay let's take up an example don't worry about it let's take up an example that probability that I will win the match you will apply here and probability that I will not play the match or when the match you will apply here and then you will calculate the entropy let's take uh you know an example how you can do this in our case so in our case let me show you yeah how we will build the decision tree in our case in our case you just see here what is happening is that we have uh 14 instances or 14 rows of the data set where nine times if you see it I will play the match and five times I will not play the match okay so if you see carefully there are nine labels where I'm playing the match and there are five levels where I am not playing the match right so first of all I have to point it out the total entropy how I will find the total entropy this is being determined by probability that I will play the match sub multiply this with log base power 2 probability that I will play the match what are the chances that I will play the match 9 out of 14 all of you will be with me right this is 9 out of 14 multiply with log base 2 9 out of 14. subtract this with what is the probability that I will not play the match 5 out of 14 multiply this with log base 2 5 out of 14. so once you calculate this you will find it out the entropy of this entire system entropy of this entire system is 0.94 okay so this is the entropy of your entire sample space this is the first thing now how this entropy will help you in selecting which features you will take or not so let's go further so now we will take each feature one by one whether I should take Outlook whether I should take temperature whether I should pick up humidity or whether should I pick up windy right let's go one by one now first I'm plotting for Outlook imagine for Outlook how many times so what are the distinct value of Outlook Outlook can be sunny Outlook can be overcast or Outlook can be rainy right these are the three different combinations you can have for Outlook now if the Outlook is sunny two times I'm playing three times I not playing the match if the Outlook is overcast I'm always playing the match if the Outlook is rainy three times I'm playing two times I'm not playing the match right this is how I have bifurcated it what I will do in the next iteration is let me point it out the entropy of Outlook okay so if we start with Outlook remember this formula which is you know probability that I will play multiply with the probability that I will play right and subtract with probability which I will not play and log base 2 of not playing so two by five is a chances that I will not I will play into log base 2 2 by 5 this is a subtraction here right with I will play and locked three three by five I will play right so you will calculate this entropy when Outlook is sunny yeah so you got this entropy when the Outlook is sunny is 0.971 accordingly you will proceed with calculating the entropy when the Outlook is overcast Outlook is overcast always you are playing right if the Outlook is overcast every time you are playing so it means that you will get a probability of zero if you apply in that formula you will get 0. and third what what would happen if the if the Outlook is sunny in that case you will again multiply and you will put the formula and you will get it out 0.971 right so you will find it out entropy for each and every distinct combination of your feature you got my point right Outlook being sunny out long being overcast Outlook Being Sunny uh uh you know uh overcast sunny and rainy this should be replaced here right and then what you will do is you will finally calculate the Information Gain Information Gain is nothing but what you will do is you will pick it up the chances you will pick it up the entire chances when you are playing right which is 5 out of 14 you remember 5 out of 14 was the total chances that I will play into if it is sunny plus 4 out of 14 if it is overcast 5 out of 14 if it is rainy right and you will calculate the information from this Outlook and once you calculate the information you will subtract this information from your total entropy which I found it out in the last slide which was 0.94 you will subtract this and you will get the information gain or this is also called as a Information Gain from a particular feature so basically these type of a calculation first of all don't get scared away that you have to do this calculation but what I'm trying to explain you is this is how your algorithm will work for for each and every feature it will calculate the information gain from your feature right if you have the more information gain it means that this variable is of very very important in predicting that something will happen or not okay so for Outlook I got the information gain as 0.247 with all the calculation remember then we will proceed with wind if the wind is there or not right and then I will proceed with wind and I will calculate the information gain and say Information Gain I found it out is 0.048 Right similarly we will calculate for all the four of them let me put it together for all of you so this is what happens here now if I put in front of all of you these were the four different variable I have Outlook temperature humidity and wind right I am calculating the information gain for each one of them and the Information Gain I got for Outlook is 0.247 so if the Information Gain Is highest in a particular feature that feature will become your root node you got my point so therefore we will pick it up Outlook as our root node similarly for when the tree gets started later on as a branch node also your Information Gain will be calculated and wherever whichever feature is giving you more information gain that will be picked up later in the this this uh your tree also so this is how you will build your decision tree and you will finally get a decision tree like this okay so now what I will do is uh you know like I will quickly show you how do we do the decision tree uh in your python okay so I'll show you how do you do this in Python and then I will uh summarize for all of you that why decision trees or tree based algorithms are better than your uh other algorithm okay so how do you choose basically that which algorithm you will select and when okay so I'll I'll describe you this but before let's jump on to Python and go there so what I have done here is like I had built uh the decision tree in front uh like uh you know already so quickly I will walk you through the commands Okay so what happens here is that um in Python uh as you would be well versed with this that we generally import packages in Python so I'm importing numpy I'm importing matplotlip for plotting the chart I'm importing various packages from your sklearn which is for machine learning purposes right so I am importing your label encoder your decision tree classifier classification report and I also I am importing here uh tree right so we have all this which I am importing right after I import what I will do is I'm reading my data set so I am showing you this with the help of our Iris data set which is one of the very popular data set for building the a decision tree right it's for any particular source so imagine that this is my data set and I'm just showing you six rows of the data set where you know like uh I want to find it out whether a particular flower species will be citosa versicolor or virginica I have three different flowers which I want to predict and on the basis of simple length Peter length simple width and petal weight so basically I have different dimensions of flowers length and width and based on that I want to find out whether the particular species will be setosa vertical or virginica you got my point right so this is a data set so what I will do is I as you know with every machine learning data set we play with the data set so I'm checking the information here uh like what type of the data type it is so simple length it is a float petal length that is a float and species is a object y object because this is a categorical column so after this I'm also checking whether there is any null value present in the data set or not because if there is any null value then we have to get rid of that null value or we have to replace that null value with some imputed value right this is what I'm checking here once I do that I'm also plotting this because we usually do visualization right to understand the data set better so what I'm doing here in the with the help of SNS which is your sns.pair plot I'm plotting all the possible plots so basically simple length with simple width what type of the combination look like so citosa vertical and virginica there are three different species you can see in the data set and this is what I am getting a trend between your sleeper length and simple weight similarly this is basically from simple width to Peter length right so this is how I am understanding the pattern so I'm understanding the relationship between the variables in the data set so this is what I am understanding here I am also checking whether there is a correlation or not higher the shade it means there will be a strong correlation so all of these things is being done as a part of exploratory data analysis before even you start your machine learning model right once you do this after that what I have to do is after that what I'm trying to do is I am taking your species column what I want to predict as my target variable which is your dependent variable and what I so this is my the dependent variable and with the help of what I want to predict will be my independent variable so I am calling X all my independent variable and I am calling Target as my dependent variable once I do this then you will also think about it at your um the variable which I want to predict is the flower species right but I want to convert this into 0 and 1 class 0 1 2 class because your computer cannot understand text right computer can only understand the numbers so what I am doing here is I am changing it to the uh I'm changing it to the class right so what I am trying to do here is I am saying wherever it is setosa it will become zero wherever it is virginica it will become one and it is Versi color as a third category it will become two so now imagine my data set my label to predict becomes 0 1 or 2 instead of three flower which was Sentosa versus color and partinica this is what I am doing with the encoder here once I convert this and this become my target what I will do is as I told you that I will split this data set into training and test data set so basically eighty percent of the Tata is going into the training data set and remaining 20 percent I am taking as a test data set okay now I will call decision tree classifier you know like I want to make a decision tree and I want to fit this on my training data set and the test data set and I will start building the decision tree and with the help of you know from uh so here is when I have created the decision tree and here is when I am checking the prediction how accurate my decision tree is so I got my Precision which is good I got my recall I got my F1 score and I also get the support right so all these matrices are being used to calculate your uh how accurate your predictions are so higher the Precision better the results would be right and finally I am showing to you that how does the tree look like so I had tried to plot this decision tree in front of you so basically it start with Peter length and you can see the genie index coming up here or the Information Gain which is Information Gain and Genie index are reciprocal to each other if you want to use Information Gain you will get that score if you don't want to use information again you will get Genie index so they are both reciprocal of each other it is one in the same thing you use Genie index or you use Information Gain they are the two different metrics to build your decision tree so you can see that based on a Peter length the trees get splitted like this then based on the petal length of different dimensions your tree further split then it further split then it further split and finally you get to note at whether the particular species will be versicolor or virginica okay so this is how your entire decision tree is built in Python okay so this is how it's so simple I know it takes time to build this thing but once you are a good data scientist and you understand all these things it is very simple to build all these things very easily in python or r right so now uh finally going back how would you decide how would you decided which algorithm you know which algorithm will be taken when so uh what happens is this is on the basis of scikit-learn so it it starts something like this okay so it is on the basis of like this that first of all you see here that uh whether how many samples you have how many data points you have in a data set right if you have more than 50 samples then you will go here if you have less than 50 data point then you know you will go here so you will get more data set right so if you have more data if you have greater than 50 data point then you will go further you split of your data set if it is not then you will you just the kind of device it get more data set okay so then you will decide that what you want to predict if you have a label data set then you will go here right and do clustering if you don't have the label data set then you will see whether you want to predict a quantity if yes you will go in regression if you want to predict uh if you just want to do exploratory analysis you can do dimensional reduction if you have a label data set you know for classification you can do all this classification so basically this is a cheat sheet which we generally use to decide that what we have to do with the data set and when [Music] so now let's understand what is a random Forest So Random Forest is constructed by using multiple decision trees and the final decision is obtained by majority votes of these decision tree so let me make things very simple for you by taking an example now suppose we have for three independent decision trees here we are just taking three decision trees and I have got an unknown fruit and I want that these trees would give me a result of what exactly this fruit is so I pass this fruit to the first decision tree the second decision tree and the third decision tree now a random Forest is nothing but a combination of these decision trees so the results are being fed into the random Forest algorithm so what it sees is that okay the first decision tree classifies it as speech the second decision tree says that it is an apple and the third one says that it is a peach So Random Forest classifier says that okay I've got the result as two Peach and one for an Apple so I would say say that the unknown fruit is an Peach all right so this is based on the majority voting of the decision trees and that is how a random Forest classifier comes to a decision of predicting the unknown value okay so this was a classification problem so it took the majority vote now suppose if it was a regression problem it would have taken mean of it okay so now let's move on further to understanding what is a decision tree but before that we should understand that random Forest the building blocks are decision trees and that's why studying decision tree becomes important because if we understand one decision tree we can apply the same concept to random Forest so now let's move on forward and understand the important terms in random forests and this will also help us consolidate whatever we have learned so far so we have taken the same small decision tree of the previous example and let's understand these are also the important terms which will be relevant to random Forest also so the first is the root node now here what happens is that the entire training data has been fed to the root node and then we've got here that each node will ask either true or false question with respect to one of the feature and then in response to that question it will partition the data set into different subsets that's what it is it is doing here based on the condition that if the mass body mass is greater than equal to 3500 it asks a question either yes or no and based on that again for the partition is done and if not then it just classifies the species and then again what happens is that the splitting now this is very important here the splitting takes place either with the help of a Genie or entropy methods and these helps to decide the optimal split and we will be discussing about splitting methods very soon right okay and then we've got the decision nodes which provide the link to the leaf nodes and these are really important because then only the leaf nodes would tell us what actually the real predictions or to which class does the specie belong so now coming to the leaf node and these are the end points where no further division will take place and we will obtain our predictions okay so now coming up to another important thing here is working of random Forest so now for working of random Forest we will have to understand a few important Concepts like random sampling with replacement feature selection and also the Ensemble technique which is used in random forests and that is bootstrap aggregation which is also known as bagging so we will understand this with the help of an example which will be very simple and then we will go on understanding how feature selection is done in both the classification and the regression problem actually how random forests select features for the construction of decision trees well in random Forest the best split is chosen based on Ginny impurity or Information Gain methods so this also we will understand now let us first understand random sampling with replacement now what happens here is that we have got a small subset of the same penguin data set wherein we have got some six rows and four features that means four columns and the arrows that you can see is that now we will be creating three subsets from this small subset right and these three subsets will become our decision trees and then we'll be constructing decision trees from these subsets so let us create our first subset and you can see here that the subset is randomly been created and for convenience sake let me just also show you the different subsets here okay so now for better understanding let us understand this that in the first subset if we focus we've got certain random rows here and we've got certain feature but we do not know how this feature has been selected we got Island and we got body mass but in the second subset we got Island and flipper length and in the third subset we got body mass and foreign right now let's look at the rows now when I am talking about these features I will say this is feature selection and remember this term now coming to the second concept that is random sampling now random sampling is nothing but selecting randomly from your subset so I'm selecting randomly certain rows from my subset and creating further subset okay so what is replacement here replacement is can be seen here and can be understood with this second subset we see here that the gento species this row is being repeated again and this is replacement that means that when we are working with repeated rows and this row can be repeated again in the second or the third subset then this is random sampling with replacement that means my random Forest can use a row multiple times in multiple decision trees right so this is the basic concept of random sampling with replacement and feature selection in random Forest another important term which I would like to bring into the notice is that when we are working with these type of small subsets these are also known as a bootstrap data sets and when we aggregate the results of all these data set it becomes bootstrap aggregation so just filling in the gaps so that later on the concepts become more clear so now let's move on to drawing decision trees of these subsets okay so let's draw the decision tree of the first subset again we are taking body mass as the first root node and then based on a decision like if the mass is greater than equal to 3500 then take a decision either yes or no if it is no then the specie is chin strap and if it is yes then again you partition based on island and if it is dog listen then it is idly and if it is visco then it is gento species okay so this is how we will construct two more decision trees of the remaining subsets so the second subset let us just again create decision tree and here now we are taking flipple length and then based on a condition that if the flipper length is greater than equal to 190 then make a split if it is yes then the species become gentle and if it is no that means again make a decision based on island and if it is toggles and it is elderly and if the island is Dream Island then it is a chin strap species so this is how the decision tree of the second subset has been created and this is how it will take decisions right based on the tree length depth and also the features it is selecting okay so now let's create the third decision tree of the third subset and we get a decision tree something like this wherein body mass if it is greater than 4000 and if it is yes then clearly it is a gender species and if it is no then again make a partition with the with respect to flipper then and another feature here and then if it is again greater than equal to 190 then the specie would be elderly else it would be chin strap so this is how decision tree 3 will make a decision now let's just keep these decision trees with us okay and we will make sense of these trees just in a while okay but before that let us understand how feature selection is done in a random Forest how am I selecting The Columns so for classification by default the feature selection is taken as the square root of total number of all the features now suppose I've got here four features so it is a classification problem I will take the square root of these four features which becomes two so decision tree would be constructed based on two features each suppose I had 16 features then it would be square root of 16 that would be four so four features would be taken in each decision tree all right and suppose if this would have been a regression problem then by default what would happen the features would be select selected by taking the total number of features and dividing them by three okay so this is how by default the feature selection is being done by a random Forest okay now let us move on forward to consolidating our learning so now we are coming to Ensemble techniques that is also known as bootstrap aggregation random Forest uses Ensemble techniques and what is ensembling it just means that you are aggregating the result of the decision trees and taking the majority vote in case of classification and the mean in case of regression problems and giving the output okay so now we have again plotted all our decision trees here and Below we can see that there's an unknown data and I want to predict the species of this data so what will happen is that again let us just feed this problem to each of the decision tree and let's see what each decision tree makes the prediction so I just feed this unknown data to decision tree one and it says that okay the specie seems to be chin stripe okay and then decision tree two says that based on the data it has been found that this species elderly and then decision tree three says that no I I with my decision tree this species chin strip okay now all these data has been fed to random forest classifier and it says that okay for chinstripe I've got two votes for Italy it's got one more so the new species would be chin strap right so this is how the bootstrap aggregation is done based on the majority voting and the decisions taken by different decision trees they have been combined together aggregated and we get an ensembled result in the random form okay so this was very simple concept of ensembled techniques which has been used in random Forest okay so now let's move on forward to splitting methods so what are the splitting methods that we use in random Forest so splitting methods are many like Gene impurity Information Gain or chi-square so let's discuss about China impurity so Gene impurity is nothing but it is used to predict the likelihood that a randomly selected example would be incorrectly classified by a specific node and it is called impurity metric because it shows how the model differs from a pure division right and another interesting fact about Gene impurity is that the impurity ranges from 0 to 1 with 0 indicating that all of the elements belong to a single class and one indicates that only one class exists now value which is like 0.5 this indicates that the elements they are uniformly distributed across some classes right now moving on forward to Information Gain now this is another splitting method which random Forest can use and Information Gain utilizes is entropy so entropy is nothing but it is a measure of uncertainty so Information Gain let's talk about that first so the features they are selected that provide most of the information about a class right and this utilizes Gene trophy concept so let's see what is entropy this is a measure of Randomness or uncertainty in the data right so now let's move on to understanding the advantages of random forest and we see here various advantages so let's focus on firstly low variance now since random Forest overcomes the limitations of decision tree and it also has the advantage of low variance because it combines a result of multiple decision trees and each decision tree is being trained on limited data set that we have seen earlier also so each rewards making its own subset of data and training the data on that limited length of the tree so there's less depth there's less overfitting and low variance of the data so coming to the next point that is reduced over fitting again since we were working with multiple decision trees hence reduced depth of the tree so we get reduced overfitting that means the model is fitted well and it does not tries to learn even the noises right so we use the bootstrap aggregation or bagging here in random forest and that is why we also get reduced overfitting in random forest and this is one of the reasons that why is it so popular because you don't have to worry about overfitting of the data right all right now moving on forward to the another Advantage is that the normalization is not required in random Forest because it works on rule-based approach right and another Advantage is that it gives really good accuracy which we will also see in our Hands-On it really gives a very nice predictions either Precision or recall and generalizes Well on unseen data as compared to other classifiers or machine learning classifiers which are present like Nave base or as VM or k n random Forest really outperforms other classifiers right so let's move on to understanding few more advantages of random Forest is that it is suitable for both classification and regression problems and also it works well with both categorical and continuous data so you can use it well with any of the data sets right and it performs well on large data sets right so it solves most of the problem that's why random Forest has largely been used in machine learning problems now moving on forward to certain disadvantages of random Forest so the first disadvantage is that it requires more training time because of the multiple decision trees if you've got a huge data set you would be constructing hundreds and hundreds of decision trees and that requires a lot of training time and here comes one more disadvantage is that the interpretation becomes really complex when you've got multiple decision trees so decision tree interpretation is easy because it is an individual decision tree but when you combine hundreds of decision tree to form a random Forest the interpretation is really very difficult to understand and it becomes quite complex to apprehend what exactly the model is trying to predict and where the splitting occurs and what features are being selected and so on and another disadvantage is that it requires more memory so memory utilization is really heavy in case of random Forest because we are working with multiple decision trees and another disadvantage is that it is computationally expensive and requires a lot of resources because of the training of multiple decision trees and also storing them all right so this was all about random Forest the theory part of it and now let us just move on to the Practical demonstration or a Hands-On on random Forest okay so now it's time for a Hands-On or random Forest so let us just import a few basic libraries of python in our jupyter notebook and we will run this we will import pandas SPD numpy SNP and c bond as SNS now c bond is needed here because we want to load a data set that is a penguin's data set with the help of c bond and this is already been pre-loaded in c-bond this is already loaded data set and C1 has got multiple data sets you know for practice for beginners so it is a good way to practice for data sets now we can see this asterisk sign that means it is telling us to wait so let us just let it get loaded so we get got our data in an object called DF and we can see the first five entries here and this data frame is shown in the form of a table rows and columns and we see here some species are in Bill length build depth flipper length body mass and the sex of the penguin so our task is to specify or to classify these species of penguin into the respective correct species right so we see the shape of our data and we see that it is like 344 rows and seven columns and we will see the info so we see df.info and this gives us along with the non-null count we also get the data type of the values so we have got species Island as the object data type whereas the bill length build depth triple length and body mass are in floating point or you can say floating data type and the sex is in object data type right so now moving on forward to calculating how many null values are there with the help of DF dot is null dot sum so we get certain like some around two null values in all these columns as you can say the features like build length build depth flipper length and body mass whereas there are 11 null values in sex feature right so what we do is since they are very small null values we can just drop it or you can also ignore them so here in this data frame what I'm doing is I'm just dropping these null values and let us just check whether they have been dropped or not with the help of again the same function dot is null Dot some and then we see that yes they have been dropped from the data frame now let us do some feature engineering with our data now we have seen that we have got some object data type in our data frame and before feeding it into algorithm That Is Random Forest we have to transform the categorical data or the object data type into the numeric so we are using here one hot encoding to convert the categorical data into numeric now there are various ways in Python which we can do that like one or encoding or you can also use a mapping function in Python but here we are using one hot encoding so let us just do that and we find here first of all let us apply it on the sex column and here we see that we have got two unique values in sex that is male and female and we use pandas here to get dummies that is how we will apply this one hot encoding because this is how get dummies work so what happens is here is that the new unique values are converted into the respective columns in the data frame so we see here we have got two unique values males and female and they are being converted into the columns okay so one thing to note here is that we also get a problem of dummy trap because here we see only two unique values now suppose if I had six or seven unique values and I do this one hot encoding I would have lots of features in my data frame and that would lead to several complexities so what I do is uh to keep things simple I can use one hour encoding when my data frame or my unique counts are low when my unique values are less so since I had just two or three I can use it so I'm using here so what I do is again now one row one column as we can see here that it is redundant giving me extra information so I will just drop it so I drop this First Column and what I get in this data frame is only mail so let us just infer whether I can also infer females from this or not so if the value is one that means the penguin is a male and if the value is 0 that means the penguin is a female okay so only one column is needed for this data frame so I just kept one and dropped the another one okay now apply again one hot encoding to the island feature so in Island if we check the unique values we've got three unique values here photographs and disco and dream island and the object is the data type right so again we will use pandas PD dot get dummies and we will use apply it on the feature Island and let's get the head of it so we get here again the unique values were converted into columns and we get here respective three columns and then again we will just drop the First Column to get the remaining two columns so here also we can enforce that if the island is toggle Sun if it is one then it is not dream neither bisco right so this is how you can read it from the data frame and understand that now remember this thing that these two Island and here sex these are two independent data frames these are not yet included in the main data frame so what we will do now now is we will concatenate the above two data frames into the original data frame so what we do we again create a new data frame that is new data and let us just concat with the help of pd.comcat function and we will concat what DF Island and sex and Xs is one that means in the column okay so when we will run this let's see the head of it so everything gets concatenated in a single data frame which is good for the feeding this data into or splitting the data into test and train data so now we have this new data frame and we've got some repeated columns here which needs to be deleted so what we do is we delete specs and Island here which are just repeating because we've got here mail and we have also got here dream and toggerson so we do not require this island column neither this x so we just drop it with the help of new data.drop and the column names X is 1 in place equals to True right and let's see the head of this data frame head of the data frame gives me five unique values right it and now it is time to create a separate Target variable and what we'll do is we will store in a variable called y only species so what we do is from this new data dot species we will just store the species in this pie and we see this y a DOT head that is the first five species and we got the values here that means another Target variable has been created now so and you can also see the Y dot unique values as Italy chinstrap and gentle so now we see here three unique values of the penguin that is chin strap adelian Gen 2 and the data type is object here so again we need to convert this object into the numeric data type so now what we are doing is we are using the map function in Python and what we do is we map Italy to zero chin strap to 1 and Gen 2 to 2. so this is how we see then all the values have been mapped to numeric this is another way to convert a categorical value into a numeric value in Python now what we do is let us just drop the target value species from our main data frame so we just drop it and let's see our new data frame so we see that we don't have any Target species here right okay so in X let's store this new data and perform the splitting of the data so what we do is from sklearn.model selection we will import our train test split and we will split our training data into 70 and 30 so test data becomes 30 and training data is some 70 and this random state is 0 which means that I'm not fixing any random State and this is also used for the code reproducibility now suppose if I again run this code I will get the same result it will not change you can set this random state to any of the random number as per your choice and result would differ okay so now let us print the shape of X strain by train X test and why test so we see here that it has been splitted into 70 and 30 and we get X strain as 233 values here and 7 features is an x-test has 100 values and 7 features similarly y train you can see 233 values and Y test as 100 values that means the species okay so that has been perfectly splitted into 70 and 30 percent now what we do is we will train the random Forest classifier on the training set how do we do it we will import the random forest classifier from sqln.en sample so we've already dealt with what is ensembled and then in classifier we will store this random forest and this n estimator is nothing but decision trees so we are creating some five decision trees here and the criteria is entropy and again random state is set to zero so let's see and then we will fit this x strain and Y train so this has been fitted and the criteria is entropy here all right so now let's make some predictions and let's create a variable called y predict and we will just predict it on X test and we've also printed this by prediction and now let's bring the confusion Matrix to check the accuracy of random Forest algorithm and what we do is from matrices Escalon matrices we will import classification report and confusion Matrix and also the accuracy score so we will just import them and then in cm variable we will print the confusion Matrix of Y test and by predictions so we will print it and we see here the accuracy score also which is 98 so our random Forest classifier is giving us a very good accuracy of 98 and you can see your confusion Matrix that only two cases have been misclassified rest all the cases have been correctly classified by random Forest classifier okay so now let's move on to printing the classification report of Y test and why prediction let's see and we get the precision as 96 that means the two predictions by the algorithm is 96 the recall or the true prediction rate is 100 which is very nice and event score is also good which is 98 so this is giving us a good result but what if we change the criteria from entropy to Ginny so let's just experiment with that too so let's try this with the different number of trees and change the criteria to gini coefficient so now again from sqlon dot Ensemble we will import random forest classifier and fit it okay and here what we are doing is just we are using Seven Trees previously we used five and now in the criteria we will use Ginny coefficient and random state is 0. so let's run this and see whether there is a change in accuracy or not and let's predict this and let's check the accuracy score what is the accuracy score for this random forest classifier with Seven Trees so we get 99 accuracy with changing the criteria and changing the number of trees so you can just experiment with different number of trees and different number of decision trees let's just experiment with you know 12 decision trees and see what happens so you can see the accuracy reduced to 98 okay with seven we were getting 99 so let's just keep seven because it is giving us really good accuracy so this is about random forest classifier and how it works with several trees and different criteria to give us very good accuracy on our training and test data [Music] k n algorithm so as we said k n algorithm is K nearest neighbors algorithm and it is an example of supervised learning algorithm where basically you try to classify a new data point based on the neighbors of that data point which is basically which data points are closer to it for example here as you can see on one side you have dogs and on the other side you have cats right now if a new data point is given to us there is a picture of a new animal and if it is lying somewhere here right then we know that it is nearer to the cats right and therefore we will classify it as cat whereas if it is sort of uh nearer to the dogs then we classify it as dog right so that's the like you know neighborhood for the dog and therefore we sort of uh classify it that new animal or the new picture as as being or that of a dog and this is quite uh you know this is something which is we even see in our in our regular day-to-day life right you know we have had examples where our parents keep telling us okay don't play with those kinds of you know children or something because they are not good in their studies or probably they are not so good in their behavior because you would become like them right so it's again an example from real life of classifying a particular person based on the company that they keep right or from the uh with the kind of people that they are so that's that's sort of uh the example and now let me actually go back okay what are the features of of K nearest neighbors algorithm Okay so let's talk about uh the features of k n so as as I said K N is a supervised learning algorithm typically it is uh you know uh used for supervised learning kind of problems it's very simple as we mentioned intuitively you can you know it's about you know what kind of neighbors do you have so your class is predicted based on the your nearest neighbors as the name suggests and then it's a non-parametric technique so I would like to spend a couple of minutes here to discuss about what we mean by by non parametric so typically you know the supervised machine learning algorithms are of two kinds right one is the parametric types and the second one is the non-parametric type when we say parametric what we mean is basically that the machine or the algorithm assumes that there is an underlying function or a distribution that is known of the particular data set like for example the linear regression would assume that the relationship between two two things X and Y is is linear in nature right um and and and similarly for a distribution a gaussian distribution it would assume a normal distribution of the data points and so on so a lot of the supervised machine learning algorithms they assume some kind of function association between the predictor which is basically the root causes which help you in predicting and the variable that you are trying to predict whereas there are certain algorithms like the k n or The Parson window or the linear discriminant analysis which is of the kind which is called non-parametric because it does not assume any particular kind of distribution or any particular kind of functional relationship of the data that you are trying to you know predict or you are trying to learn the the pattern of so k n is one of the kind as I said Parson window is another one uh which is basically where you in in Parson window essentially you know the the volume of the data set uh or the area that the data set covers is known whereas you're trying to find the K there where which is the number of data points within that area or the volume right whereas in case of non parametric technique like k n it's the opposite where the K is known which is basically you would like to associate the class of of the data point that you are trying to predict based on the K number of neighbors around it now K can be three four five whatever right 10 20 and so on essentially the difference between parts and window and K N is that in KN you already know the K and then from the K you try to find out the volume and therefore then you try to find the the probability or the density underlying distribution and then there is the the discriminant analysis which is of again two kinds linear discriminant and multiple discriminant analysis where basically what you do is you you transform the underlying data the features into a higher Dimension and in such a way that in the new feature space after you have transformed the data you now try to apply a parametric approach like for example you will try to project the features onto a line or if it is on a sub Subspace which is higher Dimension than a line then essentially it becomes multiple discriminant analysis so basically um those are the three kinds of non-parametric techniques so even if you were not able to sort of get the full hang of what these three types are what you need to keep in mind is that non-parametric technique does not assume any kind of distribution or any kind of functional relationship of the underlying data and therefore it gives us a lot of flexibility whereas the parametric techniques they basically assume some kind of functional relationship between the data points or they assume some kind of distribution so where would you use a non-parametric versus a parametric technique right so basically you would use non-parametric technique you know where you do not know about the functional relationship that is one second is that you know there is uh maybe let's say large amount of data and so on and thirdly uh non-parametric technique like k n does not work in a very high dimensional data so you would also use parametric techniques in that case whereas if you have you know smaller data sets you would use uh you know typically non-parametric techniques and then also if you know understand that the relationship between the data points might be for example linear or something like that you would use a parametric technique so you are assuming that there is some kind of relationship like a linear relationship or something like that between the data point so that's where you will use the parametric technique so again just to summarize basically non-parametric is an approach where you do not assume any kind of distribution or functional relationship whereas parametric assumes a functional relationship or basically a distribution between the data points the other feature of k n is that it is a lazy algorithm so what do we mean by lazy algorithm actually in most of the supervised learning algorithms you basically train your model on the training data set then you have your model and then you apply this particular model to the test data set to then classify or predict uh you know for example whether a new image is that of a cat or a dog right so this can be you know some kind of algorithm like support Vector machine or regression or logistic regression or whatever right so you run this logistic regression or you run this regression or support Vector machine on the training data set and it learns the features or it learns the parameters of the model from that data set and then applies this learned model on the test data set whereas in case of k n actually there is no training Step at all that's why it's called a lazy algorithm because what it does is at the time that you are actually now want to predict at that point in time actually it will go and it will do all the calculations of the distance of the new data point like for example the new image of the cat or dog from all the other data points that you have so it will calculate all the distances and then we'll check for the those data points which are or the K data points which are nearest to this so that's why it's called lazy algorithm because nothing happens till the point or no calculations happen till the point you are actually trying to predict something so there is no training step involved okay and then it's used for both classification and regression as we just mentioned so it can be used to predict the values as well as be able to classify something like you know okay whether it is a cat or dog or if you're trying to let's say predict some value some forecast or something that for that also you can use it and then it is based on feature similarity which is basically what do we mean by featured similarities so feature can be things like if for example you know you are looking at classifying cats versus dog right so is the eyes like a dog that can be one of the features is the how do the ears look that can be one of the features what about the tongue the face and so on so there can be multiple such features and how similar it these features are between two data points which is used basically uh by the k n algorithm and then as I said there is no training step involved so these are the features of k n algorithm and therefore now let's look at actually just some simple examples of how it works so as you can see here in this slide we have two classes of data so one is the all this blue data points and there is another one which is the orange data points now if you have a new data point which is this pink one here which class should it belong to should it belong to class A or should it belong to class B so what you would do is you would actually start calculating the distance of this pink data point from every square or blue triangle data point and then you will decide uh you will have to assume a particular K let's say K is 3. right which is I'm looking at the nearest three data points and in that case basically as you can see if we draw this circle right then we see that two of the nearest data points within that circle is of the square orange kind so basically we will predict that this particular new data point belongs to class A whereas K value was 7 right as in this particular example now you would see that 4 out of the 7 is actually of the blue triangle kind and therefore we will now classify it as belonging to class B so essentially this prediction changes as you can see here depending on what is the K value so therefore the question is what should be the value of K right and typically what happens is you run a trial and error and basically you come up with okay what is the best K value but essentially what one needs to understand is that as the value of K increases basically the the partition line starts moving towards becoming more and more linear so it starts becoming less flexible and it starts assuming some kind of a linear dividing line or something like that so what happens is that in that as you increase the k your bias basically increases but your variation reduces so we know that in classification problems or in machine learning problems bias and variance are two things that we are trying to manage right bias is basically how close you are to the actual class or how or to the actual uh value whereas variation is how much variability is there in your prediction so as the K increases the bias sort of increases but the variance reduces and then it is vice versa so if your K decreases let's say at K is equal to 1 where you are only looking at just one nearest neighbor and then predicting based on that actually the bias is the least which means it is the most flexible K is equal to 1 is the will give you the most flexible uh sort of demarcating line or function whereas the variability will be the maximum so that that's the sort of the trade-off and that's how we actually determine K so we have to get a k value in such a way based on trial and error that sort of maximizes our uh sort of or reduces the bias as well as the variance and and that's the kind of optimization we are trying to do okay so how do we calculate the distance itself right and distance typically can be of many kinds so you know the example here is of the euclidean distance but you can have other kinds of distances like Manhattan distance or Mahal no base distance and you can look up references for other kinds of distances now euclidean's distance is calculated for the point P1 and P2 as given here essentially euclidean distance is nothing but the you know square root of sum of the x coordinates of these two points P1 and P2 and then y coordinate square of uh of of these two points P1 and P2 and uh this is just an example of one kind of distance and other kinds of distances like Manhattan or Mahal novice are also there and then this calculating this distance becomes quite challenging especially in cases where you know you are trying to for example calculate let's say how close two LinkedIn profiles are right or trying to classify the category of electrocardiogram and so on so forth so there we have to bring in more creativity to just decide what kind of distance to use okay now let's move ahead we will now talk of some use cases where k n can be used and this is an example of how k n can be used for book recommendation so if you have purchased books on Amazon or wherever right some of these recommendations are based on on K N algorithm and then you know as we said you know k n is like based on features right so maybe let's say what will be the nearest neighbors of particular book it can be based on who is the author what is the topic and so on so forth and then there are other use cases like I mentioned so for classifying satellite images for classifying handwritten digits uh on on image analytics or or for classifying electrocardiograms um Etc uh you know typically k n can be used okay so now actually we will get into some Hands-On okay so to start the Hands-On session I'll go to this Jupiter notebook that I already have installed on my system and I have a certain code written which we will take two examples both the examples are based on data sets which are available in the open source so you can easily get access to that data so what we do is we start by importing the necessary libraries so we import pandas c bond numpy and matplotlib basically pandas and numpy are there for doing the data manipulation and also for storing data as macrisses or as arrays and and be able to perform some mathematical procedures on them and then c bond is basically used for plotting and matplotlib for plotting as well and this line here get IPython just helps us to run the images that we'll be creating in line with jupyter notebook instead of opening up a new window so let's run this and what it will do is it will import all these packages for us which we are going to use and then we will first import the breast cancer data that is available in your scikit-learn data sets so we import that and then let's just initialize that data into a variable here called cancel so cancel here represents all the load the breast cancer data and now we will let's actually look at what this data is it's a bunch of attributes in this dictionary here so you have data the target which is basically nothing but whether it is a cancer or not so whether it is malignant or B9 malignant means it's a bad cancer and benign means well it's just a tumor it's not cancerous Target's name description feature names which is basically the features that will tell us whether a particular case belongs to cancerous or non-cancer or malignant or biline and then actually let's just print the description of this particular data here so as you can see we can use this command to print the description here and then we see that there are 569 data points with about 30 attributes and these attributes are radius texture parameter Etc and uh the the Max and Min values of those are given here and then now let's look at some of the feature names so these are the feature names radius texture and so on and now let's actually set up a data frame of this particular data here using pandas this function here so there are 569 data points and all these are basically your features as we talked about and let's look at the Target variable which is nothing but whether it is telling us whether a particular data point belongs to malignant or benign so zero is cancerous and one is non-cancerous and then we convert the target into a data frame as well and then let's look at the couple of examples of how the data points look like this is you know one row of the data points which with all the several feature values that we have so basically we use this package called standard scalar from scikit-learn for pre-processing and for standardizing the variables and we initialize this standard scalar into a variable called scalar so standardizing is nothing but you know basically bringing all the samples to essentially the same range right because uh what might happen is some of the data point like for example temperature might be from 0 to 100 and some price might be from let's say 1000 to 100 000 or whatever right so the absolute values can can lead to some issues with respect to the prediction therefore we have to standardize it or bring it between let's say minus one and one so and then a mean of zero right so we have to bring everything to the same scale to be able to compare the samples so we first of all we fit the this standardization or normalization on the data set we have and uh that is we calculate the the variance and and the means and then we actually apply it on the data set to transform it to the actual values and then if we look at the scaled values now so let's look at the scaled values and this will give an example of the top five rows here so we can see now the values are between minus 1 and 1. or rather it is standardized essentially with a normal distribution and then we divide this data into test and train so basically we will train the model and then we will test it on a separate data set if you use the same random State you should be able to get the same result otherwise you may get a different result here and essentially we are keeping the testing size to 30 which means that we are dividing the entire data set into two parts the train part which is having 70 of the data and the test part which is having 30 of the data and again we are using this package called train test split from the scikit-learn package so we get the X and the Y's which are basically nothing but your train and the X's are your predictors and Y is your predicted variable whether it is cancerous or not and then now let's import the K nearest neighbors classifier this is that the actual algorithm which we are importing from scikit-learn package and now we initialize this particular algorithm and then we fit it on the data and some of the parameters as you can see is basically what is the leaf size and so on so forth the nearest and neighbors we are taking so we are taking K is equal to 1 here basically as of now we will see the results based on that and then we will change it and see how the results vary and we now run it on the uh we now try to predict it and then we will now try to evaluate what the results look like so we have imported the classification report and confusion Matrix which is basically trying to see whether we were able to correctly classify the cancers as cancerous and non-cancerous as non-cancerousness or not so we can see that this is the actual and this is the predicted so basically some data points here five and four are classified wrongly otherwise all the others are classified well so if we look at the accuracy calculated accuracy actually so we see that the Precision which is true alarm right which is basically from the cancerous samples how many were you able to actually predict as cancerous so if you see the accuracy is quite High almost 94 95 percent and then the recall which is from all of the cancerous samples how many variable to actually predict accurately is about again 94 95 percent and F1 score is nothing but a combination of both precision as well as recall and that's quite good as well so with K is equal to 1 you are able to get some already some good results now let's try to see how to choose the K value right so this is basically nothing but a a bunch of code that actually runs the K value from 1 to 40 and then tries to check the accuracy and this is just like doing a trial and error to see where we get the best results so that we can then use the best K value so if you can see this particular plot here after we plot the result from the running the trial and error from 1 to 40. we see that the error actually starts decreasing and somewhere around this K is equal to 21 we get the minimum value of error so for us the best K value is 21. so now if we compare the results between K is equal to 1 and K is equal to 21 we we should be able to see the prediction result so as you can see this was the result with K is equal to 1 which is we get about 94 95 percent accuracy and then with K is equal to 21 we will see whether the accuracy improves right so we see that yes the accuracy has now gone up to almost 99 which is we earlier had nine misclassified data points uh out of uh all of the points and then here we have just two data points which are misclassified from the test data set so now this was one example of applying k n on the cancel data set which is available freely and now let's look at another example which is the iris data set and uh again available freely as well so iris is a type of flower and we will see what flower is it so just give me a moment here so we again start by importing the necessary libraries and then we'll look at what this Iris data set is so the iris data set comprises of 50 samples of three species of Iris flower which is Iris these are the three types of Iris flowers and if you run this basically we will find that okay sorry we did not copy the entire the code here so it was giving an issue let's just run it again okay so we see that it is this particular flower which is Iris setoza so the Irish set was you can see and now let's look at the other two kinds of flowers here so which is Iris versicolor so this is Iris versicolor and then you have the iris virginica so we see that this one is Iris version so essentially we now will import the sort of data set we have already done that and we will now use the c bond package to actually plot some of this data and see how it looks like which is basically do some kind of exploratory data analysis so if we look at the data itself so this is the top five rows from the data right so essentially the data consists of basically what is the sample lens separate petal length and petal width and this is nothing but basically your petal is your this colored part of the flower and the sepal is basically your green part here right so it is talking about one what is the sepal and sepal width and petal length and petal width of each of the species whether it is satosha or genika or vertsy color and we are we will see whether we can use KNN to actually classify these these flowers into the data points into these categories of flowers so let's do some quick exploratory data analysis on this so we are running a pair plot on the data set and in the meantime I'll just copy another part of the code here okay so now the plot has come up so as we can see that the green is basically your setoza flower and we can see that this pair plot actually just plots and The Petal and petal width of each of the samples of Sentosa versus verticular and virginica and we see that the green dots which are the sectors of flour is actually quite separable from the others it's when you plot let's say for example sepal length and petal length right we see that this is quite separate from the other data points so let's see whether you know we can actually be able to classify it using KN uh or not and here we are running a kernel density estimation function on the Sentosa flower to check what kind of distribution it has so this is the kernel density estimation plot using the SNS package so only for the Sentosa flower so if we plot the separate length and supplement we get something distribution like this so essentially we see that the maximum centered around here and then there is a distribution as you can see here so there's some kind of a linear relationship here okay so now we will again do the same standardization of the variables that we had done in the cancer data set case so we are importing the standard scalar function from the psychic learn pre-processing so we will initialize that so we are again basically doing the standardization or normalization of the data and we will do the standardization on everything except the species which is a categorical value right so it's categorical whether it is which kind of flower it is so we have removed that and then we have done the standardization or normalization on rest of the data so we now convert this into a data frame pandas data frame and if we look at the top five rows now it's all converted or transformed so the values are now normally distributed basically okay so now we divide the data again into train and test and we again have a training of about 70 and test data set of about 30 percent so we are dividing that entire data set into these two buckets and we will now use k n to see if we can use K N to classify them again the same n k is equal to 1 and we will check the results and then we will do a trial and error to check what is the best value of K so here K is equal to 1 and now we are going to predict on the test data set and look at the results so we are now importing the classification report and the confusion Matrix so if you look at the confusion Matrix we see that kind of already we are getting quite good prediction so just two misclassified points and if we look at the accuracy we see that the accuracy is quite high around 96 percent already now we choose we have to see what is the best value of K so essentially we will we will again run K is equal to 1 to 40. and check which is the best value so let's plot the errors when we vary the K from 1 to 40 and we'll see that actually the error decreases and then increases so basically the error is minimum with K is equal to let's say 3 or even 5 or maybe 11. so let's choose one of these values so let's say k is equal to 3 and let's see how the results look like does it improve the accuracy or not so we now see that even the two data points which are misclassified earlier is now classified properly so the accuracy improves to 100 percent so that's the example of how you can choose k so we have covered the Hands-On and now we we look at some of the references so some of the references are basically these three textbooks here which I found quite good some of them are even available online so you can refer to them if you want to learn more about k n and also they are quite good references for some of the other techniques as well [Music] so let us now talk about what is knife base okay let us understand knife base with an example here I just cannot seem to figure out which are the best days to play football with my friend can you please help us out all possible conditions are given to us there is Summer monsoon and winter which is nothing but the Outlook am I correct in saying that Summer monsoon and winter is nothing but the Outlook then we have sunny or not Sunny so that basically is the humidity right and then we have windy or no windy that speaks about the winds how are the winds right so if you look at these combinations okay we will look at this using Navy bats on how do we decide whether we can play or not so if I have noted down all the days it was good bad to play football and the combination of weather matrices on that day that will be perfect right that is perfect and we will be able to do navy base classifiers using that now navy base classifier comes from the Navy base theorem and navy base theorem is purely and purely based on the Assumption of Independence so what does it mean when I say independence what it means is that this variable has no relationship no association with this variable now when I talk about this in a linear context obviously in summer you will see that we have more sunny days yes so if you look at it from the correlation area from the linear algebra Concepts linearly these two are correlated to each other am I correct in saying that obviously in Monsoon we have less sunny days in Winter we further have less than it is yeah so although there is a relation Navy buyers theorem says that all these variables are independent of each other what does the that mean if this is causing any kind of an effect if this is causing any kind of an impact this should not matter okay there is going to be no relationship Summer monsoon winter it has its own weightage okay and it has nothing to do with the other conditions every condition is equally significant all right so what happens in Navy Bears is we estimate the posterior probability of every event Happening Here we calculate the posterior probability of an event happening okay so here if you see if you look at the sunny conditions what we have done sunny conditions we have this distribution that there is no play happening in summer there is play happening in Monsoon there is play happening in Winter right so base of the Season we are figuring that out similarly for windy conditions we are doing that right again then we do it for a combination right whether when windy conditions are yes and no what happens to play okay so here what at the end of the day what gets selected is the one which has a posterior probability of greater than five now when I talk about posterior probability what do I mean by procedure probability let me have a blanks like here you go actually it's given so I need not show you that what is the simplistic probabilistic classifier here what is the probability of an event a happening given B so if you look at our problem context what is it that we are trying to figure out we are trying to figure out what is the probability of play happening given the Outlook is sunny comma the uh winds are normal and there is no rain on any given day on any given day when there is no rain there is no wind and the Outlook is sunny whether clay will happen or not so what we end up doing is we calculate the posterior probability of clay happening given these conditions we also calculate the posterior probability of play not happening given these conditions and then we normalize these probabilities mathematically we do all of these calculations to figure out Navy bias okay now here see here we are talking about one event here we have three events we have Outlook we have winds and we have rains so what this becomes is probability of three independent events so what we will do we figure out what is the probability of play happening given Outlook is sunny we also figure out what is the probability of play happening multiply this with the probability of wind as no given no then probability of play happening given rain is no we figure out all these in three independent probabilities multiplied all right this is what we do mathematically this is what is done mathematically in Navy bias theorem ultimately the posterior probability the posterior probability probability of an event a happening given B conditions is calculated by first calculating the class probability what is the class probability here probability of it raining given play was happening in that day this is multiplied by the total probability of play happening and divided by the total probability of sunny conditions here as you see this is the posterior probability this is the class probability here is the predictor's probability this is the Outlook X is the Outlook C is what we are trying to predict right and here is the likelihood so how is this formulated into our table now if you correlate this to our graph what will be the likelihood what will be the likelihood of sunny conditions given play happens sunny conditions play happens two divided by 3 out of sorry how many sending conditions do we have six conditions in six conditions how many days does play happen two days two by six one by three what is the class probability so of all the events that are given to us how many days does play happen okay this is how this is calculated so here you see what we have done is let me go back to the previous slide here you go okay here what we have done is we have calculated this table now it speaks about the data set where can you get this data set you can look for golf play days data set online you can look for golf play days data set okay in this data set you will find all the data which is required for this particular example to be done so what I will be doing is I will be doing this example this Navy bias classification with you in Python all right so whatever calculations are being done here okay I will do the same activity in Python with pen and paper and instead of doing this in a numeric way where I'm doing lot of probabilistic calculations I will achieve this simply in very limited lines of code very limited lines of code with python once I'm I have done that I will come and explain this probability table to all of us I am going to use some basic libraries right so here what I will be doing is using some very basic libraries for this activity all right then now let me quickly go and uh read the data set so for that what I will do is actually change my working directory and now let me quickly go and read my data so my data frame is PD Dot here you go this is my data set right so if you look at this data set in this data set you have 13 14 days in these 14 days you have the Outlook overcast rainy and sunny you have temperature temperature is hot cool and Mild you have humidity you have wind and you have Clay right so I will not be using uh okay let us use all four in this example they are using only three variables but in our Hands-On okay in this Hands-On what I will be doing is I will be uh using all four variables let us do that okay so before I do that let me convert everything into a category if you look at your data frame right now it's not everything is not into a categorical variable here you go see okay so let me quickly go and convert everything into a category and once I have done this uh let me create a new data frame in which I have everything as a category code so that I have numbers I'll show you what what do I mean by this so let me execute this here you go see now I have two data frames in the First Data frame I have all these values these are now categorical variables but in the second data frame I have all ones and zeros so wherever you see there is sunny conditions now I have a code to rainy conditions code one similarly when play happens I have a one when play does not happen I have a zero this is what I have done this data frame is available online okay you can get this data frame online all right now my data frame is ready so now what I'm going to do is I am going to divide my data frame into training and testing I have 14 records so let's take 10 records for training I will give 10 records as an input and I will give four records last four records foreign so now I will need to create my X and my y so how I will do that is I'll say y underscore train is equal to from Train I don't want the play variable that play variable should be my y as simple as that and I will say x underscore train is equal to 3. okay and I will do the same thing for my test data frame also now those who are new to python will find this a little bit strange please bear with me but these are the only calculations only steps which need to be performed every time you are trying to achieve maybe bias algorithm or any kind of an algorithm all right so here now you see this is my training data frame in which play variable is not there clay variable is not there this is my y in which only play variable is there this is my training data set so both of them have 10 records with the matching index similarly test data frame four records four records with the matching Index right so that we know which data frame is where now multinomial Navy bars very simple three lines of code and my model will be done okay first I initialize my model here you go I have initialized my model in this model I fit my data in this model I will fit my data so to do that what I say is fit X underscore train comma y underscore train done your model object is now ready and now you can simply get the classification outcomes so we have in our test data frame if you look at our tested data frame this is our test data frame we have three con four conditions all four are sunny high temperature low humidity and windy right and if you look at their outcomes these are their outcomes on the first two days play is not happening on the next two days play is happening let us look at what is the prediction of our model for this so to do that what I simply do is X Out is equal to model dot predict to this I give my X underscore test here you go okay now you have your y out variable so this is the prediction for all the four inputs that you give and this is the prediction first day first day we say play does not happen let us match it let us try to match it with our here see out of four records three records we are predicting correctly three records we are predicting correctly if you want to check the accuracy what is the accuracy of your model what you can simply do is print let us print the accuracy on both training and testing training accuracy how do I get the training accuracy very simple model DOT score foreign comma y underscore train and then we do the testing accuracy also share repo so here you can see for our model training we have 80 percent accuracy and for testing we have 75 percent accuracy okay so this is the advantage of doing this activity in Python but what is happening in the back end now let us go and also understand that in terms of knife bias classifier we have successfully we have successfully implemented the knife bias classifier in Python programming language But Here let us try to understand bias theorem what is happening so from this data set all the tabulated data frequency tables are calculated once the frequency tables are calculated they are substituted in our formula to calculate the probabilistic scores so what is the probability of Summer given it is playing conditions total how many playing conditions are there total there are nine playing conditions nine days play happened that becomes our denominator out of those days how many days was summer is our numerator that is how for this we get a probability of 0.33 then we calculate the class probability where we look at how many days was it summer out of total 14 days five days was summer so that's the probability and the class probability is 0.64 put everything into our equation put everything into our equation and this is what we get okay so we do this for each and every condition so here we calculated for winter all right once we have done it for all three days winter sunny and windy days we substitute those here and that gives us the probability which is more than 0.5 does now we can say that if it is winter sunny and conditions are sunny and there are winds conditions are not sunny and there are winds play can happen look at another example if a single card is drawn from a standard deck of playing cards the probability that card is a king is 4x52 since there are four Kings in a standard deck King is the event this card is a king this is the event the prior probability of this is one by 13. if evidence is provided for instance someone looks at the card then the single card is a face card then the posterior probability can be calculated using bias theorem okay since everything is also a face card the probability of face happening given you getting a face card given it's a king is one since there are three face cards in each suit all right it's actually four ace is also a face card so it's Jack Keem coming and uh is the probability of the face card is 4 by 13 okay so if you combine these three likelihoods what you get is thirteen by four so using bias theorem this is the probability that you get foreign [Music] now as you have seen already support Vector machine comes under supervised machine learning and we use it specifically for performing the task of classification so support Vector machine is a discriminative classifier that is formally designed by a separate hyperplane okay it is a representation of examples as points in a space that are mapped so that the points of different categories are separated by a gap as wide as possible so in this case of the support Vector machine let's say I have some data points so there are some data points of X and there are data points of circle now this support Vector machine is a type of machine learning algorithm where if I have the collection of points so here in this data points I have two classes one is X and the another one is circle now given this kind of data points okay given this kind of binary classification problem so the expectation is in case of support Vector machine I'm going to draw hyperplane which separates as much as possible okay so I'm going to draw a hyperplane which separates these two classes as much as possible so it says that the I'm going to draw draw hyperplane and it will be separated by a gap as wide as possible so that is the intuition behind support Vector machine okay now that you have an intuition behind what is support Vector machine let's understand as how does this svm that is support Vector machine would work so in case of support Vector machine so here there is one more example I have the set of points which is green green curve and I have another set of points which are in red color so these two uh points are belonging to the different different classes now what I'm going to do is I'm going to draw a hyperplane which separates these two classes data points as much as possible and when I'm drawing the hyperplane I'll make sure that this hyperplane is uh as this hyperplane is equidistant from my support vectors now the support vectors is nothing but the point which is closer to my hyperplane now here in this example that you are seeing the this data point and this data point are called as support vectors because these are the data points which are nearest for my hyper plane that I have just drawn in if I am trying to make use of this SPM model it is going to draw this kind of hyper plane to make sure that it is separating two classes the two classes that we have over here in this example is red and green it's going to separate these two classes as much as possible and it will be equidistant from my support vectors and the support vectors are nothing but the nearest point to my hyperplane and that is how I'm going to separate between two classes when it comes to support Vector machines in this example that you are currently seeing the hyper plane that I have just drawn so this is a simple linear hyperplane this is like a straight line that I am trying to draw if I want to separate two classes of data points now apart from drawing this straight line we also have other kind of lines as well which we can draw so let's see how we can do that so the types of line that we can draw or the hyper plane that we can draw is called as SPM kernels that is support Vector machine kernels the example that we have seen it's an example for linear svm kernels so let's see what are the other types of kernels that we have so when it comes to svm kernels we have linear kernels radial basis function kernel and along with that we also have polynomial kernel now in case of linear kernel I'm going to draw a hyperplane which is like a straight line in case of polynomial kernel I can draw my hyper plane on the basis of polynomial function that I have created on the basis of number of variables that I have and the degree that I have over there in case of polynomial and in case of radial basis function so I'll make use of radial basis to separate my data points okay so these three are the important kernels that we have in svm and this is one of the commonly asked interview question when it comes to the topic of support Vector machines now let's look at some of the use cases or the way we can do where we can use this SPM to uh work or let's look at some of the use cases where we can use this svm okay so we can use this svm on many of the use cases so to name a few we can use it in face detection we can use it in text and hypertext categorization we can use the SPM if I'm trying to classify any images I can make use in bioinformatics and if I'm trying to detect something so I can in an example here remote homology detection handwriting detection or in general we can make use of this generalized creative control so wherever we are dealing with the task of classification we can use this svm model okay now that we have a theoretical understanding as what is svm and how it is actually going to look like and how it will be let's have a quick walkthrough as how we can implement this svm now to implement this svm these are the common steps that we are going to follow we are going to load the data we will explore the data and once we have export the data we are going to split the data into two parts the reason is simple when I have training so I'll be making use of my training data and once my training is complete I'll check how my model has been trained with the help of my test data so I'm going to split the data now once that is complete we are going to drain this SPM model and finally we can evaluate the model and observe as how model is working so this is the overview of the implementation of support Vector machines so let's do one thing let's work it out and let's create the notebook in Google collab and let's see it in action as how we can implement this svm I'll come back to my Google collab so this is The Notebook that I have already prepared and I'll give you a walkthrough as we proceed along now here in my first cell I'm importing my numpy library pandas library and along with that for creation of plots I am importing my matplotlib library now if you are comfortable with C bone you can use the C born Library as well so in my example I'm just making use of matplotlib because we are not interested in creation or visualization but we want to understand as how model is being working okay and I'm going to execute this cell so this is going to take care of necessary Imports I am importing my necessary libraries and once that is done here I'm importing this SPM so this svm model is available inside my sqln library so I have mentioned as SK learn Dot svm and from skland.svm I'm importing my SVC okay so I'm importing my SVC so I'll show you what is this SVC then svm SVC so it's c means support Vector classification okay now here when I'm instantiating this SVC I can mention what is the kernel that I want to use and if I'm working with any polynomial kernel then I can also mention what is the degree of polynomial that I want to use while performing the fit for my data set so I'm importing my SVC and along with that I'm also importing the data sets so in the sk9 library itself we have a data set so the SQL and host already like it actually sqln has mini toy data set which will actually help us in a Learning Journey so we are going to use one of the data set the famous Iris data set we use that for multi-class classification so I'm going to load that Iris data set and I'm just going to excite only two features so the two features that I'm exciting is petal length and petal width because I don't want to complicate it I just want to visualize the data so in order to help in visualization I I'm just getting only two features of my given data and I'm separating my y as Iris Target so whatever the target variable that I had I'm assigning to my variable of Y then I'm going to do this check whether it is setos or vesicular that means this default data set which is in multi-class classification I am just going to convert it into a binary classification task you'll get a better understanding once I execute this next cell so this is going to prepare my data site and once the data set is prepared if I create a scatter plot so I'm just creating the scatter plot to show us what and how my data set looks like so this is how my data set looks like on my x-axis I think I'm having better length on my y-axis I'm having petal width and here the Blue Points refers to the class 0 and the orange points refers to the class of 1. okay so this is how my data set looks like you can clearly see that I have one set of points in one region and I have another set of points in another region now this is a classic example to understand about the svm how does it draw a hyperplane so we have the data set ready and as I mentioned already in order to fit this model so when I say support Vector machine I'm going to draw a line this line that I have drawn it will be equidistant from my support vectors now here in this example the support Vector is this because this is the only point which is nearest to my line and here I think this is the data point which is nearest to my SPM SPM line that is this fisted line hyper plane line so I'll be placing this hyperplane such that it is equidistant from the support vectors that is how I'll be drawing this support Vector line so we now have an intuition let's see whether we get the same outcome as we are expecting so here I'm initializing my model so for initialization I'm saying it as SVC use the kernel as linear because I'm able to draw a line effectively we just seen and I'm using the C as infinitive that means it should be a hard classifier so hard classifier means I make I want the 100 result I mean I don't want any Illusions I want to draw a line which passes which clearly separates two classes so I'm saying it as c as Infinity to mention this as a hard classifier and once I initialize any model here in this scenario SPM model I am performing the fit on my data set now this is the common flow that we follow whenever we are performing the fit so we'll initialize the model and then we perform the fit on a data set now since this svm being a supervised machine learning model I have to specify both my input X as well as my output y hence svm classifier dot fit X comma y so this is going to perform the fit for my data set I'll just execute this so This has performed the fit and here it is giving me the confirmation as this is the parameter that are being used to pop up the fit okay now once I have drawn and once I have found this fit next if I want to display the weight terms so I can say it as svmclassifier dot coefficients so these are the way terms and if I want to display my biasm or the intercept it is minus 3.78 now this means the line that I have just drawn so that line has the uh that line has the C term as or the W naught minus 3.78 and W1 W2 are 1.29 and 0.82 respectively so that's how the data is distributed for us that's how the values has been found for our scenario next in order to get the better visualization here I have created a function that is called as plot SVC Edition boundary and this takes my svm model the X Min and the x-max now W and B I'm extracting from the coefficient and The Intercept parameter that we have over here so we are extracting from this uh at from this attributes that we have from this model and now if I want to draw a decision boundary so I need the set of points so in order to get the points I'm saying it as X naught is equal to n b dot Lin space x max comma X Min comma x max comma 200 and I'm specifying as how does my addition boundary should look like my addition boundary is given by W naught into X naught plus W1 into X1 plus b is equal to 0. so this is what my decision boundary would look like so I know what is X naught I know W naught I also have W1 and I also have B so the only term that I do not have is my X1 okay so the only term that I do not have over here in this example is X1 and the X1 if I want it so I just have to substitute it so x 1 is equal to minus W 0 divided by W1 into X naught minus B divided by W1 now I'm specifying the same equation over here for my X2 so my X naught and the distance boundary will give me the pair of input and output okay now along with this there is a property in SPM okay so the property is given by whenever I have a margin so that margin is given by 1 over W1 okay so the margin is nothing but the distance between my hyperplane and the support Vector so that is given by Ma 1 by W1 hence I have mentioned as gutter up and down gutter up means one line or the one line where the support Vector lies so that is given by addition boundary plus margin and one line below my sub one line uh below my hyper plane that is where another support Vector would lie so I have mentioned as addition boundary minus margin okay then I am defining where exactly my support vectors are present my support vectors I can access the support vectors coordinates by saying it as by accessing the attribute of my train model support underscore vectors underscore now I am specifying where exactly those support vectors are present with the help of a simple scatter plot by highlighting my support vectors and I'm specifying where exactly my addition boundary is present and I'm also mentioning where is my line that is getting up and gutter down so let's do one thing I'll just execute this this is going to create me a function I'm going to call my function support Vector machine classification and I will specify my range of X and Y as here yeah X Min and x max as 0 comma 5.5 I'll just execute this so what we have done just now is we have created this uh hyperplane so the middle one the solid line that you're seeing over here so this solid line is called as your hyperplane and these points that you are seeing over here so these two points which are highlighted these two points are actually called as support vectors okay so this dotted line that you're seeing so this dotted line refers to my getter up and gutter down which I have found right here let's do one thing I'll add some labels so that you'll get some more visualization in the plot itself I'll say label and I'll mention it as hyperplane okay and there is one more yeah these are support vectors and I'll say PL dot Legend so this clearly says which are all my hyperplane and which are all my support vectors so this is the intuition behind support Vector machines so we'll be drawing a hyperplane which separates the points that we have okay and whichever the point which is nearest to my hyper plane we call that point as a support Vector now to access that support Vector we make use of the attribute so let's do one thing let's explore the same the attributes svm Dot support underscore vectors underscore so this is going to tell me where exactly my support vectors are present so one point is given by 1.9.0.4 I think this is the point that I'm talking about and the another support Vector that we have is at the location 3 comma 1.1 so 3 and 1.1 this is where we have another support vector so using all these attributes we have been able to create this visualization okay now whenever we are working with the support Vector machines it's very important that we scale the data first if I do not scale the data I'll not be able to get a better fit of my svm model so here I've given one more example where I have my X is given as 1 comma 55 comma 23 comma 80. as you can clearly see it's it's not scaled okay so I'm going to execute this cell so this is going to tell me and give me a visualization as how the fit will be in case of scaled and unscaled see if it is unscaled I'll if it is not scaled okay that means if it is unscaled we can clearly see that the hyper plane that I am drawing and the distance from my hyper plane it's very close to each other and whenever I'm working it's it's it will be difficult for me to separate those two data points but if I scale them correctly now here for scaling I have made use of scale and standard scalar now if I scale it correctly then in that scenario it will be easier for me and it would actually work better when I have scaled data okay so this is about using the linear svm model to perform the fit on my given data set now if I go below we have some more examples about non-linear classifiers as well now in order to test out the same here I'm creating an example data set and that data set that I'm generating is called as make moons data set and this has been generated with the help of SQL and data set generator now as you can clearly see I cannot make use of linear classifier so linear classifier is nothing but a classifier okay which is an svm model where I am drawing or where I am using a straight line to split my data points I can clearly see that I wherever I I join or wherever I try to draw a line over here I cannot split the data in an effective manner now this brings us the challenge now if I have a data set in this way where I cannot linearly separate it how can we go about and perform the fit on our svm model so in order to save us we have a model that is called as SVA model and from that SPM model we can actually create a polynomial uh polynomial kernel so we can make this a polynomial kernel and using that polynomial kernel I can actually create it like this I've been using polymer kernel I can perform polynomial regression or I can draw a line like this now to show you how it works I am getting some data like this so this is some random data and I'm making use of pipeline so this pipeline is going to take care of my stand standard scalar as well as kernel I'll do one thing I'll just come below so this is what we are currently interested in so here I am importing the polynomial features and I'm generating the polynomial features for my data I am performing the fit and transformative polynomial data that means I'm just modifying my existing data and I am sending it for my uh X okay so this is how my pair of input X and Y looks like now I'll use my X I'm going to transform it with the help of my polynomial features and then I am going to scale it with the help of my standard scalar and I'm going to send it inside my classifier that is SPM classifier okay so I'm going to combine it together like this now observe what would happen and once that is complete see with the help of my polynomial uh polynomial features that have applied on my given linear data so I have increased the degrees by which my model can learn now instead of straight line my model is also having the ability to learn this complex representation as well because I have increased the model complexity by adding my polynomial features and while doing it to make sure that we follow a clear path so I have defined this skillers pipeline so if you are new to data science machine learning I highly recommend you to learn this concept of SQL and pipeline now this sqln pipeline helps us to combine multiple operations in a single column so here we have created a pipeline this pipeline is going to add some polynomial features for my input data and on top of it this is going to perform scaling and then I'm going to perform this binomial classification using this svm and finally I am performing the fit on my data set see when I perform the fit it takes my input X and it's going to do all these activities it is going to chain all these activities together and then it is going to perform the fit for my data y once the fit has been complete so we can validate how my model is performing [Music] so what is the clustering technique clustering technique is something that we will use it for grouping purpose so especially there is a very easy way to understand what is clustering technique you would have seen such a way while we are going through a covid-19 situation the governments has came up with creating some containment zones as all of you must be knowing so how on what criteria government has taken that okay which area is supposed to be a containment zone or which area is supposed to be applied with some some restrictions on which areas can be can be considered as normal on what criteria that they have created so that's what using cluster technique which means if the governments or when I say government means that the people who will be taking the final decision in such criteria either prime minister either either the chief Ministers of that particular state will be taking decisions whether to go for a lockdown whether to not to go for lockdown or which areas has to be considered as containment zones or non-containment zones so those high level decisions are something which will be taken based on the clustering technique output which is generated by the these algorithms based on a number of inputs okay what is the population in a particular area how many number of people are affected how many number of hospitals which are present how many number of um uh people who are been recovered so likewise based on this credit these multiple criterias people will do some clustering technique on top of the data and according to that people will be segregated or the areas will be segregated so that saying that okay these are the observations which belong to one cluster these are the observations which belong to one cluster like that so that people can cluster them which can make organizations to take decisions on a very high level okay that's what is our clustering technique which clustering technique output will contain the different different groups it itself will group the different different components based on whatever the number of clusters that you want to generate that may not give the direct output on top of the generated output people will be taking business related decisions that's what is all about clustering techniques okay so now what we will do let's take an example of within clustering techniques what are the different types of clustering techniques we have so there are multiple types of ways based on the type of output that we want to produce there are multiple different types of clustering techniques we have but out of which the let's try to understand about what are the very famous and most widely used clustering technique algorithm out of which we have something called a k means clustering algorithm is one of the very famous or most widely used more than 90 percent of the people will end up with using cayman's clustering algorithm which is very very famous in question techniques so what are these clustering techniques as I said how this custom technique will work by default K is nothing but your you are supposed to as a user you are supposed to provide what is the input of K which means wherever you say there are multiple techniques that we have in machine learning like K means clustering technique K nearest sniper is one of the algorithm k-fold cross validation likewise wherever you see a notation called K what is what does that K means it's an input that you are supposed to provide always remember this it's an input that you are supposed to provide to your algorithm your algorithm cannot identify that K value of course everything else will be taken care by your algorithm but whenever you see K which means in K means clustering what is the meaning of KM is clustering how many number of clusters that you want to provide that is something that you have to input it to your algorithm here the meaning of k is how many clusters that you want to generate so how many clusters that you want to generate okay when you have a thousand observations which are present when you have 1000 in input column input records which are present in your historical data how many number of clusters that you want to predict do you want to go for one cluster obviously one cluster means that the entire data set will be considered as each do you want to create two clusters out of the data do you want to create three clusters out of the data four clusters five clusters or ten clusters so how this can be done there are multiple steps that are involved in generating KMS clustering algorithm so you can see choose the number of clusters this is what is nothing but your first step it means you need to decide what is your K is nothing but number of clusters that you want to produce and then there is an initialization of centroids will happen as a one-time activity right so there is an initialization of centroids which will be it should be declared that will that will be used as your initial step for your machine and then assign the Clusters move the centroids and optimization and then Converge All the thrusters into one component yes I know it will be very difficult to understand by looking at this thing so let me show you a very simple example how exactly to be done maybe let me take a simple diagram for you to show how exactly that's going to work okay so let's say for example I'm going to take some historical data just to explain you on how exactly the game is clustering algorithm will work so what is that it is written in the first step so the choose a number of clusters so now it means so for example we need to take some historical data I am considering some historical data here let's assume this is the historical data that we have so uh as you can see there are multiple historical data points so as you can see the first step choose number of clusters which means let's assume to make this thing simple I am going to choose that we want to have two clusters generated okay so this is which means we want to have two clusters created so because my number of clusters that I want to generate is to I am going to consider that there are two centroids which are projected this is my first step how this algorithm will do how algorithm will come with the number of clusters so this is how it is so the first step is to choose the number of centroid and then initialize your centroid that's the second step so now what is the third step let's assume this is observation number one this is our data point one so now what what is the next step we will take the distance from every observation to every centroid and every observation at every centroid now this observation is more closer to rectal app so now what happens what what is the next step the algorithm will assign this particular observation one as a regular as considering that this belong to red color likewise for the second observation when the second observation appears what is the distance from the second observation to both the centroids now which one is more closer I see green color is more closer so now I'll mark this as green color if both if what if the distance is same equal so then the algorithm will force any of the observation to get into any of the centroid so number of clusters number of centroids that you will choose based on the K value as I said now you are going to likewise you will repeat the same process and whatever the observation which is more closer to whatever the centroid it is you will Mark the mass with there so called Mark like this now you are going to initially Mark the match observations into either into green color either into red color so now these observations are now considered as green color observations and these observations are now considered as regular absorptions this is Step number one what is the step number two step number two is segregate all these reticular observations and take the average value of X and Y coordinates and repeat the same process for calculating average coordinates of X and Y coordinates for the green color observations take the average of all these observations calculate average and take the all these observations take the average you'll end up with getting a new centroid positions called X Y which means now you ended up with getting a new centroids in the initial step that we have taken random centroid now you got the centroids that you can use based on the previous iteration now you end up with getting a new centroids again you repeat the same process take the distance from every observation to every centroid and assign the observation based on the nearest of nearest distance and continue to Mark every observation either into red color or either green color or whatever it is and you repeat the process until you will be able to see there will be no change applicable for your clusters you repeat the process which means in every step you might end up with changing your centroid every step will continue to change your centroid now the centroid might become like this then later your centroid will become like this likewise your centroids will keep on moving initially you're taking it like this but it might continue to move like the somewhere it will be fixed and after that there will be no change that you notice if you're up in the same process at this particular stage whatever the observations which are grouped into green color you'll say like these are the green color observations whatever the observations which are marked into red color you'll mark them as the other observations which you belong to that color likewise you will segregate all these observations either into regular orbital algorithm will generate these algorithms uh output of using the server so now likewise there are multiple algorithms that we have so like uh when we talk about machine learning so there are multiple types of algorithms that we have so like how does the K value will occur so you're going to choose some randomly generated K value and you will be choosing the number of case over here and you can see that it will assign them based on the number of this easy most nearest distances and according to that we'll change your centroids once you change your centroids you will repeat the same process until you will able to change that your centroids don't move further and then once it has been finally you will select the final cluster output that we can generate out of this so likewise we also have different types of plastic technique and the second type of clustering technique that we have is for Z of c means clustering so what is fuzzy of Siemens clustering the output will remain same so same is clustering is that there are places that one or two observations can belong to one or two different clusters right one or two different clusters which means you usually the primary difference between your c means and payments clustering technique is in case if there are any observations which are having equal amount of distance usually in K misclusting what we will do we will force this observation to be part of any of the cluster but in CMS clustering based on the distance that we see there are chances that an observation can go to or can belong to one or two clusters so it purely depends on the business use case who is going to decide to either to go for chemist trusting or Siemens trusting based on our business outcome so if people are going to decide whether to go for Siemens clustering or famous clustering so there are a number of observations which might it's not mandatory which might can belong to one or more clusters that can happen so the third will be agglomerate the clustering so which which is the third type of clustering technique that we have so now what is this agglomerative clustering so this is what we also call it as hierarchical clustering these clustering techniques are built using hitch trusting so what are this so based on the type of algorithm they will try to there is a there's a mathematical Expressions which are involved so considering the way how the data points have been segregated it will build a kind of dendogram so on top of this gen program your observations can be classified here like this as you could see on the screen you need to understand one thing when we are talking about unsupervised learning algorithms as I said you may or may not have Clarity on the data which means your assumption is that at the least level you don't have Clarity on the data when you don't have Clarity on the data how can you say there can this is an outlier or this is not an outlayer when you have Clarity on the data especially when you're working on supervisor learning algorithms you can but you don't have output column also how you will be able to evaluate because this is not been classified properly it is not been clustered properly based on the historical data because there is no output column so those type of concepts are something which you don't need to worry about when you are working on supervised learning there will be primarily they will be constrained when you are working on supervisor learning algorithms and of course in case if you see that there are outliers which are present obviously it will be considered as any of the cluster it belongs to the data but anyhow then the characteristic of the data you don't need to you don't be taken care because you might be killing the actual original values which are present in the data but you cannot expect that out there can be recognized for all the cases that you have in unsupervised a lot so the next type of clustering technique that we have is division clustering so what is this division clustering as the division clustering is also creates the data in a form of windowgram but the difference is you can see that it starts with all data points in one cluster splits the root into child record simply based on the lindogram or stops when there is a single term clusters which are created which means that for every cluster there will be one observations belong so likewise the clustering technique will work now there is one more technique that we have in terms of building a clustering technique that's what we call the main shift clustering so what we will do we'll take an average of every cluster that we have what we will do we'll end up with reducing their means into the into a single density of items and then we will continue to repeat the process to see to that which observations mean will belong to the same cluster and according to that we will continue to identify which observation will belong to a cluster a particular cluster so likewise we can apply different types of clustering technique that can help us to Cluster the data which is part of unsupervised learning so now let's jump into something called a small Hands-On okay let's take a small python example and we will see how do we build that particular cluster technique on top of the data using one of the simple data that we have the Jupiter notebook let me open you can use Sky kit learn using one of the example so let me show you a data set that I'm going to use as well so I'm what I'm going to do is I'm going to take this movie metadat information let me open this data set I'm going to take this example of movie metadata information where this data set has got a number of observations which are present okay let me open this okay you can see that there are a number of movies related information as you can see the movie names so Avatar Pirates of the Caribbean Specter The Dark Knight of Star Wars etc etc John Carter Spider-Man 3 Tangled or etc etc we have a lot of movies and about every movie we got a lot of information which is present like who is the director who is the actor what are the director Facebook likes what are the actor Facebook likes likewise we got and we got a lot of information which is present as part of these particular every observation so now what we will do we will try to Cluster these data points you can see a lot of observations which are given what are the cross of the movie what are the number of reviews what is the IMDB rating what is the so and so called lamented score what is the movie span what is the cross what a budget and everything so now what I'll be doing I'll be reading this data set using one of the pandas library that we have I'll be using something called pandas as you can see read.pandals.csv I'll be reading this data set where you can see that this is the data set that I am able to read I got a number of observations that are present I cannot see that the data Facebook like snack of three Facebook likes so now what is that I want to do is instead of building this custom technique on top of every observation so what I will do is I will take these columns called number of Facebook likes on director and number of actor Facebook likes versus director Facebook likes so where I can see that if I want to select a director Facebook likes alone I'll be able to choose this director Facebook like Salon you can see for every movie you got the number of planetary Facebook like future present so if there are more number of Facebook likes what does it mean the director is famous person or rather if there are more number of Facebook like the actor has got which means that the actor is very famous person so that's what you can understand so now we can see that I'll try to extract these independent components that we're talking about I will extract all the records in all the director Facebook likes versus actor Facebook likes where I'm just going to form an object called new data by applying some I location as a filter iloc stands for index location where I can filter out what are the records that I wanted to what other column that I wanted to using this ilocation function now I got all the zones of all number of director Facebook likes versus actor Facebook likes which are present as part of this okay so this has been loaded into an object called new data so now what is that I'll be doing after that I'm importing something called SQL on 10 minutes cluster so this algorithm is available as part of the static clone algorithm in skycat Clone Library there is a python Library called Sky kitlan which has got most of the algorithms present and we will be importing this KMS clusting algorithm and for this game is clustering I'm providing the C is equal to number of clusters is equal to five so here if you are aware of object oriented programming using python you will be able to correlate I'm importing this K Miss clustering which is implemented as a class here where I am creating an object called K Miss clustering by providing an input called n number of scores plus is equal to Phi which means that what is the meaning of five I want to build five clusters out of this so where once you create an object using K means I'm calling this method called fixed method by providing new data as my independent variable if it is a method that is going to invoke what's supposed to be the process that needs to be executed that's going to build my trusting tactic algorithm by taking number of clusters is equal to five so now I am able to generate my algorithm where by looking at my model I'll be able to extract what are the centroids that I got final centroid because initially you'll be taking some random centroids but at the end you will end up with getting a final centroid position somewhere fix it to it that nothing but a center point for every cluster that we got these are the final centroid that we got even if I want to print what is the labels which are generated labels is nothing but it will extract the outcome you can see that these are the label numbers which are added here out of 5000 movies we got the labels which are added as an error but we won't be able to see it like that what is it I'm trying to do I'm trying to get all the unique values present in this labels with respect to accounts now you can see that my data is now clustered into five different clusters exactly question number zero has got 4700 most of the observations are moved into question number zero one or four movies went into observation number one 11 movies are moved into cluster number two and 87 movies are moved into question number three and 67 the question number four that so the data properties have been distributed and that's how the clustering technique has divided the data into y clusters now we can see what is that I am trying to do I'm trying to put this into a new data of cluster which means whatever the labels that are generated here this is my output column I'm going to create this into an as a new column in my new data where I'm using this alarm plot that will print whatever the column that I call directory Facebook likes versus actor Facebook likes and I am choosing this data is equal to new data that will help me to that will help me to identify what column can be considered as Q so that I'll be printing it in a cool warm type chart you can see that palette type is equal to cool warm type which will help me to identify based on the cluster that you have created my friend is you can see that pretty much the every observation is now categorized into individual cluster you can see so this is the graphical representation that we are using to see how these movies have been segregated I can see these movies are nothing but cluster number zero you can see there are a few more movies which are extracted over here this is nothing but based on the current indication this is cluster number three and these movies are created as a customer number two these movies are something which are created as question number one number four zero one this is nothing but two question number three question number four like this now you can clearly see that how the cable cluster string has came up the movies which are made by new people with new directors new actors or new directors are making film with new actors so you will see more number of movies will fall into this category because you'll end up with getting new people into the into the film industry most of the cases a lot of movies are being made with uh new directors with new actors and you can see that these movies are the Clusters which are being segregated very clearly the famous actors are making films with new directors very famous actors are making films with new directors and these movies are something where average Fame directors are making films with some new actors you can see these movies are nothing but very famous actors are making directors are making films with very new actors very famous directors are making films with very famous actors like people might be making a film with a Tom Cruise or something like that so likewise you can clearly see that where instead of if you do this activity manually it might take a little longer time for you to segregate each and every component but within five minutes we're able to Cluster this activity so that's what the beauty with algorithms you don't need to manually do this activity where you can provide your data automatically based on the properties of the data your algorithm it will build this clusting output out of the data which is very simple to do so that's how we will be able to build a clustering technique on top of the given data so that's what we have as part of one of the Hands-On example okay [Music] so it's a phenomenon call as the impulsive buying and big retailers take advantage of machine learning and a priori algorithm to make sure that we tend to buy more and that's what we have defined as a Market Basket analysis now first of all let's discuss on that so in today's world the goal of any organization is to increase the revenue and can this be done by pitching just one product at a time to the customer the answer is clearly a no hence the organization began mining data related to frequently bought items so Market Basket analysis is one of the key techniques used by large retailers to uncover associations between items so they tend to find out the associations between different items and products that can be sold together which gives assisting in right product placement typically it figures out what products are being bought together and organizations can place products in a similar manner now let's understand this better with an example let's say we have people are people who buy bread usually buy butter too so the marketing team at retail stores should Target customers who buy bread and butter and provide them an offer to them so that they buy the third item that we have eggs or any kind of jam as well correct so if a customer buys bread and the butter and see a discount or an offer on eggs they will be encouraged to spend more and buy the eggs so this is what Market Basket analysis is all about so this is a small example let's say if you take thousand ten thousand items data of our super march to our data scientist just imagine the number of insights we can get out of it and that is why Association rules binding is also important to understand so if you talk about Association rule mining then Association rules can be thought of as an if then relationship suppose we have an item a so item a is bought by the customer and then the chances of an item B being picked by the customer two under the same transaction ID is found out so here we have SIMPLE if and then relationship so there are two elements of these different rules first of all we have the antecedent as in if so this is basically an item or a group of items that are typically found in the item sets or data sets and then we have the consequent that is then so here we have then so then is basically used for then it comes along as an item with an antecistent or we can say a group of antecedents that we are seeing and and recently we have discussed it is basically an item or group of items that we found in item sets of data sets now here comes a constraint suppose we made a rule about an item use we still have around 999 items to consider for rule making as we discussed we have we are going to make use of 10 000 items so this is where our priori algorithm comes into play so that before we understand we are preparing algorithm let's understand the mean mathematics involved in here so there are three ways to measure the association we have support confidence and then we have lift so which you can measure the association here we have support confidence and lift so support gives the fraction of transactions which contain item a suppose if you talk about support here it is basically a frequent the fraction of transactions which contains item A and B so basically support tells us about the frequency against the we can say frequently bought items or the combination of python spot frequently by using the support so with this we can filter out the items that have a low frequency then we have confidence so it tells us how often the items A and B occur together given the number of times a occurs so typically let's say when we work with the a priori algorithm we Define different terms accordingly but how exactly we have to decide the value that's the main question and to be honest there is not a specific way to define these terms suppose we have assigned the support value as 2. now what is what exactly this means is until and unless the item frequency is not two so here we Define support as two so until unless the items frequency is not two percent we will not consider that item for the a priori algorithm so this makes sense as considering items that are bought less frequently is a waste of time because let's see if only two percent of users are you are buying this item along with the item that they have bought then obviously they cannot use it we cannot even consider is for the a priority algorithm that we are going to work on here now suppose after filtering we still have around 5000 items left now creating Association rules for them is a practically impossible task for anyone and this is where the concept of lift comes into play so basically lift indicates the strength of a rule over the random occurrence of A and B and it basically tells us the strength of Any Given rule here so for finding that out we have simply divided by the supporter of a and then support of B so focus on the denominator it is the probability of the individual support values of A and B and not together so left explains the strength of a rule so more the left more is the strength let's say a suppose we can say for a is B so if a if then relationship so if you talk about if then relationship for a b then the left value is 4. suppose let's say for this relation the left value is 4 here that means if you buy a then the chance of buying B is four times if the lead value is 4 or we can say it's four times if the value here is 2 then it is two times and so on as a part of the left value is defined so let's get started with the priori algorithm and now let's see how exactly it works all right so now let's discuss on the cons now this is what we discussed here the transitions at local market and here we have ABC for in transition 2 we have ACD in transition three we have BCD in transition four we have a d e and then we have in transition five we have a b c d so let's say here if you find out the entire rules here then a if a then again they are going to purchase D if they Pi C again they have the option of going for a if they buy a then they have the option of going for C and if they buy B and C together then they are going to buy a as well and then same way we can find out the supports okay for if a is for those who are purchasing a the probability of buying the errors Sub in terms of support it is two by five in terms of confidence it is two by three in terms of lift it is 10 upon 9 whereas for C again for a that means what is what is a possibility that those who have purchased c will be buying a so you can find the support as two Wi-Fi conference has two by four and lifters five plus six same way for C and CME for BC and a so we can see how the probability is going to be distributed as we press as we find out the possibility of one product to be bought over the other all right so now let's assume further on the priority algorithm part so a priori algorithm uses frequent item data sets okay so item sets or data sets to generate the Association rules so it is based on the concept that a subset of a frequent item set must also be frequent item set must be a frequent item set and frequent item set is the item set whose support value is greater than a threshold value that we have support so we're getting what exactly a few good items it is it's an item set whose a support value is greater than a threshold value all right so now if we talk about a simple example now here we can let's say we have the following data of a store here so we have the transition ID here and then we had items so so here we have this one are transition IDs and here we have the items so in transition one we have one three four two three five one two three five two five and one three five so if we talk about the first iteration now let's assume the support value is 2 and then we can create the item sets of the size 1 and calculate their support values as you can see here so as you can see here the item 4 has a support value of one support value of 1 here which is less than the minimum support value so we are going to discard 4 in the upcoming iterations and we have and that's how we can have the final table defined here as the FN so here we are going to discard the value as 4 and then we are going to have the final value to be given given to us so the item sets the support value less than minimum support value that is 2 are going to be eliminated next we are going to create the item sets or size 2 and calculate their support values so all the combination of item sets in F1 are used in the iteration here so here we have C2 and then we have the F1 here so basically here we are going to make use all the entire iteration here and add him sets having support less than 2 are eliminated again and in this case we have one and two right so let's Now understand what exactly what do we mean by Printing and how it makes a priority one of the best algorithms for finding the frequent item sets so if we talk about pruning then we are going to divide the item sets in C3 into subsets and eliminate the subsets that are having a support value less than two from this given item sets here so here we are going to work on the third iteration so we will discard one two and three we are going to start one two and three and one two and five as well so they both as they both contain one or two and this means again and this is the main highlight of the apriori algorithm as you can see here we are going to discard one and two and this is the main highlight of this entire algorithm here because again the one two is a common thing available in these in these both data sets right so this is the main highlight of our previous algorithm here and then if you talk about the fourth iteration so if now as you can see if any of the subsets of these items are that are not in F2 then we are going to remove that particular item sets and then in terms of the fourth iteration let's suppose here we have the transition ID and here we have the item set and support available so since the support of this item set is going to be less than 2 so we will stop here and the final item sets will have is the everyone and now so now we have not calculated the confidence value yes okay this is something that we have to keep in mind so with every we get the following item sets that means for one we have one three and five and subsets are one three one five three five one three and five whereas for item sets one again we have two three two five three five two three and five itself for the other Diamond sets available so for one we have 135 these are subsets of one three one five three five one three and five whereas two three five are subsets of two three three five two five two three and five right and for every subset of I we identify the output rule now when we are going for applying tools here so we will create the rules and apply them on item set F3 so now let's assume our minimum confidence value is 60 percent so for every subset of whether that we have S of I so are the output rule is going to be S is going to be recommended so s is going to Simply recommend is and then we hit support it is I and then we can simply divide this by the support cell and to find the minimum confidence value is going to be 25. so here rule one here has one three is again a subset that means one and three will result to five that means if a user is purchasing one and three both the items then we have a good possibility of the customer going for the fifth item as well so here the confidence is going to be support one three five divided by a support one three that is two by two that is 66.6 percent at 60 percent so here rule 1 is going to be selected and in rule 2 we have one five one three five minus one five that means if the people if someone is purchasing one and five then they are going to have a good possibility of going for three as well where we are going to see the confidence and providing the confidence we are simply going to again subtract support for 135 buy support of one five there's two by two and this is going to simply be translated at 100 and which is much greater than 60 percent as well and same way in rule 3 we have three five one three five three point five that means here if someone is going for positives in three and five then they are then they are holding a high possibility of going for one as well so here the conferences support 135 divided by support three five that we get asked 66.66 percent and here the rule 3 is going to be selected here as you can see here now apply now if you if you talk about the application of rules here now when we are talking about this rule here we can see whenever the confidence is going to be more than 60 percent defined here then this rule is going to be selected as you can see here in this Rule 5 where where those who are positioning three they have a possibility of going for one and five both so here the confidence is less than 60 percent which is why this is going to be rejected same way for rule six as well see me here for rule six as well because here the confidence is going to be less than 60 percent then that's why this is going to be rejected here correct because again here we have assume a minimum confidence level for 60 percent and wherever we are going to get to conference lessons you see person we simply go into reject that particular model here all right so first of all that's how we have seen how we can create rules in appropriate algorithms here and the same steps can be implemented for these item set two three five and then we are going to try it and see which rules are accepted and which rules are going to be rejected and how we can implement the priori algorithm in Python all right so let's fetch some sample data sets here that we are going to work on so to get started working on the actual analysis here we are going to make use of the pycharm ID so we can use any of the ID or we can also make use of the Python common ideally to get started so we are going to make use of python in this file we'll be having the access to the invoice number we have the stock code we have description quantity invoice date unit price customer ID and then we have the country so this is a data point this is a file that we are going to work on all right so in the meantime we are also going to install pycharm as a main IDE so in case you want to work on Python and we can make use of pythm ID which is one of the most popular IDs for getting started and python is also offered as a free tool that means here we get is also offered as a community tool here so you can use it as a community tool in order to get started so once we have the access to it we can open this up all right so here we can get started first of all we are going to work on the pandas Library so here we are going to import pandas and mlx10 libraries imported and to import and read the data sets so here we are going to use import and then here we are going to find import pandas and from pandas as PD so here we go to import Partners as PD and then we are going to use from mlx10 we are going plot frequent dot frequent patterns we are going to import our puree and then we are going to use the same from mlx10 dot frequent underscore patterns we are again going to import Association rules they're going to put a print Board Association rules and then we are going to create a data frame where we are going to make use of planners diary as PD dot read and here we are going to read Excel file if it if it is a CAC file then we Define it as read CSV and here we can Define the path suppose here the file is currently available in our users location and the users we have it in downloads and by the name of online underscore retail dot XLS so this is the library that we have got a name border here and in case suppose some libraries are currently missing here so we can import these libraries as and when acquired here suppose if if something is missing here we can simply import these libraries in case they are not properly installed so for installation of any new that we have you can go to files under settings here we have to go to Project interpreter in case you want to add any specific Library here then we can simply click on Plus and then we can choose which all Library we want to add here we can simply see the entire list of all these drivers currently installed here and now if we want to start with more then we can simply add more libraries and then we can start deploying it one by one so for example suppose if you want to install any specific Library here so here we can let's see here we want to insta to install the a priori here so we can see how here we can also import the Emirates strain so if this library is currently available we can simply click on install package and this is going to be install and store in a system as a library same way if you are looking for our priori here so here we can also use because again at the end we are going to use a query from the MLS scene that we have imported so here we are going to install the other package as well so the first package for Evanston has been installed and now we are now if you want to install the other packages then we can also install it so in case we have missing some kind of libraries that we can simply install it or in case we are we have not used those libraries here then we can simply use them one by one so once we add them we can simply come out so as you can see now Wednesday libraries have been properly installed and we'll be able to see there is no error being shown here and this is the path of the file that where we have stored online retails dot XLS file here so just to confirm you can confirm that we do have the file by the name of online retail.xls X available here in a download directory so once we have specified the file here and then we are going to Simply use dataframe.head so in case we want to see the first few entries of the current data set that we have imported so here we can simply Define df.head and then from the given path the first two head R is going to be returned back as a response now once we add them we can simply start applying it now once we have the access to the data set here next step is going to be focus on data cleaning which will include removing spaces from some of the descriptions that we have and for that we are simply going to specify here we can instead of using the database the data frame head here so now we are going to use the data frame and then from here we are going to add a short description so here we can add a short description here so basically we are going to to work on data cleanup which includes removing spaces from some of the descriptions and then we are going to drop the rows that don't have invoice numbers and remove the credit transition as well for example suppose we have some rows where we don't have the where we have blank transitions itself right so we have some blank invoices and then we have invoices numbers and remove the create transition where we don't have any kind of invoice number please here all right so here we are going we have we have to work on removing those kind of anomalies so here from the data frame so here we can add parameter as description now on a description we are going to use a string function as string dot strip so stream function is available under a string we're going to make use of it and then same way from data frame we are going to work on dropping we can find the values for X is equal to zero and then the subset where the what kind of values we have to drop do you have to Define that parameter where the invoice number is going to be in place is going to be true and then from data frame we can Define the invoice stream the invoice number and then we can Define it to be our data frame where invoice number again we are going to convert that to a string itself so here we can Define data free and then the invoice number type so here we have invoice number dot the as type and then we are going to or the SC time table again here we can keep it to ASI so here we can Define to be a string value so here we can find a b a string and then we can create a data frame where we can find data frame should be a value closer okay closer to data frame where we are going to define the invoice number so here we have invoice number and then we can Define this as a string point we based on the so here we have string dot contains and then we can Define where the value is C agree with the value is c as a part of definition defined here and then we are going Simply Save it as a part of given data field so after we are done cleaning up we need to consolidate the items into one transaction per row with each product so for the sake of keeping the data small we are only going to look we are only going to be looking at the sales for the for any specific country so now once we are done we are going to create and consolidate the items into one transactions per row with each product and for the sake of keeping the data set small we are going to look at the sales for specific country so for that let's say here we want to focus on one country data set here so here we can use basket and then in the basket part here we can Define multiple data frames for example first of all we can Define data fee and here we are going to use the same data frame and then we are going to segregate it further based on the country itself so here we can find country and then under country we can Define which country we want to focus on the country should we suppose let's say France as a single country so here we can Define France as a country that we want to use for consolidation and then we are going to group so here we can use group pi the invoice number so here we are going to find the invoice number for it so here we have invoice number and then we can add a short description here so here we can add description and then after we're done Define the description here we can then Define the quantity here so here we can Define quantity and then after quantity we can Define sum let's put this on the line just to what the confusion so here we can Define sum and then we can Define unstack dot reset we missed one other band he says yeah so we get to how we have to Define sum and then we have defined unstack and then we have defined dot reset dot reset and then the index value here with the final anyway with the final okay so the simple color with a simple fill and then we can Define the value to be a simple zero and then we can use set underscore index where we can Define for the invoice number so here we have invoice number defined and then we can Define this for the basket so basically here we are going to Define the consolidation for the items into one transaction per row with each particular product and the key the number of transition low we have we just focus our attention only for a single concrete so as in France so afterwards so after we are done with that part next we have to work on removing all the zeros so basically there are a lot of zeros in the data but we also need to make sure that any positive values are converted to a one and anything less than zero is also set to zero so for that we are going to make use of this so basically here we are now for converting that here we are going to make the go ahead and perform the encoding essay so here we can Define encoding units so if x is less than or equals to zero then we are going to join zero and if x is greater than or equal to 1 then it is going to turn one where we are going to define the basket sets and the basket apply map in code units and then we are going to use simply USC drop post it and press setup to be true here so basically here we are going to Simply work on first of all removing all the anomalies here and then if there are a lot of zeros in the data we also have to make sure that any positive values is converted to one and anything less than zero it's all this up to zero that what we have done here and then we are going to generate a frequent item set that have support value of at least seven percent and this number is chosen so that we don't so we can get close enough and we don't observe any kind of Errors we or we don't get any it any errors thrown and here we can generate the frequent generate the rules with the corresponding support confidence and lift so for doing that we can make use of frequent item sets so here we are going to find frequent I frequent item sets where we are going to find the apiary from basket cell minimum support should be 0.7 and the call lines that are going to be used is should be true and the rules here defined should be for Association rules where we want to find the metric as lift and the minim threshold to be set to 1. and then we are going to filter the data frame using standard pandas code for a large lift and high confidence value as 0.8 and then if we run this and we'll be able to see the output directly in the section altogether so here we can simply run rules and then if we run rules here we can Define rules for lift it is going to be defined as a value greater than 6 and the rules for a confidence value is going to be 0.8 altogether so that's how we can make use of the our priori algorithm here [Music] so let's look at some of the examples and let's define what reinforcement learning really is so guys reinforcement learning is a part of machine learning where an agent is put in an environment and he learns to behave in this environment by performing certain actions okay so it basically performs actions and it either gets rewards on the actions or it gets a punishment and observing the reward which it gets from those actions reinforcement learning is all about taking an appropriate action in order to maximize the reward in a particular situation so guys in supervised learning the training data comprises of the input and the expected output and so the model is trained with the expected output itself but when it comes to reinforcement learning there is no expected output here the reinforcement agent decides what actions to take in order to perform a given task in the absence of a training data set it is bound to learn from its experience itself all right so reinforcement learning is is all about an agent who's put in an unknown environment and he's going to use a hit and trial method in order to figure out the environment and then come up with an outcome okay now let's look at reinforcement learning with an analogy so consider a scenario wherein a baby is learning how to walk the scenario can go about in two ways now in the first case the baby starts walking and makes it to the candy here the candy is basically the reward it's going to get so since the candy is the end goal the baby is happy it's positive okay so the baby is happy and it gets rewarded a set of candies now another way in which this could go is that the baby starts walking but Falls due to some hurdle in between the baby gets hurt and it doesn't get any candy and obviously the baby is sad so this is a negative reward okay or you can say this is a setback so just like how we humans learn from our mistakes by trial and error reinforcement learning is also similar okay so we have an agent which is basically the baby and a reward which is the candy over here okay and with many hurdles in between the agent is supposed to find the best possible path to reach the reward so guys I hope you all are clear with the reinforcement learning now let's look at the reinforcement learning process so generally a reinforcement learning system has two main components all right the first is an agent and the second one is an environment now in the previous case we saw that the agent was a baby and the environment was a living room wherein the baby was crawling okay the environment is the setting that the agent is acting on and the agent over here represents the reinforcement learning algorithm so guys the reinforcement learning process starts when the environment sends a state to the agent and then the agent will take some actions based on the observations in turn the environment will send the next state and the respective reward back to the agent the agent will update its knowledge with the reward returned by the environment and it uses that to evaluate its previous action so guys this Loop keeps continuing until the environment sends a terminal state which means that the agent has accomplished all his tasks and he finally gets the reward okay this is exactly what was depicted in this scenario so the agent keeps climbing up ladders until he reaches his reward to understand this better let's suppose that our agent is learning to play Counter-Strike okay so let's break it down now initially the RL agent which is basically the player player one let's say it's a player one who is trying to learn how to play the game okay he collects some state from the environment okay this could be the first date of Counter-Strike now based on this state the agent will take some action okay and this action can be anything that causes a result so if the player moves left or right it's also considered as an action okay so initially though action is going to be random because obviously the first time you pick up Counter-Strike you're not going to be a master at it so you're going to try with different actions and you're just going to pick up a random action in the beginning now the environment is going to give a new state so after clearing that the environment is now going to give a new state to the agent or to the player so maybe he's across stage one now he's in stage two so now the player will get a reward R1 from the environment because it cleared stage one so this reward can be anything it can be additional points or coins or anything like that okay so basically this Loop keeps going on until the player is dead or reaches the destination okay and it continuously outputs a sequence of States actions and rewards so guys this was a small example to show you how reinforcement learning process works so you start with an initial State and once a player clears that state he gets a reward after that the environment will give another stage to the player and after it clears that state it's going to get another reward and it's going to keep happening until the player reaches his destination all right so guys I hope this is clear now let's move on and look at the reinforcement learning definitions so there are a few Concepts that you should be aware of while studying reinforcement learning let's look at those definitions over here so first we have the agent now an agent is basically the reinforcement learning algorithm that learns from trial and error okay so an agent takes actions like for example a soldier in Counter-Strike navigating through the game that's also an action okay if he moves left right or if he shoots at somebody that's also an action okay so the agent is responsible for taking actions in the environment now the environment is the whole Counter-Strike game okay it's basically the world through which the agent moves the environment takes the agent's current state and action as input and it Returns the agency reward and its next state as output all right next we have action now all the possible steps that an agent can take are called actions so like I said it can be moving right left or shooting or any of that all right then we have state now state is basically the current condition returned by the environment so whichever State you are in if you are in state 1 or if you're in state 2 that represents your current condition all right next we have reward a reward is basically an instant return from the environment to appraise Your Last Action okay so it can be anything like coins or it can be additional points so basically a reward is given to an agent after it clears the specific stages next we have policy policy is basically the strategy that the agent uses to find out his next action based on his current state policy is just the strategy with which you approach the game then we have value now value is the expected long-term return with discount so value and action value can be a little bit confusing for you right now but as we move further you'll understand what I'm talking about okay so value is basically the long term return that you get with discount okay discount I'll explain in the further slides then we have action value now action value is also known as Q value okay it's very similar to Value except that it takes an extra parameter which is the current action so basically here you'll find out the queue depending on the particular action that you took all right so guys don't get confused with value and action value we look at examples in the further slides and you'll understand this better okay so guys make sure that you're familiar with these terms because you'll be seeing a lot of these terms in the further slides all right now before we move any further I'd like to discuss a few more Concepts okay so first we'll discuss the reward maximization so if you haven't already realized it the basic aim of the RL agent is to maximize the reward now how does that happen let's try to understand this in depth so the agent must be trained in such a way that he takes the best action so that the reward is maximum because the end goal of reinforcement learning is to maximize your reward based on a set of actions so let me explain this with a small game now in the figure you can see there is a fox there's some meat and there's a tiger so our agent is basically the fox and his end goal is to eat the maximum amount of meat before being eaten by the tiger now since the fox is a clever fellow he eats the meat that is closer to him rather than the meat which is closer to the tiger now this is because the closer he is to the tiger the higher are his chances of getting killed so because of this the rewards which are near the tiger even if they are bigger meat chunks they will be discounted so this is exactly what discounting means so our agent is not going to eat the meat chunks which are closer to the tiger because of the risk all right now even though the meat chunks might be larger he does not want to take the chances of getting killed okay this is called discounting okay this is where you discount because you improvise and you just eat the meat which are closer to you instead of taking risks and eating the meat which are closer to your opponent all right now the discounting of reward Works based on a value called gamma we'll be discussing gamma in our further slides but in short the value of gamma is between zero and one okay so the smaller the gamma the larger is the discount value okay so if the gamma value is lesser it means that the agent is not going to explore and he's not going to try and eat the meat chunks which are closer to the tiger okay but if the gamma value is closer to 1 it means that our agent is actually going to explore and it's going to try and eat the meat chunks which are closer to the tiger all right now I'll be explaining this in depth in the further slides so don't worry if you haven't got a clear concept yet but just understand that reward maximization is a very important step when it comes to reinforcement learning because the agent has to collect maximum rewards by the end of the game all right now let's look at another concept which is called exploration and exploitation so exploration like the name suggests is about exploring and capturing more information about an environment on the other hand exploitation is about using the already known exploited information to heighten the rewards so guys consider the fox and tiger example that we discussed now here the fox eats only the meat chunks which are close to him but he does not eat the meat chunks which are closer to the tiger okay even though they might give him more rewards he does not eat them if the fox only focuses on the closest rewards he will never reach the big chunks of meat okay this is what exploitation is about you're just going to use the currently known information and you're going to try and get rewards based on that information but if the fox decides to explore a bit it can find the bigger reward which is the big chunks of meat this is a exactly what exploration is so the agent is not going to stick to one corner instead he's going to explore the entire environment and try and collect bigger rewards all right so guys I hope you all are clear with exploration and exploitation now let's look at the markers decision process so guys this is basically a mathematical approach for mapping a solution in reinforcement learning in a way the purpose of reinforcement learning is to solve a Markov decision process okay so there are a few parameters that are used to get to the solution so the parameters include the set of actions the set of states the rewards the policy that you're taking to approach the problem and the value that you get okay so to sum it up the agent must take an action a to transition from a start state to the end State s while doing so the agent will receive a reward R for each action that he takes so there's a series of actions taken by the agent Define the policy or it defines the approach I and the rewards that are collected Define the value so the main goal here is to maximize the rewards by choosing the optimum policy all right now let's try to understand this with the help of the shortest path problem I'm sure a lot of you might have gone through this problem when you were in college so guys look at the graph over here so our aim here is to find the shortest path between a and d with minimum possible cost so the value that you see on each of these edges basically denotes the cost so if I want to go from a to c it's going to cost me 15 points okay so let's look at how this is done now before we move and look at the problem in this problem the set of states are denoted by the nodes which is a b c d and the action is to Traverse from one node to the other so if I'm going from A to B that's an action similarly a to c that's an action okay the reward is basically the cost which is represented by each Edge over here all right now the policy is basically the path that I choose to reach the destination so let's say I choose ACD okay that's one policy in order to get to D I'm choosing ACD which is a policy okay it's basically how I'm approaching the problem so guys here you can start off at node a and you can take baby steps to your destination now initially your clueless so you can just take the next possible node which is visible to you so guys if you're smart enough you're going to choose a to c instead of a b c d or ABD all right so now if you're at node C you want to Traverse to node D you must again choose a wise path all right you just have to calculate which path has the highest cost or which path will give you the maximum rewards so guys this is a simple problem we're just trying to calculate the shortest path between a and d by traversing through these nodes so if I Traverse from ACD it gives me the maximum reward okay it gives me 65 which is more than any other policy would give me okay so if I go from ABD it would be 40 when you compare this to ACD gives me more rewards so obviously I'm going to go with ACD okay so guys it was a simple problem in order to understand how Markov decision process works all right so guys I want to ask you a question what do you think I did here did I perform exploration or did I perform exploitation now the policy for the above example is of exploitation because we didn't explore the other nodes okay we just selected three nodes and we Traverse through them so that's why this is called exploitation we must always explore the different nodes so that we can find a more optimal policy but in this case obviously ACD has the highest reward and we're going with ACD but generally it's not so simple there are a lot of nodes there are hundreds of nodes to Traverse and they're like 50 60 policies okay 50 60 different policies so you make sure you explore through all the policies and then decide on an Optimum policy which will give you a maximum reward so guys this is our code and this is executed in Python and I'm assuming that all of you have a good background in Python okay if you don't understand python very well I'm going to leave a link in the description you can check out that video on Python and then maybe come back to this later okay but I'll be explaining the code to you anyway but I'm not going to spend a lot of time explaining each and every line of code because I'm assuming that you know python okay so let's look at the first line of code over here so what we're going to do is we're going to import numpy okay numpy is basically a python library for adding support for large multi-dimensional arrays and matrices and it's basically for computing mathematical functions okay so first we're going to import that after that we're going to create the r Matrix okay so this is the r Matrix next we're going to create a queue Matrix and it's a six into six Matrix because obviously we have six states starting from zero to five okay and we're going to initialize the value to zero so basically the queue Matrix is going to be in initialize to 0 over here all right after that we're setting the gamma parameter to 0.8 so guys you can play with this parameter and you know move it to 0.9 or move it lower to 0.8 okay you can see what happens then then we'll set an initial stage okay initial stage is set as one after that we're defining a function called available actions okay so basically what we're doing here is since our initial state is 1 we're going to check our row number one okay this is our row number one okay this is row number zero this is row number one and so on so we're going to check the row number one and we're going to find the values which are greater than or equal to zero because these values basically represent the nodes that we can travel to now if you select minus one you can't Traverse to minus one okay I explained this earlier the minus 1 represents all the nodes that we can't travel to but we can travel to these nodes okay so basically over here we're checking all the values which are equal to 0 or greater than Z 0 these will be our available actions so if our initial state is 1 we can travel to other states whose value is equal to 0 or greater than zero and this is stored in this variable called available Act all right now this will basically get the available actions in the current state okay so we're just storing the possible actions in this available act variable over here so basically over here since our initial state is 1 we're going to find out the next possible States we can go to okay that is stored in the available act variable now next is uh this function chooses at random which action to be performed within the range so if you remember over here so guys initially we are in stage number one okay our available actions is to go to stage number three or stage number five sorry room number three or room number five okay now randomly we need to choose one room so for that we're using this line of code okay so here we're randomly going to choose one of the actions from the available act this available act like I said earlier stores all are possible actions okay from the initial State okay so once it chooses an action it's going to store it in next action so guys this action will represent the next available action to take now next is our Q Matrix remember this formula that we used so guys this formula that we use is what we're going to calculate in the next few lines of code so in this block of code we're just executing and Computing the value of Q okay this is our formula for computing the value of Q current state comma action our current state comma action gamma into the maximum value so here basically we're going to calculate the maximum index meaning that we're going to check which of the possible actions will give us the maximum Q value all right if you remember in our explanation over here this value over here Max Q of 5 comma 1 5 comma 4 and 5 comma 5 we had to choose a maximum Q value that we get from these three so basically that's exactly what we're doing in this line of code we're calculating the index which gives us the maximum value after we finish Computing the value of Q we'll just have to update our Matrix after that we'll be updating the Q value and we'll be choosing a new initial State okay so this is the update function that is defined over here okay so I've just called the function over here so guys this whole set of code will just calculate the Q value okay this is exactly what we did in our examples after that we have the training phase so guys remember the more you train an algorithm the better it's going to learn okay so over here I've provided around 10 000 iterations okay so my range is 10 000 iterations meaning that my agent will take 10 000 possible scenarios and it'll go through 10 000 iterations to find out the best policy so here exactly what I'm doing is I'm choosing the current state randomly after that I am choosing the available action from the current state so either I can go to stage 3 or state five then I'm calculating the next action and then I'm finally updating the value in the Q Matrix and next we just normalize the queue Matrix so sometimes in our Q Matrix the value might exceed okay let's say it exceeded to 500 600 so that time you want to normalize a matrix okay we want to bring it down a little bit okay because larger numbers we won't be able to understand and computation will be very hard on larger numbers that's why we perform normalization you're taking your calculated value and you're dividing it with the maximum Q value into 100 all right so you're normalizing it over here so guys this is the testing phase okay here you'll just randomly set a current state and you want given any other data because you've already trained a model okay you're going to give a current state then you're going to tell your agent that listen you're in room number one now you need to go to room number five okay so he has to figure out how to go to room number five because we've trained them now all right so here we've set the current state to one and we need to make sure that it's not equal to 5 because 5 is the end goal so guys this is the same Loop that we executed earlier so we're going to do the same iterations again now if I run this entire code let's look at the result so our current state here with chosen as one okay and if we go back to our Matrix you can see that there is a direct link from one to five which means that the root that the agent should take is one to five okay directly you should go from one to five because it will get the maximum reward like that okay let's see if that's happening so if I run this it should give me a direct path from one to five okay that's exactly what happened so this is the selected path so directly from one to five it went and it calculated the entire Q Matrix for me so guys this is exactly how it works now let's try to set the initial stage as um let's say two so if I set the initial stage as 2 and if I try to run the code let's see the path that it gives so the selected path is two three four five now it shows this path because it's giving us the maximum reward from this path okay this is the Q Matrix that it calculated and this is the selected path [Music] now for a robot and environment is a place where it has been put to use now remember this robot is itself the agent for example an automobile Factory where a robot is used to move materials from one place to another now the task we discussed just now have a property in common now these tasks involve an environment and expect the agent to learn from the environment now this is where traditional machine learning fails and hence the need for reinforcement learning now it is good to have an established overview of the problem that is to be solved using the Q learning or the reinforcement learning so it helps to define the main components of a reinforcement learning solution that is the agent environment action rewards and states so let's suppose we have to build a few autonomous robots for an automobile building Factory now these robots will help the factory Personnel by conveying them the necessary parts that they would need in order to build the car now these different parts are located at nine different positions within the factory warehouse the car part includes these chassis Wheels dashboard the engine and so on and the factory workers have prioritized the location that contains the body or the chassis to be the topmost but they have provided the priorities for other locations as well which we will look into the moment now these locations within the factory look somewhat like this so as you can see here we have L1 L to L3 all of these stations now one thing you might notice here that there are little obstacle prison in between the locations so L6 is the top priority location that contains the chassis for preparing the car bodies now the task is to enable the robots so that they can find the shortest route from any given location to another location on their own now the agents in this case are the robots the environment is the automobile factory warehouse so let's talk about these states so the states are the location in which a particular robot is present in the particular instance of time which will denote its States now machines understand numbers rather than letters so let's map the location codes to number so as you can see here we have map location L1 to the state 0 L to N1 and so on we have L8 as state seven and N line at state 8. next what we're going to talk about are the actions so in our example the action will be the direct location that a robot can go from a particular location right consider a robot that is at L2 location and the Direct locations to which it can move are L5 L1 and L3 now the figure here may come in handy to visualize this now as you might have already guessed the set of actions here is nothing but the set of all possible states of the robot for each location the set of actions that a robot can take will be different for example the set of actions will change if the robot is in L1 rather than L2 so if the robot is in L1 it can only go to L4 and L2 directly now that we are done with the states and the actions let's talk about the rewards so the states are basically 0 1 2 3 4 and the actions are also 0 1 2 3 4 up till 8. now the rewards now will be given to a robot if a location which is the state is directly reachable from a particular location so let's take an example suppose L line is directly reachable from L8 right so if a robot goes from LA to L line and vice versa it will be rewarded by one and if a location is not directly reachable from a particular equation we do not give any reward a reward of zero now the reward is just a number here and nothing else it enables the robots to make sense of the movements helping them in deciding what locations are directly reachable and what are not now with this queue we can construct a reward table which contains all the reward values mapping between all possible States so as you can see here in the table the positions which are marked green have a positive revon and as you can see here we have all the possible rewards that a robot can get by moving in between the different states now comes an interesting decision now remember that the factory administrator prioritized L6 to be the top most so how do we incorporate this fact in the above table now this is done by associating the topmost priority location with a very high reward than the usual ones so let's put 999 in the cell L6 comma L6 now the table of rewards with a higher reward for the topmost location looks something like this we have now formally defined all the vital components for the solution we are aiming for the problem discussed now we will shift gears a bit and study some of the fundamental concepts that Prevail in the world of reinforcement learning and queue learning so first of all we'll start with the Bellman equation now consider the following square of rooms which is analogous to the actual environment from our original problem but without the barriers now suppose a robot what needs to go to the room marked in the green from its current position a using the specified Direction now how can we enable the robot to do this programmatically one idea would be introduce some kind of a footprint which the robot will be able to follow now here a constant value is specified in each of the rooms which will come along the robot's way if it follows the direction specified above now in this way if it starts at location a it will be able to scan through this constant value and will move accordingly but this will only work if the direction is prefix and the robot always starts at the location a now consider the robot starts at this location rather than its previous one now the robot now sees Footprints in two different directions it is therefore unable to decide which way to go in order to get the destination which is the Green Room it happens primarily because the robot does not have a way to remember the directions to proceed so our job now is to enable what with the memory now this is where the Bellman equation comes into play so as you can see here the main reason of the Bellman equation is to enable the robot with the memory that's the thing we're going to use so the equation goes something like this V of s gives maximum of a r of s comma a plus comma of v s Dash where s is a particular state which is a room a is the Action Moving between the rooms s Dash is the state to which the robot goes from s and gamma is the discount Factor now we'll get into it in a moment and obviously R of s comma a is a reward function which takes a state s and action a and outputs the reward now V of s is the value of being in a particular state which is the footprint now we consider all the possible actions and take the one that use the maximum value now there is one constraint how you are regarding the value footprint that is the room marked in the yellow just below the Green Room it will always have the value of 1 to denote that is one of the nearest room adjacent to the green room now this is also to ensure that a robot gets a reward when it goes from a yellow room to The Green Room let's see how to make sense of the equation which we have here so let's assume a discount factor of 0.9 as remember gamma is the discount value or the discount Factor so let's take a 0.9 now for the room which is marked just below the one or the yellow room which is the asterisk Mark for this room what will be the V of s that is the value of being in a particular state so for this V of s would be something like maximum of a we'll take 0 which is the initial of r s comma a plus 0.9 which is gamma into one so that gives us 0.9 now here the robot will not get any reward for going to a state marked in yellow hence the r s comma a is zero here but the robot knows the value of being in the yellow room hence V of s Dash is one following this for the other states we should get 0.9 then again if we put 0.9 in this equation we get 0.81 then 0.729 and then we again reach the starting point so this is how the table looks with some value Footprints computed from the Bellman equation now a couple of things to notice here is that the max function helps the robot to always choose the state that gives it the maximum value of being in that state now the discount Factor gamma notifies the robot about how far it is from the destination this is typically specified by the developer of the algorithm that would be installed in the robot now the other states can also be given their respective values in a similar way so as you can see here the boxes edges listen to the green one have one and if we move away from one we get 0.9 0.81 0.729 and finally we reach 0.66 now the robot now can proceed its way through the Green Room utilizing these value Footprints even if it's dropped at any arbitrary room in the given location now if a robot lands up in the highlighted Sky Blue Area it will still find two options to choose from but eventually either of the parts will be good enough for the robot to take because of the way the valley Footprints are not laid out now one thing to note here is that the development equation is one of the key equations in the world of reinforcement learning and Q learning so if we think realistically our surroundings do not always work in the way we expect there is always a bit of stochasticity involved in it so this applies to robot as well sometimes it might so happen that the robots Machinery got corrupted sometimes the robot may come across some hindrance on its way which may not be known to it beforehand right and sometimes even if the robot knows that it needs to take the right turn it will not so how do we introduce the stochasticity in our case now here comes the mark of decision process now consider the robot is currently in the Red Room and it needs to go to the green room now let's now consider the robot has a slight chance of dysfunctioning and might take the left or the right or the bottom turn instead of taking the upper turn in order to get to The Green Room from where it is now which is the Red Room now the question is how do we enable the robot to handle this when it is out in the given environment right now now this is a situation where the decision making regarding which turn is to be taken is partly random and partly another control of the robot now partly random because we are not sure when exactly the robot mind is functional and partly under the control of the robot because it is still making a decision of taking a turn right on its own and with the help of the program embedded into it so a Markov decision process is a discrete time stochastic Control process it provides a mathematical framework for modeling decision making in situations where the outcomes are partly random and partly under control of the decision maker now we need to give this concept a mathematical shape most likely an equation which then can be taken further now you might be surprised that we can do this with the help of the pelman equation with a few minor tweaks so if we have a look at the original Bellman equation V of X is equal to maximum of r s comma a plus comma V of s Dash what needs to be changed in the above equation so that we can introduce some amount of Randomness here as long as we are not sure when the robot might not take the expected turn we are then also not sure in which room it might end up in which is nothing but the room it moves from its current room at this point according to the equation we are not sure of the S Dash which is the next state or 0 but we do know all the probable turns the robot might take now in order to incorporate each of these probabilities into the above equation we need to associate a probability with each of the turns to quantify the robot if it has got any explicitness chance of taking this turn now if we do so we get PS is equal to maximum of RS comma a plus gamma into summation of s Dash p s comma a comma s Dash into V of s Dash now the PSA and S Dash is the probability of moving from room s to S Dash with the action a and the summation here is the expectation of the situation that the robot incurs which is the randomness now let's take a look at this example here so when we associate the probabilities to each of these terms we essentially mean that there is an 80 chance that the robot will take the upper turn now if you put all the required values in our equation we get V of s is equal to maximum of R of s comma e plus gamma of 0.8 into V of room up plus 0.1 into V of room down 0.03 into room of V or from left plus 0.03 into V of room right now note that the value Footprints will not change due to the fact that we are incorporating stochastically here but this time we will not calculate those value Footprints instead we will let the row robot to figure it out until this point we have not considered about rewarding the robot for its action of going into a particular room we are only rewarding the robot when it gets to the destination now ideally there should be a reward for each action the robot takes to help it better assess the quality of the actions but there was need not to be always be the same but it is much better than having some amount of reward for the actions than having no rewards at all right and this idea is known as the living penalty in reality the reward system can be very complex and particularly modeling sparse rewards is an active area of research in the domain of reinforcement learning so by now we have got the equation which we have so what we're going to do is now transition to Q learning so this equation gives us the value of going to a particular State taking the stochasticity of the environment into account now we have also learned very briefly about the idea of living penalty which deals with associating each move of the robot with a reward secure learning processes an idea of assessing the quality of an action that is taken to move to a state rather than determining the possible value of the state which is being moved to so earlier we had 0.8 into V of s 1 0.03 into V of S2 0.1 into V of S3 and so on now if you incorporate the idea of assessing the quality of the action for moving to a certain state so the environment with the agent and the quality of the action will look something like this so instead of 0.8 V of S1 will have q of S1 comma A1 we'll have q of S2 comma A2 Q of S3 now the robot now has four different states to choose from and along with that there are four different actions also for the current state it is in so how do we calculate Q of s comma a that is the cumulative quality of the possible actions the robot might take so let's break it down now from the equation V of s equals maximum of a r s comma a plus comma summation s Dash p s a s Dash into V of s Dash if we discard the maximum function we have r s of a plus gamma into summation p and v now essentially in the equation that produces V of s we are considering all possible actions and all possible States from the current state that the robot is in and then we are taking the maximum value caused by taking a certain action and the equation produces a value footprint which is for just one possible action in fact we can think of it as the quality of the action so Q of s comma a is equal to RS comma a plus comma of summation p and v now that we have got an equation to quantify the quality of a particular action we are going to make a little adjustment in the equation we can now say that V of s is the maximum of all the possible values of Q of s comma a right so let's utilize this fact and replace V of s Dash as a function of Q so q s comma a becomes R of s comma a plus comma of summation p s a s Dash and maximum of the q s Dash a dash so the equation of V is now turned into an equation of Q which is the quality but why would we do that now this is done to ease our calculations because now we have only one function Q which is also the core of the dynamic programming language we have only one function Q2 to calculate and R of s comma a is a Quantified metric which produces reward of moving to a certain State now the qualities of the actions are called The Q values and from now on we'll refer to the value Footprints as the Q values an important piece of the puzzle is the temporal difference now temporal difference is the component that will help the robot calculate the Q values with respect to the changes in the environment over time so consider our robot is currently in the mark State and it wants to move to the Upper State one thing to note that here is that the robot already knows the Q value of making the action that is moving to the Upper State and we know that the environment is stochastic in nature and the reward that the robot will get after moving to the Upper State might be different from an earlier observation so how do we capture this change the real difference we calculate the new q s comma a with the same formula and subtract the previously known qsa from it so this will in turn give us the new QA now the equation that we just write gives the temporal difference in the Q values which further helps to capture the random changes in the environment which may impose now the name q s comma a is updated as the following so Q T of s comma is equal to Q T minus 1 s comma a plus Alpha T DT of a comma s now here Alpha is the learning rate which controls how quickly the robot adapts to the random changes imposed by the environment the QT s comma a is the current state q value and a Q T minus 1 s comma is the previously recorded Q value so if we replace the TDS comma a with its full form equation we should get Q T of s comma a is equal to Q T minus 1 of s comma a plus Alpha into R of s comma a plus gamma maximum of q s Dash a dash minus Q T minus 1 s comma a now that we have all the little pieces of q line together let's move forward to its implementation part now this is the final equation of Q learning right so let's see how we can implement this and obtain the best path for any robot to take now to implement the algorithm we need to understand the warehouse location and how that can be mapped to different states so let's start by recollecting the sample environment so as you can see here we have L1 L to L3 till L line and as you can see here we have certain borders also so first of all let's map each of the above locations in the warehouse to numbers or the states so that it will ease our calculations right so what I'm going to do is create a new Python 3 file in the jupyter notebook and I'll name it as Q learning number okay so let's define the states but before that what we need to do is import numpy because we're going to use numpy for this purpose and let's initialize the parameters that is the gamma and Alpha parameters so gamma is 0.75 which is the discount Factor whereas Alpha is 0.9 which is the learning rate now next what we're going to do is Define the states and map it to numbers so as I mentioned earlier L1 is 0 and line we have Define the states in the numerical form so the next step is to define the actions which is as mentioned above represent the transition to the next state so as you can see here we have an array of actions from 0 to 8. now what we're going to do is Define the reward table so as you can see it's the same Matrix that we created just now that I showed you just now now if you understood it correctly there isn't any real barrier limitation as depicted in the image for example the transition L4 to L1 is allowed but the reward will be zero to discourage that path or in tough situation what we do is add a minus one there so that it gets a negative reward now in the above code snippet as you can see here we took each of the states and put once in the respective state that are directly reachable from the certain State now if you refer to that reward table once again which we created the above area construction will be easy to understand but one thing to note here is that we did not consider the top priority location L6 yet we would also need an inverse mapping from the states back to its original location and it will be cleaner when we reach to the utter depths of the algorithms so for that what we're going to do is have at the inverse map location State the location we will take the distinct State and location and convert it back now what we'll do is we'll Now define a function get optimal which is the get optimal route which will have a start location and an N location [Music] start deep learning with an analogy so how do you think our brain is able to identify the difference between a dog and a cat so the reason is since the day we have born we are actually seeing different types of cats and dogs in our day-to-day life and because of that we are able to identify the difference or Spot the Difference between the two that is dog and a cat even if we see different types of cats and dogs we still know which is a cat and which is a dog so this is because we have seen a lot of cats and dogs in our entire life but what if I want a machine to do that task for me so how will a machine identify whether the given image is of a dog or a cat so one way of doing that is we can train our machine to a lot of images a lot of images of different cats and dogs that has different breeds of cats and dogs and then what will happen once our training is done we can provide it with an input image then we'll manually extract certain features features like nose whiskers colors edges it can be anything the important features which actually helps us in classifying whether the input image is a cat or a DOT then we make a machine learning model with that and once it is done our machine learning model is able to predict whether the input image is of a cat or a dog but if you notice here one very big disadvantage with this is we have to manually extract features features such as nose whiskers all those features we have to manually extract it and provide it to our machine learning model and trust me guys in every scenario it is not possible if you have large number of inputs you cannot do that you cannot manually extract the features or basically you can say columns which are important for you in order to predict what the object is right and that led to the evolution of deep learning so what happens in deep learning we skip the manual step of feature extraction so what we do we directly take the input image and feed it to our deep learning algorithm and because of that what happens whatever features are we already manually provided apart from that there might be many other features which are important for example if our features don't include the length of the neck and that is one of the major feature in order to identify or classify whether it is a cat or a dog now what will happen our algorithm will automatically determine that feature and will take into consideration and then it will classify whether the input image is a cat or a DOT so what happens here we provide an input image of a dog and then it will automatically learn certain features even if you don't provide it with those features and after that it will give us the probability and according to this particular scenario it says 95 chances of being a dog three person chances of some other animals similarly two person chances of some other animal so since the highest probability goes with dog so the prediction says that the object or the input image is nothing but a dog fine guys so we'll move forward and we'll understand how exactly deep Learning Works so the motivation behind deep learning is nothing but the human brain as we have seen in the previous analogy as well what we are trying to do we are trying to mimic the human brain we are trying to mimic the way we think the way we take decisions the way we solve problems we are trying to make sure that we have a system that can mimic our own brain so obviously the motivation for deep learning has to be our brain and how our brain works with the help of our brain cells which we call neuron now let us understand how a neuron works and let me tell you guys this is what we think how a neuron works this is what our studies tell us so yeah so these are called dendrites so dendrites receive signals from other neurons and then it passes those signals to the cell body now this cell body is where we perform certain function it can be a sigma function That's What We believe We believe it performs Sigma function that is nothing but sums all the inputs then through Exon what happens these signals are fired to the next neuron and the next neuron is present at some distance and that distance is nothing but synapse so it fires only when the signals coming from the cell body exceeds a particular limit then only the cell or this neuron will fire the signals to the next neuron and this is how the neuron at the brain cell works and we take the same process forward and we try to create artificial neurons so let us understand why we need artificial neurons with an example so I have a data set of flowers say and that data set includes separate lens sepal weights better length and petal width now what I want to do I want to classify the type of flower on the basis of this data set now there are two options either I can do it manually I can look at the flower manually and determine by its color or any amines and I can identify what sort of a flower it is or I can train a machine to do that now let me tell you the problem with doing this process manually first of all there might be millions of inputs that will be given to you and for a human brain to perform that particular task is next to Impossible and at the same time we always get tired at some point of time right so we cannot just continue working for a long period of time in single stretch and the third point is human error risk which is always there so these are few limitations with the human brain so what we can do we can train a machine to do that task for us or we can put our brain inside a machine so that it can classify the flowers for us so with this what will happen the machine will never get tired and and will make better predictions as well so this is why we create artificial neurons so that there's a system present that can mimic our brain and this is what exactly happens so this particular artificial neuron can actually classify the flowers or you can say can divide the flowers on the basis of certain features in our case it is separate length sepal width petal length and petal width so on that basis it can classify the two flowers so what we need here we need some sort of a system that can actually separate the two species and what is that system is nothing but an artificial neuron so we need artificial neuron and one type of artificial neuron is a perception now let me explain you perception with the flow diagram that is there in front of your screen now over here what happens we have set of inputs like X1 X2 dash dash dash till except now these inputs will be multiplied with their corresponding weights which is W1 W2 W3 till WN in our case now these weights actually Define how important our input is so if the value of weight is high we know that this particular input is very very important for us after multiplication all of these are summed together and then it is fed to an activation function now the reason of using activation function is to provide a threshold value so if our signal is above that threshold value a neuron or you can say our perceptron will fire else it won't fire so that is the reason why we use an activation function there can be different types of activation function there can be sigmoid there can be step function sine function depending on our use case we Define the activation function now the main idea was to Define an algorithm in order to learn the values of the weights which are w1w to W3 in our case to learn the values of the weights that are then multiplied with the input features in order to make a decision whether a neuron fires or not in context of patent classification which is in our case flowers so we are trying to classify the two species of flowers such an algorithm could be useful to determine if a sample belongs to one class or one type of a specie or another class or another type of the flower we can even call perceptron as a single layer binary linear classifier to be more specific because it is able to classify inputs which are linearly separable and our main task here is to predict to which of the two possible categories a certain data point belongs based on a set of input variables now there's an algorithm on which it works so let me explain you that so the first thing we do is we initialize the weights and the threshold now these weights can actually be a small number or a random number and it can even be zero so it depends on the use case that we have then what we do we provide the input and then we calculate the output now when we are training our model or we are training our artificial neuron we have the output values for a particular set of inputs so we know the output value but what we do we give the input and we see what will be the output of our particular neuron and accordingly we need to update the wage in order to reduce the loss or you can say in order to reduce the difference between the actual output and the desired output so what happens uh let me tell you that so we need to update the weight so how we are going to update the width is we can say the new weight is equal to the old weight plus the learning rate learning rate we'll discuss about it later in the session but generally we choose learning rate somewhere between 0.5 to 0.01 after that what happens we find the difference between the desired output and the actual output and then we multiply it with the input so on that basis we get a new weight so this is how we update the wage and then what happens we repeat the steps two and three so we are going to repeat the steps two and three that is we are going to apply this new weight again we are going to calculate the output again we are going to compare it with the desired output and if it matches then it's fine otherwise we are going to update it again so this is how the whole perceptron learning algorithm works fine guys so we'll move forward and we'll see the various types of activation functions so these are the various types of activation functions that we use although there can be many more activation functions again I'm telling it depends on your use case so we have a step function so if our output is actually above this particular value then only a neuron will fire or you can say the output will be plus one or if it's less than this particular value then we'll have no output or we'll say zero output similarly for the sigmoid function as well and same goes for sine function so let us move forward and we'll focus on various applications of this perceptron now as I've told you earlier as well it can be used to classify any linearly separable set of inputs now let me explain you with the diagram that is there in front of your screen so we have different types of dogs and we have horses and we want a line that can separate these two so our first iteration will produce this sort of a line but here we can notice that we have error here as we have classified horse one of the horses as a dog and a dog as a horse so error is two here similarly we now what happens we have updated the weights now after updating the weight what happens our error has reduced so what happens now we have actually classified all the horses correctly but one dog we have classified wrong and we have considered it as a horse so our error becomes one once again what will happen if you can remember the step two and three of our perceptron learning algorithm will be done and then after that our weights will be updated and our desired output will become equal to our actual output and we get a line something like this so what we have now we have properly classified dogs and horses so this lines separates both of them this is how we can actually use a single layer perceptron in order to classify any linearly separable set of inputs now we can even use it in order to implement a logic gauge that is r and and now let me tell you how you can do that first we'll look at or gate now in our gate what happens here is the truth table truth table according to that we have two inputs X1 and X2 so if both are 0 we get a zero if any one of the input is high or any one of the input is one we get the output as one so what we need to do is we need to make sure that our weights are present in such a way that we get the same output so how we have done that when the value of weight is equal to 1 and then we provide the input X1 and X2 after passing through this activation function we get this sort of a graph which is the graph for our or a gate now we have X1 and X2 as inputs so we provide first input as 0 x 1 and this also has zero so 0 into 0 is again 0 0 into 0 is 0 and then we pass it through this activation function now we need to make sure that whatever value that comes here should be greater than 0.5 then only our neuron or the perceptron will fire but since this value is actually less than 0.5 so the neuron won't fire and our output will be zero as you can see it over here now let's take the other set of inputs now our X1 is 0 and our X2 is 1. so 1 into 1 will what will be obviously one now in this case our output is bigger than 0.5 then what will happen our neuron will fire and we'll get one the output 1 here as you can see it in the graph similarly when my X1 is 1 and X2 is 0 in that case also our output is greater than 0.5 our neuron will fire and we'll get this sort of a graph when I talk about 1 1 then our output will be 2 which is greater than 0.5 so we get an output that is 1 and this is how we get the graph and now if you notice with the help of the single layer perceptron we are able to classify the ones and zeros so this line anything above this line is actually one and anything below this line we have zeros so this is how we are able to classify or able to implement or gate similarly when I talk about and gate as well so there's a difference in the truth table of and gate in and gate what happens we need to make sure that both of our inputs are high in order to get a high out if any of the input is low we get a low output another reason we choose an activation function that is 1.5 which means that if our value or the output is above 1.5 then only our neuron will fire and we'll get one here and there is only one case that is when both the inputs are high so when both the inputs are I so X1 is 1 x 2 is 1 we get something which is 2 which is obviously greater than 0.5 this is how it is yeah so one plus one two which is obviously greater than 0.5 so what we get we are neuron fires and we get one here but for the rest of the inputs all will be less than 1.5 so that is why our neuron doesn't fire and we get a zero output so this is how we can Implement and and or gate so let's move forward guys and we'll actually understand the use case data is of mnist data set the reason for using mnist data set is because it is already clean and will be a perfect example for this now what this mnis data set contains it basically contains handwritten digits from 0 to 9 and in that data set we have 50 000 training images along with 10 000 testing images so we will train our model with those 55 000 training images and then we are going to test the accuracy of a model with the help of those 10 000 testing images and for all of this we need to understand first what exactly is tensorflow so let's move forward guys and understand what exactly is tensorflow now what is tensorflow as I've told you earlier as well we use this tensorflow library in order to implement the Deep learning models and the way the data is represented in the Deep learning model is called tensors now what are tensors now tensors are just the multi-dimensional arrays or you can say an extension of two dimensional tables matrices to data with higher Dimension now let me explain you with the examples that are there in front of your screen so this particular data is nothing but a tensor of Dimension six because we have six rows and we have only one single column now over here we have four columns as well as six rows so this becomes a tensor of Dimension six comma four similarly over here as well so we have another dimension that is the third dimension in which we have two values so we consider this as a tensor Dimension 6 4 and 2. so this is nothing but a way of representing data in tensorflow now if you consider the tensorflow library at its core it is nothing but a library that performs Matrix and manipulation that is what tensorflow is now let us move forward and understand tensorflow in a bit detail so as the name tells that it consists of two words tensor as well as flow now we understood what exactly tensor is now when I talk about flow it is nothing but a data flow graph so let me just give you an example that is there in front of your screen so we've talked about weights and inputs so we provide these weights and inputs and we perform a matrix multiplication so weight is one tensor X input is one tensor then we perform matrix multiplication after that we add a bias then what we do we add all of these so what is this this is nothing with the sigma function in this perceptron that we have seen then we pass it through an activation function and the name of that activation function is relu or relu where you can say it and then our neuronal file so this is nothing but a flow or you can say a data flow graph now let us understand few code basics of tensorflow so we'll move forward and understand the code basics of tensorflow now the tensorflow programs actually consist of two parts one is building a computational graph and another is running a computational graph so we'll first understand how to build a computational graph now you can think of a computational graph as a network of nodes and with each node known as an operation and running some function that can be as simple as addition or subtraction or it can be as complex as say some multivariate equation now let me explain it to you with the code that is there in front of your screen so first thing you do you import the tensorflow library then what you do you define two nodes and these nodes are constants so we'll call that function we'll call it as TF dot constant and we'll provide a value that is 3 and it is nothing but a float number of 32 bits similarly we defined one more node which is a constant and it contains value four so these are nothing but your constant nodes so this is basically what computational graph is so basically we have built a computational graph and in this graph each node takes zero or more tensors as inputs and produces a tensor as an output and one type of node is a constant that I've told you earlier as well and these tensorflow constants it takes no inputs and it outputs a value which is stored internally now what I'll do I'll actually execute this in my pycharm so for that I'll open it so this is my python guys and I've already installed tensorflow so the first thing that I need to do is import the tensorflow library for that I'm going to type in here import tensorflow as TF so if you are familiar with the python Basics you know what it means actually so I am importing tensorflow library and in order to call that I'm going to use the word TF so after that I'm going to Define my Node 1 and node 2 which are constant nodes so for that I'm going to type in here Node 1 equal to TF dot constant and then I'm going to define the constant value in this so it'll be 3 and it'll be a float value of TF dot float of 32-bit all right and now I'm going to Define my second constant node so I'll type build here node 2 TF dot constant I'll put 4 in here and four that's it and so the TF dot flow 32 will be present implicitly I don't need to do that again and again so I have created a computational graph here so what if I print it now so let us see what will be the output Node 1 comma node two let's go ahead and run this and see what happens if you notice here that printing the notes does not output the values three and four you might be expecting that right that it should print three and four but instead they are nodes that when evaluated would produce three and four respectively so to actually evaluate the nodes what we need to do is we need to run this computational graph so let me show you that for that I'm again going to open the slides so since I've told you earlier as well we need to actually run this computational graph within a session and what is this session it's session actually encapsulates the control and state of the tensorflow run time now the code that is there in front of your screen what it will do it will actually create a session object and then it invokes its run method to run enough of the computational graph in order to evaluate Node 1 and node 2. and how it does that by running the computational graph in a session so now let me show it to you practically how it happens so again I'm going to open my pycharm this is my python guys so let me first uh comment the Sprint statement and now I'm going to run a session uh so for that I'm going to type in here says equal to TF dot session and now I'm going to print it so for that I'm going to type in print says dot run and what I want to run I want to run Node 1 and node two so I'm going to type in here Node 1 comma node two that's all and now when I run this it'll actually give me the value 3 and 4. and yep it does it gives me the value 3 and 4. so what we did we first built a computational graph we first saw how to build a basically a computational graph and then we understood that these all are nodes so in order to get their value we need to evaluate those nodes and how we can do that by running the computational graph inside a session and then finally when we run that session we get the output 3 and 4 which is nothing but the values of those nodes and the one thing that you must have noticed here is that it will always produce a constant result right so how to actually avoid that so we saw how to run a computational graph as well now let me explain it to you with one more example so this is one example in which we take three constant nodes a b and c and we perform certain operations like multiplication addition and subtraction and then we run the session and then we finally close it and this is the diagram how it looks like so we have three constant nodes c b a a contains value five b contains two similarly C contains three so first we add C and B then we add b and a then we get two other nodes that is e and d and then from that e and D what we are going to do is we are going to subtract both of them and then we get the final output now let me go ahead and execute this practically in my pycharm so this is my python again guys first thing I need to do is import a tensorflow as TF then we are going to find the three constant nodes so first is a a equals to TF dot constant and the value that will be there in a will be 5 then I'm going to Define one more constant node that'll be BTF dot constant and the value that'll be there in it will be 2. and then I'm going to define the last constant node and TF dot constant the value that will be there is 3. all right so we have three constant nodes and now we are going to perform certain operations in them for that I'm going to Define one node let b d which will be equal to TF dot multiply so the two nodes that we want to multiply so that'll be a comma B and then there'll be one more node in order to perform some operation that is addition to TF Dot add add c and b and then we're going to Define one more node let b f and inside that node we're going to perform the subtraction operation TF dot subtract D and E so we have build a computational graph now we need to run it and you know the process of doing that says is equals to TF dot session then we are going to Define a variable Let It Be or UTS outs whatever name that you want to give in and just type in assess a DOT Run f let's see if that happens or not and then we're gonna print it so for that I'm going to type in here print out let's go ahead and run this and see what happens so we have got the value 5 which is correct because if you notice our presentation as well let me open it for you over here we also we get the value Phi similar to our implementation in python as well so this is how you can actually build a computational graph and run a computational graph I've given you an example now guys let us move forward because these are all the constant nodes what if I want to change the value that is there in the node so for that we don't use the constant nodes for that we use placeholders and variables let me explain it to you first I'll open my slides so since these are all constant nodes so we cannot perform any operation once we have provided a value it will remain constant so basically a graph can be parameterized except external inputs as well and what are these These are nothing but your placeholders and these placeholders is basically a promise to provide values later so there's an example that is there in front of your screen over here these three lines are bit like a function or a Lambda in which we Define a two input parameters A and B and then in operation possible in them so we are actually performing addition so we can evaluate this graph with multiple inputs by using feed underscore dict parameter as you can see we are doing it here so we are actually passing all these values to our placeholders here so these all values will be passed and accordingly we'll get the output so let me show you practically how it happens so I'm going to open my pycharm once more I'll remove all of this and yeah so the first placeholder I'm going to name it as a TF a DOT placeholder and what sort of a placeholder it'll be so I'll consider it as float number of 32 bits similarly I'm going to Define one more variable as well I'm going to name it as B then I'm going to Define an operation that I'm going to perform them so I'm just going to type in here Adder underscore node equals to A plus b and now our placeholders are currently empty so that told you earlier as well placeholders are nothing but a promise in order to feed values later so this is how we have built a computational graph now our next step is to start a session uh so for that I'm going to type in as says equals to TF dot session correct since these placeholders are currently empty and we know that these placeholders are nothing but a promise in order to provide them with certain values later so let's go ahead and provide the values or you can say a list of values so I'm going to type in here print says dot run Adder underscore node and then the values that I'm going to feed in uh so basically I'm gonna feed in a dictionary a colon 3 a colon a list of integer values between one to three and one more will be B colon a list of integer values between a two and four all right so this is done now let us go ahead and execute this and see what happens so we have got the output at three and seven so basically if you add one and two you get three similarly you are three and four you get seven it's pretty easy mathematics but my main focus was to make sure that you understand what are placeholders so we understood what exactly our placeholders Now's the Time to understand what are variables so basically in deep learning we typically want a model that can take arbitrary inputs now in order to make the model trainable we need to able to modify the graph to get new outputs with the same input and what helps us to do that variables basically allows us to add a tradable parameters to our graph so in order to declare variables what we do is you can refer the code that is there in front of your screen so we have taken uh two variables in one placeholder and the first variable has a value 0.3 the another variable has minus 0.3 and the placeholder obviously it will remain empty and then later on we feed in some values to it then we create a linear model or you can say some operation in which we are going to multiply this W with X and then we are going to add a bias or a B value to it then what we need to do is we need to initialize all the variables in the tensorflow program so for that you must explicitly call a special operation which is a tf.global variable initializer and then just directly run the session now let us go ahead and execute this practically guys let me remove all of this yeah so our first variable will be a w and we're going to call in TF dot variable and in that variable we are going to store a value which is 0.3 and it is of a flow type 32-bit so just type in here TF dot float32 yeah then we're going to Define one more variable Let It Be B so I'm going to type in here TF dot variable and the value of this variable will be minus of 0.3 again it is float type of 32-bit so just type in that TF dot float32 now I'm going to define a placeholder that is X x is equals to TF a DOT placeholder and it is again a float type 32-bit and then some operation that we are going to perform so I'm going to type in here linear underscore model equal to W multiplied by our x value that is a placeholder and then we want to add a bias to it or B value to it all right now as we have seen earlier as well in order to initialize constants we call TF dot constant and their value will never change but by contrast variables are not initialized when you call TF dot variable so to initialize all the variables in the tensorflow program what you need to do is you need to explicitly call a special operation and how you're going to do it just type in here init equal to TF dot global underscore variables underscore initializer that's all and then we're gonna run a session so you know the process says is equals to TF dot session inet now let's print it and before that what we need to do is you need to provide the x placeholder with some values we'll actually going to do that in the print statement itself so I'm going to type it says dot run around what run a linear model and the values that we're going to pass in X will be a dictionary again and let me show you how you can do that so X or colon and it'll be a list of values from 1 comma 2 comma 3 comma four yeah that's all and we are going to run it now so this is how we get the output simple mathematics what we have the first value of w will be 0.3 and the first value of a b will be a minus 0.3 and the first value of x will be 1. so it'll be 3 minus 3 which is again 0. similarly for other values as well you can calculate it it's absolutely cut it so what we have done we have created a model but we don't know how good it is yet so let's see how we can actually evaluate that for that I'm again going to open my slides so now in order to evaluate the model on the training data we need a placeholder y as you can see in front of your screen to provide the desired values and we need to write a loss function so this placeholder y will actually be provided with the desired values for each set of inputs and then we're going to calculate the loss function and how are we going to do that we're going to minus the actual output with the desired output and then we're going to do the square of it after that we're going to sum all of these Square deltas and then we're going to Define one single scalar as loss so this is how we are actually going to calculate the loss and then after that we need to provide the values to X and Y placeholders so what I'm gonna do I'm going to open my pycharm and then I'm going to show you how correct our model is on the basis of the values that we provide to Y placeholder so guys this is my pycharm again and since here what I'm going to do is I'm going to define a placeholder first so Y is equals to TF dot placeholder tf.432 so I'm going to type in a TF dot load 32 squared Delta TF dot Square the actual output minus the desired output and then I'm going to sum all those losses or you can say Square deltas TF dot reduce underscore sum so I'm going to type in here Square deltas then finally print it so I'm going to type in SAS dot run loss X colon 1 comma two comma three comma four zero comma minus 1 comma minus 2 comma minus three and that's all and we are good to go let's run it and see what will be the loss so the loss is 23.66 which is very very bad now next step is to realize how to actually reduce this loss so let's go ahead with that I'm going to open my slides once more now in order to reduce this loss tensorflow provides optimizers that slowly change each variable in order to minimize the loss function and the simplest Optimizer is gradient descent in order to do that we have to call the function called TF Dot gradients so as you can see it in the code itself so we have a tf.train.gradient descent Optimizer and this is nothing but the learning rate 0.01 then train equals to Optimizer dot minimize the loss so we're going to call this Optimizer which is nothing but the grade InDesign Optimizer in order to minimize the loss and then we are going to run it so let's see if that happens or not so for that again I'm going to open my pycharm and over here let me first comment this print statement and now I'm going to type in here optimizer equals to TF dot train gradient descent optimizer and the learning rate will be 0.01 and guys let me tell you this is just an introductory session to tensorflow and deep learning so all the modules that we have discussed at the beginning will be covered in detail so whatever topics are there in those modules will be covered in detail so you don't need to worry about it so I'm just giving you a general introduction an overview of how things work in tensorflow all these things all these gradient descent optimizers all these things will be discussed in detail in the upcoming sessions now I'm going to type in here train equal to Optimizer dot minimize loss and then now let's go ahead and run this says dot run inet says dot run and we're going to feed in values to our X and Y placeholders train comma X will have values one two three four and Y will have values 0 comma minus 1 comma minus 2 comma minus 3. print says dot run W comma p now let me first go ahead and comment these lines all right so I've just made a mistake here this is uh in uppercase W and yeah so now we are good to go and let's run this and see what happens so these are our final model parameters so the value of a w will be around the 0.999969 and the value of our B will be around a 0.9999082 so this is how we actually build a model and then we evaluate how good it is and then we try to optimize it in the best way possible so I've just given you a general overview of how things Works in tensorflow so now is the time to actually implement the and an or gate that I was talking about at the beginning of the session so let me first remove all of this now in order to implement and gate our training data should consist of the truth table for and gate and we know that truth table and we know that if any of the input is low then the output will be lower and if both the inputs are high then output will be high one thing to note here guys is that the bias is implemented by adding an extra value of one to all the training examples so yeah enough with the explanation let's go ahead and code it so I'm going to type in here T comma f equal to one point comma minus one point and bias will always be one so I'm going to type in here bias is equals to 1.0 and now I'm going to provide the training data so for that I'm going to type in train underscore in equal to if one input is true and the another input is also true then we have bias then again if one input is true the other input is false then also we have bias similarly one input is false another input is true then we have bias and then finally when both the inputs are false or zero then we have bias right this is our training data and the train out will type in here as the output basically train underscore out will be equal to so yeah if both the inputs are true or both the inputs are one then output is true oops I forgot uh comma everywhere let me just go ahead and do that yeah so if both the inputs are true the output will be true if any of the input is false the output will be false so I'm just going to type in here false everywhere because there's only one condition in which we have the true output all right so this is done now now as we know that tensorflow works by building a model out of empty tensors then plugging in known values and evaluating the model like we have done in the previous example since the training data that we have provided will remain constant the only special tensorflow object we have to worry about in this case is our Three cross one tensors of vectors and now what we are going to do we're going to Define a variable and we are going to put in some random values in it so I'm just going to type in TF a DOT variable and uh generate the random function TF dot random underscore normal 3 comma 1. now what it is it is basically a variable so its value may be changed on each evaluation of the model as we train with all values initialized to normally distributed random numbers now that we have our training data and weight tensor we have everything needed to build our model using tensorflow so what we need to do we need to Define an activation function and we're going to define a step activation function or step function so for that what I'm going to write here is a function a step we're going to Define our own step function although you can use the predefined function as well that totally depends on you I'm going to type in here is the greater equal to TF dot creator X comma 0 then we're going to Define one more variable as float TF dot t o underscore float is underscore creator then we're going to Define one more variable a doubled equal to TF Dot multiply as underscore float comma 2. then return TF Dot subtract doubled comma 1. so now this is how we have defined our step function so with the step function defined the output error and the mean squared error of our model can be calculated in one Short Line each let me show you how you can do that so just type in here output equal to call that function that is step TF dot matrix multiplication math Mull train underscore n comma w now for error we're going to type in here error equals to TF dot subtract train underscore out comma output and for mean squared error I'm going to type in here MSC equal to TF dot reduce underscore mean TF dot Square and then error now the evaluation of certain tensor functions can also update the values of variables like our tensor of Weights W so in our case basically we're going to update the vhw so first we calculate the desired adjustment based on error then we add it to W now let me show you how we'll do that so if you can recall we have done that in the previous example as well where we were updating the weights and bicep so now I'm going to type in here Delta is equals to TF dot matmo or you can say matrix multiplication train underscore in comma error comma transpose underscore a equals to true then we're going to type in here train equal to TF dot assign W comma TF dot add W comma Delta the first matrix multiplication and then we're going to add it all the model has to be evaluated by tensorflow session which we have seen earlier as well but before that we also need to initialize all the variables so first let me just type in here says equals to TF dot session and then I'm going to type in here says dot run TF dot initialize underscore all variables so now what our next task is in order to perform the various iterations so that we get zero error so what we're going to do we're going to Define a variable err and our Target so basically our ER is nothing but our error which can be equal to 1 or 0 because we are using binary output so our error can be one and our Target is to make it as zero right now next what we're going to do we're going to Define Epoch now you can consider right now Epoch as nothing but the number of Cycles or you can say number of iterations that will be required in order to reach the desired output or you can say in order to reduce the error to zero so I'm just going to Define that so I'm going to type in here Epoch or you can give whatever variable name you want and I'm going to type in here Max underscore epoch equal to 0 comma 10 which means that our it'll start from the zeroth epoch and our maximum value of epoch will be 10. now I'm going to Define while error should be is greater than Target and Epoch is less than maximum epoch increase the value of epoch y1 then I'm going to type in err says dot run mean squared error comma chain and let's just finely print it and see what happens print epoch mean squared error colon err or so basically what it will print it'll print the Cycles or Epoch and the error with respect to that particular cycle so let us go ahead and execute this and see what happens oops I've typed the spelling of square wrong uh pardon me for that so I'm just gonna make it right now s q u a r e all right let's go ahead and execute it once more so yeah in three epochs we got the value of uh mean square error as zero that means it took us three iterations in order to reduce the output to zero so this is how you can actually Implement a logic gate or you can say this is how you can actually classify the high and low outputs of a particular logic gate using single layer perceptron similarly you can do that for or gate as well so what I'm gonna do I'm gonna create a new python file and I'm gonna name it as mnist which is nothing but the data set on which we are going to perform the classification of handwritten digits so we're going to execute a use case and we have told you earlier as well in this MNS data set we have handwritten digits between from 0 to 9 and it has 55 000 training sets as well as uh 10 000 uh test sets so let's go ahead with that the first thing I'm going to do is download the data set but before that let me just import tensorflow Library as TF and yeah now let us download the data set so for that I'm going to type in from tensorflow dot examples dot tutorial Dot mnest import input underscore data and now I'm going to type in here mnist is equals to input underscore data and this underscore data dot read data sets and type in here mnist underscore data comma 1 underscore heart equals to true now here in this is nothing but a lightweight class which stores the training validation and testing sets as a numpy address and when I talk about this one hot equals to True is nothing but one hot encoding now let me tell you what one hot encoding is so for that let me just comment few lines so one odd encoding means that if I'm classifying something as a seventh so if I classify that my digit is seven so how am I going to represent that so I'm going to type in here 0 1 2 3 4 5 6 7 the bit will be active so I'm going to type in there one then eight and then nine similarly if I want to represent say uh that my digit is two so I'm going to type in 0 1 on the second digit I'm gonna type in here as one and the rest all zeros so I hope you got the concept of what exactly one hot encoding is so it's like only one output is active at a time that's all so our next step is to start the session like we do every time so I'm going to type in here sets equals to TF dot interactive session so I'm just going to type in here interact Ive session yeah and now we are going to do we're gonna build the computation graph by creating nodes for the input images and Target output classes so for that I'm going to Define some placeholders for that I'm going to type in X is equals to TF Dot placeholder tf.flow32 32 and the shape will be none comma 784. now the input image is X will consist of 2D tensors of floating Point numbers here we assign it a shape of say none comma 704 as you can see so where 784 is the dimensionality of single flattened 28 by 28 pixel mnist image of handwritten digits and what none indicates it indicates that the First Dimension corresponding to the batch size can be of any size means the First Dimension can be of any size we are not putting any restrictions on that now I'm going to define the variable Y which will nothing which will be nothing but our real labels or you can say the desired output placeholder and I'm going to type in here TF Dot float32 and I'm going to define the shape which will be none comma 10 because we have 10 classes and similarly Y is also a 2d array where each row is one hot 10 dimensional Vector indicating which digit class the corresponding mnist image belongs to and now next step is to Define weight and biases for our model like we have done in the previous example so we could imagine treating these like additional inputs but tensorflow has even a better way to handle them and what it is it is nothing but variables so let us go ahead and do that I'm going to type in here WTF dot variable TF dot zeros I'm going to initialize it to zeros and the shape will be a 784 comma 10. so it's like 28 cross 28 pixels and 10 classes similarly when I talk about bias so it'll be uh TF dot variable TF dot zeros initialize it to zeros and the shape will be 10. we pass the initial value for each parameter in the call to TF dot variable repeat now over here as we can see that we initialize both W and B as tensors full of zeros and W is a 784 cross 10 matrix because we have 784 input features and 10 outputs and when I talk about bias B it is a 10 dimensional Vector because we have 10 classes and we have learned that before we can use variables in a session we need to First initialize it so we're going to type it here says a DOT run and after that I'm going to type in uh TF Dot Global underscore variables underscore initializer all right so we have initialized all the variables so our next task is to predict the class and the loss function so we can now Implement a regression model it takes only one line we multiply the vectorized input image X by the weight Matrix W and add the bias so for that I'm just going to type in here y equal to TF dot matte model that is nothing but matrix multiplication X comma w plus b all right so now we can specify a loss function very easily so the loss indicates how bad the model's prediction was on single example we try to minimize that while training across all the examples so now we can specify a loss function now loss indicates how bad the model's prediction was on a single example and we try to minimize that while training across all the examples now here our loss function will be cross entropy or you can say the difference between the actual output and the predicted output so for that what I'm going to do I'm going to make use of softmax cross entropy function so here the loss will be the difference between the target output and the actual output so for that what I'm going to do is I'm just going to type in here a variable name I'm going to name it as cross entropy and yeah entropy equals to TF dot reduce underscore mean and then just type in here let me just enter the next line TF dot NN Dot softmax underscore cross entropy with logits and now over here I'm going to type in first our Target output which will uh be labels equals to y comma our actual output so it'll be logits is equals to Y that's all so over here let me just give you a brief idea what is happening so labels is equals to y means that this is our Target output and this is the actual output so it'll be we'll name it as y underscore and so what exactly is happening it will calculate the difference between the target output and for the actual output for all the examples then it is going to sum all of them and then find out the mean so this is what basically uh this cross entropy variable will do now that we have defined our model and training loss function it is straightforward to train using tensorflow now we need to train our model so tensorflow has a wide variety of built-in optimization algorithms as I've told you earlier as well and for this example we'll use steepest gradient descent with a length of about 0.5 to descend the cross entropy so basically we're going to use a 0.5 learning rate or you can say the step length so for that I'm going to type in here a train underscore step equal to TF dot train dot gradient descent optimizer and the step will be 0.5 and minimize the loss so minimize cross entropy that's it so what this one line basically will do it will minimize the cross entropy which is nothing but the loss function that we have defined now in The Next Step what we are going to do we are going to load 100 training examples in each training iteration so each time the training iteration happens it'll take 100 examples we then run the train underscore step operation which is nothing but to reduce the error using feed underscore dict or which is nothing but we are going to feed the real values to our placeholder why now next what we are going to do we're gonna load 100 training examples in each training iterations which mean that for each iteration we'll take 100 training examples and we'll run the train underscore step operation which is nothing but the optimizer here and after that we are going to use a feed underscore dict to replace the placeholder tensors X and Y with the training examples so basically X will contain the input images and Y will contain the actual outputs or you can say the desired outputs for that I'm going to type in here for underscore in range thousand batch equals to mnis dot train Dot next underscore batch 100. train underscore step dot run and I'm gonna feed in the values to X and Y variable feed underscore dict equal to X colon batch zero comma y colon batch one that's it now we need to evaluate our model we need to figure out how well our model is doing and for that I'm going to make use of tf.org Max function now this DF dot ARG Max function let me tell you how it works first I'm going to type in here correct underscore prediction equals to our TF a DOT equal TF dot ARG Max y underscore comma one comma TF dot ARG Max y comma 1. so basically this uh TF dot ARG Max or Y underscore comma 1 is a label our model thinks is most likely for each input that means it is our predicted value while TF dot arcmax or Y comma 1 is a true label it is uh there in our data set present already and we know that it is true so what we are doing we are using TF dot equal function to check if our actual prediction matches the desired prediction so this is how it is working so now what we're going to do we're going to calculate the accuracy to determine what fraction are correct we cast The Floating Point numbers and then take the mean and repeat so now I'm going to Define a variable for accuracy so I'm just going to type in accuracy equals to TF dot reduce underscore mean TF dot cast correct underscore prediction TF Dot float32 and what we can do finally we can evaluate our accuracy on the test data and this should give us accuracy of about 90 so let's see if that happens I'm going to type in here print accuracy dot evaluate eval feed underscore date equal to X colon mnist DOT test dot images comma y colon amnest DOT test dot labels that's it guys and I've done a mistake here instead of Y it will be y underscore because this is our predicted value not the actual value and why we have considered as actual value and this y underscore will be our predicted values this is the mistake that I made so yeah now I think the code looks pretty fine to me and we can go ahead and run this let us see what happens when we run this so guys it is complete now and this is the accuracy of a model which is uh 91.4 percent and which is pretty bad when you talk about a data set like mnist but yeah with a single layer which is very very good so we have got an accuracy of around 92 percent on the mnist data set which means that whatever the test data sets were there that is like 10 000 test images so on those test images whatever the prediction our model has made are 91.4 percent correct so now there are certain limitations of single layer perceptron let us understand that so in order to understand that we'll take an example so we have an xor gate here and this is the truth table so according to this truth table if any one of the input is high then the output is high and if both the inputs are low output is low and if both the inputs are high output is low so how can you classify the high and the low outputs with a single line definitely you can't if you see the points one point is here one is here here and here in which these two points are hype outputs and these two are low so how can you classify with a single line definitely you can't so now what's the answer to this what if we use multiple neurons so using multiple neurons we can have two lines that can separate it now we can solve this problem if we have multiple neurons so if we use two neurons we can have two lines and that can actually separate the high outputs as well as the low outputs so this is where we use multi-layer perceptron with back propagation so what are multi-layer perception multi-layer perceptions they have the same structure like the single layer perceptron the only difference is they have more than one hidden layer so let me uh explain it to you with an example so this is how a typical material perception looks like so we have input layer we have two hidden layers and we have one output layer as well now typically each of these input layers are connected to the next hidden layer each neuron is connected to the next neuron present at the adjacent layer but the neurons of the same layer are not connected to each other now what happens a set of inputs are passed to these input layers and the output of this input layer will be passed to the first hidden layer then after activation function of the first hidden layer the output will be passed to the next hidden layer as the input and similarly finally we get the output now you must be thinking how the model learns from here so the basically the model learns by updating the weights and the algorithm that it uses is called back propagation so the back propagation algorithm helps the model to learn and update the weights in order to increase the efficiency so basically at this process of from input layer to the output layer is called a feed forward process and then when we back propagate it in order to increase the efficiency or accuracy so that we can update the wage that is called as back propagation so let us move forward and understand what exactly is back propagation so what is back propagation now let us understand this with an example so we'll take the inputs as the leads generated from various sources and my aim is to classify the leads on the basis of the priority so there might be certain leads which won't make it that much difference to me whereas compared to the other leads so in that case I need to make sure the leads which are important gets the highest amount of weight how am I going to do that first we'll see the output then accordingly we'll calculate the error and based on that error we are going to update the way and this process is nothing but your back propagation in a nutshell I can say right although the algorithm is pretty complex but yeah this is basically what happens so in order to classify the leads and the basis of priorities we need to provide the maximum weight to the most importantly and how we're going to do that we're going to compare the actual output and the desired output and according the difference we can update the weights so what is back propagation so back propagation is nothing but a supervised learning algorithm for multi-layer perception now let us understand what exactly is this algorithm let us understand this with an example that is there in front of your screen so these two are input neurons these two are hidden neurons and these two are output neurons now our aim is to get 0.01 and 0.99 as our output and at the same time we have inputs as 0.05 and 0.10 initially we take the weights as we can see it here so 0.15 W1 0.20 W2 so these are the ways plus we have two biases as well now what we need to do is we need to make sure that we have weights in such a way that we get output as 0.01 and 0.99 but let us see if we get that same output when we provide these kind of ways the net input for this particular Edge one will be what it'll be W1 into i1 that is the first input plus W2 into I2 plus b 1 into 1 which gives us the answer as 0.3775 similarly the output for H1 will be nothing but the activation function the output after the activation function and we are using a sigmoid function because our brain also uses a sigmoid functions that's what we believe in and at the same time it is easily differentiable and if you differentiate it twice you get the same number so the output of H1 will be 0.59 something so let me go back so the app the output of this particular H1 will be 0.59 something then we're going to calculate the output of H2 as well similarly and we get the output as 0.59684378 next up we are going to repeat the process for the output layer neurons as well so for the output layer the net input will be W5 into out of H1 W6 into out of H2 plus B2 that is the bias so we get the output somewhere like this and then the net output for out o1 will be after the activation function which will be 0.751 similarly the output for O2 will be 0.77 now if you notice that this is not the desired output our desired output was 0.01 and 0.99 what we got instead we got 0.75 and 0.77 so what we need to do we need to update the weight so for that we're going to calculate the error now error for output o1 is nothing but the sigma of 1 by 2 Target minus output whole square target is nothing but your desired output and output is your actual output so your error of output of o1 neuron will be 0.278 similarly for O2 will be 0.023 so the total error comes down to 0.29 something next up we need to update the weight so as to decrease this error now what we're going to do we're going to first calculate the change in total error with respect to any of the random weight will take W5 for example just to show you so we're going to calculate the change in the total error when we change with respect to the weight W5 so we're going to apply a chained rule here so using partial derivative we're going to calculate e total by Delta of Auto One into Delta of Auto One by Delta of net oven and then Delta of net o1 by Delta of W5 as you can see it from this particular example as well a total change in E total with respect to Outer one change in outer one with respect to neto1 and change in neto1 with respect to W5 we're going to multiply that and we're going to get this particular term and let us see how we do that so how is the total error change with respect to the output e told by E out one we're going to calculate it it came around 0.741 similarly we're going to calculate how much this output or one change with respect to the net output and we get this after that we're going to calculate it for net output change with respect to W5 also we got this and then finally what we did we put all these values together and we found out the change in the error with respect to the change in weight W5 which came to up around 0.082167041 now is the time to update the way so how are we going to update the weight first thing we need to do is we need to follow this formula so W5 plus is nothing but the updated weight which is equal to W5 minus the learning rate into Delta of e total by Delta of W5 so which came to around a 0.35891648 similarly we can calculate the other weights as well so we're going to repeat the same process and we're going to calculate the other weights as well then again we are going to see how much is the loss if still the loss prevails then we're going to repeat the same back propagation learning algorithm again for all the ways so this process will keep on repeating so this is how back propagation actually works so now what I'm going to do I'm going to make use of the same mnist data set and I'm going to increase the efficiency of that data set by 97 to 99 with the help of multi-layer perception so guys let us go ahead with that so guys as I've told you earlier is where we are going to use the same MNS data set which we have used in the single layer perceptron so I'm going to perform a classification using multi-layer convolutional networks now what are convolutional networks basically these networks are used in order to classify images so basically what I'm doing is I'm just trying to show you that how we can increase the efficiency using the convolutional neural networks you don't have to go in much detail about it because you'll be learning about it in the upcoming modules I'm just giving you a general overview or you can say a taste of how things work in convolutional neural networks so we give an input image to this convolutional network then this input image is processed in the first convolutional layer using the filter weights now this result in 16 new images one for each filter in the convolutional layer the images are also down sampled so the image resolution is decreased from 28 cross 28 to 14 cross 14. now these 16 smaller images in from the first convolutional layer are then processed in the second convolutional layer we need filter weights for each of these 16 channels and we need filter weights for each output channel of this layer now there are total 36 output channels so there are total of 16 Cross 36 equal to around 576 filters in the second convolutional layer now the resulting images are down sampled again to 7 cross 7 pixels the output of the consecutive convolutional layer is 36 images of 7 cross 7 pixels each now these are then flattened to a single Vector of length 7 cross 7 cross 36 which is a 1764 which is used as the input to a fully connected layer with 128 neurons and this feeds into another fully corrected layer with 10 neurons one for each of the classes which is used to determine the class of the image that is which number is depicted in the image so whatever number that we've provided so basically the output layer or you can say the last layer or the fully connected layer with 10 neurons is used to determine the class on which digit is the input image of now the input image depicts the number seven and the four copies of the images are shown here so basically what happens whatever filter we have in each layer it'll be present on the image pixels so there'll be a DOT product of the filter and the image pixel behind that so we'll get a DOT product there we are going to repeat the same process in each of these layers and for each of these layers and for each of these images and we are calculating the dot product so let me just show you that we can increase the efficiencies so I've already run the code and it will take time if I do it right now so this is what the efficiency we ended up with so it is 98.8 percent so on the test sets we had around 10 000 test samples out of which we predicted 9876 correctly so which is pretty good actually if you see so it begins in the single layer perceptron example that we took we were getting accuracy of around 92 but here we are getting around 99 which is actually very good foreign so now let's move on and discuss a couple of questions that every non-programmer usually has right about data science and machine learning now guys data science artificial intelligence and machine learning might seem intimidating but it isn't actually as complex as you might think right many of the tools that have been developed over the last decade or so have all helped to make artificial intelligence and machine learning more accessible to Engineers with varying degrees of experience and knowledge right today we've got uh to a stage where it's now accessible even to people who have barely written a line of code in their life now that's very exciting but if you're completely new to the field it can be challenging to know how to get started now as a beginner it's natural to have a lot of questions regarding data science and machine learning so I'll be addressing the top three frequently asked questions by beginners or by non-programmers what exactly do I mean by non-programmers right so a non-programmer can be anyone from a software tester who's basically working on an automated tool it can be a product manager it can be a researcher who's creating content it can be a Content writer it can be a student who has just graduated or somebody who's still studying it can be a marketing consultant basically you can be from any background and still take up data science and machine learning now if you want to get into the depth of the field then obviously you do require programming languages but however if you just want to start off and you cannot do programming at all there are a couple of tools that I'll be discussing today and these tools do not require you to have any prior programming knowledge now in the further slides you will see me discuss about these tools I'll be mentioning a couple of features to y'all now let me answer a couple of questions which are very frequently asked the first question that I usually get is people tell me that I'm looking to make a cardio in data science but I have no prior programming experience do I need to know programming for machine learning and data science now in a nutshell let me tell you that the answer to this question is yes okay if you want a career in machine learning then having some form of programming knowledge always helps but uh let me tell you that programming is just a part of machine learning so if you're looking to make a career out of it I'd say that it is important for you to know the different programming languages like python at least to get started but due to recent breakthroughs there are a lot of machine learning and data science tools that do not require you to have the programming necessity right you do not have to know programming languages but if you ask me personally then I'd say that you still have to learn programming languages because only then you know how the entire process works you'll know how machine learning really works because you're implementing it through code now another question is are there any tools that can help me with data science and machine learning without knowing any programming language okay the answer to this is a symbol yes now there are a lot of tools in the market these days that will help you uh get started with machine learning so these are specially useful for business applications of machine learning like predictive modeling statistical analysis and so on so there are a lot of tools and I'll be discussing a lot of those tools in the upcoming slides and so don't worry we'll talk more about the tools in the further slides another question is do I need to know advanced mathematics or do I need to know Advanced statistics in order to learn data science and machine learning now this question depends on what exactly you're looking for right if you really want to go in depth into the field of data science and machine learning and if you want to understand the why behind the working of machine learning algorithms which is very fundamental to understanding machine learning then I'd say that yes you do need to learn mathematics advanced mathematics is basically you have to have really good understanding of Statistics because statistical analysis plays a very important role in machine learning and data science so a lot of topics under mathematics which include probability of course there is statistics and linear algebra these topics can help you know more about machine learning algorithms and about data science and about the different methods and techniques that are used to derive information from data now not knowing advanced math is not an excuse to not learning machine learning so to sum it up for you if you want to go in depth into a machine learning and data science then knowing advanced mathematics is a prerequisite because it'll help you understand the algorithms the formulas how the learning is done and many other machine learning Concepts right so for now let's move on and let's look at why exactly anybody should go for these data science and ml tools right so we just looked at a couple of questions that non-programmers have now let's look at how you all can get started and why you should go for data science and ml tools now the first reason is that you don't require programming skills to use data science and machine learning tools now this is especially advantages to all of you non-id professionals who don't have experience with programming in python or in our another reason is that they provide a very interactive user interface which is very easy to use and you know you can learn very quickly by using that UI now another reason is that these tools provide you can say a very constructive way to define the entire data science workflow and then you can implement it without worrying about any coding bugs or any errors right so given the fact that these tools don't require you to do any sort of coding it's faster and easier to process data and to build strong machine learning models also all the processes that are involved in the workflow are automated and they require minimal human intervention so basically all the processes are mostly drag and drop right whatever you want or whatever model you want it's just basic drag and drop in these tools all you have to do is drag the modules that you want in your workflow now another reason is that many data driven companies have adapted to these data science tools and and they've started looking for professionals who are able to handle and manage such tools the law lot of companies that are taking up professionals who have knowledge of ml tools like IBM Watson and so on so guys those were a couple of advantages of using data science and machine learning tools now let's take a look at the top tools that any non-programmer can use in order to get started with a data science and ml so guys note that this list is in no particular order right I have not ranked them according to the tools so the first tool is rapidminer now it's no surprise that rapidminer made it to this list one of the most widely used data science and ml tools preferred by not only beginners who are not well equipped with programming skills but it is also preferred by experienced data scientist rapidminer is the all-in-one tool that takes care of the entire data science workflow from data processing to data modeling and even deployment so if you're from a non-technical background a rapid Miner is one of the best tools for you and it provides a very strong and a very interactive UI that only requires you to dump the data there is literally no coding required right so all you have to do is you just have to dump the data into the tool and it will build predictive models and machine learning models that use um complex and convoluted algorithms in order to achieve precise outputs now let me discuss a couple of features of rapidminer so like I said that it provides a powerful visual programming environment right you can perform a lot of data visualization through this tool it also comes with an inbuilt a rapid minor Hadoop that allows you to integrate with Hadoop Frameworks for data mining and Analysis right so having Hadoop support is very important because it helps in big data analytics so you can dump all your data into the Hadoop framework and then perform analytics and all of that through rapidminer apart from that it supports any data format and it also performs Top Class Predictive Analytics by expertly cleaning the data right so data cleaning and data wrangling is done very easily through rapidminer right it uses programming constructs that automate all the high level tasks such as data modeling so data modeling is also automated on rapidminer you don't have to to code a single line you just have to dump the data into the tool and do a couple of drag and drops and that's it right so rapidminer is known as one of the best tools for data science and ml so now let's look at our next tool which is data robot now data robot is an automated machine learning platform which builds precise projective models in order to perform extensive data analysis right it is known as one of the best tools for data mining and feature extractions so professionals with a less programming experience go for data robot because it is considered to be one of the most simple tools for data analysis so just like rapidminer data robot is also a single platform that can be uh used to build an end-to-end data science solution right it uses the best practices in creating solutions that can be used to model a real world business cases and can be used to solve real-wall complex problems so coming to the features of data robot it automatically identifies the most significant features and it builds a model around these data features right so data modeling is done very easily when it comes to data robot right so guys data modeling basically involves you to build a model that can help in predicting your output now this model basically takes in all the important features that are needed to predict your output this entire process is automated in data robot another thing is that it runs the data on different machine learning models and it will check which model provides the most accurate outcome right so basically uses test cases it has different different models and it tests where your accuracy is maximum also it is very fast in building training and testing predictive models in performing a text mining data scaling and so on these are other additional features it can also run large-scale data science projects and it is also known for incorporating model evaluation methods like parameter tuning and so on right so model evaluation is also easily carried out on data robot so basically your end-to-end data science process and machine learning process is covered by the single tool now let's move on and discuss another tool which is Big ml so big ml you can say that it eases the process of developing machine learning and data science models by providing readily available constructs that help in classification regression and clustering problems right it incorporates a wide range of machine learning algorithms and it helps to build a strong model without much human intervention this lets you focus on important tasks like improving your decision making process and so on right so some features include that it is a very detailed and comprehensive machine learning tool that supports the most complex machine learning algorithms involving full support for supervised and unsupervised learning which basically involves anomaly detection Association mining regression clustering and classification another feature is that it provides a simple web interface and apis that can be set up in a fraction of the time that it takes for traditional systems so comparatively it's extremely extremely fast it also creates a visually interactive predictor models that make it easy to find correlations among the different features in your data right so data visualization is also a pretty important role in data science and big ml really helps in creating such visualizations which help you understand the dependencies between your features and your data apart from that it also incorporates bindings and libraries of the most popular data science languages like Python and Java right so it has support for all of these languages so that was about big ml now let's look at another tool called ml base or machine learning base so mlbase is again an open source tool that is one of the best platforms used to create large-scale machine learning projects now before I tell you the features of mlbase let me tell you that there are three main components in this tool so there's something known as ml Optimizer and the main purpose of this Optimizer is to automate the entire machine learning pipeline construction after that we have something known as mli now mli is an API that is focused on developing algorithms and Performing feature extractions for any sort of difficult or any sort of high level computation then another component is the ml lib or machine learning library it is Apache Spark's very own machine learning library that is currently supported by the Sparks Community so it basically supports uh Apache Sparks machine learning library right so now let me tell you a couple of features of mlbase so like I mentioned it provides a very simple UI for developing machine learning models predictive models and all of that right so guys uh having a good user interface is actually pretty important when it comes to these tools because you need to easily understand and you need to easily grasp what exactly is happening how exactly you're going to build a model how you're going to add data what features you're going to select in order to build the model how you're going to make your model better through parameter tuning cross validation and all of that so it's important to have a very simple UI that will make everything look easier and more understandable and that is easily achieved through mlbase right I think it has one of the simplest UI among a lot of other tools now another feature is model evaluation right so it basically learns and it tests this data on different learning algorithms to find out which model gives the best accuracy right so basically it Compares our accuracy on different different models it runs your data on different models another feature is easy to use so like I said that the UI of this tool is extremely simple and extremely understandable so non-programmers who are new to data science or who are new to this whole machine learning process they can easily scale data science modules because of how easy and how simple the tool is to use apart from that it can scale more large and more convoluted projects much more effectively than any other traditional system so that was all about mlbase so now let's discuss the next tool which is Google Cloud Auto ml now Google cloud automl is a platform of machine learning products that allows a lot of professionals with a limited experience in data science to train high-end models which are specific to their business needs so This Cloud automl tool is one of the best machine learning platforms with over 10 years of trained Google research constructs to help you build predictive models so let me discuss some of the features of the tool so first of all professionals with minimal experience in the field of machine learning they can easily train and build high-level machine learning models so it's very easy to use for people who do not have experience in ml and data science and it comes with a lot of documentation it comes with a lot of tutorials so one of the best features of this tool is that any person with minimal expertise or limited information and the feel of machine learning they can easily use the tool and build high level machine learning models right so this tool has a lot of documentation it comes with a lot of tutorials so it's one of the easiest tools to use and it supports a lot of people from different backgrounds right so if you're not very aware with data science and ml it has a very detailed and a very comprehensive approach to the entire process so this is one of the best tools for a beginner I would say also apart from that it has a fully fledged integration with many other Google cloud services which will help you in Data Mining and data storage right these are very important parts of data science and ml it also generates a rest API while making predictions about the output right so having apis also help you connect with other tools it also provides a very simple UI to create a custom machine learning models that can be trained that can be tested improved and deployed through the same platform right again it's an end-to-end solution to data science it provides the entire data science workflow and it is quite easy to use moving on to the next tool we have Auto wika now this is one of my favorite tools when it comes to machine learning and data science right it is an open source UI based tool which is ideal for beginners because it provides a very uh intuitive interface for performing all your data science related tasks so apart from having a simple UI it supports automated data processing it performs very extensive Ada supervised and unsupervised learning algorithms all of this is covered in the tool right I think this is one of the best tools for beginners and also if you want to go in more depth about data science and ml I feel like this is one of the best tools right it's perfect for newbies who are just getting started with data science and ml also it has a huge community of developers who were kind enough to publish tutorials and research papers about using this tool so actually this tool is preferred by a lot of developers as well so I'm aware of a couple of data scientists who use Python and R and who also use autovika because they find it extremely easy and they find it a very strong tool for building machine learning models now let me tell you a couple of features of the tool so wake up provides a huge range of machine learning algorithms so you can perform classification you can perform regression clustering anomaly detection Association mining Data Mining and all of that right it also provides a very uh interactive graphical interface to perform a data mining task to perform data analysis and Eda and so on another feature is that it allows developers to test their models on varied set of possible test cases and it helps in providing the model that gives the most precise output right so like a lot of the other tools that I mentioned in this list this tool also helps in model evaluation right basically it tests your data on different models and it chooses which model is the most accurate or gives you the most precise output apart from that it also comes with a simple yet a very intuitive command line interface to run basic commands right so Guys these were a couple of features of Auto wake up now let's move on and look at the next tool which is IBM Watson studio so we're all aware of how much IBM has contributed to this AI driven world so IBM is actually known as the number one contributor in contributing a lot of tools Concepts Technologies which revolve under Ai and data science and so like most service provided by IBM IBM Watson studio is again an AI based tool which is used for extensive data analysis it is used for machine learning data science and so on right there are a lot of organizations actually which make use of IBM Watson it also comes in a lot of healthcare sectors there's also a particular tool created by IBM I'm not sure of the name but it's used for weather forecasting it's used for predicting any sort of storm and earthquakes and so on right so IBM is majorly contributed in the field of AI and data science I'd say apart from being extremely recognized in the market let's take a look at a couple of features of IBM Watson right one feature is that it provides support to perform data preparation exploration and modeling within a span of a few minutes and the entire process is automated right so it's a very quick tool which can perform your entire end-to-end process in a couple of minutes it also supports multiple data science languages and tools such as Python 3 notebooks it supports a gyden scripting SPSS modulars and data Refinery so Provide support for all these languages now for coders and data scientists it offers integration with rstudio with Scala with python and so on you can say that IBM Watson is a more advanced tool right so a lot of data scientists and a lot of coders actually make use of this tool right apart from being Auto mated it also provides support for the different programming languages like R and python another feature is that it uses the SPSS modular that provides a drag and drop functionality and this is quite simple because it lets you explore data and build strong models very easily right all you have to do is drag and drop right so guys those were a couple of features of IBM Watson so I'd say IBM Watson is a more advanced tool it provides support for all the different languages that come under data science right it has a really good data analysis support so the Eda that is performed in this tool is really good it helps in extracting the most significant variables and then it builds a model on that variable so guys that was a little bit about IBM Watson studio now let's look at our next tool which is tableau now Tableau is known as the most popular data visualization tool in the market it allows you to break down raw unformatted data into a processable and understandable format this tool is mainly focused on data visualization right you can also perform a lot of data analysis also on the tool but it is one of the best tools for data visualizations a lot of high-end companies a lot of data driven companies make use of this tool to see their business growth and to visualize and analyze that data so let me tell you a couple of features of Tableau the Tableau desktop feature it allows you to create customized reports and dashboards that help you get real-time updates right it can also connect to multiple data sources and it can visualize a huge amount of data to find a lot of correlations and patterns in data right so that's what data visualization is important for it's for understanding the correlations in your data so it can connect to different data sources and then it can perform data visualization so apart from this Tableau also provides cross database join functionality which allows you to create a calculated fields and join tables basically this entire process will help in solving complex data driven problems right so guys this is a very Advanced tool for data science it goes in depth with the entire data and it tries to bring out the best features and the most significant features in your data through graphs and through plot out now apart from that it is a very intuitive tool which uses the drag and drop feature to derive useful insights and perform data analysis so those were a couple of features of Tableau now let's move on and discuss drifacta now to factor as an Enterprise data wrangling platform that help you understand what exactly is in your data and how it can be useful for different analytic Explorations the key to identifying the value in your data so trifactor is actually considered the best tool for performing data wrangling data cleaning and Analysis right data wrangling and cleaning is one of the important features of trifacta so let's understand a few features of trifacta so basically de facto allows you to connect to multiple data sources irrespective of where the data lives and so you can have connections with various data sources it also provides a very interactive UI for understanding the data to not only uh derive the most significant data but also to remove unnecessary or any sort of redundant variables apart from that it provides visual guidance machine learning workflows and feedback that will guide you in assessing the data and performing any sort of data transformation it is also very good at monitoring your data so it monitors the inconsistencies and data and it removes any sort of null values or any missing values and make sure that data normalization is performed so that you avoid any sort of biasedness in the output right so it's mainly for data preparation I would say to factor is very good tool for data preparation data wrangling and so on so that was all about refactor so now let's discuss the last tool for the day which is nine now again nime is an open source data analytics platform which is aimed at creating out of the box data science and machine learning applications right so building data science applications involves a lot of tasks right it involves data processing data wrangling Eda and all of that so this tool basically makes sure that this entire process is fully automated right it provides a very interactive and a very intuitive user interface which makes it easy to understand the whole data science methodology a few features of nime include that it can be used to build an end-to-end data science workflow without any coding so you just have to drag and drop the modules it also provides support to embed tools from different domains including scripting in r in Python and it also provides apis to integrate with Apache Hadoop right so it has support for all the important tools that are needed in data science Apache Hadoop it also provides support if you want to code in RN python it is also compatible with various data sourcing formats which include Simple Text formats such as CSV files PDF XLS and so on it is also compatible with unstructured data formats including images gifs and so on another feature is that it provides full fed support for performing data wrangling feature extraction normalization data modeling model evaluation and it even allows you to create interactive visualizations so like I said it covers the entire data science process end to end naime is actually one of my favorite tools as well right it is used for the entire data science workflow and actually it's a very understandable tool it performs a data analysis in a very in-depth Manner and I feel like the fact that it supports machine learning languages is very good because not only can non-id professionals and programmers go for this tool even data scientists can go for this tool there actually a lot of data scientists that make use of nightmare in order to perform Eda and all of that so guys those were the top tools for data science and machine learning if you are a non-programmer [Music] now let's move on and look at a data scientist sample resume so I created a sample resume in order to make you understand what exactly you should put in your resume so an introduction is very necessary make sure it's a simple and a short introduction but also make sure that it draws attention make sure you sneak in your skills and your experience in the introduction so I've written that a data scientist with n plus years of hands-on experience in delivering valuable insights by our data analytics and Advanced Data driven methods also proficient in building statistical models using R and python so here I'm not only mentioned my experience I've also mentioned a few skills okay now the education field I'll let you guys deal with it you all can fill out the education field according to your education now within the experience field you have to list down all the projects that you've done all the data science related projects whether it's text mining whether it is NLP whether it is a predictor modeling whether it's building machine learning algorithms deep learning algorithms anything you have to mention all your data science projects in your experience field here I mentioned I've used Python and Spark to scrape clean and analyze large data sets all right so I'm basically saying that I have good enough information about the data life cycle I know how to clean and I know how to analyze and explore data sets then I've said created machine learning models with python and I've also mentioned that I've created a model that is used to predict energy usage of commercial buildings with 98 accuracy now guys mentioning your projects is very important all right because that's the only proof of your capability so you have to impress your interviewer over here you have to show him all the projects you've done and you know how they have changed or contributed to your company so next I've mentioned uh design and develop real-time recommendation engines and then transformed raw data into MySQL and finally I've mentioned expertise in Tableau for data visualization so guys make sure you don't skip data visualization there are a lot of data scientists who don't pay much attention to data visualization data visualization is one of the most important aspects of a data life cycle because here you're showing the stats and you're trying to show the growth in the company or how a particular variable or how a particular product is affecting the business okay that you can clearly show using data visualization then I've mentioned a few skills here I've mentioned proficient in languages such as R and python now this has to be your number one skill okay you have to know R in Python very well then there is strong understanding of predictive modeling and machine learning algorithms which I mentioned earlier okay machine learning algorithms and predictive modeling are very very important predictive modeling is nothing but using machine learning algorithms in order to predict an outcome okay good understanding of data mining cleaning and modeling again here I'm trying to mention the data life cycle and then I've said efficient in graphical modeling and data visualization using Tableau and an in-depth understanding of deep learning using neural networks so guys this was the entire data scientist resume now I'll just give you a few key points that you should keep in mind while you're building your resume so first of all try and stay at your career objective okay make it clear and precise State exactly what you're applying for and where your interest lies now within the educational qualification you can mention your bachelor degree your master's degree and usually people with the background of computer science or statistics are preferred all right but I'm not saying that it's necessary there are a lot of data scientists who don't have a matching or an equivalent data scientist degree it's not a prerequisite but mostly people with a computer science or statistics background are preferred all right now next you can mention your professional experience so basically in your experience you can mention your contributions to the company okay you can write about machine learning models that you created and you can see how it you know solved the business problem and how it benefited your organization after that you can mention your Technical and your non-technical skills now what are technical skills now guys technical skills are the practical abilities acquired from working on real-time projects and working on field okay so basically you have to prove your capabilities by listing out your skills so make sure you list out all your Technical and all your relevant skills only in this okay because this is the best way to impress your interviewer by telling him how much you know and how much you've worked on and therefore you know these many tools or these many Technologies now in the non-technical skills you can impress your interviewer by showing them how strong your communication skills are and how strong your intellect or your analytical thinking is okay because a data scientist is not somebody who is a hardcore programmer you have to have very good business Acumen you have to have very good communication skills because you're handling a business okay a data scientist is going to talk to other stakeholders communicate their company stats to other stakeholders so data scientists has to have excellent communication skills [Music] so let's start with the questions here directly will in the beginning start to focus on some of the fundamental questions which is more for you to understand what is data science than like a particular interviewer asking you that question for instance many people wonder what data science is all about right so there are many online sources and blogs which describes data science in a very nutshell this is what it boils down to right a person who is very good in understanding the computer algorithms a person who understands the statistics and mathematical ideas and applying these two knowledges from the Computer Science and Mathematics into a particular application right a business application where somebody sees a value coming out from the data so how do you apply that so that's how kind of data science approaches so when you combine these two powerful Concepts coming in from Computer Science and Mathematics on a real world application the sort of outcome that comes from particular data science project goes in a direction where people see our return on investment right so the people you bring in the technology you bring in the ideas you are working on all of it should kind of give you some return on the investment you have put in so that's where Industries are started to looking at data science so various objects which are very important for you to know statistics computer science Applied Mathematics and then subjects like linear algebra calculus and a few more right so fundamentally from computer science algorithms and data structures might be very useful from the mathematics or statistics and things like calculus linear algebra Matrix factorization and Concepts like that and from the application it is more from your experiences from the industry if you have worked for retail you know how the business process in retail works right so people often also ask from the technology and do we need any sort of experiences in Python language right python or like for example our programming as well uh so python is one of the most looked out for kind of programming skills particularly when you want to build Solutions in data science domain and with availability of libraries like numpy pandas python has now established its ground very strongly in giving a very robust framework for Designing data science Solutions right and in particular things which it has like list dictionaries tuples and stats are one of its own sort of one of its capabilities which sets python in its own League of programming languages suitable for upcoming out with data science Solutions so there are many more other libraries as well apart from this for building machine learning algorithms but these are some common libraries that you would normally find people using it and also with distributions like Anaconda python has shown its capabilities even for a production grade solutioning right wherein you make sure that all the dependencies that one particular library has for building a data science solution is all in one place so quite a popular programming language in the recent kind of happenings Although our programming is also equally good in terms of producing a quick prototype for most of your modeling tasks but python is moving itself into a production grade where things can be deployed after the sort of prototype into a production environment and it can face the customers from day one so that sort of capabilities are coming up with python right so let's talk something a bit more specific with the data right so when people are doing any sort of data analysis they normally face something that we know by the name selection bias right so what is actually a selection bias the fundamental place where you start doing a data analysis is by selecting a representative sample right so that's where we like normally start doing any analysis so when you like working for a company which as let's say 1 billion records in their databases like 1 billion is like very very large number which represents various customer customers data or it might also be depending on which feature you are working on and so on so if you collect all of those it might very easily come to 1 billion records which is nothing but in a structured form number of rows so with that an enormous amount of volume in the data any analysis that you take up might have to have lot of filters like saying I only want to analyze a particular feature in my products let's say right and I only want my customers from XYZ region which is let's say top four region or top five regions so you might like put many features or filters like this but later on if you would like to do a kind of an analysis which covers most of your customer base right there comes the tricky situation of not being able to use the entire volume of 1 billion records at the same time you want to do a really good study or analysis is based on what data you have so in statistics we normally use this idea of doing a kind of a randomized selection right so with this randomized selection we make sure that out of these 1 billion records we are choosing a small subset let's say of 1 million which is a true representation of the entire population right but what happens with this is while we do this one million record selection in a randomized way there are chances that you might have certain bias in the analysis obviously because you are not using the entire population so selection bias if this particular sort of a characteristics while you are doing a sampling on a large population of data very common example for this is if you want to do a exit poll analysis of particular election even before the election results are coming up and you have not chosen a representative sample for doing that exit poll analysis right by that you I mean you only have asked some questions for a selective few from the particular constituency and they have opinion towards one candidate but that doesn't represent the entire opinion of the population in that constituency so selection bias is like very important to handle and most of the time people employ things like randomized selection or selection sampling techniques like stratified sampling and so on so by that you can minimize this selection bias so these are some very generic questions so let's start with some sort of statistical questions and also how to deal with different types of data okay so doing any sort of data with a structured information so structured information I mean there are many rows and like many columns and it looks more like a tabular data right so with such a data in in place there are like these two different formats one saying the long and the white so let me like show you an example here you have a record of two customers right and you store just two values which is the height and the weight so these are like in columns so with the these two customers these height and weight being a separate column is one format like a long format let's say transform this by having only one particular column which will say attribute right by that I will bring in these two columns as one column by calling that as an attribute and put the values in one column so this sort of format is called the long one so what really happens is instead of having two separate columns for two of your attributes you put both of those into one column and by doing that there are a lot of benefits with respect to the task you have in your hand particularly in data visualization certain data visualizations would need you to not put your attributes as a separate column but rather as the one column which can have the attribute names so which might then go into building your Legends right so these kind of techniques are kind of very common formats between like the long and the wide and very frequently will people like deal with both these data formats so depending on what tasks they are doing particularly when we are building the visualization dashboards okay talking a bit more uh on the data analysis perspective of people like know that in stats normal distribution uh kind of is that one like the Godfather of the distribution so you have many distributions which normally people try to see if is present in my data or not but the moment people kind of see normal distribution coming up in any data things are kind of a bit easy to understand and in typical case any distribution any data distribution that you would like to find out given data for analysis it gives us a lot of characteristics around what the data is about if I'm let's say analyzing the salaries of the employees in my company right I might see that there are some employees who are in that like that thick rust in the center where majority of the people are sitting with a moderate level of salary ranges and then there are these sort of extremes on the left and the rights so you are like very commonly people refer whenever you talk about salaries to a bell curve right and then they start talking about top 25 percent of the performers in my company the bottom 25 and the middle one which you like are sort of the normal performance so this sort of bell curved distribution is very commonly understood and as well as used in doing many data analysis so is the other distributions as well but normal distribution has its own significance so when somebody asks you around anything around not normal distribution the first thing you should visualize is the symmetrical bell-shaped curve right and the moment you get that Bell shape curved in your imagination start thinking of certain properties like what is the mean of a normal distribution what is the standard deviation of a normal distribution and in particular a special case of normal distribution which we call the standard normal distribution so in that standard normal distribution we know the mean is exactly zero all the time and the standard deviation is exactly one right so there are different places where normal distributions are kind of used and if you are comfortable with ideas like Central limit theorem or the law of large numbers you might want to relate that as well to normal distribution particularly the central limit theorem right but the idea is a distribution which is symmetrical around the mean so that's what a normal distribution is so depending on which variable you are trying to analyze even a given data set whether it could be like an employee's salary or your sales in a for a business of yours right or your number of let's say interactions of the customers on your product so any variable that you define can have a symmetrical bell-shaped distribution which we normally refer by normal distribution and the moment we understand that something follows a normal distribution all these properties of that distribution is like revealed so that's sort of the importance of doing any analysis around the distribution of data and normal distribution like very common one and in many statistical techniques in even model building exercises if you have anything in a normal distribution many other possibilities of applying certain modeling techniques comes out evidently there and also there are many other modeling techniques in stats and machine learning where there is a fundamental assumption that things should follow a normal distribution if it is not following then the model is wrong so there are many use cases of knowing what normal distribution is but in simple terms it's a symmetrical bell-shaped curve right A B testing so quite a popular approach people particularly who are working with product right and what happens is when you are as a company having lot many features inside a product for instance if LinkedIn is a company which has a web page right it has a lot of features inside that you have some jobs of portal in will LinkedIn you have places where you can connect to your Professionals in the similar industry you can also read about the post people are doing in the portal and so on so there are different places in the website with many features so what happens is if LinkedIn as a company is looking for some changes for instance changing the entire website's design in Aesthetics design and Aesthetics or changing one particular feature inside their website right so these changes are normally accompanied by a sort of a process called a b testing so what happens as an analyst you might be working with LinkedIn and say on a fine day they might come out with a new feature new design or new sort of changes in their website so you as a person would tell okay this is my framework for testing this new change by defining a metric so my metric would be in simple terms saying if I change this website from A to B is my number of footprints on the website going to go down or not so this is my metric and if I successfully establish the fact that after rolling out this new website my number of customers who are going to visit my website is not going to go down I can be confident that okay fine this works now roll out this new feature so in this framework we normally have two set of users to identify the particular risk associated with getting this new feature into the platform and in which we in a randomized way put one user group and expose them to an older website and another user group and expose them to the newer new features right or the new website and when we compare the results of a particular metric like the number of clicks or the number of purchases and so on we should be able to see that these two groups are either exactly the same or quite different and if it is on the negative side of the difference we say the feature is not good and even if when the difference is not at all there we say even if we bring this new feature nothing is going to happen so this a b testing framework is quite robust in its own uh way right and a very common question if you have like a worked as data analyst or if you are expecting to be like sitting for this data analysis kind of a role knowing a b testing framework is very important okay so sometimes also when you do these kind of a b testing sort of analysis uh people normally come out with sort of saying uh what should be the sample size of my users whom I should be getting to participate in my a b testing framework or also when you are building some models you might see that there are certain statistical measures which has to be evaluated by the end of model building exercise like if you're building a machine learning model let's say and you want to see if the those metrics on which I am evaluating are really good or not right so in that sensitivity is one of those methods or like metrics which we normally evaluate and I'm going to show you some something that we normally refer by the name confusion Matrix so I'll spend some time explaining this and then come to what we mean by like sensitivity so let's say you are building a model right a model for predicting whether a particular customer is going to purchase from my platform within one month or not right so very simple problem statement which might include many variables that we would bring in and then finally build this model saying okay this is my final model which says with 90 accuracy whether my customer is going to buy from my platform let's say an online e-commerce platform within next month or not so this is my model so now as you without going into the details of the model let's assume that after you build the model you have got the results so while we analyze and evaluate what the result is all about we might come out with a confusion Matrix so what it says so obviously when you are building a model which follows a supervised way of learning you will see from the historical data after people have purchased in the platform within next one month whether they are buying or not so I can obviously create a really good training data set right which will contain the information but after people let's do the first transaction with me in next one month whether they are buying the next product or not so with that data set I train my model and uh the wave confusion Matrix puts that is by saying my actual data says something and you have predicted something right so let's kind of get into the details of that so this particular box which says my actual data says the customer will buy and you are also predicting the same so this we call the true positive the prediction is actually true right in a positive direction saying uh the purchase happened now like moved or diagonally opposite to this DP which is this TN which is true negative which means the prediction of your model saying the customer will not buy is actually matching with the actual data as well right so both of these values the true positive and the true negatives for the two cases is the right predictions from your model but consider the off diagonal elements the false negative and the false positive and these two cases this is sort of an error why because your actual data says the customer is not going to buy in this case let's say but your prediction says he is going to buy so the prediction is actually positive whereas the actual data is on the negative side so this is a false prediction right so you have your in this case your FP which is uh the false positive cases and on the off diagonal element if you look at this one which is f n which is your false negative for the cases where you are predicting the customer is not going to buy but in the actual data it says that customer actually has brought the product so in in this case the model is wrong so the type 1 and type 2 errors needs to be taken care of when you are building any machine learning model if these errors are low the then your model is going to move towards that hundred percent accuracy Mark but normally any machine learning model has its own limitations so particularly there is one metric which we prefer by calling sensitivity so what happens is these true positives and true negatives needs to be controlled right if my model is very good in true positive ones like the positive cases of when the customer is buying but it is doing very bad job when the customer is not buying the cases in which the customer is not buying then the model has some sort of issues right it is doing only good in one place but doing very bad on the other case so I need to find out that by some Metric so sensitivity help us to find that out which is in simple terms is a ratio between the true positive in the denominator we have all the cases of positive predictions so imagine now if the type 1 error is going to grow my sensitivity is going to come down so if my true positives are like very high the sensitivity will also be high so this is sort of what we call uh the statistical power if this sensitivity is really good uh I would say that my positive cases are predicted well and the exact opposite of the sensitivity is what we know by specificity so we need to make sure that in a very good machine learning models specificity both are balanced right so Innovation products this is what would be like mean here the ratio of true positive by the total uh positive events there and as I mentioned both of these uh sensitivity and specificity play in good role when you want to evaluate a model's output right and one more common problem so these questions might be immediately following one another when people ask you about sensitivity and specificity as you know it's about the machine learning model's output right when the model is done you understand whether the model is good or not so in those cases we also come across some kind of issues like overfitting and underfitting given machine learning model right so these words are very common and the idea is depending on the complexity of your model you might see that you want to adapt very sort of exactly to your data points or you might want to do a generalization so for instance here if I have these red and blue dots here right and if I draw a curve like this which separates the red from the blue and when this separation happens I am building actually a classifier using some sort of modeling technique but now Imagine by drawing a smooth curve like the one which is given in Black you might be over generalizing it right by which I mean there might be some red dots on the other side of this boundary you can obviously see that but the moment I am kind of a bit more flexible by drawing this Green Boundary which actually covers all that issues which is coming out with these red dots on the other side of the boundary so this Green Boundary has like taken care of that but the problem when we are building any model right the idea is you need to generalize to the pattern found in the data so if you don't do that generalization well you are under fitting but if you do that generalization like too specific you are over fitting so this curve might be represented by some polynomial right but that zigzag kind of a polynomial might be bit more complex than a smooth curve like the one which is shown in the black so you need to be very careful when you are building a model particularly in the cases of regression models where it is represented by a line and a polynomial you need to make sure that the polynomial is not so complex at the same time not so simple then you will either end up in an overfitting situation versus an underfading situation so we need to have a good balance between these two right so in summary when kind of statistical questions comes in it would mostly covering things like some basic statistical properties as you would be like very aware of like things like standard deviation averages how to interpret median how to interpret quartiles right the first quartile second quartile and so and so on and how what do you mean by percentiles right these are some basic questions a bit more complex in nature might be discussions around sensitivity overfitting under fitting these are like statistical ideas so you want to prepare maybe from a Basics level using the properties like standard deviation mean and so on till things like overfitting uh underfitting statistical the sensitivity and specificity kind of ideas so that will like make yours Crown a bit more stronger when you are going for the interviews and these are at least the bare minimum for you to understand in the statistical Concepts right if anything less than that you might like face some difficulties in the inter of you okay but let's also talk about now questions which are related a bit more on data analysis so let's see how what kind of data analysis questions which might pop up in the interview okay so some generic questions like this people normally do analysis on structured data which is in rows and columns but there might be cases when the data is not so well structured and those places the data might be textual for for instance in Twitter right if you're doing any sort of algorithms like sentiment analysis quite commonly known algorithms so in that case the sentiment analysis could be for a brand for a election campaign or maybe something else around your product features and so on so text analytics in its own a really large domain and in python as well as R there are number of libraries so in particular R has libraries like TM right the text mining package python as well we have packages like pandas packages like the numpy ones right and also packages like nltk which is built only for a natural language processing so it can deal with many different sort of text mining approaches or text analytics of approaches so in comparison if you talk about as I said the robustness in Python is much more than in R but in terms of features both are powerful enough with the libraries and packages that it offers right one of the fundamental uh sort of starting place when you do any analysis so when you are given a data set and you are asked to do some sort basic analysis of what that data is you may be typically operations like I am a retail business and my sales in a particular is going down so this is analysis that affected out of you and you need to dig through in understanding what really is the problem in the sales going down so if this is sort of the data this which is given to you you might want to First Look at the transactional data which is present in the system then you might also want to go to uh outside of your network maybe you might get the sentiments of your customers from social media platforms and so on so there will be different sources of data that you will collect but oftentimes collecting the data is not only the task right and not like only building a model or doing statistical analysis might like come very later in these stage but what comes before that after you have collected your data is to make sure that the Integrity of the data is maintained you get rid of all the unwanted noise from the data and then finally prepare the data for doing the sort of modeling exercise or doing descriptive analytics on top of it so this cleaning and understanding the data doing a lot of Explorations with plot in essence takes close to 70 to 80 percent of your time in any data analysis task so if your company maintains the data in a very well structured way this kind of heavy time which we spend on data analysis might be ready or the data cleaning might be reduced otherwise you need to like take this up for any new project that you take take up which for which data is not available in in Prior or you don't have like any pipeline which do this this cleaning you have to write it down of your own so very very important if you don't uh do the cleaning part and understanding the data well the analysis or the models that you build might end up giving you a very very bad performance right so very important as I said 80 of the time people normally spend on this task right and oftentimes uh when you are analyzing things like the example I told my sales are going down what do I do it is not possible to come out with such answers to complex problems like this with just one variable right so you might also want sometimes to move Beyond one variable and talk about let's say how to do multivariate or bivariate kind of analysis so oftentimes this question comes up uh where you like asked to distinguish between this univariate bivariate and multivariate Analysis and the idea is very simple in any sort of analysis uh it is not only one variable which kind of decides the end output of your analysis but there are multiple factors involved so when there are multiple factors involved you might also want to look at things like correlation there are multiple variables you want to see if there is any correlation between these things sales are going down but because of what is it because my sales representatives are not going to the market or is visit my products are bad or is there some other reasons so with all the variables in one place you might want to go and dig deeper to see if there is any relationships coming in the variables or not and when we collectively get all these variables together and do some sort of current analysis around the problem you come out with a really crisp answers to what you are trying to analyze right and moving on there is also times when people do some sort of grouping right with the data you do a sampling right you get a data set in your system or in in whichever servers you are doing the analysis but in that there might be lot many number of times when even the randomized sampling of getting the true representative from the population like that might not work well right so in those cases you might want to do some sort of systematic sampling or maybe a cluster based sampling as well wherein you might decide to say I want to analyze the issue with only five regions in my mind and with the five regions I am going to form different clusters or in the systematic sampling you might also want to say that with the five regions that I have got I might want to analyze only for one product right which is not doing that good in the sales so these kind of sampling techniques like the cluster based one or the systematic sampling technique and there are different names for this people might be able to give a very good interpretation of what really went wrong in whichever sort of analysis you are doing so one example is like sales going down but you can adapt this to other analysis as well but the idea is instead of doing a randomized sampling by which we are not being shot which kind of data in the data segment to use for analysis which class storage sort of uh in this example you are like analyzing the end of the analysis you will be very able to say this is not like a randomized sample that I have taken but from these five regions so there are many different ways of uh doing clustering cluster orders or sort of the systematic sampling which kind of helps in this particular final end results of your analogy to put your end results in the right perspectives right instead of my sampling okay one more uh quite a useful sort of an idea kind of valuable from and this is related to what we earlier saw between moving from one very able to multiple variables right an eager value and eager vectors is a kind of a concept borrowed from linear algebra helps us to bring in some in some way a linear combination of different variables together for instance in some complex analysis it might happen so that given a data set it might have many columns right let's assume you have a data set with 1 million rows and let's assume 10 000 columns so and these 10 000 columns are some features there are complex problems like that but in most often like most of the time not all the 10 000 variables are useful right the input variables so what we can do is we might want to transform this data set in a lower dimensional Space by which we mean this 10 000 columns can be reduced to let's say only 100 columns right so again value and eigen vectors are these ideas which helps us in this transformation and the idea is can this hundred variables be represented as some sort of linear combinations of the 10 000 variables and if I am able to do that my dimensionality is reduced the time I take to do the analysis is kind of also reduced and the representability which will come with only 100 variables will go up right so quite a powerful idea eigen values and the eigen vectors and as I said the eigen vectors is kind of that linear combinations of many uh kind of variables there and this calculations around eigen vectors normally happens for a correlation or a covariance kind of a matrix which as you know measure Collision is also about how to two variables strongly two waves are correlated right so that's why we're also saying this eigenvectors can help us to compress the sort of data that we have right and that's because of one econ Vector can be representing a hundred column 100 variables together right so that's sort of how it works uh quite a powerful idea and commonly used methods for reducing the dimensions of a large data set like the PCA a principal component analysis is actually based on eigenvalue and eigenvectors so if somebody asks you about eigenvalue and eigen vectors in an interview also talk about the PCA principle component analysis which is actually based on these two concepts so that gives them a good idea to the interview we will add you know about eigen values and eigenvectors and you are also able to think of its application like in PCA right so we talked about this false positive and the false negative cases in our confusion Matrix example so this is exactly the same we also talked about the type 1 and type 2 error okay so but let's now also drill further and say examples or kind of scenario because when the false positives are important and scenarios where the false negatives are sort of important and the by the term importance we mean are we like allowed to do this mistake if you are building a machine learning model are we even allowed to do a mistake on either of the cases depositive or the negatives so for instance here if I take an example in a medical domain where we have let's say a process called chemotherapy which is uh normally given to cancer patients which is a radioactive kind of a therapy which kills the cancerous cells right so it is very focused sort of a therapy or on the cancerous cells so what will happen if you like predict let's say you're building a model for detecting cancers right given a CT image and this model would obviously not be 100 correct all machine learning model has its limitations but you are here required to predict this whether a patient has cancer or not and based on that Radiologists might decide that whether the chemotherapy is right for this patient or not but imagine now if you have predicted somebody to be positive for cancer but the patient is actually not having the cancer cells there right so in those cases you might end up saying let's go ahead with the chemotherapy but the side effects of chemotherapy are like very very adverse right because you are giving these therapies on the healthy cells if the patient is not having the cancers so in these cases uh sort of the false positives becomes a bit more important so it will be absolutely fine if your model says the patient doesn't have cancer if even if there is like this slight possibility of cancer present in the cells of the patients but in that case you are not like exposing the patient with the chemotherapy there right which is like more harmful than saying that the patient doesn't have the cancer right so in this case if the false negative those are false negative itself is not so good in this case but at least with this particular example the false positive gets the importance then the false negative but both are bad as we know right from the confusion Matrix discussions so in simple terms it is better to not expose the patient with a false positive with a treatment like this chemotherapy like treatment then it is like much better than saying you don't have cancers okay and a very similar example uh in some other context might also come up so if you would like to think of some other examples in the same context okay so where is the other case now which is the false negative right so we talked about the false positives importance but there might be also cases where false negative might become a bit more important there and for examples like this if you are let's say building a model where you want to convict a particular criminal basis all the records and the arguments which has happened in the code right and let's see what would happen if you make a criminal Go free right because your model says it's false negative the though the person is actually a criminal but because your model predicted based on all the evidences you had the person is not a criminal so you are letting a criminal like walk free in society so that kind of is more harmful than convicting that person and maybe for a prolonged period you might also want to get more evidences and build a stronger case so it is fine to keep a suspect in behind the bars for a longer period Then letting the suspect Go free there when we like know that there it might be a case of a criminal going free from the Judicial Systems there right so in these cases the other case becomes more important so keep in mind it is very easy to get yourself confused between this false negative and positive but if you keep an example in mind always and like don't give any room for confusion there though it is a confusion Matrix will be based on which these two ideas comes in you will be able at every time put these examples in front and talk from that so if you start to explain what false negative is in terms of the formula you might get confused but if you take an example and then explain things are much more clear for you as well as the person who is hearing that in the interview right and in cases when both are important typically the one which relates to banking industry you are building up a model which will decide whether to give a loan to a person or not basis many of the input attributes that around which you have like collected the application from the customer but here you say if the customer is really good and you are missing the opportunity there of not giving the loan versus the customer is really bad in terms of its his or her credit history and you are giving the loan in both the cases in one case you are losing the business in the other cases you are taking a risk in which you will lose your money right so in this case in in this example both kind of has an equal role so if you are positive or false negatives are in either of this is high you you will end up losing some chunk of your money there keep these three examples in mind and every time you get to like hear false positive false negatives things should not be like confusing at all okay so now let's also talk about building a machine learning model so so far we have discussed about what happens after building the model right but let's step one uh like get one step back and see how do we normally build a machine learning model and what kind of processes we normally follow so when we are like building a machine learning model we know that we need to given a data set we need to divide it into different buckets or different parts so the commonly known uh divisions that we like have is your training data right then we also divide the data into something called test data and sometimes we will also divide the or keep one portion of your large data set which is called the validation data so oftentimes people confuse between the test data and the validation data right so what happens is in the training process there are certain models wherein while you are training you will use the training data obviously but in the process of training you can also involve something like a validation step right which will make sure that during the process there is one part dedicatedly given for the validation of the model and when the model is done you might see that the final model is well trained on the data at the same time validated but when the model is completely done then only you get into a process that we call testing right so you can imagine like this you have a data of thousand records you keep some 700 records for training 100 records for validation and the remaining 200 records for testing so there are three splits right if I would like to explain this with an example this is how it will happen so there is a process called k-fold cross validation in which K can be any number like between five and ten mostly there are like standards saying five-fold cross validation or tenfold cross validation and the idea is when you are building your model you will work with a training set right and a validation set in which a small portion of the data you keep for validation in the rest you use for training so what you see here test set can be like replaced with a validation set right and you can see that this is a rolling sort of subset and you keep changing it in each fold so you go for the first fold of the iteration you keep one validation set and the rest of it is the training set and the next fold you move this window to another subset and the rest is training data and so on and when this model is done using this k-fold cross validation approach in the end you will get a model which you can then use on the testing data to see if the accuracies are good or not so this kind of brings in a lot of performance improvements people have also found validation said to be in really good way of tuning the parameters in many machine learning models as you might know there are something called parameters typically in the neural network models and these parameters need to be tuned as the model is a kind of proceeding so we cannot use the testing data set for tuning this parameter so validation set kind of comes very handy in those cases right so we just talked about the cross validation so when you keep moving this validation set in each of the fold first fold second full third fold and the validation set is keep changing you kind of do this process of cross validation right and the idea behind doing cross validation is uh is to see how well is your final model generalizes uh to the data that you have so independent of which data you use for training your model should generalize because oftentimes it happens when you train your machine learning model it works very good in the training part but when it comes to the testing the model does very bad the same problem with the overfitting and the underfitting cases so in this approach of cross validation you have made sure that your tray model has trained on various subsets of the data right and in the process we also have this small validation set so that every stage of the Cross validation process you are able to use a different subset so that means you have trained your model very well so irrespective of which data you use your model is going to do well in the testing cases so that's how cross validation brings in the capabilities okay so with the two pillars like the statistical analysis and the basic uh data analysis I hope you are getting some sense of how people ask a particular question coming from either the model building perspective or from normal data analysis perspective right so we're going to know now go a bit more deeper into questions which might relate directly to machine learning so these are all the most questions that you have seen so far are either saying how do we do analysis after the model is built or how do we normally perform simple data analysis like a b testing Frameworks and so on but what if you are asked something very particularly from a machine learning domain right so the next set of questions will cover those part like people might also start with the basics like what do you mean by Machine learning so the idea of I think must be very clear you are given a set of data points particular to a given domain right and you would like to build a learning algorithm which will take the historical data and predict something for the future so as we talked about in many examples finding whether a convict is actually convicted or not like predicting if given the evidences if person is is a convict there or not or predicting whether we should be giving a loan to a customer or not right predicting the onset of cancer in a given patient's by using the given patient's historical records and so on and now this algorithms are now even becoming more complex like it is starting to work on speech data phase data which are mostly used for biometric authentication systems right so many use cases coming up from various Industries okay so in machine learning the most commonly used two types of learning is supervised and the unsupervised learning there are like two other types as well the semi-supervised and the reinforcement learning and the idea is around if you are given a set of input attributes do you have a label which can help to learn the input attributes around any data points that you have or you don't have right so if if you have then the approach could be a supervised learning approach versus if you don't have then the approach is more and unsupervised so some examples of the algorithms are like support Vector machine regression a base decision trees all of these are the supervised learning algorithms so in a very simple terms if I say you are given an input attributes for identifying let's assume a sort of given an image of different fruits based on the characteristics of the fruit can you like identify whether it is apples bananas or the oranges so if I am given that label with me the model will run keeping in mind that given these characteristics this is an apple this is an orange and this is a banana but in the other case if you go for a clustering approach where such a label of saying that it is of apple banana on Orange is not available then we might just simply segregate the data points with the input features maybe with color with texture or with the shape right with that we can maybe say that particular fruit with this shape is like uh you know into a bucket which we can like call Banana or another kind of a spherical shape it might be apple or an orange so depending on the presence of a label we can say either to use supervised or an unsupervised learning and both of these approaches are quite common and sometimes there are certain algorithms which can have the both ways it can also learn non-supervised Manner and the supervised manner so depending on how you model the problem the fundamental difference comes from the fact on whether we have the label or not okay so when we talk about the supervised learning algorithms other name for supervisor kind of one of the types in the supervised learning algorithms is uh classification right and the classification is around given a set of input attributes and the label is uh like categories right for instance if it is fruit bananas apple and oranges like if it is a customer whom we want to predict into either kind of a defaulter type or a good customer so the classes are two like one which is saying the customer is a it's going to be a defaulter the other says the customer is going to be a good customer same is true for when you want to build and classification algorithm for detecting cancer whether the patient has a cancer or not has a cancer right or if you want to detect let's say a malicious content or a malicious file which might be a virus Frozen or uh warm or something else right so in that case the classes are now many so more than one class can also be there but the fundamental idea is we are following a supervised learning algorithm but the type of problem we are solving is a classification problem so instead of saying classification algorithm we can also say it's a classification problem using supervised learning algorithm so these are the various types of classification algorithms like the linear regression decision tree then you have the support Vector machines and so on right so now let's talk about one of the types of classification algorithm called the logistic regression right so very commonly used algorithms and Banks or companies as big as like American Express or sort of have leveraged the logistic regression algorithms to quite a extent and they have like built a really really robust implementations of these algorithms particularly in banking sector cases like predicting whether a customer is going to be a defaulted or not given if I issue the customer a credit card or give a loan right these kind of decisions can very robustly be taken from a logistic regression algorithm and these algorithms are best suited for two class problems or a binary problem right where you have either Ys or no quite a common technique as I mentioned uh and in all the possible cases wherever you have these binary classes of problems you might use a logistic regression a political leader winning a collection or not somebody getting a success an examination or not and as I mentioned whether to give a loan to a customer basis whether he or she or is going to become a defaulter or not and many of these like binary types so keep in mind largest regression works best for classification problems with two class so one of another widely used algorithms like the recommender systems and like I think this particular algorithm doesn't need introduction so this is that common nowadays take an example in Amazon you have a product which you are browsing the products which comes in bottom of the widget which says you may also like or customers who brought this also brought this right so these sort of recommendations is actually from uh recommender system which is running in the back end if you like take YouTube example if you watch one video the next videos like starts to come one after the other right that again is a recommender system working in in behind the scene or if you take Netflix if you watch a particular movie it starts to adapt right to a movie which you might like Netflix also uses recommender systems the applications are like coming more and more as sophisticated systems are building in right Facebook uses it for recommending friends right you have a set of friends and based on your data of coming from the contact mail list Facebook starts to curate friend suggestions right and all of these are algorithms which might benefit the business in some way or the other for Amazon it is if you give a recommendation below a page people might buy more than one product in a transaction for Facebook they will grow their network of people right their connection between the users are going to go stronger and hence obviously the kind of ads that kufo is Facebook wants to sell will also starts to grow right and more the users more the connections more is the sort of interactions you know about the connections as well as the behaviors that people show in a social network so the fundamental idea behind all these recommender systems is to get a meaningful comparison between two users or between two items right for Amazon between any two products what's the similarity if the similarity is really high recommend that product to the for a given product in in consideration or if you find that two users are very similar in let's say Facebook you might want to show to each other that you have another friend whom you might want to connect right so there are like many such use cases which comes out the moment you get into the deeper understanding of recommender system but the fundamental idea is how do I compare two items the items might be product people movies and so on and how do I compare two users in simple terms so this is what a recommendation system works on so there are quite famous examples like the collaborative filtering approaches user-based collaborative filtering algorithms or the item-based collaborative filtering algorithms both of the algorithms are quite commonly used in recommender systems and nowadays people have also moved on to Latin Factor based models like the SVD single value decompositions and many others so we talked about classification problem and then we said logistic regression is a binary class problem right so you might also be asked something around linear regressions so what if I don't want to have a class of a particular user or a patient or something like that right but instead if I would ask you can you give me a crisp value instead of a class right so when I'm like classifying a given file into good bad in which bad can be virus warms problems these are the classes but if I don't want class but rather for example if I want to know the exact value of a house in a particular locality in my city how do I calculate that right so linear regression models in machine learning is one such technique which can regress over a given input data which might include the properties of the houses like number of bedrooms the area in square feet and so on and finally predict a value a crisp value which will be exactly in terms of let's say dollar or any any other currency the value of the price and the idea is once again that you have a training data with you which has labels so from the past data I know that given these attributes of the house what should be the ideal price of a house I'll use that as my training data and then build my model for future so now with any such similar pattern in any data which is going to be coming in future for maybe our new property which is built in some XYZ location I can use the model and predict exactly because these features are somewhere similar in that locality the prices might be in a particular range so a model like linear regression will learn those patterns in the given input attributes and try to predict the price of the house right and the idea is uh if I have a set of data points I want to build a very generic model like drawing this line which is as close as to all the points right so you can like draw infinitely many number of points if you are given a set of data points like this in a two dimensional space but the one which is the shortest or the closest to the points is the best line so the red line which you see here is like far away from all these points right so this is not a best line but if I see this blue line which is very close to all the data points is like one of the best lines which I can get from this points right so the idea here is to fit a line passing through as closely as possible to all the data points that I have and minimize these so called the error which is nothing but the sum of all the distances right so there are simple linear regression ideas and the fundamental idea that we are following here is your variables are having a linear relationship which means uh with the increase of one variable the other also increases but if you have a pattern which is not so linear in uh shape like if you are like not able to draw a generic line but the representation or the sort of points are aligned in a way that you can only create a model which is polynomial so in that case linear regression will not be so useful because the relationship is not linear anymore right so in that case you might want to go for some other regression approach maybe like a polynomial regression which has a non-linear relationship between the independent attributes so quite a common approach and it has like a really a large chunk of its explanation coming from the statistical ideas statistical ideas like hypothesis testing P values confidence intervals and so on so if somebody is asking you around linear regression better would be to start with saying uh how you build a linear regression model right and then you might give some examples of it so at best if you're not comfortable with these ideas like p-value or hypothesis testings you might want to refresh that before you like go for any interview because if linear regression comes up these Concepts needs to be a bit more explained right so when I talked about the recommendation algorithm I mentioned something about collaborative filters right the user-based collability filtering or the item-based collaborative filtering so these are the two commonly used uh sort of uh recommendation algorithms right normally referred to as item based so ibcf and the ubcf as the idea as I mentioned is to compare to users or to compare to movies or let's say items in particular so the item can be anything a product a movie or a person and the sort of way it builds the model is given lot of users and their particular let's say in this example the rating to a movie we now need to find out can we recommend for some users based on the behaviors of other users or their ratings to the movie some suitable movies or not so for example here for Carol Carol has not seen the movie 21 right so that is a question mark there so can I predict this value if it comes to be let's say like close to 2 I won't recommend the movie but if it comes to let's say somewhere three four or five yep the model says yes recommend this movie to Carol okay so one more fundamental problems on when you build the models so as I mentioned in the earlier uh discussions that when we are doing some analysis with the data after you collect the data that requires a lot of cleaning right and exploration as well so in that process of cleaning and exploring the data sets you might often find there are some extreme points so for instance if I am building a regression model for predicting the house prices and there is this one house which somebody has uh sort of able to sell for a very high price by means of maybe some auction or some other sort of marketing gimmick there the point might mislead the model to predict or get itself towards that outlier point right so we don't want to like move towards an outlier point but we need to deal it separately so if you don't have a better explanation for why that outlier is in terms of an input attribute better is to remove it so for instance if I am like analyzing any data in an e-commerce world where I am going through all the products and the sales that the product has seen in last one week and on the last one week there was a particular day when a kind of a sales day was there like nowadays e-commerce saw companies do this a lot but in those sales days you would obviously expect that the products purchase is going to go very high right but does that mean that it's an outlier to me not because if I'm able to explain an outlier by saying that this was a discount day I might be able to handle it separately either I can like simply take all those points which are for the discounts day or sales day and keep it separately or if I would like to have the variable like saying whether the given day is a sales day and keep the outlier as well then the analysis goes in a different direction so it's important you handle the outlier before you start to build your model or do any analysis otherwise your insights or your models output might totally give you a different direction and there are very different ways to handle the outliers so some people use approaches like removing any data points which is like outside of the range of mean plus three standard deviation right or sometimes people also use the percentile way of doing it any point which is greater than the 99th percentile can be like removed so these are like you are removing the toppers from the data points of years of let's say a SAT examination or a cat examination so the outliers are sometimes can cause certain issues in explaining the model so you can obviously imagine this in very intuitive terms also if you have a set of scores for candidates who appeared for an examination and there was one outlier candidate who scored really really high so do you have a way of explaining that outlier you might be simply calling that person a talented person but does that like explain the model may be difficult right so better is to keep those exceptional cases separate and do the analysis with the rest of it so which gives you the good pattern or good Insight from a data so that's how you like now normally handle an outlier right and this quite often is an important question to answer that if you are given an analytics project with lot of data how do you normally approach it right in typical cases the first step is to really go deep dive into the problem in hand and the problem needs to be defined very crisply so no never Define a problem which is Broad in sense so for example if you are building a model for customer segmentation using a clusting approach so instead of saying build a customer segmentation model for all the categories of products that can be like a broad problem but if I say build a customer segmentation model only for fashion category of products right then the problem becomes crisp so defining a problem statement and its understanding is the foremost task and then comes the the kind of exercise of exploring the data in which you will identify outliers missing values and if you need any Transformations like converting from a log format to a wide format or the vice versa you do all that steps in the second and the third step and once we have found out that the data is very good now after we have removed all the outliers and the missing values and so on you then start to understand certain relationships like given in input attributes relates in some way or the other right so this is a stage where you start to prepare for any further inside building exercise or like model building exercise and let's say if you build the model in this step the immediate step is to validate it right whether the model is really good or Not by using a testing data right and once let's say all of these is done and you are either coming out with a Insight or a model you would like to see in long term how this model is going to perform so it might happen so because every model is not static your data is growing on daily basis so if you want to build a really robust model on a growing times it should append the model should update based on the new data which is available right so over a period of time you should track and analyze how good the model is performing on a real world data and if the performance is going down then it's time to retrain the model and maybe come out with uh an updated model on the data that you are already given right okay so one more uh task as I was explaining in the cleaning process how do we treat the missing values so there are quite a number of techniques to do that so for instance if you have an attribute let's say age right and you are analyzing this age in various segment of people a people who are teenagers people who are professionals people who are still in their college and so on right and there is one value missing in one of these categories of people let's say teenagers so if the age is missing because I know I am analyzing a group of people who are teenagers by looking at the average age in the green age group right I can maybe impute a value so instead of discarding the entire row because I don't have the value of age I might be able to impute it by some simple measures like this calculating the average in that group and putting the value there which will not be completely wrong because I know there is a very strong evidence teenagers would be more or less in the range of let's say 6 16 to 20 right even if I am wrong in my average calculation it might not be so high it might be like just plus or minus one or two years so which I'm like fine with if I want to keep my data retained and you you might see that in your many applications sometimes having a kind of discarding a particular row because of missing values might be very costly because the data is limited in number so that's why people normally do this at a sort of mean minimum or maximum kind of a value or they also calculate the average and impute the value there so there can be some other pattern based imputation also possible but I'll just give you an example and in other cases if nothing is possible by putting a value if everything is going to be misleading then better better is to like remove that but that can only be done if you have a surplus of data with you if not then be cautious of removing any values particularly the missing ones okay so this question particularly pertains to a machine learning algorithm called k-means right every time you run the algorithm you have to Define what should be the value of K right so there are approaches like elbow curve uh which plots the kind of a plot scatter plot between the x axis which is the number of clusters versus y-axis which is the WSS or within sum of square which is also known by the name Distortion the idea is when we are building the k-means clustering algorithm if we find at one point if you like keep increasing the value of K and one point we find that the Distortion is low which means the distance between the data points within a cluster is as close as possible and the distances between two cluster points like a point in cluster one and a point in cluster two is as far as is possible so if that's the case then the Distortion will be like very low but if your points inside a cluster itself is very spread out and your clusters are very close to each other then the Distortion is going to be very high so in those cases the value of K needs to be maybe further increased and at one point using an idea called an elbow curve which will show you a small Kink from where there is a sharp tip in the Distortion values and that is an appropriate value for K while we are building a k-means algorithm so this is how it looks like so you can see here the x axis is the number of clusters and your y-axis is your within group of sums of square and at one point you can see there is a sharp dip here here and after this point which is like circled in red the values sort of almost saturates so this maybe I can use as the appropriate value of K so value of K can be like 6 in this case so I hope you are now getting comfortable with questions around data analysis statistics and even the machine learning part so the next pillar which might very closely be associated with Statistics as well is around probability right so it is like sometimes in most of the standard literature probability and statistics comes together it is now Inseparable uh anytime so in the name base algorithm like this is one of the machine learning algorithm which is based on the base theorem uh probability ideas are quite a lot used and there are some really Niche probability Concepts like the probability graph models which is uh actually based on the basics coming from Bayes theorem and the fundamental properties around probability I will not go to that detail of probability in interview questions if you have an expertise around probability and probability graph models or an a base kind of algorithms uh feel like confident to speak about it but most of the time the probability questions are quite write Basics and fundamentals like when the one waitlist which is asked in interviews depending on obviously in which place you are giving the interview at so let's also like see some sort of approach in attacking a probability uh problems okay so I'll like read it out for you what the problem is here so it says in any 15 min interval there is a 20 probability that you will see a shooting star like a good example rather so what is the probability that you you see at least one shooting star in a period of NR so there is a known information given to us that every 15 minute interval there is this 20 percent probability you will see at least one star so we now want to calculate uh this probability in a period of one hour right so what is the sort of approach we would take here so let's build this systematically so we know one fact that probability of knots uh seeing the shooting star in Every 15 minute interval is 20 which like comes down to 0.2 in probability terms 20 chances or 0.2 probability and as you know probability is always between 0 and 1. so if I would like to calculate what is the probability of not seeing the shooting star in 15 minutes which is like uh the opposite of this right if I take from so these are like independent events right so if I take one minus of the probability of seeing one Shooting Star 0.8 will then become the probability of not seeing the shooting star in 15 minutes interval okay what I need to do is then probability of not seeing this shooting star in another one R because these are now independent events like seeing a shooting star in first 15 minute interval the second and third and the fourth in an hours period is independent of each other we can multiply this probability like this so 0.8 into 0.8 into 0.8 into 0.8 which comes down to a value of 0.4 so the probability of not seeing the shooting star in a period of 1 are is 0.4 right and if we know the value of this probability here all we like have to do to get the value which we are interested in is which is the probability of seeing the shooting star is 1 minus the probability of not seeing it right so the probability here comes down to 0.5 quite a simple approach in terms of how it works right so there might be like similar uh tricky ways of putting the same questions but if you know one approach and having the idea of independent event is right you may be able to uh get uh the understanding or like formulation in this way and also keep in mind given a sort of a sample space and many events from the sample space the sum of the probabilities cannot exceed one right so that's why we are able to do this uh if I am saying tossing a coin the probability of tail is going to be 0.5 and the probability of head is going to be the other event which is going to be 1 minus the probability of tail right so similar is the case with here as well we are like creating this as a binary problem so depending on which problem you are going to tackle the approaches are going to be the same but the language and sort of the trickiness in the question might increase right so this is how you can attack that okay so there is this one more question which is around uh generating a random number between one to seven with only a die okay so let's also think about this how we can like approach this problem so we have like seen random how we Generate random random numbers from a given set of points but this kind of says uh that we want to do this with a die right by rolling a die we want to find out a random number between one to seven so obviously we know that the die has numbers between one to six so the probability of getting any one digit in the die is uh one by six like if I do one two three one divided by the total number of possibilities that is one by six so that will be my probability of choosing any one of the digits there but it is saying it is we want to generate this random number between one to seven so where do we get the seven so obviously the requirement is uh that we can like roll the dice twice right then the number of possibilities increases so as as I said there is no way we can get this seventh one right because we have only six digits but if you roll the Die Twice the number of possibilities of generating this seven random numbers increases so in this case we have now 36 different outcomes possible you can like do the match there which is nothing but uh six cross six so one two one three one four one five one six and for the rest of the digits the same way right six cross six so that's the different number of outcomes we can get so now for us to get this seven uh sort of random numbers and imagine that this idea actually relates to uh sampling technique as well when we say we want to get the numbers between one to seven all of those numbers should be equally likely right otherwise it will happen that I might get one and maybe quite a number of times versus uh one particular digit like two only once or so on but if I want to do a very good randomized selection I need to be making sure the probability of picking any one digit between one to seven is equal likely if I am not doing that I might have a bias right so we have this 36 different outcomes so how do we now make sure that this is like kind of equally likely so first we have to like find out a number which is divisible by 7 which is close to 36 so maybe we can exclude one of the possibilities from the 36 combinations we have and keep only these 35 possible outcomes so like excluding maybe six comma six this possibility if we remove we are left with the remaining 35 possible outcomes so with this uh one reduction in the possible outcome all we are left out with is all the possible combinations uh starting from one one till six five right now if we divide this into seven Parts where each uh like contains five possible outcomes so what will happen is with the seven Parts each having five outcomes whenever any one outcome comes in these seven Parts I'm going to assign a value to it so for instance if I assign the first five because we are going to divide it into seven Parts where each part will contain five possible outcomes so maybe the first five outcomes I will assign to the part one so whenever that five if that five outcomes comes up I will say the random number which I have generated is one in the second part I will keep the next five outcomes and whenever that five outcomes comes in Rolling these two dice I will assign that a value 2 and so on so this way what we have done is we have made sure that the random number that we are picking between 1 and 7 are equally likely if you don't do that then there will be a bias so you see here how sort of a statistics is like merged with the probability that this is how the two ideas and subjects travel together okay so one more question on the same line so for instance here if you have a couple and uh let's say they have two children so at least one of which is a girl so this is a given scenario so the question asks you what is the probability that they have two girls right it becomes a bit more like tossing two coins problem right where there are different combinations so what you can see here is uh a couple has two children so at least one of which is a girl so we have a known fact with us so now all we need to do to kind of calculate this probability is to generate all the possible combinations the first child being boy the second child being girl the first being girl the second being boy and so on right if you write that down this is what the possible combinations are and both all these four are equally likely as you could see so with this case uh we have like in the question at least one of the children is a girl so that means only three combinations are left out with us the combination where both the children were boys is excluded so in a very simple manner you can like just go ahead and calculate the probability of having two girl Childs will be one by three so there are total three possibilities which comes in the denominator and you have like the probability to choose uh so there is only one event which favors this probability like saying two girls out of the three that we have so one by three so let's take up this question here which says the jar has a thousand coins of which there are like 999 coins which are fair right which means the head and there is one head and one tail and there is one uh coin which is uh like tampered with and both the sides of that coin is head we need to now pick a a coin at random from this jar and toss it 10 times right so given that we have seen then 10 heads already up in that 10 times of tossing the coin what is the probability that the next toss of the coin is also going to be head and we are given a statement here that 10 heads are already seen what is the probability that the 11th toss would be uh ahead okay so let's see how we approach this so let's first put these two possibilities in place uh there is this 999 Fair coins and one doubly headed coin Right double headed coin so we need to see the probability of how many times we can get that double headed coin so if I look at from the jars perspective choosing a Fair coin from this charge is 9.99 times out of all the Thousand possible coins so which comes to be 0.9 so high probability but for the unfair one there though there is a small probability but it might also appear at some point right so the probability of uh selecting 10 heads in a row would then be either you would get a Fair coin and toss 10 head continuously so this is one possibility the second possibility is to select an unfair coin where the possibility of head is like hundred percent because there are both the sides of the coin is head so the probability of selecting 10 heads in a row would be the selecting Fair coin multiplied by probability of selecting 10 heads which is nothing but each coin which you are picking up is independent of each other getting ahead there is also independent there so the probability is 0.5 that's why we are able to multiply it so you have here the probability of picking a Fair coin to be 0.9 and multiply that with 0.5 10 times and that's your like probability which you get after that so this probability of a is selecting a Fair coin and getting 10 heads and the probability of B is selecting an N Fire coin and we know whenever we select that The Surety is we always get ahead so then that probability of B becomes 0.001 so now what we need to find out is this probability where we can sort of say given we have these two cases what is the probability that we are able to come out with the 11th head right so how we're going to do that is by selecting this uh kind of combining this probabilities into a sort of form like this where we are calculating given these two possible cases what is the sort of likelihood of with a Fair coin getting 10 heads and with an unfair coin getting that one head right so those two are the probabilities written here 0.49 and 0.50 so finally with these two probabilities known to us the probability of selecting one more head would mean that uh in the Fair coin the probability is 0.5 right so which you already know and in the unfair coin the probability is always going to be one because we have both the sides as head and both of these probabilities we are just going to multiply with the probability of the fact that already 10 coins have been seen and its head right so that's what we are like multiplying in front of it so if you put this into one formulation you will finally get the probability of selecting another head is 0.75 so which is like slightly more if you do didn't had this sort of one double headed coin if that coin would not have been present the probability would be a bit more uh less here because all the coins are then Fair okay so thanks a lot for uh listening to this session hope this helps hope you all get a really really successful career in data science thank you I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist And subscribe to any Eureka channel to learn more happy learning
Info
Channel: edureka!
Views: 134,047
Rating: undefined out of 5
Keywords: yt:cc=on, data science, Data Science full course, data science for beginners, data science course, data science 2023, data science course 2023, data science tutorial, data science training, data science tutorial for beginners, introduction to data science, data scientist, what is data science, Learn Data Science, who is a data scientist, data science skills, statistics for data science, python data science tutorial, data science edureka, Edureka data science, Edureka
Id: xiEC5oFsq2s
Channel Id: undefined
Length: 682min 18sec (40938 seconds)
Published: Tue Jan 10 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.