Python Machine Learning Tutorial (Data Science)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] if you're looking for a machine learning tutorial with python and jupyter notebook this tutorial is for you you're going to learn how to solve a real world problem using machine learning and python we're going to start off with a brief introduction to machine learning then we're going to talk about the tools you need and after that we're going to jump straight into the problem we're going to solve you'll learn how to build a model that can learn and predict the kind of music people like so by the end of this one hour tutorial you will have a good understanding of machine learning basics and you'll be able to learn more intermediate to advanced level concepts you don't need any prior knowledge in machine learning but you need to know python fairly well if you don't i've got a couple of tutorials for you here on my channel the links are below this video i'm ashamed only and i'm super excited to be your instructor on this channel i have tons of programming tutorials that you might find helpful so be sure to subscribe as i upload new tutorials every week now let's jump in and get started in this section you're going to learn about machine learning which is a subset of ai or artificial intelligence it's one of the trending topics in the world these days and it's going to have a lot of applications in the future here's an example imagine i ask you to write a program to scan an image and tell if it's a cat or a doc if you want to build this program using traditional programming techniques your program is going to get overly complex you will have to come up with lots of rules to look for specific curves edges and colors in an image to tell if it's a cat or a dog but if i give you a black and white photo your rules may not work they may break then you'll have to rewrite them or i might give you a picture of a cat or a dog from a different angle that you did not predict before so solving this problem using traditional programming techniques is going to get overly complex or sometimes impossible now to make the matter worse what if in the future i ask you to extend this program such that it supports three kinds of animals cats dogs and horses once again you'll have to rewrite all those rules that's not gonna work so machine learning is a technique to solve these kind of problems and this is how it works we build a model or an engine and give it lots and lots of data for example we give you thousands or tens of thousands of pictures of cats and dogs our model will then find and learn patterns in the input data so we can give it a new picture of a cat that it hasn't seen before and ask it is it a cat or a dog or a horse and it will tell us with a certain level of accuracy the more input data we give it the more accurate our model is going to be so that was a very basic example but machine learning has other applications in self-driving cars robotics language processing vision processing forecasting things like stock market trends and the weather games and so on so that's the basic idea about machine learning next we'll look at machine learning in action a machine learning project involves a number of steps the first step is to import our data which often comes in the form of a csv file you might have a database with lots of data we can simply export that data and store it in a csv file for the purpose of our machine learning project so we import our data next we need to clean it and this involves tasks such as removing duplicated data if you have duplicates in the data we don't want to feed this to our model because otherwise our model will learn bad patterns in the data and it will produce the wrong result so we should make sure that our input data is in a good and clean shape if there are data that is irrelevant we should remove them if they are duplicated or incomplete we can remove or modify them if our data is text-based like the name of countries or genres of music or cats and dogs we need to convert them to numerical values so this step really depends on the kind of data we're working with every project is different now that we have a clean data set we need to split it into two segments one for training our model and the other for testing it to make sure that our model produces the right result for example if you have a thousand pictures of cats and dogs we can reserve eighty percent for training and the other 20 for testing the next step is to create a model and this involves selecting an algorithm to analyze the data there are so many different machine learning algorithms out there such as decision trees neural networks and so on each algorithm has pros and cons in terms of accuracy and performance so the algorithm you choose depends on the kind of problem you're trying to solve and your input data now the good news is that we don't have to explicitly program an algorithm there are libraries out there that provide these algorithms one of the most popular ones which we are going to look at in this tutorial is scikit-learn so we build a model using an algorithm next we need to train our model so we fitted our training data our model will then look for the patterns in the data so next we can ask it to make predictions back to our example of cats and dogs we can ask our model is this a cat or a dog and our model will make a prediction now the prediction is not always accurate in fact when you start out it's very likely that your predictions are inaccurate so we need to evaluate the predictions and measure their accuracy then we need to get back to our model and either select a different algorithm that is going to produce a more accurate result for the kind of problem we're trying to solve or fine-tune the parameters of our model so each algorithm has parameters that we can modify to optimize the accuracy so these are the high level steps that you follow in a machine learning project next we'll look at the libraries and tools for machine learning in this lecture we're going to look at the popular python libraries that we use in machine learning projects the first one is numpy which provides a multi-dimensional array very very popular library the second one is pandas which is a data analysis library that provides a concept called data frame a data frame is a two-dimensional data structure similar to an excel spreadsheet so we have rows and columns we can select data in a row or a column or a range of rows and columns again very very popular in machine learning and data science projects the third library is matplotlib which is a two-dimensional plotting library for creating graphs and plots the next library is scikit-learn which is one of the most popular machine learning libraries that provides all these common algorithms like decision trees neural networks and so on now when working with machine learning projects we use an environment called jupiter for writing our code technically we can still use vs code or any other code editors but these editors are not ideal for machine learning projects because we frequently need to inspect the data and that is really hard in environments like vs code and terminal if you're working with a table of 10 or 20 columns visualizing this data in a terminal window is really really difficult and messy so that's why we use jupiter it makes it really easy to inspect our data now to install jupyter we're going to use a platform called anaconda so head over to anaconda.com download on this page you can download anaconda distribution for your operating system so we have distributions for windows mac and linux so let's go ahead and install anaconda for python 3.7 download all right so here's anaconda downloaded on my machine let's double click this all right first it's going to run a program to determine if the software can be installed so let's continue and once again continue continue pretty easy continue one more time i agree with the license agreement okay you can use the default installation location so don't worry about that just click install give it a few seconds now the beautiful thing about anaconda is that it will install jupyter as well as all those popular data science libraries like numpy pandas and so on so we don't have to manually install this using pip all right now as part of the next step anaconda is suggesting to install microsoft vs code we already have this on our machine so we don't have to install it we can go with continue and close the installation now finally we can move this to trash because we don't need this installer in the future all right now open up a terminal window and type jupyter with a y space notebook this will start the notebook server on your machine so enter there you go this will start the notebook server on your machine you can see these default messages here don't worry about them now it automatically opens a browser window pointing to localhost port 888 this is what we call jupiter dashboard on this dashboard we have a few tabs the first tab is the files tab and by default this points to your home directory so every user on your machine has a home directory this is my home directory on mac you can see here we have a desktop folder as well as documents downloads and so on on your machine you're going to see different folders so someone on your machine you need to create a jupyter notebook i'm going to go to desktop here's my desktop i don't have anything here and then click new i want to create a notebook for python 3. in this notebook we can write python code and execute it line by line we can easily visualize our data as you will see over the next few videos so let's go ahead with this all right here's our first notebook you can see by default it's called untitled let's change that to hello world so this is going to be the hello world of our machine learning project let's rename this now if you look at your desktop you can see this file helloworld.i pi nb this is a jupiter notebook it's kind of similar to our pi files where we write our python code but it includes additional data that jupiter uses to execute our code so back to our notebook let's do a print hello world and then click this run button here and here's the result printed in jupyter so we don't have to navigate back and forth between the terminal window we can see all the result right here next i'm going to show you how to load a data set from a csv file in jupyter all right in this lecture we're going to download a data set from a very popular website called kaggle.com gaggle is basically a place to do data science projects so the first thing you need to do is to create an account you can sign up with facebook google or using a custom email and password once you sign up then come back here on kaggle.com here in the search bar search for video game sales this is the name of a very popular data set that we're going to use in this lecture so here in this list you can see the first item with this kind of reddish icon so let's go with that as you can see this data set includes the sales data for more than 16 000 video games on this page you can see the description of various columns in this data set we have rank name platform year and so on so here's our data source it's a csv file called vgsales.csv as you can see there are over 16 000 rows and 11 columns in this data set right below that you can see the first few records of this data set so here's our first record the ranking for this game is one it's the wii sports game for we as the platform and it was released in year 2006 now what i want you to do is to go ahead and download this data set and as i told you before you need to sign in before you can download this so this will give you a zip file as you can see here here's our csv file now i want you to put this right next to your jupyter notebook on my machine that is on my desktop so i'm going to drag and drop this onto the desktop folder now if you look at the desktop you can see here is my jupyter hello world notebook and right next to that we have vgsales.csv with that we go back to our jupyter notebook let's remove the first line and instead import pandas as pd with this we're importing pandas module and renaming it to pd so we don't have to type pandas dot several times in this code now let's type pd dot read underline csv and pass the name of our csv file that is vg sales.csv now because this csv file is in the current folder right next to our jupyter notebook we can easily load it otherwise we'll have to supply the full path to this file so this returns a data frame object which is like an excel spreadsheet let me show you so we store it here and then we can simply type df to inspect it so one more time let's run this program here's our data frame with these rows and columns so we have rank name platform and so on now this data frame object has lots of attributes and methods that we're not going to cover in this tutorial that's really beyond the scope of what we're going to do so i'll leave it up to you to read panda's documentation or follow other tutorials to learn about pandas data frames but in this lecture i'm going to show you some of the most useful methods and attributes the first one is shape so shape let's run this one more time so here's the shape of this data set we have over 16 000 records and 11 columns technically this is a two dimensional array of sixteen thousand and eleven okay now you can see here we have another segment for writing code so we don't have to write all the code in the first segment so here in the second segment we can call one of the methods of the data frame that is df dot describe now when we run this program we can see the output for each segment right next to it so here's our first segment here we have these three lines and this is the output of the last line below that we have our second segment here we're calling the describe method and right below that we have the output of this segment so this is the beauty of jupiter we can easily visualize our data doing this with vs code and terminal windows is really tedious and clunky so what is this describe method returning basically it's returning some basic information about each column in this data set so as you saw earlier we have columns like rank year and so on these are the columns with numerical values now for each column we have the count which is the number of records in that column you can see our rank column has 16 598 records whereas the year column has 16 327 records so this shows that some of our records don't have the value for the year column we have no values so in a real data science or machine learning project we'll have to use some techniques to clean up our data set one option is to remove the records that don't have a value for the year column or we can assign them a default value that really depends on the project now another attribute for each column is mean so this is the average of all the values now in the case of the rank column this value doesn't really matter but look at the year so the average year for all these video games in our data set is 2006 and this might be important in the problem we're trying to solve we also have standard deviation which is a measure to quantify the amount of variation in our set of values below that we have min as an example the minimum value for the year column is 1980. so quite often when we work with a new data set we call the describe method to get some basic statistics about our data let me show you another useful attribute so in the next segment let's type df.values let's run this as you can see this returns a two-dimensional array this square bracket indicates the outer array and the second one represents the inner array so the first element in our outer array is an array itself these are the values in this array which basically represent the first row in our data set so the video game with ranking 1 which is called wii sports so this was a basic overview of pando's data frames in the next lecture i'm going to show you some of the useful shortcuts of jupyter in this lecture i'm going to show you some of the most useful shortcuts in jupyter now the first thing i want you to pay attention to is this green bar on the left this indicates that this cell is currently in the edit mode so we can write code here now if we press the escape key green turns to blue and that means this cell is currently in the command mode so basically the activated cell can be either in the edit mode or the command mode depending on the mode we have different shortcuts so here we're currently in the command mode if we press h we can see the list of all the keyboard shortcuts right above this list you can see mac os modifier keys these are the extra keys that we have on a mac keyboard if you're a windows user you're not going to see this so as an example here is the shape of the command key this is control this is option and so on with this guideline you can easily understand the shortcut associated with each command let me show you so here we have all the commands when a cell is in the command mode for example we have this command open the command palette this is exactly like the command palette that we have in vs code here's a shortcut to execute this command that is command shift and f okay so here we have lots of shortcuts of course you're not going to use all of them all the time but it's good to have a quick look here to see what is available for you with this shortcuts you can write code much faster so let me show you some of the most useful ones i'm going to close this now with our first cell in the command mode i'm going to press b and this inserts a new cell below this cell we can also go back to our first cell press escape now the cell is in the command mode we can insert an empty cell above this cell by pressing a so either a or b a for above and b for below okay now if you don't want this cell you can press d twice to delete it like this now in the cell i'm going to print a hello world message so print hello world now to run the code in this cell we can click on the run button here so here's our print function and right below that you can see the output of this function but note that when you run a cell this will only execute the code in that cell in other words the code in other cells will not be executed let me show you what i mean so in the cell below this cell i'm going to delete the call to describe method instead i'm going to print ocean now i'm going to put the cursor back in this cell where we print the hello world message and run this cell so you can see hello world is displayed here but the cell below is still displaying the described table so we don't see the changes here now to solve this problem we can go to the cell menu on the top and run all cells together this can work for small projects but sometimes you're working with a large data set so if you want to run all these cells together it's going to take a lot of time that is the reason jupiter saves the output of itself so we don't have to rerun that code if it hasn't changed so this notebook file that we have here includes our source code organized in cells as well as the output for each cell that is why it's different from a regular pi file where we only have the source code here we also have autocompletion and intellisense so in the cell let's call df dataframe dot now if you press tab we can see all the attributes and methods in this object so let's call describe now with the cursor on the name of the method we can press shift and tab to see this tooltip that describes what this method does and what parameter it takes so here in front of signature you can see the describe method these are the parameters and their default value and right below that you can see the description of what this method does in this case it generates descriptive statistics that summarize the central tendency and so on similar to vs code we can also convert a line to comment by pressing command and slash on mac or control slash on windows like this now this line is a comment we can press the same shortcut one more time to remove the comment so these were some of the most useful shortcuts in jupyter now over the next few lectures we're going to work on a real machine learning project but before we get there let's delete all the cells here so we start with only a single empty cell so here in this cell first i'm going to press the escape button now the cell is blue so we are in the command mode and we can delete the cell by pressing d twice there you go now the next cell is activated and it's in the command mode so let's delete this as well we have two more cells to delete there you go and the last one like this so now we have an empty notebook with a single cell hey guys i just wanted to let you know that i have an online coding school at cordwindmarch.com where you can find plenty of courses on web and mobile development in fact i have a comprehensive python course that teaches you everything about python from the basics to more advanced concepts so after you watch this tutorial if you want to learn more you may want to look at my python course it comes with a 30 day money back guarantee and a certificate of completion you can add to your resume in case you're interested the link is below this video over the next few lectures we're going to work on a real machine learning project imagine we have an online music store when our users sign up we ask their age and gender and based on their profile we recommend various music albums they're likely to buy so in this project we want to use machine learning to increase sales so we want to build a model we feed this model with some sample data based on the existing users our model will learn the patterns in our data so we can ask it to make predictions when a user signs up we tell our model hey we have a new user with this profile what is the kind of music that this user is interested in our model will say jazz or hip hop or whatever and based on that we can make suggestions to the user so this is the problem we're going to solve now back to the list of steps in a machine learning project first we need to import our data then we should prepare or clean it next we select a machine learning algorithm to build a model we train our model and ask it to make predictions and finally we evaluate our algorithm to see its accuracy if it's not accurate we either fine tune our model or select a different algorithm so let's focus on the first step download the csv file below this video this is a very basic csv that i've created for this project it's just some random made up data it's not real so we have a table with three columns age gender and genre gender can either be one which represents a male or zero which represents a female here i'm making a few assumptions i'm assuming that men between 20 and 25 like hip-hop men between 26 and 30 like jazz and after the age of 30 they like classical music for women i'm assuming that if they're between 20 and 25 they like dance music if they're between 26 and 30 they like acoustic music and just like men after the age of 30 they like classical music once again this is a made-up pattern it's not the representation of the reality so let's go ahead and download this csv click on this dot dot icon here and download this file in my downloads folder here we have this music.csv i'm going to drag and drop this onto desktop because that's where i've stored this hello world notebook so i want you to put the csv file right next to your jupyter notebook now back to our notebook we need to read the csv file so just like before first we need to import the pandas module so import pandas as pd and then we'll call pd that read analyze csv and the name of our file is music.csv as you saw earlier this returns a data frame which is a two-dimensional array similar to an excel spreadsheet so let's call that music underline data now let's inspect this music underline data to make sure we loaded everything properly so run so here's our data frame beautiful next minute to prepare or clean the data and that's the topic for the next lecture the second step in a machine learning project is cleaning or preparing the data and that involves tasks such as removing duplicates null values and so on now in this particular data set we don't have to do any kind of cleaning because we don't have any duplicates and as you can see all rows have values for all columns so we don't have null values but there is one thing we need to do we should split this data set into two separate data sets one with the first two columns which we refer to as the input set and the other with the last column which we refer to as the output set so when we train a model we give it two separate data sets the input set and the output set the output set which is in this case the genre column contains the predictions so we're telling our model that if we have a user who's 20 years old and is a male they like hip hop once we train our model then we give it a new input set for example we say hey we have a new user who is 21 years old and is a male what is the genre of the music that this user probably likes as you can see in our input set we don't have a sample for a 21 year old male so we're going to ask our model to predict that that is the reason we need to split this data set into two separate sets input and output so back to our code this data frame object has a method called drop now if you put the cursor under method name and press shift and tab you can see this tooltip so this is the signature of this drop method these are the parameters that we can pass here the parameter we're going to use in this lecture is columns which is set to none by default with this parameter we can specify the columns we want to drop so in this case we set columns to an array with one string genre now this method doesn't actually modify the original data set in fact it will create a new data set but without this column so by convention we use a capital x to represent that data set so capital x equals this expression now let's inspect x so as you can see our input set or x includes these two columns age and gender it doesn't have the output or predictions next we need to create our output set so once again we start with our data frame music data using square brackets we can get all the values in a given column in this case genre once again this returns a new data set by convention we use a lowercase y to represent that so that is our output data set let's inspect that as well so in this data set we only have the predictions or the answers so we have prepared our data next we need to create a model using an algorithm the next step is to build a model using a machine learning algorithm there are so many algorithms out there and each algorithm has its pros and cons in terms of the performance and accuracy in this lecture we're going to use a very simple algorithm called decision tree now the good news is that we don't have to explicitly program these algorithms they're already implemented for us in a library called scikit-learn so here on the top from sklearn.3 let's import the decision tree classifier so sklearn is the package that comes with scikit-learn library this is the most popular machine learning library in python in this package we have a module called tree and in this module we have a class called decision tree classifier this class implements the decision tree algorithm okay so now we need to create a new instance of this class so at the end let's create an object called model and set it to a new instance of decision tree classifier like this so now we have a model next we need to train it so it learns patterns in the data and that is pretty easy we call model that fit this method takes two data sets the input set and the output set so they are capital x and y now finally we need to ask our model to make a prediction so we can ask it what is the kind of music that a 21 year old male likes now before we do that let's temporarily inspect our initial data set that is music data so look what we got here as i told you earlier i've assumed that men between 20 and 25 like hip-hop music but here we only have three samples for men aged 20 23 and 25 we don't have a sample for a 21 year old male so if you ask our model to predict the kind of music that a 21 year old male likes we expect it to say hip hop similarly i've assumed that women between 20 and 25 like dance music but we don't have a sample for a 22 year old female so once again if you ask our model to predict the kind of music that a 22 year old woman likes we expect it to say dance so with these assumptions let's go ahead and ask our model to make predictions so let's remove the last line and instead we're going to call model dot predict this method takes a two dimensional array so here's the outer array in this array each element is an array so i'm going to pass another array here and in this array i'm going to pass a new input set a 21 year old male so 21 comma one that is like a new record in this table okay so this is one input set let's pass another input set for a 22-year female so here's another array here we add 22 comma zero so we're asking our model to make two predictions at the same time we get the result and store it in a variable called predictions and finally let's inspect that in our notebook run look what we got our model is saying that a 21 year old male likes hip hop and a 22 year old female likes dance music so our model could successfully make predictions here beautiful but wait a second building a model that makes predictions accurately is not always that easy as i told you earlier after we build a model we need to measure its accuracy and if it's not accurate enough we should either fine tune it or build a model using a different algorithm so in the next lecture i'm going to show you how to measure the accuracy of a model in this lecture i'm going to show you how to measure the accuracy of your models now in order to do so first we need to split our data set into two sets one for training and the other for testing because right now we're passing the entire data set for training the model and we're using two samples for making predictions that is not enough to calculate the accuracy of a model a general rule of thumb is to allocate 70 to 80 percent of our data for training and the other twenty to thirty percent for testing then instead of passing only two samples for making predictions we can pass the data set we have for testing we'll get the predictions and then we can compare these predictions with the actual values in the test set based on that we can calculate the accuracy that's really easy all we have to do is to import a couple of functions and call them in this code let me show you so first on the top from sklearn the model underline selection module we import a function called train test split with this function we can easily split our data set into two sets for training and testing now right here after we define x and y sets we call this function so train test split we give it three arguments x y and a keyboard argument that specifies the size of our test data set so test underline size we set it to 0.2 so we are allocating 20 of our data for testing now this function returns a tuple so we can unpack it into four variables right here x underline train x underline test y underline train and y underline test so the first two variables are the input sets for training and testing and the other are the output sets for training and testing now when training our model instead of passing the entire data set we want to pass only the training data set so x underline train and y underline train also when making predictions instead of passing these two samples we pass x underline test so that is the data set that contains input values for testing now we get the predictions to calculate the accuracy we simply have to compare these predictions with the actual values we have in our output set for testing that is very easy first on the top we need to import a function so from sklearn.metrics metrics import accuracy underlying score now at the very end we call this function so accuracy score and give it two arguments y underline test which contains the expected values and predictions which contains the actual values now this function returns an accuracy score between zero to one so we can store it here and simply display it on the console so let's go ahead and run this program so the accuracy score is one or 100 percent but if we run this one more time we're going to see a different result because every time we split our data set into training and test sets we'll have different data sets because this function randomly picks data for training and testing let me show you so put the cursor in the cell now you can see this cell is activated note that if you click this button here it will run this cell and also inserts a new cell below this cell let me show you so if i go to the second cell press escape button now we are in the command mode press d twice okay now it's deleted if we click the run button you can see this code was executed and now we have a new cell so if you want to run our first cell multiple times every time we have to click this and then run it and then click again and run it it's a little bit tedious so i'll show you a shortcut activate the first cell and press ctrl and enter this runs the current cell without adding a new cell below it so back here let's run it multiple times okay now look the accuracy dropped to 0.75 it's still good so the accuracy score here is somewhere between 75 to 100 but let me show you something if i change the test size from 0.2 to 0.8 so essentially we're using only 20 of our data for training this model and we're using the other 80 for testing now let's see what happens when we run this cell multiple times so control and enter look the accuracy immediately dropped to 0.4 one more time now 46 percent 40 26 it's really really bad the reason this is happening is because we are using very little data for training this model this is one of the key concepts in machine learning the more data we give to our model and the cleaner the data is we get the better result so if we have duplicates irrelevant data or incomplete values our model will learn bad patterns in our data that is why it's really important to clean our data before training our model now let's change this back to 0.2 run this one more time okay now the accuracy is one 75 percent now we drop to 50 again the reason this is happening is because we don't have enough data some machine learning problems require thousands or even millions of samples to train a model the more complex the problem is the more data we need for example here we're only dealing with a table of three columns but if you want to build a model to tell if a picture is a cat or a dog or a horse or a lion we'll need millions of pictures the more animals we want to support the more pictures we need in the next lecture we're going to talk about model persistence so this is a very basic implementation of building and training a model to make predictions now to simplify things i have removed all the code that we wrote in the last lecture for calculating the accuracy because in this lecture we're going to focus on a different topic so basically we import our data set create a model train it and then ask it to make predictions now this piece of code that you see here is not what we want to run every time we have a new user or every time we want to make recommendations to an existing user because training a model can sometimes be really time consuming in this example we're dealing with a very small data set that has only 20 records but in real applications we might have a data set with thousands or millions of samples training a model for that might take seconds minutes or even hours so that is why model persistence is important once in a while we build and train our model and then we'll save it to a file now next time we want to make predictions we simply load the model from the file and ask it to make predictions that model is already trained we don't need to retrain it it's like an intelligent person so let me show you how to do this it's very very easy on the top from sklearn.externals module we import lib this job live object has methods for saving and loading models so after we train our model we simply call joblib dot dump and give it two arguments our model and the name of the file in which we want to store this model let's call that music dash recommender dot job lib that's all we have to do now temporarily i'm going to comment out this line we don't want to make any predictions we just want to store our trained model in a file so let's run this cell with control and slash okay look in the output we have an array that contains the name of our model file so this is the return value of the dump method now back to our desktop right next to my notebook you can see our job live file this is where our model is stored it's simply a binary file now back to our jupyter notebook as i told you before in a real application we don't want to train a model every time so let's comment out these few lines so i've selected these few lines on mac we can press command and slash on windows control slash okay these lines are commented out now this time instead of dumping our model we're going to load it so we call the load method we don't have the model we simply pass the name of our model file this returns our trained model now with these two lines we can simply make predictions so earlier we assumed that men between 20 and 25 like hip-hop music let's print predictions and see if our model is behaving correctly or not so control and enter there you go so this is how we persist and load models earlier in this section i told you that decision trees are the easiest to understand and that's why we started machine learning with decision trees in this lecture we're going to export our model in a visual format so you will see how this model makes predictions that is really really cool let me show you so once again i've simplified this code so we simply import our data set create input and output sets create a model and train it that's all we are doing now i want you to follow along with me type everything exactly as i show you in this lecture don't worry about what everything means we're going to come back to it shortly so on the top from sklearn import tree this object has a method for exporting our decision tree in a graphical format so after we train our model let's call tree dot export underline graph vis now here are a few arguments we need to pass the first argument is our model the second is the name of the output file so here we're going to use keyword arguments because this method takes so many parameters and we want to selectively pass keyword arguments without worrying about their order so the parameter we're going to set is out underline file let's set this to music dash recommender dot d o t this is the dot format which is a graph description language you'll see that shortly the other parameter we want to set is feature underline names we set this to an array of two strings age and gender these are the features or the columns of our data set so they are the properties or features of our data okay the other parameter is class names so class underline names we should set this to the list of classes or labels we have in our output data set like hip hop jazz classical and so on so this y data set includes all the genres or all the classes of our data but they're repeated a few times in this data set so here we call y dot unique this returns the unique list of classes now we should sort this alphabetically so we call the sorted function and pass the result a y dot unique the next parameter is label we set this to a string all once again don't worry about the details of these parameters we're going to come back to this shortly so set label to all then round it to true and finally filled to true so this is the end result now let's run this cell using control and enter okay here we have a new file music recommender dot dot that's a little bit funny so we want to open this file with vs code so drag and drop this into a vs code window okay here's a dot format it's a textual language for describing graphs now to visualize this graph we need to install an extension in vs code so on the left side click the extensions panel and search for dot dot look at the second extension here graphviz or dot language by staphon vs go ahead and install this extension and then reload vs code once you do that you can visualize this dot file so let me close this tab all right look at this dot dot here on the right side click this you should have a new menu open preview to the site so click that all right here's the visualization of our decision tree let's close the dot file there you go this is exactly how our model makes predictions so we have this binary tree which means every node can have a maximum of two children on top of each node we have a condition if this condition is true we go to the child node on the left side otherwise we go to the child node on the right side so let's see what's happening here the first condition is age less than or equal to 30.5 if this condition is false that means that user is 30 years or older so the genre of the music that they're interested in is classical so here we're classifying people based on their profile that is the reason we have the word class here so a user who is 30 years or older belongs to the class of classical or people who like classical music now what if this condition is true that means that user is younger than 30. so now we check the gender if it's less than 0.5 which basically means if it equals to 0 then we're dealing with a female so we go to the child node here now once again we have another condition so we're dealing with a female who is younger than 30. once again we need to check their age so is the age less than 25.5 if that's the case then that user likes dance music otherwise they like acoustic music so this is the decision tree that our model uses to make predictions now if you're wondering why we have these floating point numbers like 25.5 these are basically the rules that our model generates based on the patterns that it finds in our data set as we give our model more data these rules will change so they're not always the same also the more columns or more features we have our decision tree is going to get more complex currently we have only two features age and gender now back to our code let me quickly explain the meaning of all these parameters we set fill to true so each box or each node is filled with a color we set rounded to true so they have rounded corners we set label to all so every node has labels that we can read we set class names to the unique list of genres and that's for displaying the class for each node right here and we set feature names to age and gender so we can see the rules in our notes hey thank you for watching my tutorial i hope you learned a lot and you're excited to learn more if you enjoyed this tutorial please like and share it with others and be sure to subscribe to my channel as i upload new videos every week once again thank you and i wish you all the [Music] best
Info
Channel: Programming with Mosh
Views: 1,605,691
Rating: undefined out of 5
Keywords: machine learning python, machine learning tutorial, machine learning, python, python tutorial, jupyter notebook, data science, python data science, python tutorial advanced, data science python, python machine learning, artificial intelligence, programming with mosh, mosh hamedani, code with mosh, jupyter, machine learning tutorial for beginners, machine learning with python, data science tutorial, machine learning course, python programming, introduction to machine learning
Id: 7eh4d6sabA0
Channel Id: undefined
Length: 49min 43sec (2983 seconds)
Published: Thu Sep 17 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.