Exploratory Data Analysis With Pandas || Python Machine Learning PT.1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so the first part of any good data science project is exploratory data analysis that's exactly what we're going to be going through in this video with pandas Seaborn and matplotlib alrighty so we're going to be going through a couple of key things today first and foremost we're going to cover how to load data into Python using pandas we're then going to perform some exploratory data analysis so we'll calculate summary statistics will visualize unique values and a whole bunch of other things and then last but not least we'll also visualize our data using Seaborn and matplotlib it so will analyze our trend and will also calculate some correlation statistics so in terms of the specific scenario that we're going to be tackling we've got a bit of a mock scenario set up so a business has approached us with some data and they want to be able to forecast transactions so specifically the data that they've given us are a list of accounting transaction so in there we've got accounts we've got cost centers and we've got a year and a month column so that what they want to be able to do in the end run is be able to predict what the value of the each one of those specific accounts is going to be in the future ultimately what they want to be able to do is take the machine learning model that we build and integrate it into an app so they can expose it to their front end business users so all they need to do is click a few buttons and hit predict and have their prediction shown to them so there's quite a few moving parts of this particular scenario so we're gonna start off with exploratory data analysis and we'll also cover the Chris diem model if you've never heard of the Christian model before don't stress we're going to cover it in the video a little bit later on let's get to coding already so as mentioned the goal of this entire machine learning regression series is to be able to build an app that allows a user to enter in some fields hit predict and get a result back from a machine learning model so really this entire series is going to encompass the entire data science lifecycle all the way from business understanding all the way through the deployment and finally integrating it into an application so in terms of our app or the app that we're going to be building by the end of this series we're going to be able to choose a particular year so in this case I've chosen in 2020 to choose a month choose a cost Center and then true an account and hit predict and what I'll get back is a machine learning prediction so in this case because we're using regression we're predicting a continuous value so you can see here that as a result of choosing these particular fields so 20 22 April costs enter CC 200 and account three million and healing predict I've got a result of three hundred and seventy four dollars so if I want to change this for example two different customers who say CC 301 and hit predict then I should get a different prediction and you can see that that predictions been updated and we can also see it in our visualization so we'll be covering building this app right at the end of the series but today we're going to be focusing on the first two runs of the crisp DM models so business understanding and data understanding so let's jump into a Jupiter notebook and start laying out the framework for our machine learning model so in this case I'm just working inside of jupiter notebook so all i need to do is hit new hit python 3 now first thing that we're going to do is name our notebook so we'll call it regression model and then good practice whenever working in notebooks is to include lots of documentation so that others can pick up your notebook and so that you remember what you actually gone and done now in this particular case what we're going to be doing is following the crisp DM model of data science now if you haven't heard of the chris dear model before the best way to remember the steps in it hard to remember this initialism so barry drove directly to the medical emergency department now each one of these represents a step in the data science lifecycle so ideally these are steps that you should consistently take whenever you're performing any machine learning project or any deep learning or machine learning task now each one of these represents a key step in the data science and machine learning life cycle so specifically they stand for business understanding data understanding data preparation modeling evaluation and deployment so each one of these steps sort of ensures that you get a good machine learning model and you set yourself up for success as part of your data science project so let's just add those into this particular cell and lay out our notebook so that we've got that structure set up all right so that's our framework laid out now the first step in this is actually business understanding so there's no code actually involved in this particular business understanding step because it's all to do with understanding what our end outcome is now if we actually step into this seller to add in a few notes so in this particular case a business has approached us and asked us to help out in forecasting their accounting transactions now because we're looking at transaction values we know that this particular task is going to be a regression task because we're trying to predict a continuous value so we can add in a few notes to help us remember what our business understanding step is so in this particular case our company has also told us that they've only got data for roughly three years and their data quality is okay but they're not too sure all right so that's our business understanding step done really really quickly ideally when you're performing this business understanding step it's really good to speak to people that are working within the business and find out what factors are influencing certain outcomes so in this case we've just done a quick business understanding walkthrough now our next step which is going to be the main focus for this video is our data understanding step so as part of our data understanding step we'll read in our data will start pre processing it and we'll also use some visualizations to get a feel for what we've actually got on our hands now the first thing that we're going to do is bring in our data set so we can do that using pandas all right so we've brought in pandas so we've used the import function to bring in pandas as PD now we can use the read CSV function to bring in our data set so in this particular case the data set that we've got is one CSV and it's just called regression CSV so that's all we need to pass okay so successfully read in our data set but you can see that we haven't actually started looking at what our data set looks like yet now the first thing that you should almost always do whatever you bring in a data set is just run the head function so this is going to show you the first five rows of data perfect so as the businesses saying they've got some accounting transactions and it looks like we've got a year dimension we've also got a month column a cost Center column an account column an account description account type and our amounts column now on first observation it looks pretty clear that what's going to be our target column is going to be amount so we're going to be trying to predict them out and our feature columns are going to be all of these here so year month call center account account description and account type we can also take a look at the last five rows within our data set using the tale function perfect cool so it looks like we've got a bunch of different types of accounts we've got data up until 2021 which seems pretty consistent with us having three years worth of data and looks like we've got some different cost centers and different accounts as well so you're starting to see that a large part of data understanding is really working out what data we've got on hand and checking to see if there's any issues in him now one of the things that's really important whenever you're performing any machine learning project is to actually evaluate the quality of your data now one good check to evaluate the quality of your data is to see whether or not you've got any missing values within your data so this is a pretty easy check using the DF info function so we can just type in D after info and you can see that it looks like we've got no missing value so everything saying non null and it looks like we've got four thousand two hundred and twelve values within our data set so now another good check that you can perform is to check the uniqueness of the values within your column now this gives you a feel for how many different types of categories you're going to have within your data so we can do that using a quick loop so we're just going to loop through our data frame columns so if we actually just access the DF columns value you can see that this is just going to return an index of all of our columns so you can see here that we've got year month cost Center account account description account type and amount so these just correspond to the columns that we've got up there now what we can do is we can actually loop through these columns and from there we can actually check the uniqueness for each one of these columns so if we were to do this for one single column we could just type in DF what's a good one account unique you can see we've returned an array unique values so this this is a complete list of all the unique accounts that we've got within our data frame now rather than doing this for each individual column we can actually just loop through them all and print them out to the screen perfect coop certs and so you can see that we've printed out the name of the column how many unique values there are as well as an example of all of the unique values so you can see here in a year column we've got three years as the client promise we've got 2019 2020 and 2021 in our month column we've got all of the columns as have got January February March April May June July August September October November and December so what we might do later on is we might also convert some of these months into a period index or rather than using the word we might also convert them to a numeric value for visualization so that's just a point to note when we're as we're going through our data preparation step we can see that we've got a number of different cost centers there nine few accounts we saw that already we've also got a few account descriptions now you can see here that our account descriptions the number of our count descriptions matches our number of accounts so this potentially tells us that we've got two columns which represent the same style of data so later on when we get into our data preparation step we might choose to drop one of those columns account type so we've only got four different account types so revenue expenses assets and liabilities and then amounts we've got a whole bunch of different types of amounts so that looks fine the fact that we don't have the same number of amounts as transactions potentially tells us that we have some line items which have a similar value so at some point in time we might choose to remove some of those duplicate values or evaluate whether or not they're actually adding value to a data frame all right now that we've done that another good check is to take a look at the spread of our data so we can create some summary statistics using the DF describe function and this is really just showing up and numeric values at the moment so if we actually check our data types you can see that it's only calculated for a year account and so it's excluded our month at cost Center our account description and account type so in terms of our year this really isn't all that relevant we this is really a categorical value in terms of our account it looks again that's a categorical value in terms of our amount so this is good to know so a minimum value is 1020 dollars and our maximum values two thousand three hundred and seventy-eight again we've got four thousand two hundred and twelve value so that looks all good up until now a large majority of our data understanding has been to do with actually just reviewing tables and looking at numbers now whenever you're going through the data understanding step one of the easiest ways to get a feel for your data is to start visualizing it and take a look at what the data actually looks like so let's go on ahead and do that in order to visualize our data we're going to be using two key libraries these are matplotlib and C born so let's import these as dependencies all right so that's imported now the first visualization that we're actually going to take a look at is the spread of our transaction so what are their values and distributions of each one of our transactions across our account type so we can actually do that using probably one of my favorite plots which is the violin plot so let's do that perfect so that's our violent plot done so what we've done is we've passed in our x value which is going to be our count type and our Y value which is going to be our amount now that's looking a little bit small at the moment so we can make that a little bit bigger and add a title as well so let's do that perfect so that's a little bit easier to read so we can see now that we've gone and added our title to account type violin plot and we've also made it a little bit bigger alright so if we take a look at the spread of our transactions we can see that our revenue accounts tend to sort of average out around our nine hundred to eight hundred dollar mark expenses are pretty close looks like they're a little bit below our asset is looking like it's probably around that value as well a liabilities looks like it has a big spread of values so and our transactions are sort of spread out around the what is that the 900 value to the negative what is it a 900 value mark alright cool so that tells us a little bit more about our data now what we might choose to do is interrogate our liability account a little bit more because this is looking like it's probably the least furthest from a normal distribution so let's take a look into that so again we're just going to create a violin plot but now rather than displaying everything what we might actually choose to do is focus in on our liability accounts so all we need to do is add in a filter to our data frame and we're just going to set it to equal liability ok so right now we've got a violin plot for our liabilities so it looks that it's looking a little bit weird we can also hone in and just take a look at the accounts that sit beneath that so in order to do that what we need to do is just change our X variable to be account rather than account type alrighty cool so it looks like we've only got one account in there so that's account for million and one so again if there was a bigger spread of accounts we might choose to perform some data transformations for now we're just going to leave it because we're just getting an understanding of our data what we might choose to do though is do the same but perform it on one of our other account types so let's take a look at revenue for example alright cool so again we've got a few different revenue types so again rather than choosing an account we might choose to use account description all right so this is our product sales so it looks like we're averaging around the $900 mark similar for licensing revenue so this gives us a good feel the different types of revenue that the particular business that we're looking at has so we've got product sales licensing revenue service revenue and fee revenue perfect cool now the next thing that we might choose to do is take a look at the trend within our data so up until now we've been looking at these accounts in isolation we haven't actually been looking at the trend across the years so what we can actually do is convert our date columns if we scroll back up so we can convert a year and a month into a date column and visualize our data using a period or a date so the first thing that we're going to need to do in order to do that is create a date column within our data frame so let's go on ahead and do that now in order to create a date column we're actually going to need a date but at the moment if we take a look at our data frame we've only got a year and a month column so what we're going to need to do is convert these into a steak string now for this particular case we're just going to append a day column as the first of that particular month and then what we're going to do is string them together so we're also going to need to convert our month which is in a string description to an actual month so we can do that using a month map dictionary so let's do that first perfect so that's a month map dictionary prepared now what this allows us to do is basically enter in a month and get an index return perfect cool so that's going to allow us to convert each one of these months to a period so let's go ahead and do that and create a period dimension all right so what I'm actually doing here is creating a new column called DF period and then what we're going to do is loop through each one of the values within a month column and apply this month map transformation perfect cool so that's now done so if we actually go and check our data frame again you can now see that we have a period dimension if we filter down so if we check another month you can see that we've also converted February we could try a December for example again we've performed that conversion successfully now we're just going to quickly create another column for our day so we're just going to set day equal to one we check that perfect so we've got a day so now what we're going to do is string all of those columns together first up and then once we've done that we'll convert it to a date/time cool so we've now got a date column so if we check again all right so we've got a date string and now all we need to do is if we check that D type so if we take that data type right now it's not actually a date/time column it's just as strings so if we go and check that so we can check the data types using the D types function and you can see our date is still an object it's not actually a date so we can quickly convert that using the pandas to date/time function now if we check our D types again you can now see that our date column is now truly a date/time column all right so cool so now that all of that data transformation is now done so we're sort of varying into that data preparation step now but in this case it's for our visualization so it's okay all right so now that that data transformation is done what we can now start to visualize by date so let's go ahead and prepare some line plot visualizations again with Seabourn so the first one that we're going to do is take a look at revenue because more often than not revenue seems to be the most seasonal account whenever you're working with businesses so let's go ahead and visualize revenue first up so in this case we're going to be using the line plot from Seaborn so we can do access that using SNS dot line plot and again we're going to pass through our x value our y value and add data all right so that's our basic function now because we want to filter out and specifically focus in on revenue accounts again we're going to perform a quick filter on our data frame all right perfect so that's our high-level plot so it's not looking that great right now so what we can do is clean this up a little bit so specifically we probably want to make it a little bit bigger we want to remove these estimator bars and we actually want to see the individual accounts but I mean without doing that yet it does look like we've got a little bit of a trend so it looks like we start off sort of high --is-- drop a bit lower then increase high drop a little bit lower and then increase so it looks like there might be some seasonality so in this case we're not going to be using time series techniques so we're going to be using ensemble and linear models but that's fine we can factor this in by you including our dates as columns so let's quickly change up this visualization so we can actually clean this up and get a better idea of when the seasonality is actually happening all right so that's all of our counts visualize but right now it's looking a little bit messy so what we might choose to do is just focus on one particular account rather than looking at all of them at the same time so but again you can start to see that sort of seasonality so starts off high goes to goes lower then goes high apart from this particular account here which seems to be what is that service revenue so again starts off high goes up sort of starts off high goes up so if we focus on let's potentially just look at product sales we might get a better idea of what's happening there all right so that's looking a little bit better a little bit clearer to see what's happening so it looks like within January looks like we start off high and then we drop a little bit lower towards the middle months and then again we sort of rise again towards January so it looks like we've got a little bit of seasonality there so let's also take a look at service revenue so that seemed to be bucking the trend and seemed to be doing its own thing all right so it looks like service from revenue might not be as seasonal as our product sales so perhaps if they're operating a consulting firm this might be attached to the number of days on the number of chargeable days in a month a lot of the time consulting revenue is attached to the number of chargeable days or tidepool hours so that again that's a good thing to know but it's good to know that our trends do differ slightly between the accounts that were taking a look at as well again this is just the thing to note as part of our data understanding step so this is sort of a high-level data understanding stage as well just keep that in mind there are more steps that you can perform you can dig deeper add additional data sets as well all right so the last thing that we're going to do is take a look at correlation between our accounts so let's set up our notebook and give it another section all right now in our particular case our accounts are all within one column which means that if we were to calculate correlation we actually are not going to see the correlation between accounts so if we do that you can see that we just sort of getting correlation between our different columns so year account amount and period so in our particular case we want to check if there's correlation between accounts so say we've got to taking a look at our service revenue perhaps our service revenue is attached to our staff cost or has a correlation between us staff costs we want to be able to capture that relationship so let's take a look at whether or not we've even got a staff cost column yep so you can see that we've got a staff expenses column so again performing this type of analysis sort of gives us a better overview of if there's any relationship between the accounts that we've actually got so in order to perform this analysis we need to reshape our data frame a little bit so specific we want to have each account have its own column and have the value of that particular account inside of that column as well so let's actually go and reshape our data frame so that we can perform that type of analysis so in this case we're just going to loop through each one of our rows within our data frame create a bunch of new columns and if that particular row matches the account column we're going to take that amount and put it into that column so let's go ahead and do that alright so the first thing that we're going to do is create a number of columns per each individual account now you can do this really easily using the PD get dummy's function so you can see that just by writing that function we've got a column for each one of our indi individual accounts at the moment we've got a one showing up in each one of these columns we actually want the value of that particular transaction to be in this column so in order to do that we just need to loop through our regular data frame if our particular account within our row matches a particular column we're going to take our mount and sub it out for that particular one so let's go on ahead and do that the first thing that we're going to do in order to get started on that is join these two data frame so we want to join our dummy data frame without regular data frame so let's do that alright so that's going to perform our joint so if we run that cell you can see that we've now joined it now what we want to do is actually loop through and perform that transformation all right so the eat arose function is going to allow us to loop through each one of these rows so if we actually print out our rows now you can see that we're actually printing out each one of these rows but we want some place to actually store the results of these transformation we don't want to actually do it in place within our regular data frame because we're going to want to use that particular data frame for our training later on so if we create a new dictionary we can actually store all of our transformed results in that dictionary and then recreate a new data frame from that particular dictionary so let's do that all right so if we take a look at our dictionary now you can see that we've got one particular key for each row that we want to transform so we've gotta count a million is showing up with one thousand three hundred and forty four and so on all right now the last thing that we need to do in order to perform this analysis is to just create it turn it into a data frame and then calculate our correlation so in order to create a data frame from a dictionary we just need to use the from disk method and then pass through our dictionary so if we now take a look at our data frame you can see that we've got our data frame prepared but the one problem that we've actually got here is that we want our columns or we want our rows which are accounts to be our columns and we also want to fill in all of these missing values so you can see that because there's no values at this intersection it's just showing up as not a number so we can perform this transformation really really easily all we need to do is transpose our data frame and then fill in the missing values with a zero and you can see we've now converted our rows which were originally our counts into individual columns and we've also filled in our missing values now if we wanted to calculate correlation all we need to do is type in cor and we've calculated our correlation but that's a little bit tricky to visualize so rather than just looking at a table we can actually just create a heat map that seems to be the best way to visualize correlation so let's go ahead and do that so we can create a heat map using the SNS heat map function and all we need to do is pass through our correlation table to that so if we copy this pass it through to our heat map you can see that we've now got a heat map again let's make it a little bit bigger and let's add a title perfect cool so it looks like there might be a little bit of a relationship between account what is that three million and pretty much all of the other accounts it's it's not strong correlation but it is there and likewise it looks like there's a little bit between a count what is that four million and one think that was a liability account alrighty so let's actually check what those accounts are so if we grab a couple of rows we can actually see what account three million is and what account four million and one is so let's do that okay so that looks to be cash that kind of makes a little bit more sense I mean cash is going to be related to a lot of the things that are happening in the business so one is probably going to impact the other let's take a look at the other account as well and that's accounts payable all right that kind of makes sense so if you've got changes in cash flow then you may need to draw down on an overdraft or something so there might be a bit of a relationship there alright and that about wraps up our analysis just to quickly summarize so we took a look at our data calculated to see whether or not there's any missing values we also looked at our unique values did a bunch of visualization and we also looked at our trends and last but not least we calculated our correlation so we did quite a fair bit in this video but within any data science project your data preparation and data understanding steps are going to be the large majority of your work thanks so much for watching the video thanks so much for tuning in guys hopefully you found this video useful if you did be sure to give it a thumbs up hit subscribe and tick that bell and if you've got any questions at all be sure to drop a mention in the comments below peace [Music]
Info
Channel: Nicholas Renotte
Views: 6,519
Rating: undefined out of 5
Keywords: python, pandas, matplotlib
Id: rr-KwIjinpM
Channel Id: undefined
Length: 29min 55sec (1795 seconds)
Published: Thu Jul 02 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.