Learn Data Analysis with Pandas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
sorry about that yes [Music] wow this is so cool oh no what happened i forgot some 32s i think yes and x2 is not this is supposed to be x2 cool that doesn't seem right but i'll stare more at this later i should be a little bit more than that value and i actually got like a really huge negative number well neither here nor there interesting but yeah very cool very cool algorithm to uh to look up uh and put some hands on it's just very cool to see uh how programming you know 20 30 years ago was just so much more about interacting with the kind of actual like brain of the computer uh as opposed to now when we're just so abstracted right python's so easy to easy to talk about um that's what it is easy to just say something about uh yeah um i can't send you the file michael but i don't think this file is gonna open on a phone yeah it's almost definitely not going to open on a phone unless you have a way of opening jupiter notebooks on your phone but just in case i'm going to send it to you nice this jupiter notebook is really great it's a big uh okay i'm gonna send it to you right now yeah the um [Music] i would recommend doing this uh doing this on your laptop uh it's just preference um i believe pycharm handles uh the ipython notebooks now uh just the same as jupiter notebook does so i don't think it really matters for people who joined late you guys can download this jupiter notebook file and set it up so that you can follow along with them [Music] bye side question what's the difference between using a notebook versus python the oh i i i just quickly answered that before i don't think there is i don't think there is a big meaningful difference anymore there used to be uh python notebook was a pretty specific format and now it's not anymore which thank god for open source right open source is the best thing in the world because we have this ability to just constantly build off of these uh off of other things that other people make you know because they also got frustrated at a very similar problem that we had um i'm gonna grab a glass of water before we start while we're doing the introductions and then uh you should be ready to go yeah we can let me ensure the screen later sounds good let me bring up the right window start sharing okay everyone's in my screen now i think sam is sharing so i have stopped sharing okay feel free to share video yeah can you see my screen now not yet not yet okay hello everyone can you see my screen now awesome are we ready to start or we should wait a little bit maybe one more or two more minutes okay in the meanwhile people can start downloading the file so they can follow along okay welcome everyone my name is vivian i'm the school director of nyc data science academy welcome to today's event uh before the event officially take over the ground i just want to introduce you who we are and what's today's topic and what is the speaker today so unless you deserve academy was established in 2013 we have been luckily in this industry for the past eight years we recently get fully national credit so which means we can accept federal federal financial aid we can accept a 529 funding we also accept a lot of corporate sponsorship if you want to study with us um yeah we offer all kind of programs so if you're working professional there are a few pro uh offering will work well for your schedule if you are currently in school or you are very new to this field we have all levels of offering you may consider so the best way you can figure things out is browse our website we have bootcamp courses you can also contact us through phone call or email we have admission officer we have a support team to help you to figure out what would be the best fit for you okay so today's event we're going to focus on data analysis with pandas our speaker is sam he is a assistant instructor at academy also he helped to coordinate the online bootcamp um a lot of online students who are working professional work closely with him so for today's event you can share yourself to chat you can mention you want to chat with all the panelists and attendees and we luckily got a quote from founder and ceo from course report with over eight years of experience teaching data science and four point eight out of five review rating on course report it's not surprising usa data science has become the only a credit that designs bootcamp to receive acct accreditation so we are very excited for the bright future with accreditation so as i mentioned we offer intensive programs such as bootcamp we also offer part-time plans to fit all kind of need we are the only who can teach both python and art we offer hands-on learning with four industry projects we offer lifetime job support so many students who graduate from back in 2013 14 15. when they look for their second third job they reach out to us and keep close with our network we have alumni across the across the globe around 2000 alumni working industry now and since we started to offer boot camps to run the in-person bootcamp from 2015 we have been rated the best bootcamp for the last five years since we started to offer online programs we have been rated as the best online bootcamp so the process to know whether it's a good fit or not it's pretty straightforward you can go to our website which is nycbaexact.com you can pick any program like data science booking where they handle it program or just click apply so you can give us your information for assessment usually it's a 15 minute to half hour behavioral interview then we have a tax assessment for it to complete the the goal is not like to see how well you can code it's more like understand your thinking process so if you don't have coding background it's fine we want to know how simple um problems are okay so um before we get started i just have uh one two more slides to go over we offer remote life which means the class happen lively also you have options to join the class in person in new york city we also have two online programs which is full-time part-time so based on the lens and whether you can make yourself available in person or fully online you have three different choices and our next starting day is september 27th our in person and remote live bootcamp admission are really close but i think we still have online bootcamp spots if you're interested and for the november day star we only have online online program available um so for the analytic bootcamp we have both um remote live in person option and the live option available for september 27th and november just want to let you know feel free to ask any question in the chat i will support sam along the way during the event and answer a few questions yup that's mine so sam it's your turn i was grabbing the mouse to my desktop computer i was like why doesn't this work so hi everybody as uh as vivian has said uh my name is sam adino i am a uh i am a boot camp uh coordinator for the remote online uh bootcamp the idl program which is interactive distance learning uh if you take that program uh you'll be interacting with me and some of our other wonderful idl coordinators um i believe the recording will be shared uh and i am here today to teach you about pandas um pandas is a library that we use for data manipulation in python it also has visualization tools you can do a lot with pandas pandas is a really really fascinating library it's built on this other library called numpy which is a very wonderful library that allows us to do calculations very quickly and so with those two things combined pandas is kind of a really nice uh you know graphical way to view the data and then numpy is this really powerful engine that sort of powers the whole thing so uh thank you for joining once again um and we'll uh and we'll kick it off um so as i said we're covering pandas and we're covering numpy to a much smaller extent we're also covering how we do visualizations using these um so the topics that we will cover will be uh getting started with pandas reading and saving data frames inspecting and manipulating data frames grouping and aggregation as well as data visualization and i'll just take a brief moment to point out what files you should have gotten in your uh in your download and so we see here in the uh file viewer of jupyter notebook you can see we have our ipython notebook we have the user guide which is a helpful guide for our data as we'll find out the data is very very very complicated and a little encoded so we'll actually need to use some techniques in order to uncode the data so it's a little bit more useful for us images is just a couple of images that we use in the uh that we use in jupiter notebook and then we have our data you should only have the 2018 utility sampled csv in your data uh the files are shared uh via the google drive link in the chat so if you don't have them uh please please download them from there great so let's go back to the notebook and let's start talking about code great so the first thing we have to do is with packages in python you don't just get them you have to import them so you kind of have to ask for what you want in this sense it's nice because this means packages can have names that are used in other packages which is kind of a little bit of a higher level concept so we won't go too far into it but it essentially means that everything is separated very nicely so nothing unless we want it to can really interact we go over a really basic and fundamental command the import so when we say import numpy i'm asking please give me the package numpy the package numpy is this object that actually contains a whole bunch of things and so we'll use this period to denote when something is inside another object this would be often referred to as a method of the object or of numpy here so when we import numpy we can see that we can take the square root using this numpy dot square root but what else can we do and so when you have a new package and if you don't want to look through the documentation that they so often have online we can instead use this dir function which really just stands for directory if you're on a windows machine you may know dir because it's part of your command line um it's one of it's a command line function that you can use and so we can see when we call dir on numpy you can see that there are a whole bunch of things so many different things but we can see that there are a couple of things that actually look like things that make sense to us we have adding we've got a max and a min there's a there's something to see any so we can do something that's a boolean a little bit later or we can make an array you can see there's a lot of different methods uh there are a lot of different methods contained within the numpy package that are really useful and i encourage you to explore a little bit more on your own so to continue through our code block we say the square root of 16 is equal to this numpy square root of 16 we can see also if i want to get the documentation or understand the function a little bit more i can go inside of these parentheses and hit shift tab and this will give me a big big uh big introduction essentially to what the function does return the non-negative square root of an array element lines and all the parameters that it will accept so essentially what arguments can i put into the function and we'll bring up this documentation a little another couple of times uh just to make sure everybody understands it another thing that's important is we have to ask it to print out to console and so we'll say print the f is essentially just formatting we'll go over this we go over this more in the program i won't really go over this much more here there are some notes on it uh and essentially it just allows me to do this nice printout you see i have this square root 16 and it prints it very nicely in this uh in this string and then we can also see the value of pi so numpy being a math package has the value of pi stored for easy use and a lot of the text here is going to be sort of repeating what i said this is sort of supposed to work as a standalone document so if you are able to if you are able to uh or want to look back over this after on your own you will have more than enough resources to actually look through it on your own great so it's really annoying to have to type numpy every time i want to get something from numpy it's five characters and after a while over the course of your entire programming lifetime that's gonna suck a lot of time out of your day those three extra characters it kind of sounds crazy but it really is true you want to shorten a lot of things as much as you can so we're going to go with a very well set convention in the field which is we will make an alias when we import numpy an alias is just a different name for the same thing so in this instance we'll say import numpy as np and when we say import numpy as np we are using the alias np in order to access those things stored in numpy so it's still the same things that we had before in numpy it's just with a different name numpy by any other name will be just as sweet and so we can see that we get the same printout as we did above the numpy and np are equivalent and so now that we've built a little bit of that foundational base we can now talk about pandas because we talked about importing packages and we've talked about aliasing which are two very important things you could alias either of these things as anything you'd like it's not going to give you an error but people will not really like or enjoy interacting with your code the np for numpy and the pd for pandas is essentially an industry-wide standard everybody does it no matter what code you see it in it almost always will be written like this so to sort of make sure that you're fitting in with the rest of with the rest of the data scientists make sure that you know you're hitting that import numpy as np import pandas as pd and eventually it'll just be second hvu you won't want to use a name you won't want to use any other name and so we'll import pandas and then we'll show all the versions that we have as we can see we see pandas we see the install we also see yeah we see our pandas package we see numpy and a whole bunch of other stuff so these are essentially a lot of the packages that i have downloaded um and it'll just show the versions of everything that's essentially connected to that python library great yes so uh karim brings up a question in the chat which is np show versions doesn't work that's because numpy is a different package than pandas pandas has a specific method written in it that is show versions that is all exclusive to pandas um if you want to get the numpy version you have to do np and then version with two underscores on either side and this will give you the mp version hilariously using python's show version or pandas show version we actually get the same result down here it shows us what numpy is so good question kareem great so with that being said we are allowed to move forward uh and the actual important part re the so we we imported the package the important thing to remember from this part is we're importing the packages pd and if you use a dot that's accessing a method of the package and so we'll use this quite often this dot so now that we have that out of the way we will look towards working with the data which is really the most important part right pandas is for data management um data wrangling data cleaning uh and so here is how you make the softs you read in the csv using the pd.read csv command the pd.read csv command is very simple it just takes a string where the string is where the file is located it makes the assumption when you run it that you're starting from wherever the python file you have is located so in this situation it's making the assumption that i am here in my in my file system and then i only need to click through to data and then to my utility sampled and we can see that here in the writing we have data we have our file separator and then we have 2018 utilitysample.csv and i'll run this it'll take a second to load in and now we have loaded in our data and we can look at it that's data all right um and as you might be able to tell if you look at this data a little bit more a lot of this data isn't really helpful at first glance we have a birthplace that's a number the day of the weeks are also in numbers birth time is something it kind of looks like just the 24 hour clock uh but with no separation in between hours and minutes so as you can see sometimes you'll want to open up the data and immediately jump in but unfortunately we can't really do that in that situation so what i want to do is i want to just very briefly go back over to another file that we've included in the in the google drive and that is this user guide pdf and so the user guide pdf is the user guide to this data i know this is going to be this is going to be a little bit small so let me try to get this as big as i can cool so this is the user guide so we'll see that there's some information about the uh how the data was collected there is a uh there is a table of contents which will let you navigate this very easily and then there are some detailed technical notes and then introduction and then some really really in-depth stuff right now i'm going to not really deal too much about some of the in-depth stuff i just want to show you guys the documentation great so this is essentially the documentation this is kind of like a dictionary for your data we'll see that these values over here the 01 let's say for birth month the 01 stands for january o2 for february 03 for march 04 for april etc so reading the documentation for your data is going to be very important for actually understanding what it is um it is course platform agnostic yes um the so you are essentially you know like transferring all of this uh information into something that's stored a little easier it's much easier to store a two-digit number than it is to store that many characters especially when you're working with a lot of data we've sampled down the data but the original data set has 3.8 million rows and so writing wednesday or january you know 3.8 million times is a huge amount of file size difference between writing 01. uh a data dictionary should be provided for most data sets the majority of data sets will have them especially if they are official or if they have any kind of encoding like this data set does great so with that out of the way we understand that okay this is a little complicated uh the data is a little bit complicated we can't just immediately work with it but let's cut things down a little bit let's reduce our scope because that's a very important thing in data science is sometimes you need to just reduce your scope and really just you know cut down on the amount of information you're looking at right now and so we'll take a subset of the columns we use the strings that indicate the column names so we can see birth year birth month there's a birth time and a birth day of week and so we'll want to grab these four columns the way that we do that is by indexing so if you're familiar with python you know that the square brackets allow you to index in a lot of different data types that's also true for pandas pandas sticks to the python convention that square brackets means you're indexing and so here we'll make a list of the columns we want and then we want to uh make a subset using that list the reason we use a list here is because we're working with dimensional data and so we can see that i get back the df is just a variable name we saved at the very beginning our csv file to df we call it df because it is a data frame cool so we're doing two things here we're doing an indexing here and so we're saying please give me back this collection of columns which is a good way to think about it if you want one column back you can use just a string but if you want multiple columns back you need a list and then we'll say sample we have a lot of data you can see that there's even sampling it down to less than one percent uh we still have 38 000 rows and so let's cut it down even further and let's just go to 2000 and so we'll hit the sample and we'll hit the sample and sample won't grab the first rows it will grab 2 000 random rows and we can see that why by the index numbers over here to the side you see how they're not in order when we don't use that sample and we just have the data frame printed out normally you can see that the index numbers are in order [Music] and of course python is index zero great so we have this new itty-bitty data frame but smaller much nicer to work with it has two thousand rows and four columns um i'm drawing a blank on the package name i think it's npc yes so to answer your question to set a random number seed random uh seed and then you could say like 400. and this will set the seed to 400 so you'll get the same answer every time we go over stuff like this in the program as well so i'm not going to spend too much more time on that uh great so now we have yes and now we have this smaller data set and so let's say i did a bunch of really complicated calculations and i'm crunching all the numbers and i just spent like three hours doing all this work and i want to make sure that my data gets saved because the reality is this is all in memory right now as you can see up here this is essentially if i my computer shuts down for some reason if i lose power uh you know the this isn't going to be saved and so you do need to sort of explicitly save your work and the way that we do that is we do this to csv command and so to csv will essentially package up this data frame right here this data frame subset and put it into a file and in the same way uh from csv uh pulls or read csv pulls from a string the two csv writes to a string and so we're saying please write in the data folder write a df write test dot csv and see if it's there and so i'm gonna be a hacker and we're gonna do this and i'm gonna use a magic command uh i would say that saving the notebook is saved too but when you don't save the data you have to rerun the whole notebook so if you do something that mutates the data that changes it and takes a really long time to run then you're gonna have some issues because when you start up that notebook again you're going to have to run the whole notebook to get back where you were which is going to take a lot of time and so it's important to save to the csv once you're essentially at a point where like oh i've changed something meaningful about this data to a point where i would like to save this as a separate copy and so i did something a little bit funky here this will not work if you're on a pc you have to use different commands and we go over that at the beginning of our course um i used linux commands here in order to check to see if i had written the file or not and so essentially what i'm doing here is i'm seeing oh this is a file this is a file these are the files that i have in my data folder the this is the map command aiden if you're on windows it won't work um i don't know if jupiter exports to park at you can yeah uh we mostly work with csvs so important thing about csvs is csvs have a very important line at the beginning which is called a header the header essentially determines the um uh aiden it's uh like a subtraction sign a not a tilde um and we can see when i read the csv and i set my header argument to none i get this weird thing that happens and so we see these my beautiful column header names are now part of my data which is not something i really want these column headers are not an observation so we don't want them in there and this is because i have this header equals none argument and the question could be what other methods or what other arguments do does read csv have and the answer is a lot um you will use some of these sometimes you will not um as you get more comfortable with working with data there may be ways in which you can read it in which will help you slightly but for the most part you won't really mess with a lot of this stuff and so remember don't change the header to none unless you really need to or you know your data doesn't have a header um yeah and so that was just a brief overview about how we read in data and how we filter and select certain things in the data now we'll be talking about different ways to look at parts of your data via several commands that are just real easy and nice to use for viewing specifically um yes i would leave the header equals none out and no we recommend making a cheat sheet because you eventually get to the point where doing it more and more and more uh will then build that information i find a lot of the time it's much better to build an understanding of where the information is located first before you try to memorize it'll be easier trying to go to find the lo to find where the information is located and then through that repetitive step you will learn uh cool so uh we're looking at these different functions we've got a bunch of different functions here we have the head function which shows the head of the data frame uh and so we can see here that this gives five rows this is the default of this data frame you get five rows i can also make it give me back 10 rows i can make it give me back one row and you'll notice it's always the first row that starts it and then the second row third row etc df tail is the exact opposite it is the tail and so we see the last five rows of the data frame and once again i can see the last 10 rows or i can just see one oh that's 11. just c1 there we go we've talked a little bit no remember df head so important variable names everybody variable names so df is our original data set right you see we have a million we have a million million columns we haven't filtered we saved earlier this to df2 and also to df subset so we are working once again with the whole data set once again the sample command will grab a thousand random rows it's actually just doing a normal distribution and then you and then you pick all the integers but you'll learn about that in the course too the nice thing about sample is let's say i don't know the exact number of samples i want i just know the amount i can instead say sample with the argument frac for fraction not fracking uh 2.5 and we'll see that the 0.5 means 50 here so i'll get 50 of my data back when we have replace equals true that means when we are pulling values from our data frame we don't care if we pick the same value twice we can see here so i let's say let's say i didn't know it all what replace did so i would go here put my cursor in between the parentheses of the sample function hit shift and tab and then i can see all of the different uh different arguments again and so we'll look at replace because i didn't know what this function did i do know what this function does but for sake of argument that we allow or disallow sampling of the same row more than once so essentially what we're saying here is when we have replace equals true i can pick the same thing twice and we have half the data with all the columns and so those just a couple of ways that you can very quickly take a peek at your data a lot of people prefer actually as opposed to doing either head or tail some people prefer to do a sample five because in this case instead of looking at the first five entries or the last five entries you can look at a random num you can look at five random entries which are a little bit nicer because there may not be there may be some things that happen in the beginning or the end of the data that will mislead you to trends of in other parts of your data just by looking at the first five rows but once again it's a totally preference it's not really like a lot of people do it a lot of different ways let's say great and so we've been able to take rows and look at different rows and we've been able to take columns and look at different columns and cut those down now let's talk about sorting sorting we can sort the data frame on the specific column by specifying that column using the sort values function so this sort values function we'll run it you can see that when we go over to the mother's age column we now have all of the mothers ages sorted so we can see 13 all the way up to 50. but let's say let's say i want to do a little bit more complex of a ranking i want to rank one thing and then rank a second thing within it so the mother's age and then birth month let's say and so when i do this it will first sort everyone by mother's age and then we'll sort everyone by birth month as with filtering on columns when you are doing something with multiple columns you should give your pandas functions lists of those columns as strings what about descending well what about descending you just once again just to go over it you can open the documentation using shift tab with your cursor inside and then i will use the argument ascending equals false yes and now we have the exact opposite way no so something that i forgot to explain about these methods is let's look at some of these methods or let's look at the df sort value and when you see that ascending equals true here that means that true is the default value so this is answering uh sheri's questions which is so in the code we don't need to specify the value from least to greatest and that's true when we're using sort values the default for sort values is ascending when you look at the documentation and you see these like ascending equals something that means when you run the function and you don't specify it that's what it's going to be equal to [Music] so i hope that answers your question uh yes great cool so now we're finally getting let's get back to actually working with the data a little bit more and we'll actually want to set up to decode some of our data so remember our data is encoded we have our birth months as numbers and days of the week is numbers and it's all numbers and we want it not as numbers we'd like it to be a little bit more human readable so we want to make a data frame but we want to make essentially like a translation data frame the idea is here that we'll store the information of what number refers to what day of the week and then use this data frame to attach it to the other one so that now we have that completed information there this is just one of many different ways that you could do something like that so we'll have to talk about the actual pd data frame setup essentially what this is is we have two we want two columns here we specify by name and number and then the default is to assume that all of these are rows so when we have it here and we specify the columns we're making an assumption here that sunday is the first term in our first row and one is the first uh is the second term in our first row monday is the first term in our second row and 2 is the second term in our second row and so we can see this very quickly by just running the function and we'll see that the name and the number show up as our column headers and just like i said sunday is one because we have them in the same list together monday is two tuesday wednesday thursday friday saturday et cetera and so we've created this sort of translation data frame so for those of you familiar with set theory merging will be a little intuitive because you guys understand unions and intersections and things like that for other people let's think about venn diagrams venn diagrams are really cool and are a great way to visualize and help you visualize merging here we have this data where we want to merge these names into our data set and we want to be able to merge that with the days of the week here it is so we're essentially trying the overlap here the the overlap of our venn diagram is the birthday of week and number because these numbers these numbers correspond to real values in this data frame and all we're doing is we're just making those values match up so it becomes important which side you're on so in this case df and day of week yeah so df is on the left side here and day of week is on the right side and this will be important as we can see we need to specify what columns are going to be the columns where we match so as i said the birthday of week is the column that's encoded into a number and the number is the name of the column in our smaller data frame that holds this information so i use this pd merge command and we list the arguments uh arguments above so something to note here too is if the name was the same in both data frames i would be able to just specify that one name by using the on argument but since it's not i have to use left on and write on so we'll run this and it'll take a second because we're starting to do some complicated stuff now we're matching like 38 000 rows to single values very quickly adding them and then re-spitting out all that data as we can see here we still have our birthday of week normal but all the way at the end we have name and we have number so we have successfully added we have attached onto the end of the data frame these two names are these two new columns that are name and number and as you can see every time we matched a four we added the wednesday and four data points to that row and wherever we matched one we put sunday and one on that row great all right i just solved the code we we're all good to go we can do some analysis but wait a lot of functions in pandas are non-mutating so you may be wondering where did my columns go somebody stole my columns i was i put them right here i had just merged them so neatly and someone has taken my columns and that's a shame but the reality is with non-mutating functions pandas is trying to let you try stuff out without fundamentally changing your data so in this situation when i run this merge command it returns to me another separate data frame so this isn't df this is not day of week df this is an entirely new object that's been created by merging these two previous objects so that's very important to keep in mind a lot of these functions are non-mutating so what we'll have to do that we do below is we have to set it back equal to df we have to update our variable in order to be able to go forward with our analysis and keep those extra rows so we've got our columns in now yes you can also use uh uh tigran um says uh can we use an argument yes you can use there's a specific argument called in place equals true which you can set in a lot of the pandas functions which will make them mutating functions i warn against it at the beginning just because sometimes you'll accidentally run code or something that you didn't really mean to run and you'll essentially have to restart your whole notebook and like start from the beginning um so just be careful with in place i find in place is really nice if you're writing functions that handle certain operations and data frames um so yeah good good bringing that up uh yeah cool so we have our newly merged data frame as we saw above we have this identical column we have birthday of week and we have number these are the same column we are repeating data this is a bad thing to do so we're going to drop one of them we need to we want to get rid of one of these columns so i will do the df drop do that d f drop uh and you can see that number is no longer part of part of this uh we've shuttled number out of our data frame and into the aether uh with only birthday of week to remain and if you'd like to draw multiple columns you can add a list if you would like to make this a mutating function you can use in place equals true this does not change the data so we'll have to uh we'll have to re-save it at the end here is the rename function which allows us to rename columns it's pretty straightforward you want to add a dictionary because the dictionary will allow you to specify a column name and then what it should be turned into so here we have the column named name and we would like to change it to birthday of week name and we can see birthday of week name shows up here as we've renamed it but we see this number column is shown up again and so once again we have to save everything back so let's save everything back and this brings up probably one of the most important and useful things about pandas um so before we've only really been doing one command at a time right we only ever really call one method at a time we say df drop and then we say we say df rename and then we save you know but here we're starting to do something a little bit more a little bit further than what we've been doing before and that's called chaining methods in pandas you can do this really cool thing where because pretty much everything in pandas returns a data frame as an object you can just keep slapping methods onto the back of it so we know that df drop returns a data frame so when i'm done dropping the column i then want to rename a column of that data frame that was produced by this df drop and so i can chain these functions together really quickly in order to write a really powerful line of code that handles a lot of different things as you can see we finally saved it so we'll go to the end and we'll see that we only have our birthday of week name which is under the right title now and we've drawn the number column great um does anybody have any questions about this section before we move on to selection and filtering chaining is chain's pretty cool so i get that if it's a little confusing cool um i will move on then um so just to quickly go over the two questions that were posted what does left on and write on do again left on and right on specify the columns that you would like to combine so as we saw before it was our column that we wanted to combine was day of the week and number so these columns match up in between our data sets right if i see a four in the uh yes running rename should bring back the dropped column in your data frame because we have to save it all at the end um which is the last thing i just explained uh the okay so just quickly right on left on is essentially saying lef my data frame on the left this is the name of the column that is matching with the data frame on the right and write on specifies that name of the column uh i think it's really easy to remember when to use uh when to use dictionaries as opposed to lists uh most of so a lot of people are asking a very similar question which is when do you use square brackets lists versus curly brackets which are dictionaries versus parentheses which are tuples we don't really talk about tuples in this course but we do talk about it in our actual like data science course um so the way that i like to think about it is a list is a collection of objects a grocery list is a list of things right and so whenever you need a list of things but you don't really need to specify anything about it you want to use a list so you'll want to use the square brackets a dictionary is something that you look things up in right this is what a dictionary is irl like you pick up a merriam-webster's you'll look something up and it tells you a definition there's a key that you're looking for that gives you a value which is the definition so whenever you need to convert something or change something or store data in a way where essentially you need a reference key and then when you have that reference key it gives you something else that's when you use the curly braces or a dictionary so curly braces are really good for that like translating right and we saw that a little bit with the rename right because we can see that oh i have this rename function that's going to take this column and turn it into a different column so i want to so the key of my dictionary which is what's on the left hand side is the column i want to find and the value which is the right hand side is what i want to change it to uh and tuples are immutable objects and we only use them in very specific situations yes aiden aiden has a very good uh very good concise definition that is a dictionary uses a key to get a value a list uses an index to get a value yes um i hope that clears up a little bit the confusion yeah but yes a list in r is like a dictionary because vectors are more like lists but they're kind of like a raise it's really confusing um it's not uh it's you'll find you'll find as you go through the course the data types are slightly different in our python but a lot of the thinking is kind of the same they're just called different things uh so great um i hope that clears up clears up some of the questions um and we will uh yeah and a lot of stuff in our is also non-mutating which is why we have deep flyer which we teach in our course uh which is really great it's very similar to the pandas chaining cool so let's move on to selection and filtering um here we've talked about selecting certain columns but we haven't talked about selecting certain rows so let's just go over selecting certain columns again so we'll select these five columns birth year birth month per time birthday of week and birth day of week name and then we can use this to subset them this is non-mutating of course so we need to save it back if we would like to save it and so we'll say this is df2 we i'll just show this right now why not so something is well but important to keep in mind is the order in which we give this list so let's say i wanted birth time to be first i'm going to cut it out of here and place it at the and place it at the uh beginning and so we'll see that the columns are now rearranged and so the thank you martina the um so the columns are now rearranged right before i had birth year first but now specifying birth time first it puts birth time in the first column and so let's let's set this back so we're all on the same page and remember you can you can use control z or command z or control z on windows to undo what you type it's just like a normal text editor great so now we have to start thinking about how do i filter rows columns are easy because you just specify the name of them you say i want yes uh single quote is the same as double quote except in the situation where you open a string with like single quotes or double quotes because in that situation let's say you open a string with double quotes you'll have to use single quotes inside the string uh to represent quote marks um otherwise you can't you can't do it uh because it will essentially just end the string you can also use escape characters which we'll teach you about in our program uh great so we can do some really really cool stuff with pandas pandas was built on being able to really work very well with data it was made by this company called two sigma which some of you may have heard of um and so we can actually use just a regular boolean statement yeah big very fancy hedge fund uh the so the the uh we're trying to ask now i would like the birth months 7 to 12. so july to december and so we see here for the df birth month we say i would like df dot birth month to be greater than six but for the observant of us i'm doing something a little different the the so here i'm just saying df dot so an interesting thing about this is all of the columns are actually specified in the structure of the data frame as methods and so in this situation we see that birth month is actually one of our attributes so we're allowed to say dot birth month um we could just run it really quick to see you can see there's a million million different uh yeah so here we go we see that when we look at the directory uh when we look at the directory of our data frame we can see all of our columns are actually attributes so we see that birthday of week birth month birth place birth time birth weight birth year boolean is a boolean so that's not that's not the same but uh the you can see that we're actually able to specify these columns using the dot great so now that we can create this list of booleans how do we actually use it so we use it pretty much in the same way that we use the column indexing you just stick it in some square brackets and you slap it on the end of the df and you're good to go and so here we see that oh wow great i've reduced a pretty significant amount of my columns or amount of my rows i still have all my uh all of my columns and so we've successfully filtered our data set using a boolean statement booleans are just true or false it's a funny it's a funny name for true or false um however let's say we want to do something a little bit more complicated let's say with strings we can use this is in function which is great so we say the is in method will check if the value is contained in the list pretty straightforward so we'll say i would like sunday monday or tuesday i specify the birthday of week name from my df and then use is in you can see i get the same sort of thing and we'll see the filtering too as you can see we get a reduced filter or a reduced data set and you can see that we've got a tuesday we've got sunday and monday is somewhere in there we'll show you how to figure out how monday is in there in uh in just a minute with this function called value counts but a little bit later so those of you who have taken formal logic this should come at no surprise but for those of you who have not we can combine boolean statements uh depending on what kind of operator we use whether it's the and or the or there's a couple of others too but we won't really talk about that uh it will check the true and false values so uh when you check in this situation the df birth month is greater than six and df birthday of week name is in sunday monday or tuesday so in these situations i find it's a lot easier to think about it in in words as opposed to code so we're saying we want all of the rows where the birth month is july or later and the day of the week so there's that and right and so it has to be july or later and the day needs to be sunday monday or tuesday so we'll see that this gives us back a data frame that is smaller than either of the two other data frames because this is an intersection of these two values and we'll go over a lot of this kind of stuff in way more depth in the program as well um but yeah so that's that's essentially the one of the ways that you would do uh you do filtering for rows uh just specifically rows let's think about we can think about aggregation as well aggregation is a very important part of data science uh yeah in it it is kind of like a find function um yeah except you're essentially checking all of the individual values of your column against whatever is in your list right great so group by aggregation in general is really important for data we do it all the time we like to look at sums we like to look at annual reports returns investments these are all essentially just sums uh there's also you know some other things that we do we look at medians we look at means and you know we have to be able to group the data in a way and then aggregate it because it's not going to be really meaningful it's the only thing that you can do is just take the average of the whole data set because that's fine maybe that'll give you some insight to start but most of the time we're really trying to look at these individual values for different groups so in keeping that in mind we're going to approach this trying to understand if there is any meaningful difference so i have a kind of hypothesis here that there is no meaningful difference in the birth month for how many babies were born we'll see if i'm right we'll see if i'm wrong uh so we want to see does month play a huge role in how many babies are born so i will do a group by in the group by i specify what column i want to group by as with a lot of these functions you can specify more than one column so pretty much anytime you can put in a single string you can almost always put in a list uh and so this df group by returns a very specific group by object which allows me to do aggregations so here in the next line i start to begin my aggregation i am just going to count because that's really all i care about right my question is how many babies are born it's not how many babies are admitted to the nicu it's not uh you know it's not what are their age i just need a raw count and so with that in mind we specify count there are a bunch of other ones there's some there's mean there's median pretty much all the hits uh and so you can definitely play around look at the documentation to see what other functions there are for that so here's something that's a little strange but will make sense as you do this more when we do aggregation functions like count it will essentially just count the number of rows for every column so it'll just be the same number like 108 times but we don't want that right we just need the count one time because that tells us how many babies there are so the way that we do that is we just take an arbitrary column that's not your group by column and then we just filter based on it so that's what this is this double brackets when referring to a single column returns a data frame instead of of series which will also go over a little bit more in the course not here we don't have time unfortunately uh but so we'll get the data frame by using this double bracket and then we'll use the rename in order to rename our birth year column to something that actually makes sense which is number of births you cannot use the column you are grouping by the column and technically the column that you're grouping by does not exist in your data frame it's the index as you can see here birth month yes yes kareem we're ignoring year because so i got a couple of questions we've got can you use the column you're grouping by or does it have to be a random one i just answered that the answer is it has to be a different one it has to be different than the one you're grouping by uh so we are normalizing it sort of it will compare months regardless of year yes that is correct green we are essentially kind of taking an agnostic approach to year and just combining all the data together in order to see what happens and to see the aggregation obviously if we had all 3.8 million rows you'd be able to get a very much better look but also this is only for a year anyways so uh it doesn't it doesn't even matter um because it's the 2018 utility data uh shang uh why not create a new column well there are only 12 months and there are 38 000 rows 3.8 million in the actual data set so adding a column to the end of that data set would essentially just be adding a huge amount of duplicate information right because any time birth month is one it will have numbers of birth 3089 and that's just not useful that just takes up a huge amount of space let's say you have a hundred thousand uh januaries right birth month of one you also now need another column a hundred thousand times that just says the same thing so instead we'd want to make a separate data frame like we're doing now we're aggregating so we're making it smaller into a smaller data frame and then we could store this in a way that is linked to the other ones right and so let's say i wanted to look at the specific number of births that number will be stored in a separate table because when i want that number i want it for a really specific thing i don't need it all the time i hope that uh i hope that clears up why we're not creating a new column as opposed to just just slapping another column on there great and so we'll see here and we can kind of look through it it's a little hard to see it doesn't it kind of looks like things are more or less okay you can find min and max births as well yes but i will not go over that we'll go over that in uh more of the more of the data analysis lectures that we have um there are i will say that there are commands min and max that are based in python that allow you to find the min and the max of a list or of a series object and great so up here we saved birth month groups as our group by object so here we're going to perform a different aggregation on it we are going to perform the average also known as the mean yes pd describe is also very good as well pd describe will give you the five number summary for all of the all of the columns if you push your data frame in that's a very helpful command we cool so we're looking at a different column this time and we're doing a different aggregation we can see that because we're doing a different aggregation the format in which we specify the single column that we want is also different so as you can see here we have this double bracket before and then our aggregation up here we have our double bracket after the aggregation so if you have an aggregation that cares about a certain column you'll want to specify that column if you have an aggregation to just apply over the whole data set you don't have to specify the column yeah cool so just to just to show that very quickly i will delete mother's age from here and we'll just do average of the whole data set and so we can see here that i just took the average of every column which some columns is very helpful other columns less helpful so we want to specify a very specific column here so that's why we include the double brackets and mother's page and we get back this very lovely data frame which essentially seems to say in a much more clearer way than the data above that there isn't really any meaningful difference between the month that the baby is born in and the age of the mother we're going to do one more group by which will be my name this time and we're going to see if there are essentially any days of the week that are significant and so it's really the same setup as before we're going to grab birth year we're going to rename it to number of births we're doing our group by on birthday week of name now instead and then we're doing count again and so we can see there does there does actually seem to be some difference um it would be nice if these were in order but we'll go over that in a little bit oh [Music] there is there is a way to fix that david uh where if it's not coming up with the with the variable names or the function names but i i can't remember i can't remember off the top i had right now um the it is a very common problem that students have though so fret not it is solvable um so here we'll be looking uh as well at the mother's age aggregation and we can see also how how crazy would it be if this is actually true but uh the day of the week does not influence the age of the mother it's pretty makes sense right we wouldn't think that you know a specific day of the week would have a big impact maybe month might have a bigger impact but specifically day of the week shouldn't really affect anything cool and so um does anyone have any questions before i move on to the uh the uh smoking risk exploratory data analysis section we're going to be doing some plotting we're going to be doing a little bit more aggregation and we're going to try to draw some more conclusions and insight the um yeah cool great cool so in the smoking analysis section uh we are looking at the specific column which is the number of cigarettes smoked uh in the third trimester and so the cigarettes in the third trimester uh the higher the number the more severe the level of smoking so and then the admit and icu column essentially determines uh whether or not they were admitted into the nail care the natal intensive care unit uh for this reason or because we're doing the analysis on whether or not it has an effect there are a lot of uh cases where the um admitted to the natal intensive care unit is actually missing it's not really missing it's just u which is for unknown and we'd like to get rid of those values so what we're going to do is we're going to do some filtering uh we do double square brackets in the double square brackets here is just a list right so we're specifying a list we want the sigs our sig try three and the admit nciu and then we see we have smokers and we create one of our boolean statements by using is in and now we can see our whole smoker's data set we see we only have our two columns and we have all most of our rows right we started with uh 38 015 or 38 000 yeah 38 015 um and now we're at 37 983 so we kicked a little bit of data out but for the most part we kept all of that great so here is that function that i was telling you about before this is called value counts value counts is really cool it allows you to inspect one of your columns so it only works when you specify a column this way because it's a method of the pandas series object and as you can see this essentially gives you all the values sorted in most frequent to least frequent based on how many people are in that category so we have the majority of people smoke no cigarettes the next highest is 10 cigarettes it's probably 10 cigarettes during the entirety of the trimester um 99 which i think is 99 or more uh is the third highest category yes it is the frequency of the value that showed up so here is a little bit of a complicated piece of code we're going to go through two pretty complicated code blocks pretty quickly um i would love to go more in depth and you can you just have to come and take our course uh so here we'll be looking at the uh admit counts and so we're trying to see how many people were actually admitted out of all of these categories so we can start making some claims about how smoking in the third trimester affects your pregnancy or affects not necessarily your pregnancy but your chance to be admitted to the natal intensive care unit so we do two group binds because we want to count not only the different trimesters but we also want to count the number of yeses and nodes so we do this double stacked group by size is just a very clever and quick way to get the value of how many rows that it has and then this reset index is pretty important i can't go into why but essentially you just won't have the right data type if you don't reset index rename is just our rename function and so we'll just rename it after we're done with it and then we just look at the first couple of values great so here we go we have six in the trimester no the majority of people who smoke no cigarettes were not admitted we have cigarettes and trimester one or we have no cigarettes but they were admitted which is a very small amount and but still even if we were to look at the whole data it still wouldn't be that useful for us it's hard to really get a sense of what's going on so what we'd like to do instead is we like to make percentages it's much easier to see percentages and see oh like does the percent of people admitted to the nicu go up as smoking goes on great and so we're going to write this for loop i will not be going too too deep into the four link but we'll go over it very briefly so vals is an empty list that will eventually be the values for the how many cigarettes are smoked the per in nciu is essentially the percent that is admitted to the nciu what we'll do at the very end after calculating these because this whole block is calculating we'll zip the two values together which actually makes the tuple and then we'll turn that into a data frame so the first thing we do is for value and admit counts six try unique essentially this says give me all the values uh all of the different values that sigs try three can be and then we'll append that value to a list called vowels the numerator is all the cases where the amount of cigarettes in trimester is equal to or above i'm not entirely sure what your question is um let's we can hold off on it and i'll see if there's time after the panel uh yeah so we can see here that we are looking for at least a certain level of smoking which is this and then if they were admitted into the nicu we see the number of cases and that in that regard and then we just sum it and remember number cases is just this value here right from our previous data frame our denominator which should be all of the values regardless of whether or not they were admitted so it's just the first part of our boolean so the number of cases once again and then we just sum it and then we make the percent right we divide the number of people who were admitted by the total population based on our constraints and then we get a percent as we can see we get that the number goes up and so as you can see as you smoke more cigarettes your likelihood of being admitted is significantly higher however as always important just because something looks very convincing in your data does not mean it's necessarily true remember correlation does not imply causation uh please always be heads up uh heads up when you're doing data analysis for that reason great and now for the big flashy part we're going to be doing visualization uh just to wrap it up um and then i don't know if a lot of questions and stuff after but uh this is the criminal crime we're gonna make some nice graphs so we'll look at our data frame and the number of births because remember how before we were a little confused it was difficult to say whether or not any individual month had a very serious impact on our value it's kind of hard to just look at a list of numbers and see the meaningful relationships between them which is why we use and so here i will generate a graph using the built-in pandas plot function this is a really really awesome function you can essentially just use dot plot at the end sorry uh you can use dot plot at the end setting an argument to the kind of plot that you would like as well as the title and then if we save it to this ax it saves that graph as an object and then we can edit certain things about that object so we can edit the y label we can edit the x label we can edit what the ticks say we can edit how much of the graph that we see but we'll see some of that in just a second so here with the number of births per month in 2018 it seems like we have a really big spike however the graph is slightly misleading the graph is only looking at between 2 900 and 3 500 for number of births which is a very limited space and really dramatizes some of the differences in between the data however we would want to make sure that we're looking at a more reasonable graph that doesn't so aggressively draw the differences the way that we do that is by controlling the y limit of our axes and so here we can actually give an argument to ylam that is the range that it should go in between uh oh so so the matpot live in line thing is just it's kind of goofy it's this weird thing that like you used to need it like specifically but now you don't need it anymore but you still need it the first time so the first time you run this graph if you don't have matplotlib inline which is the question what does matplotlib inline do uh it will just say text zero month and it will give you a graph and then you have to run it a second time and it'll work fine it's really weird but it's just one of those things programming is quirky great so let's look at the graph again except this time we're looking at a much more reasonable graph that looks totally normal it doesn't really look like there's a significant difference anymore right we essentially drew a graph that very aggressively showed a specific kind of data and this is something to be really careful with in both reading graphs on your own as well as making them be very certain you aren't explicitly lying to people using your graphs i've seen things where the x-axis is reversed to confuse people about the procession of time positive values are actually on the bottom instead of the top so it looks like there's some kind of negative gain when it's actually a positive gain um just really weird really weird stuff yeah it's dragon uh tigran is correct um the matte plot inline command is for the jupiter notebooks it is not for any sort of independent uh ide which is uh interactive development environment yes so maybe i don't want a line graph maybe i want to bar because i think bar maybe will let me see the differences in between the months a little bit better so it seems like two and four are a little bit lower um it makes sense some months have 31 days other months have less than 31 days it makes sense for february to have the least amount of births because it's the shortest month um so important things to keep in mind there are external factors to the data that are not included in your data set that you must be careful of like february as 28 days most of the time cool and so we're going to look at the days of the week and so that doesn't look very good does it no that's pretty bad we don't even have the days of the week at the right uh you know in the right order so we need to do something about this and we can we can actually specify a certain order in which they return the data to this is very similar to what i did earlier with the columns but here we'll use loc loc stands for location there is an equivalent command called i loc which stands for index location for loc we can use the specific names of things for iloc we like to use the index that they're located at or the the number index right so for i loop zero would just be the first row one would be the second row etc so here we specify the order that we would like the days in great and then we use loc to get those back in that order and so now we can do some nice plots and we see in both of the plots births are actually significantly higher during the week i think that this is probably because they try to schedule births when there's more staff at the hospital i think once again i'm not a medical expert um i am not offering any medical advice uh here um but my conclusion would be that they probably want more staff uh in in the situation of births so they try to schedule more births for the weekdays um great and so we'll go down into our last little section we've got like three more code blocks we're almost done i know this has been a long one but i hope you guys have gotten a lot of really cool information out of this i enjoyed it and uh hopefully look up the school after um and we'll i think we'll talk a little bit more about that after as well uh so in this last little bit we're going to be looking at the visualization for the smoking risk data that we did before and so we already have our smoking risk data set above right and that's this is a really important reason why we save things to variable names so we can use them later when we want to use them later so we have our really long title and we like to just put that in another line just for readability as we can see there are some pretty specific thresholds in which the graph jumps so zero to one is a big jump i would say like that's probably like 16 or more and then greater than 22 and these are really big jumps in our threshold and then it kind of plateaus off so you kind of want to look at oh what actually happens here and what am i actually looking at because it's kind of weird that we have this really big thresholds right and the data is not continuous in an ideal world you would see some kind of continuous relationship where the graph kind of just goes up but not in these big steps so we'll like to look at some of the data a little bit more and so we want to say well the first thing we'd like to determine is we'd like to determine whether or not someone is high risk and so we'll say that high risk is about 30 chance to be submitted to the nicu so here is that 30 cutoff but we use our simple boolean statement to say greater or equal than 2.3 which is our 30 and we just get the first one because that essentially gives us the value of how many cigarettes you need to smoke to be in the threshold because the amount of cigarettes smoked is our index great and so we want to look at the data frame a little bit more in order to get a better idea and i'm just going to run this so we can see the result for the overall data so now we're getting the percents but we're also getting it split out a little bit more and we're only looking at the individual groupings for six three um so instead of looking at two or more cigarettes smoked i'm only looking at two cigarettes smoked in this line and as we can see we get some of these really big jumps because as we as we get further down into the data and the number of cigarettes smoke gets higher the data becomes much more sparse we have some people we have some smokers who at you know 60 cigarettes during the trimester neither of those smokers were had their uh children admitted to the nicu or themselves uh and so you know like there are some situations where the sampling is just so low right there are not enough observations for you to observe the effect and so it's important in these situations to sort of look at this and try to get an understanding of how this impacts your data and how this impacts the results that you're coming to um it could be something as asking for more data trying to collect specific data based on that demographic in order to get a better viewpoint or making sure that you proceed with caution in your analysis and understanding that that is just kind of a gray area and you won't really be able to determine exactly what's going on but you should be able to get a really good chest so there are the last two points i want to go through uh are after ten the majority of smokers answer in multiples of five so this is a little bit of like a human psychology thing it's the third trimester that's a kind of a long time asking you how many cigarettes you smoked specifically during that period is kind of hard so instead they'd be like ah 20 i don't know ah 15 maybe people answer very specifically uh and other people do not as well as the 99 category which i assume is just 99 or more somebody goes in and goes yeah i smoke a lot of cigarettes what box do i pick and they go 99 and you go great and then you select the box um however in the data as we do move down and this is really really positively supported by this 99 category it does seem like this is a very significant increase into the risk of having your child admitted to the nicu um and so i am no medical expert um but as a common common sense in our culture has told us smoking is bad also yes cigarettes are sold in packs it's funny that doesn't make sense a lot as well um so since we were doing uh the at least part some of this data was hidden from us which is why it was important to essentially come back at the end and check how these were actually stratified or stratified between the different groups um and with that uh that essentially concludes the lecture part of the talk um i'd like to thank everybody uh very much for coming and for showing up uh you guys had a lot of really great questions i loved interacting with you guys uh throughout the lecture um it was a lot of fun um and i hope you guys found this really useful uh i hope that you guys are interested in learning more with us and coming to hang out and learning some data science with us um we we love meeting you guys we love meeting the new students and we hope that we get to meet you guys in the in a very short amount of time um you can check out the course offerings on our website which there's a link to below um and you can also email us at uh admissions.nyc datascience.com uh in order to you know get some more information about the program um yes uh this was your first webinar right yes this was my first webinar and i'm trying to scroll through some of these questions i think it worked really well uh is this similar to the format of the online courses i would say this is similar to the online residential if you're a remote student in the residential program um you will have a live instructor talking to you uh through the lectures uh if you take the idl program uh you watch uh pre-recorded lectures but it's at a slower pace and additionally um the like the instructors interact with you through office hours and through like one-on-one meetings uh more so in the idl program so that's the interactive distance learning but the residential bootcamp is much more like this yes and uh one last thing is that uh our next cohorts are gonna be starting on september 27th october 8th and uh january 3rd 2022 so feel free to check out our main website and also if you guys are interested in maybe starting a career in data science we're going to have an event next week so next wednesday and if you really like the data visualization part we're gonna have another event on october 6th i believe so i attached the links in the chat but i'm gonna also send a follow-up email about it questions i believe at the moment there are physical like you can go into campus yes yes and the professor most of the time is on campus yes yes and they're really nice like if you have questions they're gonna explain everything yeah yeah other questions yes luke yes october events are on an eventbrite and if you would like we have actually a webinar page on our main website i'm gonna put the link below so you guys can check our upcoming events and also like past events so maybe this one will be up tomorrow uh you have to register for the upcoming events but uh you can do it through the link i just sent in the chat uh the deadline for january 3rd cohort is gonna be on um it's gonna be oh um i think around december but yeah if you start earlier your application you will gonna have access to our pre-work and you will have like time to learn stuff yep other questions if not i'm gonna close it right now nice meeting you all have a good night bye see everybody it was so wonderful lecturing for all of you i hope to see you in class
Info
Channel: NYC Data Science Academy
Views: 109
Rating: 5 out of 5
Keywords:
Id: hfpIQewmIIU
Channel Id: undefined
Length: 114min 55sec (6895 seconds)
Published: Thu Sep 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.