Why I chose Python & Polars for Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so when I first started getting into working with data for the company I was working for at the time I had a few choices I got put through an Excel course and that was fine but this was around about the time that I was starting to teach myself Python and I could see that there was a clear divide in between the two now I had the option to go ahead and carry on my Excel course and learn more about quow query and this was kind of like the decided path for the sort of company that I was working for at the time we didn't really do a whole load of data analysis it was basically just to look at our own customer data and figure out where we can improve things but as I started to learn more Python and learn about where the data actually came from through the apis through competitor analysis and web scraping which is kind of what I built my channel around it became clear to me that I didn't want to actually export that data out and then repport it into Excel to work with it in power query or do anything like that now we had no proprietary business like bi tools or anything like that I was allowed to make the decision myself so having watched a few videos it was clear to me the option of being able to pull the data down from the API from our company system and then actually go ahead and scrape competitor data from their websites and pull all of that into a panda's data frame and analyze and analyze it there it was just a no-brainer for me that that was the choice that I was going to go for I also didn't have a load of fun with Excel when it came to lots and lots of rows and manipulating and even doing the most basic things was really challenging and even simple things like when you're working with it in when working with a load of data in pandas you create multiple data frames you don't chop up your original data you just build from that uh little things like that made my choice very clear so in this video what I want to show you is how to is how I kind of started with that I want to take some data that I've scraped from the from from the uh internet and I'm going to pull it all into a data frame and I'm going to show you just how simply you can work with it to actually show off how pandas or in this case polers which is very similar just a different sort of approach but basically the same thing I just prefer the words and terminology for the functions that it uses how you can actually take that and you can then present that to your boss or whoever and say hey look this is what I've done this might help us out here this will help us out there and then you're obviously going to help yourself out in the long in the long run too so let's get some data going let's load up our data frame and I'll show you how we can start to look at insights for the data that we have making the choice to Learn Python for data became my path but if you're interested in working with data in any capacity then it's all about how you look at it and how to find those useful insights this is where today's sponsor brilliant can really make a difference they make it easy for anyone to pick up and get started and with their learn by doing and Hands-On lessons give you the opportunity to play around with the concepts it will help you build up a solid foundation and power up your all important problem solving skills this is by far the most effective way to learn and is way more interesting and captivating than watching lecture videos there's thousands of interactive lessons in maths data analysis programming and AI already to help you regardless of your skill level and completing just a few minutes each day will quickly become a habit and become that powerful growth tool beats mindless scrolling there's a ton of new data content now too including how to explore data visually which which is where I went to brush up my skills on things like aggregating filtering and grouping data like we're going to do later on in this video so to try everything brilliant has to offer for free for a full 30 days visit brilliant.org johnwatson Rooney or click on the link in the description you'll also get 20% off an annual premium subscription so we are in our Jupiter notebook now in my project folder I'm going to leave some instructions for you of how to actually get to this point uh save me just running through it and just boring everybody so the first thing we want to do is we want to import in polers and we do this as PL like so now we want to create our data frame so we do DF and we're going to do pl. read CSV now what's great about this is we can actually read in a all the csvs in our folder just by this one command so we just do star. CSV like so and then if we hit DF do shape you can see I have 33 columns and 6,227 rows the next thing I want to do is want to have a look at DF do describe um just so we can see a little bit more about what's going on here and this will show us some of the information so we can see these columns the count uh of items so we start looking for null values you can see here these ones I've got have high null values because there's only 675 as opposed to 6,000 and so on and so forth and you can see there's lots of null values here as well now obviously I was only able to do this because all of my CSV files were the same if you have them at a different number of columns or whatever you need to um import them in separately create their own data frames and then go from there so now we have this in one last thing to check which is the schema it's always useful to see I don't need the parentheses here we can see the names of all the columns so there's going to be a few main commands that we're going use there's going to be uh filter Group by and uh select so select is going to select columns filter is going to filter rows and group by is going to basically group the data together based on the criteria that we give it so if we do DF do um let's do select first and let's have a look at pl. call to select the column and let's have a look at the price column to start with like so now this is going to be our our shape is the correct amount of row uh rows with one column obviously the price we can see them all there and it's telling us that this is an I64 which is exactly what we wanted it to be now if it wasn't we would need to change this and we may need to do this later on but you can do we could do DF do cast and then within our uh parentheses we create some brackets we give it the column price and then what we want it to be so we could do PL dot I think it's float 60 before then if we were to find the price in this bearing in mind that we haven't saved this this is just its own thing here we can see now that the price is a float um if we were wanted to make this permanent we could just do DF is equal to DF but I don't want the price to be a float I want it to be an integer uh we might work with this later but this is a good opportunity to just have a look at the data in fact what I'll do is we'll just do DF do head to look at the top four uh five rows and we can see that there is an added this is actually a date stream but it's come through as an I64 number you would probably want to convert this but I'm not going to be doing anything with the date in this analysis so um that' be one thing that I would going to leave out but we want to sort of figure out what columns we actually want here so we definitely want the number of beds that's going to be useful for us and coming across here we're going to want the post code this is going to be interesting obviously the price and the property type as well um this is going to be the information that we're going to work with so I've decided what I'm going to do is I'm going to look see if there's a correlation between um the type of the property either like a house or a flat or whatever the price and then the postcode location and we need to work with that now obviously there is going to be some but you know it's going to be interesting for us to see as we go forward um how we can pull this information out and have a look so to select those columns we would just do DF do select pl. call and we can then say we want the best we want the price what else do we say we want to forget the columns when I'm doing this I tend to do this with the schema let's get the schema back up because it's nice and easy to see like so we can just copy The Columns over that we wanted property type and the post code which is out here somewhere we hit enter you see now we have our own new our new own data frame which is obviously separate um that we've just created with this piece of information and is there anything else that we going to want from this possibly not we'll leave it like this for now we might come back and revisit that um but this will do for now so what I'm going to do is I'm just going to I'm going to save this as um post code I'm going to save it like this and that means that now I have this subset data frame here with just this information from it and if we were to look at the main data frame here's everything back again so this is how we can easily sort of like me manage all our different data frames like instead of having to have different sheets or whatever whatever on our Excel page or however you want to handle it there so now we've got our data narrowed down a bit there's a few things that we still want to do as you can see we have this property type here now we're going to want to use this to sort of figure out um the different types of properties that have different costs and values obviously different prices we also need to split up the postcode at the moment this is a UK set of data set so this is obviously a UK postcode but the whole thing is quite a specific location we only want the first three so we'll be chopping this up but first of all let's have a quick look at doing unique this is quite a good way of being able to sort of see the unique value so we do subset is equal to and this is a list and I'm just going to put property type in here and this is just going to show me all of the unique values that there are for the property type so you can see we have eight in this case here so we know what we're kind of looking at and we can now filter these down um a bit more should we need to uh we could also for example if you put beds in here you'll see you know one of these properties has 12 beds or 20 I mean that's land so that's a bit different so that's kind of a we probably won't want to look at that but for example you know nine bed house it's pretty pretty something there I suppose so as you can tell so let's have uh another look oh there's one more thing I want to show you which is Glimpse Glimpse is kind of useful as well you get to have a kind of a quick look at the data um in kind of more of a raw State um you can see like it's got all this information here so that's kind of useful as well so let's go ahead and start uh actually clear clearing this up a little bit so we have um something better to work with so let's get rid of this and we're going to do they're going to split the post code up first so we'll do post code I've called this data frame post codes not ideal and we'll dowi columns this is another um one of the methods that we use on our data frame this is basically what it says going to be working with some columns and changing or adding data in some way so we're going to select the postcode column because this is the one we want to work with post code and we going to do do string because this is string in this post code see down here string and now we can do do split and I'm going to split this on a space so you can see that the the data has a space in the middle so that's going to work just fine for me and then again we can chain these again chain these together so we can do with columns again and again pl. and the same column and then it's a list because we will have split it up in fact I'll show you this in just a second and we want to get the zero index like so we need get here like we need list. get my B list. get right so if I was to just I'm just going to remove this bit for the moment if we run just this like it is you see we return this list back here because we've done split split is obviously just like the python split we are split witing and it returns us a list now that's fine but by changing the Expressions here we're basically saying we're going to do with columns again same column but because it's now a list see list we can do list.get and I'm asking for the index of zero if I put the index of one you see we get the second half which we don't want we want the zero index like this so now I have my property type and my post code narrow my post code in a more of a general area so we can actually think about you know having a look at the differences in prices in different locations so again I'm just going to override my data frame here and just save it like this I don't need to do this anymore so now we have this let's have a look at the mean values for uh each of these kind of like groups of postcodes so to do that we need to do filter so let's do postcode do filter and we're going to be looking at pl. call this is one of the things I like about um polers actually it's really straightforward is saying oh this is looking for column so it's pretty straightforward and easy to do then we're going to take our property type like this we're going to say this is going to be equal to houses and remember we looked at the all the unique values up here I should probably include that down again actually what I'm going to do is I'm just going to create a new uh a new cell here and we're going to do unique again and it was our subset and we wanted it on property value type property type close that off cool so now I can see these here we have some that are not specified that's fine so let's do um let's do bungalows in this case here so now we're basically saying give me all of the rows that match Bungalow so I can just close this off and run it and here we are here's all of our Bungalows chain this with do select and we're going to do pl. call on the price and I'm going to have have the mean which is going to be our average here so across this whole data set the average Bungalow price is just about 299 but you'll see now this has turned this into a float a 64b float we can actually do cast on the end of this I showed youcast at the start where we were doing it because this is the an output is the data frame so we'll do price and we do pl. int64 close our brackets and now we have it as an integer so you can use cast anytime you end with a data frame it's really handy to be able to chain these all together so let's have a look at something else let's put in another filter in the middle of this filter pl. call beds and we'll make this equal to one and we can see now the average price across the whole of the thing so this is a very general not that useful is uh for one is 128,000 and and two is 246,000 is there three yeah three and on so on and so forth um so that's kind of like one way we can start to look at things here and we can of course put in houses here and we'll say two bed houses 171,000 so versus Bungalows 246,000 so in the case of small houses they seem to be cheaper generally across the board or at least across my data set which is obviously you know just a small 6,000 or so listings of properties for sale so there we go so the next thing that you want to look at is I've done filter and I've done select and we looked at mean so now we want to do group by and so what I'm going to do is postcode which was this data frame that we created Group by and this is going to like as it says group all of our data together so we want to group it by post code this is the one that we want to do want to want to get everything together like this and then we want to do Aggregate and we want to do pl. we want to do of the uh price and I'm going to say uh mean again like so and let's see I've got the syntax in complete one more bracket there and now we have a list of all of the average prices for these this sort of area this general area uh again just come through as a float 64 so what do we do again we just go ahead and cast this over here as well and change this to an integer like so so what be we've got here is a long list of all these different postcodes so what we can do is we can do we need to move uh I think we can do do sort can we do do sort on WE off this yes we can so we'll do sort by and we want to sort on the sort on the price and we'll do descending is equal to true so this will give us the most expensive ones at the top we can see out of my data set these areas have the highest average uh prices again this is just a my data set we could end up saying that there's not that many um properties in this to match but you know that's why I've pulled so this is what we're looking at right here we can actually dive even deeper into this as well so we could go ahead and say our postcode we're going to be group by um so I'm going to save this as prices by postcode while typing like so great so we have this now so let's try and narrow this uh broaden it even more so for example this is the area and this this is like the general area and then this is kind of like a subset of it so we want to go for just these first two and we we'll bride we'll broaden this out even more and then we'll have a sort of a good overview of the data so what I'm going to do now is I'm going to say this is with columns so we're going to be working with these columns again and we want to do PL do call which is the post code like this I'm going to do string because this is a string as it says here St Str and then we'll do do head and two so we hit enter oh I missed a bracket I always do that so now you can see that we have our CBS and is all matching here but now we need to do group by again so we can sort of regroup all of this data and this is essentially the same as this so what I'm going to do is uh because this basically gives us a data frame is I'm going to just going to grab all of this and we'll put it on here and we'll run it again and now you'll see that they have grouped all of these together and if we were to let's go ahead and just grab the top ones so I'm just going to put head again and this is obviously the head of the data frame rather than the string which we did in the middle and now we have the top five sort of shorthand postcode areas with the most expensive um houses or prices for uh any property within now what we can do here is we can actually save this to uh price totals like so I'm going to come back to the top here and I'm going to do import cbor as SNS I'm just going to run all the sales just to make make sure nothing goes wrong and now we can do SNS do set theme this isn't necessary but I'm just going to do it anyway and we can do SNS barplot now we can do price totals price totals and we'll have the X column is going to be equal to our post code and I think this should be price we'll double check in just a second why there we go now this is a very uninspiring graph because we chopped this right down uh where are we so we come all the way to the end here and we can just you know make this a bit bigger like so same deal there we go so we can start to look at this and we can start to compare and you can of course put these the other way around if you wanted to grow out sideways like so okay so let's construct ourselves another nice uh long expression let's find um the sort of average price for houses depending on the number of beds so to start with we want to do our again uh filter we're going to go back to using the postcode data frame it's just there available for us so we're going to use this one so we'll do filter and we're going to filter on our property type column so pl. call property type and we're going to say that that is going to be equal to going to look at houses houses like this and now we want to do a group buy because we want to group Buy on the number of beds so we'll do uh beds and we'll do aggregate function or a function pl. call on the price so we just chaining all these Expressions together now and we want to have the mean of the price so there we go now we've got this sort of data here but again we have our it's converted our uh price into a float 64 and you can see that we've got some dodgy numbers going on here um so let's go ahead and bring this in here so we want to do we've got this whole cast and sort and head so I'm going to copy that we'll put you here as well and now we have uh the top 10 of price for the number of beds and you can see here that generally speaking funnily enough the more beds you have apart from this bit of anonymous this bit of data which is a bit strange I'd be interested to see where that came from obviously the more beds you have the higher the price we can do Tail as well which I think is probably just going to give us this as well yes so there we go obviously that's not overly surprising um one one thing I want to show you is that when you create when you construct these sort of long uh Expressions it's very easy just to drop in and drop out the different bits of information so we take uh land for example here we can now just change this and we've got the same amount the same information but for the type of land so it's very easy and I'm going to change this back two heads so we get the top ones it's very easy just to drop this in and I think this one here um is character properties so let's go ahead and put that in cool so we can see right away that the character properties are much more expensive and uh maybe generally have uh the this different number of beds but you get the idea of what you can start to build and construct with these long um expressions and obviously you don't have to do this you can split these up but for example I know that when I run this is my data frame when I run this this is going to end up with a data frame as well so then I can do group by and when I end up here I know that I'm going to have data left and I can do cast to change the data type then I can sort it because I have that data frame left and of course if you wanted to you can then throw in caborn to do some kind of um graphing if you want um it's up to you this it's fairly straightforward I tend to only use S graph plot or ra plot I think kind of get the idea not that particularly useful in this but it is when you have more different more and different data points so from here if I was going to continue to analyze this data let's go back up so I can see the schema I would be changing added which to a date and I'd be looking at seeing how long um things have been on sale because we have somewhere down here um sold STC so we can filter those out and we can have a look how long things have been on sale and what sort of property stale stay on sale for the longest that would be an interesting piece of data to look at we focused in much more on the price and the type and the number of beds and the averages in this instance so hopefully you enjoyed this video if you want to know how I got all of this data make sure you subscribe because that video is coming up next
Info
Channel: John Watson Rooney
Views: 5,265
Rating: undefined out of 5
Keywords:
Id: 39jB8nJBrCI
Channel Id: undefined
Length: 24min 32sec (1472 seconds)
Published: Fri Apr 19 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.