Apache Spark / PySpark Tutorial: Basics In 15 Mins

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in recent years a new term has been coined called big data which essentially means hundreds of thousands or even millions of rows of data it's vital that any data scientist engineer or analyst knows how to process this data quickly because this is the kind of data that most companies are dealing with nowadays so you're the person that they need and the tool they need you to use is apache spark a modern solution for processing big data if you don't know how to use spark or even what it is then i highly recommend you watch this video to the end it's only going to take a few minutes of your time don't be intimidated and take it step by step don't get left behind [Music] today we're going to be using pi spark the python way of writing spark code a lot of people use it for many many reasons it's extremely popular the downside for the python version is it's a little bit slower to run but it's still used heavily today i promise you okay let's get coding don't click away it's gonna be nice and simple and it's gonna really pay off and if for some reason you're not subscribed to this youtube channel you probably should get on that as well okay so here in our python environment we have access to spark commands through the variable sc so if i output sc for you you can see it's basically this spark context thing so we'll use spark or this sc thing very very shortly but for now i'm going to make a python list called nums so i'll let nums equal the list of the range of 0 to 1 million and 1. and basically what that is is nums is a python list where the numbers are 0 1 up until a million so technically it should have a million and one items if we look at it now even though we're not using spark yet i can still use some of the spark terminology this list of numbers is actually what's on called the driver machine and it's called driver because this is the machine that's going to tell the other ones what to do and i know we haven't even talked about how there's other machines yet but what spark is going to do is distribute our data across these multiple machines or our cluster of machines which is going to do the processing for us to speed it up so cluster in general just means a group but in the context of spark in distributed computing it means your group of machines that are dedicated to performing this task and so for us it's going to be processing our data in spark so the first thing we need to do is tell our cluster or spark to distribute what data okay and we're going to distribute this same python list normally you'd actually load from a file rather than doing what we're about to do because the whole point is working with really really big data sets but just for a simple example we're going to use what's called parallelize which is a spark function that takes in a python list and it distributes it into what's called an rdd a resilient distributed data set and the easiest way to explain it is right away with an example so i'm going to go ahead and make a variable called nums rdd which is equal to sc so we're saying spark do something parallelize and this takes in a python list so i'm going to pass it nums and then if i output nums rdd we do have an rdd okay and notice when i output an rdd it's not giving me the contents of that thing and that's because we're distributing the data across our cluster of machines this is our driver program here this whole thing is our driver program saying hey spark distribute this data it says okay i'll distribute it for you and i'll give you back this rdd which is just a variable with type rdd and is a resilient distributed data set and to us it's really just something we can tell spark to do whatever we really want to do with it so maybe we do want to know what we distributed and of course we already know because we just called it and we didn't actually do anything to it yet but you can ask for what's in there with the collect method so here if i do i'm sorry d.collect it returns a python list of all the information so this should be the exact same list as before and it is now here it's a very scary operation to do this because all of this data is distributed across our cluster on many different machines and so if it is a lot of data like a million numbers it's kind of scary to bring it all back to the driver machine we don't want to do that we want our processing to be done in parallel or distributed on our cluster that's the whole point so we're not going to do this most of the time of course it'd be a bit of a disaster if we couldn't even look at what was in our distributed data set because how would we know how to do anything if we don't even know how it looks you'd have to like keep track on paper how it looks so what we can do instead of collect is called dot take okay so if i do nums rdd dot take and that takes a number of how many things we want to take it says right here take the first num elements of the rdd so let me take five of them now what that does is returns a python list of the first five items okay so i know this is kind of annoying how there's so many things up front that i had to say but it's really important to understand so that now we can go ahead and do some interesting work one thing you might want to do is apply some function to every element in the rdd every element meaning each one of these numbers is one element okay so maybe we wanted to square each element what i can do is make this new rdd so i can do squared nums rdd make a variable called that and then i will set that to a method call on our previous one so we have nums rdd and then we'll call map okay and what that does is maps a function to every element in the rdd and so we have to pass it a function in python we can pass either a lambda function which is basically a function without a name or we can pass an actual function that has a name a lot of the time when you're writing spark code you'll see that the functions are very very small and so we just write lambda functions like this so this is a function that takes a parameter called x so specifically we have to pass map this function that takes in one parameter because what its job is doing is applying a function to each element so what it's going to do is one at a time apply that function to each element and so the function has to take in only one parameter which will be that element so the function that we want is x squared we want to take an x and we want to return the square of x and python and if you're not familiar how to do that i can show you that 2 asterisk asterisk 2 is 4 2 squared is 4. so this if this was 3 then 3 squared is 9. okay so all i want to return over here is x asteris asterisk that's a tough thing to say squared okay so x squared and so if i output this thing remember.take that's the only way to get useful information if i take the top five now it sure looks like everything's been squared great so right now our rdd is actually filled up just numbers so every element is a number but we are absolutely not restricted to that we can make it a list or a tuple of items so that we can store extra information about an item in there so maybe for example we could do each element along with how many digits the number is okay so each element would actually be a pair or a tuple where the first item is that same number and the second item is the number of digits for that number watch how we can do that with the same function so firstly if you don't know in just in python how you would get the number of digits for an integer i could do it very simply like this where if i had say 546 well that's a number but i could convert to a string so if i let that be the string then it's well string of 546 and strings have a length so if i ask what the length or the len of that string is then it would be the number of digits so this is what we're going to use here so if you're listening right now i strongly encourage you to try and think about the code yourself right now we do actually have all the tools already to perform this task okay but either way i'm going to go ahead and write it so basically what we would do is just exactly the same as before we would call this map function except we would return the number and the string the length of the string of the number as a pair so i can do that by making a new variable called say pairs is equal to squared num's already d so our previous rdd then i still want a map and i'll take in a function that takes in a parameter and i want to return the pair of two items so it's going to be just two things where the first one is x itself so that's the number and then same thing as before the length of the string of x okay and then if i do pairs dot take maybe a little more 25 and clearly we have the number of digits along with the squared numbers so as you've seen map is awesome you're going to be using it all the time to perform transformations on your data but it doesn't do everything one thing it doesn't do is remove things from your rdd so maybe there's stuff in here that we don't want maybe we only want the numbers that have an even number of digits we can remove the numbers that have an odd number of digits or equivalently keep the numbers that have an even number of digits with the function filter okay so here i have this rdd called pairs and what i can do is very similarly to map i'm going to call pairs sorry i'm not going to call this pairs i'll say even digit pairs even digit pairs is equal to pairs dot filter okay and now what this does is takes a function that takes in a parameter but this time it returns a true or false and it's going to be true or at least we want it to be true if we want to keep the element and we want to be false if we don't want to keep the element so what we're going to do is pass in a function that takes in a parameter and what it does is remember x is a tuple right now it's a pair so x0 is the digit or sorry x0 is the number and x1 is the number of digits so what we want to do is return x sub 1 which is referring to the number of digits if that mod 2 okay so we divide that by 2 take the remainder if that mod 2 is equal equal to zero that would return true and that would mean it has an even number of digits and that is the ones that we want so if i return this thing even digit pairs dot take say 25 here you go we have these numbers that have two we have these numbers that have four and so on it doesn't output everything for us the first 25. so filter and map are both awesome you can do whatever you want each element and you can remove or keep whatever you actually want but what it doesn't do is any sort of grouping like information so you see here clearly we have a group here where there's all twos and we have a group here where they're all fours this is going to come up very very often for stuff like aggregation like averages minimum maximum poor group they're very important so to finish this lecture off and hopefully it's been awesome so far we're just going to do it one more function really to finish it off we're going to group all of the like information so all the twos will be a group all the fours will be a group all the sixes will be group and suppose that we wanted to compute the average for each group so we would want only a few elements we want one average for the twos we want one average for the force one average for the sixes and so on so to do that we first actually need to flip this thing so that the group is the first item in the list this is because spark often has the notion of key value pairs where the key would be what we'd be grouping on and the value is whatever information we're considering so i'm going to do this next part quick i encourage you to try if not that's okay don't worry about it but right now we just need to flip each element in the list or in the rdd and so to do that i would make it basically just call it flipped pairs is equal to whatever our previous was so even digit pairs let me just copy paste that name select that equal to even digit pairs dot map the function that takes in its parameter and what it does is really just flips it around okay and if i output that it should be just it flipped and it is so now we need to do the grouping so a group by key because for each key we want to have all of the items beside that key we don't want it in these different items we want it to have one key for all of the same ones and then all of the values in one item so i would do that by calling group by key okay so if i take this previous one flipped pairs then i'll say grouped is equal to that previous rgd dot group by key and then i don't to pass it anything because it's just going to group them by key and let me show you what that produces you see it takes a while and that's because grouping is actually a much harder operation than the map any filter but right now it did group it so we have this key which is the number of digits and then all of the items and i know you don't see all the items that's because pi spark does this irritating thing so let me quickly fix that up for you so you can see it grouped i'm just going to overwrite the variable grouped is equal to itself dot map lambda x so it's a pair where the first item so i'll keep that the same and then i just want to convert this other thing to a list and you can see what this thing is and it's going to be huge so i'm not going to take very much in fact i'll only take one item so what it does do actually let me take a couple so you can see it okay so you can see the first one is the numbers that have two digits all right there and then you have the numbers that have four digits all right there and that list is going to be pretty big so now that we've successfully grouped the information we can actually do our average and we know we can just map the sum of all of these elements and divide by the length of that and that will do our average so if we have this thing called grouped that i can say averaged is equal to grouped dot map give it this function and the first thing we want to keep it exactly the same because that's our our key our number of digits so let's keep that the same and then let's produce the average which we can just do the sum of the list which is x1 and then divide it's getting a little weird with the brackets here i gotta not mess this up divide by the length of x1 okay that looks like that matches up and if we do averaged we should be able to collect this information because we're only going to have a few and i'll show you why so here we only have at most 12 digits and so guess the numbers are getting pretty big at this point here's our huge average but here you go for each of the even numbered digits we took their average and returned it and it's fine to collect that because we only have one two three four five six informations so that's how it's gonna flow is that you input something that you're not gonna wanna collect immediately but then often after a bunch of transformations then you might be able to collect it later although this is great it actually was a very slow way to do what we just did because we grouped all the information using a group by key if you want to get the full explanation as to why this was the wrong thing to do and what is a right and better way to do watch my whole hour-long tutorial on spark where we're going to talk about a little more about what's going on under the hood so that we can write better programs now i'm going to complete this with saying if you want to watch spark in action then go ahead and watch my final data science project where i use spark to solve a real world problem people loved it it's very interesting so i'll link that above check it out now i'll catch you next time thanks so much for watching [Music] you
Info
Channel: Greg Hogg
Views: 18,887
Rating: 4.9185185 out of 5
Keywords: Apache, Spark, PySpark, Big Data, Tutorial
Id: QLQsW8VbTN4
Channel Id: undefined
Length: 17min 16sec (1036 seconds)
Published: Thu Mar 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.