AWS EMR Big Data Processing with Spark and Hadoop | Python, PySpark, Step by Step Instructions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] what's up everyone welcome back to another episode of aws tutorial and today i'll talk about aws emr which is a very popular service in the big data or machine learning world and then i'm going to show you an example on how to do a spark summary job on the emr cluster to process data from stack overflow we're going to use python and pi spark for this demo but before we get to the demo i want to talk about what exactly emr is first so emr stands for elastic map reduce which is a which is an aws service that allows you to scale and run your big data server on demand on the cloud which i think is very beneficial because typically big data servers require very high computing power and they're very expensive so you don't want to run them all the time but if you can just spin it up when you need it and then tear it down when you don't need them anymore that saves you a lot of money over time and then the other benefit that emr has is that when you spin up your emr cluster you can you can choose whatever big data framework that you need to install and configure into your emr cluster such as apache spark hadoop hive tensorflow etc and the way emr works is that it uses ec2 instances on the background as its worker nodes that does all the heavy lifting for data processing and stuff and then it uses s3 buckets as its file system this is how it looks like so this whole thing here is considered one emr cluster that consists of a master node that acts like a brain of the cluster that it delicates all the tasks to the test nodes and tells them what to do and you can choose as many code nodes as you want in your system or in your cluster and each of them has all these softwares that you selected installed in this in the system and the reason why it has to use an s3 bucket for the file system is that emr works very differently from how how things work in our local machine so in our local machine we can just use our laptop as both our server and our file system we can read and write from the laptop itself but in emr things are different because each node is an individual server and they process all the data in parallel so you cannot just say that okay i'm gonna save all the data in our in my local machine because that cannot be accessed by another core node so the data has to be saved in a central location where all the machines can access and aws s3 is a perfect platform for that so the whole flow works like this so we do a spark summary job on the quantum master nodes and then it delegates all the tasks to the core nodes and then all the corners can access the data source from the s3 bucket and then while it's processing it it saves the intermediate files to the s3 bucket as well so that other corners that are processing the data in parallel can access them as well and that's what makes the emr cluster so powerful and fast and one last thing i want to mention is that it's very easy to integrate emr with other aws services such as kinesis that you can just publish the data to the to the stream and dynamodb if you want to save your data to the database after you finish processing it all right so now we have learned about emr let's spin up a cluster and then do a spokesman job on it okay so right now i'm on the homepage of the aws console and step number one is to create the emr cluster so i'm going to type in emr and then create cluster give it a name i just call it emr demo cluster and for logging this is gonna be the s3 bucket that emr creates for you automatically to store the logs and then for launch mode we're gonna do cluster uh because we're gonna do a menu this box domain job on it um the step execution is where you define the steps beforehand and then when the emr cluster finish all the steps it's going to terminate itself if you don't want that happen release we're going to choose version 5. and then in here since we're going to use spark so we're going to select this combination that it has spark hadoop yarn that's all we need and then for the instance type uh we're just going to keep it as m5 extra large and the number of instances so this is where you choose how many corners you want and for us we're gonna do a very simple spark summary job i think two corners is enough and then for the key pair i already have a keypad created for this account so i'm going to choose that one this is very important because you need to have a key pair in order to ssh into your emo clusters so if you don't have one you should stop here and then create one first i'll include a link down below where you can find instructions on how to create a keep here and then permission we're just gonna give everything as default and that allows the emr cluster to access our sd bucket and then i'm gonna hit create cluster and it's starting i think this is gonna take about 10 minutes so meanwhile we're gonna download the stack overflow data while we are waiting for it to provision so this is the website that we're gonna get our survey data from stack overflow and when we scroll down we're gonna get the results from all the years we're just going to choose this one so what you can do is you can just click download for data and then you just click download button here to download everything i already have this download in my machine so i'm not going to do it again but all you need to do is hit download and it's going to download a zip file for you and after you download the data this is how it looks like in a zip folder you're going to have four files and this is all you need for the data to detail data and i already have that opened up in my google sheet and this is how it looks like just random things from a survey from other people now what we're going to do is we're going to create a sd bucket and then upload this file to sd bucket so that the emr costs can access them for data processing so i'm going to go back to the aws console i'll type in s3 open that into a new tab yeah so this is the bucket that emr created for you for storing the logs and now we're going to create a different one to store our data source as you call it chin monster bucket one two three four five six i'll keep it on the usb's region i'll enable versioning enable server site encryption and then hit create all right so it's done i'm going to click on it and then we're going to create a new folder i'll just call it data source enable server side encryption hit create folder and then i click on it and then upload add files that's what we need but one thing to make sure is that s3 doesn't like spaces in the names so you have to change the spaces to underscore or dash i hit upload okay so it's done and now we can move on to the next step which is to write the code for the spark storage job to process the data so i'm going to open vs code and open an empty folder called tutorial and then i'm going to create a file called main dot pi first thing first we're going to do some imports hi spark that's equal import we're going to import spark session and then the next thing we're going to use is column and then we're going to define two variables for the data source path and where we want to save our data after we process it and now i'm going to go back to the s3 bucket and then just copy s3 url and paste it here and then copy this file name and then do the same thing for the output file path we're going to save in the same s3 bucket except in a different folder we're going to call it data output okay and now we're gonna do and then we're gonna call the main method and now we're gonna define the main method first we're going to define spark we just call the name i'll just call it j meister demo app or you can call it whatever you want there to get or create and then next we're first going to read the data from this path from our sd bucket let's go do all data we do csv here because the data type that we have is a csv file and then make sure that you have the header enabled what this does is that it's gonna take the first row in our data file as our header and then we're gonna just print out something like so we want to see like how many records we have in the data source file and then we're going to select the data based on conditions so i'm going to do selective data equal to all data where column let's see what columns we have let's do country we're going to select all the columns that are i mean rows that are let's say united states and let's see what else we have that is interesting let's do work with ours we're gonna select all the rows that are based on united states and has worked with hours that are greater than 45 hours so we can see how many work colleagues are there in the united states one thing i want to mention here that's very important is that make sure that you add the parenthesis around these two conditions otherwise it's not going to work so that should be good and now we're going to print out how many records we have in the selected data same thing we just do a count on it and then finally we're going to save the data to our sd bucket so i'm gonna do right so what it means is that if we already have some data in this folder it's going to overwrite it and then we're going to save it as pocket file you can save it as a csv if you want but i think pocket file is a more popular format in the big data world so i'm going to save it as a pocket file and then finally we're going to print out something like okay so i think that's everything basically what it does is that it reads in the data file that we just uploaded to s3 and then it does a selection on all the rows or the records in the data frame that it only select the country that's in the united states and work hours that's greater than 45 hours a week but you can do whatever you want with the data but this is just one example that i want to show you and then finally we're going to save the selected data to our s3 pocket output path all right and now let's go back to the emr console and see if it's done spinning up okay so you can see that it's waiting and the notes are running so one thing we need to do before we can ssh into our emr cluster is that we have to open port 22 in the security groups so i'm going to scroll down i'm going to click on this and then we're going to choose the master node security group which is that so i'm going to click on it and then i'm going to hit edit inbound rules scroll all the way down add the row we're going to choose ssh port 22 and then i'm going to choose my ip and then hit save rules all right so it's done and now let's go back to the emr cluster and when i click here it's going to give me instruction on how to connect to it or ssh into it i have a mac so i'm going to follow this but if you have a windows you're going to follow this instruction so i'm going to open the terminal and then i'll do ssh dash i and then the location of my key pair and i saved it in here so you have to specify the location where you saved yours and then i'm gonna copy this hit enter yes now i'm in so what i'm going to do is i'm going to create a file called main dot pi and then i'm going to press i the key i for insert and then go back to the code that we just wrote i'm going to copy everything and then paste it here it looks good i'm gonna hit the escape key hit the colon wq hit enter and how we're gonna do this box submit is that i'm gonna type in spark dash submit and then the main file that we just created and then hit enter oh it's going okay so since like it's done and it's successful it's kind of hard to see the logs so let me just do a search on this so the total number of records in our data source is 64 000. and the total number after we filter it is 1500 so about 1500 of the people worked more than 45 hours in in a week and obviously that's based on the survey and i guess that is also successful i guess yep and it's saved to this location and now when we go back to the s3 bucket go back to here there should be a new folder that's created for our data output and when we click on it we should see some pocket files here one more thing is that you can just go back to your emr cluster click on applications user interface and i hit refresh here you should be able to see the job that's successful let me just rent and you can see the durations and stuff and when you click on it you should be able to see all the tasks all the stages in here and one thing you have to do before you go is that you have to terminate your emr cluster otherwise it's going to charge your money so all you need to do is hit terminate here and then hit terminate and now it's being terminated and this is it everyone i hope you have learned something and if you like this video i hope you can give it a thumbs up and i'll see you in the next video
Info
Channel: Felix Yu
Views: 3,043
Rating: undefined out of 5
Keywords: aws, emr, big data, spark, PySpark, hadoop, data processing on the cloud, spark-submit
Id: a_3Md9nV2Mk
Channel Id: undefined
Length: 16min 42sec (1002 seconds)
Published: Fri Aug 20 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.