Apache Spark Tutorial | Spark tutorial | Python Spark

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to the Apache spark with Python course in this course overview lecture we'll see what this course covers and what you will learn from this course in the first section we'll develop a conceptual understanding of what Apache spark is then we'll learn how to install spark on a local computer no matter whether you are using Windows Mac or Linux you will be able to follow along and we'll try out our first spark job on your laptop which is to count the occurrences of different words in an article in section 2 will learn sparks core abstraction which is Brazilian distributed data set we'll start by learning the basics about RDD then we'll look at different ways of creating spark rdd's next will demo several most popular spark transformations such as map and filter and flat maps then we'll look at asset operations which could help us to integrate several isolated data sources next we're going to go through several important spark RDD actions such as reduce collect and count after we have covered both spark Rd transformations and actions we're going to recap the difference between those two operations and deep dive into sparks lazy evaluation mechanism before we finish this section I will show you a critical spark capacity to optimize performance which is caching in section 3 we'll get to know sparks master and slave architecture develop our understanding of some of the most important components in spark ecosystems such as spark or spark SQL and spark streaming in section 4 we'll start by introducing pair RTD which is one of the most useful building blocks in spark applications to work with key value style data source then we'll look at different ways to create pair rdd's in spark we can either directly return pair rdd's from a list of key value data lecture called couple or turn a regular RTD into a pair RTD next we'll dive into several useful SPARC operations available for pair RDD such as reduced by key group by key and sort by key etc then we will apply the knowledge we have learned so far to analyze some California real estate price data through pair Rd API after that we'll discuss an advanced spark feature that lets users control the layout of pair rdd's across nodes partitioning we're going to finish section 4 by learning different joint operation types a pair RTD in section 5 will introduce more advanced spark programming features which allow us to share information across different nodes an Apache spark cluster by broadcast variables and accumulators in section 6 we'll be talking about a critical component of the spark spark SQL which is Sparks interface for working with structured and semi-structured data to better understand spark SQL we must get to know two important concepts data frame and data set then we'll take a closer look at how can we analyze the same stack overflow of survey data using spark SQL instead of spark rdd's will compare the pros and cons of using spark SQL vs. spark rdd's we'll finish this section 6 by demoing several useful performance tuning techniques work with spark SQL in the last section we'll see how to skill our spark or work flow that running our spark application in Amazon EMR cluster through sparks emit [Music] hello in this lecture we'll provide a high-level overview of what Apache spark is Apache spark is a fast 'memory data processing engine which allows data workers to efficiently execute streaming machine learning or SQL workloads that require fast iterative access to data sets essentially spark is a computational engine which can schedule and distribute applications consisting of many computational tasks across many spark worker machines speed is a very critical aspect in processing large data sets as it means the difference between exploring data interactively and waiting minutes or hours one of the main advantages of SPARC regarding speed is his ability to run in computations in memory apache spark has an advanced daj execution engine that supports a cyclic data flow and a memory computing spark enables applications in Hadoop clusters to run up to a hundred times faster in memory and ten times faster even when running on disk on the generality side SPARC is designed to cover a broad range of workflows at his core SPARC provides a general programming model that enables developers to write an application by composing arbitrary operators such as mappers reducers joins and grid bytes and filters this composition makes it easy to express a wide array of computations including iterative machine learning streaming complex queries and batch processing which previously can only be done through multiple distributed systems by supporting these workflows in the same engine spark makes it easy to combine different process models seamlessly in the same application which is often critical and data analysis pipelines for example we can write one spark application that classifies data in real time through spark machine learning library when the data is ingested from streaming sources via spark streaming in the same time the data scientists can also query the resulting data in real time through spark SQL that's it for this lecture I hope hello and welcome back we're going to install Java 8 and get on our local box so that we can fetch our source code of our spark project from github and run it locally on our laptop we need to install a Java development kit because spark is built on top of the skill of programming language which runs on the job of virtual machine also to run our programs we are used to Python API for spark pi spark PI spark is built on top of the sparks Java API so we need to Java development kit to be able to run our programs first let's check if Java is installed on your laptop here just launch a command line terminal and type Java dash version if you already have Java installed this command would print out which version of Java it's currently being used if you don't have Java installed idles gel app top or if your Java version is older than 8 let's move on to install Java on your local box if you already have the desired Java version installed you can skip this installation step here we gooo gooo install Java SE the first entry is the official website to download Oracle's Java development kit for now we have to install Java 8 since SPARC does not support Java 9 for the moment just click on the link on the java SE development kit downloads page accept that license click on the corresponding Installer that matches your platform we're go through the installation process for linux and windows if you have a Mac the process is similar to one on Windows if you have a Windows machine you can skip ahead now let's go through the process to install Java on Linux I'm running a 64 bits Linux so I should download this one after in this delation has finished open a terminal so we can finish the process first let's be sure we don't have Java installed now we have to create a folder to unzip the file we've just downloaded usually software's that don't come with a Linux distribution should be saved under the opt folder so let's create JDK folder under opt now let's change to the directory where our Java development kit files was downloaded in my case it's the Downloads folder command to unzip the file under the file we just created you let's check if the file unzipped correctly here it is now we have to add that Java Runtime environment to our environment variables first let's go under the jru folder on our java development kit let's copy the path to this directory so we can add to our environment variables to add a new environment variable we have to edit our dot barcia receipt file under the home folder type export Java home equal and paste a path which just competent also let's add the binaries of the Java Runtime environment to our path so we can use the Java commands from the command line type export path equal dollar sign path Kalon dollar sign Java home slash bin save this file and source it to apply that Changez now we have Java installed as a side note there are easier ways to install Java on Linux but this depends on the package manager of your Linux distribution this method that I just told you should work for any distribution Linux users can skip ahead and now to the git installation step on Windows let's be sure we don't have Java installed now let's go back to the download page since I'm running a 64 bits windows this should be the one for me you after downloading is dollar just click on it and follow the steps you now let's check if the installation was correct opening command prompt and type Java - version I have uploaded all the source code for this course to github to download the source code from github you need to have git installed on your laptop you can check if you have git installed by opening a command line terminal and typing git - - version if you've get installed this command would print out the git version you have if you don't have get installed let's go to the process of installation here we Google download get the first entry is the official GUID website just click on it now choose your platform if we have a Linux machine the download page just shows you how to download the get using the package manager for different Linux distributions you can just follow the steps for the one you have if you have a Windows or Mac machine you have to download the Installer after download installer click on it and follow the steps here check the option to add get to your path this will allow you to run get commence on the command line for the rest you can just keep the default options now let's check if the installation was correct open a command prompt and type get - - version now that we have git installed let's download the source code just open a browser and type HTTP github.com jaylee tutorial python spark tutorial now we are at the github repository for our spark code click the clone or download button which displays the git clone URL for you to download the repository they click the clipboard icon to copy the URL to your clipboard go back to the terminal and type clone and paste the URL hit enter to download the git repository after the download is complete we'll have a spark tutorial directory under the current directory that's it for this lecture and I'll see you next time [Music] hello everyone and welcome back in this lecture we're going to download and set up apache spark here we go go apache spark the first entry is the official apache spark web site click on it click on download select their spark release you wish to download and click on the link click on the mirror URL to start the download let's go through the process of setting up spark for both Linux and Windows machines on Mac you can follow the same steps as a Linux if you have a Windows machine you can skip ahead now on Lenox open up a terminal first let's create the folder where we are going to unzip spark usually on Linux software that don't come with a distribution should be stored under the opt folder so let's create a pachi spark folder under opt you now let's use the tar command to one-zip spark under the footer we've just created you now that we have spark installed let's add the environment variables to make it easier to use spark first go to the folder where you unzipped spark and copy the path you next let's ended the dot bash RC file under the home folder to add the environment variables if you have a Mac machine you should added the dot bash profile file instead now let's add the spark home variable just type export spark home equals and paste the path you've just copied at last let's update our path variable to also include the spark binaries just type path equals dollar sign path Colin and dollar sign spark home slash bin save this file and source it to apply the changes now we can check if spark is set up correctly by starting a pike spark session just type I spark and hit enter we get a Python interpreter with a spark session available don't worry about the warnings for now if you're a Linux or a Mac user you can skip now on Windows let's unzip the file Swift just downloaded for then I'm using 7-zip you can unzip whatever folder you want I created an apache spark folder under the local disk and i will unzip it there first unzip the file to get the tar file now unzip at our file into the folder you want running spark and Windows is not different from other operating systems but Hadoop has a problem with Windows NTFS file system to be able to run spark download windle's dot exe from this URL create a new folder named Hadoop under the local disk inside this create another folder named bin and save it there let's add some environment variables to be able to run spark right-click on top of the windows logo and go to system now click on advanced system settings then click on environment variables first at a Duke home variable click on new give your variable the name Hadoop home and the values should be the path to the hadoo directory we've just created now let's at the spark home variable click on new again give your variable the name spark home the value should be the path to the directory where spark is now let's add the Hadoop binary and spark binaries to the path variable click on the path variable and then on edit click on new and at percentage Hadoop home percentage /bin you click on new again and at percentage spark home percentage slash Bend close the windows by clicking on okay before running spark we have to create a folder TMP under the local disk under TMP treat a new folder named hive Hadoop needs these voters to write correctly now open the command prompt and type winning tools chmod triple7 and the path to the folder you've just created this will set the appropriate permissions for Hadoop to use this folder now we can check if spark is fed up correctly by starting a PI spark session just type PI spark and hit enter we get a Python interpreter with a spark session available don't worry about the warnings for now before we go let's change the spark lock configuration this will clean up the standard output for when we run our spark programs we will change the log level from info to error to make the standard output less noisy making it easier to read the output of our spark programs as spark locked a lot of stuff at info level go to the folder where spark is and go to the count folder you now make a copy of the lock for J dot properties dot template and rename it to lock for J dot properties open this file and go to the line that says set everything to be locked to the council change the following line log4j the root category where it says info change it to error with this only error messages will be locked to the standard output that's it for this lecture see you next time [Music] hello everyone and welcome back in this lecture we are going to run our first spark job which is called word count what the job does is County occur essence of each word in a real article you can open the project we've downloaded from github on the last lecture using any text editor or Python IDE you like I'll be using Visual Studio code which is a lightweight free and powerful code editor but you are welcome to use your favorite one in our project there is a directory called all our input data sources are in this directory the article you're going to analyze is in a file called word underscore count dot txt let's open it up this is a short article about the history of New York with less than 1000 words we are going to load this file and count the number of each word using spark next let's open up the word count file under the RDD folder this is our first spot program we are going to run don't worry if you don't fully understand all the code here this is just an exercise to get your hands dirty with spark let me quickly walk you through this file first we create the spark context which we import from the PI spark API this context is the entry point to spark core our spark app is named word count we are going to run our application on an embedded spark instance on our local box which could use up to 3/4 of our CPU then we set a log level to arrow if we change the spark plug configurations file this is not necessary then we load the word count file as an RDD RDD is resilient distributed data set and we are going to see more of what that is later next we are going to split the article into separate words using white space as the delimiter finally we calculate the occurrence of each word and print out the results now let's run this spark job open a command line terminal we are going to use the sparks submit script that comes with spark just type sparks submit the path to the script you wish to run and hit enter let's go through the output as you see most of the words only appeared once the word state appeared five times and some common words such as two appeared 17 times congratulations you have just ran your first spark program again don't worry if you don't fully understand all the source code of this program we'll go through all of them in the later lectures hello everyone in this section we're going to talk about RDD which is short for resilient distributed data set the entire world of spark is built around rdd's which are sparks core abstraction for working with data RDD is the core object that we will be using when developing with spark applications and it is probably the most important concept that you want to understand and know how to use so let's dive in our D D stands for resilient distributed data sets so let's start by talking about what is a data set a data set is basically a connection of data it can be a list of strains a list of integers or even a number of rows in a relational database our DDS can contain any types of objects including user-defined classes an RDD is simply a capsulation around a very large dataset in spark all work is expressed as either create a new rdd's transforming existing rdd's or calling operations on rdd's to compute a result under the hood spark will automatically distribute the data contained in rdd's across your cluster and paralyze the operations to perform on them was rdd's are created what can we do with them rdd's offer two types of operations transformations and actions transformations basically apply some functions to the data in our didi to create a new ID D one of the most common transformation is filter which will return our new RDD with a subset of the data in the original our duty for example we can use filter to create a new RDD holding just the strains that contain the word Friday actions on the other hand compute a result based on an RDD one of the most popular actions is first which returns the first element in an RDD to summarize every spark program will work as follows and generate initial are duties from external data apply transformations such as map and filter on rdd's launch actions such as count to fire off the computations of the result which will be optimized and executed by spark in the later lectures of this section we will go through each of these steps in detail that's it for this lecture I hope you have enjoyed it hello and welcome back in this lecture we'll be talking about how to create rdd's spark provides two ways to create rdd's loading an external data set and parallelizing a collection in your driver program the simplest way to create rdd's is to take an existing collection in your program and pass it over to spark context parallelized method in case you don't know what spark context is the spark context object represents a connection to a computing cluster we will talk about spark context in great details in later lectures once you have a spark context you can use it to build rdd's once you passed a collection over to spark context parallelize method all the elements in the collection will then be copied to form a distributed data set that can be operated on in parallel this approach is very handy when you're learning spark or running some simple tests with spark since you can quickly create your rdd's with little effort and run some operations on them however this approach is not very practical in real life scenarios because it requires the entire dataset must fit into the memory of the driver machine before you distribute them across your cluster in a lot of cases the scale of this data set we're dealing with is terabyte which definitely won't fit into the memory of a single machine so a common way to create ru DS in spark is to load the data set from external storage so where do the rdd's come from external storage can be local file system in our previous work count example we have already seen how to load a text file on our local disk as an RDD of strains using a spark context text file method more realistically the external storage is a distributed file system such as Amazon s3 or HDFS and there are lots of other data sources which can be integrated with spark and used to create rdd's including JDBC Cassandra and elasticsearch etc we won't cover all of them in this course if you're interested take a look at the reference in the next lecture [Music] hello and welcome back as we've discussed in the previous lectures RDD supports two types of operations transformations and actions in this lecture we'll be talking about transformations transformations are operations on rdd's which will return a new RTD keep in mind that transformations we return a new RDD instead of mutating the existing input RDD the two most common transformations are filter and mapped the filter transformation takes in a function and returns an RDD formed by selecting those elements which pass the filter function the filter function can be used to remove some invalid rows to clean up the input RTD or just get a subset of the input RDD based on the filter function the map transformation takes in a function and passes each element in the input argue to you through the function with a result of the function being the new value of each element in the resulting RTD the map transformation is versatile that it can do a lot of things for example it can be used to make HTTP requests to each URL in our input RDD or it can be used to calculate the square root of each number it is worth noting that the return type of the map function is not necessary the same as its input type take a look at the following example we have an RDD string and our map function was to parse the strings and return an integer which is the length of the string our input our DD type would be a string our DD and the resulting our DD would be an integer our DD now let's take a look at a real problem that we could solve using the spark filter and math transformations I'm back at our spark tutorial project the fire we're gonna analyze is some global airport data that lifts in the airport's dot txt file under the in directory this is a CSV format file from left to right each column presents the airport ID name of the airport main cities served by the airport country where the airport is located IATA FAA code ICAO code latitude longitude altitude timezone DST timezone in Ulsan format open the airports in us a problem file under the RDD airports package the task for us a spark program to read the airport data from the airport text file under the in directory to find all the airports which are located in United States and output the airport's name and the city's name to the airports in u.s. a text file under the out directory the sample output would like this let's see how we can solve this issue just open up the airports in USA solution file under the same package first we initialize the spark cuff object spark cuff object specifies various spark parameters for a spark application here we set the application name for our spark application this would show in the spark web UI we will see spark web UI in a later lecture then we said the master URL of this spark cluster in this example we will running spark in local mode so we can specify local in the master parameter loco tun means this spark job will run to worker threads on to course of the CPU on my local box if we set it to loco star it will be running locally on all the available course if we set just local it will run locally with only one thread in this way we have constructed our spark Huff object and set the application name and master URL then we can create a spark context object by passing a spark off object as a constructor parameter as we have mentioned before spark context is the main entry point for spark functionality the spark context represents the connection to a spark cluster and can be used to create rdd's accumulators and broadcast variables on that cluster then we call text fire method on the spark context object to load our input file as a string RDD each item in the string RDD represents a line in the input airport file next we need to find out all the airports which are located in the United States so here we call the filter method on the string RDD the filter method takes a function as an argument the function takes a strain as an argument which represents each element in the RDD this function returns a pool to decide whether this element should appear in the resulting RTD or not so here we split each line using a comma delimiter let's take a look at how is this common delimiter being declared go to the comments folder on our project and click on noodles as you see this common delimiter is a regular expression which matches commas but not commas within a double quotation we can see there are some cities which have commas in their names but the commas are being quoted we shouldn't use those commas as the delimiter we should only use commas which are outside of quotations as the delimiter to split our input lines let's go back to our solution file to be able to used our Udo's class we have to import it to our file since the commas model is in our root directory we have to add the root directory to our current path to be able to import the comments module we do this here after splitting the line we take the fourth split which is the country of the airport and return the airports of which country equals the United States two things we need to be aware of first the split method returns a list of the split result the index starts from zero so the index of 3 means taking the fourth split secondly if you look at the input file all the countries are double quoted so we need to add the double quotation to the United States as well next let's map the filtered airport light to the airport name and city name pair as required since Python does not allow us to write multi-line lambda functions we have to move the logic to split the lines to another function so again we split the string using commas then take the second column which is the name of the airport and the third column which is the city of the airport and join them together with a comma return the join strain to the map function lastly we output the resulting RDD to the airport's in u.s. eight file in the output directory just run this application to run this application just use this spark submit command that we saw in the previous lectures open a terminal and type spark summit and then path to our file after the computation is completed let's go check out the out folder we have an airport in US aid file since we have specified our spark application to run on two worker threads the result of each worker will output to a separate file so we have two output files we can look at those two files they are the expected airport name and city name pair now we have solved our first spark test using filter and map transformation next it's time for you to do a practice let's open the airports by latitude problem file we will still be analyzing the same input Airport data we want to find out all the airports whose latitudes are larger than 40 and output the airport a and latitude pair to a file it's your turn try to implement your solution in this file take your time to work on this task as this is your first spark practice if you get stuck take a look at our Airport in USA solution file to see how we use filter and map function we'll take a look at the solution for this task in the next lecture hello and welcome back as we've discussed in the previous lectures RDD supports two types of operations transformations and actions in this lecture we'll be talking about transformations transformations are operations on rdd's which will return a new RTD keep in mind that transformations we return a new RDD instead of mutating the existing input RTD the two most common transformations are filter and mapped the filter transformation takes in a function and returns an RDD formed by selecting those elements which pass the filter function the filter function can be used to remove some invalid rows to clean up the input RTD or just get a subset of the input RDD based on the filter function the map transformation takes in a function and passes each element in the input RDD you through the function with a result of the function being the new value of each element in the resulting RDD the map transformation is versatile that it can do a lot of things for example it can be used to make HTTP requests to each URL in our input our DD or it can be used to calculate the square root of each number it is worth noting that the return type of the map function is not necessary the same as its input type take a look at the following example we have an RDD string and our map function was to parse the strings and return an integer which is the length of the string our input our DD type would be a string our DD and the resulting our DD would be an integer our DD now let's take a look at a real problem that we could solve using the spark filter and map transformations I'm back at our spark tutorial project the firework and it analyzed is some global airport data that lives in the airport's dot text file under the in directory this is a CSV format file from left to right each column presents the airport ID name of the airport main cities served by the airport country where the airport is located IATA FAA code ICAO code latitude longitude altitude timezone DST timezone in Ulsan format open the airports in us a problem file under the RDD airports package the task for us is to create a spark program to read the airport data from the airport text file under the in directory find all the airports which are located in the United States and output the airport's name and the city's name to the airports in u.s. a text file under the out directory the simple output would like this let's see how we can solve this issue just open up the airports in USA solution file under the same package first we initialize the spark Huff object spark cuff object specifies various spark parameters for a spark application here we set the application name for our spark application this would show in the spark web UI we will see spark web UI in a later lecture then we said the master URL of this spark cluster in this example we will running spark in local mode so we can specify local in the master parameter loco tun means this spark job will run to worker threads on to course of the CPU on my local box if we set it to loco star it will be running locally on all the available course if we set just local it will run locally with only one thread in this way we have constructed our spark Huff object and set the application name and master URL then we can create a spark context object by passing a spark off object as a constructor parameter as we have mentioned before spark context is the main entry point for spark functionality the spark context represents the connection to a spark cluster and can be used to create rdd's accumulators and broadcast variables on that cluster then we call text fire method on the spark context object to load our input file as a string RDD each item in the string RDD represents a line in the input airport file next we need to find out all the airports which are located in the United States so here we call the filter method on the string RDD the filter method takes a function as an argument the function takes a string as an argument which represents each element in the RDD this function returns a pool to decide whether this element should appear in the resulting RTD or not so here we split each line using a comma delimiter let's take a look at how is this common delimiter being declared go to the comments folder on our project and click on noodles as you see this common delimiter is a regular expression which matches commas but not commas within a double quotation we can see there are some cities which have commas in their names but the commas are being quoted we shouldn't use those commas as the delimiter we should only use commas which are outside of quotations as the delimiter to split our input lines let's go back to our solution file to be able to used our Udo's class we have to import it to our file since the commas model is in our root directory we have to add the root directory to our current path to be able to import the comments module we do this here after splitting the light we take the fourth split which is the country of the airport and return the airports of which country equals the United States two things we need to be aware of first the split method returns a list of the split result the index starts from zero so the index of 3 means taking the fourth split secondly if you look at the input file all the countries are double quoted so we need to add the double quotation to the United States as well next let's map the filtered airport light to the airport name and city name pair as required since Python does not allow us to write multi-line lumped of functions we have to move the logic to split the lines to another function so again we split the string using commas then take the second column which is the name of the airport and the third column which is the city of the airport and join them together with a comma return the join strain to the map function lastly we output the resulting RDD to the airports in u.s. eight file in the output directory just run this application to run this application just use the spark submit command that we saw in the previous lectures open a terminal and type spark summit and then path to our file after the computation is completed let's go check out the out folder we have an airport in US aid file since we have specified our spark application to run on two worker threads the result of each worker will output to a separate file so we have two output files we can look at those two files they are the expected airport name and city name pair now we have solved our first spark test using filter and map transformation next it's time for you to do a practice let's open the airports by latitude problem file we will still be analyzing the same input Airport data we want to find out all the airports whose latitudes are larger than 40 and output the airport a and latitude pair to a file it's your turn try to implement your solution in this file take your time to work on this task as this is your first spark practice if you get stuck take a look at our Airport in USA solution file to see how we use filter and map function we'll take a look at the solution for this task in the next lecture in this lecture we're going to talk about another popular transformation is called flat map sometimes we want to produce multiple elements from each input element this is where plat map comes in handy as with map the function provided to flat map is applied to each element on the input RTD but on flat map the results are flattened before being returned a popular case of the flat map transformation is to split up an input string in two words as we have seen in a previous word count example we have as an input RTD a list of lines in a text and want to get as a result that list of words in this text if we use the map transformation we get as a result in RTD where each element is a list of the words for each line to get the desired result we have to then flatten this RTD flatten can be seen as unpacking each element in this case each list of the RDD the result would be an RDD with the elements than each list and the other hand if we use the flat map transformation get the desired result right away you might want to ask when should we use flat map over map map should be used if you have one-to-one relationship between the rows of the input data and the rows of the RTD you're creating flat map should be utilized if you have one to many relationship between the rows of the input data and the rows of the RTD you're creating let's revisit our first spark program which is the word count example open the word count example under the RDD package we will run our application locally on three worker threads then we love the input article file as a strain RDD then called flood map on the initial string RDD as we explained before this flat map works applying a function that returns a sequence for each element in the list and flattening the results into the original list we take each lie as an argument split the lie using space which will return us an array of words of that line in this way we get back a new string RDD each item in the new word RDD is a word in the original input file next we call can by value operation on the word RDD the come by value operation we have returned the count of each unique value in the input RDD as a map of value and count pairs we talk more about can by value operation later now we get back a map of word and its count lastly we print out all the key pair of this map we can run this application to run this application just use the spark summit command that we saw in the previous lectures open a terminal and type spark summit and then path to our file as you can see the output is each word and the original file and is count that's it for this lecture in a see you today hello and welcome back in this lecture we'll be talking about another type of spark transformations set operations rdd's support many of the operations of mathematical sets some set operations are performed on a single RDD the most popular ones are simple and distinct operations the sample operation will create a random sample from an RDD it is quite useful for testing purpose sometimes we want to take a little random sample of larger data sets to apply some transformations and we want to do it on our laptop this is when the symbol operation comes in handy the sample method takes three arguments the first one is whether the sampling is done with replacement or not sampling with replacement is a way of doing sampling it's more of a statistical term rather than a spark concept I have reposted an article which explain what is sampling with replacement if you're interested in learning more about it take a look at the next lecture the second one is the sample size as a fraction let's say we want to take one tenth of the original data set we can just put 0.1 as the sample size the third argument is the seed used for generating random numbers next we'll talk about the distinct operation the set property which is often missing from the rdd's is the uniqueness of elements as our rdd's often have duplicates the distinct transformation would just return the distinct rules from the input RDD keep in mind that the distinct operation is quite expensive because it requires shuffling all the data across partitions to ensure that we receive only one copy of each element you should avoid using distinct operate if deduplication is not necessary for your SPARC workflow there are some set operations which are performed on to hard duties and produce a resulting RTD from those to rdd's some popular ones are Union intersection subtract and Cartesian products it's important to note that all of these operations require that the rdd's been operated on are of the same type let's start by introducing the Union operation Union operation gives us back an RDD consisting of the data from both input rdd's this is quite useful in a lot of use cases for instance we can use it to aggregate log files from multiple sources it is worth mentioning that unlike the mathematical Union operation if there are any duplicates in the input rdd's the resulting RDD of Sparx Union operation will contain duplicates as well the next operation we'll talk about is intersection which returns the common elements which appear in both input rdd's intersection will also remove all duplicates including the duplicates from single RDD before returning the results keep in mind that the intersection operation is quite expensive since it requires shuffling all the data across the partitions to identify common elements the next one is subtract the subtract function takes in another RDD as an argument and returns as an RDD that only contains element present in the first our duty and not the second oddity this is useful if we want to remove some elements from an existing RDD similar to intersection and distinct operations subtract operation requires a shuffling of all the data which could be quite expensive for large data sets the last operation we want to talk about is Cartesian the Cartesian transformation returns all possible pairs of a and B where a is in the source RDD and B is in the other RDD the Cartesian product can be very handy if you want to compare the similarity between all possible pairs for example we can compute every users rating in each movie we could also take the Cartesian product of an RDD with itself which would be useful if we would like to analyze things like product similarity now let's take a look at a real-life example here we have two taps separated proxy lux files from one of the gnosis Apache servers the proceed lock file contains the hostname log name time the HTTP method the URL response code and the number of bytes we have to log files one contains 10,000 lakh lines collected on July 1st 1995 the other one contains 10,000 log files collected on August 1st 1995 let's open the union log problem file under the RDD dasa patchy web logs package the task is to create a new RDD which contains the log lines from both July 1st and August 1st and take an open one sample of those log lines and save it to sample NASA locks file in the out directory make sure the header lights are removed from the resulting RDD let's see how we resolve this problem open the union lock solution file under the same package first we create a spark Huff's object in this program we specify local star as the master URL which basically means our spark application will run all the available course on our local CPU then we create a spark context project from the spark cuff we created next we load those two input log files as two string rdd's then we call the union method on the July first log our DD as apply the August first log our DD as an argument this will give us back an aggregate our DD that contains items from both our deities then we filter out the header lines we have extracted the filtering logic to a separate method called is not header we get back a clean log file our DD which doesn't contain the header lines then let's take a sample of 0.1 on the our DD lastly save the resulting RDD as a text file let's run this application to run this application just use the spark submit comet that we saw in the previous lectures open a terminal and type spark submit and the path to our fire after the execution is done we check out the output file from the out directory as you see we have four output files it means our spark application runs on four worker threads each output file contains a portion of the aggregated log files now we're done with the spark program which demos how we use the Union operation to consolidate logs from different time it's time for you to do a practice let's open the same host problem file we'll still be working on those two log files your task is to create a spark program to generate a new RDD which contains the hosts which are accessed on both July 1st and August 1st so the resulting RDD will only contain the same hosts not the full log lines now give a try to implement the solution in this file we will discuss the example solution in the next lecture hello and welcome back in this lecture we'll be talking about another type of spark transformations set operations rdd's support many of the operations of mathematical sets some set operations are performed on a single RDD the most popular ones are simple and distinct operations the sample operation will create a random sample from an RDD it is quite useful for testing purpose sometimes we want to take a little random sample of larger data sets to apply some transformations and we want to do it on our laptop this is when the symbol operation comes in handy the sample method takes three arguments the first one is whether the sampling is done with replacement or not sampling with replacement is a way of doing sampling it's more of a statistical term rather than a spark concept I have reposted an article which explain what is sampling with replacement if you're interested in learning more about it take a look at the next lecture the second one is the sample size as a fraction let's say we want to take 1/10 of the original data set we can just put 0.1 as the sample size the third argument is the seed used for generating random numbers next we'll talk about the distinct operation the set property which is often missing from the rdd's is the uniqueness of elements as our rdd's often have duplicates the distinct transformation would just return the decision rolls from the input RTD and keep in mind that the distinct operation is quite expensive because it requires shuffling all the data across partitions to ensure that we receive only one copy of each element you should avoid using distinct operation if deduplication is not necessary for your spark workflow there are some set operations which are performed on two hard duties and produced a resulting RDD from those two rdd's some popular ones are Union intersection subtract and Cartesian products it's important to note that all of these operations require that the rdd's being operated on are of the same type let's start by introducing the Union operation Union operation gives us back an RDD consisting of the data from both input rdd's this is quite useful in a lot of use cases for instance we can use it to aggregate log files from multiple sources it is worth mentioning that unlike the mathematical Union operation if there are any duplicates in the input rdd's the resulting RDD of sparks Union operation will contain duplicates as well the next operation we'll talk about is intersection which returns the common elements which appear in both input rdd's intersection will also remove all duplicates including the duplicates from single RDD before returning the results keep in mind that the intersection operation is quite expensive since it requires shuffling all the data across the partitions to identify common elements the next one is subtract the subtract function takes in another RDD as an argument and returns as an RDD that only contains element present in the first our duty and not the second oddity this is useful if you want to remove some elements from an existing RDD similar to intersection and distinct operations subtract operation requires a shuffling of all the data which could be quite expensive for large data sets the last operation we want to talk about is Cartesian the Cartesian transformation returns all possible pairs of a and B where a is in the source RDD and B is in the other RDD the Cartesian product can be very handy if we want to compare the similarity between all possible pairs for example we can compute every users rating in each movie we could also take the Cartesian product of an RDD with itself which would be useful if we would like to analyze things like product similarity now let's take a look at a real-life example here we have two taps separated proxy lux files from one of the nuts Apache servers the proceed lock file contains the hostname log name time the HTTP method the URL response code and the number of bytes we have to log files one contains 10,000 lakh lines collected on July 1st 1995 the other one contains 10,000 log files collected on August 1st 1995 let's open the union log problem file under the our DD das Apache web logs package the task is to create a new RDD which contains the log lines from both July 1st and August 1st and take a point one sample of those log lines and save it to sample NASA locks file in the out directory make sure the header lights are removed from the resulting RDD let's see how we resolve this problem open the union lock solution file under the same package first we create the spark huffed object in this program we specify local star as the master URL which basically means our spark application will run all the available course on our local CPU then we create a spark context project from the spark cuff we created next we load those two input log files as two string rdd's then we call the union method on the July first log RDD and supply the August first log our DD as an argument this will give us back an aggregate argue D that contains items from both our deities then we filter out the header lines we have extracted the filtering logic to a separate method called as not header we get back a clean log file our DD which doesn't contain the header lines then let's take a sample of 0.1 on the our DD lastly say the resulting our DD as a text file let's run this application to run this application just use the spark submit comet that we saw in the previous lectures open a terminal and type spark submit in the path to our file after the execution is done we check out the output file from the out directory as you see we have four output files it means our spark application runs on four worker threads each output file contains a portion of the aggregated log files now we're done with a spark program which demos how we use the Union operation to consolidate locks from different time it's time for you to do a practice let's open the same host problem file we'll still be working on those two log files your task is to create a spark program to generate a new RDD which contains the hosts which are accessed on both July 1st and August 1st so the resulting RDD will only contain the same hosts not the full log lines now give a try to implement the solution in this file we will discuss the example solution in the next lecture in this lecture let's talk about actions that you can perform on rdd's actions are the second type of RDD operation they are the operations which would return a final value to the driver program or persist data in an external storage system actions will force the evaluation of the transformations required for the RTD they were called on next let me walk you through some of the most popular actions in SPARC collect collect operation retrieves the entire RTD and returns it to the driver program in the form of a regular collection or value let's say you have string RDD when you call collect action on it you would get a list of strings the same applies for other rdt types as well so after the resulting RTD is returned we can manipulate the results such as iterate over the collection to print them out at a driver machine or persistent them into disk this is quite helpful if your SPARC program has filtered rdd's down to a relatively small size and you want to deal with it locally just be aware that the entire data set must fit in memory on a single machine as it all needs to be copied to the driver when the collect action is called so collect action shouldn't be used on large data sets in a lot of cases collect action can be called on rdd's because they are too large to be fit into the memory of the driver machine collect operation is widely used in unit tests to compare the value of our RDD with our expected result as long as the entire contents of the RTD can fit in memory let's take a look at a collection example under the rd d dot collect package here we have a list of words in the driver program then we call parallelized method on the spark context object to convert the list of words to a string RDD then we can call collect on the string RTD to convert the RTD back to a list of strings lastly we print the contents of the list of strings let's run this application to run this application just use the spark submit comet that we saw in the previous lectures open a terminal and type spark submit and the path to our fire you next operation we want to talk about count and count by value actions if you just want to count how many rows in an RDD count operation is a quick way to do that it would return the count of the elements can by value we'll look at unique values in the each role of your ID D and return a map of each unique value to its count this is useful when your RTD contains duplicate rows and you want to count how many of each unique row value you have we have already see the usage of coun by value in our previous word count example it'll return us a map of each word and it's count let's take a look at another count example under the RDD dot-com package this time we'll run this application first to run this application just use the SPARC submit comet that we saw in the previous lectures open a terminal and type SPARC submit and the path to our file we have an input string RDD if we call the count operation on the RDD it'll give us the total count of items in the RTD the duplicates would count as well we have two Hadoop's in the RTD both of them count next we call can by value on the RTD this will return us a map of each unique word and is count as you see we have two Hadoop's in other words only appear once next we talk about the take action take action takes an element from the RDD this operation can be very useful if we would like to take a peek at the cardd for unit tests and quick debugging you can just take let's say the first three rows of the RDD and print them out to the council take we return and elements from the RTD and it'll try to reduce the number of partitions it accesses so it is possible that the take operation could end up giving us back a biased collection and it doesn't necessary return the elements in the order we might expect let's open the take example under the RTD take package just run this application to run this application just use this Park submit comment that we saw in the previous lectures open a terminal and type spark submit and the path to our fire as you see we have the same string rdd's here if we call take operation on the string RDD it gives it back three elements from the string RDD the next action is save as text file save as text file can be used to write data out to a distributed storage system such as HDFS or Amazon s3 or even local file system we have already see the usage of save as text file in lots of our previous examples the last action we want to talk about is reduce reduce is probably the most common action in our SPARC program the reduced action takes a function that operates on two elements of the type in the input RDD and returns a new element of the same type SPARC RDD reduce function reduces the elements of his RDD using its specified binary function this function produces the same result when repetitively apply on the same set of RTD data and reduces to a single value with reduce operation we can easily sum up all the elements of an RDD count the total number of elements or perform some other types of aggregations let's take a look at the reduce example file under the rd d dot reduce package here we have an integer R D D of 1 2 3 4 & 5 if we call reduce function on the r DD the reduce function we pass in takes two arguments what the function does is to return the product of those two arguments this reduce operation will be applied to all the items in our DD and return a single value which is the product of all the integers and the input our DD let's run the application to run this application just use the SPARC submit comment that we saw in the previous lectures open a terminal and type SPARC submit and the path to our file as you see we get a hundred-twenty which is the product of 1 2 3 4 and 5 as time for you to do a practice we open the sum of number problem file under the sum of numbers package your task is to create a spark program to read the first 100 prime numbers from the Prime noms text file and print the sum of those numbers to Council let's open the prime knobs text file under the in directory as you see each role of the input file contains two prime numbers separated by spaces yes your turn to implement the solution we will discuss the simple solution in the next lecture hello and welcome back in this lecture we're going to take a look at a sample solution for the sum of number problem let's open the sum of numbers solution file under the sum of number package we first load the prime number file as a string RDD our original file is a tap separative file so here we split the lines by tap however the split results might contain empty strings as well so we need to filter out those empty strings on Python the empty string returns false so here if the number is an empty string this expression would return false otherwise it would return true in this way we get back a new string RDD with each item being the string representation of one of the first a hundred numbers then we do a map transformation to convert each number from string type to an integer finally we can call the reduce action the integer RDD the reduce function we passing takes two arguments what the function does is to return the sum of those two arguments this reduce operation will be applied to all the items in our DD and return a single value which is the sum of all the integers in the input RDD let's run this application to run this application just use the sparks emit command that we saw in the previous lectures open a terminal and type spark summit and then path to our file see the sum of the first 100 prime numbers is 2 for 1 double 3 that's it for this lecture I hope you have enjoyed it [Music] hello and welcome back in this lecture we will talk about several important aspects about rdd's first of all our duties are distributed each RDD is broken into multiple pieces called partitions and these partitions are divided across the clusters let's say your spud cluster has 8 nodes an RDD can be split into 8 partitions the number of partitions is configurable and the rdd's spread across multiple nodes can be operated on in each node in parallel and independently this partitioning process is done automatically by SPARC so you don't need to worry about all the details about how your data is partitioned across the cluster secondly our Dedes are immutable they cannot be changed after they are created you might want to ask why rdd's are designed to be immutable immutability rolls out a significant set of potential problems due to updates from multiple threads at once lastly rdd's are resilient rdd's are a deterministic function of their input this plus immutability also means the rdd's parts can be recreated at any time in case of any node in the cluster goes down SPARC can recover the parts of the rdd's from the input and pick up from where we left off SPARC does the heavy lifting for you to make sure the rdd's are fault tolerant [Music] in the previous lectures we have seen rdd's support two types of operations transformations and actions transformations are operations on rdd's that return a new RDD such as map and filter actions are operations that return a result to the driver program to write it to storage and kick off a computation such as count and collect transformations and actions are different because of the way how SPARC computes rdd's even though new rdd's can be defined anytime they are only computed by spark in a lazy fashion which is the first time they're used in an action let's take a look at the following example we load a string RDD from a text file and then filter the lines that start with Friday if spark starts to load and store all the data in the file once it sees the loading statement we will end up wasting a lot of storage space because we then immediately filter out many lies by applying the filter transformation instead spark would delay the computation after he sees the whole chain of transformations it would compute the data which is needed for the result in our case spark only start calculating the result when we call it first in fact what happens under the hood is the spark scans the file only until the first light starting with Friday is detected it doesn't even need to go through the entire file transformations on rdd's are lazily evaluated meaning that the spark will not begin to execute until it sees an action rather than thinking of an RDD as containing specific data it might be better to think of each RDD as consisting of instructions on how to compute the data that we build up through transformations spark uses lazy evaluation to reduce the number of passes it has to take over our data by grouping operations together if you're ever confused whether a given function is a transformation or an action you can look at its return type transformations return rdd's whereas actions return some other data type this is a quite useful tip for you to tell if an operation is a transformation or an action hello and welcome back we're going to talk about one of the most critical spark capacity which is to save a data set in memory or disk across operations for performance optimization sometimes we would like to call actions on the same RTD multiple times if we do this naively rdd's and all of the dependencies are recomputed each time an action is called on the RTD this can be very expensive especially for some iterative algorithms which would call actions on the same data set many times if you want to reuse an RDD in multiple actions you can also ask SPARC to persist by calling the persist method on the RTD where you persist an RDD the first time it is computed in an action it will be kept in memory across the nodes this allows future actions to be much faster often by more than 10 times caching is a critical tool for iterative algorithms and fast interactive use let's take a look at a persist example file under the RG d dot persist package here we have an integer RTD we persist the RTD using memory storage level we'll talk more about the storage level later then we call reduce on this r DD at this point parallelized transformation will be executed to distribute the r DD from the driver program to all the worker threats and calls reduce on this partition since this r DD is persistent so it will be capped in memory across the worker threats when we call count on this r DD again spark won't paralyze the transformation again it'll go ahead to do the count action each persisted RTD can be stored using a different storage level allowing you for example to persist the data set on disk or persistent in memory these levels are set by passing a storage level object to persist method the cache method is a shorthand for using the default storage level which is memory only for memory only storage level our duty is stored as deserialized Java objects in the memory if the RDD can be fit into memory some partitions won't be cached and will be recomputed on-the-fly each time they are needed memory only is the default level there are some other types of storage level memory and disk this will store RTD as DC relized to Java objects in the memory if the RTD can be fit into memory the partitions which can be fit into memory will be stored on disk and they will be read from disk when they are needed memory only sir similar to memory only but it'll store our DD as C realized Java objects in memory this is more space efficient than DC relized objects especially when using a fast serializer but more cpu intensive to read memory and disk sir similar to memory only sir but it will save partitions that can be fit into memory to disk instead of recomputing them on the fly each time they're needed disk only it will store the are DD partitions only on disk you might want to ask which storage level we should choose spark storage level are meant to provide different trade-offs between memory usage and CPU efficiency there are some factors we need to consider before selecting the most suitable storage level if the rdd's can fit comfortably with a default storage level memory only is the ideal option this is the most CPU efficient option allowing operations on the rdd's to run as fast as possible if not try using memory only sir to make the objects much more space efficient but still reasonably fast to access don't save to disk unless the functions that computed your datasets are expensive or they filter a significant amount of the data otherwise recomputing a partition may be as fast as breeding it from disk what would happen if you attempt to cash too much data to fit in memory spark will evict old partitions automatically using a least recently used cash policy for the memory only storage level spark where would compute these partitions the next time they are needed for the memory and disk storage level spark rewrite these partitions to disk in either case your spark job won't break even if you ask spark to cache too much data but caching unnecessary data can cause spark to evict useful data and lead to longer recomputation time you should call the unpurchased method on an RTD when you want to remove them from the cache that's it for this lecture I hope you have enjoyed
Info
Channel: Level Up
Views: 137,533
Rating: 4.922987 out of 5
Keywords: Apache Spark, Apache Spark Tutorial, spark tutorial, spark, spark python, python spark
Id: IQfG0faDrzE
Channel Id: undefined
Length: 93min 48sec (5628 seconds)
Published: Tue Jun 05 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.