PySpark Full Course [2024] | Learn PySpark | PySpark Tutorial | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] five spark is a python API for Apache spot an open source distributed computing framework for big data processing pisbar provides a simple and efficient way for developers to perform complex data processing and Analysis tasks using the Sparks powerful engine hi everyone we welcome you all to this spice Park full course today we have an exciting agenda lined up for you but before we get started if you like our videos then please do not forget to subscribe to our edureka's YouTube channel and hit the Bell icon to stay updated with all the latest trending Technologies and also if you are interested in our PI spark certification training then please check out the link given in the description box now without any delay let us go through the agenda first we will start by understanding what is Apache Spark next we will explore all about the Apache Sparks architecture followed by we will also look into some of the strategies for how to become a spark developer we'll then see the introduction to Apache spark with python then let us have some practical experiences for that we will start by installing Pi spark and then we will move on to the vice Park rdd and we will also learn about the pi spark data frames after that we will dive into some of the popular Pi spark programming which covers the pi spark SQL and Pi spark streaming up next we will look into the pi spark ml lip and then we have Pi Spark training and finally we will end up this session by looking into the spark interview question and answers by the end of this full course you will have a plenty of opportunities for Hands-On practices and you will have a solid understanding of Pi spark and be well prepared to work with the spark in a professional setting so let us get started with our first topic that is what is apache's path [Music] spark is an open source scalable massively parallel in-memory execution environment for running analytics applications you can just think of it as an in-memory layer that sits above the multiple data stores where data can be loaded into the memory and analyzed in parallel across a cluster coming to big data processing much like mapreduce Park Works to distribute the data across the cluster and then process that data in parallel the difference here is that unlike mapreduce which shuffles the files around the disk spark Works in memory and that makes it much faster at processing the data than mapreduce it is also said to be the Lightning Fast unified analytics engine for big data and machine learning so now let's look at the interesting features of Apache spark coming to speed you can call spark as a swift processing framework why because it is 100 times faster in memory and 10 times faster on the disk on comparing it with Hadoop not only that it also provides High data processing speed next powerful caching it has a simple programming layer that provides powerful caching and disk persistence capabilities and Spark can be deployed through messages or Spark's own cluster manager as you all know that spark itself was designed and developed for real-time data processing so it's an obvious fact that it offers real-time competition and low latency because of in-memory competitions next polyglot spark provides high level apis in Java Scala Python and R spark code can be written in any of these four languages not only that it also provides a shell in Scala and python these are the various features of spark now let's see the various components of spark ecosystem let me first tell you about the spark core component it is the most vital component of spark ecosystem which is responsible for basic i o functions scheduling monitoring Etc the entire Apache spark ecosystem is built on the top of this core execution engine which has extensible apis in different languages like Scala python R and Java as I have already mentioned that spark can be deployed through mesos hadoopia Yan or Spark's own cluster manager the spark ecosystem library is composed of various components like spark SQL spark streaming machine learning library now let me explain you each of them the spark SQL component is used to Leverage The Power of declarative queries and optimize storage by executing sql-like queries on spark data which is present in the rdds and other external sources next spark streaming component allows developers to perform batch processing and streaming of data in the same application and coming to machine learning library it uses the deployment and development of scalable machine learning pipelines like summary statistics correlations feature extraction transformation functions optimization algorithms Etc and Graphics component lets the data scientists to work with graph and non-graph sources to achieve flexibility and resilience in graph construction and transformation and now talking about the programming languages spark supports Scala it is a functional programming language in which the spark is written so spark supports Scala as an interface then spark also supports python interface you can write the program in Python and execute it over the spark again if you see the code in Python and Scala both are very similar then R is very famous for data analysis and machine learning so spark has also added the support for R and it also supports Java so you can go ahead and write the code in Java and execute it over the spark next the data can be stored in hdfs local file system Amazon S3 cloud and it also supports SQL and nosql database as well so this is all about the various components of spark ecosystem now let's see what's next when it comes to iterative distributed computing that is processing the data over multiple jobs and computations we need to reuse or share the data among multiple shops in earlier Frameworks like Hadoop there were problems while dealing with multiple operations or jobs here we need to store the data and some intermediate stable distributed storage such as hdfs and multiple IO operations makes the overall computations of jobs much slower and there were replications and serializations which in turn made the process even more slower and our goal here was to reduce the number of i o operations through hdfs and this can be achieved only through in-memory data sharing the in-memory data sharing is 10 to 100 times faster than Network and disk sharing and rdds try to solve all the problems by enabling fault tolerant distributed in-memory competitions so now let's understand what are rdds it stands for resilient distributed data set they are considered to be the backbone of spark and is one of the fundamental data structure of spark it is also known as the schema structures that can handle both structured and unstructured data so in spark anything you do is around rdd you're reading the data in spark then it is read into rdd again when you're transforming the data then you're performing Transformations on old rdb and creating a new one then last you will perform some actions on the rdd and store that data present in an rdd to a persistent storage resilient distributed data set is an immutable distributed collection of objects your objects can be anything like strings lines rows objects collections Etc rdts can contain any type of python Java or Scala objects even including user defined classes as well and talking about the distributed environment each data set present in an rdd is divided into logical partitions which may be computed on different nodes of the cluster due to this you can perform Transformations or actions on the complete data parallelly and you don't have to worry about the distribution because spark takes care of that rdds are highly resilient that is they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes thus even if one executor fails another will still process the data this allows you to perform functional calculations against your data set very quickly by harnessing the power of multiple nodes so this is all about rdd now let's have a look at some of the important features of rdds rdds have a provision of in-memory computation and all transformations are lazy that is it does not compute the results right away until an action is applied so it supports in-memory computation and lazy evaluation as well next fault tolerant in case of rdds they track the data lineage information to rebuild the lost data automatically and this is how it provides fault tolerance to the system next immutability data can be created or retrieved anytime and once defined its value cannot be changed and that is the reason why I said rdds are immutable next partitioning that is the fundamental unit of parallelism and Spark rdd and all the data chunks are divided into partitions in rdd next persistence users can reuse rdd and choose a storage strategy for them course grained operations applies to all elements in data sets through Maps or filter or group by operations so these are the various features of rdd now let's see the ways to create rdd there are three ways to create rdds one can create rdd from paralyzed Collections and one can also create rdd from the existing rdd or other rdts and it can also be created from external data sources as well like hdfs Amazon S3 edgebase Etc now let me show you how to create rdds I'll open my terminal and first check whether my demons are running or not cool here I can see that how do pan spark demons both are running so now at the first let's start the spark shell it will take a bit time to start the shell cool now the spark shell has started and I can see the version of spark as 2.1.1 and we have a Scala shell over here now I will tell you how to create rdds in three different ways using Scala language at the first let's see how to create an rdd from parallelized collections SC dot paralyze is a method that I use to create a paralyzed collection of rdds and this method is a spark context paralyzed method to create a paralyze collection so I will give SC Dot parallelize and here I will parallelize one to hundred numbers in five different partitions and I will apply collect as action to start the process so here in the result you can see an array of 1 to 100 numbers okay now let me show you how the partitions appear in the web UI of spark so the web UI port for spark is localhost 4040 so here you have just completed one task that is SC dot parallelize collect correct here you can see all the five stages that are succeeded because we have divided the task into five partitions so let me show you the partitions so this is a dag visualization that is the directed or cyclograph visualization wherein you have applied only paralyzed as a method so you can see only one stage here so here you can see the rdd that has been created and coming to even timeline you can see the task that has been executed in five different stages and the different colors imply the scheduler delay task deserialization Time shuffle rate Time shuffle right time execute a Computing time Etc here you can see the summary metrics for the created rdd here you can see that the maximum time it took to execute the task in five partitions parallely is just 45 milliseconds you can also see the executor ID the host ID the status that is succeeded duration launch time Etc so this is one way of creating an rdd from paralyzed collections now let me show you how to create an rdd from the existing rdd okay here I'll create an array called even and assign numbers 1 to 10. one two three four five six seven okay so I got the result here that is I have created an integer array of 1 to 10 and now I will parallelize this array 1. sorry I got an error it is SC dot parallelize of A1 okay so I created an rdd called parallel collection cool now I will create a new rdd from the existing rdd that is Val new rdd is equal to A1 dot map data present in an rdd I'll create a new rdd from existing rdd so here I will take A1 as a reference and map the data and multiply that data into two so what should be your output if I map the data present in an rdd into two so it should be like 2 4 6 8 up to 20. correct so let's see how it works yes we got the output that is multiple of 1 to 10 that is 2 4 6 8 up to 20. so this is one of the method of creating a new rdd from an old rdt and I have one more method that is from external file sources so what I will do here is I will give Val test is equal to SC dot text file here I will give the path to hdfs file location and Link the path that is hdfs localhost 9000 is a path and I have a folder called example and in that I have a file called sample cool so I got one more rdd created here now let me show you this file that I have already kept in hdfs directory I will browse the file system and I will show you the slash example directory that I have created so here you can see the example that I have created as a directory and here I have sample as the input file that I have been given so here you can see the same path location so this is how I can create an rdd from external file sources in this case I have used hdfs as an external file source so this is how we can create rdds from three different ways that is paralyzed collections from external rdds and from an existing rdds so let's move further and see the various rdd operations RTD supports two main operations namely Transformations and actions as I've already said rtds are immutable so once you create an rdd you cannot change any content in the hard ID so you might be wondering how rdd applies those Transformations correct when you run any Transformations it runs those Transformations on old rdd and create a new rdd this is basically done for optimization reasons Transformations are the operations which are applied on an rdd to create a new rdd now these Transformations work on the principle of lazy evaluations so what does it mean it means that when we call some operation in rdd it does not execute immediately and Spark maintains the record of the operation that is being called since Transformations are lazy in nature so we can execute the operation Anytime by calling an action on the data hence in lazy evaluation data is not loaded until it is necessary now these actions analyze the rdd and produce result simple action can be count which will count the rows in rdd and then produce a result so I can say that transformation produce new RDP and action-produced results before moving further with the discussion let me tell you about the three different workloads that sparkate us they are batch mode interactive mode and streaming mode in case of batch mode we run a batch or you write a job and then schedule it it works through a queue or a batch of separate jobs without manual intervention then in case of interactive mode it is an interactive shell where you go and execute the commands one by one so you will execute one command check the result and then execute other command based on the output result and so on it works similar to the SQL shell so shell is the one which executes the driver program and in the Shell mode you can run it on the cluster mode it is generally used for development work or it is used for ad hoc queries then comes the streaming mode where the program is continuously running as in when the data comes it takes the data and do some Transformations and actions on the data and get some results so these are the three different workloads that sparkate us now let's see a real-time use case here I'm considering Yahoo as an example so what are the problems of Yahoo Yahoo properties are highly personalized to maximize relevance the algorithms used to provide personalization that is the targeted advertisement and personalized content are highly sophisticated and the relevance model must be updated frequently because stories news feed and ads change in time and Yahoo has over 150 petabytes of data that is stored on 35 000 node Hadoop cluster which should be accessed efficiently to avoid latency caused by the data movement and to gain insights from the data and cause effective manner so to overcome these problems Yahoo look to spark to improve the performance of its iterative model training here the machine learning algorithm for news personalization required 15 000 lines of C plus plus code on the other hand the machine learning algorithm has just 120 lines of Scala code so that is the advantage of spark and this algorithm was ready for production use in just 30 minutes of training on 100 million data sets and Sparks Rich apis available in several programming languages and has resilient in-memory storage options and is compatible with Hadoop throughyan and the spark Yan project it uses Apache spark for personalizing its news web pages is and for targeted advertising not only that it also uses machine learning algorithms that run an Apache spark to find out what kind of news users are interested to read and also for categorizing the new stories to find out what kind of users would be interested in Reading each category of news and Spark runs over hadoopyan to use existing data and clusters and the extensive API of spark and machine learning library is the development of machine learning algorithms and Spark reduces the latency of model training via in-memory rdd so this is how spark has helped Yahoo to improve the performance and achieve the targets foreign [Music] architecture in your master node you have the driver program which drives your application so the code that your writing behaves as a driver program or if you are using the interactive shell the shell acts as a driver program inside the driver program the first thing that you do is you create a spark context assume that the spark context is a gateway to all spark functionality it is similar to your database connection so any command you execute in your database goes through the database connection similarly anything you do on spark goes through the spark context now this Spar context works with the cluster manager to manage various jobs the driver program and the spark context takes care of executing the job across the cluster a job is split into the tasks and then these tasks are distributed over the worker node so anytime you create a rdt in the spark context that rdd can be distributed across various nodes and can be cached there so Rd is said to be taken partitioned and distributed across various nodes now worker nodes are the slave nodes whose job is to basically execute the tasks the task is then executed on the partition rdds in the worker nodes and then Returns the result back to the spark context spark context takes the job breaks the job into the task and distribute them on the worker nodes and these tasks works on partition rdds perform whatever operations you wanted to perform and then collect the result and give it back to the main spark context if you increase the number of workers then you can divide jobs and more partitions and execute them parallely over multiple systems this will be actually a lot more faster also if you increase the number of workers it will also increase your memory and you can cache the jobs so that it can be executed much more faster so this is all about spark architecture now let me give you an infographic idea about the spark architecture it follows Master Slave architecture here the client submits spark user application code when an application code is submitted driver implicitly converts a user code that contains Transformations and actions into a logically directed graph called DHE at this stage it also also performs optimizations such as pipelining Transformations then it converts a logical graph called DHE into physical execution plan with many stages after converting into physical execution plan it creates a physical execution units called tasks under each stage then these tasks are bundled and sent to the cluster now driver talks to the cluster manager and negotiates the resources and cluster manager launches the needed executors at this point driver will also send their tasks to the executors based on the placement when executors start they register themselves with the drivers so that driver will have a complete view of the executors and executors now start executing the tasks that are assigned by the driver program at any point of time when the application is running driver program will monitor the set of executors that runs and the driver node also schedules a future task Based on data placement so this is how the internal working takes place in and Spark architecture there are three different types of workloads that spark can cater first batch mode in case of batch mode we run a bad shop here you write the job and then schedule it it works through a queue or batch of separate jobs through manual intervention next interactive mode this is an interactive shell where you go and execute the commands one by one so you will execute one command check the result and then execute the other command based on the output result and so on it works similar to the SQL shell so shell is the one which executes a driver program so it is generally used for development work or it is also used for ad hoc queries then comes the streaming mode where the program is continuously running as and when the data comes it takes the data and do some Transformations and actions on the data and then produce output results so these are the three different types of workloads that spark actually caters now let's move ahead and see a simple demo here let's understand how to create a application and Spark shell using Scala so let's understand how to create a spark application in spark shell using Scala assume that we have a text file in the hdfs directory and we are counting the number of words in that text file so let's see how to do it so before I start running let me first check whether all my demons are running or not so I'll type sudo JPS so all my spark demons and Hadoop elements are running that I have Master worker as spark demons and name no resource manager node manager everything as Hadoop demons so the first thing that I do here is I run the spark shell so it takes bit time to start in the meanwhile let me tell you the web UI port for spark shell is localhost 4040 so this is a web UI for spark like if you click on jobs right now we have not executed anything so there is no details over here so this you have job stages so once you execute the chops you will be having the records of the tasks that you have executed here so here you can see the status of various jobs and tasks executed so now let's check whether our spark shell has started or not yes so you have your spark question as 2.1.1 and you have a scholar shell over here so before I start the code let's check the content that is present in the input text file by running this command so I'll write where test is equal to SC dot text file because I have saved a text file over there and I'll give the hdfs path location I've stored my text file in this location and Sample is the name of the text file so now let me give test.collect so that it collects the data and displays the data that is present in the text file so in my text file I have Hadoop research analyst data science and science so this is my input data so now let me map the functions and apply the Transformations and actions so I'll give where map is equal to SC dot text file and I will specify my input path location so this is my input path location and I'll apply the flat map transformation to split the data that is separated by space and then map the word count to be given as word comma 1. now this will be executed yes now let me apply the action for this to start the execution of the task so let me tell you one thing here before applying an action the spark will not start the execution process so here I have applied Reduce by kiazza action to start counting the number of words in the text file so now we are done with applying Transformations and actions as well so now the next step is to specify the output location to store the output file so I'll give as counts.save as text file and then specify the location for my output file I'll store it in the same location where I have my input file and I will specify my output file name as output 9 cool I forgot to give a double quotes and I will run this so it's completed now so now let's see the output I will open my Hadoop web UI by giving localhost 50070 and browse a file system to check the output so as I have said I have example as my directory that I have created and in that I have specified output 9 as my output so I have the two part files being created let's check each of them one by one so we have the data count as one analyst count as 1 and science count as two so this is a first part file now let me open the second part file for you so this is the second part file where you have Hadoop count as one and the research count as one so now let me show you the text file that we have specified as the input so as I have told you Hadoop count is one research count is one analyst one data 1 science and science as one one so you might be thinking data science is a one word no and the program code we have asked to count the word that is separated by a space so that is why we have science count as two I hope you got an idea about how word count works similarly I will now paralyze 1 200 numbers and divide the task into five partitions to show you what is partitions of task so I will write SC dot parallelize 1 200 numbers and divide them into five partitions and apply collect action to collect the numbers and start the execution so it displays you an array of 1 to 100 numbers now let me explain you the job stages partitions even timeline dag representation and everything so now let me go to the web UI of spark and click on jobs so these are the tasks that have submitted so coming to work count example so this is the dag visualization I hope you can see it clearly first you collected the text file then you applied flat map transformation and mapped it to count number of words and then applied Reduce by K action and then save the output file as save as text file so this is the entire tag visualization of the number of steps that we have covered in a program so here it shows the completed stages that is two stages and it also shows the duration that is 2 seconds and if you click on the event timeline it just shows the executor that is added and in this case you cannot see any partitions because you have not split the jobs into various partitions so this is how you can see the even timeline and the dag visualization here you can also see the stage ID descriptions when you have submitted that I have just submitted it now and in this it also shows the duration that it took to execute the task and the output bytes that it took the shuffle read Shuffle write and many more now to show you the partitions see in this you just applied SC dot parallelize right so it is just showing one stage where you have applied the parallelized transformation here it shows the succeeded task as 5x5 that is you have divided the task into five stages and all the five stages has been executed successfully now here you can see the partitions of the five different stages that is executed in parallel so depending on the colors it shows the scheduler delay the shuffle rate time executor Computing time result serialization time and getting result time and many more so you can see the duration that it took to execute the five tasks in parallel at the same time is maximum one milliseconds so in memory spark has a much faster computation and you can see the IDS of all the five different tasks all are success you can see the locality level you can see the executor and the host ipid the law launch time the duration it take everything so you can also see that we have created rdt and parallelized it similarly here also for word count example you can see the rdd that has been created and also the actions that you have applied to execute the task and you can see the duration that they took even here also it's just one milliseconds that it took to execute the entire word count example and you can see the ID's locality level executor ID so in this case we have just executed the task in two stages so it is just showing the two stages so this is all about how web UI looks and what are the features and information that you can see in the web UI of spark after executing the program and the scholarship so in this program you can see that first we gave the path to the input location and check the data that is presented in the input file and then we applied flat map Transformations and created rdd and then applied action to start the execution of the task and save the output file in this location so I hope you got a clear idea of how to execute a word count example and check for the various features in spark web UI like partitions dag visualizations and everything foreign so here are a few reasons why spark is considered to be the most powerful Big Data tool in the current ER firstly its ability to integrate itself with Hadoop spark can be integrated well with Hadoop and that's a great Advantage for those who are familiar with Lada technically a standalone project spark has designed in the way to run on Hadoop distributed file system or hdfs it can be straight away got into work with map R it can run on hdfs inside mapreduce having deployed onion it can even run on the same cluster along the side of mapreduce chops followed by the first reason the second reason says that it can meet the global standards according to the technology forecast spark is the future of worldwide big data processing the standards of big data analytics are arising immensely with sparked and driven by high-speed data processing and real-time results by learning spark now one can meet the global standards to ensure compatibility between next generation of spark applications and distributions by being a part of spark developers community if you think you love technology contributing in the development of growing technology and its growing stage can give a boost to your career after this you can stay up to date with latest advancements that take place in spark and be among the initial ones to build the next generation of Big Data applications the third reason says that it is highly faster than mapreduce because spark is an in-memory data processing framework and is all set up to take all the primary processing for Hadoop workloads in future being way faster and easier to program than mapreduce spark is now among the top level Apache projects which has acquired the a puzzle of large community of users as well as contributors CDO databricks and one of the brains behind Apache spark projects putsford sparked as a multi-phase query tool that could help democratize the use of Big Data he also projected the possibility of end of mapreduce ERA with the growth of Apache Spark followed by the third reason we have the fourth reason which says spark is capable to perform in production environment the number of companies that are using spark or are planning tool the same has exploded over the last year there is a massive search in the popularity of spark the reason being is its matured open source components and an expanding community of users the reasons why spark has become one of the most popular projects in Big Data are the ingrained High Performance Tools handling distinct problems and workloads and a Swift and simple programming interface and high-end programming languages like Scala Java and python there are several reasons as to why Enterprises are increasingly adopting spark ranging from speed and efficiency and is of use to single integrated system for all the data pipelines and many more spark being the most active Big Data project has been deployed in production by all major Hadoop as well as non-hadook vendors across multiple sectors including Financial Services retail media houses telecommunications and public sectors now the last important reason for spark being so powerful is its Rising demand for spark Developers spark is a brand new and yet completely spread out in the big data market the user spark is increasing at a very fast speed among many of the top-notch companies like NASA Yahoo Adobe and many more apart from those belonging to spark Community there is a handful of professionals who have learned spark and can work with it this intern has created a soaring demand for spark developers in such scenario learning spark can give you a steep Competitive Edge by learning spark at this point in time you can demonstrate the recognized validation for your experience this is what John Tripper and alliances and ecosystem lead at databricks has to say the adoption of Apache Spark by business large and small is growing at an incredible rate across a wide range of Industries and the demand for developers with certified expertise is quickly following suit so these were the few important reasons why Apache spark is considered to be the most powerful tool in the current ID industry let's move ahead and understand the roadmap to become an Apache spark developer there is always a thin line of gap between actually becoming a certified Apache spark developer and to be an actual spark developer capable enough to perform in real-time application so how do we become a certified Apache spark developer who is capable to perform in real time so the step-by-step approach to the scene is given as below to become an expert level spark developer you need to follow the right path and the expert level guidance from the certified real-time industry experts in the industry for a beginner it is the best time to take up a training and certification exam once the certification has begun you should start up with your own projects to understand the working terminology of Apache Spark Xbox major building blocks which are the rdds or resilient distributed data sets and data frames also spark has the capability to integrate itself with high performance programming languages like python Scala and Java Phi spark rdds are the best examples for the combination of python and Apache Spark you can also understand how to integrate Java with Apache spark through an amazing article called spark Java tutorial article from edureka which I have Linked In the description box below once you have the better grip on the major building blocks of spark which are the rdds and data frames you can move ahead into learning some of the major components of Apache spark which are mentioned as below which are the spark SQL spark mlib spark Graphics spark R and Spark streaming and a lot more once you get the required training and certification it's time for you to take most important and Wiggly the CCA 175 certification you can begin solving some sample CCA 175 and Spark certification examination papers which I have Linked In the description box below and once you get a prefer idea and confidence you can register for the CCA 175 examination and Excel with your true spark and Hadoop developer certification so this is the road map to become a true and certified Apache spark developer now that we have discussed about the roadmap to become an Apache spark developer let us move ahead and discuss about the Apache spark developers salary Apache spark developer is one of the most highly decorated professionals with handsome salary packages compared to others we will now discuss the salute trends of Apache spark developers in different nations first India in India the average salary offered to an entry-level spark developer is between 6 lakhs to 10 lakhs per annum and on the other hand for an experience level spark developer the salary Trends are in between 25 lakhs to 40 lakhs per annum next the United States of America in the United States of America the salary offered for a beginner level spark developer is 75 000 to 100 000 US dollars per annum similarly for an experienced level spark developer the salary Trends lie between 145 000 to 175 000 for annum now with this let us move ahead and discuss the skills of a spark developer the skills required to become an excellent spark developer are to be capable enough to load the data from different platforms into Hadoop platform using various edl tools decide an effective file format for a specific task based on business requirements clean data through streaming API or user defined functions effectively schedule Hadoop jobs hand holding with hype and Edge base for schema operations followed by that you should have a capability to work on hype tables and to assign schemas deploy headspace clusters and continuously manage them execute ping And Hive scripts to perform various joints in data sets apply different hdfs formats and structures like to speed up Analytics maintaining the privacy and security of Hadoop clusters fine-tuning of Hadoop applications troubleshooting and debugging any Hadoop ecosystem at runtime and finally installing configuring and maintaining Enterprise Hadoop environment if required now we shall move ahead and understand the Apache spark developers roles and responsibilities the roles and responsibilities of a spark developer are to be capable enough to write executable code for analytics services and Spark components knowledge and high performance programming languages like Java Python and scalar should be well versed with related Technologies like Apache Kafka storm Hadoop and zookeeper ready to be responsible for system analysis that include design coding unit testing and other software development life cycle activities Gathering user requirements and converting them into strong technical tasks and provide economical estimates for the same should be a team player with global standards so as to understand project delivery risks ensure the quality of technical analysis and expertise in solving issues review code use case and ensure it meets the requirements so these are the few rules and responsibilities of a spark developer now let us move ahead and learn the companies that are using Apache Spark Apache spark is one of the widest spread technology that has changed the faces of many ID Industries and help them to achieve their current accomplishments and further let us now discuss some of the tech joints and major players in ID industries that are under the user Spark few of the companies are Oracle Dell Yahoo CGI Facebook cognizant capgemini Amazon IBM LinkedIn Accenture and many more foreign [Music] spark let me first brief you about the bi-spark ecosystem as you can see from the diagram the spark ecosystem is composed of various components like spark SQL spark streaming mlib graphics and the core API component the spark SQL component is used to Leverage The Power of declarative queries and optimize storage by executing SQL like queries on spark data which is presented in rdds and other external sources spark streaming component allows developers to perform patch processing and streaming of data with ease in the same application the machine learning library eases the development and deployment of scalable machine learning pipelines Graphics component lets the data scientists work with graph and non-graph sources to achieve flexibility and resilience in graph construction and transformations and finally the Sparco component it is the most vital component of spark ecosystem which is responsible for basic input output functions scheduling and monitoring the entire spark ecosystem is built on top of the score execution engine which has extensible apis in different languages like Scala python R and Java and in today's session I will specifically discuss about the spark API in Python programming languages which is more popularly known as the pi spark now you might be wondering why Pi spark well to get a better Insight let me give you a brief into Pi spark now as we already know Pi spark is the collaboration of two powerful Technologies which are spark which is an open source clustering Computing framework built around speed ease of use and streaming analytics and the other one is python of course python which is a general purpose high level programming language it provides wide range of libraries and is majorly used for machine learning and real-time analytics now which gives us Pi Spark which is a python API for spark that lets you harness the Simplicity of Python and The Power of Apache spark in order to tame bit data a pi spark also lets you use the rdds and come with the default integration of Pi 4G Library we'll learn about rdds later in this video now that you know what is spy spark let's now see the advantages of using spark with python as we all know python itself is very simple and easy so when spark is written in Python it makes a party spark quite easy to learn and use moreover it's a dynamically typed language which means rdds can hold objects of multiple data types not only this it also makes the API simple and comprehensive and talking about the readability of code maintenance and familiarity with the python API for Apache spark is far better than other programming languages python also provides various options for visualization which is not possible using Scala or Java moreover you can conveniently call R directory from python on top of this python comes with a wide range of libraries like numpy Panda Scotland Seaborn matte broadlib and these Library AIDS in data analysis and also provide mature and time tested statistics with all this feature you can effortlessly program in spy spark in case you get stuck somewhere or have a doubt there is a huge Pi spark Community out there whom you can reach out and put your query and that is very active so I will make good use of this opportunity to show you how to install Pi spark in your system now here I am using a Red Hat Linux based sent to a system the same steps can be applied for using Linux systems as well so in order to install Pi spark first make sure that you have Hadoop installed in your system so if you want to know more about how to install Hadoop please check out our Hadoop playlist on YouTube or you can check out our blog on edureka website so first of all you need to go to the Apache spark official website which is spark.apache.org and a download section you can download the latest version of spark release which supports the latest version of Hadoop or Hadoop version 2.7 or above now once you have downloaded it all you need to do is extract it or as say enter the file contents and after that you need to put in the path where the spark is installed in the bash RC file now you also need to install pip and Jupiter notebook using the PIP command and make sure that the version of pip is 10 or above so as you can see here this is what our bash RC file looks like here you can see that we have put in the path for Hadoop spark and as well as Pi spark driver python which is The jupyter Notebook what it'll do is that the moment you run the pi spark shell it will automatically open a jupyter notebook for you now I find Jupiter notebook very easy to work with rather than the shell it's a personal choice now that we are done with the installation path let's now dive deeper into Pi spark and learn few of its fundamentals which you need to know in order to work with pi spark now this timeline shows the various topics which we will be covering under the pi spark fundamentals so let's start off with the very first Topic in our list that is the spark context the spark context is the heart of any spark application it sets up internal services and establishes a connection to a spark execution environment through a spark context object you can create rdds accumulatives and broadcast variable access spark Services run jobs and much more the spark context allows the spark driver application to access the cluster through a resource manager which can be yarn or Sparks cluster manager the driver program then runs the operations inside the executors on the worker nodes and Spark context uses the pi 4J to launch a jvm which in turn creates a javaspark context now there are various parameters which can be used with the spark context object like the Master app name spark home the pi files the environment in which it's set the batch size serializer configuration Gateway and much more among these parameters the master and app name are the most commonly used now to give you a basic Insight on how a spark program works I have lesser down its basic life cycle phases the typical life cycle of a spark program includes creating rdds from external data sources or paralyze a collection in your driver program then we have the lazy transformation in a lazily transforming the base RDS into new rdds using transformation then caching few of those rdds for future reuse and finally performing action to execute parallel computation and to produce the results the next Topic in our list is rdt and I'm sure people who have already worked with spark are familiar with this term but for people who are new to it let me just explain it now rdd stands for resilient distributed data sets it is considered to be the building block of any spark application the reason behind this is these elements run and operate on multiple nodes to do parallel processing on a cluster and once you create a rdd it becomes immutable and by immutable I mean that it is an object whose State cannot be modified after it's created but we can transform its values by applying certain transformation they have good fault tolerance ability and can automatically recover from almost any failures this adds an added Advantage now to achieve a certain task multiple operations can be applied on these IDs which are categorized in two ways the first in the transformation and the second one is the actions the Transformations are the operations which are applied on an rdt to create a new rdd now these Transformations work on the principle of lazy evaluation and transformation are lazy in nature meaning when we call some operation in rdd it does not execute immediately spark maintains the record of the operations it is being called through with the help of diuretic acyclic grass which is also known as dhe's and since the Transformations are lazy in nature so when we execute operation any time by calling an action on the data the lazy evaluation data is not loaded until it's necessary and the moment we call out the action all the computations are performed parallely to give you the desired output now a few important Transformations are the map flat map filter distinct reduced by key map partition sort by actions are the operations which are applied on an rdt to instruct Apache spark to apply computation and pass the result back to the driver few of these actions include collect the collect as map reduce take first now let me Implement few of these for your better understanding so first of all let me show you the bash IC file which I was talking about so here you can see in the bash RC file we provide the path for all the Frameworks which we have installed in the system so for example you can see here we have installed Hadoop the moment we install and unzip it or rather say enter it I have shifted all my Frameworks to one particular location as you can see it's the USR the user and inside this we have the library and inside that I have installed the Hadoop and also the spark now as you can see here we have two lines I'll highlight this one for you the pi spark driver python which is the Jupiter and we have given it as a notebook the option available as notebook what it'll do is that the moment I start spark it will automatically redirect me to The jupyter Notebook so let me just rename this notebook as rdd tutorial so let's get started so here to load any file into an rdd suppose I'm loading a text file you need to use the essay which is a spark context SC dot text file and you need to provide the path of the data which you are going to load so one thing to keep in mind is that the default path which the artery takes or the Jupiter notebook takes is the sdfs path so in order to use the local file system you need to mention the file colon and double forward slashes now once our sample data is inside the rad now to have a look at it we need to invoke using it the action so let's go ahead and take a look at the first five objects or rather say the first five elements of this particular rdd now the sample data I have taken here is about blockchain as you can see we have one two three four and five elements here suppose I need to convert all the data into a lowercase and split it according to word by word so for that I'll create a function and in that function I'll pass on this rdd so I'm creating as you can see here I'm creating rdd1 that is a new r80 and using the map function or rather say the transformation and passing on the function which I just created to lower and to split it so if we have a look at the output of rjd1 as you can see here all the words are in the lower case and all of them are separated with the help of a space bar now there's another transformation which is known as the flat map to give you a flat and output and I am passing the same function which I created earlier so let's go ahead and have a look at the output for this one so as you can see here we got the first five elements which are the same one as we got here the contrast transactions and and the records so just one thing to keep in mind is that the flat map is a transformation whereas take is the action now as you can see that contents of the sample data contains top words so in if I want to remove all the stuff was all you need to do is start and create a list of stop words in which I have mentioned here as you can see we have a all the as is and now these are not all the stopwards so I've chosen only a few of them just to show you what exactly the output will be and now we are using here the filter transformation and with the help of Lambda function in which we have X specified as X not in stop words and we have created another rdd which is rjd3 which which will take the input from rdd2 so let's go ahead and see whether the and and the are removed or not so as you can see contracts transaction records of them if you look at the output 5 we have contracts transaction and and the and in the are not in this list now suppose I want to group the data according to the first three characters of any element so for that I'll use the group by and I'll use the Lambda function again so let's have a look at the output so you can see we have EDG and edges so the first three letters of both words are same similarly we can find it using the first two letters also let me just change it to two so you can see we have gu and guid which is the guide now these are the basic Transformations and actions but suppose I want to find out the sum of the first thousand numbers or rather say first 10 000 numbers all I need to do is initialize another rdd which is the num underscore rdd and we use the SC Dot parallelize and the range we have given is one to ten thousand and we'll use the reduce action here to see the output you can see here we have the sum of the numbers ranging from one to ten thousand now this was all about rdd now next topic that we have on our list is broadcasts and accumulators now in spark we perform parallel processing through the help of shared variables or when the driver sends any tasks to the executor present on the cluster a copy of the shared variable is also sent to the each node of the cluster thus maintaining High availability and fault tolerance now this is done in order to accomplish the task an Apache spark suppose two type of shared variables one of them is broadcast and the other one is the accumulator now broadcast variables are used to save the copy of data on all the nodes in a cluster whereas the accumulator is the variable that is used for aggregating the incoming information via different associative and commutative operations now moving on to our next topic which is a spark configuration the spark configuration class provides a set of configurations and parameters that are needed to execute a spark application on the local system or any cluster now when you use spark configuration object to set the values to these parameters they automatically take priority over the system properties now this class contains various Getters and Setter methods now some of which are set method which is used to set a configuration property we have the set master which is used for setting the master URL you have the set app name which is used to set an application name and we have the get method to retrieve a configuration value of a key and finally we have set spark home which is used for setting the spark installation path on worker nodes now coming to the next topic on our list which is the spark files the spark file class contains only the class methods so that the user cannot create any spark files instance now this helps in resolving the path of the files that are added using the spark context add file method the class spark files contain two class methods which are the get method and the get root directory method now the get is used to retrieve the absolute path of a file added through spark context dot add file and the get root directory is used to retrieve the root directory that contains the files that are added to the spark context.add file now these are small topics and the next topic that we will covering in our list are the data frames now data frames in Apache spark is a distributed collection of rows under named columns which is similar to the relational database tables or Excel sheets it also shares common attributes with the rdts few characteristics of data frames are immutable in nature that is the same as you can create a data frame but you cannot change it it allows lazy evaluation that is the task not executed unless and until an action is triggered and moreover data frames are distributed in nature which are designed for processing large collection of structure or semi-structured data it can be created using different data formats like loading the data from source files such as Json or CSV or you can load it from an existing rdd you can use databases like Hive Cassandra you can use pocket files you can use CSV XML files there are many source through which you can create a particular rdd now let me show you how to create a data frame in pi spark and perform various actions and Transformations on it so let's continue this in the same notebook which we have here now here we have taken the NYC Flight data and I'm creating a data frame which is the NYC flights underscore TF now to load the data we are using the spark.read.csv method I need to provide the path which is the local path by default it takes the sdfs same as rgd and one thing to note down here is that I've provided two parameters extra here which is the info schema and the header if we do not provide this as true or we skip it what will happen is that if your data set contains the name of the columns on the first row it will take those as data as well it will not infer the schema now once we have loaded the data in our data frame we need to use the show action to have a look at the output so as you can see here we have the output which is exactly it gives us the top 20 rows or the particular data set we have the year month day departure time departure delay arrival time arrival delay and so many more attributes now to print the schema of the particular data frame you need the transformation or as say the action of print schema so let's have a look at the schema as you can see here we have here which is integer month integer almost half of them are integer we have the carrier as string the tail number as string yeah the origin string destination string and so on now suppose I want to know how many records are there in my database or the data frame other rather say so you need the count function for this one and it will provide you with the results so as you can see here we have 3.3 million records here 3 million 36 776. to be exact now suppose I want to have a look at the flight name the origin and the destination of just these three columns from the particular data frame we need to use the select option so as you can see here we have the top 20 rows now what we saw was the select query on this particular data frame but if I want to see or rather I want to check the summary of any particular column suppose I want to check the what is the lowest count or the highest count in the particular distance column I need to use the described function here so I'll show you what the summary looks like so the distance the count is the number of rows total number of rows we have the mean the standard deviation we have the minimum value which is 17 and the maximum value which is 4983. now this gives you a summary of the particular column if you want to now that we know that the minimum distance is 17. let's go ahead and filter out our data using the filter function in which the distance is 17. so you can see here we have one data in which in the 2013 year the minimum distance here is 17. now similarly suppose I want to have a look at the flights which are originating from EWR similarly we'll use the filter function here as well now the another Clause here which is the where Clause it is also used for filtering now suppose I want to have a look at the flight data and filter it out to see if the day at which the flight took off was the second of any month suppose so here instead of filter we can also use a where clause which will give us the same output now we can also pass on multiple parameters and rather say the multiple conditions so suppose I want the day of the flight should be seventh and the origin should be JFK and the arrival delay should be less than zero I mean that is for none of the postponed flights so just to have a look at these numbers we'll use the weight loss and separate all the conditions using the and symbol so as you can see here all the data the day is seven the origin is JFK and the arrival delay is less than zero now these were the basic Transformations and actions on a particular data frame now one thing we can also do is create a temporary table for SQL queries if someone is not good or is not acquainted to all these transformation and action add would rather use SQL queries on the data they can use this register dot temp table to create a table for their particular data frame what we'll do is convert the NYC flights underscore DF data frame into NYC underscore flight table which can be used later and SQL queries can be performed on this particular table so you remember in the beginning we used the NYC flash underscore DF dot show now we can use the select asterisks from NYC underscore flights to get the same output now suppose we want to look at the minimum a time of any flights we use the select minimum air time from NYC flights that is the SQL query we pass all the SQL query in the SQL context.sql function so as you can see here we have the minimum airtime s20 now to have a look at the records in which the air time is minimum 20. now we can also use nested SQL queries suppose if I want to check which all flights have the minimum a time as 20. now that cannot be done in a simple SQL query we need nested query for that one so selecting asterisks from New York flights where the air time is in and inside that we have another query which is Select minimum a time from NYC flights let's see if this works or not CS as you can see here we have two flights which have the minimum airtime as 20. so guys this is it for data frames so let's get back to our presentation and have a look at the list which we were following we completed data frames next we have storage levels now Storage level in pi spark is a class which helps in deciding how the rdds should be stored now based on this rdds are either stored in disks or in memory or in both the class Storage level also decides whether the RADS should be serialized or replicate its partition now the final and the last topic for or the today's list is the ml lib now mlib is the machine learning API which is provided by spark which is also present in Python and this library is heavily used in Python for machine learning as well as real-time streaming analytics now various algorithms supported by this libraries are first of all we have the spark dot ml live now recently the spice pack mlib supports model based collaborative filtering by a small set of latent factors and here all the users and the products are described which we can use to predict the message entries however to learn these latent factors spark.mlib uses the alternating least Square which is the ALS algorithm next we have the mlib dot clustering and our supervised learning problem is clustering now here we try to crop subsets of entities with one another on the basis of some notion of similarity next we have the frequent pattern matching which is the fbm now frequent pattern matching is mining frequent items item sets subsequences or other substructures that are usually among the first steps to analyze a large scale data set this has been an active research topic in data mining for years we have the linear algebra now this algorithm supports Pi spark ml lib utilities for Linden algebra we have collaborative filtering we have classification for binary classification variable methods are available in spark.mlip packets such as multi-class classification as well as regression analysis in classification some of the most popular algorithms used are naved by its random Forest decision tree as so much and finally we have the linear regression now basically lead integration comes from the family of recreation algorithms to find relationships and dependencies between variables is the main goal of regression also Pi spark MLA package also covers other algorithm classes and functions let's not try to implement all the concepts which we have learned in pi spark tutorial session now here we are going to use a heart disease prediction model and we are going to predict it using the recession tree with the help of classification as well as regression now these all are part of the mlib library here let's see how we can perform these types of functions and queries the first of all what we need to do is initialize the spark context so next we are going to read the UCI data set of the heart disease prediction and we are going to clean the data so let's import the pandas and the numpy library here now let's create a data frame as hard disease TF and as mentioned earlier we are going to use the read CSV method here and here we don't have a header so we have provided header as none now the original data set contains 303 rows and 14 columns now the categories of diagnosis of heart disease that we are projecting if the value 0 is for 50 percent less than narrowing and for the value 1 which we are giving is for the values which have 50 more diameter of narrowing so here we are using the numpy library now these are particularly old methods which is showing the deprecated warning but no issues it will work fine so as you can see here we have the categories of diagnosis of heart disease that we are predicting the value 0 is for less than 50 and value 1 is greater than 50. so what we did here was clear the row which have the question mark which have the empty spaces now to get a look at the data set here now you can see here we have zero at many places instead of the question mark which we had earlier and now we are saving it to a txt file and you can see here after dropping the rows with any empty values we have 297 rows and 14 columns now this is what the nuclear data set looks like now we are importing the mlib library and the regression here now here what we are going to do is create a label point which is a local Vector associated with a label or a response so for that we need to import the mlib.regression so for that we are taking the text file which we just created now without the missing values now next what we are going to do is pass the mlib data line by line into the mlib label Point object and we are going to convert the minus 1 labels to the zero now let's have a look after passing the number of training lines okay we have the label .01 that's cool now next what we are going to do is perform classification using the decision tree so for that we need to import the pi spark dot mlib.3 so next what we have to do is split the data into the training and testing data and we split here the data into 70s to 30 this is a standard ratio 17 being the training data set and the 30 being the testing data set the next what we do is that we train the model which we have created here using the training set we have created a training model decision tree dot train classifier we have used the training data number of classes is filed the categorical feature which we have given maximum depth to which we are classifying it is three next what we are going to do is evaluate the model based on the test data set now and evaluate the error so here we are creating predictions and we are using the test data to get the predictions through the model which we created here and we are also going to find the test errors here so as you can see here the test error is 0.2297 we have created a classification decision tree model in which the feature less than 12 is 3 the value of the features less than 0 is 54. so as you can see our model is pretty good so now next we'll use regression for the same purposes so let's perform the regression using decision tree so as you can see we have the trade model where we are using the decision tree dot train regressor using the training data the same which we created using the Precision tree model up there we use the classification now we are using regression most similarly we are going to evaluate our model using our test data set and find the test errors which is the mean squared error here for regression so let's have a look at the mean square error here the mean square error is 0.168 that is good now finally if we have a look at the Learned regression tree model so you can see we have created the regression entry model till the depth of 3 with 15 nodes and here we have all the features and classification of the tree [Music] so guys looking at the system requirements so guys here I'll be explaining you the minimum system requirements so the minimum RAM required is around 4 GB but is advised to use an ATB Ram system and manual free disk space should be 25 GB at least 25 GB now the minimum processor should be I3 or above to have a smooth programming experience and most of all the system should have a 64-bit operating system and in case if you are using a virtual machine or the virtual box and it should also support a 64-bit image of the operating system now these are all the hardware requirements so coming at the software requirements we need Java 8 or above we need Hadoop 2.7 or above as spark runs on top of Hadoop we need Pip with version 10 pip is a package management system used to install and manage software packages written in Python you can use conda as well and finally we need the Jupiter notebook this step is optional but the programming experience in jupyter Notebook is far more better than the shell so let's go ahead and see how we can install Pi spark on our systems so here I have a Windows system and in order to install Pi spark I'm using a virtual box and I'll create a virtual machine inside the virtual box because most of the time Pi spark is being used in the Linux environment so that's what I'm going to use so let's see how we can install the virtualbox all you need to do is go to the official website of virtualbox and on the download section you will find the latest version of virtualbox you need to click on the Windows host or Linux distribution but if you have Linux you don't need the virtualbox so for the windows you can click on this one and install so I've already installed virtualbox and I've created my VM now this VM has sent to S7 as the base image and our centavos is Red Hat distribution operating system so it also works on the linear platform now firstly what we need to do is check whether we have Hadoop or Java installed or not so for that we need to check the dot bash RC file Now The Bash RC file contains the part to all the Frameworks that are being used so for example as you can see we have Hadoop installed in our system we have all the paths to the Hadoop we have Java installed so we can also check the Hadoop version which we are running so as you can see we have Hadoop 2.7.3 and to check the Java version we need to type version we have Java 8 running on the system so now that we have Hadoop and Java installed in our system we need to install spark to install spark we need to go to the Apache official website and in that you need to go to spark.apache.org and slash downloads there you can select which version of spark you want the stable version so the latest version here is of June 8 2018 and it is pre-built for Apache Hadoop 2.7 and later versions so as we saw earlier we have Hadoop 2.7.3 so that's good now to download Apache spark you need to click on this link and here you'll get various mirror sites and links from where you can download the tar file so I've already downloaded it so let me show you guys so guys as you can see here it's a TTC file which is a tar file now we need to extract this file and place it in our specific location where we want let me just close this first now for that first we need to go to downloads as you can see we have the spark 2.3.1 Hadoop 2.7 tgz now we need to enter this file so we use so we use the command tar hyphen xvcf and Spark 2.3.1 name so what we'll do it will extract it or I'd rather say untie the file in the download section so now if we look at the elements list the elements we can see we have spark 2.3.1 pin Hadoop 2.7 and we have beta file also so what we need to do is move this to any specified location where we want our Frameworks to be so what I usually do is keep all my Frameworks like Hadoop spark Kafka we have Flume or Cassandra in my user Library section so the USR lib as you can see I have Cassandra Flume Hive Maven storm and then I have copied Spark so now that we have copied spark to a specific location we need to put in its path in The Bash RC file as well so let me again open the bashasi fight so guys as you can see here I've put in the path for Apache spark so there are two parts you need to configure here which is spark home and the path spark home has the path for where Spark has been shifted after it has been untied or rather say extracted from the top file and we need to also provide the path of the bin folder which is size inside the spark folder as well so after we have mentioned the path in the dot bash RC file we need to type in source and then dot bash RC so what happens is the moment when we add the path of a particular framework or any application in our patch RC file it is not saved so in order to save it we use the command source.rc so now in order to move to spark we just use CD and we use a dollar sign and we write spark home we are inside spark now if we have a look at the elements inside spark we find there's a python folder if you go inside python here you can see we have all the different libraries and the setup file which are used to run Pi spark and there is a pi spark folder too inside here we have various libraries for which python is being used and the various programs also so now that you have installed spark and mentioned its path in The Bash RC file now it's time to install Jupiter notebook as well so to install Jupiter notebook first we need to install pip or conda as I mentioned earlier pip it's a package management system and it's used to install and manage software packages so this is the command to install pip now make sure that the PIP version is 10 or above to install the Jupiter notebook now in order to install Jupiter after we have installed pip we will use the command pip install Jupiter now this will install the Jupiter notebook in our system and after it's been installed if we need to use the Jupiter notebook we just type the Jupiter notebook in our command line and what we will do it will open the Jupiter notebook for us so as you can see here in the new section we have the python 2 so we'll use this while writing programs for pi Spark now one thing to keep in mind is that we have the Jupiter notebook here and we have spy spark but Jupiter and Pi spark are not communicating in between now to make that happen we need to again go to The Bash RC file and once we have given the path for spark we need to provide the path for pi spark driver which is Jupiter notebook as well so one more important thing to note is if you are using spark for Scala you need to provide the path for Scala as well and for using Jupiter notebook all you need to do is put in these two lines of codes which is pi spark driver python which is Jupiter and the driver python options which is notebook now that we have spark install let's run Spark for that we need to go into the spark home and inside that we use the command dot slash S Pen slash start hyphen all dot SH now what I'll do it will start the master and the worker but if you want to start the master separately and the worker separately you can use start hyphen master dot assist and start hyphen slave dot sh but generally I use start hyphen all.sh as it starts both the master and the worker nodes okay so now to check whether the spark is running or not we use the command JPS as you can see here we have master and worker running along with hadoop's resource manager name node secondary name node node manager JPS and the data node now one more important thing is that after you have made changes to the dot bash RC file again you need to go to the command line and type in Source dot bash RC that will save the path of the notebook as well as Pi Spark so as you'll see here the moment I type Pi spark and press enter Pi spark starts running and I am redirected to the Jupiter notebook now what happens is that this jupyter notebook is communicating with the pi spark environment so what we can do is go to python 2 we'll create a new notebook and we can start writing our programs here as well but It ultimately comes to your choice whether you want to do all the programming in Shell or you want to continue doing it in the Jupiter notebook personally I find Jupiter notebook easy to work with as you have various options here to cut copy insert stop the kernel and much more so let's see whether this workbook works or not so here I am creating an rdd which are resilient distributed data sets which are a key Concept in Pi's path so as you can see the star Mark here shows us the process is being done in the background and if I have a look at the rdd which I just created as you can see Pi spark shell is absolutely working fine so as you can see here in the Shell here we have the notebook app open and it shows us some messages which are related to The Notebook as you can see the last message is saving file at untitled.ipynb which is the extension for a pi spark Jupiter notebook so this is it guys I hope you understood how to install Pi spark in your system what all are the dependencies required the hardware and the software requirements now keep in mind that you can use the Jupiter extension it's an optional step if you want to do all the programming in the Shell itself then no need of the Jupiter but yeah Jupiter adds a certain level of sophistication to the programming [Music] when it comes to iterative distributed computing that is processing data over multiple jobs in computation we need to reuse or reshare the data among multiple jobs now in earlier Frameworks like Hadoop there were many problems while dealing with multiple operation jobs we need to show the data in some intermediate stable distributed storage such as stfs now the multiple input output operations make the overall computations job slower and there were replications and serializations which in turn made the process even more slower now our goal here was to reduce the number of input output operations through the hdfs this can be only achieved through in-memory sharing data the in-memory data sharing is 10 200 times faster than Network order sharing now rdds try to solve all these problems by enabling fault tolerant distributed in-memory computations so let's understand what are rdds now rdd stands for resilient distributed data sets they are considered as the backbone of Apache spark as I mentioned earlier it is one of the First Fundamental data structures now these are schema-less structures that can handle both structured and unstructured data data in rdd is split into chunks based on a key and then dispersed across all the executed nodes rtds are highly resilient that is they are able to recover quickly from any issues as the same data chunks are replicated across multiple executed nodes thus even if executive fails another will still process the data now this allows you to perform your functional calculation against your data set very quickly by harnessing the power of multiple nodes now rdd supports two type of operations namely Transformations and actions so basically Transformations are the operations which are applied on an rdd to create a new rdd now these Transformations work on the principle of lazy evaluation so what does lazy valuation mean what it means is that when we call some operation in rdd it does not execute immediately spark maintains the record of the which operation is being called through a dietetic acyclic graph known as tag and since the Transformations are lazy in nature so we can execute operations at any time by calling an action on the data hence in lazy evaluation data is not loaded until it is necessary now this helps in optimizing the required calculation and recovery of lost data partition Now actions on the other hand are the operations which are applied on an rdt to instruct Apache spark to apply computation and pass the result back to the driver now the moment an action is invoked all the computations happen which are in the pipeline this gives us the result and which is stored in the intermediate storage or a distributed file system but let's have a look at few of the important Transformations and actions we have certain transformation like map flat map filter distinct reduced by keep map partition do not worry I'll show you guys exactly how these work and what are the use of all these Transformations and under actions we have collect collect as map reduce count by a key we have take we have count by value and many more so let's have a look at some of the important features of Pi Spark rdd first of all it's the MRI mini computation rdds have a provision of in-memory computation which makes the process even more faster now all the Transformations are lazy as I mentioned earlier it does not compute the results right away rdd track lineage information to build loss data automatically therefore it is Fault tolerant now data can be created or retrieved anytime and once defined its value cannot be changed this refers to the immutability of data partitioning is the fundamental unit of parallelism in pi spark rdd and users can reuse rdd and choose a storage strategy from them which implies for persistence now finally it applies to all the elements in the data set through Maps or filter or group by operation which implies that it successfully handles core screen operations now there are three ways to create rdds one can create an rdd from paralyzed collection it can be created from another rdt or it can be created from external data sources like sdfs Amazon S3 edgebase or any database of that sort so let's create some rdd and work on them so I'm going to execute all my practicals in the Jupiter notebook but you can execute it on the shell as well so to create an rdd from paralyze collection we use the SC dot parallelize method now SC stands for spark context which can be found under spark session the spark session contains spark context the streaming context and the SQL context it has been changed after the release of spark 2.0 earlier spark context and SQL context as well as the streaming context always distributed separately and had to be loaded separately now the SC dot paralyze method is the Sparks contest paralyze method to create a paralyze collection this allows spark to distribute the data across multiple nodes instead of depending on a single node to process the data so as you can see here I am creating a list so I have assigned raws as 19 joy 18 Rachel 16 Phoebe 17 and Monica 20. now that we have created our rdd which is my rdt we will use the take method to return the values to the console which is our notebook and we will also execute an already action which is the take so guys if you remember as I told you earlier when an action is invoked all the computations which are aligned in The graft or the lineage graph of the Transformations which have been performed on the rdd take place all at once so a common approach in pi spark is to use the collect action which returns all the values in your rdd from the spark work nodes to the driver now there are performance implications when working with a large amount of data as this translates to large volume of data being transferred from the spark worker nodes to the driver for small amounts of data this is perfectly fine but as a matter of habit you should pretty much always use the take method it Returns the first n Elements which are being passed to as an argument to the take action instead of the whole data set it is more efficient because it first scans when partition and uses those statistics to determine the number of partitions required to return the result so as I have six elements in my rdd I'm gonna use my Rd doc take and as an argument I'm going to pass six so guys as you can see here this is the output of the rdd now another way to take an input of a text file is through sc.txt file method and here you need to provide the absolute path of the file which you are going to use so I'm creating a new added here and to have a look at the new rdd we are going to use take method to have a look at the first five elements so as you can see we have the first five elements the first one second one third fourth and fifth now we can also take any CSV file as an input through sc.txt file so I'll show you guys how it's done so guys here we are going to use the SC dot text file with the absolute path as you can see I am loading a FIFA players.csv now there's another argument I have passed here which is minimum partitions now it indicates the minimum number of partitions that make up the rdd The Spark engine can often determine the best number of partitions based on the file size but you may want to change the number of partitions for performance reasons and hence the ability to specify the minimum number of partitions here and after that we have used the map transformation now here we use the map function to transform the data from a list of string to a list of lists we are going to use the Lambda function now putting the SC dot text file in the map function together allows us to read the text file and split it by the tab delimiter to produce an already composed of a parallelized list of list collections so if we have a look at the first three elements of this particular rdd as you can see we have 201 which is round ID match ID team the initials now to have a look at the number of partitions which we applied in general it takes up the partitions automatically but in case if we want to get the number of partitions for a particular rtt we use the get num partitions method so here we have specified at 4 so the output should be 4 I guess yes the output is four and now if you want to have a look at the number of rows in a particular RTD or the number of Records in a particular rdt we use the count method here so guys as you can see here we have 37 000 number of rows so this was our rdd the sample text file which we took now suppose I want to convert all these data into a low case and and I want to divide all these paragraph into words so for that we create a user defined function so I'll show you how it's done so guys as you can see here we have created a function which will use the dot lower and Dot split transformation so we are creating a new rdd here which is the split rdd and will pass the new artery through this function using the map transformation Now map is basically used for executing our transformation on each and every element of that particular rtt so if you have a look at the output of the split RTD so guys here you can see all the elements are separated Now by individual words and all of them are in lowercase right now next what we are going to do is we use the flat map transformation now it is similar to map but the new rdd flattened out all the elements so let us use the flat map transformation and if we have a look at the rdd as you can see the output is flattened it's not vertical it's horizontal in nature which is more easily readable now here I am going to create a stopwatch rdd which contains a few of the stop words common Stoppers I'm not using all the stop words here so what my agenda is here is I want to remove all the stop words from this particular rdd which we have here so for that we are going to use the filter transformation we are providing a Lambda function here we have defined Lambda function X such that X not n stopwards and we are creating a new ID which is rdd1 and we are going to filter out the results from the above created rad so if we have a look at the output of this new rdd which we are going to create most probably it won't be having The Words which are defined in the stop was rjd so as you can see output 13 contains contracts transactions and and T whereas output 15 contains contracts transactions and as they are mentioned in the stopwatch list so they are not included in the new rdd for that we use the filter transformation now filter can be used in many ways mostly it is being used with the help of Lambda functions in pi spark rdd so what I am doing here is I'm creating a filter rdd which will contain all the elements which are starting from C here I am using the Lambda function X such that X starts with C so let's have a look at the output of this new filter rdd another important thing to notice here that I'm using the distinct transformation which returns a new RTD containing the distinct elements on the source rdd so as you can see here we have control claim code Computing connections case car all the elements starting from C now next what I'm going to do is execute a small program of word count I hope you understand what word count is the output of this will basically give us count of each particular word here I am taking the output of the first 10 words so for that I am using the map function Lambda X such that X will be provided with the value 1. now I am grouping it by key I'm using the group by key method here it is used to group the data according to the particular condition now I'm creating another adjective which is added to frequency which will take the input of r d group it will map the values with the sum it will create the sum of the particular key which will be the word and then I'm again using another transformation of map and then sort by key is given as false so the output will be in the same format as the input but we'll get the word count as well so let's have a look at the top 10 words and see what are the count so as you can see and has 48 counts D has 38 and so on we can see that they has 7. so the initial rdd1 which we created after removing all the stop words contains 660 rows so now I'm going to use the distinct method here to ask to see what are the distinct Elements which will be in rdd2 so if we do rd2 dot count it most probably should be less than the r81 count or equal to it so guys it's 440 so 220 Stoppers were removed from the list now suppose I want to have a look at the elements of this rdd2 which I created and I want to see the elements which have the first three letters same as in similar first three letters I'll show you the output you'll understand as you can see we have EDG and edgs which is Edges we have year and years ALG algorithm robust SCA scale so if you execute it for 2 we see we have gu guide gr groundwork we have growing gradual Grand law Graphics gradually we have such transmission gains gained gain these are Small Tricks which will help you to execute your code faster with the help of rdt now next I am going to use the sample method I'm creating a new auditor which is sample rdd and I am passing the rgd1 and as you can see here I have two arguments which are false and 0.1 now what does this mean now false is basically is the width replacement parameter so suppose I want the output to not have all the replacements so I am going to assign false and 0.1 is basically the fraction of data with which we are going to take the output we are going to take the sample I'd rather say so here I'm going to take 10 of the original data rdg1.sample so rd1 contains 660 rows so 10 of that must be around 66 or 67 as say so let's collect a sample RTD so if we take the count of the new sample rdd as you can see it's 51 it's very much around 66 and as we have used false here so it's showing us without replacement so if you have a look at the output of say the sample rdd it will contain a sample from the original rdd you see we have contrast established Rays now next what we are going to learn are some of the functions like join we have reduce reduce by key sort by key we are going to look at unions so for that I'm creating two rdds which are the key value pairs so a has a given as 2 and bs3 and similarly we have B containing a value 9 B value 7 and C value 10. so in order to join this r d I am going to create another rdd which is C and the method to join these two are like I'm going to join a with B so I'll use a DOT join and in join I'll give the parameter as B so if we have a look at the output of C now here I'm using collect because C has very small amount of data in it but it's advised to use take as well so looking at the value of C we can see that a has two values 2 and 9 and B has two values three and seven now this is one kind of join similarly you can perform any types of like left join left outer join right outer join now here I am going to create a num rjd which will take all the numbers from 1 to 50 000. now here I am using the method range to get all the numbers in between one to fifty thousand next we are going to use the reduce another action which reduces the elements of an rdd using a specified method so here the method is a Lambda function X and Y such that the output is X Plus y so this will give us the sum of numbers from 1 to 50 000. so as you can see the sum of numbers is pretty huge so next I am going to use the Reduce by key method here I'll show you the importance of that function whereas say the action it works in a similar way to the reduce but performs a reduce on a key to key basis so as you can see the data key data has certain numbers assigned to the alphabets we have A4 B3 C2 a D2 B1 D3 and we have another parameter which is 4. now this is the number of parallel tasks to be executed while taking the input into the data underscore key data rdd and as you can see here we have reduced by key we are using the same Lambda function so as you can see in a we have 12 which is the sum of 4 and 8. in dvf5 D has 2 and 3 C has 2 and B has 4. P has one here and three here so in order to save your file you use the save as text file method here so we are going to save the values of RG3 into a file called data.txt here you need to provide the absolute path where you want to store the data so let's check out where our data is stored so it's under desktop so as you see here we have one folder created which is data.txt and inside that we have the part files and the success file so as you can see it contains all the elements so here I'm creating another rdd test to show you the sort by key function and how it works the sort by key transformation orders the key value rdd by the key and returns an RDA in ascending or descending order so by default it's ascending but you can also use it to get the output in the descending order so as you can see I'm using the SC Dot parallelize and I'm taking the test and I'm using the sort by key true and one now true is for ascending so as you can see I have one three two five A1 B2 and D4 one comes first then two then a b and d that is in the ascending order of the key so next we are going to learn about the union how Union works now the union transformation returns a new additive which is the unit of the source and the argument rdd which has been passed so here we have two rdds which are union underscore r d and Union 2 contain certain elements so if you want to take the union of these two r d into a new rdd or just want to see the output of this Union we are going to use the union method here because see the syntax is pretty simple here so as you can see it created the union of these two rdds similarly intersection also works in the similar manner now next what we are going to learn is the map partitions with index now it is similar to map but Returns the F function separately on each partition and provides an index to the partition now it is useful to determine the data skew within partitions so for example I have the function f here which is split index and iterator and yield split index I have created mpwi which is a map partition with index rdd it's just a name it's not necessary to name your rdd like this just for the sake and if we do the sum you can see the sum output is 6. so as you can see we have two IDs here A and B and I'll show you how intersect works it's in the same similar fashion as union now what it gives is the intersection of these two sets so now what it will do it will produce an RTD containing elements found in both rdd so ideally it should give us the output of 3 and 4 which is common to both rdd so let's see how it can be implemented so guys as you can see the output is 4 comma 3 that is what we expected now we are going to use the subtract method what it will do is subtract the set b r DB from rdd a okay so the output should be 1 and 2 I guess rather say 2 and 1. now another interesting thing to consider is the Cartesian product we use the Cartesian method here to get the product of rad A and B so as you can see every element of rdta has been mapped with every element of rddb so guys I hope you all understood a lot about the Transformations and actions and the operations which can be performed on rdd to get your certain output so guys let's do something interesting now so what we are going to do is here we are going to find the page rank of certain web pages and we are going to use rdd to solve this problem so let's understand what is Page rank first so page rank is basically the rank of any page which is being developed by Google so the pagerank algorithm as you can see here contains a certain formula so the algorithm was developed by Sergey Brin and Larry Page Larry Page is the person after which this pagerank algorithm has been named and these developers of the page rank algorithm later founded Google which is why Google uses the space rank algorithm and is one of the best search engines in the world now page rank of a particular web page indicates its relative importance within a group of web pages the higher the page rank the higher up it will appear in the search result now the importance of a page is defined by the importance of all the web pages that provide an output link to the web page in consideration so for example let's say that a web page X has a very high relative importance then web page X is outbounding to web page y hence web page y will also have a very high importance do you get me if we have a look at the formula here which we have so basically with the page rank of a particular page at a particular time is equal to the summation of all the page ranks of the inbound link divided by the number of links on that page now to get a clear idea about this how it works let's take an example here we have four pages we have Netflix Amazon we have Wikipedia and Google so initially what we do is assign the value of 1 to all the pages keep that in mind so according to our formula in the zeroth iteration what we do is to get the page rank we divide it equally among all the number of pages which are there in the network so the page rank at iteration 0 for Netflix is one by four for Amazon it's one by four for Wikipedia it's one by four and Google is one by four now this is the initial iteration so it doesn't count so from iteration one if you have a look at the formula so suppose we are talking about Netflix right so if you look at the formula it says page rank of the inbound link divided by the number of links on the page so Netflix has Wikipedia coming to inbound the inbound link to Netflix is only coming from Wikipedia right so we need to consider Wikipedia now so the previous page rank of Wikipedia is one by four and now we divided by the number of outbound links going from Wikipedia which is one going to Netflix one to Amazon and one to Google so it's one by four divided by 3 which gives us 1 by 12. okay so let's have a look at the Amazon first so Amazon has in modeling coming from Netflix and Wikipedia as well so it will take the summation of Netflix and Wikipedia so considering Netflix we have 1 by 4 as the previous iteration divided by the number of our power links from Netflix which is 2. so 1 by 4 divided by 2 that has 1 by 8 and then we'll add it for Wikipedia as well so for Wikipedia it is 1 by 4 divided by the number of outbound links which is 1 2 and 3. so it's 1 by 8 plus 1 by 12 which is 2.5 by 12. if you do the calculation so this is how the pagerank algorithm works now we'll implement the same using rdd so let's go ahead and get started now so for all example we are having four pages which are a b c and d now consider here D has two inbound links okay so we'll do the summation which is first coming from a and the second one which is coming from B as stated earlier it will be the summation of previous page rank of a divided by the number of outbound going links which are 1 2 and 3 fraud a and for B it's 1 and 2. so let's see how it can be implemented now Step One is creating nested list of web pages with outboard links and initializing ranks okay so let's get started so here I've created an rdd which is Page links so for a the outbounds are b c and d for C it's B for B it's D and C and for D It's A and C so first initially we'll assign the page ranks as one as I mentioned earlier now after creating ranks for the nested list we have to define the number of iterations for running the page rank now we are going to write a function that will take two arguments the first argument is on our function is the list of web pages and which will provide the outward links to the other web pages and the second argument is the rank of the web page access through the outbound links that are the first argument now the function will return contribution to all the web pages in the first argument so as you can see our first argument on our function list of the web page is the Uris and second one is the rant now the code is very explicable our function rank contribution will return the contribution to the page rank for the list of the Uris and which is the first variable and the function will first calculate the number of elements in our list Uris and then it will calculate the rank contribution to the given Uris and finally for each URI the contributed rank will be returned so let's first create a pair rdd of the linked data and then we are going to create the pair already for our rank data as well as you can see the rank is one now next what we are going to do is create the number of iterations which is 20 and we'll have S defined as 0.85 now s is known as dampening Factor so let's assign these values and now it's time for our to write our final Loop to update the page rank of every page so as we have defined the number of iterations as 20 and with the dampening Factor at 0.85 so let's compute it so now if you have a look at the output of the page ranks so guys as you can see in the 49th block there is no actions performed okay so the actions which you are going to perform is in the 50th block I'd rather say so what I'll do it'll run this clue through 20 iterations and to confirm that you can go to the console and you can see here the processes are running so this will take a certain amount of time based on the number of iterations which we have and this will give us the final output of the ranks I'd rather say now so let's investigate our Loop first so first We join the page links rdd to the page ranks Rd via an inner join and then the second line of the four block calculates the contribution to the page Rank by using the rank contribution using the rank contribution function we defined previously and the next line we aggregate all the contributions which we have and in the last line we update the rank of each web page by using the map function so as you can see we have the output of b as 0.22 D 0.24 and a 0.25 now the dampening Factor also plays an important role here now the value of C must have been even lesser than the value of a b and d so we can say that a has the highest page rank followed by D and then B and the minimum has to be C now one thing important to note is that the sum of all the page ranks is always equal to 1. so suppose if we look back at our problem there as you can see in the iteration zero if we add all the elements here one by four plus one by four plus one by four plus one by four that is one in the iteration one if we add 1 by 12 plus 2.5 by 12 plus 4.5 by 12 plus 4 by 12 that will give us 1 again and similar it's for the iteration two so more the number of iteration more will be clearer output and we can get more accurate output here [Music] so I'm sure you might be wondering why exactly we need data frames now the concept of data frames comes from the world of statistical software used in Empirical research data frames are designed for processing large collection of structured or semi-structured data observation in spark data frame are organized under named columns which helps Apache spark to understand the schema of a data frame this helps spark optimize execution plan on these queries now data frames in apartheid spark has the ability to handle petabytes of data it can handle large amounts of data which are usually the big data now it has support for wide range of data formats and sources we learn about that later in this video and finally last but not the least it has API support for different languages like python R Scala Java which makes it easier to be used by people having different programming backgrounds as well now another important feature of data frame is that the data frame API usually support elaborate methods for slicing and dicing the data it includes operations such as selecting rows columns and sell by name or by number filtering out rows and many other operations statistical data is very unusual and very messy and contain lots of missing and wrong values so a critically important feature of data frame is the explicit management of missing data now let's understand what are data frames now data frames generally refers to tabular data a data structure representing rows Each of which consists of a number of observation or measurements which are known as columns alternatively each row may be treated as a single observation of multiple variables in any case each row and each column has the same data type but the row which is the record data type may be heterogeneous while the column data type must be homogeneous data frames usually contain some metadata in addition to the data for example the column and the row names now data frame is like a 2d data structure similar to a SQL or a table in a spreadsheet now there are few important features of data frame that we must know of first of all they are distributed in nature which makes it highly available and fault tolerant now the second feature is lazy evaluations now lazy evaluation is an evaluation strategy which holds the evaluation of an expression until its value is needed it avoids repeated evaluation and lazy evaluation in spark means that the execution will not start until an action is triggered and in spark the picture of Lee's evaluation comes when the spark transformation occurs Transformations are lazy in nature meaning that when we cause some operations it does not execute immediately spark maintains the record of which operation is being called through DHE which is the direct acyclic graph since Transformations are lazy in nature so we can execute operations Anytime by calling an action on the data hence in lazy evaluation data is not loaded until it is necessary now there are many advantages of lazy evaluation like the increased manageability saving computation it increases speed increases complexities and it also provides optimization by reducing the number of queries now finally we can say that data frames are immutable Now by immutable I mean that it is an object whose State cannot be modified after it is created but we can transform its values by applying certain Transformations like in r80 as well now a data frame in a purchase path can be created in multiple ways it can be created using different data formats for example loading data from Json CSV xml5 pocket files and it can be also created from an existing rdd as well as this can be also created from various databases like the hive database the Cassandra database and can be also created from files which are resizing in the file system as well as the sdfs so let's have a look at the various important classes of data frames and SQL we have the pi spark SQL dot SQL context which is the main entry point for the data frame and SQL functionality you have the pi spark sql.data frame a distributed collection of data grouped into named columns and we have the pi spark SQL dot column it is for the column expression in a data frame similarly we have the same for the row and next we have the pi spark SQL Crypt data which is the aggregation methods written by the data frame dot Group by now we'll learn more about these important classes later in the video so let's go ahead and create some data frames so now I use Jupiter notebook instead of the pi spark shell as I personally find it easier to work with It ultimately it comes down to your choice so guys let's go ahead and start the demo now to start the pi spark shell all you need to type in is spy spark when you have configured the spark in your virtual machine or your personal computer this few lines of code which you need to add to the bash RC file if you want the Jupiter notebook to be integrated with the pi spark so I've done that I'll show you later how it works so as soon as I enter Pi spark instead of going to the pi spark shell I am redirected to the Jupiter notebook so as you can see here guys it has opened the Jupiter notebook for me so here I'm going to select a new python 2 file so firstly we'll create an example data of departments and employee so I'll show you how it's done now the first and foremost for the data frames is to import the pi spark SQL okay now that the import is successful so as you can see I'm creating an employee database with the row function so basically the rows will have the column name as the first name last name email and the salary now next I am going to create some of the employee data so as you can see I've created Five employees with the employee data in the same format as given in the first name last name email and salary and these employee one two three and four are all in the format of the employee now that we have created employees let's go ahead and create the Departments now as I was saying python is very easy and data frames are easy to use I'll show you how easy it is suppose I want to look at the values of the employee 3 all I need to do is just execute the command print and use the employee tree as you can see we have a row in which the first name is Muriel the last name is not specified we have the email ID and we have the salary now suppose we want to see what was the row value of the employee which we created earlier as you can see here if I do the print employee and in square brackets if I write 0 it's the first name so basically here I'm using the row function to create the employees and departments and now I will create the department with employees instance from the Departments and the employees now here as you can see I didn't even use the curly braces all I need to do was type in print and department 4. now whatever was in the department 4 which is the ID and the name the name being the name of the department we have nine zero one two three four which is the ID of the department and the name Department of the name is the development department next we'll create data frames from a list of rows so guys this is where it gets a little trickier here as you can see we have used the spark dot create data frame this will create a data frame T frame now it will contain the Departments with employees 1 and the Departments with employees two instance which in turn have the department of one and Department 2 where the employees are employed one two and five in the department one and we have the employees three and four in Department two so if we display the D Frame we'll see here the department and say that we have a string and a name and then we have the employees it's an array as we have given here which has the first name last name email and the salary so guys this is how we create a data frame now let's have a look at a FIFA World Cup use case I hope Argentina recovers from the Croatia match too so here we have the data of all the players in the FIFA World Cup let's go ahead and load the data here we have the data in the CSV format so I'll show you how it's done so I'm going to select a new notebook for this one so as you can see here we have the FIFA underscore DF and I'm using the function spark dot read.csv this will take the data from the CSV file which is provided in the file path and I've set the info schema as true and header as true now if we don't set these values to True usually the first row of any database has the name of the columns so what will happen it will not infer the schema and it will take the first row as the values of the data frame so as you can see here the path by default it takes the sdfs path so in order to give the path of the local file system you need to put in the file colon and double forward slashes now that we have the data frame in FIFA underscore data frame let's see how this data frame looks so as you can see here we have the round ID the match ID we have the team initials the coach name and the team they belong to we have the lineup we have the player name we have the position with placed and the event now if you want to print the schema of the particular data frame all we need to do is use the print schema and as you can see here we have the round idea which is integer match ID is also integer team initials are string and here we have the option of giving the nullable as true so even if there is no value specified it will take the value as null now suppose next I want to see what all the column in my data frame we saw that earlier when we used the show command but another way to do that is using the dot columns so if you want to know the amount of rows or the number of Records in our data frame all you need to do is give the name of the data frame and use the count function so as you can see here we have almost 37 000 rows and if you want to take a look at the number of columns we can manually content but when it's not possible we can use the length of the columns function so as you can see here we have the eight columns and we have 37 784 rows now suppose we want to describe the summary of a particular column we use the function describe now what it does is that it gives a summary that is the count that is 37 000 which is the number of rows we get the mean of the particular column which we have given in our case it's null because it's in the string we get the standard deviation we get the minimum now here minimum is according to the alphabetical order so we get a cost to Nelson and maximum is equal now similarly we can describe the column position also as you can see here we have only count of 4143 the mean and standard deviations are null and the minimum value is c as the captain and the maximum is goalkeeper captain gkc now suppose you want to know the player name and the coach name in a particular format we just need to use the select option and specify the column names here as you can see we have the player name and the coach name now there is another function in the data frames which is the filter function suppose we want to filter out all the players and all the data according to the match ID we use the filter function here and on top of that we use the show function to get our desired results so as you can see here all the data which we have is for the match id1096 so it's showing only 20 top rows now similarly you can see that if you want to filter according to the position of Captain here we have our result and if now I want to check that how many records I have in this particular data frame in which the position is given as Captain you can see we have 1510 rows I would rather say the records where the position is defined as Captain it's very similar to SQL we'll see how we can import SQL queries also later on similarly if we filter it using the position of GK which is the goalkeeper you can see we have all the records now we can use the filter on more than one parameters also so here I have taken the position as Captain and the event as G40 so here we'll get result of all the players who are paying position as Captain and the event they participated was at 340 so there were only two results here now that we have seen filter and select let's go ahead and see how we can order or group it to order we use the order by function here and here we are ordering it according to the match ID so as you can see the lowest match ID was 25 it usually goes in the increasing order you can also change the parameter of ascending as false to get a desired descending order result now here I am registering a temporary table of FIFA underscore table for the same data frame which we had earlier which is the FIFA underscore DF and now I'm going to use the SQL queries here so as you can see using the SQL context I've passed the function as rather say the command uses the select asterisk from FIFA underscore table this will give us the same result as FIFA underscore DF dot show as the same beta is now converted into a table which can be used in a SQL context so here in this SQL context we can pass any type of SQL queries that we wish for we'll see that later in our different use case so I hope you guys understood what the basic functions the base filtering the order by select show Group by all of that can be used in the particular data frames and you also saw how we can pass SQL queries also here I'll show you you just need to invoke the SQL context.sql and inside that you can give your SQL queries to be executed on the table rather I'll say because the data frame has been converted into the stable fee for underscore tab so now that we have seen how to create a data frame using a CSV data file and also apply some of the functions like selecting ordering filtering and also passing SQL queries to that data frame let's go ahead with another amazing use case which is the superhero use case so here we have the data of all the superheroes nonetheless I'll show you that one too let me load the data set again this data set is also in a CSV format so we'll use the same method here to get the data in so as you can see the data is in a CSV format here the superheroes.csv and we'll layout the data frame in the same format so guys keep in mind that if the top row of the data set or the data frame which we are using has all the column names of the data set please make sure you use the info schema and the header as true so guys as I mentioned earlier you can use the show command to see the contents of the data frame but you can also specify the number of rows so here in case if I have specified show 10 that means that it will show me the top 10 rows of the particular data frame as you can see here we have the serial number starting from 0 to 9. so let's have a look at the columns The Columns here we have the serial number we have the name gender we have the eye color of the particular superhero the race to which it belongs alien which or is it human Hungarian Cosmic entity we have the hair color we have the height for the publisher and now there are various publisher of these superheroes like Marvel DC Dark Horse NBC yeah the skin color of the hero we have the alignment and we have the weight now if we use the print schema we can see which all columns have what all data types associated with it say for example race is a string you can also set nullable as true and false as I mentioned earlier now let's filter out this with using the filter function and let's see how many male superheroes we have we have 505 male superheroes and let's see how many female superheroes we have 200 so here I'm creating another data frame which is the race underscore TF now what we'll do it'll take the superhero data frame and first it will Group by using the Rays and then it will provide the count to that particular race so as you can see here we have the race and Associated count to that particular race is showing only the top 20 rows but I guess there are more rays of superheroes than we expected it's not the same as we see in the comic boxes now if we create another data frame here we are creating the skin underscore TF which will take the superhero underscore TF data frame I will Group by according to the skin color and it will also provide the count of how many superheroes have that particular skin color so as you can see we have 21 green superheroes we have five great superheroes we have certain red and golden superiors but majority of them are not specified which is 662 so this is one of the benefits of using a data frame as and you can provide null values too to your data sets it doesn't provide an error so till now we have seen Group by and order by let's go ahead and see the sort function so I've created weight underscore DF another data frame and I'm going to sort it according to the superhero weight the according to the weight column which is mentioned here and I'm here I'm using the TSC function which stands for descending so as you can see we have Sasquatch which is male eye color red and if you finally look at the weight we can see the weight is 900 I mean whatever be the unit of that weight is the highest it's 900. now I've created a DC underscore hero to filter out all the heroes from the DC Comics and if I want to see the count of how many heroes are there in the DC Comics we have 250 Heroes which are in the DC Comics and similarly if you want to have a look at the heroes from the Marvel Comics all we got to do is use the filter function along with the publisher as Marvel Comics and same because do as is Marvel underscore hero dot count so you can see we have 215 Euros from DC at 388 heroes from Marvel now if you want to have a look at all the Publishers and get a count on how many heroes are there in the publisher we're going to use the group by function along with the count function so you can see here Marvel Comics has 388 this is another way of seeing how many superheroes are there and DC Comics has 215. personally I feel that TC has much better storyline than Marvel but still as you can see here Image Comics has 14 and there are null values has 15. so we can say that Marvel and DC are the two major contenders in the superhero comic market now here I'm going to create a table of superheroes using the data frame superhero underscore DF and similarly I'll show you to how to pass SQL queries now to select asterisks from Superior table this is equivalent to what we use in superhero underscore DF dot show the output of this one is not in a very good sophisticated manner as it was for the data frame now if we pass the SQL query as select distinct eye color from Superhero table let's see what the output of this one will be okay we got the list of all the different eye colors which are the distinct eye colors we have yellow without Violet we have gray green yellow brown Indio silver purple now suppose I want to see how many actually eye colors are there distinct eye colors so we have 23 which we have the rest in eye close we saw earlier that maximum weight went 900 pounds or 900 kilos for a particular Sasquatch as you saw earlier here you can see this one so we can get the same result from SQL as well if we choose the select maximum given the name of the column which we want which is the weight from Superhero table although it will give us only the maximum record but it will be the same so as you can see the maximum weight is 900 units so Guys these are some of the functions and the features which you can use in the data frames and as I mentioned earlier data frames are very important as they are used for structured and unstructured data and it can load better bytes of data which is useful for big data computations and most importantly it is used for major slicing and dicing which is clearing out and cleaning the data set foreign module is a higher level abstraction over the pi spark core which is used for processing structured and semi-structured data sets using pi spark you can process the data by making use of the SQL as well as Hive ql because of which Pi spark SQL is gained popularity among database programmers and Apache high views as well moreover it provides an optimization API which means it can read data from various types of sources such as CSV Json and the other file formats or the databases now let me show you how you can apply SQL queries in our data frames to make them more accessible so let's import the spark session and also let's import the spark SQL which is the SQL context here so now in order to load our data into a data frame we use the SQL context here earlier in the data frame we use spark.read.csv now here we are going to use the SQL context.read.load we have provided the info schema it's true and also the header is also true now let's load the data into the data frame DF let's have a look at the schema of our database as you can see here we have the NYC Flight status which is the New York flight data we have the year month day departure time departure delay arrival time delay tail number flight number we have the origin air time distance and much more now suppose I want to rename a particular column uh suppose I want to rename the column the EST as destination so for that we'll use the width column renamed function here as you can see here in the df1 if you have a look at the schema of df1 so as you can see here the the EST has been changed to destination similarly you can replace the particular column names and now you want to have a look at the basic statistical information of a particular data frame you need to use the describe command as I mentioned earlier in the data frame if you want to have a look at the summary of a particular column you use the describe and mention the column name but if you want to have a look at the summary of the particular data frame you just need to provide the describe function and you need not mention any column name as you can see here the output is very half as that in this manner but we can purify it using the partners library now these libraries are the major reason why people are going for Python and Spark programming rather than Scala or Java or rather say the availability of these libraries make visualization pretty easier and also machine learning very easy now if you want to select particular columns from our data set suppose I want to know the flight origin destination and the distance I'll use the TF2 I'm creating new data frame and I'm using the select query here and also I'm using the distinct function so as to get only the unique values of the flight origin destination and distance now here I am going to import ASC which stands for ascending and it is used basically in sorting and so now I'm creating another data frame which is sorted DF and I'm passing the data frame to a sort function and then I'm sorting it based on the ascending order of the flight now in Flight we have the flight number so our output we'll see in just now so now if we use the sorted DF dot show as you can see here in the flight we have the starting it from one the count starts from one now this is again the data set is very huge the number of columns are more so in order to have a look at only the few Corners which you saw earlier which was the flight origin destination and distance let me just show you how the cleaner output looks just to have a better understanding of how things work I am going to use the select query on this sorted data frame we have as you can see here all the flights are starting from flight number one now what I did earlier I'm sending the output of the particular select and the sort query into a sorted data Frame 2 I'm creating and as you saw earlier here we have so many duplicate values as in the database is so huge we have so many flights which are originating from JFK and the destination is LAX and the distance is 2475 which is fixed so in order to get the distinct values we'll use a distinct command here so just to have a look at the distinct output of the sorted data Frame 2 which we created just now so as you can see here the flight number is 1 but the tail number is different the carrier is different and the a time is also different so let's have a look at the number of rows which we have in our sorted data frame 2. it should be around 3 million I guess yeah it's 3 million and thirty six thousand so next what we are going to do is drop the duplicates which are having the carrier and the tail number here so as you can see here the tail number and the carrier we are going to drop all the values which are the similar and the duplicate we can do this for many columns as we like so here I'm going to show you for carrier and tail number and let's now see the count of this particular data frame so guys as you can see here the output is 4000 rows only so after just removing just two columns the duplicate values of these two columns we got four thousand so now if you want to join particular two data frames in SQL we have the dot join function I'm creating another data frame which is joined underscore tdf and now I'm passing the two data frames which are sorted underscore df2 and sorted underscore DF which is just created and the condition which we are giving here is Basis on the flight we are going to join it on the basis of the flight column so let's have a look at the output of this join data frame which we have created just now so as you can see here we have the joined output or the particular two data frames now suppose I want to find the data frame in which our origin is JFK MCU and EWR apart from the filter function or the select function we have another function which is where so it performs the same action so let's see the output in our data frame the origin will only be from JFK MCO and EWR now in order to compute the average we need to First import the average function and we have created another df3 which is data frame 3 and we are using the aggregate function with the average of the distance so as you can see here we have got the output of the average distance as 1039 so guys that's it for pi Sparks SQL programming [Music] so the next Topic in our discussion is by spark streaming now spark streaming is a scalable fault torrent streaming system which takes the rdd batch Paradigm and Spark streaming processes the data in batters which ultimately speeds up the entire task spark streaming receives an input data stream which is internally broken into multiple smaller batches and the size of these batches is based on the batch interval The Spark engine then processes those batches of the input data to produce a set of patches of process data now the key abstraction for spark streaming is City streams it represents the small batches that make up the stream of the data now these teams are built on rdds which allow spark developers to work within the same context of rdds and batches which is now also applicable to the streaming problems now spark streaming also integrates with mlib which is machine learning we'll learn about machine learning the mlib programming later in this video it also integrates with SQL data frames Graphics which widens Your Horizon of functionalities being a high level API spark streaming provides fault tolerance exactly one semantics for stateful operators spark streaming has built-in receivers that can take as many as sources as possible now these are the basic component of spark streaming as you can see data can be ingested from many sources like Kafka Flume Twitter Kinesis or TCP sockets and many more and further this data is processed using the complex algorithm Express with the high level functions like map reduce join and window and finally the process data is pushed out to the various file system databases and live dashboards [Music] foreign now what exactly is machine learning machine learning is a method of data analysis that automates analytical model building using algorithms that iteratively learn from data machine learning allows computers to find hidden insights without being explicitly programmed where to look it focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data machine learning uses the data to detect patterns in the data set and adjust programs actions accordingly most Industries working with large amounts of data have recognized the value of machine learning technology by cleaning insights from this data often in real time organizations are able to work more efficiently or gain an advantage over competitors now let's have a look at the various industry where machine learning is being used government agencies such as public safety and utilities have a particular need for machine learning they use it for face detection security and front detection your marketing and sales now websites recommending items you might like based on previous purchase you use machine learning the user to analyze your buying history and promote other items you would be interested in now analyzing data to identify patterns and Trends is the key to the transportation industry which relies on making routes more efficient and predicting potential problems to increase the profitability now coming to the Financial Service banks and other business in the financial industry use machine learning technology for two key purposes the first one is to identify important insights in data and the second one is to prevent fraud now coming to Health Care machine learning is a fast growing Trend in the healthcare industry thanks to the Advent of variable devices and the sensors that can use data to access a patient's Health in the real time now finally in the Biometrics section the science of establishing the identity of an individual based on the physical chemical or the behavior attributes of the person is one of the major key advantages of machine learning in the Biometrics area now let's have a look at typical machine learning life cycle any machine learning life cycle is divided in two phases the first one is training and the second one is testing now for training we use the 70 to 80 percent of the data and the rest the main data is used for testing purposes so first of all we drain the data and you use any particular algorithm to train the data and using that algorithm we produce a model now after that we have produced our model now the remaining 20 to 30 percent of the data is used for the testing purposes we pass this data to the model and we find out the accuracy of that model with certain tests now this is what a typical machine learning lifecycle looks like now there are three major categories of machine learning as I mentioned earlier which are supervised reinforcement and the unsupervised learning so let's understand these terms in detail starting from supervised learning the supervised learning algorithms are trained using labeled examples such as input where the desired output is known the learning algorithm receives a set of inputs along with the corresponding correct outputs and the algorithm learns by comparing its actual output with the correct output to find errors it then more is the model accordingly through methods like classification regression predicting and gradient boosting supervised learning use patterns to predict the values of each label on additional unlabeled data it is called supervised learning because the process of an algorithm learning from the training data set can be thought of as a teacher supervising the learning process now supervised learning is majorly divided into two categories namely classifications and Recreation algorithms regression is the problem of estimating operating a continuous quantity what will be the value of the S P 500 one month from today how tall will a child be as an adult how many of the customers will leave for a competitor this year these are examples of the questions that would fall under the umbrella of regression now coming to classification classification deals with assigning observation into discrete categories rather than estimating continuous quantities in the simplest case there are two possible categories this case is known as binary classification many important questions can be framed in the terms of binary classification will argument customer leave us for a competitor does a given patient have cancer does a given image contain a dog or not now classification mainly consists of classification trees support Vector machines and random Forest algorithms whereas regression consists of linear regressions decision trees biasing networks and fluency classification now there are other algorithms like artificial neural network programming and gradient boosting which also comes under supervised learning algorithms now next we have reinforcement learning now reinforcement learning is learning how to map situations to actions so as to maximize a reward and often used for robotics gaming and navigation with reinforcement learning the algorithm discovers through trial and error which actions yield the greatest rewards and the algorithm provides information about whether the answer is correct or not but does not tell how to improve it the agent is the learner or decision maker whose job is to choose actions that maximize the expected reward over a given amount of time actions are what the agent can do and the environment is everything the agent interacts with the algorithm whose ultimate goal is to acquire as much as numerical reward as possible gets penalized each time its opponent's course a point and gets rewarded each time it manages to score a point against the opponent it uses this feedback to operate its policy and gradually it filters out all the actions that lead to penalty reinforcement learning is useful in cases where the solution space is enormous or infinite and typically applies in cases where the machine learning can be thought of as an agent interacting with its environment now there are many reinforcement learning algorithms few of them are the Q learning we have the sarsa which is the state action reward State action we have deep Q Network we have the Deep deterministic policy gradient which is ddpg and finally we have the trpo which is the trust region policy optimization now the last category of machine learning is unsupervised learning so as I mentioned earlier supervised learning tasks find patterns where we have a data set of the right answers to learn from whereas in case of unsupervised learning tasks find patterns where we do not this may be because the right answers are unsolvable or infeasible to obtain or maybe for a given problem there isn't even a right answer per se a large subclass of unsupervised task is the problem of clustering clustering refers to grouping observation together in such a way that members of a common group are similar to each other and different from members of other groups a common application here is in marketing where we wish to identify segments of customers or prospects with similar preferences or buying habits a major challenge in clustering is that it is often difficult or impossible to know how many clusters should exist or how the cluster should look unsupervised learning is used against data that has no historical labels the system is not told the right answer the algorithm must figure out what's being shown the goal is to explore the data and find some structure within unsupquized learning works well on transactional data and these algorithms are also used to segment text topics recommend items and identify data outliners now there are majorly two classifications of unsupervised learning one is clustering as I discussed earlier and the other one is dimensionality reduction which includes topics like principal component analysis tensor decomposition multi-dimensional statistics and random projection so now that we have understood what is machine learning and what are its various types of machine learning let's have a look at the various component of spark ecosystem and understand how machine learning plays an important role here now as you can see here we have a component named mlib now Pi spark mlib is a machine learning library it is the wrapper over the pi spark core to do analysis using machine learning algorithms it works on distributed systems and is scalable and we can find implementation of classification clustering linear regression and other machine learning algorithm in pi spark ml lib we know that Pi spark is good for iterative algorithms using iterative algorithms many machine learning algorithms have been implemented in pi spark mlib apart from PI Sparks efficiency and scalability Pi spark mlib apis are very user friendly so software libraries which are defined to provide solution for the various problems come with their own data structure these data structures are provided to solve a specific set of problems with efficient options Pi spark mlips comes with many data structures including dense vectors sparse vectors and a local and distributed Matrix so the major MLA algorithms include mlib we have clustering we have frequent pattern matching we have linear algebra we have collaborative filtering we have classification and finally we have linear regression now let's see how we can leverage MLF to solve our few problems so let me explain this use case to you a system was hacked but the metadata of each session that the hackers use to connect their servers were found now these included features like session connect time the bytes transfer we have the Cali trees used we have certain data like servers corrupted Pages corrupted the location and we have the wpm topic speed now there are three potential hackers two confirm hackers and one not yet confirmed the forensic Engineers know that the hacker trades of attacks meaning they should each have roughly the same amount of attacks for example if there were 100 attacks then in a two hacker situation each would have 50 attacks and in a three hacker situation each would have 30C attacks so here we are going to use clustering so let's see how we can use clustering to find out how many hackers were involved so today I am going to use the jupyter notebook to do all my programming let me just open a new python to Jupiter notebook so first of all what we are going to do is import all the required libraries and initiate the spark session now next what we are going to do is read the data using the spark.read method now here we are doing spark.read.csv as our data set is in CSV format and we have given header and infosema as true now here the default location what it takes is sdfs when we do the spark.read so in order to change the default location to your local file system you need to provide file colon and two forward slashes and then provide the absolute part of the data file which we are going to read now let's have a look at the first record of the data frame and also the summary of the data set now to have a summary of the data set we use the describe function here now the output of this one is very half a set so if we want to have a look at the names of the columns which we have here we just need to use the dataset.columns so as you can see we have the session connect time we have the bytes transfer we have Kali Trace used servers corrupted Pages corrupted the location and the wpm tapping speed now wpm stands for words per minute now next what we are going to do is import the vectors and the vector assembler Library so these are all machine learning libraries which we are going to use now what Vector assembler does is take a set of columns and Define a particular feature so our features consist of the session time the bytes transfer the Kali Trace used we have the service corrupted the page is corrupted and the wpm typing speed one thing to note down is that the feature selection is based on us so whatever with the features selection if our model is not operating the right output we can change the features accordingly to get the desired output now I've created a VEC underscore assembler which will take all the above defined attributes and based on that it will provide us the feature column now what next we are going to do is make our final data and we'll use Vector assembler and transform it on the data set which we have now next what we'll do is we'll import the standard scale library now centering and scaling happen independently on each feature by Computing the relevant statistics on the samples in the training set mean and standard deviation are then sorted to be used on the later data using the transform method standardization of the data set is a common requirement for many machine learning estimators they might behave badly if the individual feature do not more or less look like standard normally distributed data now let's compute the summary statistics by setting the standard scalar and then let's normalize each feature to have a unit standard deviation now finally we have the cluster final data it's time for us to find out whether there were two or three hackers so for that we are going to use k-means here so I've created K means free and k-means 2 came industry will have all the features with the K value as 3 and the k-means 2 will have the features column which are the scale features with the K value S2 now what we'll do is create models for these both a means three and K means two variables we are going to fit it into the cluster final data now w triple s e stands for within set sum of squared errors so let's have a look at the values of these for the model which has k equals 3 that is three clusters and for the model which has k equals 2. now for k equals 3 the set sum of squared errors is 434 and for k equals 2 it is 601 now let's have a look at the values of K starting from 2 to 9 to have a look at the values of within such sum of squared errors as you can see the values are getting lower and lower that means the probability of the number of hackers being more than 3 and 4 is very less as you can see for k equals 8 is 198 now the last key fact the engineer mentioned was that the attack should be evenly numbered between the hackers let's check with the transformation and prediction column that the result for this now grouping by prediction we'll see as you can see if we have the prediction of three hackers the count is 167 79 and 88 which is not evenly distributed so if we have a look at the data for k equals 2 with the model K2 and do the prediction so guys as you can see here the count is evenly distributed this means that only two hackers will involved the clustering algorithm created two equal site cluster with k equals 2 and the count being 167 for each one of them so this is one way through which we can find out how many hackers will involved using k-means clustering so let's move forward with our Second Use case which is the customer churn prediction now customer churn prediction is Big Business it minimizes customer deflection by predicting which customers are likely to cancel a subscription to a service though originally used within the telecommunication industry it has become common practice across Banks isps Insurance firms and the other verticals the prediction process is heavily data driven and often utilizes Advanced machine learning techniques in this post we'll take a look at what type of customer data are typically used to do some preliminary analysis of the data and Trend rate churn prediction models all with pi spark and its machine learning framework so let's have a look at the story of this use case now our marketing agency has many customers that use their service to produce as for the client and customers they've noticed that they have quite a few bit of churns in the clients they basically randomly assign accounts managers right now but they want you to create a machine learning model that will help predict which customers will churn so that they can correctly assign the customers most at risk to churn an account manager luckily they have some historical data so can you help them out do not worry I'll show you how to help them so we need to create a classification algorithm here that will help classify whether or not a customer churned then the company can test this against the incoming data for future customers to predict which customers will churn and assign them an account manager so let's import the libraries first which we need so here we are going to use logistic regression to solve this method now the data is saved as customer underscore churn.csv so we'll use the spark.read method here to read the historic data and then we'll have a look at the schema of the data and understand what exactly are we dealing with now to understand the schema of any particular data frame or the data we use the print schema method so as you can see here guys we have name H we have the total purchase we have account manager years the number of sites onboard date location company and the churn so let's have a look at the data so as you can see here we have data of 900 customers here so I've used the count method to get exactly the number of rows to see how much we are dealing with now let's load up the test data as well and now have a look at the schema of this data so as you can see the test data is also in the same format as the training data next what we are going to do is import the vector assembler Library now since I've already imported the vector assembler Library here as you can see earlier in this we have done from PI spark.ml.feature import Vector assembler so I'm not going to import it again as it will show us some error now firstly we must transform our data using the vector assembly function to get to a single column where each row of the data frame contains a feature Vector now this is a requirement for the regression API in ml lab so as you can see here I'm using age the total purchase the account manager the years and the number of sites or I must say it's dependent on the user that is creating the model so say that this model is not giving us the output as we want or we require so we'll change the parameters of the input columns now here what we are doing we are creating an output underscore data which is a data frame which will contain the data of the input data which has been converted using all these input columns or the features and have a single output of the call of named features so let's have a look at the schema of this new output data so guys as you can see here all the columns are the same except the last one and the last we have an additional feature column and it's a vector so this will help us in the prediction of the customer churn now in order to have a look at what we are dealing here let's take a look at the output of the first element so as you can see here in the last we have the features which is a dense Vector containing all the five values of the column which is 42 age 1166 which is the total purchase we have 0.0 which is account manager yes 10.2 years and we have the number of sides which is 8. so now what we are going to do is create our final data we'll use this output data which we have from the vector assembler and what we'll do is only select the features and the churn so if we have a look at the final data so as you can see here we have only two columns which are the features and the churn so now what we are going to do is split our data into training and testing data for now we are going to use the random split method and we are dividing it in the ratio of 70s to 30. now what we are going to do is create our logistic regression model and now we are going to use the column churn for the label so now let's train the model using our training data which is just now created from our final data so let's have a look at the summary of the model which we just created so as you can see here we have the churn the prediction we have the main standard deviation the minimum and the maximum value now that we have created our model let's use it to get the value of the evaluator on the raw prediction data now for that we need to First import the binary classification evaluator we are creating a data frame predictions in which we'll fit the test data into the model and evaluate using the binary classification evaluator now when we have a look at the output data we can see on the left hand side we have the features then we have the churn then we have the raw prediction according to our model then we have the probability and finally the prediction so let's use the evaluator which has the binary classification evaluator it will take the prediction column and the label column shown and tell us how accurate is our model so as you can see a 77 percent accurate so earlier I loaded the test data I'll show you here you can see we have the new customers.csv so we use the original data split it into 70s to 30 ratio then we created a model and trained it using our training data and then tested it using the testing data so now we'll use the incoming new data which is the new customers and see if our model is fit or not now again we are going to use the assembler the vector assembler here now we have created a data frame results in which we'll take the logistic regression model and transform it using either the new test data and this new test data also contains the features column because we use the vector assembler just before that so if you have a look at the results it's very half a sad here I'll show you another format just hang on a second so what we are going to do is select the company and the predictions just to see how it's working so as you can see the Karen Benson prediction is true parad Robertson prediction is true the sexton golden is also true in the park Robinson also so guys as you can see our model was 77 accurate now we can play along with the features column to see if our model produces a more accurate output so in our case if we are satisfied with the 77 ratio of the model being true the prediction being true that is fine but then again we can change it according to our preferences [Music] in a world where data is being generated at an alarming rate the correct analysis of the data at the correct time can be very useful now one of the most amazing framework to handle big data in real time and perform analysis is a purchase path and if we talk about the programming languages being used nowadays for different purposes I'm sure python will Top This chart as it is being used almost everywhere so talking about the features of Apache spark starting off with the most important feature that is speed it is almost 100 times faster than the traditional data processing tools and Frameworks it has powerful caching the simple programming layer provides powerful caching and dispersistent capabilities now coming to deployment Apache spark can be deployed through mesos or Hadoop wire yarn or via Spark's own cluster the most important feature that helps spark achieve the fast speed in real-time computation and level latency is the use of in-memory computation the lazy evaluation of Transformations the directric acyclic graphs and much more the spark is polyglot which means it can be programmed in various languages like python Scala Java and R and it is one of the reasons why Apache spark has taken over machine learning and exploratory analysis now let's have a look at the various companies that use a participal here we have Yahoo Alibaba Nokia Netflix NASA databricks which is the official Enterprise distributor of Apache spark we have TripAdvisor and we have eBay so you can see spark has been used in the industry a lot now let's have a look at the various industry use cases first of all starting with Healthcare as Healthcare Providers look for novel ways to enhance the quality of healthcare a party spark is slowly becoming the heartbeat of many Healthcare applications many Healthcare Providers are using Apache spark to analyze patient recalls along with the past clinical data to RFI which patients are likely to face health issues after being discharged from the clinic this helps hospitals prevent hospital re-admittance as they can deploy home health care services to the identified patient saving on cost for both the hospitals and the patient Apache spark is used in genome sequencing to reduce the time needed to process genome data it took several weeks to organize all the chemical compounds with genes but now with Apache spark on Hadoop it just takes few hours now coming to finance banks are using Apache spark to access and analyze the social media profiles call recordings complaint logs emails The Forum discussion to gain insights which can help them make right business decisions for credit risk assessment targeted advertising and customer segmentation what are the financial institutions that has retail Banking and brokerage operations is using Apache spark to reduce its customer churn by 25 percent the financial institution has divided the platforms between retail banking trading and investment however the bank wants a 360 degree view of the customer regardless of whether it is a company or an individual to get the Consolidated view of the customer the bank uses Apache spark as the unifying layer the spark helps the bank automate analytics with the use of machine learning by accessing the data from each repository for the customers so talking about media Apache spark is used in the gaming industry to identify patterns from the real-time in-game events and respond to them to harvest lucrative business opportunities like Target advertising Auto adjustment of gaming levels based on complexity player retention and many more now conviva is another company averaging about 4 million videos per month it uses Apache spark to reduce customer churn by optimizing video streams and managing live video traffic thus maintaining a consistently smooth high quality viewing experience now you all might have heard about Netflix Netflix uses Apache spark for real-time stream processing to provide online recommendation to its customer streaming devices at Netflix send events with capture all member activities and play a vital role in personalization it processes 450 billion events per day which flow to the server-side application and are then directed to Apache Kafka now coming to the retail and e-commerce industry one of the largest e-commerce platform Alibaba runs some of the largest Apache spark shops in the world in order to analyze hundreds of petabytes of data on its e-commerce platform some of the spark jobs that perform feature extraction on image data than for several weeks millions of merchants and users interact with Alibaba e-commerce platform each of this interaction is represented as a complicated large graph and Apache spark is used for fast processing of sophisticated machine learning on this data now eBay also uses Apache spark to provide targeted offers enhanced customer experience and optimize overall performance a purchase path is leveraged at eBay through Hadoop yarn manages all the Clusters resources to run the generic tasks and eBay spark users Leverage The Hadoop cluster in the range of 2000 nodes twenty thousand cores and 100 TB of ram through yarn now finally coming to the travel industry Trip Advisor a leading travel website that helps plan a perfect trip is using Apache spark to speed up its personalized customer recommendation Trip Advisor uses Apache spark to provide advice to millions of Travelers by comparing hundreds of websites to find the best hotel prices for its customers the time taken to read and process the reviews of the hotels in a readable format is done with the help of Apache Spark now we all know Uber every day this multinational online taxi dispatch company gathers terabytes of event data from its mobile users by using Kafka spark streaming and sdfs to build a continuous ETL pipeline Uber can convert unstructured event data into structured data as it is collected and then use it for further and more complex analysis so as I was talking about spark being a polyglot earlier it basically means that programming in spark can be done in various languages like Scala Python and ah now you may ask which one should I go or choose to begin with so spark was developed in Scala that you should know it was a default language in which spark was developed it is very similar to Java but the recent images of data analytics and machine learning has made it difficult for Scala to keep up so instead spark came up with the python API to use Python so let's have a look at the reasons why one should go for python to begin with firstly it is easy to learn for programmers python is comparatively easy to learn because of its syntax and standard libraries moreover it is dynamically typed language which means RDS can hold objects of multiple types now we'll discuss more about rdds coming up in this video now it's portable and it can be used with various operating systems like Windows Solaris Linux we have the bus PlayStation and the Mac OS and lastly Scala does not have sufficient and data science tools and libraries like python for machine learning and natural language processing spark mlib the machine learning library has only fewer ml algorithms but they are ideal for big data processing in summary we can say that Scala lacks good visualization and local data transformation tools along with highly used machine learning libraries now edureka as we know provides a detailed and comprehensive training on Apaches spark in Python that is the pi spark developer certification training now this course is designed to provide Knowledge and Skills to become a successful spark developer using python you will get in-depth knowledge of Concepts such as Hadoop distributed file system the Hadoop cluster Flume scoop Apache Kafka you'll learn about the apis and the libraries which spark offers such as spark streaming MLive spark SQL and this Pi spark developer course is an integral part of a big data developers career path this course is designed to provide Knowledge and Skills to become a successful Hadoop and Spark develop upper and would help to clear the spark and Hadoop developer which is the cca175 examination this course has in total 12 modules with one bonus module and it focuses on cloud data's Hadoop and Spark developer certification training now coming to module 1 which is the introduction to Big Data Hadoop and Spark in this model you will understand Big Data the limitation of existing solutions for big data problems how Hadoop solves the big data problem the Hadoop ecosystem Hadoop architecture sdfs rack awareness hand replication you will learn about the Hadoop cluster architecture important configuration files in Hadoop cluster and you will also get an introduction to spark why it is used and understanding the difference between past processing and real-time processing now coming to module 2 which is Introduction to python for Apache spark at the end of this model you will be able to Define python understand operands and expressions you will be able to write your first Python program understand command line parameters and flow control you will understand how to take an input from the user and perform operations on a it and you'll also learn about numbers strings tuples less dictionaries and sets now coming to model 3 which is basically the functions oops modules error and exceptions in Python this module you will learn how to create generic Python scripts how to address the errors and exceptions in the code and finally how to extract and filter content using the regex now model 4 is deep dive into Apache spark framework in this model you will understand Apache spark in depth and you will also learn about the various spark components you will be creating and running various spark application and at the end you will learn how to perform data ingestion using scoop now coming to Model 5 which is playing with spark rdds in this module you will learn about spark rdds which are the resilient distributed data sets and other Rd related manipulations for implementing business Logistics like the Transformations actions and the functions performed on rdd moving on to model 6 which is the data frames and Spark SQL in this module you will learn about spark SQL which is used to process structured data with SQL queries you will learn about the data frames and the data sets in spark SQL along with different kind of SQL operations performed on the data frames you will also learn about the spark And Hive integration now module 7 which is machine learning using spark MLM in this module you will learn about why machine learning is needed different machine learning techniques and algorithms and their implementation using the spark ml lift now each module is deep dive into spark mlib in this model you will be implementing various algorithms supported by a machine learning library which is the ml lib such as Leland regression decision tree random forest and many more now coming to Modern Line which is understanding Apache Kafka and Apache flow in this model you will understand Kafka and Kafka architecture afterwards you will go through the details of kafa cluster and you will also learn how to configure different types of Kafka cluster at last you will see how messages are produced and consumed using Kafka apis in Java you will also get an introduction to Apache Flume its basic architecture and how it is integrated with Apache Kafka for event processing now model 10 is Apache spark streaming in this model you will work on a purchase spark streaming which is used to build scalable fault torrent streaming application you will learn about these streams and the various Transformations performed on the streaming data you will get to know about commonly used streaming operators such as sliding window operator and The stateful Operators now Model 11 is Apache spark streaming data sources in this model you will learn about the different streaming data sources such as Kafka and Flume and at the end of this module you will also be able to create a spark streaming application now Model 12 is the in-class project now this project will comprise of all that we have learned till now that is Hadoop spark Kafka flu and much more and as opponents we have another model which is the graphics in this module you will be learning the key concepts of spark Graphics programming Concepts and operations along with the different Graphics algorithms and their implementations now that we have seen the training structure offered by address let's understand what exactly Pi spark is now Apache spark is an open source cluster Computing framework for real-time processing developed by Apache spark foundation and Pi spark is nothing but the python API for Apache Spa now let's have a look at the various components of the spark ecosystem the core engine of the entire spark framework provides utilities and architecture for other components spark streaming enables analytical and interactive apps for live streaming data the ml lift which is machine learning library of spark it is built on top of spark to support the various machine learning algorithms Graphics computational engine which is similar to cryov combines data parallel and graph parallel Concepts spark R is the package for our language to enable our user to leverage spark power from the r shell now finally we have the pi spark the API developed to support python as a programming language for spa now Pi spark shell links the python API to the spark core and initializes the spark context the spark context is the heart of any spark application spark context sets up internal services and establishes a connection to a spark execution environment the spark context object in the driver program coordinates all the distributed processes and allows resource allocation the cluster managers provides executors which are jvm processes with Lodge spark context object sends the application to the executors and then spark context executes these tasks in each executors now when you have installed spark on your system just by typing Pi spark you can enter the spark shell and it looks something like this just make sure that all the emails or spark are running in the background foreign [Music] Source it's a cluster Computing framework for real-time processing so three main keywords over here Apache Spark It's a open source project it's used for cluster Computing and for in-memory processing along with real-time processing it's going to support in-memory Computing so lots of project which supports cluster Computing along with that spark differentiates Itself by doing the in memory Computing it's very active community and out of the Hadoop ecosystem Technologies Apache spark is very active multiple releases we got last year it's a very inactive project among the Apache projects basically it's a framework current support in memory Computing and cluster Computing and you may face this specific question how spark is different than mapreduce or how you can compare it with the mapreduce mapreduce is the processing methodology within the Hadoop ecosystem and within Hadoop ecosystem we have hdfs how to distributed file system mapreduce is going to support distributed computing and how spark is different so how we can compare spark with the mapreduce in a way this comparison going to help us to understand the technology better but definitely like we cannot compare these two or two different methodologies by which it's going to work spark is very simple to program but mapreduce there is no abstraction or the sense like all the implementations we have to provide an interactivity it's has an interactive mode to work with in spark a map it is there is no interactive mode there are some components like Apache Peak And Hive which facilitates as to do the interactive Computing or interactive programming and Spark supports real-time stream processing and to precisely say within spark the stream processing is called a near real time processing there's nothing in the world is real-time processing it's near real-time processing it's going to do the processing in micro batches I'll cover in detail when we are moving on to the streaming concept I am going to do the batch processing on the historical data in mapreduce when I say stream processing I'll get the data that's getting processed in real time and do the processing and get the result either store it or publish it to public Community we will be doing it latency wise map reduce will have very high latency because it has to read the data from hard disk but spark it will have very low latency because it can reprocess or use the data already cased in memory but there's a small cache over here in spark first time when the data gets loaded it has to read it from the hard disk same as mapreduce so once it is read it will be there in the memory so spark is good whenever we need to do a iterative Computing so spark whenever you do iterative Computing again and again do the processing on the same data especially in machine learning deep learning all we will be using the iterative Computing here Sparks performs much better you will see the performance Improvement 100 times faster than mapreduce but if it is one time processing and fire and forget that type of processing spark relatively it may be the same latency you will be getting it then mapreduce maybe like some improvements because of the building block or spark that's the rdd you may get some additional Advantage so that's the key feature or the key comparison factor of spark and mapreduce now let's get on to the key features explain key features of spark we discuss about the Speed and Performance it's going to use the in-memory Computing so Speed and Performance wise it's going to much better when we do activity Computing and some polygot in the sense the programming language to be used with the spark it can be any of these languages can be python Java RR scale up we can do programming with any of these languages and data formats to give us a input we can give any data formats like Json pack with a data formats we can if there is a input and the key selling point with the spark is its lazy evaluation the sense it's going to calculate the DAC cycle directed acycling graph dag because there is a DHE it's going to calculate what all steps needs to be executed to achieve the final result so we need to give all the steps as well as what final result I want it's going to calculate the optimal cycle or optimal calculation what all steps needs to be calculated or what else steps needs to be executed only those steps it will be executing it so basically it's a lazy execution only if the results needs to be processed it will be processing that specific result and it supports real-time Computing it's through spark streaming there is a component called spark streaming which supports real-time Computing and it gels with Hadoop ecosystem very well it can run on top of Hadoop or it can Leverage The hdfs to do the processing so when it leverages the hdfs the Hadoop cluster container can be used to do the distributed computing as well as it can leverage the resource manager to manage the resources so spark and gel with the hdf is very well as well as it can leverage the resource manager to share the resources as well as data locality it can leverage data locality it can do the processing near to the data where data is located within the hdfs it has a fleet of machine learning algorithms already implemented right from clustering classification regression all those logic already implemented and machine learning it's achieved using mlib within spark and there is a component called Graphics which supports graph Theory where we can solve the problems using graph Theory using the component Graphics within the spark these are the things we can consider as the key features of spark so when you discuss with the installation of the spark you may come across this yarn what is yarn do you need to install spark on all nodes of yarn cluster so yarn is nothing but another resource negotiator that's the resource manager within the Hadoop ecosystem so that's going to provide the resource management platform you aren't going to provide the resource management platform across all the Clusters and a spark it's going to provide the data processing so wherever the resource being used that location the spark will be used to do the data processing and of course yes we need to have spark installed on all the nodes whereas spark clusters are located that's basically we need those libraries and additional to the installation of spark and all the worker nodes we need to increase the ram capacity on the worker missions as well as far going to consume huge amount of memory to do the processing it will not do the map reduce way of working internally it's going to generate the lag cycle and do the processing on top of yarn so yarn at the high level it's like resource manager or like an operating system for the distributed computing it's going to coordinate all the resource management across the fleet of servers on top of it I can have multiple components like spark days giraffe a spark especially it's going to help us to achieve in memory Computing so spark yarn is nothing but it's a resource manager to manage the resource across the cluster on top of it we can have spark and yes we need to have spark installed and all the nodes on where the spark yarn cluster is used and also additional to that we need to have the memory increased in all the worker nodes the next question goes like this what file system does spark support when I say file system when we work in the individual system we will be having a file system to work within that particular operating system but in the distributed cluster or in the distributed architecture we need a file system with which where we can store the data in a distribute mechanism Hadoop comes with the file system called hdfs it's called Hadoop distributed Finance system where data gets distributed across multiple systems and it will be coordinated by two different type of components called nanode and data node and Spark it can use this hdfs directly so you can have any files in hdfs and start using it within the spark ecosystem and it gives another advantage of data locality when it does the distributed processing wherever the data is distributed the processing could be done locally to that particular machine where data is located and to start with as a standalone mode you can use the local file system as well so this could be used especially when we are doing the development or any POC we can use the local file system and Amazon Cloud provides another file system called S3 simple storage service we call that as the S3 it's a block storage service this can also be leveraged or used within spark for the storage and lot other file system are also it supports there are some file systems like Alexa which provides in-memory storage so we can leverage that particular file system as well so we have seen all features what our functionalities available within spark we are going to look at the limitations of using spark of course every component when it comes with a huge power and Advantage it will have its own limitations as well the next question illustrates some limitations of using spark utilizes more storage space compared to when it comes to the installation it's going to consume more space but in the Big Data world that's not a very huge constraint because storage cost is not very great or very high and a big data space a developer needs to be careful while running the apps and Spark the reason because it uses in-memory Computing of course it handles the memory very well but if you try to load a huge amount of data in the distributed environment and if you try to just join when you try to do join with in the distributed world the data are going to get transferred over the network network is really a costly resource So the plan or design should be such a way to reduce or minimize the data transfer over the network and however the way possible with all possible means we should facilitate distribution of the data over multiple missions the more we distribute the more parallelism we can achieve and the more results we can get and cost efficiency if you try to compare the cost how much cost involved to do a particular processing take any unit in terms of processing 1GB of data with uh say like five iterative processing if you compare cost wise in-memory Computing always it's Constable because memory It's relatively com costier than the storage so that may act like a bottleneck and we cannot increase the memory capacity of the machine Beyond some limit so we have to grow horizontally so when we have the data distributed in memory across the cluster of course the network transfer all this model lens will come into picture so we have to strike the right balance which will help us to achieve the in memory Computing whatever the memory Computing required it will help us to achieve and it consumes huge amount of data processing compared to Hadoop and Spark it performs better than user does iterative Computing because let's explore both spark and the other Technologies it has to read data for the first time from the hardest car from other data source and Spark performance is really better when it reads the data or do does the processing when the data is available in the cache of course yes the dark cycle it's going to give us a lot of advantage while doing the processing but the in memory Computing processing that's going to give us lots of Leverage the next question lists some use cases where Spark out performs Hadoop in processing the first thing is the real-time processing Hadoop cannot handle real-time processing but Spa can handle real-time processing so any data that's coming in in the Lambda architecture you will have three layers and most of the Big Data projects will be in the Lambda architecture we will have a speed layer batch layer and service layer and the speed layer whenever the radar comes in that needs to be processed stored and handled and those type of real-time processing spark is the best fit of course within Hadoop ecosystem we have other components which does the real-time processing like storm but when you want to Leverage The Machine learning along with the spark streaming on such computation spark will be much better so that's why like when you have architecture like a Lambda architecture you want to have all three layers batch layer speed layer and service layer spark and gel the speed layer and service layer far better and it's going to provide better performance and whenever you do the batch processing especially like doing a machine learning processing we will Leverage The hydrated Computing and can perform 100 times faster than Hadoop the more the iterative processing that we do the more data will be read from the memory and it's going to get us a much faster performance than Hadoop map reduce so again remember whenever you do the processing only once so you're going to do the processing only once read process it and deliver the result spark may not be the best fit that can be done with the map reduce itself and there is another compound called akka as a messaging system our message coordinating system spark internally uses AKA for scheduling or any task that needs to be assigned by the master to the worker and the follow-up of that particular task by the master basically asynchronous coordination system and that's achieved using akka AKA programming internally it's used by the spark as such for the developers we don't need to worry about a programming up of course we can leverage it but AKA is used internally by the spark for scheduling and coordination between master and the worker and within spark we have a few major components let's see what are the major components of Apache spark the name the components of spark ecosystem spark comes with a core engine so that has the core functionalities of what is required by the spark or for the spark rtds are the building blocks of the spark core engine on top of spark core the basic functionalities of file interaction file system coordination all that's done by the spark core engine on top of spark core engine we have n number of other offerings to do machine learning to do graph Computing to do streaming we have n number of other components so the majorly used components of these components like spark SQL spark streaming ml lip graphics and smart car on the higher level we will see what are these components Sparks equals especially it's designed to do the processing against a structured data so we can write SQL queries and we can handle or we can do the processing so it's going to give us the interface to interact with the data especially structured data and the language that we can use it's more similar to what we use within the SQL I can say 99 percentage is same and most of the commonly used functionalities within the SQL have been implemented within spark SQL and Spark streaming is going to support the stream processing that's the offering available to handle the stream processing and MLA B is the offering to handle machine learning so the component name is called MLM and it has a list of components a list of machine learning algorithms already defined we can leverage and use any of those machine learning algorithms Graphics again it's a graph processing offerings within the spark it's going to support us to achieve graph Computing against the data that we have like page rank calculation how many connected entities how many triangles all those going to provide us a meaning to that particular data and Spark R is the component it's going to interact or help us to Leverage The Language R within the spark environment R is a statistical programming language where we can do statistical Computing within the spark environment and we can leverage our language by using the Smart car to get that executed within the spark environment additional to that there are other components as well like approximate database it's called a blink DB all of the things like in beta stage so these are the majorly used components within Spark so next question how can spark be used alongside Hadoop so when we see spark performs much better it's not a replacement to Hadoop it's going to coexist with the Hadoop right leveraging the spark and Hadoop together it's going to help us to achieve the best result a spark can do in memory Computing or can handle the speed layer and Hadoop comes with the resource manager so we can leverage the resource manager of Hadoop to make spark to work and free processing we don't need to Leverage The in memory Computing for example one time processing do the processing and forget I just store it we can use mapreduce so the processing cost or Computing cost will be much less compared to spark so we can amalgamate and get strike the right balance between the batch processing and stream processing when we have spark along with atom so let's have some detailed question related to spark core within spark core as I mentioned earlier the core building block of spark core is rdd resilient distributed data set it's a virtual it's not a physical entity it's a logical entity you will not see this hard it is existing the existence of r d will come into picture when you take some action so this r d will be used or referred to create the Dax cycle and arteries will be optimized to transform from one form to another form to make a plan how the data set needs to be transformed from one structure to another structure and finally when you take some against an RTD the existence of the data structure the resultant data will come into picture and that can be stored in any file system either httfs S3 or any other file system can be stored and rdds can exist in a partitioned form the sense it can get distributed across multiple systems and its fault tolerant when you say fault tolerant if any of the rdd is lost any partition of the 100d is lost it can regenerate only that specific partition it can regenerate so that's a huge advantage of RTD so if someone asks like what's the huge advantage of entity it's a fault tolerant where it can regenerate the last rtds and it can exist in a distributed fashion and it is immutable so since once the rdd is defined or like created it cannot be changed the next question is how do we create rdds in spark the two ways we can create the rdds One is using the spark context we can use any of the collections that's available within the scalar or in the other and using the parallelize function we can create the rdd and it's going to use the underlying file systems distribution mechanism if the data is located in distributed file system like hdfs it will leverage that and it will make those rdds available in a number of systems so it's going to leverage and follow the same distribution and r d as well or we can create the rdd by loading the data from external sources as well like HBS gently hdfsp may not consider as an external Source it will be considered as a file system of Hadoop so when spark is working with the hardware mostly the file system we will be using will be the hdfs so we can read from HPS or even we can read from other sources like parquet file or of S3 different sources every we can read and create the RTD okay next question is what is executed memory in spark application so Every Spark application will have fixed the Heap size and fixed number of cores for the spark executor executor is nothing but the execution unit available in every machine and that's going to facilitate to do the processing to do the task in the worker machine so irrespective of whether you use yarn resource manager or any other mesos like resource manager every worker Mission we will have one executor and within the executor the task will be handled and the memory to be allocated for that particular executor is what we Define as the Heap size and we can Define how much amount of memory should be used for that particular executor within the worker machine as well as number of cores can be used within the executor or by the executor within the spark application and that can be controlled through the configuration files of spark next question is Define partitions in Apache spark so any data irrespective of whether it is a small data or large data we can divide those data sets across multiple systems the process of dividing the data into multiple pieces and making it to store across multiple systems as a different logical units it's called partitioning so in simple terms partitioning is nothing but the process of dividing the data and storing in multiple systems is called partitions and by default the conversion of the data into RTD will happen in the system where the partition is existing so the more the partition the more parallelism they are going to get at the same time we have to be careful not to trigger huge amount of network data transfer as well and every rdd can be partitioned within spark and the panel is the partitioning going to help us to achieve parallelism more the partition that we have more distributions can be done and the key thing about the success of the spark program is minimizing the network traffic while doing the parallel processing and minimizing the data transfer within the systems of spark what operations does already support so I can operate multiple operations against the rdd so there are two type of things we can do we can group it into two one is transformations in Transformations rdd will get transformed from one form to another form say like filtering grouping all that like it's going to get transformed from one form to another form one small example like reduced by key filter all that will be Transformations the resultant of the transformation will be another rdd the same time we can take some actions against the rdd that's going to give us the final result I can say count how many records are there or store that result into the hdfs they are not actions so multiple actions can be taken against the RTD so the existence of the data will come into picture only if I take some action against the RTD okay next question what do you understand by transformations in spark so Transformations are nothing but functions mostly it will be higher higher order functions within scalar we have something like a higher order functions which will be applied against that rdd mostly against the list of elements that we have within the rdd that function will get applied by the existence of the rdd will come into picture only if we take some action against it in this particular example I am reading the file and having it within the rdd called raw data then I am doing some transformation using a map so it's going to apply a function So within map I have some function which will split each record using the tab so there are split with the tab will be applied against each record within the raw data and the resultant movies data will again be another rdd but of course this will be a lazy operation the existence of movies data will come into picture only if I take some action against it like count or print or store only those actions will generator data so next question Define functions of spark core so that's going to take care of the memory management and fault tolerance of rdds it's going to help us to schedule distribute the task and manage the jobs running within the cluster and so going to help us to store the data in the storage system as well as read the data from the storage system that's to do the file system level operations it's going to help us and Spark core programming can be done in any of these languages like Java Scala python as well as using R so core is at the horizontal level on top of spark core we can have a number of components and there are different type of rtds available one such special type is pair rdd so next question what do you understand by pair rdd so it's going to exist in page as a keys and values so I can do some special functions within the pair rdds or special Transformations like collect all the values corresponding to the same key like sort of Shuffle what happens within the shortened Shuffle of Hadoop those type of operations like you want to consolidate or group all the values corresponding to the same key or apply some functions against all the values corresponding to the same key like I want to get the sum of the value of all the keys we can use the pair rdd and get that achieved so it's going to the data within the rdd going to exist in base keys and values all right okay a question from Json what our Vector rgds in machine learning you will have a huge amount of processing and by vectors and matrices and we do lots of operations Vector operations like effective Vector are transforming any data into a vector form so vectors like as the normal way it will have a direction and magnitude so we can do some operations like sum two vectors and what is the difference between the vector A and B as well as a and C if the difference between Vector A and B is less compared to a and C we can say the vector A and B is somewhat similar okay in terms of features so the vector rdd will be used to represent the vector directly and that will be used the extents will be while doing the machine learning and Json thank you and there's another question what is rdd lineage so here any data processing any Transformations that we do it maintains something called a lineage so how data is getting transformed when the data is available in a partition form in multiple systems and when we do the transformation it will undergo multiple steps and in the distributed word it's very common to have phases of machines or machines going out of the network and the system of framework as such it should be an opposition to handle Spa handles it through 100 lineage so it can restore the last partition only assume like out of 10 machines data is distributed across five machines out of that those five missions One mission is lost so whatever the latest transformation that had that data for that particular partition the partition in the last mission alone can be regenerated and it knows how to regenerate that data or how to get that resultant data using the concept of rdd lineage so from which data source it got generated what was its prayer step so the complete lineage will be available and it's maintained by the spark framework internally we call that as rdd lineage what is spark driver to put it simply for those who are from Hadoop background Beyond background we can compare this to App Master every application will have a spark driver that will have a spark context which is going to modulate the complete execution of the job that will connect to the spark master and delivers the rdd graph that is the lineage to the master and the coordinate the tasks whatever the tasks that gets executed in the distributed environment it can do the parallel processing do the Transformations and actions against the rdd so it's a single point of connect for that specific application so spark driver is a short-lived and the spark context within the spark driver is going to be the coordinator between the master and the tasks that are running and Spark driver can get started in any of the executor within spark name types of cluster managers in spark so whenever you have group of missions you need a manager to manage the resources there are different type of cluster manager already we have seen the yarn yet another resource negotiator which manages the resources of Hadoop on top of yarn we can make spark 2 work sometimes I may want to have spark alone in my organization and not along with the Hadoop or any other technology then I can go with the Standalone spawn has built-in cluster manager so only smart can get executed multiple systems but generally if we have a cluster we will try to leverage various other Computing platforms or Computing Frameworks like graph processing giraffe these all that we will try to leverage in that case we will go with yarn or some generalized resource manager like mesos yarn is very specific to Hadoop and it comes along with the Hadoop mezos is a cluster level resource manager when I have multiple clusters within organization then I can use mesos mesos is also a resource manager it's a separate top level project within Apache next question what do you understand by worker node so in a cluster in a distributed environment we will have a number of workers we call that as a worker node or a slave node which does the actual processing going to get the data do the processing and get us the result and masternode going to assign what has to be done by what perker node and is going to read the data available in the specific worker node generally the tasks assigned to the worker node or the task will be assigned to the output node where data is located in Big Data space especially Hadoop always it will try to achieve the data locality that's what we count as the resource availability as well as the availability of the resource in terms of CPU memory as well will be considered assume I have some data in replicated in three missions all three missions are busy doing the work and no CPU or memory available to start the other task it will not wait for those missions to complete the job and get the resource and do the processing it will start the processing in some other machine which is going to be near to that the machines having the data and read the data over the network machines are nothing but which does the actual work I am going to report to the master in terms of what is the resource utilization and the tasks running within the worker missions will be doing the actual work and what is as pass Vector just few minutes back I was answering a question like what is a vector vector is nothing but representing the data in multi-dimensional form a vector can be multi-dimensional Vector as well assume I am going to represent a point in space I need three dimensions x y and z so the vector will have three dimensions if I need to represent a line in the space then I need two points to represent the starting point of the line and the end point of the line then I need a vector which can hold so it will have two Dimensions the First Dimension will have one point the second dimension will have another point at the same way if I have to represent a plane then I need another dimension to represent two lines so each line will be representing two points same way I can represent any data using a vector form as you may have huge number of feedback or ratings of products across an organization let's take a simple example Amazon Amazon have millions of products not every user not even a single user would have used billions of all the products within Amazon so only hardly we would have used like a point one percent or like even less than that maybe like few hundred products we would have used and rated the products within Amazon for the complete lifetime if I have to represent all ratings of the products with a vector I'll say the first position of the rating it's going to refer to the product with id1 second position it's going to refer to the product with id2 so I'll have million values within that particular vector out of million values I'll have only values for 100 products where I have provided the ratings so it may vary from number one to five for all others it will say zero sparse means thinly distributed so to represent the huge amount of data with the position and saying this particular position is having a zero value we can mention that with a key and value so what position having what value rather than storing all zeros I can store only non-zeros the position of it and the corresponding value that means all others going to be a zero value so we can mention this particular sparse Vector mentioning it to represent the non-zero entities so to store only the non-zero entities this Mass Factor will be used so that we don't need to waste additional space while storing this pass Vector let's discuss some questions on spark streaming how is streaming implemented in spark explained with examples spark streaming is used for processing real-time streaming data to precisely say it's a micro batch processing so data will be collected between every small interval say maybe if like 0.5 seconds or every seconds and it will get processed so internally it's going to create micro patches the data created out of that micro batch we call that is a d stream dstream is like a r d so I can do Transformations and actions whatever that I do with rdd I can do it with restreamer as well and Spark streaming can read data from Flume hdfs or other streaming sources as well and store the data in the dashboard or in any other database and it provides a very high throughput as it can be processed with a number of different systems in a distributed fashion again streaming dstream will be partitioned internally and it has the built-in feature of fault tolerance even if any data is lost any transformed rdd is lost it can regenerate those rdds from the existing or from the source data so D string is going to be the building block of streaming and it has the fault tolerance mechanism what we have within the RTD so this stream are specialized RTD specialized form of rdd specifically to use it within this Box streaming okay next question what is the significance of sliding window operation that's a very interesting one in the streaming data whenever we do the Computing the data density are the business implications of that specific data May oscillate a lot for example within Twitter we used to say the trending tweet hashtag just because that hashtag is very popular maybe someone might have hacked into the system and used their number of tweets maybe for that particular R it might have appeared billions of times just because it appeared millions of times for that specific mini duration or like say two three minute duration you should not get into the trending tag or trending hashtag for that particular day or for that particular month so what we will do we will try to do an average so like a window this current time frame and T minus 1 T minus 2 all the data we will consider and we will try to find the average or sum so the complete business logic will be applied against that particular window so any drastic changes or to precisely say the spike or dip very drastic spy cards drastic dip in the pattern of the data will be normalized so that's the biggest significance of using the sliding window operation within spark streaming and Spark can handle this sliding window automatically it can store the prior data the T minus one T minus 2 and how big the window needs to be maintained all that can be handed easily within the program handles at the abstract level next question is what is D string the expansion is discretized strain so that's the abstract form or the virtual form of representation of the data for the spark streaming the same way how already getting transformed from one form to another form we will have series of rdds all put together called as a d string so D string is nothing but it's another representation of rdd are like group of rgds we call there is a restream and I can apply the streaming functions or any of the functions Transformations are actions are available within the streaming against this D string So within that particular micro batch so I'll Define what interval the data should be collected or should be processed a call there is a micro batch it could be every one second or every 100 milliseconds or every five seconds I can Define that particular period so all the data received in that particular duration will be considered as a piece of data and that will be called as a d string next question explain caching in spark streaming of course yes spark internally it uses in-memory Computing so any data when it is doing the Computing that screen generator it will be there in memory but further if you do more and more processing with other jobs when there is a need for more memory the least used rdds will be clear now from the memory or the least used data available out of actions from the rdt will be cleared off from the memory sometimes I may need that data forever in memory very simple example like dictionary I want the dictionary words should be always available in memory because I may do a spell check against the Tweet commands or feedback commands and number of names so what I can do I can say cache those any data that comes in we can cache it or persist it in memory so even when there is a need for memory by other applications this specific data will not be removed and especially that will be used to do the further processing and the caching also can be defined whether it should be in memory only or in memory and hard disk that also we can define it let's discuss some questions on spark graphics so next question is is there an APA for implementing graphs in spark so in graph Theory everything will be represented as a graph when you say graph it'll have notes and edges so all will be represented using the rtds so it's going to extend the rdd and there is a component called graphics and it exposes the functionalities to represent a graph we can have a hrdd Vertex rdd by creating the edges and vertex I can create a graph and this graph can exist in a distributed environment so same way we will be in operation to do the parallel processing as well so Graphics it's just a form of representing the data that graphs with edges and vertices and of course yes it provides the API to implement or create the graph do the processing on the graph the apis are provided what is Page rank in graphics within graph facts once the graph is created we can calculate the page rank for a particular node so that's very similar to how we have the page rank for the websites within Google the higher the page rank that means it's more important within that particular graph it's going to show the importance of that particular node or Edge within that particular graph when I say graph it's a connected set of data all data will be connected using the property and how much important that property makes we will have a value Associated to it So within page rank we can calculate like a static page rank it will run a number of iterations or there is another page line called Dynamic page rank that will get executed till we reach a particular saturation level and the science fiction level can be defined with multiple criterios and the apis we call that as a graph operations can be directly executed against those graph and they all are available as API within the graphics what is lineage graph so that it is very similar to the graphics how the graph representation every rdd internally it will have the relation saying how that particular rdd got created and from there how that got transformed largely is how that got transformed so the complete lineage or the complete history or the complete path will be recorded within the lineage that will be used in case if any particular partition of the target is lost it can be regenerated even if the complete Oddities last we can regenerate so it will have the complete information on what all partitions where it is existing water Transformations it had undergone what is the resultant value if anything is lost in the middle it knows where to recalculate from and what all essential things needs to be recalculated that's going to save us a lot of time and if that already is never being used it will never get recalculated so the recalculation also triggers based on the action only on need the base is it will recalculate that's why it's going to use the memory optimally does Apache spark provide checkpoint link if you still take the example like streaming and if any data is lost within that particular sliding window we cannot get back the data or like the data will be lost firstly by making a window of say 24 hours to do some average I am making a sliding window of 24 hours every 24 hours it will keep on getting slided and if you lose any system assume there is a complete failure of the cluster I may lose the data because it's all available in the memory so how to recalculate if the data system is lost it follows something called a checkpointing so we can checkpoint the data and directly it's provided by the spark API we have to just provide the location where it should get checkpointed and you can read that particular data back when you start the system again whatever the state it was in we can regenerate that particular data so yes to answer the question straight Apache provides checkpointing and it will help us to regenerate the state what it was earlier let's move on to the next conference Park MLA how is machine learning implemented in spark the machine learning again it's a very huge ocean by itself and it's not a technology specific to spark machine learning is a common data science it's a subset of data science world where we have different type of algorithms different categories of algorithm like clustering regression dimensionality reduction all that we have and all these algorithms or most of the algorithms have been implemented in spark and Spark is the preferred framework or preferred application component to do the machine learning algorithm nowadays or machine learning processing the reason because most of the machine learning algorithms needs to be executed iteratively and number of times till we get the optimal result maybe like say 25 iterations or 50 iterations or till we get that specific accuracy we will keep on running the processing again and again and Spark is very good fit whenever you want to do the processing again and again because the data will be available in memory I can read it faster store the data back into the memory again read it faster and all this machine learning algorithms have been provided within the spark a separate component called ml lib and within mlib we have other components like visualization to extract the features you may be wondering how we can process the images the core thing about processing a image or audio or video is about extracting the feature and comparing the future how much they are related so that's where vectors matrices all that will come into picture and we can have pipeline of processing as well to the processing one then take the result and do the processing too and it has persistence algorithm as well the result of it the generated a processed result it can be persisted and reloaded back into the system to continue the processing from that particular Point onwards next question what are categories of machine learning machine learning has such different categories available supervised unsupervised and reinforced learning supervised unsupervised it's very popular where we will know some I'll give with an example I'll know well in advance what category that belongs to azima I want to do a character recognization while training the data I can give the information saying this particular image belongs to this particular category character or this particular number and I can train sometimes I will not know well in advance assume like I may have a different type of images like it may have cars bikes cat dog all that I want to know how many category available I will not know well in advance so I want to group it how many category available and then I'll realize saying okay all this belongs to a particular category I'll identify the pattern within that category and I'll give a category name say like all these images belongs to both category or looks like a boat so leaving it to the system by providing this value or not that's why the different type of machine learning comes into picture and as such machine learning is not specific to spark it's going to help us to achieve to run this machine learning algorithms what are spark ml lead tools ml lib is nothing but machine learning library or machine learning offering within this Mark and has a number of algorithms implemented and it provides a very good feature to persist the result generally in machine learning we will generate a model the pattern of the data because that is a model the model will be persisted either in different forms like pair quit Avro different forms it can be stored or persisted and has methodologies to extract the features from a set of data I may have a million images I want to extract the common features available within those millions of images and there are other utilities available to process to define or like to define the seed the randomizing it so different utilities are available as well as pipelines that's very specific to spark where I can Channel or arrange the sequence of steps to be undergone by the machine learning so machine learning one algorithm first and then the result of it will be fed into machine learning algorithm too like that we can have a sequence of execution and that will be defined using the Pipelines these are all enabled features of spark MLB what are some popular algorithms and Utilities in spark MLA so these are all some popular algorithms like regression classification basic statistics recommendation systems a complete system is like well implemented all we have to provide is give the data if you give the ratings and products within an organization if you have a complete damp we can build the recommendation system in no time and if you give any user it can give a recommendation these are the products the user may like and those products can be displayed in the search result works on the basis of the feedback that we are providing for the earlier products that we have bought clustering dimensionality reduction whenever we do transitioning with the huge amount of data it's very very compute intensive and we may have to reduce the dimensions especially the matrix dimensions within the MLA without losing the features whatever the features available without losing it we should reduce the dimensionality and there are some algorithms available to do that dimensionality introduction and feature extraction so what are all the common features or features available within that particular image and I can compare what are all the common across common features available within those images that's how we will group those images so get me whether this particular image the person looking like this image available in the database or not for example assume the organization or the police department crime Department maintaining a list of persons committed crime and if we get a new photo when they do a search they may not have the exact photo bit by bit the photo might have been taken with a different background different lightings different locations different time so 100 the data will be different or bits and bytes will be different but look wise yes they are going to be seeing so I am going to search the photo looking similar to this particular photograph as the input I'll provide to achieve that we will be extracting the features in each of those photos we will extract the features and we will try to match the feature rather than the bits and bytes and optimization as well in terms of processing are doing the piping there are a number of algorithms to do the optimization let's move on to spark SQL is there a module to implement SQL in spark how does it work so directly not the SQL may be very similar to Hive whatever the structured data that we have we can read the data or extract the meaning out of the data using SQL and it exposes the API and we can use those API to read the data or create data frames and Spark SQL has four major categories data source data frame data frame is like the representation of X and Y data or like a Excel data multi-dimensional structure data an abstract form on top of data frame I can do the query and internally it has The Interpreter and Optimizer any query I find that will get interpreted or optimized and will get executed using the SQL services and get the data from the data frame or you can read the data from the data source and do the processing what is a parquet file it's a format of the file where the data in some structured form especially the result of the spark SQL can be stored or returned in some persistence and the packet again it is a open source from Apache it's a data serialization technique where we can serialize the data using the packet form and to precisely say it's a column in our storage it's going to consume less space it will use the keys and values and store the data and also it helps you to access a specific data from that barcode form using the query so backward it's another open source format data serialization form to store the data or purchase the data as well as to retrieve the data list the functions of spark SQL it can be used to load the varieties of structured data of course yes spark SQL can work only with the structured data it can be used to load varieties of structure the data and you can use a SQL like statements to query against the program and it can be used with external tools to connect to the spark as well it gives very good integration with the SQL and using python Java scalar code we can create a RTD from the structure data available directly using the spark SQL I can generate the rdd so it's going to facilitate the people from database background to make the program faster and quicker next question is what do you understand by lazy evaluation so whenever you do any operation within the spark World it will not do the processing immediately it will look for the final result that we are asking for it if it doesn't ask for the final result it doesn't need to do the processing So based on the final action till we do the action there will not be any Transmissions I will there will not be any actual processing happening it will just understand what are transformations it has to do finally if you ask for the action then in optimized way it's going to complete the data processing and get us the final result so to answer straight lazy evaluation is doing the processing only on need of the resultant data if the data is not racket it's not going to do the processing can you use spark to access and analyze data stored in Cassandra database yes it is possible okay not only Cassandra any of the nosql database it can very well do the processing and the Cassandra also works in a distributed architecture it's an O SQL database so it can leverage the data locality the query can be executed locally where the Cassandra nodes are available it's going to make the query execution faster and reduce the network load and Spark executors it will try to get started or the spark executors in the machine where the Cassandra nodes are available our data is available it's going to do the processing locally so it's going to leverage the data locality next question how can you minimize data transfers when working with spark if you ask the code design the success of the spark program depends on how much you are reducing the network transfer because network transfer is very constant operation and you cannot parallelize it gives multiple ways or especially two ways to avoid this one is called broadcast variable and accumulators broadcast variable it will help us to transfer any static data or any informations keep on publishing it to multiple systems so I'll say if any data to be transferred to multiple executors to be used in common and can broadcast it and I might want to consolidate the values happening in multiple workers in a single centralized location I can use accumulator so this will help us to achieve the data consolidation or data distribution in the distributed world at the APA level or at the abstract level where we don't need to do the heavy lifting that's taken care by the spark for us what are broadcast variables just now as we discussed the value the common value that we need I may want that to be available in multiple executors multiple workers simple example you want to do a spell check on the Tweet comments the dictionary which has the right list of words I'll have a complete list I want that particular dictionary to be available in each executor so that with the task when that's running locally in those executors can refer to that particular map task and get the processing done by avoiding the network data transfer so the process of Distributing the data from the smart context to the executors where the task is going to run is achieved using broadcast variables and so built-in within the spark API using the spark API we can create the broadcast variable and the process of Distributing this data available in all executors is taken care by the spark framework explain accumulators in spark the similar way how we have our broadcast variables we have accumulators as well simple example you want to count how many error records are available in the distributed environment as if data is distributed across multiple systems multiple executives each executor will do the process thing count the records and atomically I may want the total count so what I'll do I'll ask to maintain an accumulator of course it will be maintained in the smart context in the driver program because the driver program going to be one per application it will keep on getting accumulated and whenever I want I can read those values and take any appropriate action so it's like more or less accumulators and broadcast variables looks opposite to each other but the purpose is totally different why is there a need for broadcast variable when working with Apache Spark it's a read-only variable and it will be cached in memory in a distributed fashion and it eliminates the work of moving the data from a centralized location that is a spark driver or from a particular program to all the executors within the cluster where the transparent to get executed we don't need to worry about where the task will get executed within the cluster so when compared with the accumulators broadcast variables it's going to have a read-only operation the executors cannot change the value can only read those values it cannot update so mostly it will be used like a cache that we have for the rdt next question how can you trigger automatic cleanups in spark to handle accumulated metadata so there is a parameter that we can set TTL that will get triggered along with the running jobs and intermediately it's going to write the data result into the disk or clean the unnecessary data or clean the rdds that's not being used the least used rdt it will get cleaned and it will keep the metadata as well as the memory clean what are the various levels of persistence in Apache Spark when we say data should be stored in memory it can be in different level it can be persisted so it can be in memory only or memory and disk or disk only and when it is getting stored we can ask it to store it in a serialized form so the reason why we may store our persist is I want this particular artery this form of RTD little bank for reusing so I can read it back maybe I may not need it very immediately so I don't want that to keep occupying my memory I'll write it to the disk and I'll read it back whenever there is a need I'll read it back the next question what do you understand by schema rdd so schema rdd will be used especially within the spark SQL so the rdd will have the meta information built into it it will have the schema also very similar to what we have the database schema the structure of the particular data and when I have the structure it will be easy for me to handle the data so data and the structure will be existing together and this schema rdd now it's called as a data frame with its Mark and data frame term is very popular in languages like R such as other languages it's very popular so it's going to have the data and The Meta information about the data saying what column what structure it is in explain the scenario where you will be using spark streaming assume you want to do a sentiment analysis of tweeters so data will be streamed so we will use a flume sort of a tool to harvest the information from Twitter and feed it into spark streaming it will extract or identify the sentiment of each and every tweet and Market whether it is positive or negative and accordingly the data will be the structured data the Tweet ID whether it is positive or negative maybe percentage of positive and percentage of negative sentiment store it in some structured form then you can Leverage The Spark SQL and do grouping or filtering based on the sentiment and maybe I can use a machine learning algorithm what drives that particular tweet to be in the negative side is there any similarity between all those negatives and negative tweets maybe specific to a product a specific time by when the Tweet was treated or from a specific region that it was tweeted so those analysis could be done by leveraging the MLA buff spark so MLA streaming core all going to work together all these are like different offerings available to solve different problems I hope you all enjoyed thank you folks I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist And subscribe to edureka channel to learn more happy learning
Info
Channel: edureka!
Views: 100,094
Rating: undefined out of 5
Keywords: yt:cc=on, pyspark full course, pyspark full course 2023, complete pyspark course, Pyspark tutorial for beginners, pyspark tutorial, pyspark online training, Pyspark training, pyspark tutorial jupyter notebook, Introduction to PySpark for Beginners, spark with python, data analytics using pyspark, apache spark with python, pyspark dataframes, pyspark rdd, introduction to pyspark, pyspark api, what is pyspark, pyspark edureka, apache spark edureka, edureka
Id: sSkAuTqfBA8
Channel Id: undefined
Length: 238min 31sec (14311 seconds)
Published: Thu Feb 02 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.