Learn Apache Spark in 10 Minutes | Step by Step Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Ninety percent of the world's data was generated  in just the last two years. In the early 2000s,   the amount of data being generated exploded  exponentially with the use of the internet,   social media, and various digital  technologies. Organizations found   themselves facing a massive volume of  data that was very hard to process.   To address this challenge, the  concept of Big Data emerged. Big Data refers to extremely large and complex  data sets that are difficult to process using   traditional methods. Organizations across the  world wanted to process this massive volume   of data and derive useful insights from it.  Here's where Hadoop comes into the picture. In 2006, a group of engineers at Yahoo developed  a special software framework called Hadoop. They   were inspired by Google's MapReduce and Google  File System technology. Hadoop introduced a new   way of data processing called distributed  processing. Instead of relying on a single   machine, we can use multiple computers to get  the final result. Think of it like teamwork:   each machine in a cluster will get some part of  the data to process. They will work simultaneously   on all of this data, and in the end, we will  combine the output to get the final result. There are two main key components of Hadoop.  One is Hadoop Distributed File System (HDFS),   which is like the giant storage system for  keeping our dataset. It divides our data   into multiple chunks and stores all of this  data across different computers. The second   part of Hadoop is called MapReduce, which is a  super smart way of processing all of this data   together. MapReduce helps in processing all  of this data in parallel. So, you can divide   your data into multiple chunks and process them  together, similar to a team of friends working   to solve a very large puzzle. Each person in  the team gets a part of the puzzle to solve,   and in the end, we put everything  together to get the final result. So, with Hadoop, we have two things:  HDFS (Hadoop Distributed File System),   which is used for storing our data across multiple  computers, and MapReduce, which is used to process   all of this data in parallel. It allowed  organizations to store and process very large   volumes of data. But here's the thing, although  Hadoop was very good at handling Big Data,   there were a few limitations. One of the  biggest problems behind Hadoop was that it   relied on storing data on disk, which made  things much slower. Every time we run a job,   it would store the data onto the  disk, read the data, process it,   and then store that data again through a disk.  This made the data processing a lot slower.   Another issue with Hadoop was that it processed  data only in batches. This means we had to wait   for one process to complete before submitting  any other job. It was like waiting for the whole   group of friends to complete their puzzles  individually and then putting them together. So, there was a need to process all of this  data faster and in real-time. Here's where   Apache Spark comes into the picture. In 2009,  researchers at the University of California,   Berkeley, developed Apache Spark as  a research project. The main reason   behind the development of Apache Spark was  to address the limitations of Hadoop. This   is where they introduced the powerful concept  called RDD (Resilient Distributed Dataset). RDD is the backbone of Apache Spark. It allows  data to be stored in memory and enables faster   data access and processing. Instead of reading  and writing the data repeatedly from the disk,   Spark processes the entire data in just memory.  The meaning of memory here is the RAM (Random   Access Memory) stored inside our computer. And  this in-memory processing of data makes Spark   100 times faster than Hadoop. Yes, you heard it  right, 100 times faster than Hadoop. Additionally,   Spark also gave the ability to write code in  various programming languages such as Python,   Java, and Scala. So, you can  easily start writing Spark   applications in your preferred language  and process your data on a large scale. Apache Spark became very famous because  it was fast, could handle a lot of data,   and process it efficiently. Here are the different  components attached to Apache Spark. One of the   most important parts of the Spark ecosystem is  called Spark Core. It helps with processing data   across multiple computers and ensures everything  works efficiently and smoothly. Another part is   Spark SQL. So, if you want to write SQL queries  directly on your dataset, you can easily do that   using Spark. Then there is Spark Streaming. If  you want to process real-time data that you see   in Google Maps or Uber, you can easily do that  using Apache Spark Streaming. And at the end, we   have MLlib. MLlib is used for training large-scale  machine learning models on Big Data using Spark. With all of these components working together,   Apache Spark became a powerful tool for  processing and analyzing Big Data. Nowadays,   in any company, you will see Apache  Spark being used to process Big Data. Now, let's understand the basic architecture  behind Apache Spark. When you think of a computer,   a standalone computer is generally  used to watch movies, play games,   or anything else. But when you  want to process large Big Data,   you can't do that on a single computer. You need  multiple computers working together on individual   tasks so that you can combine the output at  the end and get the desired result. You can't   just take ten computers and start processing  your Big Data. You need a proper framework to   coordinate work across all of these different  machines, and Apache Spark does exactly that. Apache Spark manages and coordinates  the execution of tasks on data across   a cluster of computers. It has something called a  cluster manager. When we write any job in Spark,   it is called a Spark application. Whenever we  run anything, it goes to the cluster manager,   which grants resources to all applications  so that we can complete our work. In a Spark application, we have two  important components: the driver   processes and the executor processes.  The driver processes are like a boss,   and the executor processes are like workers. The  main job of the driver processes is to keep track   of all the information about the Apache Spark  application. It will respond to the command and   input from the user. So, whenever we submit  anything, the driver process will make sure   it goes through the Apache Spark application  properly. It analyzes the work that needs to be   done, divides our work into smaller tasks, and  assigns these tasks to executor processes. So,   it is basically the boss or a manager who  is trying to make sure everything works   properly. The driver process is the heart of the  Apache Spark application because it makes sure   everything runs smoothly and allocates the  right resources based on the input that we   provide. Executor processes are the ones that  actually do the work. They execute the code   assigned by the driver process and report back  the progress and result of the computation. Now, let's talk about how Apache Spark executes  the code in practice. When we actually write   our code in Apache Spark, the first thing  we need to do is create the Spark session.   It is basically making the connection with the  cluster manager. You can create a Spark session   with any of these languages: Python, Scala,  or Java. No matter what language you use to   begin writing your Spark application, the first  thing you need to create is a Spark session. You can perform simple tasks, such  as generating a range of numbers,   by writing just a few lines of code. For  example, you can create a data frame with   one column containing a thousand rows with values  from 0 to 999. By writing this one line of code,   you create a data frame. A data frame is simply  the representation of data in rows and columns,   similar to MS Excel. The concept of a data frame  is not new to Spark. We also have the data frame   concept available in Python and R. In Python, the  data frame is stored on a single computer, whereas   in Spark, the data frame is distributed across  multiple computers. To ensure that all of this   data is executed in parallel, you need to divide  your data into multiple chunks. This is called   partitioning. You can have a single partition  or multiple partitions, which you can specify   while writing the code. All of these things are  done using transformations. Transformations are   basically the instructions that tell Apache Spark  how to modify the data and get the desired result. For example, let's say you want to find  all the even numbers in a data frame. You   can use the filter transformation function to  specify this condition. But here's the thing,   if we run this code, we will not get the  desired output. In most programming languages,   once you run the code, you get the output  immediately. But Spark doesn't work like   that. Spark uses lazy evaluation. It waits  until you complete writing your entire code,   and then it generates the proper plan  based on the code you have written. This   allows Spark to calculate your entire  data flow and execute it efficiently. To actually execute the transformation  block, we have something called actions.   There are multiple actions available in Apache  Spark. One of the actions is the count action,   which gives us the total number of records  in a data frame. We can run an action,   and Spark will run the entire transformation  block and give us the final output. Here's an example to understand all of these  concepts in a single project. The first thing   we need to do is import the Spark session. You can  do that using the following code: from pyspark.sql   import SparkSession. This creates the entry point  for the Spark application. Once you do that,   you can use the sparkSession.builder.create  function. This creates the Spark application   so that you can import the dataset  and start writing the query. You   have all the details available, such  as versions, app name, and everything. Now, let's see if we have this dataset called  "tips". If you want to read this data, you can use   a simple function called spark.read.csv. If you  provide the path and set the header to 2, it   will print the entire data from the CSV file. As  you can see, our data contains total bill, tips,   sex, smoker, date, time, and size. All of this  data is being imported from the CSV file. If you   print the type of this particular file, you will  understand that it is a pyspark.sql.dataframe. Now, you can create a temporary view on top  of this data frame. If you use the function   createOrReplaceTempView, it will create a table  inside Spark, and you can write SQL queries on top   of it. For example, you can run the query SELECT  * FROM tips, and if you provide this query to   spark.sql, you can easily run this particular SQL  query on top of our data frame. So, what we really   did was import the data, convert the data into a  table, and then write SQL queries on top of it. The same thing can be done to convert this Spark  data frame into a Pandas data frame. So, if you   want to apply any Pandas function, you can also do  that inside Spark itself. Over here, if you want   to understand lazy evaluation, where you are just  filtering the sex by female and the day as Sunday,   once we run this particular statement, Spark does  not execute this entire thing. It waits for the   action to be performed. The action over here  is the show action. So, once you run the show,   then it will run this entire thing, and  then you will be able to see the results. This is called a transformation  that we understood in the video,   and this is the action that you were talking  about. Like this, you can do a lot of things.   You can go to the Spark documentation and  understand it in detail. There are multiple   functions available, and for each function,  you will get a detailed understanding. I hope you understood everything about  Apache Spark and how it executes all of   this code. If you want to do an entire data  engineering project involving Apache Spark,   you can watch the video mentioned in the  transcript. It will give you a complete   understanding of how a data engineering  project is built from start to end. That's all from this video.  If you have any questions,   let me know in the comments, and I'll  see you in the next video. Thank you.
Info
Channel: Darshil Parmar
Views: 12,652
Rating: undefined out of 5
Keywords: darshil parmar, darshil parmar data engineer project, data engineering, what is apache spark, apache spark tutorial, learn apache spark, apache spark for beginners, apache spark for dummies, apache spark project, spark vs hadoop, apache spark vs hadoop, how to learn apache spark, apache spark for big data, big data processing with apache spark
Id: v_uodKAywXA
Channel Id: undefined
Length: 10min 46sec (646 seconds)
Published: Sun Jul 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.