Learn Apache Spark in 10 Minutes | Step by Step Guide

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Ninety percent of the world's data was generated in just the last two years. In the early 2000s, the amount of data being generated exploded exponentially with the use of the internet, social media, and various digital technologies. Organizations found themselves facing a massive volume of data that was very hard to process. To address this challenge, the concept of Big Data emerged. Big Data refers to extremely large and complex data sets that are difficult to process using traditional methods. Organizations across the world wanted to process this massive volume of data and derive useful insights from it. Here's where Hadoop comes into the picture. In 2006, a group of engineers at Yahoo developed a special software framework called Hadoop. They were inspired by Google's MapReduce and Google File System technology. Hadoop introduced a new way of data processing called distributed processing. Instead of relying on a single machine, we can use multiple computers to get the final result. Think of it like teamwork: each machine in a cluster will get some part of the data to process. They will work simultaneously on all of this data, and in the end, we will combine the output to get the final result. There are two main key components of Hadoop. One is Hadoop Distributed File System (HDFS), which is like the giant storage system for keeping our dataset. It divides our data into multiple chunks and stores all of this data across different computers. The second part of Hadoop is called MapReduce, which is a super smart way of processing all of this data together. MapReduce helps in processing all of this data in parallel. So, you can divide your data into multiple chunks and process them together, similar to a team of friends working to solve a very large puzzle. Each person in the team gets a part of the puzzle to solve, and in the end, we put everything together to get the final result. So, with Hadoop, we have two things: HDFS (Hadoop Distributed File System), which is used for storing our data across multiple computers, and MapReduce, which is used to process all of this data in parallel. It allowed organizations to store and process very large volumes of data. But here's the thing, although Hadoop was very good at handling Big Data, there were a few limitations. One of the biggest problems behind Hadoop was that it relied on storing data on disk, which made things much slower. Every time we run a job, it would store the data onto the disk, read the data, process it, and then store that data again through a disk. This made the data processing a lot slower. Another issue with Hadoop was that it processed data only in batches. This means we had to wait for one process to complete before submitting any other job. It was like waiting for the whole group of friends to complete their puzzles individually and then putting them together. So, there was a need to process all of this data faster and in real-time. Here's where Apache Spark comes into the picture. In 2009, researchers at the University of California, Berkeley, developed Apache Spark as a research project. The main reason behind the development of Apache Spark was to address the limitations of Hadoop. This is where they introduced the powerful concept called RDD (Resilient Distributed Dataset). RDD is the backbone of Apache Spark. It allows data to be stored in memory and enables faster data access and processing. Instead of reading and writing the data repeatedly from the disk, Spark processes the entire data in just memory. The meaning of memory here is the RAM (Random Access Memory) stored inside our computer. And this in-memory processing of data makes Spark 100 times faster than Hadoop. Yes, you heard it right, 100 times faster than Hadoop. Additionally, Spark also gave the ability to write code in various programming languages such as Python, Java, and Scala. So, you can easily start writing Spark applications in your preferred language and process your data on a large scale. Apache Spark became very famous because it was fast, could handle a lot of data, and process it efficiently. Here are the different components attached to Apache Spark. One of the most important parts of the Spark ecosystem is called Spark Core. It helps with processing data across multiple computers and ensures everything works efficiently and smoothly. Another part is Spark SQL. So, if you want to write SQL queries directly on your dataset, you can easily do that using Spark. Then there is Spark Streaming. If you want to process real-time data that you see in Google Maps or Uber, you can easily do that using Apache Spark Streaming. And at the end, we have MLlib. MLlib is used for training large-scale machine learning models on Big Data using Spark. With all of these components working together, Apache Spark became a powerful tool for processing and analyzing Big Data. Nowadays, in any company, you will see Apache Spark being used to process Big Data. Now, let's understand the basic architecture behind Apache Spark. When you think of a computer, a standalone computer is generally used to watch movies, play games, or anything else. But when you want to process large Big Data, you can't do that on a single computer. You need multiple computers working together on individual tasks so that you can combine the output at the end and get the desired result. You can't just take ten computers and start processing your Big Data. You need a proper framework to coordinate work across all of these different machines, and Apache Spark does exactly that. Apache Spark manages and coordinates the execution of tasks on data across a cluster of computers. It has something called a cluster manager. When we write any job in Spark, it is called a Spark application. Whenever we run anything, it goes to the cluster manager, which grants resources to all applications so that we can complete our work. In a Spark application, we have two important components: the driver processes and the executor processes. The driver processes are like a boss, and the executor processes are like workers. The main job of the driver processes is to keep track of all the information about the Apache Spark application. It will respond to the command and input from the user. So, whenever we submit anything, the driver process will make sure it goes through the Apache Spark application properly. It analyzes the work that needs to be done, divides our work into smaller tasks, and assigns these tasks to executor processes. So, it is basically the boss or a manager who is trying to make sure everything works properly. The driver process is the heart of the Apache Spark application because it makes sure everything runs smoothly and allocates the right resources based on the input that we provide. Executor processes are the ones that actually do the work. They execute the code assigned by the driver process and report back the progress and result of the computation. Now, let's talk about how Apache Spark executes the code in practice. When we actually write our code in Apache Spark, the first thing we need to do is create the Spark session. It is basically making the connection with the cluster manager. You can create a Spark session with any of these languages: Python, Scala, or Java. No matter what language you use to begin writing your Spark application, the first thing you need to create is a Spark session. You can perform simple tasks, such as generating a range of numbers, by writing just a few lines of code. For example, you can create a data frame with one column containing a thousand rows with values from 0 to 999. By writing this one line of code, you create a data frame. A data frame is simply the representation of data in rows and columns, similar to MS Excel. The concept of a data frame is not new to Spark. We also have the data frame concept available in Python and R. In Python, the data frame is stored on a single computer, whereas in Spark, the data frame is distributed across multiple computers. To ensure that all of this data is executed in parallel, you need to divide your data into multiple chunks. This is called partitioning. You can have a single partition or multiple partitions, which you can specify while writing the code. All of these things are done using transformations. Transformations are basically the instructions that tell Apache Spark how to modify the data and get the desired result. For example, let's say you want to find all the even numbers in a data frame. You can use the filter transformation function to specify this condition. But here's the thing, if we run this code, we will not get the desired output. In most programming languages, once you run the code, you get the output immediately. But Spark doesn't work like that. Spark uses lazy evaluation. It waits until you complete writing your entire code, and then it generates the proper plan based on the code you have written. This allows Spark to calculate your entire data flow and execute it efficiently. To actually execute the transformation block, we have something called actions. There are multiple actions available in Apache Spark. One of the actions is the count action, which gives us the total number of records in a data frame. We can run an action, and Spark will run the entire transformation block and give us the final output. Here's an example to understand all of these concepts in a single project. The first thing we need to do is import the Spark session. You can do that using the following code: from pyspark.sql import SparkSession. This creates the entry point for the Spark application. Once you do that, you can use the sparkSession.builder.create function. This creates the Spark application so that you can import the dataset and start writing the query. You have all the details available, such as versions, app name, and everything. Now, let's see if we have this dataset called "tips". If you want to read this data, you can use a simple function called spark.read.csv. If you provide the path and set the header to 2, it will print the entire data from the CSV file. As you can see, our data contains total bill, tips, sex, smoker, date, time, and size. All of this data is being imported from the CSV file. If you print the type of this particular file, you will understand that it is a pyspark.sql.dataframe. Now, you can create a temporary view on top of this data frame. If you use the function createOrReplaceTempView, it will create a table inside Spark, and you can write SQL queries on top of it. For example, you can run the query SELECT * FROM tips, and if you provide this query to spark.sql, you can easily run this particular SQL query on top of our data frame. So, what we really did was import the data, convert the data into a table, and then write SQL queries on top of it. The same thing can be done to convert this Spark data frame into a Pandas data frame. So, if you want to apply any Pandas function, you can also do that inside Spark itself. Over here, if you want to understand lazy evaluation, where you are just filtering the sex by female and the day as Sunday, once we run this particular statement, Spark does not execute this entire thing. It waits for the action to be performed. The action over here is the show action. So, once you run the show, then it will run this entire thing, and then you will be able to see the results. This is called a transformation that we understood in the video, and this is the action that you were talking about. Like this, you can do a lot of things. You can go to the Spark documentation and understand it in detail. There are multiple functions available, and for each function, you will get a detailed understanding. I hope you understood everything about Apache Spark and how it executes all of this code. If you want to do an entire data engineering project involving Apache Spark, you can watch the video mentioned in the transcript. It will give you a complete understanding of how a data engineering project is built from start to end. That's all from this video. If you have any questions, let me know in the comments, and I'll see you in the next video. Thank you.

Info

Channel: Darshil Parmar

Views: 12,652

Rating: undefined out of 5

Keywords: darshil parmar, darshil parmar data engineer project, data engineering, what is apache spark, apache spark tutorial, learn apache spark, apache spark for beginners, apache spark for dummies, apache spark project, spark vs hadoop, apache spark vs hadoop, how to learn apache spark, apache spark for big data, big data processing with apache spark

Id: v_uodKAywXA

Channel Id: undefined

Length: 10min 46sec (646 seconds)

Published: Sun Jul 16 2023