Ninety percent of the world's data was generated
in just the last two years. In the early 2000s, the amount of data being generated exploded
exponentially with the use of the internet, social media, and various digital
technologies. Organizations found themselves facing a massive volume of
data that was very hard to process. To address this challenge, the
concept of Big Data emerged. Big Data refers to extremely large and complex
data sets that are difficult to process using traditional methods. Organizations across the
world wanted to process this massive volume of data and derive useful insights from it.
Here's where Hadoop comes into the picture. In 2006, a group of engineers at Yahoo developed
a special software framework called Hadoop. They were inspired by Google's MapReduce and Google
File System technology. Hadoop introduced a new way of data processing called distributed
processing. Instead of relying on a single machine, we can use multiple computers to get
the final result. Think of it like teamwork: each machine in a cluster will get some part of
the data to process. They will work simultaneously on all of this data, and in the end, we will
combine the output to get the final result. There are two main key components of Hadoop.
One is Hadoop Distributed File System (HDFS), which is like the giant storage system for
keeping our dataset. It divides our data into multiple chunks and stores all of this
data across different computers. The second part of Hadoop is called MapReduce, which is a
super smart way of processing all of this data together. MapReduce helps in processing all
of this data in parallel. So, you can divide your data into multiple chunks and process them
together, similar to a team of friends working to solve a very large puzzle. Each person in
the team gets a part of the puzzle to solve, and in the end, we put everything
together to get the final result. So, with Hadoop, we have two things:
HDFS (Hadoop Distributed File System), which is used for storing our data across multiple
computers, and MapReduce, which is used to process all of this data in parallel. It allowed
organizations to store and process very large volumes of data. But here's the thing, although
Hadoop was very good at handling Big Data, there were a few limitations. One of the
biggest problems behind Hadoop was that it relied on storing data on disk, which made
things much slower. Every time we run a job, it would store the data onto the
disk, read the data, process it, and then store that data again through a disk.
This made the data processing a lot slower. Another issue with Hadoop was that it processed
data only in batches. This means we had to wait for one process to complete before submitting
any other job. It was like waiting for the whole group of friends to complete their puzzles
individually and then putting them together. So, there was a need to process all of this
data faster and in real-time. Here's where Apache Spark comes into the picture. In 2009,
researchers at the University of California, Berkeley, developed Apache Spark as
a research project. The main reason behind the development of Apache Spark was
to address the limitations of Hadoop. This is where they introduced the powerful concept
called RDD (Resilient Distributed Dataset). RDD is the backbone of Apache Spark. It allows
data to be stored in memory and enables faster data access and processing. Instead of reading
and writing the data repeatedly from the disk, Spark processes the entire data in just memory.
The meaning of memory here is the RAM (Random Access Memory) stored inside our computer. And
this in-memory processing of data makes Spark 100 times faster than Hadoop. Yes, you heard it
right, 100 times faster than Hadoop. Additionally, Spark also gave the ability to write code in
various programming languages such as Python, Java, and Scala. So, you can
easily start writing Spark applications in your preferred language
and process your data on a large scale. Apache Spark became very famous because
it was fast, could handle a lot of data, and process it efficiently. Here are the different
components attached to Apache Spark. One of the most important parts of the Spark ecosystem is
called Spark Core. It helps with processing data across multiple computers and ensures everything
works efficiently and smoothly. Another part is Spark SQL. So, if you want to write SQL queries
directly on your dataset, you can easily do that using Spark. Then there is Spark Streaming. If
you want to process real-time data that you see in Google Maps or Uber, you can easily do that
using Apache Spark Streaming. And at the end, we have MLlib. MLlib is used for training large-scale
machine learning models on Big Data using Spark. With all of these components working together, Apache Spark became a powerful tool for
processing and analyzing Big Data. Nowadays, in any company, you will see Apache
Spark being used to process Big Data. Now, let's understand the basic architecture
behind Apache Spark. When you think of a computer, a standalone computer is generally
used to watch movies, play games, or anything else. But when you
want to process large Big Data, you can't do that on a single computer. You need
multiple computers working together on individual tasks so that you can combine the output at
the end and get the desired result. You can't just take ten computers and start processing
your Big Data. You need a proper framework to coordinate work across all of these different
machines, and Apache Spark does exactly that. Apache Spark manages and coordinates
the execution of tasks on data across a cluster of computers. It has something called a
cluster manager. When we write any job in Spark, it is called a Spark application. Whenever we
run anything, it goes to the cluster manager, which grants resources to all applications
so that we can complete our work. In a Spark application, we have two
important components: the driver processes and the executor processes.
The driver processes are like a boss, and the executor processes are like workers. The
main job of the driver processes is to keep track of all the information about the Apache Spark
application. It will respond to the command and input from the user. So, whenever we submit
anything, the driver process will make sure it goes through the Apache Spark application
properly. It analyzes the work that needs to be done, divides our work into smaller tasks, and
assigns these tasks to executor processes. So, it is basically the boss or a manager who
is trying to make sure everything works properly. The driver process is the heart of the
Apache Spark application because it makes sure everything runs smoothly and allocates the
right resources based on the input that we provide. Executor processes are the ones that
actually do the work. They execute the code assigned by the driver process and report back
the progress and result of the computation. Now, let's talk about how Apache Spark executes
the code in practice. When we actually write our code in Apache Spark, the first thing
we need to do is create the Spark session. It is basically making the connection with the
cluster manager. You can create a Spark session with any of these languages: Python, Scala,
or Java. No matter what language you use to begin writing your Spark application, the first
thing you need to create is a Spark session. You can perform simple tasks, such
as generating a range of numbers, by writing just a few lines of code. For
example, you can create a data frame with one column containing a thousand rows with values
from 0 to 999. By writing this one line of code, you create a data frame. A data frame is simply
the representation of data in rows and columns, similar to MS Excel. The concept of a data frame
is not new to Spark. We also have the data frame concept available in Python and R. In Python, the
data frame is stored on a single computer, whereas in Spark, the data frame is distributed across
multiple computers. To ensure that all of this data is executed in parallel, you need to divide
your data into multiple chunks. This is called partitioning. You can have a single partition
or multiple partitions, which you can specify while writing the code. All of these things are
done using transformations. Transformations are basically the instructions that tell Apache Spark
how to modify the data and get the desired result. For example, let's say you want to find
all the even numbers in a data frame. You can use the filter transformation function to
specify this condition. But here's the thing, if we run this code, we will not get the
desired output. In most programming languages, once you run the code, you get the output
immediately. But Spark doesn't work like that. Spark uses lazy evaluation. It waits
until you complete writing your entire code, and then it generates the proper plan
based on the code you have written. This allows Spark to calculate your entire
data flow and execute it efficiently. To actually execute the transformation
block, we have something called actions. There are multiple actions available in Apache
Spark. One of the actions is the count action, which gives us the total number of records
in a data frame. We can run an action, and Spark will run the entire transformation
block and give us the final output. Here's an example to understand all of these
concepts in a single project. The first thing we need to do is import the Spark session. You can
do that using the following code: from pyspark.sql import SparkSession. This creates the entry point
for the Spark application. Once you do that, you can use the sparkSession.builder.create
function. This creates the Spark application so that you can import the dataset
and start writing the query. You have all the details available, such
as versions, app name, and everything. Now, let's see if we have this dataset called
"tips". If you want to read this data, you can use a simple function called spark.read.csv. If you
provide the path and set the header to 2, it will print the entire data from the CSV file. As
you can see, our data contains total bill, tips, sex, smoker, date, time, and size. All of this
data is being imported from the CSV file. If you print the type of this particular file, you will
understand that it is a pyspark.sql.dataframe. Now, you can create a temporary view on top
of this data frame. If you use the function createOrReplaceTempView, it will create a table
inside Spark, and you can write SQL queries on top of it. For example, you can run the query SELECT
* FROM tips, and if you provide this query to spark.sql, you can easily run this particular SQL
query on top of our data frame. So, what we really did was import the data, convert the data into a
table, and then write SQL queries on top of it. The same thing can be done to convert this Spark
data frame into a Pandas data frame. So, if you want to apply any Pandas function, you can also do
that inside Spark itself. Over here, if you want to understand lazy evaluation, where you are just
filtering the sex by female and the day as Sunday, once we run this particular statement, Spark does
not execute this entire thing. It waits for the action to be performed. The action over here
is the show action. So, once you run the show, then it will run this entire thing, and
then you will be able to see the results. This is called a transformation
that we understood in the video, and this is the action that you were talking
about. Like this, you can do a lot of things. You can go to the Spark documentation and
understand it in detail. There are multiple functions available, and for each function,
you will get a detailed understanding. I hope you understood everything about
Apache Spark and how it executes all of this code. If you want to do an entire data
engineering project involving Apache Spark, you can watch the video mentioned in the
transcript. It will give you a complete understanding of how a data engineering
project is built from start to end. That's all from this video.
If you have any questions, let me know in the comments, and I'll
see you in the next video. Thank you.