For the past five years Spark
has been on an absolute tear becoming one of the most widely
used Technologies in big data and AI. Today's cutting-edge
companies like Facebook app will Netflix Uber and many more have deployed spark at massive scale
processing petabytes of data to deliver Innovations ranging from detecting
fraudulent Behavior to delivering personalized
experiences in real. Lifetime and many such
innovations that are transforming every industry. Hi all I welcome you
all to this full court session on Apache spark a complete
crash course consisting of everything you need
to know to get started with Apache Spark from scratch. But before we get into details, let's look at our agenda for
today for better understanding and ease of learning. The entire crash course
is divided into 12 modules in the first module introduction
to spark will try to understand what exactly Is and how it
performs real time processing in second module will dive deep
into different components that constitute spark will also
learn about Spark architecture and its ecosystem next up
in the third module. We will learn what exactly relational distributed data
sets are in spark. Fourth module is all about
data frames in this module. We will learn what
exactly data frames are and how to perform different
operations in data frames moving on in the fifth. Module we will
discuss different ways that spark provides
to perform SQL queries for accessing and processing
data in the six module. We will learn how to perform streaming
on live data streams using spark where and in the seventh
module will discuss how to execute different machine
learning algorithms using spark machine learning library
8 module is all about spark Graphics
in this module. We are going to learn what
graph processing is and how to perform graph processing
using Bob Graphics library in the ninth module will discuss
the key differences between two popular data processing
Paddock rooms mapreduce and Spark talking
about 10 module will integrate to popular James spark and Kafka. 11th module is
all about pyspark in this module will try to understand how by spark exposes
spark programming model to python lastly
in the 12 module. We'll take a look at most
frequently Asked interview. Options on spark which will help you
Ace your interview with flying colors. Thank you guys
while you are at it, please do not
forget to subscribe and Edureka YouTube channel to stay updated with
current training Technologies. There has been - underworld that spark is
a future of Big Data platform, which is hundred times faster than mapreduce and is also
a go-to tool for all solutions. But what exactly is
Apache spark and what? It's so popular. And in the session I will give
you a complete Insight of Apache spark and its fundamentals
without any further due. Let's quickly. Look at the topics to be covered in this session
first and foremost. I will tell you what is Apache spark
and its features next. I will take you
to the components of spark ecosystem that makes Park as a future
of Big Data platform. After that. I will talk about the fundamental
data structure of spark that is rdd I will also tell you about its features its Asians
the ways to create rdd Etc and at the last either wrap
up the session by giving a real-time use case of spark. So let's get started with the very first
topic and understand what is spark spark
is an open-source killable massively parallel in memory execution
environment for running analytics applications. You can just think
of it as an in-memory layer that sits about the
multiple data stores where data can be loaded
into the memory and analyzed in parallel
across the cluster. Into big data processing much
like mapreduce Park Works to distribute the data
across the cluster and then process
that data in parallel. The difference here is
that unlike mapreduce which shuffles the files around
the disc spark Works in memory, and that makes it much
faster at processing the data than mapreduce. It is also said to be the Lightning Fast unified
analytics engine for big data and machine learning. So now let's look
at the interesting features of Apache Spark. Coming to speed you
can cause Park as a swift processing framework. Why because it is
hundred times faster in memory and 10 times faster on the disk
on comparing it with her. Do not only that it also provides
High data processing speed next powerful cashing. It has a simple
programming layer that provides powerful caching and disk persistence
capabilities and Spark can be deployed through mesos. How do PI on or Sparks
own cluster manager as you all know? That's Park itself was designed and developed for
real-time data processing. So it's obvious fact that it offers
real-time competition and low latency because of
in memory competitions next polyglot spark
provides high level apis in Java Scala Python and our spark code
can be written in any of these four languages. Not only that it also provides
a shell in Scala and python. These are the various
features of spark now, let's see the The various
components of spark ecosystem. Let me first tell you
about the spark or component. It is the most vital component
of Spartacus system, which is responsible for basic I/O functions
scheduling monitoring Etc. The entire Apache spark
ecosystem is built on the top of this core execution engine which has extensible apis
in different languages like Scala python are and Chava as I have already
mentioned the spark and the departs from essos. How do you feel John
or Sparks own cluster manager the spark ecosystem
library is composed of various components like spark SQL spark streaming
machine learning library. Now, let me explain
you each of them. The spark SQL component
is used to Leverage The Power of declarative queries and optimize storage
by executing SQL queries on spark data, which is present in the rdds and other external sources
next Sparks trimming component allows developers
to perform batch. Processing and streaming of data in the same application and come
into machine learning library. It eases the deployment and development of scalable
machine learning pipelines, like summary statistics
correlations feature extraction transformation functions
optimization algorithms Etc and graph x component lets
the data scientist to work with graph are non rough sources
to achieve flexibility and resilience and graph
construction and transformation and now talking about the programming
languages spark supports car. I just a functional
programming language in which the spark is written. So spark supports Colour as
the interface then spark also supports python interface. You can write the program
in Python and execute it over the spark again. If you see the code
in Python and Scala, both are very similar then our
is very famous for data analysis and machine learning. So spark has also added
the support for our and it also supports Java so you can go ahead and write the code in Java
and Giggle with this park next the data can be stored in hdfs local file
system Amazon S3 cloud and it also supports SQL
and nosql database as well. So this is all about the various
components of spark ecosystem. Now, let's see what's next when it comes to iterative
distributed computing that is processing the data
over multiple jobs and competitions. We need to reuse or share the data
among multiple jobs in earlier Frameworks
like Hadoop there were problems while dealing with multiple. Operations or jobs here. We need to store the data and some intermediate stable
distributed storage such as hdfs and multiple I/O operations
makes the overall computations of jobs much slower
and they were replications and civilizations which in turn made
the process even more slower and our goal here was
to reduce the number of I/O operations to hdfs and this can be achieved only
through in-memory data sharing the in-memory data sharing
the stent 200 times faster. Of the network and disk sharing and rdds try to solve all
the problems by enabling fault-tolerant distributed
in memory competitions. So now let's understand what our rdds it stands for
resilient distributed data set. They are considered to be
the backbone of spark and is one of the fundamental
data structure of spark. It is also known as
the schema-less structures that can handle both structured
and unstructured data. So in spark anything
you do is around rdd. You're reading the
data in spark. When it is read
into our daily again, when you're transforming
the data, then you're performing Transformations on old rdd
and creating a new one. Then at last you will perform
some actions on the rdd and store that data
present in an rdd to a persistent storage
resilient distributed data set has an immutable distributed
collection of objects. Your objects can be anything like strings lines
Rose objects collections Etc rdds can contain
any type of python Java or Scala objects. Even including user
defined classes as And talking about
the distributed environment. Each data set present
in an rdd is divided into logical partitions, which may be computed
on different nodes of the cluster due to this you
can perform Transformations or actions on the complete data parallely and I don't have
to worry about the distribution because spark takes care of that are they these are
highly resilient that is they are able to recover
quickly from any issues as a same data chunks are replicated across
multiple executor notes thus so even if one executor
fails another will still process the data. This allows you to perform
functional calculations against a data set very quickly by harnessing the power
of multiple nodes. So this is all about rdd now. Let's have a look at some of the important features of
our dbe's rdds have a provision of in memory competition and all transformations
are lazy. That is it does not compute
the results right away until an action is applied. So it supports
in memory competition and lazy evaluation
as well next. Fault tolerant in case of rdds. They track the data
lineage information to rebuild the last data
automatically and this is how it provides fault tolerance
to the system. Next immutability data
can be created or received any time and once defined its value
cannot be changed. And that is the reason why I said are they these are
immutable next partitioning at is the fundamental
unit of parallelism and Spark rdd and all the data chunks
are divided into partitions and already next persistence. So users can reuse rdd and choose a storage stategy for
them coarse-grained operations applies to all elements
in datasets through Maps or filter or
group by operations. So these are the various
features of our daily. Now, let's see
the ways to create rdd. There are three ways to create
rdds one can create rdd from paralyzed Collections
and one can also create rdd from the existing card ID or other are DTS
and it can also be created from external data sources
as well like hdfs. Amazon S3 hbase Etc. Now let me show you
how to create rdds. I'll open my terminal
and first check whether my demons
are running or not. Cool here. I can see that Hadoop and Spark demons
both are running. So now at the first let's start the spark shell it will take
a bit time to start the shell. Cool. Now the spark shall has started and I can see the version of
spark as two point one point one and we have a scholar
shell over here. Now. I will tell you
how to create rdds in three different ways using
Scala language at the first. Let's see how to create an rdd from paralyzed collections
SC dot paralyzes the method that I use to create a paralyzed
collection of oddities and this method is a spark context paralyzed method
to create a palace collection. So I will give a seedot bad. Lice and here I will paralyze
one 200 numbers. In five different partitions
and I will apply collect as action to start the process. So here in the result, you can see an array
of fun 200 numbers. Okay. Now let me show you how the partitions appear
in the web UI of spark. So the web UI port for spark is
localhost four zero four zero. So here you have just
completed one task. That is St. Dot
paralyzed collect. Correct here. You can see all the five stages
that are succeeded because we have divided the task
into five partitions. So let Show you the partitions. So this is a dag
which realization that is the directed acyclic
graph visualization wherein you have applied only
paralyzed as a method so you can see only
one stage here. So here you can see the rdd
that is been created and coming to even timeline
you can see the task that has been executed
in five different stages and the different colors imply. The scheduler delayed tasks these sterilization Time shuffle
rate Time shuffle right time. I'm execute a Computing
time Etc here. You can see the summary metrics
for the created rdd here. You can see that the maximum time it
took to execute the tasks in five partitions parallely is
just 45 milliseconds. You can also see the executor ID
the host ID the status that is succeeded
duration launch time Etc. So this is one way
of creating an rdd from paralyzed collections. Now, let me show you how to create an rdd
from the I think our DD okay here I'll create
an array called Aven and assign numbers one to ten. One two, three,
four five six seven. Okay, so I got the result here. That is I have created
an integer array of 1 to 10 and now I will paralyze
this a day one. Sorry, I got an error. It is a seedot pass
the lies of a one. Okay, so I created an rdd
called parallel collection cool. Now I will create a new Oddity
from the existing already. That is Val new are d d is equal to a 1 dot map data
present in an rdd. I will create a new ID
from existing rdd. So here I will take a one. As a difference and map
the data and multiply that data into two. So what should be our output if I Mark the data present
in an rdd into two, so it would be like
2 4 6 8 up to 20, correct? So, let's see how it works. Yes, we got the output
that is multiple of 1 to 10. That is two four
six eight up to 20. So this is one of the method of creating a new ID
from an old rdt. And I have one more method that
is from external file sources. So what I will do here is I
will give that test is equal to SC dot txt file here. I will give the path
to hdfs file location and Link the path that is hdfs
who localhost 9000 is a path and I have a folder. Called example and in that
I have a file called sample. Cool, so I got one
more already created here. Now. Let me show you this file that I have already kept
in hdfs directory. I will browse the file system and I will show you
the / example directory that I have created. So here you can see the example that I have created as
a directory and here I have sample as input file
that I have been given. So here you can see
the same path location. So this is how I can create an rdd
from external file sources. In this case. I have used hdfs as
an external file source. So this is how we can create
rdds from three different ways that is paralyzed collections
from external RDS and from an existing rdds. So let's move further and see
the various rdd. It's actually supports two men operations namely
Transformations and actions as have already set. Our treaties are immutable. So once you create an rdd, you cannot change
any content in the Hardy, so you might be wondering how our did he applies
those Transformations? Correct? When you run
any Transformations, it runs those Transformations
on all our DD and create a new body. This is basically done
for optimization reasons. Transformations are
the operations which are applied on a An rdd to create a new rdd
now these Transformations work on the principle
of lazy evaluations. So what does it mean it means that when we call
some operation in rdd at does not execute immediately
and Spark montañés, the record of the operation
that is being called since Transformations
are lazy in nature so we can execute the operation any time by calling
an action on the data. Hence in lazy evaluation
data is not loaded until it is necessary now these Since analyze the RTD and produce result
simple action can be count which will count the rows and
rdd and then produce a result so I can say that transformation produced new
rdd and action produced results before moving further
with the discussion. Let me tell you about
the three different workloads that spark it is they are
batch mode interactive mode and streaming mode
in case of batch mode. We run a batch
of you write a job and then schedule it
it works through a queue or a batch of separate. Jobs without manual
intervention then in case of interactive mode. It is an interactive shell where you go and execute
the commands one by one. So you will execute
one command check the result and then execute
other command based on the output result and so on it works similar
to the SQL shell so she'll is the one which executes a driver program
and in the Shell mode. You can run it
on the cluster mode. It is generally used
for development work or it is used
for ad hoc queries, then comes the streaming mode where the program
is continuously running. As invented data
comes it takes a data and do some Transformations and actions on the data
and get some results. So these are the three
different workloads that spark 8 us now. Let's see a real-time
use case here. I'm considering Yahoo! As an example. So what are
the problems of Yahoo! Yahoo! Properties are highly personalized
to maximize relevance. The algorithms used
to provide personalization. That is the
targeted advertisement and personalized content
are highly sophisticated. It and the relevance model
must be updated frequently because stories news feed and
ads change in time and Yahoo, has over 150 petabytes of data that the stored
on 35,000 node Hadoop cluster, which should be access
efficiently to avoid latency caused by the data movement and to gain insights
from the data and cost-effective manner. So to overcome
these problems Yahoo! Look to spark to
improve the performance of this iterative
model training here. Machine learning algorithm for
news personalization required 15,000 lines of C++ code on the other hand the machine
learning algorithm has just won 20 lines of Scala code. So that is
the advantage of spark and this algorithm was ready
for production use in just 30 minutes of training
on a hundred million datasets and Sparks Rich API is available in several programming
languages and has resilient in memory storage
options and a scum. Potable with Hadoop through yarn
and the spark yarn project. It uses Apache spark
for personalizing It's News web pages and for
targeted advertising. Not only that it also uses
machine learning algorithms that run an Apache spark
to find out what kind of news user are
interested to read and also for categorizing
the new stories to find out what kind of users
would be interested in Reading each category of news and Spark runs over Hadoop Ian
to use existing data. And clusters and
the extensive API of spark and machine learning library
is the development of machine learning algorithms
and Spar produces the latency of model training. We are in memory rdd. So this is how spark has helped
Yahoo to improve the performance and achieve the targets. So I hope you understood
the concept of spark and its fundamentals. Now, let me just give
you an overview of the Spark architecture
Apache spark has a well-defined layered architecture where all the components
and layers are Loosely coupled and integrated with various
extensions and libraries. This architecture is based
on two main abstractions. First one resilient
distributed data sets that is rdd and the next one directed
acyclic graph called DAC or th e in order to understand
this park architecture. You need to first know
the components of the spark that the spark. System and its fundamental
data structure rdd. So let's start by understanding
the spark ecosystem as you can see from the diagram. The spark ecosystem is composed
of various components like spark SQL spark screaming machine learning
library Graphics spark our and the code a pi component
talking about spark SQL. It is used to Leverage The Power
of declarative queries and optimize storage
by executing SQL queries on spark data, which is present in rdds. And other external sources
next Sparks remain component allows developers
to perform batch processing and trimming of the data and the same application coming
to machine learning library. It eases the development and deployment of scalable
machine learning pipelines, like summary statistics
cluster analysis methods correlations dimensionality
reduction techniques feature extractions and many more now
Graphics component. Let's the data scientist to work with graph and non graph
sources to achieve. Security and resilience
and graph construction and transformation coming
to spark our it is an r package that provides a light weighted
front end to use Apache spark. It provides a distributed
data frame implementation that supports operations like
selection filtering aggregation, but on large data sets, it also supports
distributed machine learning using machine learning library. Finally the spark or component. That is the most vital component
of spark ecosystem, which is responsible. Possible for basic
I/O functions scheduling and monitoring the entire spark
ecosystem is built on the top of this code execution engine which has extensible apis
in different languages like Scala python
are and Java now, let me tell you
about the programming languages at the first Spark support Scala Scala is a functional
programming language in which spark is written and Spark suppose Carla
as an interface then spark also supports python. Face, you can write program
in Python and execute it over the spark again. If you see the code
and Scala and python, both are very similar then
coming to our it is very famous for data analysis
and machine learning. So spark has also added
the support for our and it also supports Java so you can go ahead
and write the Java code and execute it over the spark against Park also provides
you interactive shell for Scala Python and are
very can go ahead and Execute the commands
one by one. So this is all about
the sparkle ecosystem. Next. Let's discuss the fundamental
data structure of spark that is rdd called as
resilient distributed data sets. So and Spark anything
you do is around rdd, you're reading the data
and Spark then it is read into R DT again. When you're transforming
the data, then you're performing Transformations on an old rdd
and creating a new one. Then at the last you
will perform some actions on the data and store. Dataset present in an rdd to a persistent storage
resilient distributed data set as an immutable distributed
collection of objects. Your objects can be anything like string lines
Rose objects collections Etc. Now talking about
the distributed environment. Each data set in rdd is divided
into logical partitions, which may be computed
on different nodes of the cluster due to this you
can perform Transformations and actions on the
complete data parallelly. And you don't have to worry
about the distribution because part takes care of that next as I said our
did these are immutable. So once you create
an rdd you cannot change any content in the Rd so you might be wondering how our did the applies
those Transformations correct? Then you run any Transformations
at runs those Transformations on all our DD
and create a new Oddity. This is basically done
for optimization reasons. So, let me tell you
one thing here are decals. The cached and persistent if you want to save an rdd
for the future work, you can cash it and it will improve
the spark performance rdd is a fault-tolerant
collection of elements that can be operated
on in parallel. If our DD is lost
it will automatically be recomputed by using
the original Transformations. This is House Park
provides fault tolerance. There are two ways to create
rdds first one by paralyzing an existing collection
in your driver program and the second one
by Referencing a data set in the external storage system such as shared file
system hdfs hbase Etc. Now Transformations
are the operations that you perform an rdd
which will create a new body. For example, you
can perform filter on an rdd and create a new rdd. Then there are actions
which analyzes the rdd and produced result
simple action can be count which will count
the rows in our D and producer isn't so I can say that transformation produced
new ID Actions produce results. So this is all about the fundamental
data structure of spark that is already now. Let's dive into the core topic
of today's discussion that the Spark architecture. So this is
the Spark architecture in your master node. You have the driver program
which drives your application. So the code that you're writing
behaves as a driver program or if you are using
the interactive shell the shell acts as a driver program
inside the driver program. The first thing that you do is you create
a spark context assume that the spark context
is a gateway to allspark functionality at a similar
to your database connection. So any command you execute
in a database goes through the database connection
similarly anything you do on spark goes through
the spark context. Now this park on text works with the cluster manager
to manage various jobs, the driver program and the spark context takes care
of executing the job across the cluster
a job is splitted the And then these tasks are distributed over
the work or not. So anytime you create the rtt. In the spark context
that rdd can be distributed across various notes and can be cashed their so rdd
set to be taken partitioned and distributed across various
notes now worker knows are the slave nodes whose job is
to basically execute the tasks. The task is then executed on the partition rdds
in the worker nodes and then Returns the result back
to the spark context spot. Our context takes the job breaks
the shop into the task and distribute them
on the worker nodes and these tasks works
on partition rdds perform, whatever operations you
wanted to perform and then collect the result and give it back
to the main Spar context. If your increase
the number of workers, then you can divide jobs and more partitions and execute
them para Leo multiple systems. This will be actually
lot more faster. Also if you increase
the number of workers, it will also
increase your memory. And you can catch the jobs so that it can be executed
much more faster. So this is all
about Spark architecture. Now. Let me give you
an infographic idea about the Spark architecture. It follows master-slave
architecture here. The client submits
Park user application code when an application code
is submitted driver implicitly converts a user code that contains Transformations and actions into a logically
directed graph called DHE at this stage it also Performs optimizations such as
pipelining Transformations, then it converts
a logical graph called DHE into physical execution plan with many stages after converting into
physical execution plan. It creates a physical
execution units called tasks under each stage. Then these tasks are bundled and sent to the cluster
now driver talks to the cluster manager
and negotiates a resources and cluster manager launches
the needed executors at this point driver
be Also send the task to the executors based
on the placement when executor start to register
themselves with the drivers, so that driver will have
a complete view of the executors and executors now start
executing the tasks that are assigned by
the driver program at any point of time when the application is running
driver program will monitor the set of executors that runs and the driver note also schedules future tasks
Based on data placement. So this is how the internal
working takes place in space. Architecture, there are
three different types of workloads that spark and cater first batch mode
in case of batch mode. We run a bad shop here
you write the job and then schedule it. It works through a queue
or batch of separate jobs through manual intervention
next interactive mode. This is an interactive shell where you go and execute
the commands one by one. So you'll execute
one command check the result and then execute
the other command based on the output result and so on it works similar to the SQL. Action social is the one
which executes a driver program. So it is generally used
for development work or it is also used
for ad hoc queries, then comes the streaming mode where the program
is continuously running as and when the data
comes it takes a data and do some Transformations
and actions on the data and then produce output results. So these are the three
different types of workloads that spark actually caters now, let's move ahead and see
a simple demo here. Let's understand how
to create a spark up. Location in spark
shell using Scala. So let's understand how to create a spark
application in spark shell using Scala assume that we have a text file
in the hdfs directory and we are counting the number
of words in that text file. So, let's see how to do it. So before I start running,
let me first check whether all my demons
are running or not. So I'll type sudo JPS so all my spark demons
and Hadoop elements are running that I have master/worker
as Park demon son named notice. Manager non-manager everything
as Hadoop team it. So the first thing that I do here is
I run the spark shell so it takes bit time
to start in the meanwhile. Let me tell you the web UI port for spark shell is
localhost for 0 4 0. So this is a web
UI first Park like if you click on jobs right now,
we have not executed anything. So there is
no details over here. So there you have job stages. So once you execute the chops If you'll be having
the records of the tasks that you have executed here. So here you can see
the stages of various jobs and tasks executed. So now let's check whether our spark
shall have started or not. Yes. So you have your spark version
as two point one point one and you have a scholar
shell over here. So before I start the code, let's check the content
that is present in the input text file
by running this command. So I'll write where test is equal
to SC dot txt file because I have saved
a text file over there and I'll give
the hdfs part location. I've stored my text file
in this location. And Sample is the name
of the text file. So now let me give
test dot collect so that it collects the data and displays the data that
is present in the text file. So in my text file, I have Hadoop research analysts
data science and science. So this is my input data. So now let me map the functions and apply
the Transformations and actions. So I'll give our map is equal
to SC dot txt file and I will specify my but location So this
is my input part location and I'll apply
the flat map transformation to split the data. There are separated by space and then map the word count to
be given as word comma one now. This would be executed. Yes. Now, let me apply the action for this to start
the execution of the task. So let me tell you one thing
here before applying an action. This park will not start
the execution process. So here I have applied
produced by key as the action to start
counting the number of words in the text file. So now we are done
with applying Transformations and actions as well. So now the next step is to specify the output location
to store the output file. So I will give
as counts dot save as text file and then specify
the location form output file. I'll sort it
in the same location where I have my input file. Never specify my output
file name as output 9 cool. I forgot to give
a double quotes. And I will run this. So it's completed now. So now let's see the output. I will open my Hadoop web UI by giving local lost Phi
double zero seven zero and browse the file system
to check the output. So as I have said, I have example asthma director
that I have created and in that I have specified
output 9 as my output. So I have the two part
files been created. Let's check each
of them one by one. So we have the data count as one analyst count
as one and science count as two so this is
a first part file now. Let me open the second
part file for you. So this is the second
part file there you have Hadoop count as one
and the research count as one. So now let me show
you the text file that we have specified
as the input. So as I have told you Hadoop counters
one research count as one analyst one data one signs
and signs as 1 1 so in might be thinking
data science is a one word no in the program code. We have asked to count the word
that the separated by a space. So that is why we have
science count as two. I hope you got an idea
about how word count works. Similarly, I will now
paralyzed 1/200 numbers and divide the tasks into five partitions to show
you what is partitions of tusks. So I will write a seedot
paralyzed 1/200 numbers and divide them
into five partitions and apply collect action
to collect the numbers and start the execution. So it displays you
an array of 100 numbers. Now, let me explain you the job
stages partitions even timeline. Dag representation
and everything. So now let me go
to the web UI of spark and click on jobs. So these are the tasks that have submitted so
coming to word count example. So this is the
dagger usual ization. I hope you can see
it clearly first you collected the text file, then you applied
flatmap transformation and mapped it to count
the number of words and then applied
Reduce by key action and then save the output file as save as text file. So this is Entire
tag visualization of the number of steps that we have covered
in our program. So here it shows
the completed stages that is two stages
and it also shows the duration that is 2 seconds. And if you click
on the event timeline, it just shows the executor
that is added. And in this case you
cannot see any partitions because you have not split the
jobs into various partitions. So this is how you can see
the even timeline and the - visualization here you
you can also see the stage ID descriptions
when you have submitted that I have just
submitted it now and in this it also
shows the duration that it took to execute the task
and the output pipes that it took the shuffle
rate Shuffle right and many more now to show
you the partitions see in this you just applied
SC dot paralyzed, right? So it is just showing
one stage where you have applied the parallelized
transformation here. It shows the succeeded
task as Phi by Phi. That is you have divided
the task into five stages and all the five stages has been
executed successfully now here you can see the partitions
of the five different stages that is executed in parallel. So depending on the colors, it shows the scheduler delay the shuffle rate time
executor Computing time result civilization time and getting result time
and many more so you can see that duration that it took to execute
the five tasks in parallel at the same time as maximum. Um one milliseconds. So in memory spark as
much faster computation and you can see the IDS of all the five different
tasks all our success. You can see the locality level. You can see the executor and
the host IP ID the launch time the duration it take everything so you can also see that we have created our DT
and paralyzed it similarly here also for word count example, you can see the rdd that has been created
and also the Actions that have applied
to execute the task and you can see the duration
that it took even here also, it's just one milliseconds that it took to execute
the entire word count example, and you can see the ID is
locality level executor ID. So in this case, we have just executed
the task in two stages. So it is just showing
the two stages. So this is all about how web UI looks and what are
the features and information that you can see
in the web UI of spark after executing the program
and the Scala shell. So in this program, you can see that first gave
the part to the input location and check the data that is presented
in the input file. And then we applied
flatmap Transformations and created rdd and then applied action to start
the execution of the task and save the output file
in this location. So I hope you got a clear idea of how to execute a word
count example and check for the various features and Spark web UI like
partitions that visualisations and I hope you found the session
interesting Apache spark. This word can generate a spark
in every Hadoop Engineers mind. It is a big data
processing framework, which is lightning fast
and cluster Computing. And the core reason behind
its outstanding performance is the resilient distributed
data set or in short. They are DD and today I'll focus on the topic called
rdd using spark before we get Get started. Let's have a quick look
on the agenda. For today's session. We shall start with
understanding the need for rdds where we'll learn the reasons behind which
the rdds were required. Then we shall learn
what our rdds where will understand
what exactly an rdd is and how do they work later? I'll walk you through
the fascinating features of rdds such as in
memory computation partitioning persistence fault tolerance and many more once I finished a theory I'll get your hands on rdds where will practically create and perform all possible
operations on a disease and finally I'll wind up this session with
an interesting Pokémon use case, which will help you understand
rdds in a much better way. Let's get started spark is one of the top mandatory skills
required by each and every Big Data developer. It is used
in multiple applications, which need real-time processing
such as Google's recommendation engine credit
card fraud detection. And many more to understand
this in depth. We shall consider Amazon's
recommendation engine assume that you are searching
for a mobile phone and Amazon and you have certain
specifications of your choice. Then the Amazon search engine
understands your requirements and provides you the products which match the specifications
of your choice. All this is made possible
because of the most powerful tool existing
in Big Data environment, which is none other
than Apache spark and resilient distributed data. Is considered to be
the heart of Apache spark. So with this let's begin
our first question. Why do we need a disease? Well, the current world
is expanding the technology and artificial intelligence is the face for this Evolution
the machine learning algorithms and the data needed
to train these computers are huge the logic behind all these algorithms
are very complicated and mostly run in a distributed
and iterative computation method the machine learning
algorithms could not use the older mapreduce. Grams, because the traditional
mapreduce programs needed a stable State hdfs and we know that hdfs generates redundancy
during intermediate computations which resulted in a major
latency in data processing and in hdfs gathering data for multiple processing units
at a single instance. First time consuming along
with this the major issue was the HTF is did not have
random read and write ability. So using this old
mapreduce programs for machine learning
problems would be Then the spark was introduced compared to mapreduce spark is an advanced big data
processing framework resilient distributed data set which is a fundamental
and most crucial data structure of spark was the one which made it all possible rdds
are effortless to create and the mind-blowing
property with solve. The problem was it's in memory
data processing capability Oddity is not a distributed
file system instead. It is a distributed
collection of memory where the data needed
is always stored and kept available. Lynn RAM and because of
this property the elevation it gave to the memory
accessing speed was unbelievable The Oddities our fault tolerant and this property bought it
a Dignity of a whole new level. So our next question would be what are rdds the resilient
distributed data sets or the rdds are the primary underlying
data structures of spark. They are highly fault tolerant and the store data
amongst multiple computers in a network the data is written
into multiple executable notes. So that in case of a Calamity
if any executing node fails, then within a fraction
of second it gets back up from the next executable node with the same processing speeds
of the current node, the fault-tolerant property
enables them to roll back their data to the original state by applying simple
Transformations on to the Lost part
in the lineage hard. It is do not need
anything called hard disk or any other secondary storage all that they need
is the main memory, which is Ram now that we have understood
the need for our dear. It is and what exactly an RTD is so let us see
the different sources from which the data
can be ingested into an rdd. The data can be loaded
from any Source like hdfs hbase high C ql you name it? They got it. Hence. The collected data
is dropped into an rdd. And guess what the rdds
a free-spirited they can process any type of data. They won't care if the data
is structured unstructured or semi-structured now, let me walk you
through the features. Just of rdds, which give it an edge
over the other Alternatives in memory computation the idea of in memory computation bought
the groundbreaking progress in cluster Computing it
increase the processing speed when compared with the hdfs
moving on to Lacey evaluations the phrase lazy Explains It All spark logs all
the Transformations you apply onto it and will not throw
any output onto the display until an action is provoked. Next is Fault tolerance rdds
are Lutely, fault-tolerant. Any lost partition of an rdd
can be rolled back by applying simple Transformations on
to the last part in the lineage speaking about immutability the data once
dropped into an rdd is immutable because the access provided
by our DD is just re only the only way to access or modified is by applying
a transformation on to an rdd which is prior
to the present one discussing about partitioning. The important reason for Sparks. Parallel processing is
its part issue. By default spot determines
the number of Parts into which your data is divided, but you can override this
and decide the number of blocks. You want to split your data. Let's see what persistence is Sparks are it is
a totally reusable. The users can apply certain number of
Transformations on to an rdd and preserve the final Oddity for future use this avoids
all the hectic process of applying all
the Transformations from scratch and now last but not the least
course crane operations. The operations performed
on rdds using Transformations like map filter flat map
Etc change the arteries and update them. Hence. Every operation applied
onto an RTD is course trained. These are the features of rdds
and moving on to the next stage. We shall understand. The creation of rdds art. It is can be created
using three methods. The first method is using
parallelized collections. Next method is by using external
storage like hdfs hbase. Hi. And many more the third one
is using an existing ID, which is prior
to the present one. Now, let us see understand and create an array D
through each method now Spa can be run on Virtual
machines like spark VM or you can install
a Linux operating system like Ubuntu and
run it Standalone, but we here at Erica use
the best-in-class cloud lab which comprises of
all the Frameworks. You needed a single
stop Cloud framework. No need of any hectic. Has of downloading any file or setting up
an environment variables and looking for
a hardware specification Etc. All you need is a login ID and password to the all-in-one
ready to use cloud lab where you can run
and save all your programs. Let us fire up our spark shell
using the command spark to - shell now as partial
is been fired up. Let's create a new rdd. So here we are creating
a new RTD with the first method which is using the
parallelized collections here. We are creating a new rdt by the name parallelized
collections are ready. We are starting a spark context and we have paralyzing
an array into the rdd which consists of the data
of the days of a week, which is Monday Tuesday, Wednesday, Thursday,
Friday and Saturday. Now, let's create
this our new rdd paralyzed collections rdd
is successfully created now, let's display the data
which is present in our RTD. So this was the data
which is present in our RTD now, let's create a new ID
using a second method. The second method
of creating an rdd was using an external storage
such as hdfs high SQL and many more here. I'm creating a new rdd
by the name spark file where I'll be loading
a text document into the rdd from an external storage, which is hdfs. And this is the location
where my text file is located. So the new rdd spark file
is successfully created now, let's display the data which is present
in as pack file a TD. It's the data which is present in
as pack file ID is a collection of alphabets
starting from A to Z. Now. Let's create a new already
using the third method which is using
an existing iridium, which is prior to the present
one in the third method. I'm creating a new Rd
by the name verts and I'm creating a spark context and paralyzing a statement
into the RTD Words, which is spark is
a very powerful language. So this is
a collection of Words, which I have passed
into the new. You are DD words. Now. Let us apply a transformation on to the RTD and create
a new artery through that. So here I'm applying
map transformation on to the previous rdd that is words and I'm storing
the data into the new ID which is WordPress. So here we are applying
map transformation in order to display the first letter
of each and every word which is stored
in the RTD words. Now, let's continue. The transformation is been
applied successfully now, let's display the contents
which are present in new ID which is word pair So as explained we have displayed
the starting letter of each and every word as s is starting letter of spark
is starting letter of East and so on L is starting
letter of language. Now, we have understood
the creation of a dedes. Let us move on to the next stage where we'll
understand the operations that are performed
on rdds Transformations and actions are
the two major operations that are performed on added. He's let us understand what
our Transformations we applied. Summations in order to access
filter and modify the data which is present in an rdd. Now Transformations are further
divided into two types narrow Transformations and
why Transformations now, let us understand what
our narrow Transformations we apply narrow Transformations
onto a single partition of parent ID because the data required
to process the RTD is available on a single partition of parent additi the examples for neurotransmission
our map filter. At map partition
and map partitions. Let us move on to the next
type of Transformations which is why Transformations. We apply why Transformations
on to the multiple partitions of parent a greedy because the data required
to process an rdd is available on multiple partitions of the parent
additi the examples for why Transformations
are reduced by and Union now, let us move on to the next part which is actions actions
on the other hand are considered to be
the next part of operations, which are used
to display the final. The examples for actions
are collect count take and first till now
we have discussed about the theory part on rdd. Let us start
executing the operations that are performed on a disease. In a practical part
will be dealing with an example of IPL match stata. So here I have a CSV file which has the IPL match records
and this CSV file is stored in my hdfs and I'm loading. My batch is dot CSV file
into the new rdd, which is CK file as a text file. So the match is dot CSV file
is been successfully loaded as a text file into the new ID, which is CK file now, let us display the data
which is present in our seek a file
using an action command. So collect is the action command which I'm using in order
to display the data which is present
in my CK file a DD. So here we have in total
six hundred and thirty six rows of data which consists
of IPL match records from the year 2008 to 2017. Now, let us see the schema
of a CSV file. I am using the action command
first in order to display the schema of a match
is dot CSV file. So this command will display
the starting line of the CSV file. We have so the schema of a CSV file is the ID
of the match season city where the IPL match
was conducted date of the match team one team
two and so on now, let's perform the further
operations on a CSV file. Now moving on
to the further operations. I'm about to split
the second column of my CSV file which consists the information
regarding the states which conducted the IPL matches. So I am using this operation
in order to display the states where
the matches were conducted. So the transformation
is been successfully applied and the data has been stored
into the new ID which is States. Now, let's display the data
which is stored in our state's rdd using
the collection action command, so these with The states where the matches
were being conducted now, let's find out the city which conducted the maximum
number of IPL matches. Yeah, I'm creating
a new ID again, which is States count
and I'm using map transformation and I am counting each
and every city and the number of matches conducted
in that particular City. The transformation
is successfully applied and the data has been stored
into the account ID. Now. Let us create a new editing
by name State count em and apply reduced by key transformation and map
transformation together and consider topple one as
the city name and toppled to as the Number of matches which were considered
in that particular City and apply sort by K transformation
to find out the city which conducted maximum number
of IPL matches. The Transformations
are successfully applied and the data is being stored
into the state count. Em RTD now let's
display the data which is present in state count. Em, I did here I am using take action command in order
to take the top 10 results which are stored
in state count MRDD. So according to the results
we have Mumbai which Get the maximum number
of IPL matches, which is 85 since the year
2008 to the year 2017. Now let us create a new ID
by name fil ardi and use flat map in order to filter
out the match data which were conducted
in the city Hydra path and store the same data
into the file rdd since transformation is been
successfully applied now, let us display the data
which is present in our fil ardi which consists of the matches which were conducted
excluding the city Hyderabad. So this is the data which is present in our fil ardi
D which excludes the matches which are played
in the city Hyderabad now, let us create another rdd by name fil and store
the data of the matches which were conducted
in the year 2017. We shall use
filter transformation for this operation. The transformation is
been applied successfully and the data has been stored
into the fil ardi now, let us display the data
which is present there. Michelle use collect action command and now we have
the data of all the matches which your plate especially
in the year 2070. similarly, we can find
out the matches which were played
in the year 2016 and we can save the same data
into the new rdd which is fil to Similarly, we can find out the data
of the matches which were conducted in the year
2016 and we can store the same data into our new rdd which is fil to I
have used filter transformation in order to filter out
the data of the matches which were conducted
in the year 2016 and I have saved the data
into the new RTD which is a file to now, let us understand
the union transformation which will apply the union transformation on
to the fil ardi and fil to rdd. In order to combine
both the data is present in both The Oddities here. I'm creating a new rdd by the name Union rdd and I'm
applying Union transformation on the to Oddities
that we created before. The first one is fil ardi
which consists of the data of the matches played
in the year 2017. And the second one is a file to which consists
the data of the matches. Which up late in the year
2016 here I'll be clubbing both the R8 is together and I'll be saving the data
into the new rdd. Which is Union rdd. Now let us display the data
which is present in a new array, which is Union rgd. I am using collect
action command in order to display the data. So here we have the data
of the matches which were played in the u.s. 2016 and 2017. And now let's continue
with our operations and find out the player
with maximum number of man of the match awards
for this operation. I am applying map transformation and splitting out
the column number 13, which consists of the data
of the players who won the man of the match awards
for that particular match. So the transformation
is been successfully applied and the column number
13 is been successfully split and the data has been
stored into the man of the match our DD now. We are creating a new rdd
by the named man of the match count me applying
map Transformations on to a previous rdd and we are counting the number of awards won by each and
every particular player. Now, we shall create a new ID
by the named man of the match and we are applying reduced
by K. Under the previous added which is man of the match count. And again, we are applying
map transformation and considering topple one
as the name of the player and topple to as
the number of matches. He played and won the man
of the match Awards, let us use take action command
in order to print the data which is stored in our new RTD
which is man of the match. So according to the result
we have a bws who won the maximum number
of man of the matches, which is 15. So these are the few operations
that were performed on rdds. Now, let us move on
to our Pokémon use case so that we can understand
our duties in a much better way. So the steps to be performed
in Pokémon use cases are loading the Pokemon data dot CSV file from an external storage
into an rdd removing the schema from the Pokémon
data dot CSV file and finding out the total number
of water type Pokemon finding the total number
of fire type Pokemon. I know it's getting interesting. So let me explain you each
and every step practically. So here I am creating
a new identity by name Pokemon data rdd one and I'm loading my CSV file
from an external storage. That is my hdfs as a text file. So the Pokemon data dot CSV file
is been successfully loaded into our new rdd. So let us display the data which is present
in our Pokémon data rdd one. I am using collect
action command for this. So here we have 721 rows
of data of all the types of Pokemons we have So now
let us display the schema of the data we have I have used the action command
first in order to display the first line of a CSV file which happens to be
the schema of a CSV file. So we have index of the Pokemon name
of the Pokémon. Its type total points
HP attack points defense points special attack special
defense speed generation, and we can also find if a particular Pokemon
is legendary or not. Here, I'm creating a new RTD which is no header and I'm using filter operation
in order to remove the schema of a Pokemon data dot CSV file. The schema of Pokemon data
dot CSV file is been removed because the spark
considers the schema as a data to be processed. So for this reason, we remove the schema now,
let's display the data which is present
in a no-hitter rdd. I am using action command collect in order
to display the data which is present
in no header rdd. So this is the data which is stored
in a no-hitter rdd without the schema. So now let us find out
the number of partitions into which are no header are
ready is been split in two. So I am using partitions
transformation in order to find out the number of partitions. The data was split
in two according to the result. The no header rdd is been split
into two partitions. I am here creating a new rdt
by name water rdd and I'm using filter
transformation in order to find out what a type Pokemons in
our Pokémon data dot CSV file. I'm using action command collect
in order to print the data which is present in water rdd. So these are the total number
of water type Pokemon that we have in our
Pokémon data dot CSV. Similarly. Let's find out
the fire type Pokemons. I'm creating a new identity
by the name fire RTD and applying filter operation
in order to find out the fire type Pokemon
present in our CSV file. I'm using collect action command
in order to print the data which is present in fire rdd. So these are the fire type
Pokemon which are present in our Pokémon
data dot CSV file. Now, let us count the total
number of water type Pokemon which are present
in a Pokemon data dot CSV file. I am using count action for this
and we have 112 water type Pokemon is present in
our Pokémon data dot CSV file. Similarly. Let's find out the total number
of fire-type Pokémon as we have I'm using count
action command for the same. So we have a total 52 number of fire type Pokemon Sinnoh
Pokemon data dot CSV files. Let's continue with
our further operations where we'll find out a highest
defense strength of a Pokémon. I am creating a new ID
by the name defense list and I'm applying
map transformation and spreading out
the column number six in order to extract the defense points of all the Pokemons present in
our Pokémon data dot CSV file. So the data is been stored
successfully into a new era. DD which is defenseless. Now. I'm using Mac's action command
in order to print out the maximum different strengths
out of all the Pokemons. So we have 230 points as
the maximum defense strength amongst all the Pokemons. So in our further operations, let's find out the Pokemons
which come under the category of having highest
different strengths, which is 230 points. In order to find out
the name of the Pokemon with highest defense strength. I'm creating a new identity
with the name. It defense with Pokemon name
and I'm applying May transformation on
to the previous array, which is no header and I'm splitting out column number six which happens
to be the different strengths in order to extract the data
from that particular row, which has the defense
strength as 230 points. Now I'm creating a new RTD again with the name maximum defense
Pokemon and I'm applying group bike a transformation
in order to display the Pokemon which have the maximum defense
points that is 230 points. So according to the result. We have Steelix Steelix
Mega chacal Aggregate and aggregate Mega as the Pokemons with
highest different strengths, which is 230 points. Now we shall find
out the Pokemon which is having least
different strengths. So before we find
out the Pokemon with least different strengths, let us find out
the least defense points which are present
in the defense list. So in order to find
out the Pokémon with least different strengths, I have created a new rdt by name minimum defense Pokemon
and I have applied distinct and sort by Transformations
on to the defense list rdd in order to extract
the least defense points present in the defense list and I have used take
action command in order to display the data which is present
in minimum defense Pokemon rdd. So according to the results, we have five points as
the least defense strength of a particular Pokémon now, let us find out
the name of the On which comes under the category
of having Five Points as different strengths now, let us create a new rdd which is difference Pokemon name
to and apply my transformation and split the column number 6
and store the data into our new rdd which is defense
with Pokemon name, too. The transformation is
been successfully applied and the data is now
stored into the new rdd which is defense with Pokemon name to the data
is been successfully loaded. Now, let us apply
the further operations here. I am creating another rdd with
name minimum defense Pokemon and I'm applying group bike
a transformation in order to extract the data from the row which has the defense
points as 5.0. The data is been successfully
loaded now and let us display. The data which is present in minimum defense Pokemon rdd
now according to the results. We have to number of Pokemons, which come under the category
of having Five Points as that defense strength
the Pokemons chassis knee and happening at
the to Pokemons, which I have in the least
definition the world of Information Technology and big data processing started
to see multiple potentialities from spark coming into action. Such Pinnacle in Sparks
technology advancements is the data frame. And today we shall
understand the technicalities of data frames and Spark a data frame and Spark
is all about performance. It is a powerful multifunctional
and an integrated data structure where the programmer can work
with different libraries and perform numerous
functionalities without breaking a sweat to understand apis and libraries involved in the process
without wasting any time. Let us understand a topic
for today's discussion. I line up the docket
for understanding. Data frames and Spark is below which will begin with
what our data frames here. We will learn what
exactly a data frame is. How does it look like and what
are its functionalities then we shall see why do we need
data frames here? We shall understand
the requirements which led us to the invention
of data frames later. I'll walk you through
the important features of data frames. Then we should look into the sources from which
the data frames and Spark get their data from Once
the theory part is finished. I will get us involved
into the Practical part where the creation
of a dataframe happens to be a first step next we shall work
with an interesting example, which is related to football and finally to understand
the data frames in spark in a much better way we should work
with the most trending topic as I use case, which is none other
than the Game of Thrones. So let's get started. What is a data frame
in simple terms a data frame can be considered as a
distributed collection of data. The data is organized
under named columns, which provide us The operations
to filter group process and aggregate the available data
data frames can also be used with Sparks equal and we
can construct data frames from structured data files rdds
or from an external storage like hdfs Hive Cassandra hbase and many more with
this we should look into a more simplified example, which will give us a basic
description of a data frame. So we shall deal
with an employee database where we have entities
and their data types. So the name of the employee
is a first entity And its respective data type is string data type similarly
employee ID has data type of string employee phone number which is integer data type and employ address happens
to be string data type. And finally the employee salary
is float data type. All this data is stored
into an external storage, which may be hdfs Hive or Cassandra using
the data frame API with their respective schema, which consists of the name of the entity along
with this data type now that we have understood what
exactly a data frame is. Let us quickly move on
to our next stage where we shall understand the
requirement for a data frame. It provides as multiple programming
language support ability. It has the capacity to work
with multiple data sources, it can process both structured
and unstructured data. And finally it is
well versed with slicing and dicing the data. So the first one is the support ability for
multiple programming languages. The IT industry
is required a powerful and an integrated data structure which could support multiple programming languages
and at the same. Same time without
the requirement of additional API data frame
was the one stop solution which supported multiple
languages along with a single API the most popular languages that a dataframe could
support our our python. Skaila, Java and many more
the next requirement was to support
the multiple data sources. We all know that in
a real-time approach to data processing
will never end up at a single data
source data frame is one such data structure, which has the capability
to support and process data. From a variety of data
sources Hadoop Cassandra. Json files hbase. CSV files are the examples
to name a few. The next requirement was
to process structured and unstructured data. The Big Data environment was
designed to store huge amount of data regardless of which type exactly it is now Sparks data frame
is designed in such a way that it can store a huge
collection of both structured and unstructured data in a tabular format
along with its schema. The next requirement was slicing
In in dicing data now, the humongous amount of data stored in Sparks
data frame can be sliced and diced using the operations
like filter select group by order by and many more these operations
are applied upon the data which are stored in form
of rows and columns in a data frame these
with a few crucial requirements which led to the invention
of data frames. Now, let us get
into the important features of data frames which bring it an edge
over the other alternatives. Immutability lazy
evaluation fault tolerance and distributed memory storage, let us discuss about each
and every feature in detail. So the first one is
immutability similar to the resilient distributed data
sets the data frames and Spark are also immutable
the term immutable depicts that the data was stored into a data frame
will not be altered. The only way to alter the data
present in a data frame would be by applying
simple transformation operations on to them. So the next feature
is lazy evaluation. Valuation lazy evaluation
is the key to the remarkable performance offered by spark
similar to the rdds data frames in spark will not throw
any output onto the screen until and unless an action
command is encountered. The next feature
is Fault tolerance. There is no way that the Sparks data frames
can lose their data. They follow the principle
of being fault tolerant to the unexpected calamities which tend to destroy
the available data. The next feature is distributed storage Sparks dataframe
distribute the data. Most multiple locations so that in case of a node
failure the next available node can takes place to continue
the data processing. The next stage will be
about the multiple data source that the spark dataframe
can support the spark API can integrate itself with multiple programming
languages such as scalar Java python our MySQL and many more making
itself capable to handle a variety of data sources
such as Hadoop Hive hbase Cassandra, Json file. As CSV files my SQL
and many more. So this was the theory part and now let us move
into the Practical part where the creation of a dataframe happens
to be a first step. So before we begin
the Practical part, let us load the libraries which required in order to
process the data in data frames. So these are the few libraries
which we required before we process the data
using our data frames. Now that we have loaded
all the libraries which we required to process
the data using the data frames. Let us begin with the creation
of our data frame. So we shall create a new data
frame with the name employee and load the data of the employees present
in an organization. The details of the employees
will consist the first name the last name and their mail ID
along with their salary. So the First Data frame is
been successfully created now, let us design the schema
for this data frame. So the schema for this data
frame is been described as shown the first name is of
string data type and similarly. The last name is
a string data type along with the mail address. And finally the salary
is integer data type or you can give
flow data type also, so the schema has been
successfully delivered now, let us create
the data frame using Create data frame function here. I'm creating a new data frame
by starting a spark context and using the create
data frame method and loading the data from Employee
and employer schema. The data frame is
successfully created now, let's print the data which is existing
in the dataframe EMP DF. I am using show method here. So the data which is present in EMB DF is
been successfully printed now, let us move on to the next step. So the next step for our today's
discussion is working with an example related
to the FIFA data set. So the first step
in our FIFA example would be loading the schema
for the CSV file. We are working with so the schema has been
successfully loaded now. Now let us load the CSV file
from our external storage which is hdfs
into our data frame, which is FIFA DF. The CSV file is been
successfully loaded into our new data frame, which is FIFA DF now, let us print the schema
of a data frame using the print schema command. So the schema
is been successfully displayed here and we have
the following credentials. Of each and every player
in our CSV file now, let's move on to a further
operations on a dataframe. We will count the total number
of records of the play as we have in our CSV file
using count command. So we have a total
of eighteen thousand to not seven players
in our CSV files. Now, let us find out the details of the columns on which
we are working with. So these were the columns
which we are working with which consists the idea of the player
name age nationality potential and many more. Now let us use the column value which has the value of each and every player
for a particular T and let us use describe command in order
to see the highest value and the least value
provided to a player. So we have account
of a total number of 18,000 to not seven players and the minimum worth
given to a player is 0 and the maximum is given
as 9 million pounds. Now, let us use
the select command in order to extract
the column name and nationality. How to find out the name of each and every player along
with his nationality. So here we have we can display
the top 20 rows of each and every player which we have in our CSV file
along with us nationality. Similarly. Let us find out the players
playing for a particular Club. So here we have
the top 20 Place playing for their respective clubs along with their names for example messy
playing for Barcelona and Ronaldo for
Juventus and Etc. Now, let's move
to the next stages. No, let us find out the players who are found to be most active
in a particular national team or a particular club
with h less than 30 years. We shall use filter transformation
to apply this operation. So here we have the details of the Players whose age
is less than 30 years and their club and nationality
along with their jersey numbers. So with this we have finished
our FIFA example now to understand the data frames
in a much better way, let us move on
into our use case, which is about the most Hot
Topic The Game of Thrones. Similar to our previous example, let us design the schema
of a CSV file first. So this is the schema
for a CSV file which consists the data
about the Game of Thrones. So, this is a schema
for our first CSV file. Now, let us create the schema
for our next CSV file. I have named the schema
for our next CSV file a schema to and I've defined
the data types for each and every entity the scheme
has been successfully designed for the second CSV file also. Now let us load our CSV files
from our external storage, which is our hdfs. The location of the first CSV
file character deaths dot CSV is our hdfs, which is defined as above and the schema is been
provided as schema. And the header true option
is also been provided. We are using spark
read function for this and we are loading this data
into our new data frame, which is Game
of Thrones data frame. Similarly. Let's load the other CSV file which is battles dot CSV
into another data frame, which is Game of Thrones
Butters dataframe the CSV file. Has been successfully
loaded now. Let us continue
with the further operations. Now let us print
the schema offer Game of Thrones data frame using
print schema command. So here we have the schema which consists of
the name alliances death rate book of death
and many more similarly. Let's print the schema of Game
of Thrones Butters data frame. So this is a schema
for our new data frame, which is Game of Thrones
battle data frame. Now, let's continue
the further operations. Now, let us display
the data frame which we have created using
the following command data frame has been successfully printed
and this is the data which we have in our data frame. Now, let's continue
with the further operations. We know that there are
a multiple number of houses present in the story
of Game of Thrones. Now, let us find out each and every individual house
present in the story. Let us use the following command
in order to display each and every house present
in the Game of Thrones story. So we have the following houses
in the Game of Thrones story. Now, let's continue
with the further operations the battles in the Game
of Thrones were fought for ages. Let us classify the vast waste with their occurrence
according to the years. We shall use select
and filter transformation and we shall access The Columns
of the details of the battle and the year in which
they were fought. Let us first find
out the battles which were fought in the year. R 298 the following
code consists of filter transformation which will provide the details
for which we are looking. So according to the result. These were the battles
were fought in the year 298 and we have the details
of the attacker Kings and the defender Kings and the outcome of the attacker
along with their commanders and the location
where the war was fought now, let us find out the wars
based in the air 299. So these with the details of the verse which were fought
in the year 299 and similarly, let us also find out the bars
which are waged in the year 300. So these were the words which were fought
in the year 300. Now, let's move on
to the next operations in our use case. Now, let us find out the tactics
used in the wars waged and also find out the total
number of vast waste by using each type of those tactics
the following code must help us. Here we are using select and group by operations
in order to find out each and every type of tactics
used in the war. So they have used Ambush sees
raising and Pitch type of tactics inverse and most of the times they
have used pitched battle type of tactics inverse. Now, let us continue
with the further operations the Ambush type of battles are
the deadliest now, let us find out the Kings who fought the battles
using these kind of tactics and also let us find out
the outcome of the battles fought here the In code
will help us extract the data which we need here. We are using select
and we're commands and we are selecting
The Columns year attacking Defender King attacker outcome
battle type attacker Commander defend the commander now, let us print the details. So these were the battles
fought using the Ambush tactics and these were
the attacker Kings and the defender Kings along
with their respective commanders and the wars waste
in a particular year now. Let's move on
to the next operation. Now let us focus on the houses and extract the deadliest house
amongst the rest. The following code will help us
to find out the deadliest house and the number
of patents the wage. So here we have the details
of each and every house and the battles the waged
according to the results. We have stuck and Lannister houses to be
the deadliest among the others. Now, let's continue
with the rest of the operations. Now, let us find out
the deadliest king among the others which will use the following
command in order to find the deadliest king
amongst the other kings who fought in the A
number of Firsts. So according to the results
we have Joffrey as the deadliest King who fought a total number
of 14 battles. Now, let us continue
with the further operations. Now, let us find out the houses which defended most number
of Wars waste against them. So the following code must help
us find out the details. So according to the results. We have Lannister house
to be defending the most number of paths based against them. Now, let us find out
the defender King who defend it most number of battles
which were waste against him So according to the result drop
stack is the king who defended most
number of patterns which waged against him. Now. Let's continue with
the further operations. Since Lannister house
is my personal favorite. Let me find out the details of the characters
in Lannister house. This code will
describe their name and gender one for male and 0 for female along with
their respective population. So let me find out
the male characters in The Lannister house first. So here we have used select
and we're commanded. Ends in order to find out
the details of the characters present in Lannister house and the data is been stored
into tf1 dataframe. Let us print the data
which is present in idea of one data frame
using show command. So these are the details of the characters
present in Lannister house, which are made now similarly. Let us find out the female
character is present in Lannister house. So these are the characters
present in Lannister house who are females so we have a total number of
69 male characters and 12 number of female characters
in The Lannister house. Now, let us continue with
the next operations at the end of the day every episode of Game of Thrones had
a noble character. Let us now find out all
the noble characters amongst all the houses that we have in our Game
of Thrones CSV file the following code must help
us find out the details. So the details of all the characters
from all the houses who are considered to be Noble. I've been saved
into the new data frame, which is DF 3 now, let us print the details
from the df3 data frame. So these are the top 20 members
from all the houses who are considered to be Noble
along with their genders. Now, let us count the total
number of noble characters from the entire game
of thrones stories. So there are a total
of four hundred and thirty number of noble characters
existing in the whole game of throne story. Nonetheless, we have also faced a few Communists
whose role in The Game of Thrones is found
to be exceptional vision of find out the details
of all those commoners who were highly dedicated
to their roles in each episode the data of all, the commoners is
been successfully loaded into the new data frame, which is TFO now let
us print the data which is present in the DF
for using the show command. So these are the top
20 characters identified as common as amongst all the Game
of Thrones stories. Now, let us find out
the count of total number of common characters. So there are a total of four hundred and
eighty seven common characters amongst all stories
of Game of Thrones. Let us continue
with the further operations. Now they were a few rows who were considered
to be important and equally Noble, hence. They were carried out
under the last book. So let us filter
out those characters and find out the details
of each one of them. The data of all the characters
who are considered to be Noble and carried out until the last book are being
stored into the new data frame, which is TFO now let
us print the data which is existing in the data frame for so
according to the results. We have two candidates who are considered to be
the noble and their character is been carried on
until the last book amongst all the battles. I found the battles
of the last books to be generating more
adrenaline in the readers. Let us find out the details
of those battles using the following code. So the following code will help
us to find out the bars which were fought
in the last year's of the Game of Thrones. So these are the details
of the vast which are fought in the last year's
of the Game of Thrones and the details of the Kings and the details
of their commanders and the location
where the war was fought. Welcome to this interesting
session of Sparks SQL tutorial from a drecker. So in today's session, we are going to learn about
how we will be working. Spock sequent now what all you can expect from this course
from this particular session so you can expect that. We will be first learning
by Sparks equal. What are the libraries which are present
in Sparks equal. What are the important
features of Sparkle? We will also be doing
some Hands-On example and in the end we will see
some interesting use case of stock market analysis now Rice Park sequel is it like Why we are learning it
why it is really important for us to know about
this Sparks equal sign. Is it like really hot in Market? If yes, then why we want
all those answer from this. So if you're coming
from her do background, you must have heard a lot
about Apache Hive now what happens in Apache. I also like in Apache
Hive SQL developers can write the queries in SQL way and it will be getting converted to your mapreduce
and giving you the out. Now we all know that mapreduce is
lower in nature. And since mapreduce
is going to be slower and nature then definitely your overall high score
is going to be slower in nature. So that was one challenge. So if you have let's say
less than 200 GB of data or if you have
a smaller set of data. This was actually
a big challenge that in Hive your performance
was not that great. It also do not have
any resuming capability stuck. You can just start it also. - cannot even drop
your encrypted data bases. That's was also one
of the challenge when you deal with
the security side. Now what sparks equal have done it Sparks equal have solved
almost all of the problem. So in the last sessions
you have already learned about the smart way right House
Park is faster from mapreduce and not we have already learned that in the previous
few sessions now. So in this session, we are going to kind of take
a live range of all that so definitely in this case since This pack is
faster because of the in-memory computation. What is in memory competition? We have already seen it. So in memory computations
is like whenever we are Computing anything
in memory directly. So because of in memory
competition capability because of arches purpose poster. So definitely your spark SQL is
also been to become first know so if I talk about the advantages
of Sparks equal over Hive definitely number one it
is going to be faster in Listen to your hive
so a high quality, which is let's say
you're taking around 10 minutes in Sparks equal. You can finish that same query
in less than one minute. Don't you think it's
an awesome capability of subsequent definitely as
right now second thing is when if let's say you
are writing something and - now you can take an example of let's say a company
who is let's say developing - queries from last 10 years. Now they were doing it. There were all happy that they were able
to process picture. That they were worried
about the performance that Hive is not able
to give them a that level of processing speed what
they are looking for. Now this fossil. It's a challenge
for that particular company. Now, there's a challenge right? The challenge is
they came to know know about subsequent fine. Let's say we came
to know about it, but they came to know that we can execute
everything is Park Sequel and it is going to be
faster as well fine. But don't you think that if these companies working
for net set past 10 years? In Hive they must have already
written lot of Gordon - now if you ask them to migrate
to spark SQL is will it be until easy task? No, right. Definitely. It is not going
to be an easy task. Why because Hive syntax
and Sparks equals and X though. They boot tackle the sequel way
of writing the things but at the same time
it is always a very it carries a big difference, so there will be a good
difference whenever we talk about the syntax between them. So it will take a very
good amount of time for that company to change
all of the query mode to the Sparks equal way
now Sparks equal came up with a smart salvation what they said is even if you are writing
the query with - you can execute
that Hive query directly through subsequent don't you
think it's again a very important
and awesome facility, right? Because even now
if you're a good Hive developer, you need not worry about that how you will be now
that migrating to Sparks. Well, you can still keep on
writing to the hive query and can your query
will automatically be getting converted to spot sequel
with similarly in Apache spark as we have learned
in the past sessions, especially through spark
streaming that Sparks. The aiming is going to make
you real time processing right? You can also perform
your real-time processing using a purchase. / now. This sort of facility is you can take leverage even
you know Sparks ago. So let's say you can do
a real-time processing and at the same time
you can also Perform your SQL query now the type that was the problem. You cannot do that because when we talk
about Hive now in - it's all about Hadoop is all about batch
processing batch processing where you keep historical data and then later you
process it, right? So it definitely Hive also
follow the same approach in this case also high risk going to just only follow
the batch processing mode, but when it comes to a purchase, but it will also be taking care
of the real-time processing. So how all these things happens so Our Park sequel always
uses your meta store Services of your hive
to query the data stored and managed by - so in when you were
learning about high, so we have learned at that time
that in hives everything. What we do is always
stored in the meta Stone so that met Esther was
The crucial point, right? Because using that meta store only you are able
to do everything up. So like when you are doing
let's say or any sort of query when you're creating a table, everything was getting stored
in that same metal Stone. What happens Spock sequel also use the same metal Stone
now is whatever metal store. You have created with respect
to Hive same meta store. You can also use it for your Sparks equal
and that is something which is really awesome
about this spark sequent that you did not create
a new meta store. You need not worry
about a new storage space and not everything what you have done with respect
to your high same method you can use it. Now. You can ask me then
how it is faster if they're using
cymatics don't remember. But the processing part
why high was lower because of its processing way because it is converting
everything to the mapreduce and this it was making
the processing very very slow. But here in this case since the processing is going
to be in memory computation. So in Sparks equal case, it is always going to be
the faster now definitely it just because of
the meta store site. We are only able
to fetch the data are not but at the same time
for any other thing of the processing related stuff, it is always going to be At the when we talk about
the processing stage it is going to be in memory
does it's going to be faster. So let's talk about some success
stories of Sparks equal. Let's see some use cases
Twitter sentiment analysis. If you go through over if you want sexy remember
our spark streaming session, we have done a Twitter
sentiment analysis, right? So there you have seen that we have first initially
got the data from Twitter and that to we have got
it with the help of Sparks Damon and later what we did later. We just analyze everything
with the help of spot. Oxycodone so you can see
an advantage as possible. So in Twitter sentiment analysis where let's say
you want to find out about the Donald Trump, right? You are fetching the data every tweet related
to the Donald Trump and then kind of bring
analysis in checking that whether it's
a positive with negative tweet neutral tweet, very negative with very
positive to it. Okay, so we have already
seen the same example there in that particular session. So in this session, as you are noticing what we are doing we
just want to kind of so that once you're
streaming the data and the real time
you can also do it. Also, seeing using
spark sequel just you are doing all the processing
at the real time similarly in the stock market analysis. You can use Park
sequel lot of bullies. You can adopt the in the banking
fraud case Transitions and all you can use that. So let's say your credit
card current is getting swipe in India and in next 10 minutes if your credit card
is getting swiped in let's say in u.s. Definitely that is not possible. Right? So let's say you are doing all
that processing real-time. You're detecting everything
with respect to sparsely me. Then you are let's say applying
your Sparks equal to verify that Whether it's
a user Trend or not, right? So all those things you want
to match up as possible. So you can do that similarly the medical domain
you can use that. Let's talk about
some Sparks equal features. So there will be
some features related to it. Now, you can use what happens when this sequel
got combined with this path. We started calling it as Park sequel now when definitely we are talking
about SQL be a talking about either a structure data or a semi-structured data now SQL queries cannot deal
with the unstructured data, so that is definitely one of
Thing you need to keep in mind. Now your spark sequel also
support various data formats. You can get a data from pocket. You must have heard about Market that it is a columnar
based storage and it is kind of very much
compressed format of the data what you have but it's
not human readable. Similarly. You must have heard
about Jason Avro where we keep the value
as a key value pair. Hi Cassandra, right? These are nosql TVs so you can get all the data
from these sources now. You can also convert
your SQL queries to your A derivative so you can you can you
will be able to perform all the transformation steps. So that is one thing you can do. Now if we talk about performance and scalability definitely
on this red color graph. If you notice this
is related to your Hadoop, you can notice that red color graph is much
more encompassing to blue color and blue color denotes
my performance with respect to Sparks equal so you can notice that spark
SQL is performing much better in comparison to your Hadoop. So we are on this Y axis. We are taking the running. On the x-axis. We were considering
the number of iteration when we talk about
Sparks equal features. Now few more features
we have for example, you can create a connection
with simple your jdbc driver or odbc driver, right? These are simple
drivers being present. Now, you can create your connection with his path
SQL using all these drivers. You can also create a user
defined function means let's say if any function is
not available to you and that gives you can create
your own functions. Let's say if function
Is available use that if it is not available, you can create a UDF means
user-defined function and you can directly execute that user-defined function
and get your dessert sir. So this is one example
where we have shown that you can convert. Let's say if you don't have
an uppercase API present in subsequent how you
can create a simple UDF for a and can execute it. So if you notice there what we are doing
let's get this is my data. So if you notice in this case, this is data set is
my data part. So this is I'm generating
as a sequence. I'm creating it as a data frame
see this 2df part here. Now after that we
are creating a / U DF here and notice we are converting
any value which is coming to my upper case, right? We are using this to uppercase
API to convert it. We are importing this function
and then what we did now when we came here,
we are telling that okay. This is my UDF. So UDF is upper by because we have created
here also a zapper. So we are telling that this is my UDF
in the first step and then Then when we are using it, let's say with our datasets
what we are doing so data sets. We are passing year
that okay, whatever. We are doing convert it to my upper developer you DFX
convert it to my upper case. So see we are telling you
we have created our / UDF that is what we are passing
inside this text value. So now it is just
getting converted and giving you all the output
in your upper case way so you can notice
that this is your last value and this is your
uppercase value, right? So this got converted to my upper case
in this particular. Love it. Now. If you notice here
also same steps. We are how to we
can register all of our UDF. This is not being shown here. So now this is
how you can do that spark that UDF not register. So using this API, you can just register
your data frames now similarly, if you want to get the output after that you can get
it using this following me so you can use the show API
to get the output for this Sparks
equal at attacher. Let's see that so what is Park sequel architecture now is
Park sequel architecture if we talked about so
what happens to your let 's say getting the data
of with using your various formats, right? So let's say you can get
it from your CSP. You can get it
from your Json format. You can also get it
from your jdbc format. Now, they will be
a data source API. So using data source API, you can fetch the data
after fetching the data you will be converting
to a data frame where so what is data frame. So in the last one
we have learned that that when we were creating
everything is already what we were doing. So, let's say this was
my Cluster, right? So let's say this is machine. This is another machine. This is another machine, right? So let's say these are
all my clusters. So what we were doing
in this case now when we were creating all
these things are as were cluster what was happening here. We were passing
Oliver values him, right? So let's say we
were keeping all the data. Let's say block B1 was there so we were passing all
the values and work creating it in the form of in the memory
and we were calling that as rdd now when we were walking in SQL
we have to store the the data which is a table of data, right? So let's say there is a table which is let's say
having column details. Let's say name age. Let's say here. I have some value here
are some value here. I have some value here
at some value, right? So let's say I have some value
of this table format. Now if I have to keep
this data into my cluster what you need to do, so you will be keeping first
of all into the memory. So you will be having let's say name H this column
to test first of all year and after that you will be
having some details of this. Perfect. So let's say this much data, you have some part
in the similar kind of table with some other values
will be here also, but here also you are going
to have column details. You will be having name H
some more data here. Now if you notice this
is sounding similar to our DD, but this is not exactly
like our GD right because here we are not only
keeping just the data but we are also studying something
like a column in a storage right? We also the keeping
the column in all of it. Data nodes or we can call it as
if Burke or not, right? So we are also keeping
the column vectors along with the rule test. So this thing is called
as data frames. Okay, so that is called
your data frame. So that is what we are going to
do is we are going to convert it to a data frame API then using the data frame TSS or by
using Sparks equal to H square or you will be processing
the results and giving the output we will learn about
all these things in detail. So, let's see this Popsicle
libraries now there are multiple apis available. This like we have
data source API we have data frame API. We have interpreter
and Optimizer and SQL service. We will explore
all this in detail. So let's talk about
data source appear if we talk about data source API
what happens in data source API, it is used to read and store the structured
and unstructured data into your spark SQL. So as you can notice in Sparks
equal we can give fetch the data using multiple sources like you can get it
from hive take Cosette. Inverse ESP Apache
BSD base Oracle DB so many formats available, right? So this API is going to help you to get all the data
to read all the data store it where ever you want to use it. Now after that your data frame API is going
to help you to convert that into a named Colin and remember I
just explained you that how you store
the data in that because here you are not keeping
like I did it. You're also keeping
the named column as well as Road it is That is
the difference coming up here. So that is
what it is converting. In this case. We are using data
frame API to convert it into your named column
and rows, right? So that is what you
will be doing. So at it also follows the same
properties like your IDs like your attitude is
Pearl easily evaluated in all same properties
will also follow up here. Okay now interpret
an Optimizer and interpreter and Optimizer step
what we are going to do. So, let's see if we have
this data frame API, so we are going to first
create this name. Column then after that we
will be now creating an rdd. We will be applying
our transformation step. We will be doing over action
step right to Output the value. So all those things where it is happens it happening
in The Interpreter and optimizes them. So this is all happening
in The Interpreter and optimism. So this is what all
the features you have. Now, let's talk about
SQL service now in SQL service what happens it is going
to again help you so it is just doing the order. Formation action the last day after that using
spark SQL service, you will be getting
your spark sequel outputs. So now in this case whatever
processing you have done right in terms of transformations
in all of that so you can see that your sparkers SQL service
is an entry point for working along the structure data
in your aperture spur. Okay. So it is going to kind of
help you to fetch the results from your optimize data or maybe whatever you
have interpreted before so that is what it's doing. So this kind of completes. This whole diagram now, let us see that how we
can perform a work queries using spark sequin. Now if we talk
about spark SQL queries, so first of all, we can go to spark cell itself
engine execute everything. You can also execute
your program using spark your Eclipse also
directing from there. Also, you can do that. So if you are let's say log in
with your spark shell session. So what you can do, so let's say you have first
you need to import this because into point x
you must have heard that there is something
called as Park session which came so that is
what we are doing. So in our last session
we have Have you learned about all these things are
now Sparkstation is something but we're importing after that. We are creating sessions path
using a builder function. Look at this. So This Builder API you we
are using this Builder API, then we are using the app name. We are providing a configuration
and then we are telling that we are going to create
our values here, right? So we had that's why
we are giving get okay, then we are importing
all these things right once we imported after that we can say that okay. We were want to read
this Json file. So this implies God
or Jason we want to read up here and in the end we want
to Output this value, right? So this d f becomes my data
frame containing store value of my employed or Jason. So this decent value
will get converted to my data frame. We're now in the end PR just
outputting the result now if you notice here
what we are doing, so here we are first of all importing your spark
session same story. We just executing it. Then we are building
our things better in that. We're going to
create that again. We are importing it then
we are reading Json file by using Red Dot Json API. We are reading
never employed or Jason. Okay, which is present
in this particular directory and we are outputting
so can you can see that Json format will be
the T value format. But when I'm doing this DF
not show it is just showing up all my values here. Now. Let's see how we
can create our data set. Now when we talk about data set, you can notice
what we're doing. Now. We have understood all
this stability the how we can create a data set now
first of all in data set what we do so So
in data set we can create the plus you can see we
are creating a case class employ right now in case class what we are doing we are done
just creating a sequence in putting the value Andrew H
like name and age column. Then we are displaying
our output all this data set right now. We are creating a primitive data
set also to demonstrate mapping of this data frames
to your data sets. Right? So you can notice
that we are using to D's instead of 2 DF. We are using two DS
in this case. Now, you may ask me what's
the difference with respect to data frame, right? With respect to data frame in data frame
what we were doing. We were create
again the data frame and data set both
exactly looks safe. It will also be having
the name column in rows and everything up. It is introduced lately
in 1.6 versions and later. And what is it provides it it provides
a encoder mechanism using which you can get when you are let's say
reading the weight data back. Let's say you are DC
realizing you're not doing that step, right? It is going to be faster. So the performance
wise data set is better. That's the reason it
is introduced later nowadays. People are moving from
data frame two data sets Okay. So now we are just outputting in the end see the same
thing in the output. But so we are creating
employ a class. Then we are putting the value
inside it creating a data set. We are looking
at the values, right? So these are the steps we
have just understood them now how we can read of a Phi so
we want to read the file. So we will use three dot Json
as employee employee was what remember case class which
we have created last thing. This was the classic
we have created your case class employee. So we are telling
that we are creating like this. We are just out
putting this value. We just within shop
you can see this way. We can see this output. Also now, let's see how we can add the schema
to rdd now in order to add the schema to rdd
what we are going to do. So in this case also, you can look at we
are importing all the values that we are importing all
the libraries whatever are required then after that we are using
this spark context text by reading the data splitting it with respect to comma then
mapping the attributes. We will employ The case
that's what we have done and putting converting
this values to integer. So in then we are converting
to to death right after that. We are going to create
a temporary viewer table. So let's create
this temporary view employ. Then we are going
to use part dot Sequel and passing up our SQL query. Can you notice that we
have now passing the value and we are assessing
this employ, right? We are assessing
this employee here. Now, what is this employ
this employee was of a temporary view
which we have created because the challenge
in Sparks equalist when Whether you want
to execute any SQL query you cannot say select aesthetic
from the data frame. You cannot do that. There's this is
not even supported. So you cannot do select extract
from your data frame. So instead of that what we need to do is we need
to create a temporary table or a temporary view
so you can notice here. We are using this create
or replace temp You by replace because if it is already
existing override on top of it. So now we are creating
a temporary table which will be exactly similar
to mine this data frame now you You can just directly
execute all the query on your return preview
Autumn Prairie table. So you can notice here
instead of using employ DF which was our data frame. I am using here temporary view. Okay, then in the end, we just mapping
the names and a right and we are outputting the bells. That's it. Same thing. This is just
an execution part of it. So we are just showing
all the steps here. You can see in the end. We are outputting
all this value now how we can add
the schema to rdd. Let's see this transformation
step now in this case you Notice that we can map
this youngster fact the we're converting
this map name into the string for
the transformation part, right? So we are checking all
this value that okay. This is the string type name. We are just showing up
this value right now. What were you doing? We are using this map encoder
from the implicit class, which is available to us
to map the name and Each pie. Okay. So this is
what we're going to do because remember in
the employee is class. We have the name and age column
that we want to map now. Now in this case, we are mapping
the names to the ages. Has so you can notice that we are doing for ages
of our younger CF data frame that what we
have created earlier and the result is an array. So the result but you're going
to get will be an array with the name map
to your respective ages. You can see this output
here so you can see that this is getting map. Right. So we are getting seeing
this output like name is John it is 28 that is what
we are talking about. So here in this case,
you can notice that it was representing
like this in this case. The output is coming out
in this particular format now, let's talk about
how Can add the schema how we can read the file
we can add a whiskey minor so we will be first
of all importing the type class into your passion. So with this is what we have done
by using import statement. Then we are going to import
the row class into this partial. So rho will be used
in mapping our DB schema. Right? So you can notice we're importing this also then
we are creating an rdd called as employ a DD. So in case this case
you can notice that the same priority
we are creating and we are creating this
with the help of this text file. So once we have create this we
are going to Define our schema. So this is the scheme approach. Okay. So in this case, we are going to Define it
like named and space than H. Okay, because they these were
the two I have in my data also in this employed or tht if you look at these are the two data which
we have named NH. Now what we can do
once we have done that then we can split it
with respect to space. We can say that our mapping value
and we are passing it all this value inside
of a structure. Okay, so we are defining a burn
or fields are ready. That is what we are doing. See this the fields are ready, which is going to now output
after mapping the employee ID. Okay, so that is
what we are doing. So we want to just do this
into my schema strength, then in the end. We will be obtaining this field. If you notice this field
what we have created here. We are obtaining
this into a schema. So we are passing this
into a struct type and it is getting converted
to be our scheme of it. So that is what we will do. You can see all
this execution same steps. We are just executing
in this terminal now, Let's see how we are going
to transform the results. Now, whatever we
have done, right? So now we have already created
already called row editing. So let's create that Rogue additive are going
to Gray and we want to transform the employee ID
using the map function into row already. So let's do that. Okay. So in this case what we are doing so look
at this employed reading we are splitting it
with respect to coma and after that we are telling
see remember we have name and then H like this so that's what you're telling
me telling that act. Zero or my attributes one and why we're trimming
it just inverted to ensure if there is no spaces
and on which other so those things we don't want
to unnecessarily keep up. So that's the reason we are
defining this term statement. Now after that after we
once we are done with this, we are going to define
a data frame employed EF and we are going to store
that rdd schema into it. So now if you notice
this row ID, which we have defined here and schema which we have defined
in the last case right now if you'll go back
and notice here. Schema, we have created here
right with respect to my Fields. So that schema and this value what we have just
created here rowady. We are going to pass it and say that we are going
to create a data frame. So this will help us
in creating a data frame now, we can create our temporary view
on the base of employee of let's create an employee
or temporary View and then what we can do we can execute
any SQL queries on top of it. So as you can see
SparkNotes equal we can create all the SQL queries
and can directly execute that now what we can do. We want to Output the values
we can quickly do that. Now. We want to let's say display
the names of we can say Okay, attribute 0 contains the name
we can use the show command. So this is how we
will be performing the operation in the scheme away now, so this is the same output way
means we're just executing this whole thing up. You can notice here. Also, we are just
saying attribute 0.0. It is representing
or me my output now, let's talk about Json data. Now when we talk
about Json data, let's talk about how we
can load our files and work on. This so in this case,
we will be first. Let's say importing
our libraries. Once we are done with that. Now after that we can just say that retort Jason we are
just bringing up our employed or Jason you see
this is the execution of this part now similarly, we can also write
back in the pocket or we can also read
the value from parque. You can notice this if you want to write
let's say this value employee of data frame to my market way so I can sit right dot
right dot market. So this will be created
employed or Park. Be created and hear all
the values should be converted to employed or packet. Only thing is the data. If you go and see
in this particular directory, this will be a directory. We should be getting created. So in this data,
you will notice that you will not be able
to read the data. So in that case
because it's not human readable. So that's the reason you
will not be able to do that. So, let's say you want
to read it now so you can again bring it back by using Red Dot Market you are
reading this employed at pocket, which I just created then you are creating
a temporary view or temporary table and then By using
standard SQL you can execute on your temporary table. Now in this way. You can read your pocket file
data and in then we are just displaying the result see
the similar output of this. Okay. This is how we can execute
all these things up now. Once we have done all this, let's see how we
can create our data frames. So let's create this file path. So let's say we have created
this file employed or Jason after that we can
create a data frame from our Json path, right? So we are creating this
by using retouch Jason then we can Print the schema. What does to this is going
to print the schema of my employee data frame? Okay, so we are going to use
this print schemer to print up all the values then we
can create a temporary view of this data frame. So we are create doing that see create or replace
temp you we are creating that which we have seen
it last time also now after that we can
execute our SQL query. So let's say we are executing
our SQL query from employee where age is between 18
and 30, right? So this kind of SQL query. Let's say we want
to do we can get that And in the end we
can see the output Also. Let's see this execution. So you can see that all the vampires who these
are let's say between 18 and 30 that is showing up
in the output. Now. Let's see this
rdd operation way. Now what you can do so we are going to create this
add any other employer Nene now which is going to store
the content of employed George and New Delhi Delhi. So see this part, so here we are creating this
by using make a DD and we have just this is going
to store the content containing Such from noodle, right? You can see this so New Delhi is my city
named state is the ring. So that is what we
are passing inside it. Now what we are doing we
are assigning the content of this other employee ID
into my other employees. So we are using
this dark dot RI dot Json and we are reading at the value and in the end we
are using this show appear. You can notice
this output coming up now. Let's see with the hive table. So with the hive table
if you want to read that, so let's do it
with the case class and Spark sessions. So first of all, we are going to import
a guru class and we are going to use path session
into the Spartan. So let's do that for a way. I'm putting this row
this past session and not after that. We are going to create a class
record containing this key which is of integer data type and a value which is
of string type. Then we are going
to set our location of the warehouse location. Okay to this pathway rows. So that is what we are doing. Now. We are going to build
a spark sessions back to demonstrate the hive
example in spots equal. Look at this now, so we are creating Sparks
session dot Builder again. We are passing the Any app name to it we have passing
the configuration to it. And then we are saying
that we want to enable The Hive support now once we have done that we are importing
this spark SQL library center. And then you can notice
that we can use SQL so we can create now a table SRC so you can see create table
if not exist as RC with column to stores the data
as a key common value pair. So that is what we
are doing here. Now, you can see all
this execution of the same step. Now. Let's see the sequel operation
happening here now in this case what we can do. We can now load the data
from this example, which is present to succeed. Is this KV m dot txt file, which is available to us and we want to store it
into the table SRC which we have just
created and now if you want to just view the all
this output becomes a sequence select aesthetic form SRC and it is going to show up all the values you
can see this output. Okay. So this is the way you can show
up the virus now similarly we can perform the count operation. Okay, so we can say
select Counter-Strike from SRC to select the number
of keys in there. See tables, and now
select all the records, right so we can say
that key select key gamma value so you can see that we can perform all
over Hive operations here on this right similarly. We can create a data set
string DS from spark DF so you can see this
also by using SQL DF what we already have
we can just say map and then provide the case class in can map
the ski common value pair and then in the end we
can show up all this value see this execution of this in then
you can notice this output which we want it now. Let's see the result back. But now we can create
our data frame here. Right so we can create
our data frame records deaf and store all the results which contains the value
between 1 200. So we are storing all the values
between 1/2 and video. Then we are creating
a victim Prairie View. Okay for the records,
that's what we are doing. So for requires the FAA
creating a temporary view so that we can have
over Oliver SQL queries now, we can execute all the values so you can also notice we
are doing join operation here. Okay, so we can display
the content of join between the records
and this is our city. We can do a joint on this part
so we can also perform all the joint operations
and get the output. Now. Let's see our use case for it. If we talk about use case. We are going to analyze
our stock market with the help of spark sequence
select understand the problem statement first. So now in our problem statement, so what we want to do so we want
to accept definitely everybody must be aware of this top market
like in stock market. You can lot
of activities happen. You want to know analyze it in order to make some profit
out of it and all those stuff. Alright, so now
let's say our company have collected a lot of data
for different 10 companies and they want to do
some computation. Let's say they want to compute
the average closing price. They want to list the companies
with the highest closing prices. They want to compute the average
closing price per month. They want to list the number
of big price Rises and fall and compute
some statistical correlation. So these things we are going
to do with the help of our spark SQL statement. So this is a very common we want
to process the huge data. We want to handle The input
from the multiple sources, we want to process
the data in real time and it should be easy to use. It should not be
very complicated. So all this requirement will be
handled by my spots equal right? So that's the reason
we are going to use the spacer sequence. So as I said that we are going
to use 10 companies. So we are going to kind
of use this 10 companies and on those ten companies. We are going to see that we are going to perform
our analysis on top of it. So we will be using
this table data from Yahoo finance
for all this following stocks. So for n and a A bit sexist. So all these companies we have on on which we
are going to perform. So this is how my data will look
like which will be having date opening High rate low rate
closing volume adjusted close. All this data will
be presented now. So, let's see how we can Implement a stock analysis
using spark sequel. So what we have to do for that, so this is how many data
flow diagram will sound like so we have going to initially
have the huge amount of real-time stock data that we are going to process it
through this path SQL. So going to It into
a named column base. Then we are going
to create an rdd for functional programming. So let's do that. Then we are going to use
a reverse Park sequel which will calculate
the average closing price for your calculating. The company with is closing
per year then buy some stock SQL queries
will be getting our outputs. Okay, so that is
what we're going to do. So all the queries
what we are getting generated, so it's not only this we
are also going to compute few other queries what we
have solve those queries. We're going to execute him. Now. This is how the flow
will look like. So we are going
to initially have this Data what I have just shown you a now
what you're going to do. You're going to create
a data frame you are going to then create
a joint clothes are ready. We will see what we
are going to do here. Then we are going
to calculate the average closing price per year. We are going to hit
a rough patch SQL query and get the result in the table. So this is how my execution
will look like. So what we are going
to do in this case, first of all, we are going to initialize the
Sparks equal in this function. We are going to import all
the required libraries then we are going to start our spark session after
importing all the required. B we are going to create our case class whatever
is required in the case class, you can notice a then
we are going to Define our past stock scheme. So because we have already
learnt how to create a schema as we're going to create
this page table schema by creating this way. Well, then we are going
to Define our parts. I DD so in parts are did if you notice so
here we are creating. This parts are ready mix. We have going to create all of that by using
this additive first. We are going to remove
the header files also from it. Then we are going
to read our CSV file into Into stocks a a
on DF data frame. So we are going to read
this as C dot txt file. You can see we are reading
this file and we are going to convert it into a data frame. So we are passing
it as an oddity. Once we are done then if you want to print
the output we can do it with the help of show API. Once we are done
with this now we want to let's say display the average of addressing closing price
for n and for every month, so if we can do all of that also
by using select query, right so we can say this data frame
dot select and pass whatever parameters are required
to get the average know, You can notice are inside this we are creating
the Elias of the things as well. So for this DT, we are creating
areas here, right? So we are creating the Elias
for it in a binder and we are showing
the output also so here what we are going to do now, we will be checking that the closing
price for Microsoft. So let's say they're going up
by 2 or with greater than 2 or wherever it is going
by greater than 2 and now we want to get the output
and display the result so you can notice
that wherever it is going to be greater than 2 we
are getting the value. So we are hitting
the SQL query to do that. So we are hitting
the SQL query now on this you can notice the SQL query which we are hitting
on the stocks. Msft. Right? This is the we have data frame we have created now
on this we are doing that and we are putting
our query that where my condition
this to be true means where my closing price
and my opening price because let's say
at the closing price the stock price by let's say
a hundred US Dollars and at that time in the morning when it open with
the Lexi 98 used or so, wherever it is going
to be having a different. Of to or greater than to that only output
we want to get so that is what we're doing here. Now. Once we are done then after that
what we are going to do now, we are going to use
the join operation. So what we are going to do so
we will be joining the Annan and except bestop's in order
to compare the closing price because we want
to compare the prices so we will be doing that. So first of all, we are going to create a union
of all these stocks and then display
this guy joint Rose. So look at this what we're going to do
we're going to use the spark sequence and if you notice this closely
what we're doing in this case, So now in this park sequel, we are hitting
the square is equal and all those stuff then
we are saying from this and here we are using
this joint operation may see this join oppression. So this we are joining it on and then in the end
we are outputting it. So here you can see
you can do a comparison of all these clothes price
for all these talks. You can also include no
for more companies right now. We have just shown you
an example with to complete but you can do it
for more companies as well. Now in this case if you notice what we're doing
were writing this in the park a file format and Save Being
into this particular location. So we are creating
this joint stock market. So we are storing it as
a packet file format and here if you want to read
it we can read that and showed output but whatever file you
have saved it as a pocket while definitely you
will not be able to read that up because that file is going
to be the perfect way and park it way are the files
which you can never read. You will not be able
to read them up now, so you will be seeing this
average closing price per year. I'm going to show you all
these things running also some just right to explaining you
how things will be run. We're doing up here. So I will be showing
you all these things in execution as well. Now in this case, if you notice
what we are doing again, we are creating
our data frame here. Again, we are executing our
query whatever table we have. We are executing on top of it. So in this case because we want to find
the average closing per year. So what we are doing
in this case, we are going to create
a new table containing the average closing price
of let's say an and fxn first and then we are going
to display all this new table. So we are in the end. We are going to
register this table or The temporary table so that we can execute
our SQL queries on top of it. So in this case, you can notice that we
are creating this new table. And in this new table, we have putting
our SQL query right that SQL query is going to contains
the average closing Paso the SQL queries finding out
the average closing price of N and all these companies
then whatever we have now. We are going to apply
the transformation step not transformation
of this new table, which we have created
with the year and the corresponding
three company data what we have created
into the The company or table select
which you can notice that we are creating
this company or table and here first of all, we are going to create
a transform table company or and going to display
the output so you can notice that we are hitting
the SQL query and in the end we have printing
this output similarly if we want to let's say
compute the best of average close we can do that. So in this case again
the same way now, if once they have learned
the basic stuff, you can notice that everything is following a similar approach
now in this case also, we want to find out let's say
the best of the average So we are creating
this best company here now. It should contain the best
average closing price of an MX and first so we can just get
this greatest and all battery. So we creating that then after that we
are going to display this output and we will be again registering
it as a temporary table now, once we have done that then
we can hit our queries now, so we want to check
let's say best performing company per year. Now what we have to do for that. So we are creating
the final table in which we are going to compute all the things we are going
to perform the join or not. So although SQL query we
are going to perform here in order to compute that which company
is doing the best and then we are going
to display the output. So this is what the output
is going showing up here. We are again storing
as a comparative View and here again the same
story of correlation what we're going to do here. So now we will be using
our statistics libraries to find the correlation between Anand
epochs companies closing price. So that is what we
are going to do now. So correlation in finance
and the investment and industries is a statistics. Measures the degree to which to Securities move
in relation to each other. So the closer the correlation is to be 1 this is going
to be a better one. So it is always like how to variables are correlated
with each other. Let's say your H is highly
correlated to your salary, but you're earning like
when you are young you usually unless and when you are more Edge definitely
you will be earning more because you will be more mature
similar way I can say that. Your salary is also dependent
on your education qualification. And also on the premium
Institute from where you have done your education. Let's say if you are from IIT, or I am definitely
your salary will be higher from any other campuses. Right Miss. It's a probability. We what I'm telling you. So let's say if I have to correlate now
in this case the education and the salary but I can easily
create a correlation, right? So that is
what the correlation go. So we are going to do all that with respect
to Overstock analysis. Now now what we are doing in this case, so You can notice
we are creating this series one where we heading
the select query now, we are mapping all
this an enclosed price. We are converting to a DD
similar way for Series 2. Also we are doing that right. So this is we are doing
for rabbits or earlier. We have done it for an enclosed
and then in the end we are using the statistics
dot core to create a correlation between them. So you can notice this is how we
can execute everything now. Let's go to our VM and see
everything in our execution. Question from at all. So this VM how we
will be getting you will be getting all
this VM from a director. So you need not worry
about all that but that how I will be
getting all this p.m. In a so a once you
enroll for the courses and also you will be getting all
this came from that Erika said so even if I am working on Mac operating system
my VM will work. Yes every operating system. It will be supported. So no trouble you
can just use any sort of VM in all means
any operating system to do that. So what I would occur do
is they just don't want You to be troubled
in any sort of stuff here. So what they do is
they kind of ensure that whatever is required
for your practicals. They take care of it. That's the reason they
have created their own VM, which is also going to be
a lower size and compassion to Cloudera hortonworks VM and this is going to definitely
be more helpful for you. So all these things
will be provided to you question from nothing. So all this project I am going
to learn from the sessions. Yes. So once you enroll for so
right now whatever we have seen definitely we have just Otten
upper level of view of this how the session looks
like for a purchase. But but when we actually teach
all these things in the course, it's usually are much more
in the detailed format. So in detail format, we kind of keep on showing
you each step in detail that how the things are working
even including the project. So you will be also learning
with the help of project on each different topic. So that is the way
we kind of go for it. Now if I am stuck
in any other project then who will be helping me so they will be
a support team 24 by 7 if Get stuck at any moment. You need to just
give a call and kit and a call or email. There is a support ticket
and immediately the technical team will be helping across
the support team is 24 by 7. They are they are
all technical people and they will be assisting
you across on all that even the trainers
will be assisting you for any of the technical query great. Awesome. Thank you now. So if you notice this is my data we have we were executing
all the things on this data. Now what we want to do
if you notice this is the same code which I
have just shown you. Earlier also now let us
just execute this code. So in order to execute this what we can do we can connect
to my spa action. So let's get
connected to suction. Someone's will be connected
to Spur action. We will go step by step. So first we will be
importing our package. This take some time let
it just get connected. Once this is connected now, you can notice that I'm just importing all
the all the important libraries we have already
learned about that. After that, you will be
initialising your spark session. So let's do that again the same steps
what you have done before. Once we will be done. We will be creating
a stock class. We could have also directly
executed from Eclipse. Also, this is just I want to show you step-by-step
whatever we have learnt. So now you can see
for company one and then if you want to do some computation we want to even
see the values and all right, so that's what we're doing here. So if we are just getting
the files creating another did, you know, so let's execute this. Similarly for your a back
similarly for your fast for all this so I'm just copying
all these things together because there are a lot
of companies for which we have to do all this step. So let's bring it
for all the 10 companies which we are going to create. So as you can see, this print scheme has giving
it output right now. Similarly. I can execute for a rest
of the things as well. So this is just giving
you the similar way. All the outputs will be shown
up here company for company V all these companies you
can see this in execution. After that, we will be creating
our temporary view so that we can execute
our SQL queries. So let's do it for complaint
and also then after that we can just create a work all
over temporary table for it. Once we are done now
we can do our queries. Like let's say we
can display the average of existing closing price
for and and for each one so we can hit this query. So all these queries will happen
on your temporary view because we cannot anyway
to all these queries on our data frames are out so you can see this this
is getting executed. Trying it out to Tulsa now
because they've done dot shoe. That's the reason
you're getting this output. Similarly. If we want to let's say list
the closing price for msft which went up more than $2 way. So that query also we can execute now we have already
understood this query in detail. It is seeing is
execution partner so that you can appreciate
whatever you have learned. See this is the output
showing up to you. Now after that how you can join all the stack
closing price right similar way how we can save the joint view
in the packet for table. You want to read that back. You want to create a new table like so let's execute all
these three queries together because we have
already seen this. Look at this. So this in this case, we are doing the drawing class
basing this output. Then we want to save it
in the package files. We are saving it and we want
to again reiterate back. Then we are creating
our new table, right? We were doing that join and on so that is
what we are doing in this case. Then you want
to see this output. Then we are against touring
as a temp table or not. Now. Once we are done with this step
also then what so we have done it in Step 6. Now we want to perform. Let's have a transformation on new table corresponding
to the three companies so that we can compare
we want to create the best company containing
the best average closing price for all these three companies. We want to find the companies but the best closing
price average per year. So let's do all that as well. So you can see best company
of the year now here also the same stuff we are doing to
be registering over temp table. Okay, so there's a mistake here. So if you notice here it is 1 but here we are doing
a show of all right, so there is a mistake. I'm just correcting it. So here also it should be
1 I'm just updating in the sheet itself so
that it will start working now. So here I have just made it one. So now after that it
will start working. Okay, wherever it is going
to be all I have to make it one. So that is the change
which I need to do here also. And you will notice
it will start working. So here also you
need to make it one. So all those places where ever it was so just
kind of a good point to make so wherever you are working
on this we need to always ensure that all these values
what you are putting up here. Okay, so I could have also
done it like this one second. In fact in this place. I need not do all
this step one second. Let me explain you also
why no in this place. It's So see from here
this error started opening why because my data frame what I have created
here most one. Let's execute it. Now, you will notice
this Quest artwork. See this is working. Now. After that. I am creating a temp table
that temp table. What we are creating is
let's say company on okay. So this is the temp table
which we have created. You can see this company
now in this case if I am keeping this company
on itself it is going to work. Because here anyway, I'm going to use
the whatever temporary table we have created, right? So now let's execute. So you can see now
it started book. No further to that now, we want to create
a correlation between them so we can do that. See this is going to give
me the correlation between the two column names
and so that we can see here. So this is the correlation the
more it is closer to 1 means the better it is it means definitely
it is near to 1 it is 0.9, which is a bigger value. So definitely it is going
to be much they both are highly correlated means
definitely they are impacting each other stock price. So this is all about the project but Welcome to this interesting
session of spots remaining from and Erica. What is pathogenic? Is it like really important? Definitely? Yes. Is it really hot? Definitely? Yes. That's the reason we
are learning this technology. And this is one of the very
sort things in the market when it's a hot thing means in terms of job market
I'm talking about. So let's see what will be
our agenda for today. So we are going to Gus
about spark ecosystem where we are going
to see that okay, what is pop how smarts the main threats
in the West Park ecosystem wise path streaming we
are going to have overview of stock streaming kind of
getting into the basics of that. We will learn about these cream. We will learn also about
these theme Transformations. We will be
learning about caching and persistence accumulators
broadcast variables checkpoints. These are like Advanced
concept of paths. And then in the end, we will walk through a use case
of Twitter sentiment analysis. Now, what is streaming
let's understand that. So let me start
by us example to you. So let's see if there is
a bank and in Bank. Definitely. I'm pretty sure all of you must have views credit
card debit card all those karts what dance provide now, let's say you
have done a transaction. From India just now
and within an art and edit your card
is getting swept in u.s. Is it even possible for your car to vision
and arduous definitely know now how that bank will realize that it is a fraud connection because Bank cannot let
that transition happen. They need to stop it at the time of when it
is getting swiped either. You can block it. Give a call to you ask you whether It is a genuine
transaction or not. Do something of that sort. Now. Do you think they will put
some manual person behind the scene that will be looking
at all the transaction and you will block it manually. No, so they require
something of the sort where the data will
be getting stream. And at the real time they should be able to catch
with the help of some pattern. They will do some processing and they will get
some pattern out of it with if it is not sounding
like a genuine transition. They will immediately add
a block it I'll give you a call maybe send me an OTP to confirm whether it's a genuine
connection dot they will not wait till the next day to kind of
complete that transaction. Otherwise if what happened
nobody is going to touch that that right. So that is the how we
work on stomach. Now someone have mentioned that without stream processing
of data is not even possible. In fact, we can see that there is no And big data
which is possible. We cannot even talk
about internet of things. Right and this this is
a very famous statement from Donna Saint do from C equals
3 lot of companies like YouTube Netflix Facebook
Twitter iTunes topped Pandora. All these companies
are using spark screaming. Now. What is this? We have just seen with an
example to kind of got an idea. Idea about steaming pack. Now as I said with the time
growing with the internet doing these three main Technologies
are becoming popular day by day. It's a technique
to transfer the data so that it can be processed
as a steady and continuous drip means immediately as and when the data is coming you are continuously
processing it as well. In fact, this real-time streaming is
what is driving to this big data and also internet of things now, they will be lot of things
like fundamental unit of streaming media streams. We will also be
Transforming Our screen. We will be doing it. In fact, the companies are using it with
their business intelligence. We will see more details
in further of the slides. But before that we will be
talking about spark ecosystem when we talk about Spark mmm, there are multiple libraries which are present in a first one
is pop frequent now in spark SQL is like when you can SQL Developer
can write the query in SQL way and it is going to get converted
into a spark way and then going to give you
output kind of analogous to hide but it is going to be faster
in comparison to hide when we talk about sports clinic that is what we are going
to learn it is going to enable all the analytical
and Practical applications for your live
streaming data M11. Ml it is mostly
for machine learning. And in fact, the interesting part
about MLA is that it is completely replacing
mom invited are almost replaced. Now all the core contributors of Mahal have moved
in two words the towards the MLF thing because of the faster response
performance is really good. In MLA Graphics Graphics. Okay. Let me give you example
everybody must have used Google Maps right now. What you doing Google Map
you search for the path. You put your Source you
put your destination. Now when you just
search for the part, it's certainly different paths and then provide you
an optimal path right now how it providing
the optimal party. These things can be done
with the help of Graphics. So wherever you can create
a kind of a graphical stuff. Up, we will say that we can use
Graphics spark up. Now. This is the kind
of a package provided for art. So R is of Open Source, which is mostly used by analysts and now spark committee
won't infect all the analysts kind of to move
towards the sparkling water. And that's the reason they have recently
stopped supporting spark on we are all the analysts can now execute the query
using spark environment that's getting better
performance and we can also work on Big Data. That's that's all
about the ecosystem point below this we are going to have
a core engine for engine is the one which defines all
the basics of the participants all the RGV related stuff and not is going to be defined in your staff for Engine
moving further now, so as we have just
discussed this part we are going to now discuss
past screaming indicate which is going to enable
analytical and Interactive. For live streaming data
know Y is positive if I talk about bias
past him indefinitely. We have just gotten after
different is very important. That's the reason
we are learning it but this is so powerful
that it is used now for the by lot of companies
to perform their marketing they kind of getting an idea that what a customer
is looking for. In fact, we are going to learn
a use case of similar to that where we are going
to to use pasta me now where we are going to use
a Twitter sentimental analysis, which can be used
for your crisis management. Maybe you want to check
all your products on our behave service. I just think target marketing by all the companies
around the world. This is getting used
in this way. And that's the reason
spark steaming is gaining the popularity and because
of its performance as well. It is beeping
on other platforms. At the moment
now moving further. Let's eat Sparks training
features when we talk about Sparks training teachers. It's very easy to scale. You can scale
to even multiple nodes which can even run till hundreds
of most speed is going to be very quick means
in a very short time. You can scream as well as
processor data soil tolerant, even it made sure that even you're not losing
your data integration. You with your bash time and
real-time processing is possible and it can also be used
for your business analytics which is used to track
the behavior of your customer. So as you can see this
is super polite and it's like we are kind of getting to
know so many interesting things about this pasta me now next
quickly have an overview so that we can get
some basics of spots. Don't know let's understand. Which box? So as we have just discussed it
is for real-time streaming data. It is useful addition
in your spark for API. So we have already seen
at the base level. We have that spark or in our ecosystem on top
of that we have passed we will impact Sparks claiming
is kind of adding a lot of advantage to spark Community because a lot of people are only
joining spark Community to kind of use this pasta me. It's so powerful. Everyone wants to come and want to use it
because all the other Frameworks which we already have which are existing are
not as good in terms of performance in all and and it's the easiness of moving Sparks
coming is also great if you compare your program
for let's say two orbits from which is used
for real-time processing. You will notice
that it is much easier in terms of from
a developer point of your ass that that's the reason a lot
of regular showing interest in this domain now, it will also enable Table
of high throughput and fault-tolerant so that you to stream your data
to process all the things up and the fundamental unit
Force past dreaming is going to be District. What is this thing? Let me explain it. So this dream is
basically a series of bodies to process
the real-time data. What we generally do is if you look
at this light inside you when you get the data, It is a continuous data you
divide it in two batches of input data. We are going to call it
as micro batch and then we are going to get that is
of processed data though. It is real time. But still how come it is back because definitely you
are doing processing on some part of the data, right? Even if it is coming
at real time. And that is what we are going
to call it as micro batch. Moving further now. Let's see few more
details on it. Now from where you
can get all your data. What can be your
data sources here. So if we talk about data sources
here now we can steal the data from multiple sources
like Market of the past events. You have statuses
like at based mongodb, which are you know,
SQL babies elasticsearch post Vis equal pocket file format you
can Get all the data from here. Now after that you can also
don't do processing with the help
of machine learning. You can do the processing
with the help of your spark SQL and then give the output. So this is a very strong thing that you are bringing
the data using spot screaming but processing you can do by using some other
Frameworks as well. Right like machine learning
you can apply on the data what you're getting
fatter years time. You can also apply
your spots equal on the data, which you're getting at. the real time Moving further. So this is a single thing now
in Sparks giving you what you can just get the data
from multiple sources like from cough cough prove
sefs kinases Twitter bringing it to this path screaming
doing the processing and storing it back
to your hdfs. Maybe you can bring it to
your DB you can also publish to your UI dashboard. Next Tableau angularjs lot
of UI dashboards are there in which you can publish
your output now. Holly quotes, let us just break down
into more fine-grained gutters. Now we are going to get
our input data stream. We are going to put it inside of a spot screaming going to get
the batches of input data. Once it executes
to his path engine. We are going to get that chest
of processed data. We have just seen
the same diagram before so the same explanation for it. Now again breaking it down
into more glamour part. We are getting a d
string B string was what Vulnerabilities of data
multiple set of Harmony, so we are getting a d string. So let's say we are getting
an rdd and the rate of time but because now we are getting
real steam data, right? So let's say in today right now. I got one second. Maybe now I got some one second
in one second. I got more data now I got
more data in the next not Frank. So that is what
we're talking about. So we are creating data. We are getting from time
0 to time what we get say that we have an RGB at the rate of Timbre similarly
it is this proceeding with the time that He's
getting proceeded here. Now in the next thing
we extracting the words from an input Stream So if you can notice what we are doing here
from where let's say, we started applying
doing our operations as we started doing
our any sort of processing. So as in when we get the data
in this timeframe, we started being subversive. It can be a flat map operation. It can be any sort of operation
you're doing it can be even a machine-learning opposite
of whatever you are doing and then you are generating
the words in that kind of thing. So this is how we
as we're seeing that how gravity we can kind
of see all these part at a very high level this work. We again went into
detail then again, we went into more detail. And finally we have seen that how we can even process
the data along the time when we are screaming
our data as well. Now one important point is just
like spark context is mean entry point for
any spark application similar. Need to work on streaming a spot screaming you require
a streaming context. What is that when you're passing
your input data stream you when you are working
on the Spark engine when you're walking
on this path screaming engine, you have to use your system in context of its using
screaming context only you are going to get the batches of your input data now
so streaming context is going to consume a stream
of data in In Apache spark, it is registers and input D string to produce
or receiver object. Now it is the main entry point
as we discussed that like spark context is
the main entry point for the spark application. Similarly. Your streaming context
is an entry point for yourself Paxton. Now does that mean
now Spa context is not an entry point know when you creates pastrini
it is dependent. On your spots community. So when you create
this thing in context it is going to be dependent
on your spark of context only because you will not be able
to create swimming contest without spot Pockets. So that's the reason it
is definitely required spark also provide a number of default
implementations of sources, like looking in the data
from Critter a factor 0 mq which are accessible
from the context. So it is supporting
so many things, right? now If you notice this what we are doing
in streaming contact, this is just to give
you an idea about how we can initialize
our system in context. So we will be importing
these two libraries after that. Can you see I'm passing
spot context SE right son passing it every second. We are collecting the data
means collect the data for every 1 second. You can increase this number
if you want and then this is your SSC means
in every one second what ever gonna happen? I'm going to process it. And what we're doing
in this place, let's go to the D string topic
now now in these three it is the full form
is discretized stream. It's a basic abstraction provided by your spa
streaming framework. It's appointing a stream of data
and it is going to be received from your source and from processed
steaming context is related to your response living
Fun Spot context is belonging. To your spark or if you remember the ecosystem
radical in the ecosystem, we have that spark context right
now streaming context is built with the help of spark context. And in fact using
streaming context only you will be able to perform
your sponsoring just like without spark context you will not able to execute anything in spark application
just park application will not be able
to do anything similarly without streaming content. You're streaming application
will not be able to do anything. It just that screaming
context is built on top of spark context. Okay, so it now it's
a continuous stream of data we can talk
about these three. It is received from source
of on the processed data speed generated by the
transformation of interesting. If you look at this part
internally a these thing can be represented by
a continuous series of I need these this is important. Now what we're doing is
every second remember last time we have just seen an example of like every second
whatever going to happen. We are going to do processing. So in that every second
whatever data you are collecting and you're
performing your operation. So the data what you're getting here is
will be your District means it's a Content you can say that all these things
will be your D string point. It's our Representation
by a continuous series of kinetic energy so
many hundred is getting more because let's say right
knocking one second. What data I got collected. I executed it. I in the second second
this data is happening here. Okay? Okay. Sorry for that. Now in the second time
also the it is happening a third second. Also it is happening here. No problem. No, I'm not going
to do it now fine. So in the third second Auto if I did something
I'm processing it here. Right. So if you see
that this diagram itself, so it is every second whatever
data is getting collected. We are doing the processing on top of it and the whole
countenance series of RDV what we are seeing here
will be called as the strip. Okay. So this is what your distinct
moving further now we are going to understand
the operation on these three. So let's say you are doing this operation on this dream
that you are getting. The data from 0 to 1 again, you are applying some operation on that then whatever output
you get you're going to call it as words the state
means this is the thing what you're doing you're doing
a pack of operation. That's the reason
we're calling it is at what these three now similarly
whatever thing you're doing. So you're going
to get accordingly and output be screen
for it as well. So this is what is happening
in this particular example now. Flat map flatmap is API. It is very similar to mac. Its kind of platen
of your value. Okay. So let me explain you
with an example. What is flat back? So let's say
if I say that hi, this is a doulica. Welcome. Okay, let's say listen later. Now. I want to apply a flatworm. So let's say this is
a form of rdd. Also now on this rdd, let's say I apply flat back
to let's say our DB this is the already flat map. It's not map
Captain black pepper. And then let's say you want
to define something for it. So let's say you say that okay, you are defining
a variable sale. So let's say a a DOT now after that you are defining
your thoughts split split. We're splitting with respect
to visit now in this case what is going to happen now? I'm not saying the exacting here
just to give extremely flat back just to kind of give
you an idea about box. It is going to flatten
up this fight with respect to the split what you are mentioned here
means what it is going to now create each element as one word. It is going to create
like this high as one what l 1 element this
as one One element is ask another what a one-element adwaita as
one water in the limit. Bentham has one
vote for example. So this is how your platinum Works kind
of flatten up your whole file. So this is what we are doing
in our stream effort. We are our so this is
how this will work. Now so we have just
understood this part. Now, let's understand input
the stream and receivers. Okay, what are these things? Let's understand this fight. Okay. So what are the input
based impossible? They can be basic Source
advances in basic Source we can have filesystems
sockets Connections in advance Source we
can have Kafka no Genesis. Okay. So your input these things are under these things
representing the stream of input data received
from streaming sources. This is again the same thing. Okay. So this is there are
two type of things which we just discussed. Is your basic and second
is your advance? Let's move brother. Now what we are going
to see each other. So if you notice let's see here. There are some events often it
is going to your receiver and then energy stream now I
will bees are getting created and we are performing
some steps on it. So the receiver sends
the data into the D string where each back is going
to contain the RTD. So this is what you're
this thing is doing receiver. Is doing here now moving further Transformations
on the D string. Let's understand that. What are the
Transformations available? There are multiple
Transformations, which are possibly the most popular. Let's talk about that. We have map flatmap filter
reduce Group by so there are multiple Transformations
available via now. It is like you are getting
your input data now you will be applying any
of these operations. Means any Transformations
that is going to happen. And then on you this thing
is going to be created. Okay, so that is
what's going to happen. So let's explore it one by one. So let's start with now if I start with map
what happens with Mac it is going to create
that judges of data. Okay. So let's say it is going
to create a map value of it like this. So let's say X is not to be
my is giving the output Z that is giving
the output X, right. So in this similar format,
this is going to get mad. That is going to whatever
you're performing. It is just going to create
batches of input data, which you can execute it. So it returns a new DC
by fasting each element of the source D string
through a function, which you have defined. Let's discuss this lapis that we have just
discussed it is going to flatten up the things. So in this case, also, if you notice we are just
kind of flat inner it is very similar to Mac. But each input item
can be mapped to zero or more outputs in items here. Okay, and it is going to return
a new these three bypassing each Source element
to a function for this fight. So we have just seen an example
of that crap anyway, so that seems awfully
can remember 70 more easy for you to kind of
see the difference between with markets has
no moving further filter as the name States you
can now filter out the values. So let's say you have a huge data you are kind of we
want to filter out some values. You just want to kind of walk
with some filter data. Maybe you want to remove
some part of it. Maybe you are trying
to put some Logic on it. Does this line contains
this right under this line? Is that so in that case extreme only
with that particular criteria? So this is what we do here, but definitely most of the times
to Output is going to be smaller in comparison to your input
reduce reduce is it's just like it's going to do kind
of aggregation on the wall. Let's say in the end you want
to sum up all the data what you have that is going to be done
with the help of reduce. Now after that group by group back is like it's going
to combine all the common values that is what group
by is going to do. So as you can see
in this example all the things which are starting
with Seagal broom back all the things we're starting with J. Boardroom back
all the names starting with C got goodbye. Not. So again, what is this screen window now to give
you an example of this window? Everybody must be
knowing Twitter, right? So now what happens in total? Let me go to my paint. So insert in this example, let's understand how
this windowing of Asians of so, let's say in initials per second in the initial
per second 10 seconds. Let's say the tweets
are happening in this way. Let's say cash
a hash a hashtag now, which is the trading Twitter
definitely is right is my training good maybe
in the next 10 seconds. In the next 10 seconds
now again Hash A. Ashby. Ashby is open which is the trending
with be happening here. Now. Let's say in another 10 seconds. Now this time let's say hash be hash be so actually I
should be Hashmi zapping now, which is trendy be lonely. But now I want to find out which is the trending
one in last 30. Ashley right because
if I combine I can do it easily. Now this is your been doing
operation example means you are not only looking
at your current window, but you're also looking at your previous window
Vanessa current window. I'm talking about let's say
10 seconds of slot in this 10 seconds lat
let's say you are doing this operation on has be has
to be has to be has to be so this is a current
window now you are not fully Computing with respect
to your current window. But you are also considering
your previous window. Now, let's say in this case. If I ask you, can you give me the output
of which is trending in last 17 seconds? Will you be able
to answer know why because you don't have partial
information for 7 Seconds you have information for your 10 20 30 mins
multiple of them, but not intermediate one. So keep this in mind. Okay, so you will be able
to perform in doing operation only with respect
to your window size. It's not like you can create any partial value in can do
the window efficient now, let's get back to the sides. Now it's a similar thing. So now it is shown here that we are not only considering
the current window, but we are also considering
the previous window now next understand the output operators are
operations of the business when we talk
about output operations. The output operations
are going to allow the D string data to be pushed
out to your external system. If you notice here means whenever whatever processing
you have done with respect to what What data you are doing
here now your output you can store in multiple base
against original file system. You can keep in your database. You can keep it even
in your external systems so you can keep
in multiple places. So that is
what being reflected here. Now, so if I talk
about output operation, these are the one which are supported we can print
out the value we can use save as text file menu save
as take five. It saves it into your chest. If you want you can
also use it to save it in the local pack system. You can save it as
an object file. Also, you can save
it as a Hadoop file or you can also apply
for these are daily function. Now what are for
each argument function? Let's see this example. So the mill Levy Spin on this
part in detail Banks we teach you or in advocacy sessions, but just to give
you an idea now. This is a very
powerful primitive that is going to allow
your data to be sent out to your external systems. So using this you
can send it across to your web server system. We have just seen
our external system that we can give file system. It can be anything. So using this you
will be able to transfer it. You can view will be
able to send it out to your external systems. Now, let's understand the cash
in and persistence now when we talk
about caching and persistence, so these 3 Ms. Also annoying
the developers to cash or to persist the streams data in the moral means you
can keep your data in memory. You can cash your data
in the morning for longer time. Even after your
action is complete. It is not going to delete it so you can just Use
this as many times as you want so you can simply use
the first method to do that. So for your input streams which are receiving the data
over the network may be using taskbar Loom sockets. The default persistence level
is set to the replicate the data to two loads
for the for tolerance like it is also going
to be replicating the data into two parts so you can see the same thing
in this diagram. Let's understand this
accumulators broadcast variables and checkpoints. Now, these are mostly
for your performance. But so this is going to help you
to kind of perform to help you in the performance partner. So it is accumulators is nothing but environment that are only
added through and associative and commutative operation. Usually if you're coming
from Purdue background if you have done let's say be mapreduce programming you
must have seen something. Counters like that, they'll be used
for other counters which kind of helps us to debug
the program as well and you can perform some analysis
in the console itself. Now this is similar
to you can do with the accumulators as well. So you can Implement
your contest with X open this part you can
also some of the things with this fact now you can if you want to track
through UI you can also do that as you can see
in this UI itself. You can see all your excavators as well now similarly
we have broadcast. Erebus now broadcast Parables
allows the programmer to keep your meat only bearable cast
on all the machines which are available. Now it is going
to be kind of cashing it on all the machines now, they can be used to give
every note of copy of a large input data set in an efficient manner so you
can just use that sparkle. Also attempt to distribute the
distributed broadcast variable using efficient bra strap. I will do nothing to reduce
the communication process. So as you can see here, we are passing
this broadcast value it is going to spark contest
and then it is broadcasting to this places. So this is what how it
is working in this application. Generally when we teach
in this class has and also since things are
Advanced concept, we kind of we kind
of try to expand you with the practicals
are not right now. I just want to give you an idea
about what are these things? So when you go with the practicals
of all these things that how activator see how this is happening out
is getting broadcasted Things become more and more fear
at that time right now. I just want that everybody at these data
high level overview of things. Now moving further sub
what is checkpoints so checkpoints are similar to your checkpoints
in the gaming now, hold on they can they make
it run 24/7 make it resilient to the failure and related
to the application project. So if you can see this diagram, we are just
creating the checkpoint. So as in the
metadata checkpoint, you can see it is the saving
of the information which is defining the streaming
computation if we talk about data from check. It is saving of the generated
a DD to the reliable storage. So this is this
both are generating the checkpoint now
now moving forward. We are going to move
towards our project where we are going to perform
our Twitter sentiment analysis. Let's discuss a very
important Force case of Twitter sentiment analysis. This is going to
be very interesting because we will just
do a real-time. This on Twitter sentiment
analysis and they can be lot of possibility
of this sentiment analysis will be but we will
be taking something for the turtle and it's going
to be very interesting. So generally when we do
all this in know course, it is more detailed because right now in women
are definitely going in deep is not very much possible, but during the training
of a director, you will learn all these things
within the trust awesome, right that's there something which we learned
during the session. It's No, we talked
about some use cases of Twitter. As I said there can be
multiple use cases which are possible because there are solutions behind whatever the continue
doing it so much of social media right now in these days are
very active has been right. It must be noticing that even politicians
have started using Twitter and their did all
the treats are being shown in the news channel in cystic
of a heart-rending to it because they are talking
about positive negative in any politician
use Something right? And if we talk
about anything is even if we talk about let's any Sports FIFA World Cup
is going on then you will notice always return will be filled up
with lot of treatment. So how we can make use of it
how we can do some analysis on top of it that first we
are going to learn in this so they can be multiple sort
of our sentiment analysis think it can be done for
your crisis Management Service. I just think target marketing
we can keep on talking about when a new release release now even the moviemakers
kind of glowing eyes. Okay, hold this movie
is going to perform so they can easily make
out of it beforehand. Okay, this movie is going to go
in this kind of range of profit or not interesting day. I let us explore
not to Impossible even in the political campaign
in 50 must have heard that in u.s. When the president
election was happening. They have used in fact role of social media of all
this analysis at all and then that have ever played
a major role in winning that election similarly, how weather investors
want to predict whether they should invest
in a particular company or not, whether they want to check that whether like we
should Target which customers for advertisement because we cannot Target
everyone problem with targeting everyone is and if we try
to Target element, it will be very costly
so we want to kind of set it a little bit because maybe my set
of people whom I should send this advertisement
to be more effective and Wells as well as a queen
is going to be cost effective as well if you wanted
to do the products and services also include
I guess we can also do this. Now. Let's see some use cases
like the him terms of use case. I will show you a practical
how it comes. So first of all, we will be importing all
the required packages because we are going
to not perform or Twitter sentiment analysis. So we will be requiring
some packages for that. So we will be doing that as
a first step then we need to SEC Oliver authentication
without or indication. We cannot do anything
of now here the challenges we cannot directly
put your username and they don't you think it will get Candidate put
your username and password. So Peter came up with something. Very smart thing. What they did is they came
up with something on his fourth indication tokens. So you have to go
to death brought twitter.com login from there and you will find kind of all
this authentication tokens available to you for will be the recruit take
that and put it here then as we have learned
the D string transformation, you will be doing
all that computation you so you will be having
my distinct honor of France. Action, then you will be
generating your Tweet data. I'm going to save it
in this particular directory. Once you are done with this. Then you are going
to extract your sentiment once you extract it. And you're done. Let me show you quickly
how it works in our fear. Now one more interesting thing
about a greater would be that you will be getting all
this consideration machines. So you need not worry about from where I
will be getting all this. Is it like very difficult
to install when I was waiting. This open source location. It was not working for me
in my operating system. It was not working. So many things we
have generally seen people face issues to resolve everything up be we kind of provide all this fear
question from Rockville. This pm has priest but yes, that's what it has
everything pre-installed. Whichever will be required
for your training. So that's the best part
what we also provide. So in this case your Eclipse
will already be there. You need to just go
to your Eclipse location. Let me show you how you can. So cold that if you want because it gives you it gives
you just need to go inside it and double-click on it at that. You need not go and kind
of installed eclipse and not even the spot will
already be installed for you. Let us go in our project. So this is our project
which is in front of you. This is my project
which we are going to war. Now you can see that we have first
imported all the libraries that we have set
or more indication system and then we have moved
and kind of ecstatic. The D string transformation
extractor that we write and then save
the output final effect. So these are the things
which we have done in this program has now let's
execute it to run this program. It's very simple go
to run as and from run as click on still application. You will notice in the end. It is releasing
that great good to see that so it is executing the program. Let us execute. I did bring a taxi for Trump. So use these for Trump any way
that we surveyed to be negative. Right? It's an achievement because anything you do for Tom
will be to be negative Trump is anyway the hot topic for us. Maybe make it a little bigger. You will notice a lot
of negative tweets coming up on. Yes, now, I'm just stopping it so that I can
show you something. Yes. It's filtering that we thought so we have actually been written
back in the program itself. You have given
at one location from using that we were kind of asking for a treetop Tom now
here we are doing analysis and it is also going to tell us whether it's a positive to a
negative resistance is situated. It is giving up Faith because term for Transit even
will not quit positive rate. So that's something which is so that's
the reason you're finding. This is a negative. Similarly if there
will be any other that we should
be getting a static. So right now if I keep on
moving ahead we will see multiple negative traits
which will come up. So that's how this program runs. So this is how our program we will be executing
we can distract it. Even the output results will be
getting through at a location as you can see in this
if I go to my location here, this is my actual project
where it is running so you can just come to this location here
are on your output. All your output
is Getting through there so you can just take a look as but yes, so it's
everything is done by using space thing apart. Okay. That's what we've
seen right reverse that we were seeing
it with respect to these three transformations in a so we have done all that
with have both passed anybody. So that is one
of those awesome part about this that you can do such
a powerful things with respect to your with respect
to you this way. Now, let's analyze the results. So as we have just seen that it is showing
the president's a positive to a negative tweets. So this is where your output
is getting Stone as it shown you a doubt
will appear like this. Okay. This is just broke
your output to explicitly principal also tell whether it's a neutral
one positive one negative one everything. We have done it
with the help of Sparks. I mean only now we
have done it for Trump as I just explained you
that we have put in our program itself from like we have put
everything up here and based on that only we
are getting all the software now we can apply all
the sentiment analysis and like this. Like we have learned. So I hope you have found
all this this specially this use case very much useful for you kind of
getting you that yes, it is getting done by half. But right now we
have put from here, but if you want you can keep
on putting the hashtag as well because that's
how we are doing it. You can keep on
changing the tax. Maybe you can kind of code
for let's say four people for stuff is going on a cricket match will be going
on we can just put the tweets according to that just take the
in that case instead of trump. You can put any player named or maybe a Team name
and you will see all that friendly becoming a father. Okay, so that's
how you can play with this. Now. This is there are
multiple examples with it, which we can play and this new skills can be even
evolved multiple other type of those cases. You can just keep
on transforming it according to your own use cases. So that's it about Sparks coming
which I wanted to discuss. So I hope you must
have found it useful. So in classification generally what happens just
to give you an example. You must have notice
the spam email box. I hope everybody
must be having have seen that sparkle in your spam
email box Energy Mix. Now when any new email comes up
how Google decide whether it's a spam email
or unknown stamped image that is done as an example
of classification plus 3, let's say My ghost
in the Google news, when you type
something it group. All the news together that is called your electric
regression equation is also one of the very important
fact it is not here. The regression is let's say
you have a house and you want to sell that house and you have no idea. What is the optimal price? You should keep for your house. Now this regression
will help you too. To achieve that collaborative
filtering you might have see when you go
to your Amazon web page that they show you
a recommendation, right? You can buy this because you are buying this
but this is done with the help of colaborative filtering. Before I move to the project, I want to show you
some practical find how we will be executing spark things. So let me take you
to the VM machine which will be provided
by a Dorita. So this machines are also
provided by the Rekha. So you need not worry
about from where I will be getting the software. What I will be doing
recite It Roll there. Everything is taken care back
into they come now. Once you will be coming to this you will see
a machine like Like this, let me close this. So what will happen you will see
a blank machine like this. Let me show you this. So this is how your machine
will look like. Now what you are going to do
in order to start working. You will be opening
this permanent by clicking on this black option. Now after that, what you can do is you
can now go to your spot now how I can work with funds
in order to execute any program in sparked by using
Funeral program you will be entering it as fast - Chanel if you type fast - gel it will take you
to the scale of Ron where you can write
your path program, but by using scale
of programming language, you can notice this. Now, can you see the fact it
is also giving me 1.5.2 version. So that is the version
of your spot. Now you can see here. You can also see this part of
our context available as a see when you get connected
to your spark sure. You can just see this will be
my default available to you. Let us get connected. It is sometime. No, we got anything. So we got connected
to this Kayla prom now if I want to come out of it, I will just type exit
it will just let me come out of this product now. Secondly, I can also write
my programs with my python. So what I can do if I want to do
programming and Spark, but with provide
Python programming language, I will be connecting
with by Sparks. So I just need to type ice pack
in order to get connected. Your fighter. Okay. I'm not getting connected now because I'm not
going to require. I think I will be explaining
everything that scalar item. But if you want to get connected
you can type icebox. So let's again get connected to my staff -
sure now meanwhile, this is getting connected. Let us create a small pipe. So let us create
a file so currently if you notice I
don't have any file. Okay. I already have a DOT txt. So let's say sake at a DOT txt. So I have some data one. Two three four five. This is my data,
which is with me. Now what I'm going to do, let me push this file
and do select the effective if it is already available in my system as that means
SDK system Hadoop DFS - ooh, Jack a dot txt just
to quickly check if it is already available. There is no sex by so let
me first put this file to my system to put a dot txt. So this will put it
in the default location of x g of X. Now if I want to read it,
I can see the specs. So again, I'm assuming that you're aware of this
as big as commands so you can see now this one two, three four Pilots coming
from a Hadoop file system. Now what I want to do, I want to use this file
in my in my system of spa now how I can do that select
we come here. So in skaila in skaila, we do not have any Your float
and on like in Java we use the Define
like this right integer K is equal to 10 like this is used
to define buttons Kayla. We do not use this data type. In fact, what we do
is we call it as back. So if I use
that a is equal to 10, it will automatically identify that it is
a integer value notice. It will tell me that
a is of my integer type now if I want to Update
this value to 20. I can do that. Now. Let's say if I want to update
this to ABC like this. This will smoke an error by because a is already
defined as in danger and you're trying to assign
some PVC string back. So that is the reason
you got this error. Similarly. There is one more thing
called as value. Well B is equal to 10. Let's say if I do it works
exactly a similar to that. But I have one difference
now in this case. If I do basic want
to 20 you will see an error and why does Sarah because when
you define something as well, it is a constant. It is not going
to be variable anymore. It will be a constant and that is the reason
if you define something as well, it will be not updatable. You will be should not be able
to update that value. So this is how in Fela you
will be doing your program so back for bearable part
of that for your constant, but now so you will be
doing like this now, let's use it for the example
what we have learned now. Let's say if I want
to create and cut the V. So Bal number is equal
to SC dot txt file. Remember this API we
have learned the CPI already St. Dot Txt file now. Let me give this file a DOT txt. If I give this file a dot txt. It will be creating
an ID will see this file. It is telling that I created an rdd
of string type. Now. If I want to read this data,
I will call number dot connect. This will print be the value
what was available. Can you say now this line
what you are seeing here? Is going to be from your memory. This is your from my body. It is reading a and that is
the reason it is showing up in this particular manner. So this is how you
will be performing your step. No second thing. I told you that sparked and walk
on Standalone systems as well. Right? So right now
what was happening was that we have executed this part
in our history of this now if I want to execute this Us
on our local file system. Can I do that? Yes, it can still do that. What you need to do for that. So is in that case
the difference will come here. Now what the file you are giving here would be instead
of giving like that. You will be denoting
this file keyword before that. And after that you need
to give you a local file. For example, what is
this part slash home slash. Advocacy. This is a local park
not as deep as possible. So you will be
writing / foam. /schedule Erica a DOT PSD. Now if you give this this will be loading
the file into memory, but not from your hdfs instead. What does that is this loaded it from your just loaded it
formula looks like this so that is the defensive. So as you can see
in the second case, I am not even using my hdfs. Which means what now? Can you tell me why this
Sarah this is interesting. Why do Sarah input path
does not exist because I have given
a typo here. Okay. Now if you notice by I did not get this error here
why I did not get this Elijah this file do not exist. But still I did not got any error because of
lazy evaluation link the evaluation kind
of made sure that even if you have given
the wrong part in creating And beyond ready but it
has not executed anything. So all the output or the error mistake
you are able to receive when you hit that action
of Collective Now in order to correct this value. I need to connect this adorable
and this time if I execute it, it will work. Okay, you can see
this output 1 2 3 4 5. So this time it works
by so now we should be more clear about the lazy
evaluation of the even if you are giving
the wrong file name doesn't matter suppose. I want to use Park
in production unit, but not on top of Hadoop. Is it possible? Yes, you can do that. You can do that Sonny, but usually that's
not what you do. But yes, if you
want to can do that, there are a lot of things which you can view
can also deploy it on your Amazon clusters as that
lot of things you can do that. How will it provided
distribute in that case? We'll be using
some other distribution system. So in that case you
are not using this fact, you can deploy it
will be just death. He will not be able
to kind of go across and distribute in that Master. You will not be able to lift
weight that redundancy, but you can use them in Amazon
is the enough for that. Okay, so that is how you will be using this now
you're going to get so this is how you will be performing
your practice as a sec how you will be working
on this part. I will be a training you
as I told you. So this is how things work. Now, let us see
an interesting use case. So for that let us go back. Back to our visiting this
is going to be very interesting. So let's see this use case. Look at this. This is very interested. Now this use case is for
earthquake detection using Spa. So in Japan you
might have already seen that there are so many
up to access coming you might have thought about it. I definitely you
might have not seen it but you must have heard about it that there are
so many earthquake which happens in Japan now
how to solve that problem with about I'm just going
to give you a glimpse of what kind of problems in solving the sessions definitely we are not going to
walk through in detail of this but you will get an idea
House of Prince fastest. Okay, just to give you
a little bit of brief here. But all these products
will learn at the time of sessions now. So let's see this part
how we will be using this bill. So as everybody must be knowing
what is asked website. So our crack is like a shaking of your surface
of the Earth your own country. Ignore all those events
that happen in tector. If you're from India, you might have seen recently
there was an earthquake incident which came from Nepal
by even recently two days back. Also there was upset incident. So these are techniques
on coming now, very important part is let's say if the earthquake is
on major earthquake like arguing or maybe tsunami
maybe forest fires, maybe a volcano now, it's very important
for them to kind of SC. That black is going to come
they should be able to kind of predicted beforehand. It's not happen
that as a last moment. They got to the that okay Dirtbag is comes
after I came up cracking No, it should not happen like that. They should be able to estimate
all these things beforehand. They should be able
to predict beforehand. So this is the system
with Japan's is using already. So this is a real-time kind of
use case what I am presenting. It's so Japan is already
using this path finger in order to solve
this earthquake problem. We are going to see
that how they're using it. Okay. Now let's say what happens
in Japan earthquake model. So whenever there is
an earthquake coming for example at 2:46 p.m. On March 4 2011 now Japan earthquake early
warning was detected. Now the thing was as
soon as it detected immediately, they start sending
Not those fools to the lift to the factories every station
through TV stations. They immediately kind
of told everyone so that all the students
were there in school. They got the time to go
under the desk bullet trains, which were running. They stop. Otherwise the capabilities
of us will start shaking now the bullet trains are already
running at the very high speed. They want to ensure that there should be no sort
of casualty because of that so all the bullet train Stop
all the elevators the lift which were running. They stop otherwise
some incident can happen in 60 seconds 60 seconds before this number they were able to inform
almost every month. They have send the message. They have a broadcast on TV all those things
they have done immediately to all the people so that they can send
at least this message whoever can receive it and that have saved millions of So powerful they
were able to achieve that they have done all this
with the help of Apache spark. That is the most important job how they've got you
can select everything what they are doing there. They are doing it
on the real time system, right? They cannot just
collect the data and then later the processes
they did everything as a real-time system. So they collected the data
immediately process it and as soon has the detected that has quick they
immediately inform the in fact this happened in 2011. Now they they start
using it very frequently because Japan is one of the area which is very frequently
of kind of affected by all this. So as I said, the main thing is we should be
able to process the data and we are finding that the bigger thing you
should be able to handle the data from multiple sources because data may be coming from multiple sources may be
different different sources. They might be suggesting some
of the other events. It's because Which we
are predicting that okay, this earthquake can happen. It should be very
easy to use because if it is very complicated
then in that case for a user to use it if you'd be very good
become competitive service. You will not be able
to solve the problem. Now even in the end how to send the alert
message is important. Okay. So all those things
are taken care by your spark. Now there are two kinds
of layer in your earthquake. The number one layer
is a prime the way and second is fake. And we'll wait. There are two kinds of wave in an earthquake
Prime Z Wave is like when the earthquake is
just about to start it start to the city center and it's vendor or Quake is going to start secondary wave
is more severe than which sparked after producing. Now what happens
in secondary wheel is when it's that start it
can do maximum damage because primary ways you
can see the initial wave but the second we
will be on top of that so they will be some details
with respect to I 'm not going in detail of that. But yeah, there
will be some details with respect to that. Now what we are going
to do using Sparks. We will be creating our arms. So let's go and see
that in our machine how we will be sick
calculating our Roc which using which we will be solving
this problem later and we will be calculating
this Roc with the help of Apache spark. Let us again come back
to this machine now in order to walk on that. Let's first exit
from this console. Once you exit from this console
now what you're going to do. I have already created
this project in kept it here because we just want to give
you an overview of this. Let me go to
my downloads section. There is a project called
as Earth to so this is your project initially what all things you
will be having you will not be having all
the things initial part. So what will happen. So let's say if I go
to my downloads from here, I have worked too. project Okay. Now initially I
will not be having this target directory project
directory bin directory. We will be using
our SBT framework. If you do not know SBP this
is the skill of Bill tooth which takes care of all
your dependencies takes care of all your dependencies are not
so it is very similar to Melvin if you already know Megan you
this is because very similar but at the same time
I prefer this BTW because as BT is
more easier to write income. I've been doing yoga never so you will be writing
this bill taught as begins. So this finally will provide you
build dot SBT now in this file, you will be giving the name of your project your
what's a version of is because using version of scale
of what you are using. What are the dependencies
you have with what versions dependencies you
have like 4 stock 4 and using 1.5.2 version of stock. So you are telling
that whatever in my program, I am writing. So if I require anything related
to spawn quote go and get it from this website of dot Apache
dot box download it install it. If I require any dependency for spark streaming program for
this particular version 1.5.2. Go to this website or this link and executed similar theme
for Amanda password. So you just telling them now once you have done this you will
be creating a Folder structure. Your folder structure
would be you need to create a sassy folder. After that. You will be creating
a main folder from Main folder. You will be creating
again a folder called as Kayla now inside that you will be
keeping your program. So now here you will
be writing a program. So you are writing you. Can you see this screaming
to a scalar Network on scale of our DOT Stella. So let's keep it as
a black box for them. So you will be writing
the code to achieve this problem statement. Now what we are going to do that come out of this What
do you mean project folder and from here? We will be writing SBT packaged. It will start downloading with respect to your is beating
it will check your program. Whatever dependency you require for stock course starts
screaming stuck in the lift. It will download and install it it
will just download and install it so we
are not going to execute it because I've already
done it before and it also takes some time. So that's the reason
I'm not doing it now. You have been this packet, you will find all
this directly Target directly toward project directed. These got created
later on the now what is going to happen. Once you have created this
you will go to your Eclipse. So you are a pure c will open. So let me open my Eclipse. So this is how you're
equipped to protect. Now. I already have this program
in front of me, but let me tell you how you
will be bringing this program. You will be going
to your import option with We import you will be selecting your existing
projects into workspace. Next once you do that you need to select
your main project. For example, you need
to select this Earth to project what you have created
and click on OK once you do that they will be
a project directory coming from this Earth
to will come here. Now what we need to do go
to your s RC / Main and not ignore all this program. I require only just are jocular
because this is where I've written
my main function. Important now after that once you reach
to this you need to go to your run as Kayla application and your spot code
will start to execute now, this will return me a row 0. Okay. Let's see this output. Now if I see this, this will show me
once it's finished executing. See this our area
under carosi is this so this is all computed
with the elbows path program. Similarly. There are other programs
also met will help you to spin the data or not. I'm not walking over all that. Now, let's come back
to my wedding and see that what is the next step what we will be doing so you
can see this way will be next. Is she getting created now,
I'm keeping my Roc here. Now after you have created
your RZ you will be Our graph now in Japan there is
one important thing. Japan is already
of affected area of your organs. And now the trouble here is that whatever it's not the even
for a minor earthquake. I should start sending
the alert right? I don't want to do all that
for the minor minor affection. In fact, the buildings
and the infrastructure. What is created is
the point is in such a way if any odd quack below six magnitude
comes there there. The phones are designed in a way
that they will be no damage. They will be no damage them. So this is the major thing when you work with your Japan
free book now in Japan, so that means with six
they are not even worried but about six they are worried now for that day
will be a graph simulation what you can do you can do it
with Park as well. Once you generate this graph you
will be seeing that anything which is going above 6 if anything which
is going above 6, Should immediately start
the vendor now ignore all this programming side because that is what we
have just created and showing you this execution fact now if you have to visualize
the same result, this is what is happening. This is showing my Roc but if my artwork is going
to be greater than 6 then only weighs those alert then only
send the alert to all the paper. Otherwise take come that is what the project
what we generally show. Oh in our space program sent now
it is not the only project we also kind of create
multiple other products as well. For example, I kind
of create a model just like how Walmart to it how Walmart maybe creating a whatever sales is happening
with respect to that. They're using Apache spark and at the end they are kind of
making you visualize the output of doing whatever
analytics they're doing. So that is ordering the spark. So all those things
we walking through when we do the per session all
the things you learn quick. I feel that all these projects
are using right now, since you do not know the topic you are not able to get
hundred percent of the project. But at that time once you know each
and every topics of deadly you will have a clearer picture
of how spark is handling. All these use cases graphs
are very attractive when it comes to modeling
real world data because they are
intuitive flexible and the theory supporting
them has Been maturing for centuries welcome everyone in today's session
on Spa Graphics. So without any further delay,
let's look at the agenda first. We start by understanding
the basics of craft Theory and different types of craft. Then we'll look
at the features of Graphics further will understand what is property graph and look
at various crafts operations. Moving ahead. We'll look at different graph
processing algorithms at last. We'll look at a demo where we will try
to analyze Ford's go by data using pagerank algorithm. Let's move to the first topic. So we'll start
with basics of graph. So graphs are I basically
made up of two sets called vertices and edges. The vertices are drawn
from some underlying type and the set can be
finite or infinite. Now each element of the edge set is a pair
consisting of two elements from the vertices set. So your vertex is V1. Then your vertex is V3. Then your vertex is V2 and V4. And your edges are V
1 comma V 3 then next is V 1 comma V 2 Then
you have B2 comma V 3 and then you have V 2 comma V fo so basically
we represent vertices set as closed in curly braces
all the name of vertices. So we have V 1 we have V 2 we have V 3 and then
we have before and we'll close the curly braces
and to represent the edge set. We use curly braces again
and then in curly braces, we specify those two vertex
which are joined by the edge. So for this Edge, we will use a viven comma V
3 and then for this Edge will use we one comma V
2 and then for this Edge again, we'll use V 2 comma V 4. And then at last
for this Edge will use we do comma V 3 and At Last I
will close the curly braces. So this is your vertices set. And this is your headset. Now one, very
important thing that is if headset is containing
U comma V or you can say that are instead
is containing V 1 comma V 3. So V1 is basically
a adjacent to V 3. Similarly your V
1 is adjacent to V 2. Then V2 is adjacent
to V for and looking at this as you can say V2
is adjacent to V 3. Now, let's quickly move
ahead and we'll look at different types of craft. So first we have
undirected graphs. So basically in
an undirected graph, we use straight lines
to represent the edges. Now the order of the vertices
in the edge set does not matter in undirected graph. So the undirected graph usually
are drawn using straight lines between the vertices. Now it is almost
similar to the graph which we have seen
in the last slide. Similarly. We can again represent
the vertices set as 5 comma 6 comma 7 comma 8 and the edge set as 5 comma 6 then
5 comma 7 now talking about directed graphs. So basically in a directed graph
the order of vertices in the edge set matters. So we use Arrow
to represent the edges as you can see in the image as It was not the case
with the undirected graph where we were using
the straight lines. So in directed graph, we use Arrow to denote the edges
and the important thing is The Edge set should be similar. It will contain
the source vertex that is five in this case
and the destination vertex, which is 6 in this case and this
is never similar to six comma five you cannot represent
this Edge as 6 comma 5 because the direction always
Does indeed directed graph similarly you can see that 5 is adjacent to 6, but you cannot say
that 6 is adjacent to 5. So for this graph the vertices
said would be similar as 5 comma 6 comma 7 comma 8
which was similar in undirected graph, but in directed graph your Edge
set should be your first opal. This one will be 5 comma
6 then you second Edge, which is this one would be
five comma Mama seven, and at last your this set
would be 7 comma 8 but in case of undirected graph
you can write this as 8 comma 7 or in case of undirected graph you can
write this one as seven comma 5 but this is not the case
with the directed graph. You have to follow
the source vertex and the destination vertex
to represent the edge. So I hope you guys are clear
with the undirected and directed graph. Now. Let's talk about
vertex label graph now. A Vertex liberal graph
each vertex is labeled with some data
in addition to the data that identifies the vertex. So basically we say this X
or this v as the vertex ID. So there will be data that would be added
to this vertex. So let's say this vertex
would be 6 comma and then we are adding the color so it would be purple next. This vertex would be 8 comma and the color
would be green next. We'll say See this as 7 comma
read and then this one is as five comma blue now
the six or this five or seven or eight. These are vertex ID
and the additional data, which is attached is the color
like blue purple green or red. But only the identifying data
is present in the pair of edges or you can say only the ID
of the vertex is present in the edge set. So here the Edsel. Again similar to
your directed graph that is your Source ID this which is 5 and
then destination ID, which is 6 in this case
then for this case. It's similar as five comma
7 then in for this case. It's similar as 7 comma 8 so we are not specifying
this additional data, which is attached
to the vertices. That is the color. If you only specify
the identifiers of the vertex that is the number but your vertex set
would be something like so this vertex would be 5 comma blue
then your next vertex will become 6 comma purple then your next vertex
will become 8 comma green and at last your last
vertex will be written as 7 comma read. So basically when you
are specifying the vertices set in the vertex label
graph you attach the additional information
in the vertices are set but while representing the edge set it is represented
similarly as A directed graph where you have to just specify
the source vertex identifier and then you have to specify the destination
vertex identifier now. I hope that you guys are clear
with underrated directed and vertex label graph. So let's quickly move forward
next we have cyclic graph. So a cyclic graph
is a directed graph with at least one cycle and the cycle is the path
along with the directed edges from a Vertex to itself. So so once you see over here, you can see that from this vertex
V. It's moving toward x 7 then it's moving to vertex Aid then with arrows
moving to vertex six. And then again,
it's moving to vertex V. So there should be at least
one cycle in a cyclic graph. There might be a new component. It's a Vertex 9 which is
attached over here again, so it would be a cyclic graph because it has
one complete cycle over here and the important
thing to notice is That the arrow should make
the cycle like from 5 to 7 and then from 7 to 8
and then 8 to 6 and 6 to 5 and let's say that there is an arrow from 5 to 6 and then there
is an arrow from 6 to 8. So we have flipped the arrows. So in that situation, this is not a cyclic graph
because the arrows are not completing the cycle. So once you move from 5 to 7
and then from 7 to 8, you cannot move from 8:00
to 6:00 and similarly once you move from 5 to 6
and then 6 to 8. You cannot move from 8 to 7. So in that situation,
it's not a cyclic graph. So let's clear all this thing. So will represent this cycle as five then using
double arrows will go to 7 and then we'll move to 8
and then we'll move to 6 and at last we'll
come back to 5 now. We have Edge liberal graph. So basically as label
graph is a graph. The edges are
associated with labels. So one can basically indicate
this by making the edge set as be a set of triplets. So for example, let's say this H in this Edge label graph
will be denoted as the source which is 6 then the destination which is 7 and then the label
of the edge which is blue. So this Edge would
be defined something like 6 comma 7 comma blue
and then for this and Hurley The Source vertex that is 7 the
destination vertex, which is 8 then
the label of the edge, which is white like
similarly for this Edge. It's five comma 7 and
then blue comma red. And it lasts for this Edge. It's five comma six and then it
would be yellow common green, which is the label of the edge. So all these four edges
will become the headset for this graph and the vertices
set is almost similar that is 5 comma
6 comma 7 comma 8 now to generalize this I would say x comma y so X here is the source vertex then why
here is the destination vertex? X and then a here is
the label of the edge then Edge label graph
are usually drawn with the labels written
adjacent to the Earth specifying the edges as you can see. We have mentioned blue white and all those label
addition to the edges. So I hope you guys a player
with the edge label graph, which is nothing but labels attached
to each and every Edge now, let's talk about weighted graph. So we did graph is
an edge label draft. Where the labels
can be operated on by usually automatic operators
or comparison operators, like less than or greater
than symbol usually these are integers
or floats and the idea is that some edges
may be more expensive and this cost is represented
by the edge labels or weights now in short weighted
graphs are a special kind of Edgley build rafts where your Edge
is attached to a weight. Generally, which is
a integer or a float so that we can perform
some addition or subtraction or different kind
of automatic operations or it can be some kind of conditional operations
like less than or greater than so we'll again represent this Edge as 5 comma
6 and then the weight as 3 and similarly will represent
this Edge as 6 comma 7 and the weight is again
6 so similarly we represent these two edges as well. So I hope that you guys are clear
with the weighted graphs. Now let's quickly
move ahead and look at this directed acyclic graph. So this is
a directed acyclic graph, which is basically
without Cycles. So as we just discussed
in cyclic graphs here, you can see that it is not completing
the graph from the directions or you can say the direction
of the edges, right? We can move from 5 to 7, then seven to eight but we cannot move from 8 to 6
and similarly we can move from 5:00 to 6:00
then 6:00 to 8:00, but we cannot move from 8 to 7. So this is Not forming
a cycle and these kind of crafts are known as
directed acyclic graph. Now, they appear as special
cases in CS application all the time and the vertices set and the edge set
are represented similarly as we have seen
earlier not talking about the disconnected graph. So vertices in a graph
do not need to be connected to other vertices. It is basically legal for a graph to have
disconnected components or even loan vertices
without a single connection. So basically this disconnected
graph which has four vertices but no edges. Now. Let me tell you something
important that is what our sources and sinks. So let's say we have
one Arrow from five to six and one Arrow from 5 to 7
now word is with only in arrows are called sink. So the 7 and 6 are known
as sinks and the vertices with only out arrows
are called sources. So as you can see in the image
this Five only have out arrows to six and seven. So these are called sources now. We'll talk about this
in a while guys. Once we are going
through the pagerank algorithm. So I hope that you guys know
what our vertices what our edges how vertices and edges
represents the graph then what are different
kinds of graph? Let's move to the next topic. So next let's know. What is Park Graphics. So talking about
Graphics Graphics is a new component in spark. For graphs and crafts
parallel computation now at a high level graphic extends The Spark rdd by introducing
a new graph abstraction that is directed multigraph that is properties
attached to each vertex and Edge now to support
craft computation Graphics basically exposes a set
of fundamental operators, like finding sub graph
for joining vertices or aggregating messages as well
as it also exposes and optimize. This variant of the pregnant
a pi in addition Graphics also provides you a collection
of graph algorithms and Builders to simplify
your spark analytics tasks. So basically your graphics
is extending your spark rdd. Then you have Graphics
is providing an abstraction that is directed multigraph with properties attached
to each vertex and Edge. So we'll look at this
property graph in a while. Then again Graphics gives you
some fundamental operators and Then it also provides you some graph
algorithms and Builders which makes your analytics
easier now to get started you first need to import spark
and Graphics into your project. So as you can see, we are importing first Park
and then we are importing spark Graphics to get
those graphics functionalities. And at last we are importing spark rdd to use those already
functionalities in our program. But let me tell you
that if you are not using spark shell then you
will need a spark. Context in your program. So I hope that you guys are clear
with the features of graphics and the libraries which you need to import
in order to use Graphics. So let us quickly move ahead
and look at the property graph. Now property graph is something as the name suggests property
graph have properties attached to each vertex and Edge. So the property graph is a directed multigraph with
user-defined objects attached to each vertex and Edge. Now you might be wondering
what is undirected multigraph. So a directed multi graph is a
directed graph with potentially multiple parallel edges
sharing same source and same destination vertex. So as you can see in the image that from San Francisco
to Los Angeles, we have two edges and similarly
from Los Angeles to Chicago. There are two edges. So basically in
a directed multigraph, the first thing is
the directed graph, so it should have a Direction. Ian attached to the edges
and then talking about multigraph so
between Source vertex and a destination vertex, there could be two edges. So the ability to
support parallel edges basically simplifies
the modeling scenarios where there can be
multiple relationships between the same vertices
for an example. Let's say these are two persons so they can be friends
as well as they can be co-workers, right? So these kind of scenarios
can be Easily modeled using directed multigraph now. Each vertex is keyed by
a unique 64-bit long identifier, which is basically the vertex ID
and it helps an indexing. So each of your vertex
contains a Vertex ID, which is a unique
64-bit long identifier and similarly edges have corresponding source and
destination vertex identifiers. So this Edge would have this vertex identifier as
well as This vertex identifier or you can say Source vertex ID
and the destination vertex ID. So as we discuss
this property graph is basically parameterised
over the vertex and Edge types, and these are the types
of objects associated with each vertex and Edge. So your graphics basically
optimizes the representation of vertex and Edge types and it reduces the in
memory footprint by storing the primitive data types
in a specialized array. In some cases it might be
desirable to have vertices with different property types
in the same graph. Now this can be accomplished
through inheritance. So for an example to model
a user and product in a bipartite graph, or you can see
that we have user property and we have product property. Okay. So let me first tell you
what is a bipartite graph. So a bipartite graph
is also called a by graph which is a set
of graph vertices. Opposed into two disjoint sets
such that no two graph vertices within the same
set are adjacent. So as you can see over here, we have user property and then
we have product property but no to user property
can be adjacent or you can say there should be no edges that is joining any
of the to user property or there should be no Edge that should be joining
product property. So in this scenario
we use inheritance. So as you can see here, we have class vertex
property now basically what we are doing we
are creating another class with user property. And here we have name, which is again a string
and we are extending or you can say we are inheriting
the vertex property class. Now again, in the case
of product property. We have name that is
name of the product which is again string and then
we have price of the product which is double and we are again extending
this vertex property graph and at last You're grading a
graph with this vertex property and then string. So this is how we
can basically model user and product as
a bipartite graph. So we have created user property as well as we have created
this product property and we are extending
this vertex property class. No talking about
this property graph. It's similar to your rdd. So like your rdd property graph
are immutable distributed and fault tolerant. So changes to the values
or structure of the graph. Basically accomplished by producing a new graph
with the desired changes and the substantial part
of the original graph which can be your structure
of the graph or attributes or indices. These are basically reused
in the new graph reducing the cost of inherent
functional data structure. So basically your property graph once you're trying to change
values of structure. So it creates a new graph
with changed structure or changed values and zero substantial part
of original graph. Re used multiple times
to improve the performance and it can be
your structure of the graph which is getting reuse
or it can be your attributes or indices of the graph
which is getting reused. So this is how your property
graph provides efficiency. Now, the graph is partitioned across the executors
using a range of vertex partitioning rules, which are basically
Loosely defined and similar to our DD
each partition of the graph can be recreated on different machines
in the event of Failure. So this is how your property
graph provides fault tolerance. So as we already
discussed logically the property graph corresponds
to a pair of type collections, including the properties
for each vertex and Edge and as a consequence the graph class contains
members to access the vertices and the edges. So as you can see we have graphed class then you
can see we have vertices and we have edges. Now this vertex Rd DVD
is extending your rdd, which is your body
D and then your vertex ID and then your vertex property. Similarly. Your Edge rdd is extending your Oddity with your Edge
property so the classes that is vertex rdd and HR DD extends under
optimized version of your rdd, which includes vertex idn
vertex property and your rdd which includes your Edge
property and Booth this vertex rdd and hrd provides additional
functionality build on top of graph computation and leverages internal
optimizations as well. So this is the reason we use
this Vertex rdd or Edge already because it already extends your
already containing your word. X ID and vertex property or your Edge property
it also provides you additional functionalities built
on top of craft computation. And again, it gives you some
internal optimizations as well. Now, let me clear
this and let's take an example of property graph where the vertex property might contain the user
name and occupation. So as you can see in this table
that we have ID of the vertex and then we have property
attached to each vertex. That is the username
as well as the Station of the user or you can see
the profession of the user and we can annotate
the edges with the string describing the relationship
between the users. So so as you can see
first is Thomas who is a professor
then second is Frank who is also a professor then as you can see third is Jenny. She's a student and forth is Bob who is a doctor now Thomas is
a colleague of Frank. Then you can see that Thomas is academic
advisor of Jenny again. Frank is also a Make advisor of Jenny and then the doctor
is the health advisor of Jenny. So the resulting graph
would have a signature of something like this. So I'll explain this in a while. So there are numerous ways
to construct the property graph from raw files or RDS or even synthetic
generators and we'll discuss it in graph Builders, but the very probable and most General method
is to use graph object. So let's take a look
at the code first. And so first over here, we are assuming
that Parker context has already been constructed. Then we are giving
the SES power context next. We are creating an rdd
for the vertices. So as you can see for users, we have specified idd
and then vertex ID and then these are two strings. So first one would be your username and the second one
will be your profession. Then we are using SC paralyzed
and we are creating an array where we are specifying
all the vertices so And that is this one and you are getting
the name as Thomas and the profession is Professor similarly
for to well Frank Professor. Then 3L Jenny cheese student
and 4L Bob doctors. So here we have created
the vertex next. We are creating
an rdd for edges. So first we are giving
the values relationship. Then we are creating
an rdd with Edge string and then we're using SC
paralyzed to create the edge and in the array we are
specifying the A source vertex, then we are specifying
the destination vertex. And then we are
giving the relation that is colleague similarly
for next Edge resources when this nation is one and then the profession
is academic advisor and then it goes so on. So then this line we
are defining a default user in case there is a relationship
between missing users. Now we have given
the name as default user and the profession is missing. Nature trying to build
an initial graph. So for that we are using
this graph object. So we have specified users
that is your vertices. Then we are specifying the
relations that is your edges. And then we are giving
the default user which is basically
for any missing user. So now as you can see over here, we are using Edge case class
and edges have a source ID and a destination ID, which is basically
corresponding to your source and destination vertex. And in addition
to the Edge class. We have an attribute member which stores The Edge property
which is the relation over here that is colleague or it is academic advisor or it
is Health advisor and so on. So, I hope that you guys are clear
about creating a property graph how to specify the vertices
how to specify edges and then how to create a graph Now
we can deconstruct a graph into respective vertex and Edge views by using
a graph toward vertices and graph edges members. So as you can see
we are using craft or vertices over here
and crafts dot edges over here. Now what we are trying to do. So first over here the graph
which we have created earlier. So we have graphed vertices dot filter Now
using this case class. We have this vertex ID. We have the name and then
we have the position. And we are specifying
the position as doctor. So first we are trying
to filter the profession of the user as doctor. And then we are trying to count. It. Next. We are specifying
graph edges filter and we are basically
trying to filter the edges where the source ID is greater
than your destination ID. And then we are trying
to count those edges. We are using
a Scala case expression as you can see to
deconstruct the temple. You can say to deconstruct the result on the other hand
craft edges returns a edge rdd, which is containing
Edge string object. So we could also have used
the case Class Type Constructor as you can see here. So again over here we
are using graph dot s dot filter and over here. We have given case h and then
we are specifying the property that is Source destination
and then property of the edge which is attached. And then we are filtering it and
then we are trying to count it. So this is how using Edge class
either you can see with edges or you can see with vertices. This is how you can go ahead
and deconstruct them. Right because you're
grounded vertices or your s dot vertices returns
a Vertex rdd or Edge rdd. So to deconstruct them, we basically use
this case class. So I hope you guys are clear about
transforming property graph. And how do you use this case? Us to deconstruct
the protects our DD or HR DD. So now let's quickly move ahead. Now in addition to the vertex and Edge views
of the property graph Graphics also exposes
a triplet view now, you might be wondering
what is a triplet view. So the triplet view
logically joins the vertex and Edge properties
yielding an rdd edge triplet with vertex property
and your Edge property. So as you can see
it gives an rdd. D with s triplet and then it has vertex property as well as
H property associated with it and it contains an instance
of each triplet class. Now. I am taking example of a join. So in this joint we are trying
to select Source ID destination ID Source attribute then this is your Edge attribute and then at last you
have destination attribute. So basically your edges has
Alias e then your vertices has Alias as source. And again your vertices
has Alias as Nation so we are trying to select
Source ID destination ID, then Source, attribute
and destination attribute, and we also selecting
The Edge attribute and we are performing left join. The edge Source ID should
be equal to Source ID and the h destination ID should
be equal to destination ID. And now your Edge
triplet class basically extends your Edge class
by adding your Source attribute and destination
attribute members which contains the source
and destination properties and we can use the triplet view of a graph
to render a collection of strings describing
relationship between users. This is vertex 1 which is again
denoting your user one. That is Thomas and
who is a professor and is vertex 3, which is denoting you Jenny
and she's a student. And this is your Edge, which is defining
the relationship between them. So this is a h triplet which is denoting
the both vertex as well as the edge which denote
the relation between them. So now looking at this code
first we have already created the graph then we
are taking this graph. We are finding the triplets and then we are
mapping each triplet. We are trying to find out
the triplet dot Source attribute in which we are picking
up the username. Then over here. We are trying to pick up
the triplet attribute, which is nothing
but the edge attribute which is your academic advisor. Then we are trying to pick up the triplet
destination attribute. It will again pick
up the username of destination attribute, which is username
of this vertex 3. So for an example
in this situation, it will print Thomas is
the academic advisor of Jenny. So then we are trying
to take this facts. We are collecting the facts using this forage we have
Painting each of the triplet that is present in this graph. So I hope that you guys are clear
with the concepts of triplet. So now let's quickly take
a look at graph Builders. So as I already told you that Graphics provides
several ways of building a graph from a collection of vertices
and edges either. It can be stored in our DD
or it can be stored on disk. So in this graph object first,
we have this apply method. So basically this apply
method allows creating a graph from rdd of vertices and edges and duplicate vertices
are picked up our by Tralee and the vertices which are found in the Edge rdd
and are not present in the vertices rdd are assigned
a default attribute. So in this apply method first, we are providing
the vertex rdd then we are providing the edge rdd and then we are providing
the default vertex attribute. So it will create the vertex
which we have specified. Then it will create the edges which are specified and
if there is a vertex which is being referred
by The Edge, but it is not present
in this vertex rdd. So So what it does it
creates that vertex and assigns them the value of
this default vertex attribute. Next we have from edges. So graph Dot from edges
allows creating a graph only from the rdd of edges which automatically creates
any vertices mentioned in the edges and assigns
them the default value. So what happens over here
you provide the edge rdd and all the vertices that are present in the hrd
are automatically created and Default value is assigned
to each of those vertices. So graphed out from adjustables basically
allows creating a graph from only the rdd of vegetables and it assigns the edges as
value 1 and again the vertices which are specified by the edges
are automatically created and the default value which we are specifying over here
will be allocated to them. So basically you're from has double supports
deduplicating of edges, which means you can remove
the duplicate edges, but for that you have
to provide a partition strategy in the unique edges parameter
as it is necessary to co-locate The Identical edges on the same partition duplicate
edges can be removed. So moving ahead men of the graph
Builders re partitions, the graph edges by default
instead edges are left in their default partitions. So as you can see,
we have a graph loader object, which is basically used to load. Crafts from the file system so graft or group edges requires
the graph to be re-partition because it assumes that identical edges
will be co-located on the same partition. And so you must call
graph dot Partition by before calling group edges. So so now you can see the edge
list file method over here which provides a way to load
a graph from the list of edges which is present
on the disk and it It passes the adjacency list
that is your Source vertex ID and the destination vertex ID
Pairs and it creates a graph. So now for an example, let's say we have two and one
which is one Edge then you have for one which is another Edge and then you have 1/2
which is another Edge. So it will load these edges and then it will
create the graph. So it will create 2, then it will create
for and then it will create one. And for to one it will create the edge and then
for one it will create the edge and at last we create
an edge for one and two. So do you create a graph
something like this? It creates a graph
from specified edges where automatically
vertices are created which are mentioned by the edges
and all the vertex and Edge attribute
are set by default one and as well as one
will be associated with all the vertices. So it will be 4 comma
1 then again for this. It would be 1 comma
1 and similarly it would be 2 comma 1 for this vertex. Now, let's go back to the code. So then we have
this canonical orientation. So this argument
allows reorienting edges in the positive direction that is from the lower Source ID to the higher
destination ID now, which is basically required
by your connected components algorithm will talk about this algorithm
in a while you guys but before this
this basically helps in view orienting your edges, which means your Source vertex, Tex should always be less
than your destination vertex. So in that situation it
might reorient this Edge. So it will reorient this Edge
and basically to reverse direction of the edge
similarly over here. So with the vertex which is coming from 2 to 1
will be reoriented and will be again reversed. Now the talking about the minimum Edge partition
this minimum Edge partition basically specifies
the minimum number of edge partitions
to generate There might be more Edge partitions
than a specified. So let's say the hdfs
file has more blocks. So obviously more partitions
will be created but this will give you
the minimum Edge partitions that should be created. So I hope that you guys are clear
with this graph loader how this graph loader Works how you can go ahead
and provide the edge list file and how it will create the craft from this Edge list file and
then this canonical orientation where we are again going
and reorienting the graph and then we have
Minimum Edge partition which is giving the minimum
number of edge partitions that should be created. So now I guess you guys are
clear with the graph Builder. So how to go ahead and use
this graph object and how to create graph
using apply from edges and from vegetables method and then I guess
you might be clear with the graph loader object and where you can go ahead and
create a graph from Edge list. Now. Let's move ahead and talk
about vertex and Edge rdd. So as I already told you that Graphics exposes
our DD views of the vertices and edges stored
within the graph at however, because Graphics again
maintains the vertices and edges in optimize data structure and these data structure provide additional
functionalities as well. Now, let us see some of
the additional functionalities which are provided by them. So let's first talk
about vertex rdd. So I already told
you that vertex rdd. He is basically extending
this rdd with vertex ID and the vertex property and it
adds an additional constraint that each vertex ID occurs only
words now moreover vertex rdd a represents a set of vertices
each with an attribute of type A now internally what happens this is achieved
by storing the vertex attribute in an reusable, hash map data structure. So suppose, this is
our hash map data structure. So suppose if to vertex rdd are derived from the same
base vertex rdd suppose. These are two vertex rdd which are basically derived
from this vertex rdd so they can be joined
in constant time without hash evaluations. So you don't have to go ahead
and evaluate the properties of both the vertices
you can easily go ahead and you can join them
without the Yes, and this is one of the way
in which this vertex already provides you
the optimization now to leverage this
indexed data structure the vertex rdd exposes multiple
additional functionalities. So it gives you
all these functions as you can see here. It gives you filter
map values then - difference left join in a joint and aggregate
using index functions. So let us first discuss
about these functions. So basically filter a function
filters the vertex set but preserves the internal index
So based on some condition. It filters the vertices that are present
then in map values. It is basically used
to transform the values without changing the IDS and which again preserves
your internal index. So it does not change the idea
of the vertices and it helps in transforming those values
now talking about the - method it shows What is unique in the said based
on their vertex IDs? So what happens if you are providing to set
of vertices first contains V1 V2 and V3 and second
one contains V3, so it will return V1 and V2 because they are unique
in both the sets and it is basically done
with the help of vertex ID. So next we have dysfunction. So it basically removes
the vertices from this set that appears in another set Then
we have left join an inner join. So join operators
basically take advantage of the internal indexing
to accelerate join. So you can go ahead
and you can perform left join or you can perform inner join. Next you have
aggregate using index. So basically is aggregate
using index is nothing by reduced by key, but it uses index on this rdd to accelerate
the Reduce by key function or you can say reduced
by key operation. So again filter is actually
Using bit set and there by reusing the index and preserving the ability to do fast joints with other
vertex rdd now similarly the map values operator as well. Do not allow the map function
to change the vertex ID and this again helps in reusing the same
hash map data structure now both of your left join as well as your inner join
is able to identify that whether the two vertex rdd which are joining
are derived from the same. Hash map or not. And for this they basically use
linear scan did again don't have to go ahead and search
for costly Point lookups. So this is the benefit
of using vertex rdd. So to summarize your vertex audit abuses
hash map data structure, which is again reusable. They try to
preserve your indexes so that it would be easier
to create a new vertex already derive a new vertex already
from them then again while performing some
joining or Relations, it is pretty much easy to go
ahead perform a linear scan and then you can go ahead
and join those two vertex rdd. So it actually helps
in optimizing your performance. Now moving ahead. Let's talk about
HR DD now again, as you can see your Edge
already is extending your rdd with property Edge. Now it organizes the edge
in Block partition using one of the various
partitioning strategies, which is again defined in Your
partition strategies attribute or you can say partition
strategy parameter within each partition each attribute and a decency structure
are stored separately which enables the maximum reuse when changing the
attribute values. So basically what it does while
storing your Edge attributes and your Source vertex
and destination vertex, they are stored separately so that changing the values
of the attributes either of the source
Vertex or Nation Vertex or Edge attribute so that it can be
reused as many times as we need by changing
the attribute values itself. So that once the vertex ID
is changed of an edge. It could be easily changed and the earlier part
can be reused now as you can see, we have three additional
functions over here that is map values
reverse an inner join. So in hrd basically map values is to transform
the edge attributes while preserving the structure. ER it is helpful in transforming so you can use map values and
map the values of Courage rdd. Then you can go ahead and use
this reverse function which rivers The Edge reusing
both attribute and structure. So the source
becomes destination. The destination becomes
Source not talking about this inner join. So it basically joins to Edge rdds partitioned using
same partitioning strategy. Now as we already discuss
that same partition strategies, Tired because again
to co-locate you need to use same partition strategy and your identical
vertex should reside in same partition to perform
join operation over them. Now. Let me quickly give you an idea
about optimization performed in this Graphics. So Graphics basically adopts a Vertex cut approach to
distribute graph partitioning. So suppose you have five vertex
and then they are connected. Let's not worry
about the arrows, right? Now or let's not worry
about Direction right now. So either it can be divided
from the edges, which is one approach or again. It can be divided
from the vertex. So in that situation, it would be divided
something like this. So rather than splitting crafts along edges Graphics partition
is the graph along vertices, which can again
reduce the communication and storage overhead. So logically what happens that your edges
are assigned to machines and allowing your vertices
to span multiple machines. So what this is is basically
divided into multiple machines and your edges is assigned
to a single machine right then the exact method
of assigning edges. Depends on the
partition strategy. So the partition strategy is
the one which basically decides how to assign the edges to different machines or you
can send different partitions. So user can choose between different strategies
by partitioning the graph with the help of this graft
Partition by operator. Now as we discussed that this craft or Partition by operator three partitions
and then it divides or relocates the edges and basically we try
to put the identical edges. On a single partition so that different
operations like join can be performed on them. So once the edges have been
partitioned the mean challenge is efficiently joining
the vertex attributes with the edges right now because real world graphs typically have more
edges than vertices. So we move vertex attributes
to the edges and because not all the partitions will contain
edges adjacent to all vertices. We internally maintain a row. Routing table. So the routing table is the one
who will broadcast the vertices and 10 will implement the join
required for the operations. So, I hope that you guys are clear
how vertex rdd and hrd works and then how the optimizations take place and how vertex cut optimizes
the operations in graphics. Now, let's talk
about graph operators. So just as already
have basic operations like map filter reduced by key property graph also have
Election of basic operators that take user-defined functions
and produce new graphs the transform properties and
structure Now The Co-operators that have optimized
implementation are basically defined in crafts class
and convenient operators that are expressed as a composition
of The Co-operators are basically defined
in your graphs class. But in Scala it
implicit the operators in graph Ops class, they are automatically available
as a member of graft class so you can use them. M using the graph
class as well now as you can see we have
list of operators like property operator, then you have
structural operator. Then you have join operator and then you have something
called neighborhood operator. So let's talk about them one
by one now talking about property operators, like rdd has map operator
the property graph contains map vertices map edges and map
triplets operators right now. Each of this operator basically
eels a new graph with the vertex or Edge property. Modified by the user-defined
map function based on the user-defined map function
it basically transforms or modifies the vertices if it's map vertices or it transform
or modify the edges if it is map edges method or map is operator and so
on format repeats as well. Now the important thing
to note is that in each case. The graph structure
is unaffected and this is a key feature
of these operators. Basically which allows
the resulting graph to reuse the structural indices. Of the original graph each and every time you
apply a transformation, so it creates a new graph and the original
graph is unaffected so that it can be used so you can see it can be reused
in creating new graphs. Right? So your structure indices can be used from the original
graph not talking about this map vertices. Let me use the highlighter. So first we have map vertices. So be it Maps the vertices or you can still
transform the vertices. So you provide vertex ID
and then vertex. And you apply some of the
transformation function using which so it will give you
a graph with newer text property as you can see now same is
the case with map edges. So again you provide the edges
then you transform the edges. So initially it was Ed and then
you transform it to Edie to and then the graph which is given or you
can see the graph which is returned is the graph
for the changed each attribute. So you can see here
the attribute is ed2. Same is the case
with Mark triplets. So using Mark triplets, you can use the edge triplet
where you can go ahead and Target the vertex Properties
or you can say vertex attributes or to be more specific
Source vertex attribute as well as destination vertex attribute and the edge attribute and then
you can apply transformation over those Source attributes or destination attributes
or the edge attributes so you can change them and then
it will again return a graph with the transformed values now, I guess you guys are clear
the property operator. So let's move Next operator that is structural operator So
currently Graphics supports only a simple set of commonly
use structural operators. And we expect more
to be added in future. Now you can see
in structural operator. We have reversed operator. Then we have subgraph operator. Then we have masks operator and then we have
group edges operator. So let's talk about them one by
one so first reverse operator, so as the name suggests, it returns a new graph with all
the edge directions reversed. So basically it will change
your Source vertex into destination vertex, and then it will change
your destination vertex into Source vertex. So it will reverse
the direction of your edges. And the reverse operation
does not modify Vertex or Edge Properties or change. The number of edges. It can be implemented efficiently without
data movement or duplication. So next we have
subgraph operator. So basically subgraph
operator takes the vertex and Edge predicates or you can say Vertex
or edge condition and Returns the Of
containing only the vertex that satisfy those vertex
predicates and then it Returns the edges that satisfy
the edge predicates. So basically will give
a condition about edges and vertices and those predicates which are fulfilled
or those vertex which are fulfilling the
predicates will be only returned and again seems the case
with your edges and then your graph
will be connected. Now, the subgraph operator
can be used in a number of situations to restrict
the graph to the vertices and edges of interest and eliminate the Rest
of the components, right so you can see
this is The Edge predicate. This is the vertex predicate. Then we are providing
the extra plate with the vertex and Edge attributes and we are waiting for the Boolean value then
same is the case with vertex. We're providing the vertex
properties over here or you can say vertex
attribute over here. And then again, it will yield a graph
which is a sub graph of the original graph which will
fulfill those predicates now, the next operator
is mask operator. So mask operator Constructors. Graph by returning a graph
that contains the vertices and edges that are also found
in the input graph. Basically, you can treat this mask operator as
a comparison between two graphs. So suppose. We are comparing graph 1 and graph 2 and it
will return this sub graph which is common in both
the graphs again. This can be used in conjunction
with the subgraph operator. Basically to restrict a graph based on properties
in another related graph, right. And so I guess you guys are
clear with the mask operator. So we're here. We're providing a graph and then we are providing
the input graph as well. And then it will return a graph which is basically a subset
of both of these graph not talking about group edges. So the group edges operator
merges the parallel edges in the multigraph, right? So what it does it, the duplicate edges between pair
of vertices are merged or you can say are
at can be aggregated or perform some action and in many numerical
applications I just can be added and their weights can be
combined into a single edge, right which will again
reduce the size of the graph. So for an example, you have to vertex V1 and V2
and there are two edges with weight 10 and 15. So actually what you can do is
you can merge those two edges if they have same direction and
you can represent the way to 25. So this will actually
reduce the size of the graph now looking
at the next operator, which is join operator. So in many cases
it is necessary. To join data from external
collection with graphs, right? For example. We might have
an extra user property that we want to merge
with the existing graph or we might want
to pull vertex property from one graph to another right. So these are some
of the situations where you go ahead and use
this join operators. So now as you can see over here, the first operator
is joined vertices. So the joint vertices operator
joins the vertices with the input rdd and returns a new graph
with the vertex properties. Dean after applying
the user-defined map function now the vertices
without a matching value in the rdd basically retains
their original value not talking about outer join vertices. So it behaves similar
to join vertices except that which user-defined map function
is applied to all the vertices and can change
the vertex property type. So suppose that you have
a old graph which has a Vertex attribute as old price and then you created
a new a graph from it and then it has the vertex
attribute as new rice. So you can go ahead
and join two of these graphs and you can perform
an aggregation of both the Old and New prices
in the new graph. So in this kind of situation
join vertices are used now moving ahead. Let's talk about neighborhood
aggregation now key step in many graph analytics
is aggregating the information about the neighborhood
of each vertex for an example. We might want to know the number
of followers each user has Or the average age of the follower of each user now
many iterative graph algorithms, like pagerank shortest path, then connected components
repeatedly aggregate the properties of
neighboring vertices. Now, it has four operators
in neighborhood aggregation. So the first one is
your aggregate messages. So the core aggregation
operation in graphics is aggregate messages. Now this operator applies a user-defined
send message function as you can see over here
to Each of the edge triplet in the graph and then it uses
merge message function to aggregate those messages
at the destination vertex. Now the user-defined send message function
takes an edge context as you can see and
which exposes the source and destination address Buttes
along with the edge attribute and functions like send
to Source or send to destination is used
to send messages to source and destination attributes. Now you can think
of send message as the map. Function in mapreduce and
the user-defined merge function which actually takes
the two messages which are present
on the same Vertex or you can see
the same destination vertex and it again combines or aggregate those messages
and produces a single message. Now, you can think
of the merge message as reduce function
the mapreduce now, the aggregate messages operator
returns a Vertex rdd. Basically, it contains
the aggregated messages at each of the destination vertex. It's and vertices that did not receive
a message are not included in the returned vertex rdd. So only those vertex are returned which actually
have received the message and then those messages
have been merged. If any vertex
which haven't received. The message will not be included
in the returned rdd or you can say a return
vertex rdd now in addition as you can see we have
a triplets Fields. So aggregate messages takes
an optional triplet fields, which indicates what data is. Accessed in the edge content. So the possible options for the triplet fields
are defined interpret fields to default value
of triplet Fields is triplet Fields oil as you can see over
here this basically indicates that user-defined send
message function May access any of the fields
in the edge content. So this triplet field argument
can be used to notify Graphics that only these part of
the edge content will be needed which basically allows Graphics
to select the optimize joining. Strategy, so I hope that you guys are clear
with the aggregate messages. Let's quickly move ahead
and look at the second operator. So the second operator is
mapreduce triplet transition. Now in earlier versions of Graphics neighborhood
aggregation was accomplished using the mapreduce
triplets operator. This mapreduce triplet
operator is used in older versions of Graphics. This operator takes
the user-defined map function, which is applied to each triplet
and can yield messages which are Aggregating using the
user-defined reduce functions. This one is the reason
I defined malfunction. And this one is your user
defined reduce function. So it basically applies
the map function to all the triplets and then the aggregate
those messages using this user defined reduce function. Now the newer version of this
map produced triplets operator is the aggregate messages
now moving ahead. Let's talk about Computing
degree information operator. So one of the common
aggregation task is Computing the degree of each vertex. That is the number of edges
adjacent to each vertex. Now in the context
of directed graph. It is often necessary to know
the in degree out degree. Then the total degree of vertex. These kind of things are
pretty much important and the graph Ops class
contain a collection of operators to compute
the degrees of each vertex. So as you can see, we have maximum input degree
than maximum output degree, then maximum degrees
maximum degree will tell us the number of Maximum
incoming edges then Max. Degree will tell us
maximum number of output edges and this Max degree with actually tell us the number
of input as well as output edges now moving
ahead to next operator that is collecting
Neighbors in some cases. It may be easier to express the computation by collecting
neighboring vertices and their attribute
at each vertex. Now, this can be easily
accomplished using the collect neighbors ID and
the collect neighbors operator. So basically your collect
neighbor ID takes The Edge direction
as the parameter and it returns a Vertex rdd that contains the array
of vertex ID that is neighboring
to the particular vertex now similarly The Collection
neighbors again takes the edge directions as the input and it will return you the array with the vertex ID and
the vertex attribute both now, let me quickly open my VM and let us go through
the spark directory first. Let me first open
my terminal so first I'll start the Do demons so
for that I will go to her do phone directory
genocide has been start or lot asset script file. So let me check if the Hadoop demons
are running or not. So as you can see that name, no data node
secondary name node, the node manager
and resource manager. All the Demons
of Hadoop are up now. I will navigate to spark home. Let me first start
this park demons. I See Spark demons are running alko first minimize this and let
me take you to this park home. And this is my spot directories. I'll go inside now. Let me first show you the data which is by default present
with your spark. So we'll open this in a new tab. So you can see
we have two files in this Graphics data directory. Meanwhile, let me take you
to the example code. So this is example
and inside so main scalar. You can find the graphics directory and
inside this Graphics directory you Some of the sample codes
which are present over here. So I will take you
to this aggregate messages example dots
Kayla now meanwhile, let me open the data as well. So you'll be able to understand. Now this is
followers dot txt file. So basically you can imagine these are the edges which
are representing the vertex. So this is what x 2
and this is vertex 1 then this is Vertex 4 and this
is vertex 1 and similarly. So on these are representing
those vertex and if you can remember I
have already told you that inside graph loader class. There is a function
called Edge list file which takes the edges from a file and then it
construct the graph based. That now second you
have this user dot txt. So these are basically the edges
with the vertex ID. So vertex ID for this vertex
is 1 then for this is 2 and so on and then
this is the data which is attached or you can say
the attribute of the edges. So these are the vertex ID which is 1 2 3 respectively
and this is the data which is associated
with your each vertex. So this is username and this
might be the name of your user. Zur and so on now
you can also see that in some of the cases
the name of the user is missing. So as in this case the name of the user is missing
these are the vertices or you can see the vertex ID
and vertex attributes. Now, let me take you through
this aggregate messages example, so as you can see,
we are giving the name of the packages over G Apache
spark examples dot Graphics, then we are importing Graphics
in that very important. Off class as well as this vertex rdd next we
are using this graph generator. I'll tell you why we
are using this graph generator and then we are using
the spark session over here. So this is an example where we are using the aggregate
messages operator to compute the average age of the more
senior followers of each user. Okay. So this is the object
of aggregate messages example. Now, this is the main function
where we are first. Realizing this box session then
the name of the application. So you have to provide the name
of the application and this is get or create method now
next you are initializing the spark context as SC
now coming to the code. So we are specifying
a graph then this graph is containing double and N now. I just told you that we
are importing craft generator. So this graph generator is to generate a random
graph for Simplicity. So you would have multiple
number of edges and vertices. Says then you are using
this log normal graph. You're passing the spark context and you're specifying the number
of vertices as hundred. So it will generate
hundred vertices for you. Then what you are doing. You are specifying
the map vertices and you're trying
to map ID to double so what this would do this will basically map
your ID to double then in next year trying
to calculate the older followers where you have given
it as vertex rdd and then put is nth and Also, your vertex already
has sent as your vertex ID and your data is double which is associated
with each of the vertex or you can say
the vertex attribute. So you have this graph which is basically
generated randomly and then you are performing
aggregate messages. So this is the aggregate
messages operator now, if you can remember we first
have the send messages, right? So inside this triplet, we are specifying a function
that if the source attribute of the triplet is board. Destination attribute
of the triplet. So basically it will return if the followers age
is greater than the age of person whom he is following this tells
the followers is is greater than the age of whom
he is following. So in that situation, it will send message
to the destination with vertex containing counter that is 1 and the age
of the source attribute that is the age
of the follower so first so you can see the age
of the destination on is less than the age
of source attribute. So it will tell you if the follower is older
than the user or not. So in that situation will send
one to the destination and we'll send the age
of the source or you can see the edge
of the follower then second. I have told you
that we have merged messages. So here we are adding
the counter and the H in this reduce function. So now what we are doing we
are dividing the total age of the number of older followers
to Write an average age of older followers. So this is the reason why
we have passed the attribute of source vertex firstly if we are specifying
this variable that is average age of older followers. And then we are specifying
the vertex rdd. So this will be double
and then this older followers that is the graph which we are picking up
from here and then we are trying to map the value. So in the vertex, we have ID and we have value so
in this situation We are using this case class
about count and total age. So what we are doing we
are taking this total age and we are dividing it by count
which we have gathered from this send message. And then we have aggregated
using this reduce function. We are again taking the total
age of the older followers. And then we are trying
to divide it by count to get the average age when at last we are trying
to display the result and then we are stopping this park. So let me quickly open
the terminal so I will go to examples so I'd examples I took you
through the source directory where the code is
present inside skaila. And then inside there
is a spark directory where you will find
the code but to execute the example you need to go
to the jars territory. Now, this is
the scale example jar which you need to execute. But before this,
let me take you to the hdfs. So the URL is localhost. Colon 5 0 0 7 0 And we'll go to utilities then we'll go to browse
the file system. So as you can see, I have created a user
directory in which I have specified the username. That is Ed Eureka
and inside Ed Eureka. I have placed my data directory where we have this graphics
and inside the graphics. We have both the file
that is followers Dot txt and users dot txt. So in this program, we are not referring
to these files but incoming examples will
be referring to these files. So I would request you to first
move it to this hdfs directory. So that spark can refer
the files in data Graphics. Now, let me quickly minimize
this and the command to execute is Spock - submit and then I'll pass
this charge parameter and I'll provide
the spark example jar. So this is the jar then
I'll specify the class name. So to get the class name. I will go to the code. I'll first take
the package name from here. And then I'll take the class name which is
aggregated messages example, so this is my class. And as I told you have
to provide the name of the application. So let me keep it as example
and I'll hit enter. So now you can see the result. So this is the followers and this is the average
age of followers. So it is 34 Den. We have 52 which is
the count of follower. And the average age is
seventy six point eight that is it has
96 senior followers. And then the average age of the followers is
ninety nine point zero, then it has
four senior followers and the average age is 51. Then this vertex has
16 senior followers with the average age
of 57 point five. 5 and so on you can see
the result over here. So I hope now you guys are clear
with aggregate messages how to use aggregate messages how to specify
the send message then how to write the merge message. So let's quickly go back
to the presentation. Now, let us quickly move ahead and look at some
of the graph algorithms. So the first one is Page rank. So page rank measures
the importance of each vertex in a graph assuming that an edge from U
to V represents. And recommendation or support of Vis importance
by you for an example. Let's say if a Twitter user
is followed by many others user will obviously rank
high graphics comes with the static and dynamic
implementation of pagerank as methods on page rank object and static page rank runs
a fixed number of iterations, which can be specified by you
while the dynamic page rank runs until the ranks converge what we mean by that is
it Stop changing by more than a specified tolerance. So it runs until it have optimized
the page rank of each of the vertices now graphs class
allows calling these algorithms directly as methods
on crafts class. Now, let's quickly go
back to the VM. So this is the pagerank example. Let me open this file. So first we are specifying
this Graphics package, then we are importing
the graph loader. So as you can Remember
inside this graph loader class we have
that edge list file operator, which will basically create
the graph using the edges and we have those edges
in our followers dot txt file now coming back
to pagerank example now, we're importing the spark
SQL Sparks session. Now, this is Page
rank example object and inside which we
have created a main class and we have similarly created
this park session then Builders and we're specifying
the app name which Is to be provided then
we have get our grid method. So this is where we are initializing
the spark context as you can remember. I told you that using
this Edge list file method. We are basically
creating the graph from the followers dot txt file. Now, we are running
the page rank over here. So in rank it will give you all
the page rank of the vertices that is inside this graph which we have just
to reducing graph loader class. So if you're passing
an integer as an an argument to the page rank, it will run
that number iterations. Otherwise, if you're
passing a double value, it will run
until the convergence. So we are running
page rank on this graph and we have passed the vertices. Now after this we are trying
to load the users dot txt file and then we are trying to play the line by comma then
the field zero too long and we are storing
the field one. So basically field zero. In your user txt is
your vertex ID or you can see the ID of the user
and field one is your username. So we are trying to load
these two Fields now. We are trying
to rank by username. So we are taking the users
and we are joining the ranks. So this is where we
are using the join operation. So Frank's by username. We are trying to
attach those username or put those username
with the page rank value. So we are taking the users then we are joining
the ranks it is again, we are getting
from this page Rank and then we are mapping
the ID user name and rank. Second week sometime run
some iterations over the craft and will try to converge it. So after converging you
can see the user and the rank. So the maximum rank is
with Barack Obama, which is 1.45 then
with Lady Gaga. It's 1.39 and then with
order ski and so on. Let's go back to the slide. So now after page rank, let's quickly move ahead to Connected components
the connected components algorithm labels each
connected component of the graph with the ID
of its lowest numbered vertex. So let us quickly go
back to the VM. Now let's go inside
the graphics directory and now we'll open
this connect components example. So again, it's the same very
important graph load and Spark session. Now, this is the connect
components example object makes this is the main function
and inside the main function. We are again specifying all those Sparks session
then app name, then we have spark context. So it's similar. So again using
this graph loader class and using this Edge. To file we are loading
the followers dot txt file. Now in this graph. We are using this connected
components algorithm. And then we are trying to find
the connected components now at last we are trying
to again load this user file that is users Dot txt. And we are trying to join
the connected components with the username so over
here it is also the same thing which we have discussed
in page rank, which is taking the field 0 and field one
of your user dot txt file and a at last we
are joining this users and at last year trying to join
this users to connect component that is from here. Now. We are printing the CC
by username collect. So let us quickly go ahead and
execute this example as well. So let me first copy
this object name. that's name this
as example to so as you can see Justin Bieber has
one connected component, then you can see this has
three connected component. Then this has
one connected component than Barack Obama has one
connected component and so on. So this basically
gives you an idea about the connected components. Now, let's quickly move back to the slide will discuss
about the third algorithm that is triangle counting. So basically a Vertex is a part
of a triangle when it has two adjacent vertices
with an edge between them. So it will form
a triangle, right? And then that vertex
is a part of a triangle now Graphics implements
a triangle counting algorithm in the Triangle count object. Now that determines the number
of triangles passing through each vertex providing
a measure of clustering so we can compute
the triangle count of the social network data set from the pagerank section
1 mode thing to note is that triangle count
requires the edges. To be in
a canonical orientation. That is your Source ID
should always be less than your destination ID and the graph will be
partition using craft or Partition by Method now, let's quickly go back. So let me open
the graphics directory again, and we'll see
the triangle counting example. So again, it's the same and the object is
triangle counting example, then the main function
is same as well. Now we are again using
this graph load of class and we are loading
the followers dot txt which contains the edges as you can see here. We are using this Partition by argument and we are passing
the random vertex cut, which is the partition strategy. So this is how you can go ahead and you can Implement
a partition strategy. He is loading the edges
in canonical order and partitioning the graph
for triangle count. Now. We are trying to find
out the triangle count for each vertex. So we have this try count variable and then we are using
this triangle count algorithm and then we are
specifying the vertices so it will execute
triangle count over this graph which we have just loaded
from follows dot txt file. And again, we are basically
joining usernames. So first we are Being
the usernames again here. We are performing the join
between users and try counts. So try counts is from here. And then we are again
printing the value from here. So again, this is the same. Let us quickly go
ahead and execute this triangle counting example. So let me copy this. I'll go back to the terminal. I'll limit as example
3 and change the class name. And I hit enter. So now you can see
the triangle associated with Justin Bieber 0 then
Barack Obama is one with odors kids one and with Jerry sick. It's fun. So for better understanding I
would recommend you to go ahead and take this followers or txt. And you can create
a graph by yourself. And then you can attach
these users names with them and then you will get an idea about why it is giving
the number as 1 or 0. So again the graph
which is connecting. In two and four is disconnect and it
is not completing any triangles. So the value of these 3 are 0
and next year's second graph which is connecting your vertex 3 6 & 7
is completing one triangle. So this is the reason why these three vertices
have values one now. Let me quickly go back. So now I hope that you guys are clear
with all the concepts of graph operators
then graph algorithms. Eames so now is the right
time and let us look at a spa Graphics demo where we'll go ahead and we'll try to analyze
the force go by data. So let me quickly go
back to my VM. So let me first show
you the website where you can go ahead and
download the Fords go by data. So over here you can go to download the fort
bike strip history data. So you can go ahead and download
this 2017 Ford's trip data. So I have already downloaded it. So to avoid the typos, I have already written
all the commands so first let me go ahead and start
the spark shell So I'm inside these Park shell now. Let me first import graphics
and Spa body. So I've successfully
imported graphics and Spark rdd. Now, let me create
a spark SQL context as well. So I have successfully
created this park SQL context. So this is basically
for running SQL queries over the data frames. Now, let me go ahead
and import the data. So I'm loading the data
in data frame. So the format of file is CSV, then an option the header
is already added. So that's why it's true. Then it will automatically
infer this schema and then in the load parameter, I have specified
the path of the file. So I'll quickly hit enter. So the data is loaded
in the data frame to check. I'll use d f dot count
so it will give me the count. So you can see it has
5 lakhs 19 2007 Red Rose now. Let me click go back
and I'll print the schema. So this is the schema
the duration in second, then we have
the start time end time. Then you have start station ID. Then you have
start station name. Then you have start
station latitude longitude then end station ID and station name then
end station latitude and station longitude. Then your bike ID user type then
the birth year of the member and the gender
of the member now, I'm trying to create
a data frame that is Gas stations so it will only create
the station ID and station name which I'll be using as vertex. So here I am trying
to create a data frame with the name of just stations where I am just selecting
the start station ID and I'm casting it as float and then I'm selecting
the start station name and then I'm using
the distinct function to only keep the unique values. So I quickly go
ahead and hit enter. So again, let me go ahead and use this just stations
and I will print the schema. So you can see
there is station ID, and then there is
start station name. It contains the unique values of stations in this just
station data frame. So now again, I am taking this stations
where I'm selecting these thought station ID
and and station ID. Then I am using re distinct which will again give
me the unique values and I'm using this flat map where I am specifying the iterables where we
are taking the x0 that is your start station ID, and I am taking x 1
which is your ends. An ID and then again, I'm applying this
distinct function that it will keep only
the unique values and then at last we have to d f function which will convert
it to data frame. So let me quickly go ahead
and execute this. So I am printing this schema. So as you can see
it has one column that is value and it
has data type long. So I have taken all
the start and end station ID and using this flat map. I have retreated
over all the start. And and station ID and then
using the distinct function and taking the unique values
and converting it to data frames so I can use the stations
and using the station. I will basically keep each
of the stations in a Vertex. So this is the reason why
I'm taking the stations or you can say I am taking
the unique stations from the start station ID
and station ID so that I can go ahead and I can define
vertex as the stations. So now we are creating
our set of vertices and attaching a bit
of metadata to each one of them which in our case is
the name of the station. So as you can see we are
creating this station vertices, which is again an rdd
with vertex ID and strength. So we are using the station's
which we have just created. We are joining it
with just stations at the station value
should be equal to just station station ID. So as we have created stations, And just station
so we are joining it. And then selecting
the station ID and start station name
then we are mapping row 0. And Row 1 so your row
0 will basically be your vertex ID and Row
1 will be the string. That is the name of your station
to let me quickly go ahead and execute this. So let us quickly print this
using collect forage println. So over here, we are basically attaching
the edges or you can see we are creating the trip edges
to all our individual rights and then we'll get
the station values and then we'll add
a dummy value of one. So as you can see that I am selecting
the start station and and station from the DF which is the first data frame
which we have loaded and then I am mapping
it to row 0 + Row 1, which is your source
and destination. And then and then I'm attaching
a value one to each one of them. So I'll hit enter. Now, let me quickly go ahead
and print this station edges. So just taking the source
ID of the vertex and destination ID of the vertex or you can say so station ID
or vertex station ID and it is attaching value
one to each one of them. So now you can go ahead
and build your graph. But again as we discuss
that we need a default station so you can have some situations where your edges might be
indicating some vertices, but that vertices
might not be present in your vertex re D. So for that situation, we need to create
a default station. So I created a default station
as missing station. So now we are all set. We can go ahead
and create the graph. So the name of the graph
is station graph. Then the vertices
are stationed vertices which we have created which basically contains
the station ID and station name and then we have station edges and at last we
have default station. So let me quickly go ahead
and execute this. So now I need to cash this graph
for faster access. So I'll use cash function. So let us quickly go ahead and
check the number of vertices. So these are the number
of vertices again, we can check the number
of edges as well. So these are
the number of edges. And to get a sanity check. So let's go ahead
and check the number of records that are present
in the data frame. So as you can see
that the number of edges in our graph and the count
in our data frame is similar, or you can see the same. So now let's go ahead and run
page rank on our data so we can either run
a set number of iterations or we can run it
until the convergence. So in my case,
I'll run it till convergence. So it's rank then
station graph then page rank. So has specified
the double value so it will Tell convergence
so let's wait for some time. So now that we have executed
the pagerank algorithm. So we got the ranks which are attached
to each vertices. So now let us quickly go ahead
and look at the ranks. So we are joining ranks
with station vertices and then we have sorting it
in descending values and we are taking
the first 10 rows and then we are printing them. So let's quickly go
ahead and hit enter. So you can see these are
the top 10 stations which have the most pagerank values so you can say it has
more number of incoming trips. Now one question would be what are the most common
destinations in the data set from location to location so we can do this by performing
a grouping operator and adding The Edge counts together. So basically this will give
a new graph except each Edge will now be the sum of all
the semantically same edges. So again, we are taking
the station graph. We are performing Group
by edges H1 and H2. So we are basically
grouping edges H1 and H2. So we are aggregating them. Then we are using triplet and then we are sorting them
in descending order again. And then we are
printing the triplets from The Source vertex
and the number of trips and then we are taking
the destination attribute or you can see destination Vertex or you can see
destination station. So you can see there are 1933 trips from San
Francisco Ferry Building to the station then again, you can see there are
fourteen hundred and eleven trips from San Francisco
to this location. Then there are 1 0 to 5 trips from this station
to San Francisco and it goes so on so now we
have got a directed graph that mean our
trip are directional from one location to another so now we can go ahead
and find the number of Trades that Went to a specific station and then leave
from a specific station. So basically we are trying to find the inbound
and outbound values or you can say we are trying
to find in degree and out degree of the stations. So let us first calculate the in
degrees from using station graph and I am using
n degree operator. Then I'm joining it
with the station vertices and then I'm sorting it again
in descending order and then I'm taking
the top 10 values. So let's quickly go
ahead and hit enter. So these are the top 10 station
and you can see the in degrees. So there are these many trips which are coming
into these stations. Not similarly. We can find the out degree. Now again, you can see
the out degrees as well. So these are the stations
and these are the out degrees. So again, you can go ahead
and perform some more operations over this graph. So you can go ahead
and find the station which has most number
of trips things that is most number of people
coming into that station, but less people are
leaving that station and again on the contrary
you can find out the stations where there are
more number of edges or you can set trip
leaving those stations. But there are less number of trips coming
into those stations. So I guess you guys are
now clear with Spa Graphics. Then we discuss
the different types of crops then moving ahead. We discuss the
features of grafx. They'll be discuss something
about property graph. We understood what
is property graph how you can create vertex
how you can create edges how to use Vertex or DD H Rd D. Then we looked at some of
the important vertex operations and at last we understood some
of the graph algorithms. So I guess now you
guys are clear about how to work with Bob Graphics. Today's video is
on Hadoop versus park. Now as we know organizations
from different domains are investing in big
data analytics today. They're analyzing large
data sets to uncover all hidden patterns unknown correlations market
trends customer preferences and other useful
business information. Analogy of findings
are helping organizations and more effective marketing
new Revenue opportunities and better customer service and they're trying
to get competitive advantages over rival organizations and other business benefits
and Apache spark and Hadoop are the two of most
prominent Big Data Frameworks and I see people often comparing
these two technologies and that is what exactly
we're going to do in this video. Now, we'll compare these two big
data Frame Works based on on different parameters, but first it is important
to get an overview about what is Hadoop. And what is Apache spark? So let me just tell you a little
bit about Hadoop Hadoop is a framework to store and process large sets of data
across computer clusters and Hadoop can scale
from single computer system up to thousands
of commodity systems that offer local storage and compute power and Hadoop
is composed of modules that work together to create
the entire Hadoop framework. These are some of the components that we have in the
entire Hadoop framework or the Hadoop ecosystem. For example, let
me tell you about hdfs, which is the storage unit
of Hadoop yarn, which is for resource management. There are different
than a little tools like Apache Hive Pig nosql
databases like Apache hbase. Even Apache spark and Apache Stone fits
in the Hadoop ecosystem for processing big data in real-time for ingesting data
we have Tools like Flume and scoop flumist used
to ingest unstructured data or semi-structured data where scoop is used to ingest
structured data into hdfs. If you want to learn more
about these tools, you can go to Eddie rei'kas YouTube channel and look
for Hadoop tutorial where everything has
been explained in detail. Now, let's move to spark Apache spark is a lightning-fast
cluster Computing technology that is designed
for fast computation. The main feature of spark
is it's in memory clusters. Esther Computing that increases the processing
of speed of an application fog perform similar operations
to that of Hadoop modules, but it uses an in-memory
processing and optimizes the steps the primary
difference between mapreduce and Hadoop and Spark is that mapreduce users
persistent storage and Spark uses resilient
distributed data sets, which is known as
rdds which resides in memory the different
components and Sparkle. The spark origin the spark
or is the base engine for large-scale parallel and distributed data processing
further additional libraries which are built on top of
the core allow diverse workloads for streaming SQL and machine learning spark
or is also responsible for memory management
and fault recovery scheduling and distributed and monitoring
jobs and a cluster and interacting with
the storage systems as well. Next up. We have spark streaming. Spark streaming is
the component of spark which is used to process
real-time streaming data. It enables high throughput and fault-tolerant stream
processing of live data streams. We have Sparks equal spark
SQL is a new module in spark which integrates relational
processing with Sparks functional programming API. It supports querying
data either via SQL or via the hive query language. For those of you
familiar with rdbms. Spark sequel will be an easy. Transition from your earlier
tools where you can extend the boundaries of traditional
relational data processing. Next up is Graphics Ralph X is
the spark API for graphs and graph parallel computation and thus it extends
the spark resilient distributed data sets with a
resilient distributed property. Graph. Next is Park Emma lip
for machine learning Emma lip stands for machine
learning library spark. Emma live is used
to perform machine. In learning in Apache spark now since you've got an overview
of both these two Frameworks, I believe that the ground is all set to compare
Apache spark and Hadoop. Let's move ahead and compare
Apache spark with Hadoop on different parameters
to understand their strengths. We will be comparing
these two Frameworks based on these parameters. Let's start with performance
first Spark is fast because it has in-memory processing it
can also use For data, that doesn't fit
into memory Sparks in-memory processing delivers
near real-time analytics and this makes Park suitable for credit card
processing system machine learning security analysis and processing data
for iot sensors. Now, let's talk
about hadoop's performance. Now Hadoop has originally
designed to continuously gather data from multiple
sources without worrying about the type of data and storing it across distributed
environment and mapreduce. Use uses batch processing
mapreduce was never built for real-time processing main idea behind yarn is parallel
processing over distributed data set the problem
with comparing the two is that they have different
way of processing and the idea behind the
development is also Divergent next ease-of-use spark comes with a user-friendly apis
for Scala Java Python and Sparks equal spark SQL
is very similar to SQL. So it becomes easier for a sequel developers
to learn it spark also provides an interactive shell for developers to query
and perform other actions and have immediate feedback. Now, let's talk about Hadoop. You can ingest data in Hadoop
easily either by using shell or integrating it
with multiple tools, like scoop and Flume and yarn is just
a processing framework that can be integrated
with multiple tools like Hive and pig for Analytics. I visit data
warehousing component which performs Reading Writing and managing large data set
in a distributed environment using sql-like interface
to conclude here. Both of them have
their own ways to make themselves user-friendly. Now, let's come
to the cost Hadoop and Spark are both Apache
open source projects. So there's no cost for the
software cost is only associated with the infrastructure both
the products are designed in such a way that Can run
on commodity Hardware with low TCO or total
cost of ownership. Well now you might
be wondering the ways in which they are different. They're all the same storage
and processing in Hadoop is disc-based and Hadoop uses
standard amounts of memory. So with Hadoop, we need a lot of disk space as well as
faster transfer speed Hadoop also requires multiple
systems to distribute the disk input output, but in case of Apache spark due to its in-memory processing
it requires a lot of memory, but it can deal
with the standard. Speed and amount of disk as
disk space is a relatively inexpensive commodity and since Park does not use
disk input output for processing instead. It requires large amounts of RAM for executing
everything in memory. So spark systems
incurs more cost but yes one important thing
to keep in mind is that Sparks technology reduces
the number of required systems, it needs significantly
fewer systems that cost more so there will be a point
at which spark reduces the cost per unit of the computation even with
the additional RAM requirement. There are two types of
data processing batch processing and stream processing batch
processing has been crucial to the Big Data World in simplest term batch
processing is working with high data volumes
collected over a period in batch processing data is
first collected then processed and then the results are produced at a later
stage and batch. Is it efficient way of processing large
static data sets? Generally we perform batch processing for archived
data sets for example, calculating average income
of a country or evaluating the change
in e-commerce in the last decade now stream processing stream
processing is the current Trend in the Big Data World need
of the hour is speed and real-time information, which is what stream processing does batch processing
does not allow. Businesses to quickly react
to changing business needs and real-time stream processing
has seen a rapid growth in that demand now coming
back to Apache Spark versus Hadoop yarn is basically
a batch processing framework when we submit a job to yarn. It reads data from
the cluster performs operation and write the results
back to the cluster and then it again reads the updated data performs
the next operation and write the results back
to the cluster and Off on the other hand spark is
designed to cover a wide range of workloads such as batch application iterative
algorithms interactive queries and streaming as well. Now, let's come to fault
tolerance Hadoop and Spark both provides fault tolerance, but have different
approaches for hdfs and yarn both Master demons. That is the name node in hdfs and resource manager
in the arm checks the heartbeat of the slave demons. The slave demons are data nodes
and node managers. So if any slave demon fails, the master demons reschedules
all pending an in-progress operations to another slave
now this method is effective but it can significantly
increase the completion time for operations with
single failure also and as Hadoop uses
commodity hardware and another way in which hdfs
ensures fault tolerance is by replicating data. Now let's talk about spark as we discussed earlier rdds are
resilient distributed data sets are building blocks
of Apache spark and rdds are the one which provide fault
tolerant to spark. They can refer to any data set present
and external storage system like hdfs Edge base
shared file system Etc. They can also be operated parallely rdds can
persist a data set and memory across operations. It's which makes future actions
10 times much faster if rdd is lost
it will automatically get recomputed by using
the original Transformations. And this is how spark provides
fault tolerance and at the end. Let us talk about security. Well Hadoop has
multiple ways of providing security Hadoop supports
Kerberos for authentication, but it is difficult
to handle nevertheless. It also supports
third-party vendors like ldap. For authentication, they also offer
encryption hdfs supports traditional file permissions as
well as Access Control lists, Hadoop provides service level
authorization which guarantees that clients have
the right permissions for job submission spark currently
supports authentication via a shared secret spark
can integrate with hdfs and it can use hdfs ACLS or Access Control lists
and file level permissions sparking also run. Yarn, leveraging the
capability of Kerberos. Now. This was the comparison
of these two Frameworks based on these following parameters. Now, let us understand use cases where these Technologies
fit best use cases were Hadoop fits best. For example, when you're analyzing
archive data yarn allows parallel processing over huge amounts of data parts
of data is processed parallely and separately on
different data nodes and gathers result
from each node manager in cases when instant results
are not required now Hadoop mapreduce is a good
and economical solution for batch processing. However, it is incapable of processing data
in real-time use cases where Spark fits best
in real-time Big Data analysis, real-time data analysis
means processing data that is getting generated by
the real-time event streams coming in at the rate
of Billions of events per second the strength of spark lies in its abilities
to support streaming of data along with
distributed processing and Spark claims to process
data hundred times faster than mapreduce while 10 times
faster with the discs. It is used in graph
processing spark contains a graph computation
Library called Graphics which simplifies our life
in memory computation along with inbuilt graph support
improves the performance. Performance of algorithm
by a magnitude of one or two degrees over
traditional mapreduce programs. It is also used in iterative
machine learning algorithms almost all machine learning
algorithms work iteratively as we have seen earlier
iterative algorithms involve input/output bottlenecks in the mapreduce
implementations mapreduce uses coarse-grained tasks that are too heavy for iterative algorithms spark
caches the intermediate data. I said after each iteration and runs multiple iterations
on the cache data set which eventually reduces
the input output overhead and executes the algorithm faster in a fault-tolerant
manner sad the end which one is the best the answer
to this is Hadoop and Apache spark are
not competing with one another. In fact, they complement
each other quite well, how do brings huge
data sets under control by commodity systems? Systems and Spark provides
a real-time in-memory processing for those data sets. When we combine
Apache Sparks ability. That is the high processing
speed and advanced analytics and multiple integration support
with Hadoop slow cost operation on commodity Hardware. It gives the best results Hadoop compliments Apache
spark capabilities spark not completely replace a do
but the good news is that the demand of spark is
currently at an all-time. Hi, if you want to learn more
about the Hadoop ecosystem tools and Apache spark, don't forget to take
a look at the editor Acres YouTube channel
and check out the big data and Hadoop playlist. Welcome everyone in
today's session on kafka's Park streaming. So without any further delay,
let's look at the agenda first. We will start by understanding. What is Apache Kafka? Then we will discuss
about different components of Apache Kafka
and it's architecture. Further we will look
at different Kafka commands. After that. We'll take a brief overview
of Apache spark and will understand
different spark components. Finally. We'll look at the demo where we will use spark
streaming with Apache caf-pow. Let's move to our first slide. So in a real time scenario, we have different
systems of services, which will be communicating with each other and
the data pipelines are the ones which are establishing
connection between two servers or two systems. Now, let's take
an example of e-commerce. Except site where it can have
multiple servers at front end like Weber application server
for hosting application. It can have a chat server for the customers
to provide chart facilities. Then it can have a separate
server for payment Etc. Similarly organization can also
have multiple server at the back end which will be receiving messages
from different front end servers based on the requirements. Now they can have
a database server which will be storing
the records then they can have security systems
for user authentication and authorization then
they can have Real-time monitoring server, which is basically
used for recommendations. So all these data
pipelines becomes complex with the increase
in number of systems and adding a new system or server requires
more data pipelines, which will again
make the data flow more complicated and complex. Now managing. These data pipelines also
become very difficult as each data pipeline has
their own set of requirements for example data pipelines, which handles transaction
should be more fault tolerant and robust on the other hand. Clickstream data pipeline
can be more fragile. So adding some pipelines or removing some pipelines
becomes more difficult from the complex system. So now I hope that you would
have understood the problem due to which misting
systems was originated. Let's move to the next slide
and we'll understand how Kafka solves this problem
now measuring system reduces the complexity of data pipelines and makes the communication
between systems more simpler and manageable
using messaging system. Now, you can easily
stablish remote Education and send your data
easily across Netbook. Now a different systems may use different
platforms and languages and messaging system provides you a common
Paradigm independent of any platformer language. So basically it
decouples the platform on which a front end server as
well as your back-end server is running you can also stablish
a no synchronous communication and send messages so that the sender
does not have to wait for the receiver
to process the messages. Now one of the benefit
of messaging system is that you can
Reliable communication. So even when the receiver and
network is not working properly. Your messages wouldn't
get lost not talking about cough cough cough cough
decouples the data pipelines and solves the complexity
problem the applications which are producing messages
to Kafka are called producers and the applications which are consuming
those messages from Kafka are called consumers. Now, as you can see in the image
the front end server, then your application server
will burn application server to and chat server. I using messages to Kafka
and these are called producers and your database server security systems real-time
monitoring server than other services
and data warehouse. These are basically
consuming the messages and are called consumers. So your producer sends
the message to Kafka and then cough cash
to those messages and consumers who want those
messages can subscribe and receive them now
you can also have multiple subscribers to
a single category of messages. So you Database server and your security system can
be consuming the same messages which is produced
by application server 1 and again adding
a new consumer is very easy. You can go ahead and
add a new consumer and just subscribe
to the message categories that is required. So again, you can add
a new consumer say consumer one and you can again
go ahead and subscribe to the category of messages which is produced by
application server one. So, let's quickly move ahead. Let's talk about
a Bocce Kafka so party. Kafka is a distributed
publish/subscribe messaging system messaging traditionally
has two models queuing and publish/subscribe in a queue
a pool of consumers. May read from a server and each record only
goes to one of them whereas in publish/subscribe. The record is broadcasted
to all consumers. So multiple consumers
can get the record the Kafka cluster is distributed and have multiple machines
running in parallel. And this is the reason
why calf pies fast scalable and fault. Now let me tell you that Kafka is developed
at LinkedIn and later. It became a part
of Apache project. Now, let us look at some
of the important terminologies. So we'll first start with topic. So topic is a category or feed name to which
records are published and Topic in Kafka are
always multi subscriber. That is a topic can have
zero one or multiple consumers that can subscribe the topic and consume the data written
to it for an example. You can have serious record
getting published in sales, too. Topic you can
have product records which is getting published
in product topic and so on this will actually
segregate your messages and consumer will only
subscribe the topic that they need and again you
consumer can also subscribe to two or more topics. Now, let's talk
about partitions. So Kafka topics are divided
into a number of partitions and partitions allow
you to paralyze a topic by splitting the data in a particular
topic across multiple. Brokers which means
each partition can be placed on separate machine to allow
multiple consumers to read from a topic parallelly. So in case of serious
topic you can have three partition partition
0 partition 1 and partition to from where three consumers
can read data parallel. Now moving ahead. Let's talk about producers. So producers are the one who publishes the data
to topics of the choice. Then you have consumers so consumers can subscribe
to one or more topic. And consume data from that topic now consumers
basically label themselves with a consumer group name and each record publish
to a topic is delivered to one consumer instance within
each subscribing consumer group. So suppose you have
a consumer group. Let's say consumer Group
1 and then you have three consumers residing in it. That is consumer a consumer be
an consumer see now from the seals topic. Each record can be read once
by consumer group Fun and it And either be read by consumer a
or consumer be or consumer see but it can only be consumed once
by the single consumer group that is consumer group one. But again, you can have
multiple consumer groups which can subscribe to a topic where one record can be consumed
by multiple consumers. That is one consumer
from each consumer group. So now let's say
you have a consumer one and consumer group
to in consumer Group 1 we have to consumer that is consumer a a
and consumer be and consumer group to we have to Consumers consumer key
and consumer to be so if consumer Group
1 and consumer group 2 are consuming messages
from topic sales. So the single record will be
consumed by consumer group one as well as consumer group
2 and a single consumer from both the consumer group
will consume the record once so, I guess you are clear
with the concept of consumer and consumer group Now
consumer instances can be a separate process
or separate machines. No talking about Brokers Brokers
are nothing but a single machine in the CAF per cluster and zookeeper is another Apache
open source project. It's Tuesday metadata
information related to Kafka cluster. Like Brokers information
topics details Etc. Zookeeper is basically the one who is managing
the whole Kafka cluster. Now, let's quickly go
to the next slide. So suppose you have a topic. Let's assume this is topic sales
and you have for partition so you have Partition
0 partition 1 partition to and partition three now you
have five Brokers over here. Now, let's take the case
of partition 1 so if the replication factor
is 3 it will have 3 copies which will reside
on different Brokers. So when the replica is
on broker to next is on broker 3 and next is
on brokered 5 and as you can see repl 5, so this 5 is from this broker 5. So the ID of the replica
is same as the ID of The broker that hosts it now moving ahead. One of the replica of partition one will serve
as the leader replica. So now the leader of partition one is replica
five and any consumer coming and consuming messages from partition one will
be solved by this replica. And these two replicas is
basically for fault tolerance. So that once you're
broken five goes off or your disc becomes corrupt, so your replica 3 or replica to to one of them
will again serve as a leader and this is basically
decided on the basis of most in sync replica. So the replica
which will be most in sync with this replica
will become the next leader. So similarly this
partition 0 may decide on broker one broker to
and broker three again your partition to May
reside on broke of for group of five and say broker one and then your third
partition might reside on these three brokers. So suppose that this is
the leader for partition to this is the leader for partition 0 this is
the leader for partition 3. This is the leader
for partition 1 right so you can see that for consumers can consume
data pad Ali from these Brokers so it can consume
data from partition to this consumer can consume
data from partition 0 and similarly for partition
3 and partition fun now by maintaining
the replica basically helps. Sin fault tolerance and keeping
different partition leaders on different Brokers basically
helps in parallel execution or you can say baddeley
consuming those messages. So I hope that you
guys are clear about topics partitions
and replicas now, let's move to our next slide. So this is how the whole
Kafka cluster looks like you have multiple producers, which is again producing
messages to Kafka. Then this whole is
the Kafka cluster where you have two nodes node
one has to broker. Joker one and broker to
and the Note II has two Brokers which is broker three and broke
of for again consumers will be consuming data
from these Brokers and zookeeper is the one who is managing
this whole calf cluster. Now, let's look at some basic commands of Kafka
and understand how Kafka Works how to go ahead
and start zookeeper how to go ahead
and start Kafka server and how to again go ahead and produce some messages
to Kafka and then consume some messages to Kafka. So let me quickly. on my VM So let me
quickly open the terminal. Let me quickly go ahead
and execute sudo GPS so that I can check
all the demons that are running in my system. So you can see I have named no data node resource manager
node manager job is to server. So now as all the hdfs demons are burning let us
quickly go ahead and start Kafka services. So first I will go
to Kafka home. So let me show
you the directory. So my Kafka is in user lib. Now. Let me quickly go ahead
and start zookeeper service. But before that, let me show you
zookeeper dot properties file. So decline Port is 2 1 8 1 so
my zookeeper will be running on Port to 181 and the data directory
in which my zookeeper will store all the metadata
is slash temp / zookeeper. So let us quickly go ahead
and start zookeeper and the command is bins
zookeeper server start. So this is the script file and then I'll pass
the properties file which is inside config directory
and a little Meanwhile, let me open another tab. So here I will be starting
my first Kafka broker. But before that let me show
you the properties file. So we'll go
in config directory again, and I have
server dot properties. So this is the properties
of my first Kafka broker. So first we have server Basics. So here the broker idea
of my first broker is 0 then the port is 9:09 to on which
my first broker will be running. So it contains all
the socket server settings then moving ahead. We have log base X. So in that log Basics,
this is log directory, which is / them / Kafka - logs so over here
my Kafka will store all those messages or records, which will be produced
by The Producers. So all the records which belongs to broker 0
will be stored at this location. Now, the next section is
internal topic settings in which the offset topical. application factor is 1 then transaction State log
replication factor is 1 Next we have log retention policy. So the log retention
ours is 168. So your records will be stored
for 168 hours by default and then it will be deleted. Then you have
zookeeper properties where we have specified
zookeeper connect and as we have seen
in Zookeeper dot properties file that are zookeeper
will be running on Port 2 1 8 1 so we are giving
the address of Zookeeper that is localized
to one eight one and at last we have group. Coordinator setting so
let us quickly go ahead and start the first broker. So the script file is
Kafka server started sh and then we have to give
the properties file, which is server dot properties
for the first broker. I'll hit enter and meanwhile,
let me open another tab. now I'll show you
the next properties file, which is Server 1. Properties. So the things which you have to change
for creating a new broker is first you have
to change the broker ID. So my earlier book ID was 0
the new broker ID is 1 again, you can replicate this file
and for a new server, you have to change
the broker idea to to then you have to change the port
because on 9:09 to already. My first broker is running
that is broker 0 so my broker. Should connect to
a different port and here I have specified
nine zero nine three. Next thing what you have
to change is the log directory. So here I have added a -
1 to the default log directory. So all these records which is stored to my broker
one will be going to this particular directory that is slashed
and slashed cough call logs - 1 And rest of the
things are similar, so let me quickly go ahead
and start second broker as well. And let me open
one more terminal. And I'll start
broker to as well. So the Zookeeper started then
procurve one is also started and this is broker
to which is also started and this is proof of 3. So now let me
quickly minimize this and I'll open a new terminal. Now first, let us look
at some commands later to Kafka topics. So I'll quickly go ahead
and create a topic. So again, let me first go
to my Kafka home directory. Then the script file
is Kafka top it dot sh, then the first parameter is create then we have to give
the address of zoo keeper because zookeeper is the one who is actually containing
all the details related to your topic. So the address of my zookeeper
is localized to one eight one then we'll give the topic name. So let me name the topic
as Kafka - spark next we have to specify
the replication factor of the topic. So it will replicate all
the partitions inside the topic that many times. So replication - Factor as we
have three Brokers, so let me keep it as 3
and then we have partitions. So I will keep it as three because we have
three Brokers running and our consumer can go ahead and consume messages parallely from three Brokers and
let me press enter. So now you can see
the topic is created. Now, let us quickly go ahead
and list all the topics. So the command for listing all the topics
is dot slash bin again. We'll open cough car
topic script file then - - list and again will provide
the address of Zookeeper. So do again list the topic we have to first go to
the CAF core topic script file. Then we have to give - - list parameter and next we
have to give the zookeepers. Which is localhost
181 I'll hit enter. And you can see
I have this Kafka - spark the kafka's
park topic has been created. Now. Let me show you
one more thing again. We'll go to when cuff
card topics not sh and we'll describe this topic. I will pass the address
of zoo keeper, which is localhost to one eight one and then
I'll pause the topic name, which is Kafka - Spark So now you can see here. The topic is cough by spark. The partition count is
3 the replication factor is 3 and the config is as follows. So here you can see all the
three partitions of the topic that is partition 0 partition 1
and partition 2 then the leader for partition 0 is
broker to the leader for partition one is broker 0 and leader for partition
to is broker one so you can see we have different
partition leaders residing on And Brokers, so this is
basically for load balancing. So that different partition
could be served from different Brokers and it could be
consumed parallely again, you can see the replica of this partition is residing
in all the three Brokers same with Partition 1 and same
with Partition to and it's showing you
the insync replica. So in synch replica, the first is to then you have 0
and then you have 1 and similarly with
Partition 1 and 2. So now let us quickly. Go ahead. I'll reduce this to 1/2. Wake me up in one more terminal. The reason why I'm doing this is that we can actually produce
message from One console and then we can receive
the message in another console. So for that I'll start cough
cough console producer first. So the command is dot slash bin cough cough
console producer dot sh and then in case of producer you have to give
the parameter as broker - list, which is Localhost 9:09 to you
can provide any of the Brokers that is running and it will again take the rest
of the Brokers from there. So you just have to provide
the address of one broker. You can also provide
a set of Brokers so you can give it
as localhost colon. 9:09 2 comma Lu closed:
9 0 9 3 and similarly. So here I am passing the address of the first broker now next
I have to mention the topic. So topic is Kafka Spark. And I'll hit enter. So my console
producer is started. Let me produce
a message saying hi. Now in the second terminal
I will go ahead and start the console consumer. So again, the command is
Kafka console consumer not sh and then in case of consumer, you have to give the parameter
as bootstrap server. So this is the thing
to notice guys that in case of producer you have to give
the broker list by in. So of consumer, you have to give bootstrap
server and it is again the same that is localhost 9:09 to which
the address of my broker 0 and then I will give the topic which is cuff cost park
now adding this parameter that is from - beginning will basically
give me messages stored in that topic from beginning. Otherwise, if I'm not giving
this parameter - - from beginning I'll only I'm the recent messages that has been produced after
starting this console consumer. So let me hit enter and you can see I'll get
a message saying hi first. Well, I'm sorry guys. The topic name I
have given is not correct. Sorry for my typo. Let me quickly corrected. And let me hit enter. So as you can see,
I am receiving the messages. I received High then let
me produce some more messages. So now you can see
all the messages that I am producing from console producer is getting
consumed by console consumer. Now this console producer
as well as console consumer is basically used by the developers to actually
test the Kafka cluster. So what happens if you are if there is a producer which is running and
which is producing those messages to Kafka
then you can go ahead and you can start console
consumer and check whether the producer
is producing. Messages or not
or you can again go ahead and check the format in which your message are
getting produced to the topic. Those kind of testing part
is done using console consumer and similarly using
console producer. You do something
like you are creating a consumer so you can go ahead you can
produce a message to Kafka topic and then you can check whether your consumer is
consuming that message or not. This is basically used
for testing now, let us quickly go ahead
and close this. Now let us get back
to our slides now. I have briefly covered Kafka and the concepts of Kafka so
here basically I'm giving you a small brief idea
about what Kafka is and how Kafka works now as we have understood why
we need misting systems. What is cough cough? What are different
terminologies and Kafka how Kafka architecture works and we have seen some
of the basic cuff Pokemons. So let us now understand. What is Apache spark. So basically Apache spark is an Source cluster
Computing framework for near real-time processing
now spark provides an interface for programming
the entire cluster with implicit data parallelism and fault tolerance will talk
about how spark provides fault tolerance but talking
about implicit data parallelism. That means you do not need
any special directives operators or functions to enable
parallel execution. It sparked by default provides
the data parallelism spark is designed to cover
a wide range of workloads such. As batch applications iterative algorithms interactive
queries machine learning algorithms and streaming. So basically the main feature of spark is it's
in memory cluster Computing that increases the processing
speed of the application. So what spark does spark does
not store the data in discs, but it does it
transforms the data and keep the data in memory. So that quickly multiple operations can
be applied over the data and the final result
is only stored in the disk now a On-site Spa can also do
batch processing hundred times faster than mapreduce. And this is the reason why
a patches Park is to go to tool for big data processing
in the industry. Now, let's quickly move
ahead and understand how spark does this so the answer is rdd that is resilient distributed
data sets now an rdd is a read-only partitioned
collection of records and you can see it is a fundamental
data structure of spa. So basically, ERD is an immutable distributed
collection of objects. So each data set
in rdd is divided into logical partitions, which may be computed
on different nodes of the cluster now already
can contain any type of python Java or scale objects. Now talking about
the fault tolerance rdd is a fault-tolerant collection
of elements that can be operated on in parallel. Now, how are ready does that if rdd is lost
it will automatically be recomputed by using original. Nations and this is how spot
provides fault tolerance. So I hope that you
guys are clear that house Park
provides fault tolerance. Now let's talk about
how we can create rdds. So there are two ways to create
rdds first is paralyzing an existing collection
in your driver program, or you can refer a data set in an external storage systems
such as shared file system. It can be hdfs Edge base or any other data source
offering a Hadoop input format now spark makes use of the concept of rdd to achieve
fast and efficient operations. Now, let's quickly move ahead and look how already So
first we create an rdd which you can create
either by referring to an external storage system. And then once you create
an rdd you can go ahead and you can apply
multiple Transformations over that are ready. Like will perform
filter map Union Etc. And then again, it gives you a new rdd or you
can see the transformed rdd and at last you apply
some action and get the result now this action
can be Count first a can collect all those kind of functions. So now this is a brief idea
about what is rdd and how rdd works. So now let's quickly
move ahead and look at the different workloads that can be handled
by Apache spark. So we have interactive
streaming analytics. Then we have machine learning. We have data integration. We have spark
streaming and processing. So let us talk about them one by one first is spark
streaming and processing. So now basically, you know data arrives
at a steady rate. Are you can say
at a continuous streams, right? And then what you can do
you can again go ahead and store the data set in disk and then you can actually go
ahead and apply some processing over it some analytics over it and then get
some results out of it, but this is not the scenario
with each and every case. Let's take an example
of financial transactions where you have to go
ahead and identify and refuse potential
fraudulent transactions. Now if you will go ahead and store the data stream
and then you will go ahead and apply some Assessing
it would be too late and someone would have got
away with the money. So in that scenario
what you need to do. So you need to quickly take
that input data stream. You need to apply
some Transformations over it and then you have
to take actions accordingly. Like you can send
some notification or you can actually reject that fraudulent transaction
something like that. And then you can go ahead and if you want you
can store those results or data set in some
of the database or you can see some
of the file system. So we have some scenarios. Very we have to actually
process the stream of data and then we have to go ahead and store the data or perform some analysis on it
or take some necessary actions. So this is where Spark
streaming comes into picture and Spark is a best fit for processing those continuous
input data streams. Now moving to next
that is machine learning now, as you know, that first we create a machine learning model
then we continuously feed those incoming data
streams to the model. And we get some
continuous output based on the input values. Now, we reuse
intermediate results across multiple computation
in multi-stage applications, which basically includes
substantial overhead due to data replication disk
I/O and sterilization which makes the system slow. Now what Spock does spark rdd
will store intermediate result in a distributed memory
instead of a stable storage and make the system faster. So as we saw in spark rdd
all the Transformations will be applied over there and all the transformed
rdds will be stored in the memory itself
so we can quickly go ahead and apply some more
iterative algorithms over there and it does not take
much time in functions like data replication or disk I/O so all those overheads
will be reduced now you might be wondering
that memories always very less. So what if the memory
gets over so if the distributed memory
is not sufficient to store intermediate results, then it will
store those results. On the desk. So I hope that you guys are
clear how sparks perform this iterative machine
learning algorithms and why spark is fast, let's look at the next workload. So next workload is
interactive streaming analytics. Now as we already discussed
about streaming data so user runs ad hoc queries
on the same subset of data and each query will do a disk
I/O on the stable storage which can dominate
applications execution time. So, let me take an example. Data scientist. So basically you have
continuous streams of data, which is coming in. So what your data
scientists would do. So do your data scientists
will either ask some questions execute some queries over the data
then view the result and then he might alter
the initial question slightly by seeing the output or he might also drill
deeper into results and execute some more queries
over the gathered result. So there are multiple scenarios
in which your data scientist would be running
some interactive queries. On the streaming analytics
now house path helps in this interactive
streaming analytics. So each transformed our DD
may be recomputed each time. You run an action on it, right? And when you persist an rdd
in memory in which case Park will keep all
the elements around on the cluster for faster access and whenever you will execute
the query next time over the data, then the query will
be executed quickly and it will give you
a instant result, right? So I hope that you
guys are clear how spark helps in
interactive streaming analytics. Now, let's talk
about data integration. So basically as you know, that in large organizations data
is basically produced from different systems
across the business and basically you
need a framework which can actually integrate
different data sources. So Spock is the one which actually integrate
different data sources so you can go ahead and you can take the data
from Kafka Cassandra flu. Umm hbase then Amazon S3. Then you can perform some real
time analytics over there or even say some near
real-time analytics over there. You can apply some machine
learning algorithms and then you can go ahead and store the process
result in Apache hbase. Then msql hdfs. It could be your Kafka. So spark basically gives
you a multiple options where you can go ahead and pick the data
from and again, you can go ahead
and write the data into now. Let's quickly move ahead
and we'll talk. About different spark components
so you can see here. I have a spark or engine. So basically this
is the core engine and on top of this core engine. You have spark SQL spark
streaming then MLA, then you have graphics
and the newest Parker. Let's talk about them one
by one and we'll start with spark core engine. So spark or engine
is the base engine for large-scale parallel and distributed data processing
additional libraries, which are built on top
of the core allows divers workloads Force. Streaming SQL machine learning
then you can go ahead and execute our on spark or you can go ahead
and execute python on spark those kind of workloads. You can easily go
ahead and execute. So basically your spark
or engine is the one who is managing all your memory, then all your fault
recovery your scheduling your Distributing of jobs and monitoring jobs on a cluster and interacting
with the storage system. So in in short we
can see the spark or engine is the heart of Spock and on top of this all of these libraries
are there so first, let's talk about
spark streaming. So spot streaming is
the component of Spas which is used to process
real-time streaming data as we just discussed and it is a useful addition
to spark core API. Now it enables high
throughput and fault tolerance stream processing
for live data streams. So you can go ahead and you can perform all
the streaming data analytics using this spark streaming then
You have Spock SQL over here. So basically spark SQL is
a new module in spark which integrates relational
processing of Sparks functional programming API and it supports
querying data either via SQL or SQL that is - query language. So basically for those of you who are familiar with rdbms
Spock SQL is an easy transition from your earlier tool where you can go ahead
and extend the boundaries of traditional relational
data processing now talking about graphics. So Graphics is the spaag API for graphs
and crafts parallel computation. It extends the spark rdd with a resilient distributed
property graph a talking at high level. Basically Graphics extend
the graph already abstraction by introducing the resilient
distributed property graph, which is nothing but a directed multigraph
with properties attached to each vertex and Edge
next we have spark are so basically it provides you
packages for our language and then you can go ahead and Leverage Park power with our shell next
you have spark MLA. So ml is basically stands
for machine learning library. So spark MLM is used
to perform machine learning in Apache spark. Now many common machine learning and statical algorithms
have been implemented and are shipped with ML live which simplifies large scale
machine learning pipelines, which basically includes summary statistics
correlations classification and regression collaborative
filtering techniques. New cluster analysis methods then you have dimensionality
reduction techniques. You have feature extraction
and transformation functions. When you have
optimization algorithms, it is basically a MLM package or you can see a machine
learning package on top of spa. Then you also have
something called by spark, which is python package
for spark there. You can go ahead
and leverage python over spark. So I hope that you guys are clear
with different spark components. So before moving
to cough gasp, ah, Exclaiming demo. So I have just given you
a brief intro to Apache spark. If you want a detailed tutorial
on Apache spark or different components of Apache spark like Apache
spark SQL spark data frames or spark streaming
Spa Graphics Spock MLA, so you can go to editor
Acres YouTube channel again. So now we are here guys. I know that you guys are waiting
for this demo from a while. So now let's go ahead and look
at calf by spark streaming demo. So let me quickly go
ahead and open. my virtual machine
and I'll open a terminal. So let me first check
all the demons that are running in my system. So my zookeeper is running name node is running
data node is running. The my resource manager is running all the three cough
cough Brokers are running then node manager is running and job is to server is running. So now I have to start
my spark demons. So let me first go
to the spark home and start this part demon. The command is
a spin start or not. Sh. So let me quickly go ahead and execute sudo JPS
to check my spark demons. So you can see master
and vocal demons are running. So let me close this terminal. Let me go to
the project directory. So basically, I
have two projects. This is cough card
transaction producer. And the next one is the spark
streaming Kafka master. So first we will
be producing messages from Kafka transaction producer and then we'll be
streaming those records which is basically produced by
this producer using the spark streaming Kafka master. So first, let me take you through this cough
card transaction producer. So this is
our cornbread XML file. Let me open it with G edit. So basically this is a me. Project and and I have used
spring boot server. So I have given Java version as a you can see
cough cough client over here and the version of Kafka client, then you can see I'm putting
Jackson data bind. Then ji-sun and then I
am packaging it as a war file that is web archive file. And here I am again specifying
the spring boot Maven plugins, which is to be downloaded. So let me quickly go ahead and close this and we'll go
to this Source directory and then we'll go inside main. So basically this is the file
that is sales Jan 2009 file. So let me show you
the file first. So these are the records which I'll be producing
to the Kafka. So the fields
are transaction date than product price payment type the name city state
country account created then last login latitude and longitude. So let me close this file
and then the application dot. Yml is the main property file. So in this application
dot yml am specifying the bootstrap server, which is localhost 9:09 to
than am specifying the Pause which again resides
on localhost 9:09 to so here. I have specified the broker list
now next I have product topic. So the topic of the
product is transaction. Then the partition count is 1 so basically you're a cks
config controls the criteria under which requests
are considered complete and the all setting we
have specified will result in blocking on the full
Committee of the record. It is the slowest burn the most durable setting
not talking about retries. So it will retry Thrice
then we have mempool size and we have maximum pool size, which is basically
for implementing Java threads and at last we
have the file path. So this is the path of the file, which I have shown you just now
so messages will be consumed from this file. Let me quickly close this file
and we'll look at application but properties so here we
have specified the properties for Springboard server. So we have server context path. That is /n Eureka. Then we have
spring application name that is Kafka producer. We have server Port that is double line W8 and
the spring events timeout is 20. So let me close this as well. Let's go back. Let's go inside Java calm
and Eureka Kafka. So we'll explore
the important files one by one. So let me first take you
through this dito directory. And over here,
we have transaction dot Java. So basically here we
are storing the model. So basically you can see these
are the fields from the file, which I have shown you. So we have transaction date. We have product price payment
type name city state country and so on so we have created
variable for each field. So what we are doing we
are basically creating a getter and Setter function for
all these variables. So we have get transaction ID, which will basically
returned Transaction ID then we have sent transaction ID, which will basically
send the transaction ID. Similarly. We have get transaction date for
getting the transaction date. Then we have set
transaction date and it will set the transaction date
using this variable. Then we have get products
and product get price set price and all the getter and Setter methods
for each of the variable. This is the Constructor. So here we are taking all the parameters like
transaction date product price. And then we are setting
the value of each of the variables
using this operator. So we are setting the value for
transaction date product price payment and all of the fields
that is present over there. Next. We are also creating
a default Constructor and then over here. We are overriding
the tostring method and what we are doing
we are basically The transaction details and we are
returning transaction date and then the value
of transaction date product then body of product price
then value of price and so on for all the fields. So basically this is the model
of the transaction so we can go ahead and we can create object
of this transaction and then we can easily go ahead and send the transaction
object as the value. So this is the main
reason of creating this transaction model, LOL. Me quickly, go ahead
and close this file. Let's go back and let's first
take a look at this config. So this is Kafka
properties dot Java. So what we did again as I have shown you
the application dot yml file. So we have taken all the parameters that we
have specified over there. That is your bootstrap
product topic partition count then Brokers filename
and thread count. So all these properties then you have file path
then all these Days, we have taken we have created a variable and then
what we are doing again, we are doing the same thing as we did with
our transaction model. We are creating a getter and Setter method for each
of these variables. So you can see we
have get file path and we are returning
the file path. Then we have set file path where we are setting the file
path using this operator. Similarly. We have get product topics at product topic then we
have greater incentive for third count. We have greater incentive. for bootstrap and all
those properties No, we can again go ahead and call this cough
cough properties anywhere and then we can easily extract those values
using getter methods. So let me quickly close
this file and I'll take you to the configurations. So in this configuration what we are doing we
are creating the object of Kafka properties
as you can see, so what we are doing then we
are again creating a property's object and then we
are setting the properties so you can see that we are Setting
the bootstrap server config and then we are retrieving the value using the cough
cough properties object. And this is the get
bootstrap server function. Then you can see we are setting
the acknowledgement config and we are getting the acknowledgement from this
get acknowledgement function. And then we are using
this get rate rise method. So from all these
Kafka properties object. We are calling
those getter methods and retrieving those values and setting those values
in this property object. So We have partitioner class. So we are basically implementing
this default partitioner which is present in over G. Apache car park client
producer internals package. Then we are creating
a producer over here and we are passing this props object which will set
the properties so over here. We are passing
the key serializer, which is the
string T serializer. And then this is the value serializer in which
we are creating new customer. Distance Eliezer and then
we are passing transaction over here and then it
will return the producer and then we are implementing
thread we are again getting the get minimum pool size from Kafka properties and get
maximum pool size from Kafka property. So we're here. We are implementing
Java threads now. Let me quickly close this cough
pop producer configuration where we are configuring
our Kafka producer. Let's go back. Let's quickly go to this API which have event producer
EPA dot Java file. So here we are basically
creating an event producer API which has this
dispatch function. So we'll use this dispatch
function to send the records. So let me quickly
close this file. Let's go back. We have already seen this config and configurations in which we are basically
retrieving those values from application dot yml file and then we are Setting
the producer configurations, then we have constants. So in Kafka constants or Java, we have created this Kafka
constant interface where we have specified the batch size account limit
check some limit then read batch size minimum
balance maximum balance minimum account maximum account. Then we are also implementing
daytime for matter. So we are specifying all
the constants over here. Let me close this file. Let's go back then this is
Manso will not look at these two files, but let me tell you what
does these two files to these two files are
basically to record the metrics of your Kafka like time in which your thousand records have
been produced in cough power. You can say time in which records
are getting published to Kafka. It will be monitored and then
you can record those starts. So basically it helps in optimizing the performance
of your Kafka producer, right? You can actually know
how to do Recon. How to add just
those configuration factors and then you can
see the difference or you can actually
monitor the stats and then understand or how you can actually make
your producer more efficient. So these are basically for those factors but let's
not worry about this right now. Let's go back next. Let me quickly take you
through this file utility. So you have file
you treated or Java. So basically what we
are doing over here, we are reading each record from the file we using
For reader so over here, you can see we have this list
and then we have bufferedreader. Then we have file reader. So first we are reading the file and then we are trying
to split each of the fields present in the record. And then we are setting the
value of those fields over here. Then we are specifying
some of the exceptions that may occur like
number format exception or pass exception all
those kind of exception we have specified over here
and then we are Closing this so in this file. We are basically
reading the records now. Let me close this. Let's go back. Now. Let's take a quick look
at the seal lizer. So this is custom
Jason serializer. So in serializer, we have created
a custom decency réaliser. Now, this is basically
to write the values as bites. So the data which you will be
passing will be written in bytes because as we know that data is sent to Kafka
and form of pie. And this is the reason
why we have created this custom Jason serializer. So now let me quickly close
this let's go back. This file is basically for
my spring boot web application. So let's not get into this. Let's look at events
Red Dot Java. So basically over here we
have event producer API. So now we are trying to dispatch
those events and to show you how dispatch function works. Let me go back. Let me open services and even producer
I MPL is implementation. So let me show you
how this dispatch works. So basically over here
what we are doing first. We are initializing. So using the file utility. We are basically reading
the files and read the file. We are getting the path using
this Kafka properties object and we are calling
this getter method of file path. Then what we are doing
we are basically taking the product list and then we are trying
to dispatch it so in dispatch Are basically
using Kafka producer and then we are creating the
object of the producer record. Then we are using the get topic
from this calf pad properties. We are getting
this transaction ID from the transaction and then we are using event
producer send to send the data. And finally we are trying to monitor this but let's
not worry about the monitoring and cash the monitoring
and start spot so we can ignore this part Nets. Let's quickly go back
and look at the last file which is producer. So let me show you
this event producer. So what we are doing here, we are actually
creating a logger. So in this on completion method, we are basically passing
the record metadata. And if your e-except shin is
not null then it will basically throw an error saying this
and recorded metadata else. It will give you the send
message to topic partition. All set and then
the record metadata and topic and then it will give you all the details regarding
topic partitions and offsets. So I hope that you
guys have understood how this cough cough producer
is working now is the time we need to go ahead and we need
to quickly execute this. So let me open
a terminal over here. No first build this project. We need to execute
mvn clean install. This will install
all the dependencies. So as you can see
our build is successful. So let me minimize this and
this target directory is created after you build
a wave in project. So let me quickly go inside this target directory and
this is the root dot bar file that is root dot
web archive file which we need to execute. So let's quickly go ahead
and execute this file. But before this to verify whether the data
is getting produced in our car for topics so for testing as I already told you
We need to go ahead and we need to open
a console consumer so that we can check that whether data
is getting published or not. So let me quickly minimize this. So let's quickly go to
Kafka directory and the command is dot slash bin Kafka
console consumer and then - - bootstrap server. nine zero nine two Okay,
I'll let me check the topic. What's the topic? Let's go to our
application dot yml file. So the topic
name is transaction. Let me quickly minimize
this specify the topic name and I'll hit enter. So now let me place
this console aside. And now let's quickly go ahead
and execute our project. So for that
the command is Java - jar and then we'll provide
the path of the file that is inside. Great, and the file is
rude dot war and here we go. So now you can see
in the console consumer. The records are
getting published. Right? So there are multiple records which have been published
in our transaction topic and you can verify this
using the console consumer. So this is where the developers use
the console consumer. So now we have successfully
verified our producer. So let me quickly go ahead
and stop the producer. Lat, let me stop
consumer as well. Let's quickly minimize this and now let's go
to the second project. That is Park
streaming Kafka Master. Again. We have specified
all the dependencies that is required. Let me quickly show
you those dependencies. Now again, you
can see were here. We have specified
Java version then we have specified the artifacts
or you can see the dependencies. So we have Scala compiler. Then we have
spark streaming Kafka. Then we have
cough cough clients. Then Json data binding then we
have Maven compiler plug-in. So all those dependencies
which are required. We are specified over here. So let me quickly go
ahead and close it. Let's quickly move to the source
directory main then let's look at the resources again. So this is application
dot yml file. So we have put eight zero eight zero then we
have bootstrap server over here. Then we have proven over here. Then we have topic
is as transaction. The group is transaction
partition count is one and then the file name so we won't be using
this file name then. Let me quickly go ahead
and close this. Let's go back. Let's go back to Java
directory comms Park demo, then this is the model. So it's same so these are all the fields
that are there in the transaction
you have transaction. Eight product price payment type the name city state country
account created and so on. And again, we have
specified all the getter and Setter methods over here
and similarly again, we have created
this transaction dto Constructor where we have taken
all the parameters and then we have setting
the values using this operator. Next. We are again over adding
this tostring function and over here. We are again returning the
details like transaction date and then vario
transaction date product and then value of product
and similarly all the fields. So let me close this model. Let's go back. Let's look at cough covers,
then we are see realizer. So this is the Jason serializer which was there in our producer
and this is transaction decoder. Let's take a look. Now you have decoder which is again implementing
decoder and we're passing this transaction dto then again, you can see we This problem
by its method which we are overriding and we are reading
the values using this bites and then transaction
DDO class again, if it is failing to pass we are
giving Json processing failed for object this and you can see we have this transaction decoder
construct over here. So let me quickly
again close this file. Let's quickly go back. And now let's take a look
at spot streaming app where basically the data which the producer project
will be producing to cough cough will be actually consumed by
spark streaming application. So spark streaming will stream
the data in real time and then will display the data. So in this park
streaming application, we are creating conf object
and then we are setting the application name
as cough by sandbox. The master is local star
then we have Java. Fog contest so here we
are specifying the spark context and then next we are specifying
the Java streaming context. So this object will basically we used to take
the streaming data. So we are passing this Java Spa
context over here as a parameter and then we are specifying
the duration that is 2000. Next. We have Kafka parameters
should to connect to Kafka you need
to specify this parameters. So in Kafka parameters, we are specifying
The Meta broken. Why's that is localized 9:09 to
then we have Auto offset resent that is smallest. Then in topics the name
of the topic from which we will be consuming messages
is transaction next Java. We're creating a Java
pair input D streams. So basically this D stream
is discrete stream, which is the basic abstraction
of spark streaming and is a continuous sequence of rdds representing
a continuous stream of data now the stream can I
The created from live data from Kafka hdfs of Flume or it can be generated from transforming existing be
streams using operation to over here. We are again creating
a Java input D stream. We are passing string
and transaction DTS parameters and we are creating
direct Kafka stream object. Then we're using
this Kafka you tails and we are calling
the method create direct stream where we are passing
the parameters as SSC that is your spark
streaming context then you have String dot class which is basically
your key serializer. Then transaction video
does not class that is basically your value serializer
then string decoder that is to decode your key
and then transaction decoded basically to
decode your transaction. Then you have Kafka parameters, which you have created here where you have specified
broken list and auto offset reset and then you
are specifying the topics which is your transaction so
next using this Cordy stream, you're actually continuously
iterating over the rdd and then you are trying
to print your new rdd with then already partition and size then rdd count and the record so already
for each record. So you are printing the record and then you are starting
these Park streaming context and then you are waiting
for the termination. So this is the spark
streaming application. So let's first quickly go ahead
and execute this application. They've been close this file. Let's go to the source. Now, let me quickly go ahead and
delete this target directory. So now let me quickly open the
terminal MV and clean install. So now as you can see the target
directory is again created and this park streaming Kafka
snapshot jar is created. So we need to execute this jar. So let me quickly go ahead
and minimize it. Let me close this terminal. So now first I'll start
this pop streaming job. So the command is Java -
jar inside the target directory. We have this spark streaming of
college are so let's hit enter. So let me know quickly go ahead
and start producing messages. So I will minimize this and I
will wait for the messages. So let me quickly close
this pot streaming job and then I will show
you the consumed records so you can see the record that is consumed
from spark streaming. So here you have got record
and transaction dto and then transaction date
products all the details, which we are specified. You can see it over here. So this is how spark
streaming works with Kafka now, it's just a basic job again. You can go ahead and you
can take Those transaction you can perform some real-time
analytics over there and then you can go ahead and
write those results so over here we have just given
you a basic demo in which we are producing
the records to Kafka and then using spark streaming. We are streaming those records
from Kafka again. You can go ahead and you can perform
multiple Transformations over the data multiple actions and produce some real-time
results using this data. So this is just a basic demo
where we have shown you how to basically
produce recalls to Kafka and then consume those records
using spark streaming. So let's quickly go
back to our slide. Now as this was a basic project. Let me explain you one of the cough
by spark streaming project, which is a Ted Eureka. So basically there is a company
called Tech review.com. So this take review.com
basically provide reviews for your recent
and different Technologies, like a smart watches phones
different operating systems and anything new
that is coming into Market. So what happens is the company
decided to include a new feature which will basically allow
users to compare the popularity or trend of multiple
Technologies based on the Twitter feeds
and second for the USP. They are basically trying this comparison
to happen in real time. So basically they
have assigned you this task so that you have to go
ahead you have to take the real-time Twitter feeds then you have to show
the real time comparison of various Technologies. So again, the company is
is asking you to to identify the minute literate between different Technologies
by consuming Twitter streams and writing aggregated minute
Li count to Cassandra from where again - boarding team will come
into picture and then they will try to dashboard that data
and it can show you a graph where you can see how the trend of two different or you can see various
Technologies are going ahead now the solution strategy
which is there so you have to continuously
stream the data from Twitter. Then you will be storing that those tweets
inside a cop car topic then second again. You have to
perform spark streaming. So you will be continuously
streaming the data and then you will be
applying some Transformations which will basically
give you the minute trend of the two technologies. And again, you'll write it back
to a car for topic and at last you'll write a consumer that will be consuming messages
from the Casbah topic and that will write the data
in your Cassandra database. So First you have
to write a program that will be consuming
data from Twitter and I did to cough or topic. Then you have to write
a spark streaming job, which will be continuously
streaming the data from Kafka and perform analytics
to identify the military Trend and then it will write the data
back to a cuff for topic and then you have
to write the third job which will be
basically a consumer that will consume data
from the table for topic and write the data
to a Cassandra database. But a spark is
a powerful framework, which has been heavily
used in the industry for real-time analytics
and machine learning purposes. So before I proceed
with the session, let's have a quick
look at the topics which will be covering today. So I'm starting
off by explaining what exactly is by spot
and how it works. When we go ahead. We'll find out the various
advantages provided by spark. Then I will be showing you how to install
by sparking a systems. Once we are done
with the installation. I will talk about the
fundamental concepts of by spark like this spark context. Data frames MLA Oddities
and much more and finally, I'll close of the session with
the demo in which I'll show you how to implement by spark
to solve real life use cases. So without any further Ado, let's quickly embark
on our journey to pie spot now before I start off
with by spark. Let me first brief you
about the by spark ecosystem as you can see from the diagram
the spark ecosystem is composed of various components like
Sparks equals Park streaming. Ml Abe graphics and the core
API component the spark. Equal component is used
to Leverage The Power of decorative queries and optimize storage
by executing sql-like queries on spark data, which is presented in rdds and other external sources
spark streaming component allows developers
to perform batch processing and streaming of data with ease
in the same application. The machine learning library
eases the development and deployment of
scalable machine learning pipelines Graphics component. Let's the data scientists work
with graph and non graph sources to achieve flexibility
and resilience in graph. Struction and Transformations and finally the
spark core component. It is the most vital component
of spark ecosystem, which is responsible for basic input output
functions scheduling and monitoring the entire
spark ecosystem is built on top of this code execution engine which has extensible apis
in different languages like Scala Python and Java
and in today's session, I will specifically discuss about the spark API
in Python programming languages, which is more popularly
known as the pie Spa. You might be wondering
why pie spot well to get a better Insight. Let me give you a brief
into pie spot. Now as you already know
by spec is the collaboration of two powerful Technologies, which are spark which is an open-source clustering
Computing framework built around speed ease of use
and streaming analytics. And the other one is python,
of course python, which is a general purpose
high-level programming language. It provides wide range
of libraries and is majorly used for machine learning
and real-time analytics now, Now which gives us by spark which is a python
a pay for spark that lets you harness
the Simplicity of Python and The Power of Apache spark. In order to tame
pick data up ice pack. Also lets you use
the rdds and come with a default integration
of Pi Forge a library. We learn about rdds later
in this video now that you know, what is pi spark. Let's now see the advantages
of using spark with python as we all know python
itself is very simple and easy. So when Spock is written
in Python it To participate quite easy to learn
and use moreover. It's a dynamically type language which means Oddities can hold
objects of multiple data types. Not only does it also
makes the EPA simple and comprehensive and talking about the readability
of code maintenance and familiarity with
the python API for purchase Park is far better than other programming
languages python also provides various options
for visualization, which is not possible using
Scala or Java moreover. You can conveniently call
are directly from python on above this python comes with a wide range of libraries like numpy pandas
Caitlin Seaborn matplotlib and these debris is
in data analysis and also provide mature and time test statistics
with all these feature. You can effortlessly program and spice Park in case
you get stuck somewhere or habit out. There is a huge price but Community out there whom you
can reach out and put your query and that is very actor. So I will make good use
of this opportunity to show you how to install Pi spark in a system now here
I'm using Red Hat Linux based sent to a system
the same steps can be applied for using Linux systems as well. So in order to install
Pi spark first, make sure that you have
Hadoop installed in your system. So if you want to know more
about how to install Ado, please check out
our new playlist on YouTube or you can check out our blog on
a direct our website the first of all you need to go to the
Apache spark official website, which is parked
at a party Dot o-- r-- g-- and the download section you
can download the latest version of spark release which supports It's
the latest version of Hadoop or Hadoop version
2.7 or above now. Once you have downloaded it, all you need to do is
extract it or add say under the file contents. And after that you
need to put in the path where the spark is installed
in the bash RC file. Now, you also need
to install pip and jupyter notebook using
the pipe command and make sure that the version
of piston or above so as you can see here, this is what our bash RC file
looks like here you can see that we have put in the path for Hadoop spark and as
well as Spunk driver python, which is The jupyter Notebook. What we'll do. Is that the moment you
run the pie Spock shell it will automatically open
a jupyter notebook for you. Now. I find jupyter notebook
very easy to work with rather than the shell
is supposed to search choice now that we are done
with the installation path. Let's now dive deeper
into pie Sparkle on few of its fundamentals, which you need to know
in order to work with by Spar. Now this timeline shows
the various topics, which we will be covering under
the pie spark fundamentals. So let's start off. With the very first
Topic in our list. That is the spark context. The spark context is the heart
of any spark application. It sets up internal services
and establishes a connection to a spark execution environment
through a spark context object. You can create rdds accumulators and broadcast variable
access Park service's run jobs and much more
the spark context allows the spark driver application
to access the cluster through a resource manager, which can be yarn or Sparks cluster manager
the driver program then runs. Operations inside the executors
on the worker nodes and Spark context uses the pie
for Jay to launch a jvm which in turn creates
a Java spark context. Now there are
various parameters, which can be used
with a spark context object like the Master app name spark home the pie
files the environment in which has set the path size
serializer configuration Gateway and much more
among these parameters the master and app name
are the most commonly used now to give you a basic Insight
on how Spark program works. I have listed down
its basic lifecycle phases the typical life cycle
of a spark program includes creating rdds from
external data sources or paralyzed a collection
in your driver program. Then we have the lazy
transformation in a lazily transforming the base rdds into new Oddities using
transformation then caching few of those rdds for future reuse and finally performing action
to execute parallel computation and to produce the results. The next Topic
in our list is added. And I'm sure people who have already worked with
spark a familiar with this term, but for people
who are new to it, let me just explain it. No Artie T stands for
resilient distributed data set. It is considered to be
the building block of any spark application. The reason behind this
is these elements run and operate on multiple nodes
to do parallel processing on a cluster. And once you create an RTD, it becomes immutable
and by imitable, I mean that it is an object
whose State cannot be modified after its created, but we can transform
its values by up. Applying certain transformation. They have good
fault tolerance ability and can automatically recover
for almost any failures. This adds an added Advantage
not to achieve a certain task multiple operations can
be applied on these IDs which are categorized
in two ways the first in the transformation and the second one is
the actions the Transformations are the operations which are applied on an oddity
to create a new rdd. Now these transformation work on the principle
of lazy evaluation and transformation
are lazy in nature. Meaning when we call
some operation in our dirty. It does not execute
immediately spark maintains, the record of the operations
it is being called through with the help
of direct acyclic graphs, which is also known as the DHS and since the Transformations
are lazy in nature. So when we execute operation any time by calling
an action on the data, the lazy evaluation
data is not loaded until it's necessary and the moment we call out
the action all the computations are performed parallely to give
you the desired output. Put now a few important
Transformations are the map flatmap filter this thing reduced by key map partition sort by
actions are the operations which are applied on an rdd
to instruct a party spark to apply computation and pass the result back
to the driver few of these actions include
collect the collectors mapreduce take first now, let me Implement few of these
for your better understanding. So first of all,
let me show you the bash as if I'll which I
was talking about. So here you can see
in the bash RC file. We provide the path
for all the Frameworks which we have installed
in the system. So for example,
you can see here. We have installed
Hadoop the moment we install and unzip it
or rather see entire it I have shifted all my Frameworks
to one particular location as you can see is
the US are the user and inside this we have
the library and inside that I have installed the Hadoop
and also the spa now as you can see here, we have two lines. I'll highlight this one for
you the pie spark driver. Titan which is the Jupiter and we have given it as
a notebook the option available as know to what we'll do
is at the moment. I start spark will
automatically redirect me to The jupyter Notebook. So let me just rename
this notebook as rdd tutorial. So let's get started. So here to load any file
into an rdd suppose. I'm loading a text file
you need to use the S if it is a spark context
as C dot txt file and you need to provide
the path of the data which you are going to load. So one thing to keep
in mind is that the default path which the artery takes
or the jupyter. Notebook takes is the hdfs path. So in order to use
the local file system, you need to mention
the file colon and double forward slashes. So once our sample data is inside the ret not to
have a look at it. We need to invoke
using it the action. So let's go ahead and take
a look at the first five objects or rather say the first five
elements of this particular rdt. The sample it I have taken
here is about blockchain as you can see. We have one two, three four and
five elements here. Suppose I need to convert
all the data into a lowercase and split it according
to word by word. So for that I will
create a function and in the function
I'll pass on this Oddity. So I'm creating
as you can see here. I'm creating rdd one that is a new ID
and using the map function or rather say the transformation
and passing on the function, which I just created to lower
and to split it. So if we have a look
at the output of our D1 As you can see here, all the words are
in the lower case and all of them are separated
with the help of a space bar. Now this another transformation, which is known as the flat map
to give you a flat and output and I am passing
the same function which I created earlier. So let's go ahead and have a look
at the output for this one. So as you can see here, we got the first five elements which are the save one as we got
here the contrast transactions and and the records. So just one thing
to keep in mind. Is at the flat map
is a transformation where as take is the action now, as you can see
that the contents of the sample data
contains stop words. So if I want to remove
all the stop was all you need to do is start and create a list of stop words
in which I have mentioned here as you can see. We have a all the as is and now these are
not all the stop words. So I've chosen only a few
of them just to show you what exactly the output will be and now we are using here
the filter transformation and with the help of Lambda. Function and which we have
X specified as X naught in stock quotes and we
have created another rdd which is added III which will take the input from our DD to so
let's go ahead and see whether and and the
are removed or not. This is you can see contracts
transaction records of them. If you look at the output 5, we have contracts transaction
and and the and in the are not in this list, but suppose I want
to group the data according to the first
three characters of any element. So for that I'll use the group by and I'll use
the Lambda function again. So let's have a look
at the output so you can see we
have EDG and edges. So the first three letters of
both words are same similarly. We can find it using
the first two letters. Also, let me just change it
to two so you can see we are gu and guid just a guide not these are
the basic Transformations and actions but suppose. I want to find out the sum
of the first thousand numbers. Others have first
10,000 numbers. All I need to do
is initialize another Oddity, which is the
number underscore ID. And we use the AC Dot
parallelized and the range we have given is one to 10,000 and we'll use the reduce action here to see the output
you can see here. We have the sum
of the numbers ranging from one to ten thousand. Now this was all about rdd. The next topic that we have
on a list is broadcast and accumulators now in spark
we perform parallel processing through the Help
of shared variables or when the driver sends
any tasks with the executor present on the cluster a copy of
the shared variable is also sent to the each node of the cluster thus
maintaining High availability and fault tolerance. Now, this is done in order
to accomplish the task and Apache spark supposed
to type of shared variables. One of them is broadcast. And the other one is
the accumulator now broadcast variables are used
to save the copy of data on all the nodes in a cluster. Whereas the accumulator is
the variable that is used for aggregating the incoming. Information we are
different associative and commutative operations now
moving on to our next topic which is a spark configuration
the spark configuration class provides a set
of configurations and parameters that are needed to execute
a spark application on the local system
or any cluster. Now when you use
spark configuration object to set the values
to these parameters, they automatically take priority
over the system properties. Now this class
contains various Getters and Setters methods some
of which are Set method which is used to set
a configuration property. We have the set master which is used for setting
the master URL. Yeah the set app name, which is used to set
an application name and we have the get method to retrieve
a configuration value of a key. And finally we
have set spark home which is used for setting
the spark installation path on worker nodes. Now coming to the next
topic on our list which is a spark files
the spark file class contains only the class methods so that the user cannot create
any spark files instance. Now this helps in Dissolving
the path of the files that are added using
the spark context add file method the class Park files
contain to class methods which are the get method and
the get root directory method. Now, the get is used
to retrieve the absolute path of a file added through
spark context to add file and the get root directory is used to retrieve
the root directory that contains the files
that are added. So this park context
dot add file. Now, these are smart topics
and the next topic that we will covering in our list are the data frames
now data frames in a party. Spark is a distributed
collection of rows under named columns, which is similar to
the relational database tables or Excel sheets. It also shares common attributes with the rdds few
characteristics of data frames are immutable in nature. That is the same
as you can create a data frame, but you cannot change it. It allows lazy evaluation. That is the task not executed unless and until
an action is triggered and moreover data frames
are distributed in nature, which are designed
for processing large collection of structure
or semi-structured data. Can be created using
different data formats, like loading the data from source files
such as Json or CSV, or you can load it
from an existing re you can use databases
like hi Cassandra. You can use pocket files. You can use CSV XML files. There are many sources through which you can create
a particular R DT now, let me show you how to create
a data frame in pie spark and perform various actions
and Transformations on it. So let's continue this
in the same notebook which we have here now
here we have taken In the NYC Flight data, and I'm creating a data frame
which is the NYC flights on the score TF now to load the data. We are using the spark dot
RI dot CSV method and you to provide the path which is the local path
of by default. It takes the hdfs same as our GD and one thing
to note down here is that I've provided
two parameters extra here, which is the info schema
and the header if we do not provide
this as true of a skip it what will happen. Is that if your data set Is
the name of the columns on the first row it will take
those as data as well. It will not infer
the schema now. Once we have loaded the data
in our data frame we need to use the show action to have
a look at the output. So as you can see here, we have the output
which is exactly it gives us the top 20 rows
or the particular data set. We have the year month day
departure time deposit delay arrival time arrival delay
and so many more attributes. To print the schema of the particular data frame
you need the transformation or as say the action
of print schema. So let's have a look
at the schema. As you can see here we have here
which is integer month integer. Almost half of them are integer. We have the carrier as
string the tail number a string the origin
string destination string and so on now suppose. I want to know how many records are
there in my database or the data frame rather say so you need the count
function for this one. I will provide but the results
so as you can see here, we have three point
three million records here three million thirty six thousand
seven hundred seventy six to be exact now suppose. I want to have a look
at the flight name the origin and the destination of just these three columns
from the particular data frame. We need to use
the select option. So as you can see here,
we have the top 20 rows. Now, what we saw
was the select query on this particular data frame, but if I wanted
to see or rather, I want to check the summary. Of any particular
column suppose. I want to check the
what is the lowest count or the highest count in
the particular distance column. I need to use
the describe function here. So I'll show you what
the summer it looks like. So the distance the count is the number of rows
total number of rows. We have the mean the standard
deviation via the minimum value, which is 17
and the maximum value, which is 4983. Now this gives you a summary
of the particular column if you want to So
that we know that the minimum distance is 70. Let's go ahead and filter
out our data using the filter function
in which the distance is 17. So you can see here. We have one data in which in the 2013 year
the minimum distance here is 17 but similarly suppose I want
to have a look at the flash which are originating from EWR. Similarly. We use the filter
function here as well. Now the another Clause here, which is the where
Clause is also used for filtering the suppose. I want to have a look at the flight data
and filter it out to see if the day at work. Which the flight took off was
the second of any month suppose. So here instead of filter. We can also use a where clause which will give us
the same output. Now, we can also pass
on multiple parameters and rather say
the multiple conditions. So suppose I want the day
of the flight should be seventh and the origin should be JFK and the arrival delay
should be less than 0 I mean that is for none
of the postponed fly. So just to have a look at these numbers
will use the way clause and separate all the conditions
using the + symbol so you can see
here all the data. The day is 7 the origin is JFK and the arrival delay
is less than 0 now. These were the basic
Transformations and actions on the particular data frame. Now one thing we can also do
is create a temporary table for SQL queries if someone is not good or is not Wanted
to all these transformation and action add would rather
use SQL queries on the data. They can use this register dot
temp table to create a table for their particular data frame. What we'll do is
convert the NYC flights and a Squatty of data frame
into NYC endoscope flight table, which can be used later
and SQL queries can be performed on this particular table. So you remember in the beginning
we use the NYC flies and score d f dot show now we can use
the select asterisk from I am just go flights to get
the same output now suppose we want to look at the minimum
a time of any flights. We use the select minimum
air time from NYC flights. That is the SQL query. We pass all the SQL query
in the sequel context or SQL function. So you can see here. We have the minimum air time
as 20 now to have a look at the Wreckers in which
the air time is minimum 20. Now we can also use
nested SQL queries a suppose if I want to check which all flights have
the Minimum air time as 20 now that cannot be done in a simple SQL query we need
nested query for that one. So selecting aspects
from New York flights where the airtime
is in and inside that we have another query, which is Select minimum air time
from NYC flights. Let's see if this works or not. CS as you can see here, we have two Flats which have
the minimum air time as 20. So guys this is it
for data frames. So let's get back
to our presentation and have a look at the list
which we were following. We completed data frames. Next we have stories levels
now Storage level in pie spark is a class
which helps in deciding how the rdds should be stored now based on this rdds
are either stored in this or in memory or in both the class Storage
level also decides whether the RADS
should be serialized or replicate its partition
for the final and the last topic
for the today's list is MLM blog MLM is
the machine learning APA which is provided by spark, which is also present in Python. And this library
is heavily used in Python for machine learning as well as real-time streaming
analytics Aurelius algorithm supported by this libraries
are first of all, we have the spark dot m l live now recently the spice
Park MN lips supports model based collaborative filtering
by a small set of latent factors and here all the users
and the products are described which we can use
to predict them. Missing entries however
to learn these latent factors Park dot ml abuses
the alternatingly square which is the ALS algorithm. Next we have the MLF clustering and are supervised learning problem is clustering
now here we try to group subsets of entities with one
another on the basis of some notion of similarity. Next. We have the frequent
pattern matching, which is the fpm now frequent
pattern matching is mining frequent items item set
subsequences or other Lectures that are usually among the first steps to analyze
a large-scale data set. This has been an active research
topic in data mining for years. We have the linear algebra. Now this algorithm
support spice Park, I mean live utilities
for linear algebra. We have collaborative filtering. We have classification for binary classification
various methods are available in sparked MLA package such as
multi-class classification as well as regression analysis
in classification some of the most popular Terms
used are Nave by a strand of forest decision tree and so much and finally we
have the linear regression now basically lead integration
comes from the family of recreation algorithms
to find relationships and dependencies between
variables is the main goal of regression all the pie spark MLA package also covers
other algorithm classes and functions. Let's now try to implement
all the concepts which we have learned
in pie spark tutorial session now here we are going to use
a heart disease prediction model and we are going to predict
Using the decision tree with the help of classification
as well as regression. Now. These all are part
of the ml Live library here. Let's see how we
can perform these types of functions and queries. The first of all what we need to do
is initialize the spark context. Next we are going
to read the UCI data set of the heart disease prediction and we are going
to clean the data. So let's import the pandas
and the numpy library here. Let's create a data frame
as heart disease TF and as mentioned earlier, we are going to use
the read CSV method here and here we don't have a header. So we have provided
header as none. Now the original data set
contains 300 3 rows and 14 columns. Now the categories
of diagnosis of heart disease that we are projecting if the value 0 is for 50% less
than narrowing and for the value 1 which we are giving
is for the values which have 50% more
diameter of naren. So here we are using
the numpy library. These are particularly
old methods which is showing the deprecated warning
but no issues it will work fine. So as you can see here, we have the categories
of diagnosis of heart disease that we are predicting
the value 0 is 4 less than 50 and value 1 is greater than 50. So what we did here
was clear the row which have the question mark
or which have the empty spaces. Now to get a look
at the data set here. Now, you can see here. We have zero at many places
instead of the question mark which we had earlier and now we are saving
it to a txt file. And you can see her
after dropping the rose with any empty values. We have two ninety seven rows
and 14 columns. But this is what the new
clear data set looks like now we are importing
the ml lived library and the regression here now here what we are going to do
is create a label point, which is a local Vector
associated with a label or a response. So for that we need to import
the MLF dot regression. So for that we are
taking the text file which we just created now
without the missing values. Now next. What we are going to do is
pass the MLA data line by line into the MLM label Point object and we are going
to convert the - one labels to the 0 now. Let's have a look after passing
the number of fishing lines. Okay, we have to label .01. That's cool. Now next what we are going to do
is perform classification using the decision tree. So for that we need to import
the pie spark the ml 8.3. So next what we have to do is
split the data into the training and testing data and we split here the data
into 70s 233 standard ratio, 70 being the training data set and the 30% being the testing
data set next what we do is that we train the model. Which we are created here
using the training set. We have created
a training model decision trees or trained classifier. We have used
a training data number of classes is file
the categorical feature, which we have given maximum depth to which
we are classifying. It is 3 the next what we are going to do is
evaluate the model based on the test data set now
and evaluate the error. So here we are creating predictions and we
are using the test data to get the predictions
through the model which we Do and we
are also going to find the test errors here. So as you can see here, the test error is
zero point 2 2 9 7 we have created a classification
decision tree model in which the feature
less than 12 is 3 the value of the features
distance 0 is 54. So as you can see
our model is pretty good. So now next we'll use regression
for the same purposes. So let's perform the regression
using decision tree. So as you can see
we have the train model and we are using
the decision tree, too. Trine request using
the training data the same which we created using the
decision tree model up there. We use the classification now we are using
regression now similarly. We are going to evaluate our model using our test data
set and find that test errors which is the mean squared error
here for aggression. So let's have a look
at the mean square error here. The mean square error is 0.168. That is good. Now finally if we have a look at the Learned
regression tree model. You can see we have created
the regression tree model till the depth
of 3 with 15 notes. And here we have
all the features and classification of the tree. Hello folks. Welcome to spawn
interview questions. The session has been planned
collectively to have commonly asked interview questions later
to the smart technology and the general answer
and the expectation is already you are aware
of this particular technology. To some extent and in general
the common questions being asked as well as I will give
interaction with the technology as so let's get this started. So the agenda for
this particular session is the basic questions
are going to cover and questions later
to the spark core Technologies. That's when I say spark or that's going to be
the base and top of spark or we have
four important components which work that
is streaming Graphics. Ml Abe and SQL
all these components have been created to satisfy a
The government again interaction with these Technologies and get into the commonly
asked interview questions and the questions also
framed such a way. It covers the spectrum
of the doubts as well as the features available
within that specific technology. So let's take the first question and look into the answer like
how commonly this covered. What is Apache spark and Spark
It's with Apache Foundation now, it's open source. It's a cluster
Computing framework for real-time processing. So three main keywords over. Here a purchase markets
are open source project. It's used for cluster Computing. And for a memory processing
along with real-time processing. It's going to support
in memory Computing. So the lots of project which supports cluster Computing along with that spark
differentiate Itself by doing the in-memory Computing. It's very active
community and out of the Hadoop ecosystem
technology is Apache spark is very active multiple releases. We got last year. It's a very inactive project
among the about your Basically, it's a framework kind support
in memory Computing and cluster Computing and you
may face this specific question how spark is different than mapreduce on
how you can compare it with the mapreduce mapreduce
is the processing pathology within the Hadoop ecosystem and within Hadoop ecosystem. We have hdfs Hadoop distributed
file system mapreduce going to support distributed computing
and how spark is different. So how we can compare
smart with them. Mapreduce in a way
this comparison going to help us to understand
the technology better. But definitely like we cannot compare these two
or two different methodologies by which it's going to work
spark is very simple to program but mapreduce there
is no abstraction or the sense like all the implementations we have
to provide and interactivity. It's has an interactive mode to
work with inspark a mapreduce. That is no interactive mode. There are some
components like Apache. Big and high which facilitates has to do
the interactive Computing or interactive programming and smog supports
real-time stream processing and to precisely
say with inspark the stream processing is called
a near real-time processing. There's nothing in the world
is Real Time processing. It's near real-time processing. It's going to do the processing
and micro batches. I'll cover in detail when we are moving
onto the streaming concept and you're going to do the batch processing on
the historical data in Matrix. Zeus when I say stream
processing I will get the data that is getting processed
in real time and do the processing and get
the result either store it on publish to publish Community. We will be doing it let and see wise mapreduce will have
very high latency because it has to read
the data from hard disk, but spark it will have
very low latency because it can reprocess are used the data
already cased in memory, but there is a small catch
over here in spark first time when the data gets loaded it
has Tool to read it from the hard disk
same as mapreduce. So once it is red it
will be there in the memory. So spark is good. Whenever we need to do I treat a Computing so spark whenever
you do I treat a Computing again and again to the processing
on the same data, especially in machine learning
deep learning all we will be using the iterative Computing
his Fox performs much better. You will see
the rock performance Improvement hundred times
faster than mapreduce. But if it is one time processing
and fire-and-forget, Get the type
of processing spark lately, maybe the same latency, you will be getting
a tan mapreduce maybe like some improvements because
of the building block or spark. That's the ID you may get
some additional Advantage. So that's the key feature are
the key comparison factor of sparkin mapreduce. Now, let's get on to the key
features xnk features of spark. We discussed over
the Speed and Performance. It's going to use
the in-memory Computing so Speed and Performance. Place it's going to much better. When we do actually to Computing
and Somali got the sense the programming language
to be used with a spark. It can be any of these languages
can be python. Java are our scale. Mm. We can do programming
with any of these languages and data formats
to give us a input. We can give any data formats
like Jason back with a data formats began if there is a input
and the key selling point with the spark is it's
lazy evaluation the since it's going To calculate the DAC cycle
directed acyclic graph d a g because that is a th e
it's going to calculate what all steps needs
to be executed to achieve the final result. So we need to give all
the steps as well as what final result I want. It's going to calculate
the optimal cycle on optimal calculation. What else tips needs
to be calculated or what else tips needs
to be executed only those steps it will be executing it. So basically it's
a lazy execution only if the results needs
to be processed, it will be processing that. Because of it and it's
about real-time Computing. It's through spark streaming that is a component
called spark streaming which supports real-time
Computing and it gels with Hadoop ecosystem variable. It can run on top of Hadoop Ian or it can Leverage The hdfs
to do the processing. So when it leverages the hdfs
the Hadoop cluster container can be used to do
the distributed computing as well as it can leverage
the resource manager to manage the resources so spot. I can gel with the hdfs very
well as well as it can leverage the resource manager to share the resources
as well as data locality. You can give each data locality. It can do the processing we have to the database data is located
within the hdfs and has a fleet of machine learning
algorithms already implemented right from clustering
classification regression. All this logic
already implemented and machine learning. It's achieved using
MLA be within spark and there is a component
called a graphics which supports Maybe we
can solve the problems using graph Theory using the component
Graphics within this park. So these are the things
we can consider as the key features of spark. So when you discuss
with the installation of the spark, you may come across this year
on what is he on do you need to install spark
on all nodes of young cluster? So yarn is nothing
but another is US negotiator. That's the resource manager
within the Hadoop ecosystem. So that's going to provide the
resource management platform. Ian going to provide
the resource management platform across all the Clusters and Spark It's going
to provide the data processing. So wherever there is
a horse being used that location response will be
used to do the data processing. And of course, yes, we need to have spark
installed on all the nodes. It's Parker stores are located. That's basically we need
those libraries an additional to the installation of spark
and all the worker nodes. We need to increase
the ram capacity on the VOC emissions as well as far going
to consume huge amounts. Memory to do the processing it
will not do the mapreduce way of working internally. It's going to generate
the next cycle and do the processing on top of yeah, so Ian and the high level it's
like resource manager or like an operating system
for the distributed computing. It's going to coordinate
all the resource management across the fleet
of servers on top of it. I can have multiple components like spark these giraffe
this park especially it's going to help Just watch it
in memory Computing. So sparkly on is nothing
but it's a resource manager to manage the resource
across the cluster on top of it. We can have spunk and yes, we need to have spark installed and all the notes on where
the spark yarn cluster is used and also additional to that. We need to have
the memory increased in all the worker robots. The next question goes like this what file
system response support. What is the file system then
we work in individual system. We will be having
a file system to work within that particular
operating system Mary redistributed cluster or in
the distributed architecture. We need a file system with which where we can store the data
in a distribute mechanism. How do comes with
the file system called hdfs. It's called Hadoop
distributed file system by data gets distributed
across multiple systems and it will be coordinated by 2. Different type of components
called name node and data node and Spark it can use
this hdfs directly. So you can have any files
in hdfs and start using it within the spark ecosystem
and it gives another advantage of data locality when it does the distributed
processing wherever the data is distributed. The processing could be done
locally to that particular Mission way data is located and to start with as
a standalone mode. You can use the local
file system aspect. So this could be used especially when we are doing
the development or any of you see you can use
the local file system and Amazon Cloud provides
another file system called. Yes, three simple
storage service we call that is the S3. It's a block storage service. This can also be leveraged or used within spa
for the storage and lot other file system. Also, it supports there are
some file systems like Alex, oh which provides
in memory storage so we can leverage that
particular file system as well. So we have seen
all the features. What are the functionalities
available with inspark? We're going to look
at the limitations of using spark. Of course every component when it comes with
a huge power and Advantage. It will have its own
limitations as well. The equation illustrates
some limitations of using spark spark utilizes
more storage space compared to Hadoop and it comes
to the installation. It's going to consume more space
but in the Big Data world, that's not a
very huge constraint because storage cons is
not Great are very high and our big data space and
developer needs to be careful while running the apps
and Spark the reason because it uses
in-memory Computing. Of course, it handles
the memory very well. But if you try to load
a huge amount of data and the distributed environment
and if you try to do is join when you try to do join within the distributed world the
data going to get transferred over the network network is really a costly
resource So the plan or design should be such
a way to reduce or minimize. As the data transferred
over the network and however the way
possible with all possible means we should facilitate
distribution of theta over multiple missions the more we distribute the more
parallelism we can achieve and the more results we can get
and cost efficiency. If you try to compare the cost how much cost involved to do a particular
processing take any unit in terms of processing
1 GB of data with say like II Treaty processing if you come Cost-wise in-memory
Computing always it's considered because memory It's
relatively come costlier than the storage so that may act
like a bottleneck and we cannot increase the memory capacity of
the mission Beyond supplement. So we have to grow horizontally. So when we have
the data distributor in memory across the cluster, of course the network transfer all those bottlenecks
will come into picture. So we have to strike
the right balance which will help us to achieve
the in-memory computing. Whatever, they memory
computer repair it will help us to achieve and it consumes huge amount of data processing
compared to Hadoop and Spark it performs better than use it as
a creative Computing because it likes for both spark
and the other Technologies. It has to read data for the first time
from the hottest car from other data source and Spark
performance is really better when it reads the data
onto does the processing when the data is available
in the cache, of course is the DAC cycle. It's going to give
us a lot of advantage while doing the processing but the in-memory
Computing processing that's going to give
us lots of Leverage. The next question
list some use cases where Spark outperforms
Hadoop in processing. The first thing is
the real time processing. How do you cannot handle
real time processing but spark and handle
real time processing. So any data that's coming in
in the land architecture. You will have three layers. The most of the Big
Data projects will be in the Lambda architecture. You will have speed layer
by layer and sighs Leo and the speed layer
whenever the river comes in that needs to be processed
stored and handled. So in those type of real-time processing stock
is the best fit. Of course, we can
Hadoop ecosystem. We have other components which does the real-time
processing like storm. But when you want to Leverage
The Machine learning along with the Sparks dreaming on such computation spark
will be much better. So that's why I like when you have architecture like a Lambda architecture
you want to have all three layers bachelier
speed layer and service. A spark and gel the speed layer
and service layer far better and it's going to provide
better performance. And whenever you do
the edge processing especially like doing
a machine learning processing, we will leverage
nitrate in Computing and can perform a hundred times faster than Hadoop
the more diversity processing that we do the more data
will be read from the memory and it's going to get as
much faster performance than I did with mapreduce. So again, remember whenever you
do the processing only buns, so you're going to to do
the processing finally bonds read process it and deliver. The result spark
may not be the best fit that can be done
with a mapreduce itself. And there is another component
called akka it's a messaging system
our message quantity in system Sparkle
internally uses account for scheduling our any task that needs to be assigned
by the master to the worker and the follow-up
of that particular task by the master basically
asynchronous coordination system and that's achieved using akka I call programming internally
it's used by this monk as such for the developers. We don't need to worry
about a couple of growing up. Of course we can leverage it but the car is used internally
by the spawn for scheduling and coordination between master
and the burqa and with inspark. We have few major components. Let's see, what are
the major components of a possessed man. The lay the components
of spot ecosystem start comes with a core engine. So that has the core. Realities of what is required
from by the spark of all this Punk Oddities
are the building blocks of the spark core engine
on top of spark or the basic functionalities are file interaction file system
coordination all that's done by the spark core engine on top of spark core engine. We have a number
of other offerings to do machine learning to do
graph Computing to do streaming. We have n number
of other components. So the major use the components of these components
like Sparks equal. Spock streaming. I'm a little graphics
and Spark our other high level. We will see what are
these components Sparks equal especially it's designed
to do the processing against a structure data so we can write SQL queries
and we can handle or we can do the processing. So it's going to give us
the interface to interact with the data, especially structure data
and other language that we can use
it's more similar to what we use within the SQL. Well, I can say
99 percentage is seen and most of the commonly used
functionalities within the SQL have been implemented
within smocks equal and Spark streaming is going to
support the stream processing. That's the offering
available to handle the stream processing and MLA based the offering
to handle machine learning. So the component name
is called ml in and has a list of components a list of machine learning
algorithms already defined we can leverage and use any
of those machine learning. Graphics again, it's
a graph processing offerings within the spark. It's going to support us
to achieve graph Computing against the data that we have
like pagerank calculation. How many connector identities how many triangles all those
going to provide us a meaning to that particular data and Spark are is the component
is going to interact or helpers to leverage. The language are
within the spark environment are is a statistical
programming language. Each where we can do
statistical Computing, which is Park environment and we can leverage our language
by using this parka to get that executed within the spark
a environment addition to that. There are other components
as well like approximative is it's called blink DB all other
things I can be test each. So these are the major Lee used
components within spark. So next question. How can start be used
alongside her too? So when we see a spark
performance much better it's not a replacement to handle it. Going to coexist with the Hadoop right
Square leveraging the spark and Hadoop together. It's going to help us
to achieve the best result. Yes. Mark can do in memory Computing
or can handle the speed layer and Hadoop comes
with the resource manager so we can leverage
the resource manager of Hadoop to make smart to work and few processing be
don't need to Leverage The in-memory Computing. For example, one time processing
to the processing and forget. I just store it we
can use mapreduce. He's so the processing cost
Computing cost will be much less compared to Spa so we can amalgam eyes and get
strike the right balance between the batch processing
and stream processing when we have spark
along with Adam. Let's have some detail question
later to spark core with inspark or as I mentioned earlier
the core building block of spark or is our DD resilient
distributed data set. It's a virtual. It's not a physical entity. It's a logical entity. You will not See
this audit is existing. The existence of hundred
will come into picture when you take some action. So this is our Unity
will be used are referred to create the DAC cycle and arteries will be optimized
to transform from one form to another form to make a plan how the data set needs
to be transformed from one structure
to another structure. And finally when you take some
against an RTD that existence of the data structure that resulted in data
will come into picture and that can be stored
in any file system whether it's GFS is 3 or any other file system
can be stored and that it is can exist
in a partition form the sense. It can get distributed
across multiple systems and it's fault tolerant
and it's a fault tolerant. If any of the artery
is lost any partition of the RTD is lost. It can regenerate only
that specific partition it can regenerate so that's a huge
advantage of our GD. So it's a mass like first
the huge advantage of added. It's a fault-tolerant where it can regenerate
the last rdds. And it can exist
in a distributed fashion and it is immutable the since once the RTD is defined on
like it it cannot be changed. The next question is
how do we create rdds in spark the two ways we
can create The Oddities one as isn't the spark context we
can use any of the collections that's available within this
scalar or in the Java and using the paralyzed function. We can create the RTD
and it's going to use the underlying file
systems distribution mechanism if The data is located
in distributed file system, like hdfs. It will leverage
that and it will make those arteries available
in a number of systems. So it's going to leverage
and follow the same distribution and already Aspen or we can create the rdt
by loading the data from external sources
as well like its peace and hdfs be may not consider
as an external Source. It will be consider as
a file system of Hadoop. So when Spock is working with Hadoop mostly
the file system, we will be using will be Hdfs, if you can read
from it each piece or even we can do
from other sources, like Parkwood file or has
three different sources a roux. You can read and create the RTD. Next question is what is executed memory
in spark application. Every Spark application
will have fixed. It keeps eyes and fixed number, of course for the spark
executor executor is nothing but the execution unit
available in every machine and that's going to facilitate
to do the processing to do the tasks in the Water machine, so irrespective of whether you
use yarn resource manager or any other measures like resource manager
every worker Mission. We will have an Executor and within the executor
the task will be handled and the memory to be allocated
for that particular executor is what we Define as the hip size
and we can Define how much amount of memory should be used
for that particular executor within the worker
machine as well. As number of cores
can be used within the exit. Our by the executor
with this path application and that can be controlled through the configuration
files of spark. Next questions different
partitions in Apache spark. So any data irrespective of whether it is a small
data a large data, we can divide those data sets across multiple systems
the process of dividing the data into multiple pieces and making it to store across multiple systems as
a different logical units. It's called partitioning. So in simple terms partitioning
is nothing but the process of Dividing the data and storing in multiple systems
is called partitions and by default the conversion of the data into R. TD
will happen in the system where the partition is existing. So the more the partition
the more parallelism they are going to get
at the same time. We have to be careful
not to trigger huge amount of network data transfer as well and every a DD can
be partitioned with inspark and the panel
is the partitioning going to help us to achieve
parallelism more the partition that we have more. Solutions can be done and that the key thing
about the success of the spark program is
minimizing the network traffic while doing the parallel
processing and minimizing the data transfer
within the systems of spark. What operations does already
support so I can operate multiple operations
against our GD. So there are two type of things
we can do we can group it into two one is transformations in Transformations are did he
will get transformed from one form to another form. Select filtering grouping all that like it's going
to get transformed from one form to another form
one small example, like reduced by key filter all
that will be Transformations. The resultant of
the transformation will be another rdd the same time. We can take some actions
against the rdd that's going to give
us the final result. I can say count how many records
or they are store that result into the hdfs. They all our actions so
multiple actions can be taken against the RTD. The existence of the data
will come into picture only if I take some action
against not ready. Okay. Next question. What do you understand
by transformations in spark? So Transformations are
nothing but functions mostly it will be higher
order functions within scale and we have something
like a higher order functions which will be applied
against the tardy. Mostly against the list of elements that we
have within the rdd that function will get
applied by the existence of the arditi will Come
into picture one lie if we take some action against
it in this particular example, I am reading the file and having it within the rdd
Control Data then I am doing some transformation using a map. So it's going
to apply a function so we can map I have some function which will split
each record using the tab. So the spit with the app
will be applied against each record
within the raw data and the resultant movies data
will again be another rdd, but of course,
this will be a lazy operation. The existence of movies data
will come into picture only if I take some action
against it like count or print or store only those actions
will generate the data. So next question
Define functions of spark code. So that's going to take care
of the memory management and fault tolerance of rdds. It's going to help us
to schedule distribute the task and manage the jobs running
within the cluster and so we're going to help
us to or store the rear in the storage system as well
as reads data from the storage. System that's to do the file
system level operations. It's going to help us and Spark core programming
can be done in any of these languages
like Java scalar python as well as using our so core is that the horizontal level
on top of spark or we can have a number of components and there are different type
of rdds available one such a special type is parody. So next question. What do you understand
by pay an rdd? It's going to exist
in peace as a keys and values so I can Some special functions within the parodies
are special Transformations, like connect all the values
corresponding to the same key like solder Shuffle what happens within
the shortened Shuffle of Hadoop those type of operations like you want to consolidate our group
all the values corresponding to the same key are
apply some functions against all the values
corresponding to the same key. Like I want to get the sum of the value of all the keys
we can use the parody. D and get that a cheat so
it's going to the data within the re going to exist
in Pace keys and right. Okay a question from Jason. What are our Vector rdds in machine learning you
will have huge amount of processing handled by vectors and matrices and we do lots
of operations Vector operations, like effective actor or transforming any data
into a vector form so vectors like as the normal way
it will have a Direction. And magnitude so we can do some operations
like some two vectors and what is the difference
between the vector A and B as well as a and see if the difference
between Vector A and B is less compared to a and C we can say the vector A and B is somewhat similar
in terms of features. So the vector R GD
will be used to represent the vector directly and
that will be used extensively while doing the
measuring and Jason. Thank you other. Is another question. What is our GD lineage? So here I any data
processing any Transformations that we do it maintains
something called a lineage. So what how data
is getting transformed when the data is available
in the partition form in multiple systems and
when we do the transformation, it will undergo multiple steps
and in the distributed word. It's very common to have
failures of machines or machines going
out of the network and the system our framework as it should be in a position to handle
small handles it through. Did he leave eh it can restore
the last partition only assume like out of ten machines
data is distributed across five machines out of that those five machines
One mission is lost. So whatever the
latest transformation that had the data for that particular
partition the partition in the last mission alone
can be regenerated and it knows how to regenerate that data
on how to get that result and data using the concept of rdd lineage so
from which Each data source, it got generated. What was its previous step. So the completely
is will be available and it's maintained by
the spark framework internally. We call that as Oddities in eh, what is point driver to put
it simply for those who are from her
do background yarn back room. We can compare this
to at muster. Every application will
have a spark driver that will have a spot context which is going to moderate
the complete execution of the job that will connect
to the spark master. Delivers the RTD graph that is the lineage
for the master and the coordinate the tasks. What are the tasks that gets executed
in the distributed environment? It can do the parallel processing
do the Transformations and actions against the RTD. So it's a single
point of contact for that specific application. So smart driver
is a short linked and the spawn context
within this part driver is going to be the coordinator
between the master and the tasks that are running
and smart driver. I can get started
in any of the executor with inspark name types
of custom managers in spark. So whenever you have
a group of machines, you need a manager to manage the resources the different type
of the store manager already. We have seen the yarn
yet another assist ago. She later which manages
the resources of Hadoop on top of yarn we can make
Spock to book sometimes I may want to have sparkle
own my organization and not along with the Hadoop
or any other technology. Then I can go with the And alone spawn
has built-in cluster manager. So only spawn can get
executed multiple systems. But generally if we
have a cluster we will try to leverage various other Computing
platforms Computing Frameworks, like graph processing
giraffe these on that. We will try to
leverage that case. We will go with yarn or some generalized
resource manager, like masseuse Ian. It's very specific to Hadoop
and it comes along with Hadoop measures is the
cluster level resource manager and I have multiple clusters. Within organization,
then you can use mrs. Mrs. Is also a resource manager. It's a separate table project
within Apache X question. What do you understand by worker node in a cluster
redistribute environment. We will have n number
of workers we call that is a worker node
or a slave node, which does the actual
processing going to get the data do the processing
and get us the result and masternode going to assign what has to be done by
one person own and it's going to read the data available
in the specific work on. Generally, the tasks assigned
to the worker node, or the task will be assigned
to the output node data is located in vigorous Pace. Especially Hadoop always
it will try to achieve the data locality. That's what we can't is
the resource availability as well as the availability of the resource in terms
of CPU memory as well will be considered as you might have some data
in replicated in three missions. All three machines are busy
doing the work and no CPU or memory available
to start the other task. It will not wait. For those missions to complete
the job and get the resource and do the processing it
will start the processing and some other machine which is going to be near to that the missions having
the data and read the data over the network. So to answer straight
or commissions are nothing but which does the actual work and going to report to the master in terms of what
is the resource utilization and the tasks running within the work emissions
will be doing the actual work and what ways as past Vector
just few minutes back. I was answering a question. What is a vector
vector is nothing but representing the data
in multi dimensional form? The vector can
be multi-dimensional Vector as well. As you know, I am going
to represent a point in space. I need three dimensions
the X y&z. So the vector will
have three dimensions. If I need to represent
a line in the species. Then I need two points
to represent the starting point of the line and the endpoint
of the line then I need a vector which can hold so it will have two Dimensions
the first First Dimension will have one point
the second dimension will have another Point let us say point B if I have to represent a plane
then I need another dimension to represent two lines. So each line will be representing
two points same way. I can represent any data
using a vector form as you might have
huge number of feedback or ratings of products
across an organization. Let's take a simple example
Amazon Amazon have millions of products. Not every user not even
a single user would have It was millions of all
the products within Amazon. The only hardly
we would have used like a point one percent
or like even less than that, maybe like few hundred products. We would have used
and rated the products within amazing for
the complete lifetime. If I have to represent
all ratings of the products with director and see
the first position of the rating it's going
to refer to the product with ID 1 second position. It's going to refer
to the product with ID 2. So I will have million values
within that particular vector. After out of million values, I'll have only values
400 products where I have provided the ratings. So it may vary from number
1 to 5 for all others. It will say 0 sparse
pins thinly distributed. So to represent the huge amount
of data with the position and saying this particular
position is having a 0 value we can mention that with a key and value. So what position having what value rather than storing
all Zero seconds told one lie non-zeros the position of it and
that the corresponding value. That means all others going
to be a zero value so we can mention
this particular space Vector mentioning it
to representa nonzero entities. So to store only
the nonzero entities this Mass Factor will be used so that we don't need to based on additional space was
during this past Vector. Let's discuss some questions
on spark streaming. How is streaming Dad
in sparking explained with examples smart
streaming is used for processing real-time streaming data to precisely say
it's a micro batch processing. So data will be collected
between every small interval say maybe like .5 seconds
or every seconds until you get processed. So internally, it's going to create
micro patches the data created out of that micro batch we call
there is a d stream the stream is like a and ready so I can do
Transformations and actions. Whatever that I do
with our DD I can do With the stream as well and Spark streaming can read
data from Flume hdfs are other streaming services Aspen and store the data
in the dashboard or in any other database and it
provides very high throughput as it can be processed with
a number of different systems in a distributed
fashion again streaming. This stream will be partitioned
internally and it has the built-in feature
of fault tolerance, even if any data is lost and it's transformed already
is Lost it can regenerate those rdds from the existing or from the source data. So these three is going
to be the building block of streaming and it has
the fault tolerance mechanism what we have within the RTD. So this stream are specialized
on Didi specialized form of our GD specifically to use it
within this box dreaming. Okay. Next question. What is the significance
of sliding window operation? That's a very interesting one
in the streaming data whenever we do the Computing the data. Density are the
business implications of that specific data
May oscillate a lot. For example within Twitter. We used to say the trending
tweet hashtag just because that hashtag
is very popular. Maybe someone might have hacked
into the system and used a number of tweets
maybe for that particular our it might have appeared
millions of times just because it appear billions
of times for that specific and minute duration or like say to three minute
duration each not getting to the trending tank. Trending hashtag for
that particular day or for that particular month. So what we will do we
will try to do an average. So like a window
this current time frame and T minus 1 T minus 2 all
the data we will consider and we will try to find
the average or some so the complete business logic
will be applied against that particular window. So any drastic changes
on to precisely say the spike or deep very
drastic spinal cords drastic deep in the pattern
of the data will be normalized. So that's the because significance of using
the sliding window operation with inspark streaming and smart can handle this
sliding window automatically. It can store the prior data
the T minus 1 T minus 2 and how big the window
needs to be maintained or that can be handled easily
within the program and it's at the abstract level. Next question is what is destroying the expansion
is discretized stream. So that's the abstract form or the which will form
of representation of the data. For the spark
streaming the same way, how are ready getting
transformed from one form to another form? We will have series of oddities all put together
called as a d string so this term is nothing but it's another representation of our GD are like
to group of oddities because there is a stream and I can apply
the streaming functions or any of the functions
Transformations are actions available within the streaming
against this D string So within that
particular micro badge, so I will Define What interval the data should be collected
on should be processed because there is a micro batch. It could be every 1 second
or every hundred milliseconds or every five seconds. I can Define that page particular period so
all the data is used in that particular duration
will be considered as a piece of data and that will be called as ADI string s question explain
casing in spark streaming. Of course. Yes Mark internally. It uses in memory Computing. So any data when it
is doing the Computing that's killing generated
will be there in Mary but find that if you do more and more
processing with other jobs when there is a need
for more memory, the least used on DDS will be
clear enough from the memory or the least used data
available out of actions from the arditi will be cleared
of from the memory. Sometimes I may need
that data forever in memory, very simple example,
like dictionary. I want the dictionary words should be always
available in memory because I may do a spell check
against the Tweet comments or feedback comments
and our of nines. So what I can do I
can say KH those any data that comes in we can cash it. What possessed it in memory. So even when there is a need
for memory by other applications this specific data will
not be remote and especially that will be used to do
the further processing and the casing
also can be defined whether it should be in memory
only I in memory and hard disk that also we can Define it. Let's discuss some questions
on spark graphics. The next question is is there
an APA for implementing collapse and Spark in graph Theory? Everything will be represented as a graph is a graph it
will have nodes and edges. So all will be represented
using the arteries. So it's going to extend
the RTD and there is a component called graphics and it exposes
the functionalities to represent a graph we can have
H RG D buttocks rdd by creating. During the edges and vertex. I can create a graph and this graph can exist
in a distributed environment. So same way we will be
in a position to do the parallel processing as well. So Graphics, it's just
a form of representing the data paragraphs with edges
and the traces and of course, yes, it provides the APA
to implement out create the graph do the processing
on the graph the APA so divided what is Page rank? Graphics we didn't have sex
once the graph is created. We can calculate the page rank
for a particular note. So that's very similar to
how we have the page rank for the websites within Google
the higher the page rank. That means it's more important
within that particular graph. It's going to
show the importance of that particular node or Edge within that particular
graph is a graph is a connected set of data. All right, I will be connected
using the property and How much important that property makes we will have
a value Associated to it. So within pagerank
we can calculate like a static page rank. It will run a number of iterations or there
is another page and code anomic page rank
that will get executed till we reach
a particular saturation level and the saturation level can be
defined with multiple criterias and the APA is because there is
a graph operations. And be direct executed
against those graph and they all are available
as a PA within the graphics. What is lineage graph? So the audit is very similar to the graphics how the
graph representation every rtt. Internally. It will have the relation saying how that particular
rdd got created. And from where how
that got transformed argit is how their got transformed. So the complete lineage
or the complete history or the complete path
will be recorded within the lineage. That will be used in case if any particular partition
of the target is lost. It can be regenerated. Even if the complete
artery is lost. We can regenerate so it will have the complete
information on what are the partitions where it is
existing water Transformations. It had undergone. What is the resultant and you if anything is lost
in the middle, it knows where to recalculate from and what are essential
things needs to be recalculated. It's going to save us a lot
of time and if that Audrey is never being used it will now. Ever get recalculated. So they recalculation also
triggers based on the action only on need basis. It will recalculate that's why it's going
to use the memory optimally does Apache spark provide
checkpointing officially like the example
like a streaming and if any data is lost within
that particular sliding window, we cannot get back the data are
like the data will be lost because Jim I'm making
a window of say 24 asks to do some averaging. Each I'm making a sliding window
of 24 hours every 24 hours. It will keep on getting slider
and if you lose any system as in there is a complete
failure of the cluster. I may lose the data because it's all available
in the memory. So how to recalculate if the data system is lost
it follows something called a checkpointing so we can check point
the data and directly. It's provided by the spark APA. We have to just
provide the location where it should get checked
pointed and you can read that particular data back
when you Not the system again, whatever the state it was in be can regenerate
that particular data. So yes to answer the question straight about this path
points check monitoring and it will help us
to regenerate the state what it was earlier. Let's move on to the next
component spark ml it. How is machine learning implemented in spark
the machine learning again? It's a very huge ocean by itself and it's not a technology
specific to spark which learning is
a common data science. It's a Set of data science work
where we have different type of algorithms different
categories of algorithm, like clustering regression
dimensionality reduction or that we have and all these algorithms
are most of the algorithms have been implemented
in spark and smart is the preferred framework or before preferred application
component to do the machine learning algorithm nowadays or machine learning
processing the reason because most of the machine
learning algorithms needs to be executed i3t real number. Of times till we get
the optimal result maybe like say twenty five
iterations are 58 iterations or till we get
that specific accuracy. You will keep on running
the processing again and again and smog is very good fit
whenever you want to do the processing again and again because the data
will be available in memory. I can read it faster store
the data back into the memory again reach faster and all this machine learning
algorithms have been provided within the spark a separate
component called ml lip and within mlsp We
have other components like feature Association
to extract the features. You may be wondering
how they can process the images the core thing about
processing a image or audio or video is about
extracting the feature and comparing the future
how much they are related. So that's where
vectors matrices all that will come into picture
and we can have pipeline of processing as well
to the processing one then take the result
and do the processing to and it has persistence
algorithm as well. The result of it
the generator process the result it can be persisted and reloaded back into the system to
continue the processing from that particular Point
onwards next question. What are categories
of machine learning machine learning assets different
categories available supervised or unsupervised and
reinforced learning supervised and surprised it's very popular where we will know some
I'll give an example. I'll know well
in advance what category that belongs to Z. Want
to do a character recognition while training the data, I can give information saying
this particular image belongs to this particular
category character or this particular number and I can train sometimes I
will not know well in advance assume like I may have
different type of images like it may have
cars bikes cat dog all that. I want to know
how many category available. No, I will not know well
in advance so I want to group it how many category available and then I'll
realize saying okay, they're all this belongs
to a particular category. I'll identify the pattern
within the category and I'll give
a category named say like all these images
belongs to boot category on looks like a boat. So leaving it to the system
by providing this value or not. Let's say the cat is different
type of machine learning comes into picture and as such machine learning is
not specific to It's going to help us to achieve to run
this machine learning algorithms what our spark ml lead
tools MLA business thing but machine learning library or machine learning offering within this Mark and has a
number of algorithms implemented and it provides very
good feature to persist the result generally
in machine learning. We will generate
a model the pattern of the data recorder
is a model the model will be persisted either in
different forms Like Pat. Quit I have
Through different forms, it can be stored opposite
district and has methodologies to extract the features
from a set of data. I may have million images. I want to extract the common features available
within those millions of images and other utilities
available to process to define or like to define the seed
the randomizing it so different utilities are
available as well as pipelines. That's very specific to spark where I can Channel
Arrange the sequence of steps to be undergone by
the machine learning submission learning one algorithm first and then the result
of it will be fed into a machine learning
algorithm to like that. We can have a sequence of execution and
that will be defined using the pipeline's is Honorable
features of spark Emily. What are some popular algorithms
and Utilities in spark Emily. So these are some popular
algorithms like regression classification basic statistics
recommendation system. It's a comedy system is
like well implemented. All we have to provide
is give the data. If you give the ratings and
products within an organization, if you have the complete damp, we can build the recommendation
system in no time. And if you give any user you
can give a recommendation. These are the products
the user may like and those products can be displayed in the search
result recommendation system really works on the basis
of the feedback that we are providing
for the earlier products that we had bought. Bustling dimensionality
reduction whenever we do transitioning
with the huge amount of data, it's very very compute-intensive and we may have
to reduce the dimensions, especially the matrix dimensions within them early
without losing the features. What are the features
available without losing it? We should reduce
the dimensionality and there are
some algorithms available to do that dimensionality reduction
and feature extraction. So what are the common features
are features available within that particular image
and I can Compare what are the common across common features
available within those images? That's how we
will group those images. So get me whether this particular image
the person looking like this image available
in the database or not. For example,
assume the organization or the police department crime
Department maintaining a list of persons committed crime
and if we get a new photo when they do a search they
may not have the exact photo bit by bit the photo might have been taken
with a different background. Front lighting's different
locations different time. So a hundred percent the data
will be different on bits and bytes will be different
but look nice. Yes, they are going to be seeing
so I'm going to search the photo looking similar to this particular
photograph as the input. I'll provide to achieve that we will be extracting
the features in each of those photos. We will extract the features
and we will try to match the feature rather than the bits and bytes and optimization as
well in terms of processing or doing the piping. There are a number of algorithms
to do the optimization. Let's move on to spark SQL. Is there a module
to implement sequence Park? How does it work so
directly not the sequel may be very similar to high
whatever the structure data that we have. We can read the data or extract the meaning
out of the data using SQL and it exposes the APA and we can use those API to read
the data or create data frames and spunk SQL has four major. Degrees data source
data Frame data frame is like the representation
of X and Y data or like Excel data
multi-dimensional structure data and abstract form
on top of dataframe. I can do the
query and internally, it has interpreter
and Optimizer any query I fire that will
get interpreted or optimized and get executed using
the SQL services and get the data from the data frame or it An read the data
from the data source and do the processing. What is a package file? It's a format of the file where the data
in some structured form, especially the result
of the Spock SQL can be stored or returned in some persistence
and the packet again. It is a open source from Apache
its data serialization technique where we can serialize the data
using the pad could form and to precisely say, it's a columnar storage. It's going to consume
less space it will use the keys and values. Store the data and also it helps
you to access a specific data from that packaged form
using the query so backward. It's another open source format
data serialization format to store the data on purses the data as well as to retrieve the data list
the functions of Sparks equal. You can be used
to load the varieties of structured data, of course, yes monks equal can work only
with the structure data. It can be used to load varieties of structured data
and you can use SQL like it's to query
against the program and it can be used with external tools to connect
to this park as well. It gives very good
the integration with the SQL and using python
Java Scala code. We can create an rdd
from the structure data available directly using
this box equal. I can generate the TD. So it's going to
facilitate the people from database background to make
the program faster and quicker. Next question is what do you understand
by lazy evaluation? So whenever you do any operation
within the spark word, it will not do the processing
immediately it look for the final results
that we are asking for it. If it doesn't ask
for the final result. It doesn't need to do
the processing So based on the final action
until we do the action. There will not be
any Transformations. I will there will not be
any actual processing happening. It will just understand what our Transformations
it has to do finally if you ask The action
then in optimized way, it's going to complete
the data processing and get us the final result. So to answer straight
lazy evaluation is doing the processing one Leon need
of the resultant data. The data is not required. It's not going
to do the processing. Can you use Funk to access and analyze data stored
in Cassandra data piece? Yes, it is possible. Okay, not only Cassandra
any of the nosql database it can very well do the processing and Sandra also works
in a distributed architecture. It's a nosql database so it can leverage
the data locality. The query can
be executed locally where the Cassandra
notes are available. It's going to make
the query execution faster and reduce the network load
and Spark executors. It will try to get started or the spark executors
in the mission where the Cassandra
notes are available or data is available going
to do the processing locally. So it's going to leverage
the data locality. T next question, how can you
minimize data transfers when working with spark if you ask the core
design the success of the spark program depends on how much you are reducing
the network transfer. This network transfer
is very costly operation and you cannot paralyzed in case multiple ways are
especially two ways to avoid. This one is called
broadcast variable and at Co-operators
broadcast variable. It will help us
to transfer any static data or any informations
keep on publish. To multiple systems. So I'll see if any data to be transferred
to multiple executors to be used in common. I can broadcast it and I might want to consolidate
the values happening in multiple workers in
a single centralized location. I can use accumulator. So this will help us to achieve
the data consolidation of data distribution
in the distributed world. The ap11 are not abstract level where we don't need
to do the heavy lifting that's taken care
by the spark for us. What our broadcast
variables just now as we discussed the value
of the common value that we need. I am a want that to be available
in multiple executors multiple workers simple example
you want to do a spell check on the Tweet
Commons the dictionary which has the right
list of words. I'll have the complete list. I want that particular
dictionary to be available in each executor so that with a task with
that's running locally in those Executives can refer
to that particular. Task and get the processing
done by avoiding the network data transfer. So the process of Distributing
the data from the spark context to the executors where the task going
to run is achieved using broadcast variables and the built-in within the
spark APA using this parquet p-- we can create
the bronchus variable and the process of Distributing
this data available in all executors is taken care
by the spark framework explain accumulators in spark. The similar way how we
have broadcast variables. We have accumulators
as well simple example, you want to count how many
error codes are available in the distributed environment as your data is distributed across multiple systems
multiple Executives. Each executor will do
the process thing count the records anatomically. I may want the total count. So what I will do I will ask
to maintain an accumulator, of course, it will be maintained
in this more context. In the driver program
the driver program going to be one per application. It will keep on
getting accumulated and whenever I want I
can read those values and take any appropriate action. So it's like more or less the
accumulators and practice videos looks opposite each other, but the purpose
is totally different. Why is there a need
for workers variable when working with Apache Spark
It's read only variable and it will be cached in memory
in a distributed fashion and it eliminates the The work of moving the data
from a centralized location that is Spong driver or from a particular program
to all the executors within the cluster where
the transfer into get executed. We don't need to worry about
where the task will get executed within the cluster. So when compared with the accumulators
broadcast variables, it's going to have
a read-only operation. The executors cannot change the value can only
read those values. It cannot update so mostly
will be used like a quiche. Have for the
identity next question, how can you trigger
automatically naps in spark to handle accumulated metadata. So there is a parameter that we can set TTL the
will get triggered along with the running jobs
and intermediately. It's going to write the data
result into the disc or cleaned unnecessary data
or clean the rdds. That's not being used. The least used RTD. It will get cleaned and click keep the metadata as
well as the memory clean water. The various levels
of persistence in Apache spark when you say data
should be stored in memory. It can be indifferent now
you can be possessed it so it can be in memory of only
or memory and disk or disk only and when it is getting stored
we can ask it to store it in a civilized form. So the reason why we may store
or possess dress, I want this particular on very this form
of body little back for using so I can really back maybe I may not need
it very immediate. So I don't want that to keep
occupying my memory. I'll write it to the hard disk and I'll read it back
whenever there is a need. I'll read it back
the next question. What do you understand
by schema rdd, so schema rdd will be used as
slave Within These Punk's equal. So the RTD will have the meta
information built into it. It will have the schema
also very similar to what we have the database
schema the structure of the particular data and when I have a structure it
will be easy for me. To handle the data so data and the structure
will be existing together and the schema are ready. Now. It's called as a data frame but it's Mark and dataframe
term is very popular in languages like our
as other languages. It's very popular. So it's going to have the data
and The Meta information about that data saying
what column was structure it. Is it explain the scenario where you will be
using spark streaming as you may want to do
a sentiment analysis of Twitter's so there
I will be streamed so we will Flume sort of a tool
to harvest the information from Peter and fit it
into spark streaming. It will extract or identify
the sentiment of each and every tweet and Market whether it is positive
or negative and accordingly the data will be
the structure data that we tidy whether it is positive
or negative maybe percentage of positive and percentage of negative
sentiment store it in some structured form. Then you can leverage this park
Sequel and do grouping or filtering Based
on the sentiment and maybe I can use
a machine learning algorithm. What drives that
particular tweet to be in the negative side. Is there any similarity between
all this negative sentiment negative tweets may be specific to a product a specific time
by when the Tweet was sweeter or from a specific region that we it was
Twitter those analysis could be done by leveraging
the MLA above spark. So Emily streaming core
all going to work together. All these are like different. Offerings available to
solve different problems. So with this we are coming
to end of this interview questions discussion of spark. I hope you all enjoyed. I hope it was constructive
and useful one. The more information
about editor is available in this website to record
at cou only best and keep visiting the website
for blocks and latest updates. Thank you folks. I hope you have enjoyed
listening to this video. Please be kind enough to like it and you can comment any
of your doubts and queries and we will reply them at the earliest do look out
for more videos in our playlist And subscribe to Edureka
channel to learn more. Happy learning.