What exactly is Apache Spark? | Big Data Tools

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is spark and how is it used in big data is it related to hadoop let's look at the architecture of spark explain some key terms see how it relates to other big data solutions and most importantly see if it has a place in your data architecture [Music] spark was developed in 2009 at the berkeley lab by mate zaharia in 2010 the code was open sourced in 2013 it was given to apache foundation and in 2014 it was designated as a top-level project since then it has become one of the most popular and active big data projects the goal of spark was to provide a fast general-purpose cluster framework for large-scale data processing designed to overcome the limitations of mapreduce the most common data processing method in hadoop at the time the foundation of spark is based on the resilient distributed data set or rdd which is a programming abstraction that represents a collection of read-only objects split across a computing cluster the rdd can be created from text files sql databases nosql databases hdfs cloud storage and pretty much anything else rdds allow for standard mapreduce functions but also joining datasets filtering and aggregation the processing of rdds is done entirely in memory the rdd is designed to hide complexity from users who don't have to worry about defining where specific files are sent or what resources are used to store and retrieve files the processing of an rdd is done via drivers and executors when a program executes it starts with a driver which creates a spark context essentially an orchestrator that considers the code and determines the possible tasks to be performed it generates a physical plan and then uses the cluster manager to coordinate all of the executors to schedule and run the tasks the scheduler within the spark context is called a dag directed acyclic graph which will assign tasks and the order to execute them out to the worker nodes then the executors residing in the worker nodes are dynamically launched by the cluster manager they run a task and return that result to the driver all of this makes up the core engine for spark on top of that there are several library modules that allow developers to easily interact with the core engine these include spark sql for working with structured data and data frames this can be interacted with using sql or hive as well as java python or scala spark streaming which allows for ingesting small data batches or micro batching to achieve new real-time data streams ml lib to provide a distributed machine learning framework and graph x for distributed graph processing spark can be run on a processing engine such as hadoop yarn cloudera and hortonworks use this as well as apache mesos kubernetes and docker swarm a managed spark solution can be used with amazon emr google cloud dataproc and azure hd insight as well as databricks across platforms the most important feature of spark is speed it's fast processing thanks to the rdd design and in-memory processing make it run at significantly faster speeds than other big data options it's very flexible in programmability allowing developers to use their preferred programming language it can handle near real-time data processing thanks to the speed and ability to read stream data in memory and spark allows for more advanced analytics thanks to the ability to use sql machine learning functions graph processing and additional modules like spark r to work with r these features have made spark the preferred option over mapreduce which has slow processing and makes it less ideal for real-time data oltp functions filtering and iterative execution it's common to use hadoop distributed file system as the data storage layer and then spark as the data processing and data interaction layer by far the biggest disadvantage of spark is the extreme amount of ram required to do all the processing in memory but for most the cost of memory is outweighed by the many advantages spark provides thanks for watching if you enjoyed this video or learned something a thumbs up would be really appreciated stick around for more data content by subscribing to the channel or clicking a video on screen see you in the next one [Music]
Info
Channel: nullQueries
Views: 80,029
Rating: undefined out of 5
Keywords: apache spark, spark, big data, what is apache spark, spark big data, spark sql, what is spark, hadoop, analytics, databricks, data processing, big data spark, data warehousing, introduction to apache spark, What exactly is apache spark, hadoop tutorial, spark hadoop, apache spark basic concepts, spark tutorial, spark tutorial for beginners, spark streaming, hadoop vs spark, spark architecture, apache spark concepts, apache spark architecture, apache spark databricks
Id: ymtq8yjmD9I
Channel Id: undefined
Length: 4min 36sec (276 seconds)
Published: Tue Jul 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.