Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

let's rewind to the days before the world turned digital back then miniscule amounts of data were generated at a relatively sluggish pace all the data was mostly documents and in the form of rows and columns storing or processing this data wasn't much trouble as a single storage unit and processor combination would do the job but as years passed by the internet took the world by storm giving rise to tons of data generated in a multitude of forms and formats every microsecond semi-structured and unstructured data was available now in the form of emails images audio and video to name a few all this data became collectively known as big data although fascinating it became nearly impossible to handle this big data and a storage unit processor combination was obviously not enough so what was the solution multiple storage units and processors were undoubtedly the need of the hour this concept was incorporated in the framework of hadoop that could store and process vast amounts of any data efficiently using a cluster of commodity hardware hadoop consisted of three components that were specifically designed to work on big data in order to capitalize on data the first step is storing it the first component of hadoop is its storage unit the hadoop distributed file system or hdfs storing massive data on one computer is unfeasible hence data is distributed amongst many computers and stored in blocks so if you have 600 megabytes of data to be stored hdfs splits the data into multiple blocks of data that are then stored on several data nodes in the cluster 128 megabytes is the default size of each block hence 600 megabytes will be split into four blocks a b c and d of 128 megabytes each and the remaining 88 megabytes in the last block e so now you might be wondering what if one data node crashes do we lose that specific piece of data well no that's the beauty of hdfs hdfs makes copies of the data and stores it across multiple systems for example when block a is created it is replicated with a replication factor of 3 and stored on different data nodes this is termed the replication method by doing so data is not lost at any cost even if one data node crashes making hdfs fault tolerant after storing the data successfully it needs to be processed this is where the second component of hadoop mapreduce comes into play in the traditional data processing method entire data would be processed on a single machine having a single processor this consumed time and was inefficient especially when processing large volumes of a variety of data to overcome this mapreduce splits data into parts and processes each of them separately on different data nodes the individual results are then aggregated to give the final output let's try to count the number of occurrences of words taking this example first the input is split into five separate parts based on full stops the next step is the mapper phase where the occurrence of each word is counted and allocated a number after that depending on the words similar words are shuffled sorted and grouped following which in the reducer phase all the grouped words are given a count finally the output is displayed by aggregating the results all this is done by writing a simple program similarly mapreduce processes each part of big data individually and then sums the result at the end this improves load balancing and saves a considerable amount of time now that we have our mapreduce job ready it is time for us to run it on the hadoop cluster this is done with the help of a set of resources such as ram network bandwidth and cpu multiple jobs are run on hadoop simultaneously and each of them needs some resources to complete the task successfully to efficiently manage these resources we have the third component of hadoop which is yarn yet another resource negotiator or yarn consists of a resource manager node manager application master and containers the resource manager assigns resources node managers handle the nodes and monitor the resource usage in the node the containers hold a collection of physical resources suppose we want to process the mapreduce job we had created first the application master requests the container from the node manager once the node manager gets the resources it sends them to the resource manager this way yarn processes job requests and manages cluster resources in hadoop in addition to these components hadoop also has various big data tools and frameworks dedicated to managing processing and analyzing data the hadoop ecosystem comprises several other components like hive pig apache spark flume and scoop to name a few the hadoop ecosystem works together on big data management so here's a question for you what is the advantage of the 3x replication schema in hdfs a supports parallel processing b faster data analysis c ensures fault tolerance d manages cluster resources give it a thought and leave your answers in the comment section below three lucky winners will receive amazon gift vouchers hadoop has proved to be a game changer for businesses from startups and big giants like facebook ibm ebay and amazon there are several applications of hadoop like data warehousing recommendation systems fraud detection and so on we hope you enjoyed this video if you did a thumbs up would be really appreciated here's your reminder to subscribe to our channel and click on the bell icon for more on the latest technologies and trends thank you for watching and stay tuned for more from simplylearn you

Info

Channel: Simplilearn

Views: 1,311,373

Rating: undefined out of 5

Keywords: simplilearn, Hadoop In 5 Minutes, What Is Hadoop?, Introduction To Hadoop, Hadoop Explained, what is hadoop, introduction to hadoop, what is hadoop and big data, hadoop tutorial for beginners, what is hadoop and how does it work, what is big data, hadoop, hadoop tutorial, introduction to hadoop framework, hadoop overview, hadoop overview and history, hadoop architecture overview, simplilearn hadoop, big data, what is big data and hadoop, apache hadoop

Id: aReuLtY0YMI

Channel Id: undefined

Length: 6min 20sec (380 seconds)

Published: Thu Jan 21 2021