Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey everyone, this is Reshma from Edureka and in today's tutorial we're going to focus on Hadoop. Thank you all the attendees for joining today's session. I hope that you'll all enjoy this session but before I begin I want to make sure that you all can hear me properly so kindly drop me a confirmation on the chat window so that I can get started. Alright, so I've got a confirmation from Kanika Neha Keshav there's Jason Sebastian okay so we'll start by looking at the topics that we'll be learning today. So we'll start by learning the big data growth drivers, the reasons because of which data has been converting into big data. Then we'll take a look at what is big data and we'll take a look at the solution of big data which is Hadoop. So we'll also see the master/slave architecture of Hadoop and the different hadoop core components, we'll also study how HDFS stores data into data blocks and how the read/write mechanism works in HDFS. Then we'll understand the programming part of Hadoop which is known as MapReduce and we'll understand this with a MapReduce program we'll understand the entire MapReduce job workflow and we'll see the Hadoop ecosystem the different tools that the Hadoop ecosystem comprises of and finally we'll take in a use case where we'll see how Hadoop has solved all the big data problems in real life. So I hope that the agenda is clear to everyone, all right then it seems that everyone is clear with the agenda so we'll get started and we'll begin with the big data growth drivers. Now the reasons behind the growth of big data could be numerous ever since the enhancement of technology data has also been growing every day now if you go back in time like in 70s or 80s not many people were using computers there were only a fraction of people who are dealing with computers and that's why the data fed into computer systems was also quite less but now everyone owns the gadget everyone has a mobile phone everyone owns a laptop and they are generating data from there every day you can also think of Internet of Things as a factor nowadays we are dealing with smart devices we have smart appliances that are interconnected and they form a network of things which is nothing but internet of things so these smart appliances are also generating data when they're trying to communicate with each other and one prominent factor behind the rise of big data that comes to our mind is social media we have billions of people on social media because we human we are social animals and we love to interact we love to share our thoughts and feelings and social media website provides us just the platform that we need and we have been using it extensively every day so if you look at the stats in front of your screen so you can see that in Facebook the users generates almost 4 million likes every 60 seconds similarly on Twitter there is almost 300 thousand tweets every 60 seconds on reddit there is 18,000 user cast votes on Instagram there are more than 1 million likes and on YouTube there is almost 300 hours of new video uploaded every 60 seconds now this is data for every 60 seconds you can imagine the kind of data that we are dealing with every day and how much data we have accumulated throughout the years ever since the social media website have started now that's a lot of data and it has been rising exponentially over years so let's see what Cisco has to tell about this now you all know that Cisco is one of the biggest networking companies and they have monitored the data traffic they have been getting over years and they have published this on their white paper which they publish every year and we can see from here the stats that they have provided that by 2020 we'll be dealing with 30.6 exabytes of data now one exabyte is 10 raised to the power 18 bytes now that's a lot of zeros that even you can think of. In 2015 if you see that we were only dealing with 3.7 exabytes and now in just five years we're going up to 30.6 exabytes now it can be more in the coming years because the data has been rising exponentially and we are dealing with a lot of data now and cisco has also mentioned the three major reasons because of the rise of the data now the first one is adapting to smarter mobile devices now gone are the days when we were using phones like Nokia 1100 which is able to only call people and to receive calls and just send a few lines of text messages nowadays everyone is using smartphones and we are using different apps and our phones so each of the app is generating a lot of data. The next reason that they have mentioned is defining cell network advances now earlier we had 2G now we had come with 3G and 4G and we're looking forward to 5g now to time we are advancing in the cellular network technology also and it has made us feasible for us to communicate even faster and in a better way and that's why since I already told you that we love to share things it has become very easy for us to send a message or send videos or any kind of files to our friend who is even countries apart and it takes only a few seconds not even seconds milliseconds for that person to receive that message and that is why we're using it extensively because of the ease of the use that we are getting provided and the next reason that they have mentioned is reviewing tiered pricing now the network companies are also providing you with a lot of data plans that your entire family can use now we have unlimited data plans and shared plans which is very feasible for us again and that's why we're using it extensively so there are a lot of mobile users nowadays now the stats also say that we have 217 new users every 60 seconds so you can imagine that almost out of the world population almost everyone uses a mobile phone now well almost so you can say that we are dealing with a lot of data and that is why the main comes as big data so now let us see what is big data now as the name goes big and data you already understood that it is a large cluster of data that we are dealing with but if you ask me I see it as a problem statement that surrounds the in capabilities of a traditional system to process it so when the traditional systems were created we never thought that we'll have to deal with such amount of data and such kind of data so they are unable to process this amount of data that is being generated and with such a high speed and that's why big data is a problem because a traditional system are not able to store the big data and process it now since I told you that big data is a problem so IBM have suggested 5vs in order to identify a big data problem and those are in front of your screen so the first one is "volume" so it implies that the amount of data the client is using is so huge that it becomes increasingly difficult for the client or customer to store the data into the traditional systems and then is the time that we should approach for a solution and the next V we'll talk about is "variety" now we already know that we are dealing with huge volume of data with exabytes of data but these are coming from a variety of sources now we're dealing with mp3 files we're dealing with video files images JSON now they are of all different kinds so the mp3 files and video files they are all unstructured data JSON files are semi structured and there are some structured data as well but the major problem is that most of the data almost 90% of the data is unstructured so should we just dump all those unstructured data or should we make use of it obviously we should make use of it because those unstructured data that we are talking about because in facebook we mostly share photos videos which are unstructured those are very important data because these are used by companies to make business decisions that is gained by insights so this data provide the companies an opportunity to profile their customers because in facebook you go around liking different pages and that is profiling because now the company knows that what kind of things you like and they could approach you by advertising because in facebook you can see that when you're browsing onto your newsfeed on the right-hand side there are certain apps popping up and you'll find out that those ads are also users specific they know what kind of things you like because you have browsed through different pages on Facebook on Google or many other websites so that is why these unstructured data which comprises up the 90% of data is very very important and this is also a problem because our traditional systems are incapable of processing this unstructured data. The next V that comes up is "velocity" so let's talk about the webservice to understand this case so if you create a web service and you provide the web service for clients to access so how many events the web service can handle at a point of time so you can say maybe thousand or two thousand so generally there will be almost two thousand live connections at any point of time on an average normally there is always a restriction to the number of live connections available at that point so you suppose that your company has a threshold of five hundred transactions at a point of time and that is your upper limit but today you cannot have the amount of number in the big data world you talk about sensors you talk about the Sheen's that is continuously sending you information like GPS is continuously sending you the information to somebody you're talking about millions and billions of even strikes per second on real time so you need some extended capabilities which withstand that amount of velocity that data is getting dumped into your traditional systems so if you think that the velocity can be a challenge to your customer then you propose them again a big data solution because this is again a big data problem. Now the next V that we'll talk about is "value" now if your data set cannot give you the necessary information which you can use to gain insights and develop your business then it's just garbage to you because it is very important that you have the right data and you can extract the right information from out of it now there might be unnecessary data lying around in your data set that is unnecessary for you now you'll also have to be able to identify which data set will give you the value that you need in order to develop your business so that is again a problem in order to identify the valuable data and hence it is again a big data problem and finally we'll talk about "veracity" so veracity talks about the sparseness of data so in simple words veracity says that you cannot expect the data to be always correct or reliable in today's world you might get few data which has missing values you may have to work with various types of data that is incorrect or data which may not always hold true so in other words veracity means that you have to trust and make the system to have an understanding that the data may not always be correct and up to the standards it is up to you as an application developer that you have to integrate the data and flush out those data that does not make any sense and extract only those data that makes sense to you and use those data for making decision at the end so these are the five Z's that will help you to identify a big data problem whether your data is big data or not and then you can find the approach for a solution for it so this was an introduction to big data so now we'll understand the problems of big data and how you should approach for a solution for it with a story that you can relate to so I hope that you'll find this part very interesting. So this is a very typical scenario so this is Bob and he has opened up a very small restaurant in a city and he attired a waiter for taking up orders and this is the chef who cooks all those orders and finally delivers them to the customers now what happens here is that this is the cook and he has access to a food shelf and this is where he gets all the ingredients from in order to cook a particular dish now this is the traditional scenarios so he's getting two orders per hour and he's able to cook two dishes per hour so it's a happy situation for him so he's cooking happily the customers are getting served because there are only two orders per hour and he has got all the time he has got access to the food shelf also it's a happy day similarly if we compare the same scenario with your traditional processing system so data is also being generated at a very steady rate and all the data that is being generated is also structured which is very easy for our traditional system to process it so it's a happy day for the traditional processing system too. Now let us talk about a different day so this is the other scenario so Bob decided to take online orders now and now they are receiving much more orders than expected so from two orders per hour now the orders have rised to ten orders per hour and now he has to cook ten dishes every hour so this is quite a bad situation for the cook because he is not capable of cooking ten dishes every hour where beforehand he was only doing two dishes every hour so now consider the scenario of our traditional processing system too so there are a huge number and huge variety of data that is being generated at alarming rate they have already seen the stats that I just showed you that in every 60 seconds how much data is being generated so the velocity is really high and they are all unstructured data and our traditional processing system is not capable of doing that so it's a bad day for our processing system too so now what should be the solution for it so I would ask you guys so what should Bob do right now in order to service customers without delay all right so I'm getting some answers so Sebastian is saying that Bob should hire more cooks and exactly Sebastian you are correct so the issue was that there were too many orders per hour so the solution would be hire multiple cooks and that is exactly what Bob did so he hired four more cooks and now he has five cooks and all the cooks have access to the food shelf this is where they all get their ingredients from so now there are multiple cooks cooking food even though there are ten orders per hour maybe each cook is taking two orders every hour and they're serving people but there are issues still there now because there is only one food shelf and there might be situations like both of the coasts maybe let's say these two cooks one the same ingredient at the same time and they are fighting over it or and the other cooks have to wait until one of the cooks have taken all the ingredients from the food shelf and by that time maybe he has got something on the stove and it has already burned because he was waiting for the other cook to go so that he can get his hands on the ingredient that he wants so again it is a problem so now let us consider the same situation with their traditional processing system so now we have got multiple processors in order to process all the data which was being problematic so it should solve the problem right but again there is a problem because all this processing units are accessing data from a single point which is the data warehouse so bringing data to processing generates a lot of network overhead there would be a lot of input/output overhead and there would be a network congestion because of that and sometimes there might be of situations like processing unit is downloading data from the data warehouse and the other units have to wait in queue in order to access that data and this will completely fail when you want to perform near real-time processing when situations are like this that is why this solution will fail - so then what should be the solution so can I get a few answers okay so geisha says that it should be distributed and parallel you are right geisha flow since the food shelf is becoming a bottleneck for Bob so the solution was to provide a distributed and parallel approach so we'll see how Bob did that so as a solution what Bob did is that he divided up an order into different tasks so now let us consider the example of meat sauce let's say that a customer has come into Bob's restaurant and he has ordered a meat sauce so what happens in Bob's kitchen now is that each of the chefs have got different tasks so let's say in order to prepare meat sauce this chef over here he only cooks meat and this chef over here he only cooks sauce and he has also hired a head chef in order to combine the meat and the sauce together and finally serve the customer so this cooks cook the meat and these two cooks prepare the sauce and they are doing this parallely at the same time and finally this head chef merges the order and the order is completed now if you remember that the food shelf was also a bottleneck so what Bob did in order to solve this is that he distributed the food shells in such a way that a chef has got his access to his own shelf so this shelf over here holds all the ingredients that this chef might need and similarly he has got three more shelves that has got the same ingredients now again let's say that we have a problem that one of the cooks falls sick so in that case we don't have to worry much since we have got another cook who can also cook meat so we can tackle this problem very easily and similarly let's say there comes another problem where a food shelf breaks down and this cook over here has no access to ingredients so again we don't have to worry since there are three more shelves and at that time of disaster we have a backup of three more shelves so he can go ahead and use ingredients from any of the shelves over here so basically we have distributed and made parallel the whole processor tasks into smaller tasks and now there is no problem in bulks restaurant is able to service customers happily let me relate the situation with her we're and let us consider where I've told you that each of the chef's has got his own food shelf in Hadoop terms this is known as data locality it means that data is locally available into the processing units and this whole thing where all the different tasks of cooking meat and sauce are happening parallely this is known as map in Hadoop terms and when they're finally merged together and finally we have got a meat sauce as a dish by the head chef this is known as reduced and we'll be learning MapReduce in Hadoop later on this tutorial don't get confused with the terms if I'm speaking it right now and you're not able to understand it you'll be clear at this end of this tutorial I promise you that so now he is able to handle all the ten online orders per hour and even at times let's say on Christmas or New Year even if Bob is getting more customers more than ten orders per hour this system that he has developed it is scalable he can hire more chefs more head chef in that case in order to serve more orders per hour this is a scalable system so you can scale up and scale down whenever he needs he can hire more chefs he can fire more chefs whenever he needs so this is the ultimate solution that Bob had and this is very effective indeed but now ball has solved all these problems but have we solved all the problems do we have a framework like that who could solve all the big data problems of storing it and processing it well the answer is yes we have something called Apache Hadoop and this is the framework to process big data so let us go ahead and see Apache Hadoop in detail so Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion now you know that there are two major problems in dealing with big data the first one was storage so in order to solve the storage problem of big data we have got HDFS because like how Bob's solve the food shelves problem by distributing it among the chefs similarly Hadoop solves the storing of big data with HDFS which stands for Hadoop distributed file system so now all the big amount of data that we are dumping is it's distributed over different machines and these machines are interconnected on which our data is getting distributed and in Hadoop terms it is called a Hadoop cluster and again like how Bob has managed to divided the tasks among his chefs and need the serving process quite quicker similarly in order to process Big Data we have something called MapReduce and this is the programming unit of Hadoop so this allows a parallel and distributed processing of data that is lying across our Hadoop cluster so every machine in the Hadoop cluster it processes the data that it's got and this is known as map finally when the intermediary outputs are combined in order to provide the final output this is called reduce and hence MapReduce so now let us understand the Hadoop architecture which is a master/slave architecture and we'll understand it by taking a very simple scenario which I'm very sure that you'll all relate to very closely so this is a scenario which is usually found in every other company so we have a project manager here and this project manager handles a team of four people so the four people here in our example are John James Bob and Alice so whatever project he gets from a client he distributed it crossed his team members and tracks a report of how the work is going on from time to time so now let us consider that the project manager over here he has received four projects from a client so let's say the projects are a B C and D and he has assigned all these projects across the team so John is called project a James has got project B bob has got see Alice's got D so everyone is handling and working on a different project and the work is going on fine so he's quite sure that he'll be able to meet the deadlines and deliver the project in time but there is a problem Bob applied for a leave and he tells the project manager that I'm going on leave for a week or two and I won't be coming an office and I can't do the work and now it is a problem for the project manager because at the end he is liable for the work that has not been completed to the client so this person has to make sure that all the projects are delivered at time so it thinks off plan because he's a very clever fellow so in order to tackle this problem what the project manager does he goes to John and he tells him hey John how are you doing and John says yeah I'm doing great yeah I heard that you're doing really great and you're doing excellent in your project the John said things that something's fishy why is appreciating me so much today then the project manager goes ahead and tells him so John since you're doing so well why did you take up the project see as well and the John things okay that's it and then it replies back to the manager that no I'm fine with my project that I've got I have a lot of work to do already I don't think I can take project C then the project manager says no no no you've got me wrong you don't have to work on projects see you know that bob is already working on project C well you can keep it as your backup project and you might never know that you might not have to even work at the end at project C but you'll get the credit for both the projects at the end and I could refer you for a substantial hike and then John thinks it's a quite good deal he might not even have to work on it and he'll get a hike for that so that's why Greece and it takes up project C so now the project manager has done its job he doesn't have to worry about completing projects see even if Bob is going out of town and this is a very very clever fellow in order to tackle even future problems what he does he goes to each of the members and tells them the same thing and hence now he has got a backup for all the different projects so if any of the members ever even opted out of the team he has got a backup and this is how a project manager completes all his tasks at the given time and the client is satisfied and then he also makes sure that he has updated his list as well in order to know who is carrying the back of projects as well and this is exactly what happens in Hadoop so we have got a master node that supervises the different slave nodes the master node keeps a track record of all the processing that is going on in slave nodes and in case of disaster if any of those goes down the master know has always got a backup now as we compare this whole office situation to our Hadoop cluster this is what it looks like so this is the master node this is the project manager in case of our office and these are the processing units where the work is getting carried out so this is exactly how Hadoop processes and Hadoop manages Big Data using the master slave architecture so understand more about the master node and the slave nodes and detail later on in this tutorial so any doubts till now right so now we'll move ahead and we'll take a look at the Hadoop core components and we're going to take a look at HDFS first which is the distributed file system in Hadoop so at first let's take a look at the two components of HDFS since we're already talking about master and slave nodes so let us take a look at what is name node and data node so these are the components you'll find in HDFS so since we're already talking about a master/slave architecture so the master node is known as the name node and slave nodes are known as data node so the name node over here this maintains and manages all the different data nodes which are slave nodes just like our project manager manages a team and like how you guys report to your manager about your work progress and everything the data nodes also do the same thing by sending signals which are known as heartbeats now this is just a signal to tell the name node that the data node is alive and working fine now coming to the data node so this is where your actual data is stored so remember when we talked about storing data in a distributed fashion across different machines so this is exactly where your data is distributed across and it is stored in data blocks so the data node over here is responsible for managing all your data across data blocks and these are nothing but these are slave daemons and the master daemon is the name node but here you can see another component over here which is the secondary name node and by the name you might be guessing that this is just a backup for the name node like when the name node might crash so this will take over but actually this is not the purpose of secondary name though the purpose is entirely different and I'll tell you what is that you just have to keep patience for a while and I'm very sure that you'll be intrigued to know about how important the secondary name node is so now let me tell you about the secondary name node well since we're talking about metadata which is nothing but information about our data it contains all the modifications that had took place across the Hadoop cluster or our HDFS namespace and this metadata is maintained by HDFS using two files and those two files RSS image and edit log now let me tell you what are those so f is image this file over here this contains all the modifications that have been made across your Hadoop cluster ever since the name node was started so let's say the name node was started 20 days back so my FS image will contain all the details of all the changes that happen in ask 20 days so obviously you can imagine that there will be a lot of data contained in this find over here and that is why we store the essence image on our disk so you'll find this SS image file in the local disk of your name node machine now coming to edit log so this file also contains metadata that is the data about your modifications but it only contains the most recent changes let's say whatever modifications that took place in the past one are and this file is small and this file resides in the RAM of your name load machine so we have the secondary name node here which performs a task known as checkpointing now what is checkpointing it is the process of combining the edit log with the FS image and how is it done so the secondary name node over here has got a copy of the edit log and the SS image from the name node and then it adds them up in order to get a new FS image so why do we need a new FS image we need an updated file of the FS image in order to incorporate all the recent changes also into our SS image file and why do we need to incorporate it regularly let's say that if you're maintaining all the modifications in your edit lock you know that your edit log resides in your lab so you cannot let your edit log file to grow bigger because as time passes you'll be making more modifications and more changes and this will get stored in the edit log only first so that's why the file gets bigger it might end up taking a lot of space in your RAM and we'll make the processing power of the name node quite slower and also during the time of failure let's say that your name node has failed and you want to set up a new name node you've got all the files that is needed in order to set up a new name though you've got the most updated recent copy of the SS image all the metadata that you need about your data nodes that your name node is managing so that will be found in your secondary name node and that's why your failure recovery time will grow much lesser and then you'll not lose much data or much time in order to set up a new name node and my default the checkpoint thing happens every hour and by the time when the checkpoint is happening you might be making some more changes also so those changes are stored in a new edit and until the next checkpoint happens we'll be maintaining a new edit log file that will again contain all the recent changes since the last checkpoint so this will be ready log in again when we are performing checkpoint again so we'll take in all the modifications all the data in this edit log and then combine it with the last SS image that we had so this checkpoint thing keeps on going on and by default it takes place every one R if you want the checkpoint thing to happen in minimum intervals you can also do that if you wanted after a long time you can also configure it so we have studied about the HDFS components we have taken a look at what is name node and how does it manage all the data nodes we have also seen the functions of secondary name nodes now so now let us see how all this data is actually stored in all the data nodes so HDFS is a block structured file system and each file is divided into a block of particular size and by default that size is 128 mV so let us understand how HDFS stores files and data blocks with an example so suppose a client wants to store a file which is of 380 MB and he wants to store it in a Hadoop distributed file system so now what H DSS will do is that it will divide up the file into three blocks because 380 MB divided by 128 MB which is the default size of each data block is approximately three so here the first block will occupy 128 MB the second block will also occupy 128 MB and the third block will be of the remaining size of the file which is 124 MB so after my file has been divided into data blocks this data blocks will be distributed across all the data nodes that is present in mojado clustered here you can see that the first part of my file which is 128 MB is indeed a node 1 the next data block is in my data node 2 and my final data block is an data node 3 and if you notice the size of all the blocks are same except for the last one so this is a 124 MB data block and this helps Hadoop to save the HDFS space as the final block is using only that much of space that is needed to store the last part of so therefore we have saved 4mb from being wasted in this scenario now it may seem very little to you that we have only said four MB so what's the big deal but imagine if you are working with tens of thousands of such files how much data you can save here so this was all about data blocks and how HDFS stores data blocks across different data nodes and I suppose that by now you have understood that why do we need a distributed file system so let me tell you that we have got three advantages when we are using a distributed file system so let me explain this to you with an example so now I imagine that I have got a Hadoop cluster with four machines so one of them is the name node and the other three are data nodes so where the capacity of each of the data node is one terabyte so now let's suppose that I have to store a file of three terabytes so since all my data nodes have a capacity of one terabyte this will be distributed the file of three terabyte will be distributed across my three data nodes and one terabyte will be occupied in each data node so now I don't have to worry about how it is getting stored so HDFS will manage that and if you see that this provides me with an abstraction of a single computer that is having a capacity of three terabytes so that's the power of HDFS and let me explain you the second benefit of using a distributed file system so now consider that instead of three terabytes I have to store a file of four terabytes and my cluster capacity is only of three terabytes so I'll add one more data node in my cluster in order to fit my requirements and maybe later on when you need to store a file of huge size you can go ahead and add as many machines in your cluster in order to fit all your requirements to store the file so you can see that this kind of file system which is distributed is highly scalable now let me tell you the third benefit of using a distributed file system now let's consider that you have a single high hand computer which has the processing power of processing a one terabyte data in four seconds now when you're distributing your file across the same single computer with the same capacity are the same processing power you are reading that file parallely so instead of one if you have got four data nodes in your cluster so it will take one bite force of your actual time which we are doing with a single computer so it will take you only one second so basically with the help of distributed file system we are able to distribute our large file across different machines and we're also reducing the processing time by processing it parallely and because of this were able to save huge amount of time in processing the data so this are the benefits of using HDFS and now let us see that how Hadoop can cope up with the data node failure now you know that we are storing our data in data node but what if a data node fails so let us consider the same example over here you know that I have got to find a 380 MB and I have got three data blocks which are distributed across three data nodes over here in my Hadoop cluster so let's say the data node which contains the last part of the file it crashes what to do that now you have lost a part of your file highly process that file right now because you don't have a part of it so what do you think could be a solution for that so I'm getting an answer so casian says that we should have backup yes exactly so the logical approach to solve this problem would be that we should have multiple copies of the data right and that is how Hadoop was assaulted by introducing something which is known as the replication factor you all know what a replica is replica is nothing but a copy and similarly all our data blocks will also have different copies and in HDFS each of the data block has got three copies across the cluster so you can see that this part of the file which is 124 MB this data block is present in data node two data node three and data node 4 and similarly this is common to the other data blocks as well so every data block will be there in my Hadoop cluster three times even if one of my data node gets crashed and I lose all of the data blocks that was inside the data node I don't have to worry because there are two more copies present in the other data nodes and we have to that because since in Hadoop we are dealing with commodity hardware's and it is very likely that our commodity hardware will crash at some point of time so that's why we maintain three copies so even if two of them go out we still have got one more so this is how HDFS performs fault tolerance and I have got a question from Neha so she is asking that do we have to go ahead and make replicas of our data blocks well-known Li how you don't have to do that whenever you put any kind of file when it will copy any file in your Hadoop cluster your files will get replicated by default and by default it will have a replication factor of three it means that every data block will be present automatically three times across your Hadoop cluster so I hope that may have you've got your answer okay so she is saying yes Thank You Niihau for the question that was a very good question indeed so we don't have to worry now if a data node gets crashed we have got multiple copies and since you know the proverb that never put all your eggs in the same basket this is very very true in case of this scenario that we are dealing with right now so we are not putting all our eggs in the same basket we're putting our eggs in three different baskets right now so even if one basket also and all the eggs crack open we don't have to worry we have enough eggs for our omelet if I hope that you all have understood how HDFS provides fault tolerance if you have any questions you can go ahead and ask me or whenever you get questions you can ask me to this end of this session so now let us understand what happens behind the scene when you are writing a file into the HDFS so when you want to write a file across your Hadoop cluster you have to go through three steps and the first step that you should go through is the pipeline setup so let us understand how to set up a pipeline with an example so let's say that I have got a text file maybe this is called example dot txt and I have divided into two data blocks which is block a and block B so let us talk in terms of block a first let us see how to write block a across my data nodes in my HDFS so here is the client so the client at first request the name note telling that I have got a blog that I need to copy so the name note says okay so I'll give you the IP address of three data nodes you can copy your file in this three data nodes and you know that you have to copy your block three times because apparently the replication factor is three the name node here gives the IP address of three data nodes data node one four and six to the kingdom so now the client node has caught the IP addresses of three data nodes where block a will be copied so at first what he does he goes and checks two data node 1 and as they don't want that hey I want to copy a block on your data node so are you ready and can you just go and ask data node 4 and 6 if they're ready and the data says yeah I'm ready so I'll just go ahead and ask for m6 so now data node 1 goes to data node 4 and as hey so the kind is asking for you to copy block sorry ready then 4 says yeah I'm ready and then once it's okay just go ahead and have 6 if he is ready also so for us 6 and then 6 is also ready and this is how the whole pipeline is set up that first block a will be copied to data node 1 then data node 4 and then they denote 6 let's say that in situations there are no data nodes available that whatever IP address the name node gave maybe those are not functioning or those data nodes are not working so in that case when the client node doesn't receive any confirmation he goes back to the main no that says hey whatever IP addresses that you've given those data nodes are not working so would you go and give me another one and then the name node goes on and checks that which are the data nodes are available during that time and give IP address to the client node again so now your pipeline is ready so your pipeline is set up now so first it will be copied onto data node 1 then they tunnel 4 and then data node 6 so this is your pipeline so now comes the second step where the actual writing takes place so now since all the data nodes are ready to copy the block so now the client will contact data node 1 first and data node 1 will first copy block a so now the client will give the responsibility to data node 1 in order to copy the rest of the block and data note 4 and data node 6 so now the data node 1 will contact data node 4 and tell that copy the block a onto yourself and ask data node 6 to do the same so data node 4 will then copy block 8 and then pass the message on to data node 6 and similarly data node 6 will also copy the block so now you have got three copies of the block just as we require so this is how the writing takes place after that the next step is a series of acknowledgment so now we have a pipeline and we have written our blocks onto the data nodes that we wanted to so now the acknowledgement will take place in the reverse order as the writing so at first data node 6 will give an acknowledgement to data node 4 that I have copied block a successfully than they did before we received that acknowledgement and pass it to data node 1 that I have copied block a onto myself and so has data node 6 so all this acknowledgment will be passed to data node 1 and the data node 1 will give an acknowledgment finally to the client node that all the three blocks have been copied successfully and after that the Clinard will send a message to the name node that the writing has been successful that all the blocks have been copied to the data nodes 1 4 & 6 so the name node will receive that message and update its metadata where all the blocks are copied in which data node so this is how the right mechanism takes place first a pipeline setup then the actual writing and then you get an acknowledgment so now we just talked only about a single block as I told you that my file my example dot txt file was divided into two blocks block a and block be the right mechanism for Block B will be similar only when the client node will request a copy block B he might get the IP addresses of different data nodes for example the Block B is copied to 3 7 and 9 and block a was copied on 1 4 & 6 now let me tell you that the writing process of block ay and block B will happen at the same time now obviously I told you that the writing mechanism it takes place in three steps so the actual writing will happen sequentially that means it will first get copied to the first data node then the second and then the third but the blocks will be copied at the same time so this is how the writing mechanism takes place so the writing of the block and block B are taking place at the same time so 1a and 1b step is taking place at the same time 2a and 2b step are taking place at the same time so when the client is copying the different blocks on two different data nodes 1a and 1b this is taking place at the same time and then 2a and 2b are also taking place at the same time when the first block that block is getting copied on to data node 1 and when Block B is getting copied at the data node 7 similarly the other steps are also taking place at the same time as many as block your file contains all the blocks will be copied at the same time in sequential steps onto your data node so this is how the writing mechanism takes place so now let us see what is a story behind reading a file from your different data nodes in your HDFS so let me tell you that reading is fairly much simpler than writing a block onto your HDFS so let's say now my client wants to read that same file that has been copied across different data nodes in my HDFS so you know that my block a was copied on to data node 1 4 and 6 and block B was copied on to data node 3 7 and 9 so now my client will again request the name node that I want to read this particular file and my name wood will give the IP addresses where all my data blocks of that particular file are located so the Kline node will receive that IP address and contact the data nodes and then all the data blocks will be fetched my data block a and my data Block B will be fed simultaneously and then it will be read by the client so this is how the entire read mechanism takes place so guys this is all about HDFS we have seen how a file is copied in your HDFS how a file is copied across a Hadoop cluster in a distributed fashion then we have seen the advantages of using a distributed file system we have also understood what is the name node and what are data nodes and how our data is stored and how is your files stored and divide it up into data blocks and spread across your Hadoop cluster we have all seen that how Hadoop deals when our data node feels and they introduced a replication factor as a backup for your file and then we have also seen how the read and write mechanism takes place so I hope that you have all understood what is Hadoop distributed file system if you have any questions you can ask me and now let us go and move on and let us check what is MapReduce now you already remember the example that we have given at the start of her session cook example how different chefs cook different dishes at the same time and finally a head chef assembles the dish all together and finally gives the desired output so this is what we'll be learning now we'll be learning with more relevant examples so that you can understand MapReduce better so let us understand MapReduce with another story which we'll find amusing again I'm very sure about that so let us consider a situation where we have a professor and there are four students in the class so they are reading a Julius Caesar book so now the professor wants to know how many times the word Julius occurs in the book so for that he asked his students that go ahead read the entire book and tell me how many times the word Julius is there on the book so all of the students have got a copy of the book and they start counting the word Julius so it took them four hours to do so so the first student answered that I've got 45 times the second one answers 46 maybe made a calculation mistake or maybe is correct we don't know that because we don't have the book Bank and the third student also replies 45 and the fourth also replies 45 and then the professor decides that okay three people can't be wrong so I have to go with the majority and majority is usually correct please go through the answer that the word Julius was appeared 45 times in the entire book and it took a time of four hours so then the professor thought that it's taking a lot of time so this time what the professor did he applied a different method so let us assume that the book has got four chapters so he distributed each chapter to each of the students he asked the student one that you go to chapter one and tell me how many times Julius occurs in Chapter one and similarly he gave the same task and assigned chapter 2 to the second student chapter three to the third and chapter four to the fourth so now since they are only assigned with one chapter instead of the entire book they're able to count the Julius word or finish up an entire chapter in just one R and they're doing it at the same time so at the same time chapter 1 has been counted chapter 2 has been counted chapter 3 has been counted in chapter 4 has also been counted and everyone gave the respective answer so this student went up to the professor and said that I found the word Julius 12 times in chapter 1 and the second student said I found it 14 times in chapter 2 chapter 3 he says that I found it eight times and chapter four he says that I found it 11 times so the professor received all the different answers from all the four students and finally he adds them up in order to get the answer of 45 and let's assume that it took him two minutes to add them up now these are very small numbers so he might not make two minutes but we are just assuming it so instead of four hours now we are able to find out the correct answer in just one or two minutes so this is a very effective solution so the part where each of the students were distributed and each of them were working on a part of the book this part is known as map and finally when the professor is summing up all the numbers together this part is known as reduce and this entirely is map produced in the concepts of Hadoop so all the processing of a single file is divided into parts and they are getting processed simultaneously and finally the reducer adds all the intermediate results and gives you the final output and this is a very effective solution because all the tasks are happening parallely and in a lesser time so I hope that you have understood with this example you have understood the essence of MapReduce with this example so now let us go ahead and understand MapReduce in detail so MapReduce is the programming unit and hadoo so this is a framework the advantage of using a distributed framework in order to process large data sets so the MapReduce consists of two distinct tasks so the first one is known as map and the second task is known as reduce and as the named MapReduce suggests the reducer phase takes place after the mapper phase has been completed because the releaser needs intermediate results that is produced by map in order to combine it and finally give you the final output so the first is the map job where a block of data is read and process to produce key value pairs as an intermediate output and then the output of a mapper or a map job which are nothing but key value pairs is input into the reducer and then the reducer receives the key value pair from multiple map jobs and then it aggregates all the intermediate results and finally gives you the final output in the form of key value pairs so this is how MapReduce takes place we'll be understanding this in detail now so I hope that you have all understood this so let us move on right now and understood MapReduce with an example which is a word count program so let us say that we have got a paragraph we have got this much text DRB a river car car river deer car beer and we want to find out that how many times each word appear in this particular sentence or in this particular paragraph so this is how MapReduce works so now we have divided and since you know that we divide up the entire task into different parts here we'll divide up each of the sentence into 3 because there are three sentences so this is the first sentence deer beer river the second one is car car river and the third is to your car beer so now the mapping will take place on each of the sentences over here and since I already told you that a map job is something where a data is read and then a key value pair is formed so we have got the key which is each of this word and then a value is assigned here which is nothing but 1 so here the mapping takes place so each of them is converted into a key value pair with the word and a number one so it happens similarly in the other two sentences as well so first we divide up the input in three splits as you can see in the figure over here that we are divided into three parts and the three sentences that we have in our back rub so the first one is deer beer River the second one is car car river and then deer car beer and then we'll distribute this work among all the map nodes so after that and mapping what happens we tokenize the words in each of the mapper and give a hard-coded value 1 so the reason behind giving the hard-coded value 1 is that every word in itself will occur once so now a list of key value pair will be created where the key is nothing but the individual word and the value is 1 so after the mapper sorting and shuffling happens so that all the keys are sent to the corresponding reducer so after the sorting and shuffling each of the reducer will have a unique key and a list of values corresponding to that very key so we have got beer two times so we have got the key beer and its value 2 times 1 and what so now the reducer what it will do is that if you count the values which are present in that Lane stuff values so here one and one is two and the car was found three times three one value so car will be three similarly D r2 and river two and finally we'll get all the output together in key value pairs so the reducer has combined all the different intermediate results all together here and we have got another key value pair which gives you the final output where we can see that beer was found in our input two times the car was three times deer two times and river two times so this is how MapReduce occurs in Hadoop so I hope that you have understood this word count program so we'll also go ahead and run this program so I'll tell you the major parts of MapReduce program so first you have to write the mapper code that is how the mapping will happen how all the distributed tasks will carry out at the same time and how they will produce key value pairs and that comes in with user code it means that how all the intermediate reserves the key value pairs that we have got from each of the mapping functions and how will we merge them and then finally there is the driver code so here you specify all the job configurations like what is the job name the input output path and etc so this are the three parts of running MapReduce in Hadoop so now let's talk about the mapper code so basically this is a Java program so for those of you who know Java and have been working on Java this is a very simple program for you all but say let me go through and explain the logic of this entire program so this is our mapper code and we have a class here called map which extends to the class mapper and we have mentioned the data types of our input/output key value pair with respect to mapper now let me tell you that the mapper accepts input as the key value pair and gives output also in a key value pair form so since we have this as an input which is nothing but paragraphs and we have not specified any particular key or value to it so the mapper here itself specifies the key as the byte offset type and the value here would be each sentence or each tuple from the entire paragraph that we are inputting into so the datatype of each of the key which is nothing but the byte offset type will be wrong writable since it's just a number and how it takes the byte offset type let me tell you if you see the input over here which is just a double if you see that in this sentence we have got three words with four four and five character each and two blank spaces and since they are all of character types and each character occupies eight bytes of memory so if you add them up together you get 121 and this is the next byte offset for the next couple so this is the data type of our byte offset type which is wrong writable and the input type would be each of the sentence which is nothing but text and if you remember the mapper produces an output again as a key value pair so which will have nothing but each token which are also nothing but each of the unique words in our particular tuple which is nothing but text and then with a tokenized value a hard-coded value like we have done in a previous example like we have assigned a hard-coded value 1 to each of the token which is nothing but an integer so this the data type of our mapper value output would be intractable so for this method we have got our key as divided offset and the value as our tuples so we have got three tuples there and this will be performed on each of the tuples in our input so the map method here takes the key value and context as arguments so we have the byte offset as our key and we have the tupple as our value and the context will allow us to write our map output so what we are doing here is that we are storing each of the tupple in a variable called line and then we're tokenizing it means we are just breaking up our each tuple into tokens which are nothing but each individual words present in that tab and then we are assigning a hard-coded value 1 so each token will be our map output key along with the hard-coded value 1 and we have provided one as a hardcore value just because each of the word will be at least occurring once in that particular tuple so the output keep their values that will have will have something like each of the token and then with a hard-coded value what if you remember the example which we just learned a while ago so the output for the first couple in our example would be D r1 b r1 and river one so this is the entire map record so now let us take a look at the reducer good and even here we have got a class called reduce which extends the class reducer and you remember that the reduce takes place only after the shuffling and sorting so here the input will be nothing but the output of our shuffling and sorting and output of shuffling and sorting with something like this which will have a word along with its frequency or how many times it has occurred after the mapping is done so this will be our input and if you see the first key and the key here is nothing but a text and the value here is nothing but an added which is of the data type int writable and finally it produces an output with the word and how many times it has occurred which is again nothing but a word and a number which is of the data type text and ncredible something like this which you can see over here so what we are doing is that so we have got a method called reduce so here we have got the input key which is nothing but a text and the input value as an added something like this so now since it is an array we'll just run a loop and we'll sum up the number of ones for each of the token so here for bear we have got 2 1 so we'll just sum up these two ones and finally get the result so the output key will be text that is a particular word or a unique word and the value would be the sum of all the ones that was associated in that particular array so here we have got 1 plus 1 as 2 so the final output would be bear 2 and similarly for card at the input wisc are 1 1 1 so we are getting card 3 so this is the whole video circuit remember that I've told you that there was one more section of code in the entire MapReduce code and third part is the driver code so this code over here this will contain all the configuration details of our MapReduce job so for example it will contain the name of my job the data type of my input output of the mapper and reducer so you can see that my job name is my word count program and here I haven't mentioned the name of my class then the mapper class which is known as map the reducer class which is reduced and the output key class is txt so we can also set the output value of our class and since in this example we are dealing with the frequency of your words which are nothing but numbers so we have mentioned indictable so again if you want to set input format class which is nothing but this is just to specify how a mapper will process a particular input size that is what will be the unit of work for each map and in our case the whole input text that we had with this process line by line so we can specify that as well similarly we can also specify how the output format class how the output will be written on to our file which is also line by line and we can also go ahead and set the input path we can mention the directory from which it will fetch our input file and we can also go ahead and mention the output path or the directory where my file or my output will be written on - so this is what exactly a driver code contains this is nothing but just the configuration details of your entire MapReduce code so I hope that you have all understood this program so we'll just go ahead and execute it so this is my VM where I have set up my HT SS so let's go ahead and execute the MapReduce program practically so let me open my IDE first and for my ID I'm using eclipse so this is my Java program that I just showed you so here is my mapper code then here is my reducer code and this is my driver code that I just now explained it to you in detail I don't do that the starting point is the main method and this is where my driver code resides so I told you earlier that the starting point is the main method and this is where my java code resides and here you can see that we have assigned a zeroth argument for the input path and the first argument for the output path so my class name here is work count this is the package where my class resides in that is in dot ed Eureka dot Map Reduce and I was important the required Hadoop jars that is required for this program so these are the jars and I've also exported this whole program along with all Hadoop dependencies as a word count jar so this is the jar file which you can see over here so this is it so let's go ahead and run this so for that I'll just open up my terminal so now let's go ahead and create a directory in order to store my input and output so first I'll create one directory and inside that I'll create two more directories for input and output so for that you have to use the command Hadoop FS then - mkdir which is for make directory and let me call the directory as workout and now let us go ahead and create some subdirectories for input and output so we'll go ahead and I'll just add input over here and similarly let us go ahead and create the output directory as well so I have created my directories now what I have to do I have to pull the data set or the file that we're dealing with into our input directory so that Hadoop can fetch it from there and run the code so let me show you where my file is so this is here in the home directory so this is the file so there's the same file that we have learned in the example which is dear river card so this is a simple paragraph over here and we're going to perform the word count program on this text file over here which is known as test dot txt so and clear the screen so we're done with making our directories now our next step is to put this text file or move this text file into our HDFS directory so for that we'll use the command Hadoop FS - put and the name of our file which is test dot exe and our HDFS directory which is known as word count and we wanted in our input directories so this will move it and now what we have to do we have to run the jar file now in order to perform MapReduce on the test dot txt file and for that we'll use this command Hadoop jar and the name of my jar is word count and we also have to mention the name of the package so you remember the name of the package that I showed you in the code which is n dot ed Eureka dot Map Reduce and also mention my class name were my main method is so that the execution of this MapReduce program can get started from there so the name of my class is word count and press Enter so destroying this exception because if you remember in our driver code that we have mentioned that our input directory is of the zeroth argument and the output directory is of one as arguments but we haven't mentioned it anywhere so we have to go ahead and mention it so that Hadoop can fetch the file from our input directory and finally store the output in mouths foot directory so now we'll go ahead and we'll just mention the input and output directories so my input is in word count flash input and my output was in word count slash output and now let us run it so now I can see that the map produced execution is going on so you can see that it has read some bytes and written some bytes so let us go ahead and see the output so let me show you my output file so for that I'd used to command Hadoop FS - LS then this is my directory see you see over here that this is my output file so let us go ahead and check that what Hadoop has written onto this output file or let us see the MapReduce result so for that I'll just use the cat command so this is the command Hadoop SS - cat my directory and slash Astrix zero so there it is so beer it counted all the words and it has given you the final results of the year four times car three deer two and reverse three so this is how Hadoop executes MapReduce and this is how you can run different MapReduce programs in your system so this is just one simple example you can go ahead and run different programs as well so I hope that you all have understood this so we'll go ahead and move on to the next topic so now let us go ahead and take a look at the yarn components and yarn stands for yet another resource negotiator which is nothing but MapReduce version two so let us take a look at the components so we have got the resource manager a node manager app master and container so the resource manager here again is the main node in the processing department so the resource manager receives processing requests like MapReduce jobs and then it passes on the request to the node manager and it monitors if the MapReduce job is taking place correctly or not so the node manager over here this is installed on every data node so basically you can think that a node manager and the data node lies in a single machine so this is responsible for app master and container now coming to containers so the containers are nothing but this is a combination of CPU and RAM so this is where the entire processing or the MapReduce task takes place and there we have got an app master so the app master is assigned whenever the resource manager receives a request for a MapReduce job so then only app master is launched which monitors if the MapReduce job is going on fine and reports and negotiates with the resource manager to ask for resources which might be needed in order to perform that particular MapReduce job so this is again a master slave architecture with the resource manager is the master and the node manager is asleep which is responsible for looking after the app master and the container so this is yard now let us go ahead and take a look at the entire MapReduce job workflow so what happens the client node submits a MapReduce job the resource manager and if you know the resource manager is the masternode so this is where a job is submitted and then the resource manager replies to client node with an application ID and then the resource manager contacts the node manager and asked them to start the container and then the node manager is responsible for launching an app master for each of the application so the app master will negotiate for containers that is the data node environment where the process executes and then it will execute the specific application and monitor the progress so the application master are nothing but demons which reside on data node and communicates to containers for execution of tasks on each data node so then it will receive all the resources that is needed so that the app master will receive all the resources from the resource manager in order to complete that job and will start a container so the app master will launch a container and when the container is launched we'll have a yard child which will perform the actual MapReduce stuff and finally we will get the output so this is how the entire MapReduce job workflow takes place and now let us understand what happens behind the scene when a MapReduce job is taking place so this is our input block and the details in the input block will be read by the map tasks and each map has a circular memory buffer that it writes the output to and the buffer is 100 MB by default but the size of this offer can be tuned or changed by changing the MapReduce dot IO dot sort mb property so when the contents of the buffer reach at a certain threshold size and by default when it fills up to 0.80 or let's say 80% so a background thread will start to spill the contents to the disk so the map outputs will be continued to be written to the buffer while the spill takes place but if the buffer fills up during this time the map will block until the spill is complete so before spilling the content into disk the thread will first divide the data into partitions corresponding to the reducers that they will ultimately be sent to so with each partition the background thread performs a in-memory sort by T so each time the memory buffer reaches the spin threshold a new spin file is created so after the map task has written its last output record there could be several spin sites so before the task is finished the Spill files are merged into a single partition and sorted output file and this will be done by different mapping functions so the configuration property map produced a co dot sort factor controls the maximum number of streams or spill files to merge at once and the default is step so now we'll have outputs from different other mapping functions and finally all this outputs from different maps are fetched and it is sent to the reducer for aggregation so you can see in this image over here that I have received different intermediate results from different Maps and finally they are merged together and they are sent to the reducer in order to provide the final result so this is how MapReduce works so I hope that if all understood this for any questions all right so we'll move on and we'll take a look at the yarn architecture so we have already gone through the components in yarn so we already know that there is a resource manager which is the master and then we have got slave nodes again where a node manager is present in every of the slave node and the node manager is responsible for app master and container so we've got different node managers here so when the node manager does is that it sends the node status or how each of the node is performing a single MapReduce job and it sends a report to the resource manager and when a resource manager receives a job request or a MapReduce job request from a client what it does it asks a node manager to launch app master now there is only one single app master for each application so it is only launched when it gets a request or a MapReduce job from the client and it is terminated as soon as the MapReduce job is completed so the app master is responsible for collecting all the resources that is needed in order to perform that MapReduce job from the resource manager so the app master asks for all the resources that is needed and the resource provides it through that app master and finally the app master is also responsible for launching a container this is where the actual MapReduce job or the MapReduce processing will take place so this is the entire yarn architecture is fairly simple so I hope that you've understood this and now let us take a look at the Hadoop architecture by combining both of these two concepts together the Hadoop distributed file system and yarn so if you see HDFS and yarn together so we have got two master nodes here so the master node in case of HDFS is named node and in the ayat it is resource manager so HDFS is only responsible for storing our big data so we have also the secondary name node here which is responsible for check pointing and you already know a checkpoint against it is the process of combining the FS image with the edit log and first actually storing the data we have got data node which are the worker nodes and in case of yard we our worker nodes our node manager which is responsible for processing the data which is nothing but a MapReduce job so you can also see that a data node and a node manager they basically reside on a single machine so this is HDFS and yarn all together so I hope that you have all understood HDFS and yarn so you all now know how data is stored in Hadoop and how it is processed it hadoo so now let us take a look at how a Hadoop cluster actually looks like so this is how a Hadoop cluster looks like so we have got different racks together that contains different nodes master and slave nodes all together so these are nothing but different clusters so all these machines are interconnected and they are connected with a switch in this particular rack we have got the master node the name node the secondary name node and different slave nodes we can also combine small clusters together in order to obtain a big Hadoop cluster all together so this is a very simple diagram that shows you what a Hadoop cluster looks like so now let us see how you can launch different Hadoop cluster or the different modes of Hadoop cluster okay we'll start from the bottom so we'll start with multi node cluster mode so the previous image that I've just shown you is a multi node cluster mode so let me just go back and show it to you again so this is a Hadoop multi node cluster so we have got multiple nodes over here we've got name nodes which are master nodes and worker nodes on different machines so this is a multi node cluster and then we have got pseudo distributed mode so it means that all the Hadoop daemons the master daemon and the slave daemon they run on the local machine and then we have got a standalone or local mode it means that there are no demons everything is running on a single virtual machine so this is only suite above when you were just going to try out how do you want to see that how hadoo works so this only for that but this completely violates our concept of having a distributed file system because it is not distributed at all when you have only a single machine but in pseudo distributed the difference is that you can have virtualization inside even though the hardware is same you can still have logical separations but this is also not advisable to use since if that machine goes down your entire Hadoop cluster or your entire Hadoop setup would be lost so you can go ahead and set up your hadoop cluster in a pseudo distributed mode when you want to learn hadoop when you want to see how the files get distributed and you want to get a first-hand experience on hadoop you can go ahead and set up your hadoop cluster in a single machine by logically partitioning it but when you talk about production you should always go ahead with multi node cluster mode you should divide up the tasks and that is how exactly you'll get the benefits out of big data because unless you distribute the tasks and unless all the tasks are performed parallely by different machines by also having a back-up plan or by having a backup storage or by having a backup node or back a machine for processing it when a single machine goes down you won't get the proper benefits of using Hadoop so that's why for production purpose you should always go ahead with multi node cluster mode so this was all about Hadoop clusters so now let us go ahead and see the Hadoop ecosystem so this is the Hadoop ecosystem and this is nothing but a set of tools which you can use for performing big data analytics so let's start with flu and school which are used for ingesting data into HDFS now I already told you that data has been generating at a very high velocity so in order to cope up with the velocity we use tools like zoom and scoop in order to ingest the data into a processing system or our storage system because it is getting generated at a very high rate so flume and scope acts like funnel in order to store the data for some time and then ingest it accordingly the flume is used to ingest unstructured and semi-structured data which are nothing but mostly social media data and scope is used to ingest structured data like excel sheets Google sheets something like that and you already know what HDFS is this is a distributed file system which is used for storing big data we have also discussed about yarn which is nothing but yet another resource negotiator this is meant for processing big data and apart from that we have got many other tools in our Hadoop ecosystem so we have got high V R now High's if used for analysis so it was developed by Facebook and it uses high query language which is very similar to sequel so when Facebook developed high and when they wanted to start using it they didn't have to hire people who knew HTML because they could already use the people who are experts in sequel and it's very similar to that now we have got another tool for analytics which is Pig now Pig is really powerful and one big command is almost equal to 20 lines of MapReduce code so obviously when you run that Pig command that one line pick command the compiler implicitly converts it into a MapReduce code but you have to only drive one single Pig command and it will perform analytics on your data circuits park over here which is used for near real-time processing and for machine learning we've got two more tools SPARC ml lip and mahute so again we've got tools like zookeeper and embody which is used for management and coordination so Apache embody is a tool for provisioning managing and monitoring the Apache Hadoop clusters and over here is e is a workflow scheduler system in order to manage Apache Hadoop jobs and this is very scalable reliable and an extensible system then apache storm this is used for real-time computation which is free and open source and with storm it is very easy to reliably process unbounded streams of data then we've also got Kafka which handles real-time data feeds and we've got solar loosin which is used for searching and indexing so these are the set of tools in Hadoop ecosystem and according to your need you need to select the best tools and come up with the best possible solution so you don't have to use all the tools at the same time so this was Hadoop ecosystem any questions or doubts all right so now let us take a look at a use case to understand how we can use Hadoop for big data analytics in real life and when understand it by taking an account and analyzing our Olympic data set so let us see what we're going to do with this data set and how this data set looks like so we have Olympic data set and we're going to use a Hadoop tool which is known as big in order to make some analysis about this data set now let me tell you a little bit about big before going ahead with this use case so big is a very powerful and a very popular tool that has been widely used for big data analytics and we think you can write complex state of transformations without the knowledge of Java you saw the earlier program that we wrote that was fairly simple this was just a small MapReduce program but it had almost 70 to 80 lines of Java code and if you're not good at Java it might be Lin hard for you so now you don't have to worry because we have got big and big users its own language which is known as big Latin and this is very much similar to sequel and they also has various built-in operators for joining filtering sorting to process large sets of data and also let me tell you a very interesting fact that ten lines of big code is almost equal to 200 lines of MapReduce code so that is why fig is so popular because it is very easy to learn and it is very easy with big to deal with large data sets so now we have got the Olympic data set now this is fairly small but just for an example let me show you and let me tell you what we are going to do with this data set so these are the things that we're going to make analysis about the Olympic data set so at first we're going to find the list of top ten countries that have won the highest medals and then we're going to see the total number of gold medals won by each country and we'll also find out which countries have won the most number of medals in a particular sport which is swimming so this is what we're going to find out and now let us take a look at our data set so this is a brief description of my data set so I've got these in my data set so the first field is athlead and this consists of the name of the athlete then we have got the age of the athlete the country which an athlete belongs to the year of Olympics when the athlete played the closing date is the date when the ending ceremony was held for that particular Olympic year the sport which an athlete is associated to the number of gold medals won by him or her number of several medals number of bronze medals and the total medals won by a particular athlete and this is what our data set looks like so here is athlete and this is the field athlete and contains the name of the athlete like Michael Phelps Natalie Coughlin Alex animals then the age of the athletes the country the United States the year 2008 the closing ceremony day this is the date sport is swimming gold medals aid silver medal zero bronze four zero total medals eight so this is how our data set looks like and we're going to perform some operations on this data set in order to make some analysis and some insights using pic so let us go ahead and let me show you how to do that so this is my terminal where I have got my Hadoop set up and we're going to use big for that so now I have already loaded my data set in my HDFS let me show you where my data set actually lies so Hadoop SS - LS so these are my input and output directories so let us go ahead you spake and make this analysis so all my results will be stored over here and I'll go ahead and show it to you once you perform all the operations so the first thing that we're doing we're going to find the list of top 10 countries with highest medals so let me go ahead and open big so this is the shelf or big so the first thing that we need to do we need to load the data set into Pig so for that I'm going to use a variable I'm going to store the data set in this variable and this is the command that I'm using which is load and then you have to mention the name of the directory which is Olympic slash input and you also have to mention the name of your data set which is Olympics underscore data and it is a CSV file after that you have to write using big storage and we're going to use a delimiter sine D now I'll tell you why because if you remember in our data set all of the fields that we have they are separated by using a tab and that's why we have used flash T as our delimiter here and make sure that after you end each line of pig code you end it with a semicolon just like how you do in sequel now press Enter now let us go check out this variable olimpic so for that you can use this command dump and the name of the variable so my data set has been loaded so this is it so here we have got all the fields mentioned we have got the name of each of the player the age the country where they belong to the year of the Olympics the closing date ceremony the sport each of this athletes are associated to the number of gold medal silver bronze and ultimate so my entire dataset has been loaded into the variable Olympic so if you remember what we are going to do we are going to find a list of top 10 countries with the highest medals so we don't need all the fields here we just need the field where we have got the country name and the total medals so for that I will write one more code but first I will clear the screen so I'm going to use another variable here so let us call it country final and let me write the code so let me say for each Olympic generate to add country total medals now these numbers that you see dollar two and nine these are index so let me go back to our data set and let me show you why I have mentioned two and nine here so this is our data set and the index of all the fields it starts from zero so athlete is at 0th index ages at one country is a two and total medals is at six so we only need the country and the total medals and that's why you've mentioned the indexes of the country field and the total medals field only so now let us go ahead and execute this let us go check this variable so this is our another intermediate result so you can see all the countries are present and there is a hard-coded value one so you can see that now all the countries we have got one Ukraine here and two over here so what we want to do now we want to group all the same countries together so for that we'll use this again I'm using a variable to group all the countries together we're calling it grouped so and then execute this command so group country final by country now let us check grouped so now I can see all the same countries are grouped together got trinidad and tobago here serbia and montenegro czech republic's all the countries are grouped together so this result is also intermediate if you remember in the previous MapReduce program we also got a similar value like this and then what we did we counted it and finally gave the final result and that is exactly what we're going to do now let me tell you also that each Pig code that you run it gets implicitly transformed or it gets implicitly translated into a MapReduce code only so whatever is happening we did the similar thing in our previous code also so now we'll go ahead and we'll count them so now let me use another variable to store the results so let me call it final results and this is the command so for each grouped generate group and in order to count it we're going to use a in dual function in bake which is called count and here we're using our country final and total meadows as a scout now let us go check the final result and there it is so South Korea has got 274 total medals where Tirico has to North Korea has 21 Venezuela has four but if you see it right now this is not in a sorted order and we want the top ten so let us sort it so that we can have the highest medal winners on the top clear it in order to sort it I'm going to use this variable I'm going to store the sorted result in this variable called sort and now I'm going to write and I want to order the final result by F count and I wanted in a descending order this cool chick sort so now we have got all the countries in a sorted manner so if you scroll up we can see that United States has got the highest medals then comes Russia Germany Australia China but I have got the list of all the countries with me and I wanted only the top ten countries so I will eliminate all the others and I will just select the top ten so for that let me use another variable to store the names of the top ten countries only and let me call it final count and let me use this limit short ten now you give me only the top ten values now let us go check final count so this is our final result we have got the name of the top ten countries with the total number of medals a particular country won and so this is our final result so let's go ahead and store this result in our output directory so for that I'll use this command store final count into the name of your directory which is Olympic slash output and let me store it in a particular file let me call it use case first and it's a success so the final result has been successfully stored in a file in my output directory similarly we can go ahead and find out the answers to the other two questions that we already had so the second one was to find the top ten countries that won the highest number of gold medals now this is completely similar to the first one that we did only instead of selecting the field with total medals we'll select the field for gold medals this time and apart from that all other steps will be same so the gold medals will be in the 6th index instead of writing 9 we should write 6 in this case and third one was to find out which countries have won the most number of medals in swimming so let me go ahead and execute you this one so this is also very very similar again instead of just two we have to select three fields because there is one more field which is the sport field involved in this one so let me just go ahead and run the same command so the first thing that we need to do we have to load our data set so it's the same so we have loaded this and now for the second one now the second one instead of two fields we have to select three fields so generate two as the country this is fine we'll add another one the sport was in the fifth index so go ahead and mention that so 5 as sport a 9 as total medals since we want it for a particular sport which is swimming so we'll filter out all of the sports and we'll only take in accounts to me but first let me clear my screen so I'm using another variable let me call it athletes filter and I'm going to use another inbuilt function for that which is known as filters so I'm going to filter country final by sport and sport is swimming so now let us go ahead and check athletes filter so there we have we have got only the country name the sport swimming now again this is another intermediate result we want to group all the countries together again for that let me use this variable called final group and we'll use another inbuilt function which is called group ashle filter by country let's go and check out final group so again we have grouped all the countries together now we'll go ahead and count it and now we'll go ahead and we'll use a similar count function that we did before so let me use another variable over here let me again call it final count make sure to have a space here so for each final group generate crew and use the count function you mentioned athlete filter now let's go ahead and check final count again it's not sorted and we want to see the top country who always win medals and swimming so again we'll go ahead and sort it so start our final count by F count and I want the top country first all sorted in descending order so let me go ahead and check out sort so there we have so I know you guys already guessed it so it's obviously going to be United States and Michael selves wanted all so yeah so we've got the United States on the table then we've got Australia Netherlands Japan China Germany France now if you want only the top five or top ten you can do it in the similar way by using limit so but if you want to keep it this way you can do that so now this is the final result that we want and we're going to store it in our output directory so again we'll use the same command store sword in mouth but directory which is again Olympic slash output and let me just or it in a file called use case of us three and enter and so again this is successful now let's come out of the picture now this my terminal and now let us view the first result that we have got and we have stored it in our output directory so for that type Hadoop FS - get my output directory so it was in this side use case first and Astrix zero so there is my result it was successfully stored in my output directory and there it is so this is how you can use big in order to make analysis now this is a very small data set and very easy analysis that we make you can perform some very complex ones also using big and you just have to write only a few lines of code so I hope that you all have understood this use case if you have any doubts you can ask me questions right now so do you have any questions alright thank you everyone for attending this session I hope that you've all learned about Hadoop but if you have any queries or any doubts kindly leave it on the comment section below this video will be uploaded on your LMS and I'll see you next time till then happy learning I hope you enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply to them at the earliest do look out for more videos in our playlist and subscribe to our at Eureka channel to learn more happy learning
Info
Channel: edureka!
Views: 568,677
Rating: undefined out of 5
Keywords: yt:cc=on, hadoop, hadoop tutorial, hadoop tutorial for beginners, apache hadoop, introduction to hadoop, overview of hadoop, hadoop overview, hadoop training, hadoop certification, big data, big data tutorial, big data tutorial for beginners, big data hadoop, big data hadoop tutorial, what is hadoop, what is big data, hadoop mapreduce tutorial, hadoop hdfs tutorial, hadoop tutorial for beginners with examples, hadoop architecture, hadoop edureka, big data edureka, edureka
Id: mafw2-CVYnA
Channel Id: undefined
Length: 101min 37sec (6097 seconds)
Published: Tue May 09 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.