Big Data & Hadoop Full Course In 12 Hours [2023] | BigData Hadoop Tutorial For Beginners | Edureka

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] big data is a term used to describe the large volume of data both structured and unstructured that Indian dates are business on a day-to-day basis but it is not the amount of data that is important it's what the organization do with the data that matters big data can be analyzed for insight that leads to better decisions and strategic business moves hi everyone welcome to this big data full course today we have an exciting agenda lined up for you but before we get started if you like our videos then please do not forget to subscribe to our edureka YouTube channel and hit the Bell icon to stay updated with all the latest trending Technologies also if you are interested in our big data certification training then please click on the link given in the description box now without any delay let us go through the agenda first we will start with an introduction to Big Data by discussing what is Big Data and the challenges it possess and the opportunities it presents next we will dive deep into the Hadoop fundamentals followed by hdfs mapreduce and other key components of Hadoop ecosystem we will also discuss the advanced topics such as scoop a tool for transferring data between Hadoop and relational database next we have Flume a tool for collecting aggregating and moving the large amount of log data then we have pig a high level platform for creating mapreduce programs used with Hadoop then we have Hive a data warehousing and SQL like query language for Hadoop after that we have hbase our nosql database that runs on top of Hadoop and it is used for real-time data accessing and Analysis and after that we will learn about Uzi which is used to manage and coordinate Hadoop jobs after that we will dive into some of the popular Hadoop projects which will give you an understanding of how these Technologies are being used in these industries we will also cover the career opportunities in the Big Data domain and tips to prepare for Big Data Hadoop interview questions by the end of this full course you will have a plenty of opportunities for Hands-On practice and you will have a solid understanding of Hadoop ecosystem and be well prepared to work with the big data in professional setting so let us get started with our first topic that is what is Big Data [Music] what is Big Data with a spike in internet usage and other Technologies such as iot devices mobile phones autonomous devices like robotics drones Vehicles appliances the volume of generated data is growing exponentially at an unprecedented rate this constant increase in the amount of data generated has led to the emergence of big data so what is Big Data let's keep it simple data refers to a collection of data that is so huge and complex that none of the traditional data management tools are able to store it or process it efficiently now do you know that big data involves lots of data but have you ever stopped to think about just how big is Big Data according to Forbes there are 2.5 quintillion bytes of data created every day now you might be wondering how big data tools manage to handle such a huge amount of data when Netflix offers you personalized recommendations from its library of thousands of movies and TV shows that's big data at work Big Data helps Netflix determine which programs may be of interest to you and the recommendation systems actually influences 80 percent of the content we watch on Netflix in the near future with the growing increase in the volume of data big data will grow bigger as the demand for data management experts will shoot up the gap between the demand for data professionals and their availability the white this will help data scientists and analysts draw higher salaries so what are you waiting for drive into the world of big data and towards a bright future [Music] now I feel sort of it's the best time to tell the story about how data evolved and how big data came find reshma so we'll move forward so sort of what can you notice here I see how technology has evolved earlier we had landline phones but now we have smartphones we have Android we have IOS that are making our lives smarter as well as our phones smarter apart from that we were also using bulky desktops for processing MBS of data now if you can remember we were using floppies and you know how much data it can store right then came hard disk for storing TVs of data and now we can store data on cloud as well and similarly nowadays even self-driving cars have come up I know you must be thinking why are we telling that now if you notice due to this enhancement of Technology we're generating a lot of data so let's take the example of your phones have you ever noticed how much data is generated due to your fancy smartphones your every action even one video that is sent through WhatsApp or any other messenger app that generates data now this is just an example you have no idea how much data you're generating because of every action you do now the deal is this data is not in a format that our relational database can handle and apart from that even the volume of data has also increased exponentially now I was talking about self-driving cars so basically these cars have sensors that records every minute details like the size of the obstacle the distance from the obstacle and many more and then it decides how to react now you can imagine how much data is generated for each kilometer that you drive on that car I completely agree with you reshma so let's move forward and focus on various other factors behind the evolution of data I think you guys must have heard about iot if you can recall the previous slide we were discussing about self-driving cars it is nothing but an example of iot let me tell you what exactly it is iot connects your physical device with internet and makes a device smarter so nowadays you have noticed we have Smart ACS TVs Etc so we'll take the example of smart air conditioners so this device actually monitors your body temperature and the outside temperature and accordingly decides what should be the temperature of the room now in order to do this it has to First accumulated data from where it can accumulate data from internet through sensors that are monitoring your body temperature and the surroundings so basically from various sources that you might not even know about it is actually fetching that data and accordingly it decides what should be the temperature of your room now we can actually see that because of iot we are generating a huge amount of data now there is one start also that is there in front of your screen so if you notice by 2020 will have 50 billion iot devices so I don't think so I need to explain much that how iot is generating huge amount of data so we'll move forward and focus on one more factor that is social media now when we talk about social media I think reshma can explain this better right yeah sort of but I'm pretty sure that even you use it so let me tell you that social media is actually one of the most important factor in the evolution of big data so nowadays everyone is using Facebook Instagram YouTube and a lot of other social media websites so this social media sites have so much data for example it will have your personal details like your name age and apart from that even each picture that you like or react to also generates data and even the Facebook pages that you go around liking that is also generating data and nowadays you can see that most people are sharing videos on Facebook so that is also generating huge amount of data and the most challenging part here is that the data is not present in a structured Manner and at the same time it is huge in size isn't that right sorrow can't agree more the point you made about the form of data is actually one of the biggest factor for the evolution of Big Data so do you do all these reasons that we have discussed have not only increased the amount of data but it has also shown us that data is actually getting generated in various formats for example data is generated with videos that is actually unstructured same goes for images as well so there are numerous or you can say millions of ways in which data is getting generated nowadays absolutely and these are just few examples that we have given you there are many other driving factors for the evolution of data so these are few more examples because of which data is evolving and converting to Big Data we'll discuss about the retail part I'm pretty sure that all of you must have visited websites like Amazon Flipkart Etc and reshma I know you visited a lot of times yeah I do and suppose reshma wants to buy shoes so she won't just directly go buy shoes she'll search for a lot of shoes so somewhere her search history will be stored and I know for sure that this won't be the first time that she's buying something so there will be her purchase history as well along with her personal details and there are numerous ways in which she might not even know that she's generating data and obviously Amazon was not present earlier so at that time there is no way that such huge amount of data was generated similarly the data has evolved due to other reasons as well like Banking and finance media and entertainment etc etc so now the deal is what exactly is Big Data how do we consider data as big data so let's move forward and understand what exactly it is okay now let us look at the proper definition of Big Data even though we've put forward our own definitions already so sort of why don't you take us through it yesterday so big data is a term for collection of data sets so large and complex that it becomes difficult to process using on hand database system tools or traditional data processing applications okay so what I understand from this is that our traditional systems are a problem because they're too old-fashioned to process this data or something no reshma the real problem is there is too much data to process when the traditional systems were invented in the beginning we never anticipated that we would have to deal with such enormous amount of the data it's like a disease infected on you you don't change your body orientation when you get infected with the disease right reshma you cure it with medicines couldn't agree more sort of now the question is how do we consider some data as Big Data how do we classify some data as Big Data how do we know which kind of data is going to be hard for us to process well sort of we have the five V's to tell us that so let's take a closer look at what are those so starting with the first V it's the volume of data it's tremendously large so if you look at the stats here you can see the volume of data is rising exponentially so now we're dealing with just 4.4 zettabytes of data and by 2020 just in three years is expected that the data will rise up to 44 zettabytes which is like equal to 44 trillion gigabytes so that's really really huge it is because all these humongous all this humongous data is coming from multiple sources and that is the second way which is nothing but variety we deal with so many different kinds of files at all once there are MP3 files videos Json CSV tsv and many more now these are all structured unstructured and semi-structured altogether now let me explain you this with the diagram that is there on your screen so over here we have audio we have video files we have PNG files we have Json log files emails various formats of data now this data is classified into three forms one is structured format now in structure format you have a proper schema for your data so you know what all columns will be there and basically you know the schema about your data so it is structured it is in a structured format or you can say in a tabular format now when we talk about semi-structured files these are nothing but Json XML and CSV files where schema is not defined properly now when I go to unstructured format we have block files here audio files videos and images so these are all considered as unstructured files and sorrow it is also because of the speed of accumulation of all this variety of data altogether which brings us to our third V which is velocity so if you look here earlier we were using Mainframe systems huge computers but less data because there were less people working with computers at that time but as computers evolved and we came to the client server model the time came for the web applications and the internet boomed and as it grew among the masses the web applications got increased over the internet and everyone started using all these applications and not only from their computers and also for mobile devices so more users more appliances more apps and hence a lot of data and when you talk about people generating data or Internet reshma the one kind of application that strikes first in the mind is social media so you tell me how much data you generate alone with your Instagram posts and stories it will be quite a boast if I only talk about myself here so let's talk including every social media user so if you see the stats in front of your screen you can see that for every 60 seconds there are 100 000 tweets actually more than 100 000 tweets generated in Twitter every minute similarly there are 695 000 status updates on Facebook when you talk about messaging there are 11 million messages generated every minute and similarly there are 698 445 Google searches 168 million emails and that equals to almost 1 820 terabytes of data and obviously the number of mobile users are also increasing every minute and there are 217 plus new mobile users every 60 seconds geez that's a lot of data I don't even want to go ahead and calculate the total it would actually scare me yeah that's a lot now the bigger problem is is how to extract the useful data from here and that's when we come to our next V that is value so over here what happens first you need to mine the useful content from your data basically you need to make sure that you have only useful fields in your data set after that you perform certain analytics or you say you you perform certain analysis on that data that you have cleaned and you need to make sure that whatever analysis you have done it is of some value that is it will help you in your business to grow it can basically find out certain insights which were not possible earlier so you need to make sure that whatever big data that has been generated or whatever data that has been generated it makes sense it will actually help your business to grow and it has some value to it now getting the value out of this data is one big challenge let me tell you why and that brings us to our next V which is veracity now this big data has a lot of inconsistencies obviously when you're dumping such huge amount of data some data packets are bound to lose in the process now what we need to do we need to fill up these missing data and then start mining again and then process it and then come up with a good inside if possible so if you can notice there's a diagram in front of your screen so over here we have this field which is not defined similarly this field and if you can notice here when we talk about this minimum value you see the other minimum values and when you talk about this it is it is way more than the other fields present in this particular column similarly goes for this particular element as well okay so obviously processing data like this is one problematic thing and now I get it why big data is a problem statement well we have only five V's now but maybe later on we'll have more so there are good chances that big data might be even more big okay so there are a lot of problems in dealing with big data but there are always different ways to look at something so let us get some positivity in the environment now and let us understand how can we use Big Data as an opportunity yes reshma and I would say the situation is similar to the proverb when life throws you lemons make lemonade yeah so let us go through the fields where we can use Big Data as a boon and there are certain unknown problems solved only because we started dealing with big data and the Boon that you're talking about reshma is big data analytics first thing with big data we figured out how to store our data cost effectively we were spending too much money on storage before until Big Data came into the picture we never thought of using commodity Hardware to store and manage the data which is both reliable and feasible as compared to the costly servers now let me give you a few examples in order to show you how important big data analytics is nowadays so when you go to a website like Amazon or YouTube or Pandora Netflix any other website so they'll actually provide you certain fields in which they'll recommend some products or some videos or some movies or some songs for you right so how do you think they do that so basically whatever data that you are generating on these kind of websites they make sure that they analyze it properly and let me tell you guys that data is not small it is actually big data now they analyze that big data and they make sure that whatever you like or whatever your preferences are accordingly they'll generate recommendations for you and when I go to YouTube I don't know if you guys must have noticed it but I'm pretty sure you must have done that so when I go to YouTube YouTube knows what song or what video that I want to watch next similarly Netflix knows what kind of movies are like and when I go to Amazon it actually shows me what all products that I would prefer to buy right so how do you think it happens it happens only because of big data analytics okay so there's one more example that just popped into my mind I'll share with you guys so there's this time when the Hurricane Sandy was about to hit on New Jersey in the United States so what happened then the Walmart used big data analytics to profit from it now I'll tell you how they did it so what Walmart did is that they studied the purchase patterns of different customers when a hurricane is about to strike or any kind of natural Calamity is about to strike on a particular area and when they made an analysis of it so they found out that people tend to buy emergency stuff like flashlight life jackets and a little bit of other stuff and interestingly people also buy a lot of strawberry Pop-Tart strawberry top dots are you serious yeah now I didn't do that analysis so I Walmart did that and apparently it is true so what they did is they stuffed all their stores with a lot of strawberry Pop-Tarts and emergency stuff and obviously it was sold out and they earned a lot of money during that time but my question here reshma is people want to die eating strawberry Pop-Tarts like what was the idea behind strawberry Pop-Tarts I'm pretty unsure about it but yeah since you have given us a very interesting example and Walmart did that analysis we didn't do it so yeah so it is a very good example in order to understand how big data analytics can help your business to grow and find better insight from the data that you have yeah and also if you want to know why strawberry Pop-Tarts maybe later on we can start making an analysis by gathering some more data also yeah that can be possible okay so now let's move ahead and take a look at a case study by IBM how they have used big data analytics to profit their company so if you have noticed that earlier the data that was collected from The Meters that you have in your home that measures the electricity consumed it is actually sending data after one month but nowadays what IBM did they came up with this thing called smart meter and that smart meter used to collect data after every 15 minutes so whatever energy that you have consumed after every 15 minutes it will send that data and because of it big data was generated so we have some stats here which says that we have 96 million reads per day for every million meters which is pretty huge this data the amount of data that is generated is pretty huge now IBM actually realized the data that they're generating it is very important for them to gain something from that data so for that what they need to do for that what they need to do they need to make sure that they analyze this data so they realize that big data analytics can solve a lot of problems and they can get better business Insight through that so let us move forward and see what type of analysis they did on that data so before analyzing the data they came to know that energy utilization and billing was only increasing now after analyzing Big Data they came to know that during Peak load the users require more energy and during off-peak times that users require less energy so what advantage they must have got from this analysis one thing that I can think of right now is they can tell the industries to use their Machinery only during the off-peak times so that the load will be pretty much balanced and you can even say that time of use pricing encourages cost savy retail like industrial heavy machines to be used off peak time so yeah take it save money as well because of peak times pricing will be less than the peak time prices right so this is just one analysis now let us move forward and see the IBM Suite that they developed so over here what happens you first dump all your data that you get in this data warehouse after that it is very important to make sure that your user data is secure then what happens you need to clean that data as I've told you earlier as well there might be many fees that you don't require so you need to make sure that you have only useful material or useful data in your data set and then you perform certain analysis and in order to use this Suite that IBM offered you efficiently you have to take care of a few things the first thing is that you have to be able to manage the smart meter data now there is a lot of data coming from all this million Smart Meters so you have to be able to manage that large volume of data and also be able to retain it because maybe later on you might need it for some kind of regulatory requirements or something and next thing you should keep in mind is to monitor the distribution grid so that you can improve and optimize the overall grid reliability so that you can identify the abnormal conditions which are causing any kind of problem and then you also have to take care of optimizing the unit commitment so by optimizing the unit commitment the companies can satisfy their customers even more they can reduce the power outages that is the they can reduce the power outages so that their customers don't get angry more identity 35 problems and then reduce it obviously and then you have also to optimize the energy trading so it means that you can advise your customers when they should use their appliances in order to maintain that balance in the power load and then you also have to forecast and schedule loads so companies must be able to predict when they can profitably sell the Excess power and when they need to hedge the supply and continuing from this now let's talk about how Encore have made use of the I-beam solution so Encore is an electric delivery company and it is the largest electrical distribution and transmission company in Texas and it is one of the six largest in the United States they have more than 3 million customers and their service area covers almost 117 000 square miles and they begin the advanced meter program in 2008 and they have deployed almost 3.25 million meters serving customers of North and Central Texas so when they were implementing it they kept three things in mind the first thing was that it should be instrumented so this solution utilizes the smart electricity meters so that they can accurately measure the electricity usage of a household in every 15 minutes because like we discussed that the smart meters were sending out data every 15 minutes and it provided data inputs that is essential for consumption insights next thing is that it should be interconnected so now the customers have access to the detailed information about the electricity they're consuming and it creates a very Enterprise wide view of all the meter assets and it helped them to improve the service delivery the next thing is to make your customers intelligent now since it is getting monitored already about how each of the household or each customer is consuming the power so now they're able to advise the customers about maybe to tell them to wash their clothes at night because they're using a lot of appliances during the daytime so maybe they could divide it up so that they could use some appliances at off-peak hours so that they can even save more money and this is beneficial for both of them for both the customers and the company as well and they have gained a lot of benefits by using the IBM solution so what are the benefits they got is that it enables Encore to identify and fix outages before the customers get inconvenience that means they were able to identify the problem before it even occurred and it also improved the emergency response on events of severe weather events and views of outages and it also provides the customers the data needed to become of active participant in the power consumption management and it enabled every individual household to reduce their electrical consumption by almost five to ten percent and this is how Encore used the IBM solution and made huge benefits out of it just by using big data analytics that IBM performed but let me just interrupt right now so since reshma told us in the beginning as well that there are no free launches in life right so this is an opportunity but there are many problems to encase this opportunity right so let us focus on those problems one by one so the first problem is storing colossal amount of data so let's discuss few stars that are there in front of your screen so data generated in past two years is more than the previous history in total so guys what are we doing top generating so much amount of data I said that by 2020 total Digital Data will grow to 44 Zeta bytes approximately and there's one more start that amazes me is about 1.7 MB of new information will be created every second for every person by 2020 so storing this huge data in traditional system is not possible the reason is obvious the storage will be limited for one system for example you have a server with a storage limit of 10 terabytes but your company is growing really fast and data is exponentially increasing now what you'll do now at one point you'll exhaust all the storage so investing in huge servers is definitely not a cost effective solution so reshma what do you think what can be the solution to this problem uh according to me a distributed file system will be a better way to store this huge data because with this we'll be saving a lot of money let me tell you how because due to this distributed system you can actually store your data in commodity Hardware instead of spending money on high-end servers don't you agree sorrow completely now we know storing is a problem but let me tell you guys it is just one part of the problem let's see few more okay so since we saw that the data is not only huge but it is present in various formats as well like unstructured semi-structured and structured so you not only need to store this huge data but you also need to make sure that a system is present to store this varieties of data generated from various sources and now let's focus on the next problem now let's focus on the diagram so over here you can notice that the hard disk capacity is increasing but the disk transfer performance or speed is not increasing at that rate let me explain you this with an example if you have only 100 Mbps input output Channel and you are processing say one terabytes of data now how much time will it take maybe calculate it'll be somewhere around 2.91 hours right so there will be somewhere around 2.91 hours and I have taken an example of one terabytes what if you're processing some Zeta bytes of data so you can imagine how much time will it take now what if you have a four input output channels for the same amount of data then it will take approximately 0.72 hours or converted to minutes so it'll be around 43 minutes approximately right and now imagine instead of 1 TB you have Zeta bytes of data for me more than storage accessing and processing speed for huge data is a bigger problem okay so reshma has a very good example to discuss yeah so since you were talking about accessing the data and you told us already about how Amazon at different websites and YouTube they make those recommendations so if there was no solution for it if it would take so much time to access the data the recommendation system won't work at all and they make a lot of money just for recommendation system because a lot of people go there and click over there and buy that product right so let's consider that that it is taking like hours or maybe years of time in order to process my that big amount of data so let's say that at one time I purchased an iPhone 5s from Amazon and after two years I'm again browsing onto Amazon and since it took so much time to access the data and I've already switched over to a new iPhone and they are recommending me the old iPhone case for 5S so obviously that won't work I won't go there and click it because I've already changed my phone right so that will be a huge problem for Amazon the recommendation system won't work anymore and I know that reshma changes her phone every year so if she has bought a phone and people are recommending if she has bought a phone now and someone's recommending the case for that phone after two years doesn't make sense to me at all yeah only it'll work if I have both the two phones at the same time but yeah I don't want to waste money on purchasing new iPhone case for my older phone so basically it won't be fair if we don't discuss the solution to these problems reshma we can't leave our viewers with just the problems right it won't be fair what is the solution Hadoop Hadoop is a solution so let's introduce Hadoop now okay so now what is Hadoop so Hadoop is a framework that allows you to first store big data in a distributed environment so that you can process it parallely there are basically two parts one is hdfs that is Hadoop distributed file system for storage it allows you to store data of various formats across a cluster and the second part is mapreduce now it is nothing but a processing unit of Hadoop it allows parallel processing of data that is stored across the hdfs now let us dig deep in hdfs and understand it better yeah so hdfs creates an abstraction of resources let me simplify it for you so similar to virtualization you can see hdfs logically as a single unit for storing big data but actually restoring your data across multiple systems or you can say in a distributed fashion so here you have a Master Slave architecture in which the name node is a master node and the data nodes are slaves and the name node contains the metadata about the data that is stored in the data nodes like which data block is stored in which data node where are the replications of the data block kept and etc etc so the actual data is stored in the data nodes and I also want to add that we actually replicate the data blocks that is present in the data nodes and by default the replication factor is three so it means that there are three copies of each file so sorry I'm going to tell us why do we need that replication since we are using commodity Hardware is right and we know failure rate of these Hardwares are pretty high so if one of the data nodes fail I won't have that data block and that's the reason we need to replicate the data block now this replication Factor depends on your requirements right now let us understand how actually Hadoop provided the solution to big data problems that we have discussed so reshma can you remember what was the first problem yeah it was storing the big data so how hdf has solved it let's discuss it so hdfs provides a distributed way to store Big Data we've already told you that so your data is stored in blocks in data nodes and you then specify the size of each block so basically if you have a 512 MB of data and you have configured hdf as such that it will create 128 megabytes of data block so hdfs will so hdfs will divide the data in four blocks because 512 divided by 128 is 4. and it will store it across different data nodes and it will also replicate the data blocks on the different data nodes so now we are using commodity hardware and storing is not a challenge so what are your thoughts on it sort of I will also add one thing reshma it also solves the scaling problem it focuses on horizontal scaling instead of vertical now you can always add some extra data nodes to your hdfs cluster as and when required instead of scaling the resources of your data nodes so you're not actually increasing the resources of your data nodes you're just adding few more data nodes when you require let me summarize it for you so basically for storing one TB of data I don't need a one TV system I can instead do it on multiple 128 GB systems or even less now reshma what was the second challenge with big data so the next problem was storing variety of data and that problem was also addressed by hdfs so with hdfs you can store all kinds of data whether it's structured semi-structured or unstructured it is because in hdfs there is no pre-dumping schema validation so you can just dump all the kinds of data that you have in one place and it also follows write ones and read many model and due to this you can just write the data once and you can read it multiple times for finding out insights and if you can recall the third challenge was accessing the data faster and this is one of the major challenge with big data and in order to solve it we're moving processing to data and not data to processing so what it means sort of just go ahead and explain it yes reshma I will so over here let me explain you what you mean by actually moving process to data so consider this as our master and these are our slates so the data is stored in the slates so what happens one way of processing this data is what I can do is I can send this data to my master node and I can process it over here but what will happen if all of my slaves will send the data to my master node it'll cause Network congestion plus input output Channel congestion and at the same time my master node will take a lot of time in order to process this huge amount of data so what I can do I can send this process to data that means I can send the logic to all these slaves which actually contain the data and perform processing in the slaves itself so after that what will happen the small chunks of the result that will come out will be sent to our name node so in that way there won't be any network congestion or input output congestion and it will take comparatively very less time so this is what actually means sending process to data [Music] problems faced by traffic now there are a hell lot of problems that are being faced by us in our day-to-day lives a few major problems are delays the first thing many people think of when it comes to congested roadways are the delays during the morning commute there is an additional stress because of delays caused by traffic that can make people late for work and at the end of the day the afternoon Rush is again frustrating time because if the work day is done and the people want to go home and relax and the traffic is preventing it these delays are common to most of the people because it is universal and everyone who has to maneuver through congested routes just in case time a secondary effect of traffic congestion related to delays is the inability to estimate travel times those who regularly travel congested areas know approximately how long it usually takes to get to a particular area depending upon on the time of the day and the day of the week these experienced city drivers have to build in time just in case the traffic is too bad this takes away the time from their Leisure and the time to do other tasks throughout the day also on a few days when the traffic is usually light the built-in extra time may be of no use and the person arrives too early followed by the first problem we move into the second problem which is the fuel consumption and pollution the stopping and starting in traffic jams Burns fuel at a higher rate than the smooth rate of travel on the open Highway this increases the fuel consumption cost commuters additionally for fuel and it also contributes to the amount of emissions released by the vehicles these emissions create air pollution and are related to global warming followed by the second problem the third major problem is the road rage road rage is a senseless reaction into traffic that is common in congested traffic areas if someone is not driving as fast as the person behind him thinks he should or someone cuts in front of someone else it can lead into an incident that is dangerous to the offender and those around him on the road road rage often manifests itself as shouting matches on the road intentional tailgating retaliated traffic Maneuvers and mostly a lack of attention being paid to the traffic around the people involved it is basically a temper tantrum by frustrated drivers in traffic followed by this we have the emergency vehicles when you dial 911 or 108 in case of India and request a police officer or an ambulance or a fire truck and the emergency vehicle is unable to respond in appropriate amount of time because of the traffic congestion it can be a danger to you or your property systems are available that help Elevate this problem by allow allowing the emergency crews to automatically change the traffic lights to keep the line moving so with this we move ahead into the next topic where we will learn exactly how big data is solving it the higher risk of Passenger safety loss of productivity increase in fuel consumption and fuel as all the effects of urban traffic congestion efficient traffic management will reduce congestion improve performance measurements for seamless traffic flow and proficiently manage current roadway assets government organizations and administrative authorities are implementing coordinated traffic signals and variable messages to manage traffic congestion by implementing Big Data Solutions administrators can leverage historical Trends a combination of real-time information and a new age algorithms to improve and traffic networks in urban areas the growing focus on the development of intelligent Network systems and the use of big data analytics will assist traffic management and result in reduced congestion and roadblocks the adoption of advanced sensors and GPS signal systems is revolutionizing the urban traffic Network these systems are designed to help reduce Network congestion and act as alerts that notify traffic authorities of potential roadblocks and how to avoid them the senses are installed in trucks ships and airplanes that give real-time insights into driver's capabilities and the traffic GPA signals are utilized for bottlenecks and predict the condition of the transportation Network followed by that we have the emergence of smart Vehicles the Advent of smart Vehicles will help reduce the network congestion across several cities in the world these are connected vehicles that provide real-time estimation of traffic patterns that help authorities with the deployment of management strategies these systems are designed to improve the communication vehicle to infrastructure Communications and monitor the traffic control to reduce collisions and accidents Additionally the implementation of of speed trackers traffic sensors and display boats will result in smarter roads and health control speed and traffic issues effectively followed by that we have telematics Solutions telematics is extensively used in traffic management to provide statistics and information data such as weather condition traffic conditions and navigation systems these systems provide real-time information and authorities leverage predictive analysis to determine the state of Transportation Network moreover telematics provide speech-based internet access to the consumers through Wireless links that monitor the driver's State and stress levels and send alerts to the systems if there is any issue to avoid or reduce chances of collision now we shall discuss the outcomes and solutions offered quick data analytics assessments on traffic Network condition identifies a set of Transportation indicators that are measured using the mobile phone data available to travel agencies to optimize Road planning some of the solutions offered are developed and integrated platform to perform data analysis and scheduled easy and controlled way using a user-friendly interface offered an exhaustive understanding of how traffic demand is distributed and the transportation Network and how it varies over time and listed the different types of bottlenecks or the transportation Network and enabled enhanced route planning evaluated the travel delay due to congestion based on travel time distribution between Peaks and non-peak periods provided a comprehensive analysis of total number of trips made to and from each Zone based on date time month and holiday next we shall learn the architecture of Intelligent Traffic Management Systems its generally has cameras set up on most populated areas of metropolitan cities these cameras collect the visuals data in the form of photos and videos this data is collected and stored in a storage unit generally edgebase is used for the job as it is capable of storing all sorts of data regardless of the type of data now there are n number of calculations and processing that is applied onto the collected data so for an offline data processing procedure we use mapreduce whereas on the other hand we use some high-end processing Frameworks for online and interactive applications the procedure involves active data mining for fetching most accurate data the data is ingested into rdpms and other servers using Store where the data is aggregated and the Legacy applications are run over the data to analyze the data and provide the traffic department with latest updates in the live traffic now our next concern would be learning the approach to Intelligent Traffic Systems there are five different stages in which the Intelligent Traffic Control system works they are traffic data collection traffic data transmission traffic safety measures traffic data analysis and lastly providing the traveler with latest traffic updates as per his needs let's discuss each one button the first one traffic Management Center traffic Management Center is the vital unit of its it is mainly a technical system administered by the Transportation Authority here all data is collected and analyzed for further operations and control management of the traffic in real time or information about local transport Vehicles well organized and proficient operations of traffic Management Center depend on automated data collection with precise location information that analysis of data to generate accurate information and then transmitting it back to Travelers let's understand the entire process in a more detailed way the first stage is data collection strategic planning needs precise extensive and prompt data collection with real-time observation so the data here is collected via varied Hardware devices that lay the base of further its processes this data is collected and stored into a storage unit generally edgebase is used for the job as it is capable of covering all sorts of data regardless of the type of data now there are n number of calculations and processes that need to be applied onto the collected data the next stage is data transmission rapid and real-time information communication is the key to Proficiency in its implementation so the aspect of its consists of the transmission of collected data from the field to TMC and then sending back the analyzed information from TMC to The Travelers traffic related announcements are communicated to The Travelers through the internet SMS or onboard units of vehicles other methods of communications are dedicated short range Communications or dsrc using radio and continuous air interface long in medium range that is c-a-i-l-n using cellular connectivity and infrared links followed by this we have the safety intelligent Transport Systems top priority is to make sure the Travelers are safe it mainly Targets on the making way for emergency vehicles like fire and safety ambulance and curves it looks after real-time traffic information analyzer set and redirects the emergency vehicles to the most favorable routes to reach their destination faster and safer the next stage is data analysis the data that has been collected and received at DMC is processed for further in various steps these steps are error rectification data cleansing data synthesis and adaptive logical analysis the inconsistencies in data are identified with specialized software and rectified after that the data is further alterated and pulled for analysis this mended Collective data is analyzed further to predict traffic scenarios which are available to deliver appropriate information to users lastly The Traveler information travel advisory systems or Tas is used to inform Transportation updates to The Traveler the system delivers real-time information like travel time travel speed delays and accidents on roads change in Road diversions work Zone conditions Etc this information is delivered by a wide range of electronic devices like variable message signs Highway advisory radios internet SMS and automated cell call with urbanization expanding with Speedy stride the number of vehicles on road is also increasing combination of both in return puts enormous pressure on cities to maintain a better traffic system so that the city keeps on moving without any hassle for the purpose application of intelligent transport system is the only solution its is a win-win situation for both citizens and City administrators where it provides safety and comfort to Citizens and easy maintenance and surveillance to City administrators now moving ahead we shall understand the hardware requirements of its the hardware requirements for its are categorized into three major components those are the field equipment communication systems and traffic Management Center now these three components have further requirements which are mentioned below they are field equipment that is the inductive Loop detectors magnetic detectors infrared and microwave detectors acoustic detectors and video imagine moving ahead the next one is the communication system which requires wired and Wireless Communications followed by that the last one is the traffic Management Center which requires basic facility of staff F signal control unit traffic surveillance freeway control integration for regional control incident detection incident response team information dissemination electronic tolls rail Crossing monitors now that we have a brief idea about the hardware requirements of ideas let's move ahead and understand the major challenges faced by its so the major challenges faced by its are lack of resources for operation and maintenance of its technology lack of In-House technical capacity to process understand and analyze the data lack of advanced analytics Solutions in the public transport industry lack of knowledge on idea systems and capabilities to specify suitable terms when Contracting idea services to vendors followed by that the lack of knowledge among vendors on specific needs of public transport operations which significantly affects the utility of the end product now with this we shall now move ahead and wind up our session discussing the benefits of its data is stored in data centers in different regions with global access its data center is universal life backup it offers real-time statistical data analysis it has common data source which is shared among offline analysis and interactive applications it offers full-text search capabilities inside the storage systems it has an inbuilt indexing system to offer synchronization of traffic data standard headspace interface increases image storing and processing performance it integrates our language support for edgebase hdfs and mapreduce finally exponential reduction in designing Logic for complex data mining [Music] so who is a big data engineer now every data driven business needs to have a framework in place for the data science and data analytics Pipeline and a data engineer is the one who is responsible for building and maintaining this framework now these Engineers must ensure that there is an uninterrupted flow of data between servers and applications so in simple words a data engineer builds tests maintains data structures and architectures for data ingestion processing and deployment of large-scale data intensive application now data Engineers work in tandem with data architect data analysts and data scientists so they must all share these insights to other stakeholders in the company through data visualization and storytelling but what does a big data engineer do exactly now the most crucial part of a big data engineer is to design develop construct install test and maintain the complete data management and processing systems they are basically the ones who handle the complete end-to-end infrastructure for data management and processing they build a pipeline for data collection and storage and funnel the data to data analysts and scientists so basically what they do is they create the framework to make data consumable for data scientists and analysts so they can use the data to derive insights from it note that the data Engineers are the Builders of data systems and not those who mine for insights so the data engineer works more behind the scenes and must be comfortable with other members of the team producing Business Solutions from this data now all their responsibilities revolve around this they need to take care of a lot of things while performing these activities hence one of the most sought after skills in data engineering is the ability to design and build data warehouses this is where all the raw data is collected stored and retrieved from without data warehouses all the tasks that a data scientist does will become obsolete it is either going to get too expensive or very very large to scale now data Engineers should always keep in mind that the system which he or she builds needs to be scalable robust and false tolerant so that the system can be scaled up without increasing the number of data sources and can handle a huge amount of heterogeneous data without any failure now imagine a situation wherein the source of data is doubled or tripled but the system cannot scale up will it not cost a lot more time and resources to build the same system again which is suitable for this kind of intake exactly this is why the Big Data Engineers have a role here next he or she is the one that handles the extract transform and load process which is basically the blueprint for how they've collected raw data is processed and transformed into Data ready for analysis now you're going to acquire a lot of data from different sources how do you bring them together to one platform ETL is your answer apart from all this a data engineer should always aim at deriving insights by acquiring data from new sources some of the responsibilities of a data engineer also include improving data foundational procedures integrating new data management Technologies and the software into existing systems and building data collection pipelines and finally one of the major roles of a data engineer is to include performance tuning and make the whole system way more efficient which is pretty self-explanatory if you ask me now most of us have some idea about who a big data engineer is but there's still some confusion about their responsibilities now this ambiguity further increases when we gain more information about the role now let me help you debunk all your queries about it so let's talk about some big data engineer responsibilities first up we have data ingestion now this is associated with the task of getting data out of the source systems and ingesting it into a data Lake now a data engineer would need to know how to efficiently extract the data from a source including multiple approaches for both batch and real-time extraction as well as needing to know about the incremental data loading fitting within small Source windows and parallelization of data loading as well now another small sub-task of data ingestion is data synchronization but because it's such a big issue in the Big Data world we are going to talk about it now since Hadoop and other big data platforms don't support incremental loading of data a data engineer would need to know how to deal with detecting changes in the data source merge and sync change data from sources into the Big Data environment next we have data transformation this is basically the T in the extract transform and load that we had discussed earlier it is based basically focused on integration and transformation of data for a specific use case now a major skill set here is the knowledge of SQL as it turns out not much has changed in terms of the type of data Transformations that people are doing now compared to purely relational environments now imagine all this data that you've acquired from various sources what would you have to do to make them all palatable in the same platform you need to transform that data and this is what a data engineer does here and finally we have performance optimization which is one of the tougher areas because anyone can build a slow performing system the challenge is to build data pipelines that are both scalable and efficient so the ability and understanding of how to optimize the performance of an individual data Pipeline and the overall systems are a higher level of data engineering skill now for example Big Data platforms continue to be challenging with regard to query performance and have added complexity to a data engineer's job in order to optimize performance of queries and creation of reports the data engineer needs to know how to denormalize partition and index data models he also needs to understand tools and Concepts regarding in-memory models and olap cubes now let's quickly move ahead and look at the required skills to fulfill these responsibilities now we'll be going through these skills in a clockwise order so starting with big data Frameworks now with the rise of big data in the early 21st century a new framework was born and that is Hadoop all thanks to Doug cutting for introducing this framework it not only stores big data in a distributed manner but also processes the data parallely there are several tools in the Hadoop ecosystem which cater differently for different purposes and Professionals for a big data engineer mastering Big Data tools is a must some of the tools which you will need to Master first of all you have hdfs which is the storage part of Hadoop being the foundation of Hadoop knowledge of hdfs is a must to start working with Hadoop framework next we have yarn which performs resource management by allocating resources to different applications and scheduling jobs now mapreduce is a parallel processing Paradigm which allows data to be processed parallely on top of the hdfs next we have begin Hive now Hive is a data warehousing tool on top of hdfs which caters to professional from an SQL background to perform analytics on top of hdfs whereas Apache pig is a high level platform which is used for data transformation on top of Hadoop now high was generally used by data analysts for creating reports whereas pig is used by researchers for programming both are pretty easy to learn if you're already familiar with SQL next we have Flume and scoop Flume is a tool which is used to import unstructured data to hdfs and scope is used to Import and Export structured data from rdbms now next we have zookeeper which acts as a coordinator among the distributed Services running in a Hadoop environment it basically helps to configure management and synchronize services and finally we have Uzi which is basically a scheduler which binds multiple logical jobs together and helps in accomplishing a complete task next up we have real-time processing Frameworks now real-time processing with quick actions is the need of r either it is a credit card fraud detection system or a recommendation system now imagine if you wanted a red dress today and Amazon decides to suggest it to you a month later now wouldn't that be completely useless for you in this case you need real-time processing it is very important for a data engineer to have knowledge of real-time processing Frameworks now Apache spark is one of the distributed real-time processing Frameworks which is used in the industry rigorously it can be easily integrated with Hadoop leveraging hdfs as well next we have dbms now a database management system stores organizes and manages a large amount of information within a single software application now data Engineers need to understand the database management system to manage data efficiently and allow users to perform multiple tasks with ease this will help data engineers in improved data sharing data security data access and better data integration with minimize data inconsistencies these are the fundamentals that data Engineers should know prior to building a scalable robust and fault tolerance system next we have SQL based Technologies now there are various relational databases that are used in the industry such as Oracle DB Microsoft SQL Server Etc now data Engineers must have at least the knowledge of one such database now knowing SQL is also a must this structured query language as SQL is also known as used to structure manipulate and manage data stored on relational databases as data Engineers work closely with rdbmss they need to have a strong command on SQL now next we have no SQL Technologies as the requirements of organizations have grown Beyond structured data nosql databases have been introduced into this environment it can store large volumes of structured semi-structured or structured data with quick iteration and agile structure as per application requirements some of the most prominently used databases are hbase Cassandra and mongodb now hbase is a column oriented nosql database on top of hdfs which is great for scalable and distributed Big Data stores it is also great for applications with optimized read and range based scan and it provides consistency and partitioning out of cap now Cassandra is a highly scalable database with incremental scalability and the best part about Cassandra is the minimal Administration and no single point of failure it's good for applications with fast and random read and writes it provides available and partitioning out of cap and finally we have mongodb which is basically a document oriented nosql database which is a schema free database it gives full index support for high performance and replica station for fault tolerance it has a Master Slave sort of architecture and provides CP out of cap it is rigorously used by web applications and semi-structured data handling next we are going to discuss programming and scripting languages so various programming languages can serve for the same purpose so knowledge of one programming language is enough I'm saying this because the flavor of language may change but the logic Remains the Same if you're a beginner you can go ahead with python as it is an easy language to learn due to its syntax and good Community Support whereas R has a steep learning curve which is developed by statisticians and it is mostly used by analysts and data scientists the next skill we're going to discuss is an important one it is ETL or data warehousing now data warehousing is very important when it comes to managing a huge amount of data coming in from heterogeneous sources where you need to apply extract transform and load now data warehousing is used for analytics and Reporting and is a very very crucial part of every business intelligence solution because this is the part which is going to take you most time now it is very important for a big data engineer to Master One data warehousing or ETL tool after mastering one it becomes pretty easy to learn new tools and as the fundamentals remain the same now Informatica click View and talent are very well known tools used in the industry Informatica and talent Open studio are data integration tools with ETL architecture the major benefit of Talon is its support from the Big Data Frameworks if you're new to data warehousing and ETL tools I would definitely recommend and you start with talent because after learning this any data warehousing tools will become a piece of cake and finally we have our operating systems now intimate knowledge of Unix Linux and Solaris is very helpful as many mathematical tools are going to be based off of these systems due to their unique demands for root access to hardware and operating system functionality above and beyond that of Microsoft's Windows or Mac OS now some level of understanding of how to act upon this data is also very valuable for data Engineers for this reason some knowledge of statistical analysis and the basics of data modeling are also hugely valuable knowledge of machine learning in Cloud also will serve as a big plus while machine learning is technically something relegated to a data scientist knowledge in this area is helpful to construct Solutions usable by your cohorts now this knowledge has the added benefit of making you extremely marketable in this space as being able to put on both hats in which case makes you a really formidable tool [Music] we start with our first chapter the first chapter is about learning why exactly we need to test the Big Data most of the users might end up with one question that asked why exactly we need to test the Big Data out you might have written the queries correct and your architecture might be just fine yet there might be many possibilities for failure let us assume a classic case of a drastic failure that occurred in a bank the designers of the bank database wanted to create a phone application which could enable phone banking to the customers so the designers of the band database name the customer name as CN the customer Bank location pin code AS CL customer ID as CID and customer phone number as CP now the bank wants to make key value pairs of customer ID which is CI and customer phone number that is CP in this scenario the mapreduce algorithm gets messed up between the letters P and L which is a keypad error or the typing error now the customers won't be getting the OTP and the phone banking facilities just imagine this scenario in a real-time situation horrible right So to avoid such mistakes in real time we prefer the Big Data testing now we shall understand what exactly is Big Data testing so big data testing can be defined as a procedure that involves examining and validating the functionality of Big Data applications the big data is a collection of a huge amount of data that traditional storage systems cannot handle testing such a huge amount of data would take some special tools techniques and terminologies which will be discussed in the later section of this tutorial followed by this we shall understand the strategies behind testing the Big Data testing an application that handles terabytes of data would take the skill from a whole new level and out of the box thinking the core and the important test that a quality assurance team concentrates is based on three scenarios namely based data processing test real-time data processing test and lastly the interactive data processing test so what exactly is batch data processing test the batch data processing test involves test procedures that run the data when the applications are in batch processing mode where the application is processed using batch processing storage units such as hdfs the batch data processing test involves test procedures that run the data when the application is in batch processing mode in this the application is processed using batch processing storage units such as hdfs the batch processing test mainly involves running the application against faulty inputs varying the volume of the data now next comes as the real-time processing test the real-time data processing test deals with the data when the application is in real time data processing mode the application is run using the real-time processing tools such as Spark real-time test involves application to be tested in the real-time environment and it is checked for stability and the last one is interactive data processing test the interactive data processing test integrates the real-time test protocols that interact with application as in a view of the real life user the interactive data processing mode uses interactive processing tools such as Hive SQL now moving ahead we shall understand the different forms of Big Data so there are three main forms of big data which are structured format semi-structured format and unstructured format firstly we shall understand what is the meaning of structured data any tabular data which is meaningfully organized under rows and columns with easy accessibility is known as structured data it can be organized under name columns using different storage units such as an rdbms for example any tabular data stored in an rdbms now we shall understand what is semi-structured data semi-structured data lies perfectly between structured and unstructured data it cannot be directly ingested into an rdbms as it includes metadata tags and sometimes duplicate values data needs some operations to be applied on it before the data is ready to be ingested for example dot CSV format and Dot Json format now lastly we shall understand what exactly is unstructured data data that does not obey any kind of structure is known as unstructured data unlike the structured data the unstructured data is difficult to store and retrieve most of the data generated by organizations is unstructured type of data for example image Files video files and audio files now we shall understand the Big Data formats in a little detailed way where we have the examples for structured semi-structured and unstructured firstly we shall deal with the structured data so the structured data is available from the sources like data warehouses databases Erp and CRM now when we enter into semi-structured Data we find the sources from dot CSV files.xmlfiles.json files and finally we come into the unstructured data and the sources for the unstructured data or audio files video files and image files now we shall move ahead and understand the Big Data testing environment owning a perfect environment for testing Big Data applications is very crucial the basic requirements for Big Data testing are as follows firstly space for storing processing and validating terabytes of data should be available then the cluster and its respective nodes should be responsive and lastly the data processing resources such as a powerful CPU should be available now with this let us enter the Big Data testing so the general approach in Big Data testing involves three stages firstly the data ingestion stage in the data ingestion stage the data is first loaded from the source to the big data system using extracting tools the storage might be hdfs mongodb or any other similar storage then the loaded data is Crosstrek for errors and missing values the basic example for data ingesting tools is Talent followed by the data ingestion phase we have data processing phase in this stage the key value pairs for the data get generated later the map reduce logic is applied onto the nodes and checked if the algorithm works fine or not a data validation process takes place where we make sure that the output generated is as expected and finally the last stage is about the validation of the output at this stage the output generated is ready to be migrated from the data warehouse here the transformation logic is checked the data Integrity is verified and the key value pairs at the location where the data needs to be dumped is validated for accuracy so these are the three phases in which the big data is tested so with this letter into the next chapter where we will deal with the different categories in which a big data application can be tested firstly unit testing unit testing in Big Data is completely similar to any other unit testing in similar applications the complete Big Data application is divided into segments and each segment is rigorously tested with multiple possibilities for an expected outcome if any segment fails then that particular segment is sent back to the development stage for improvements now after unit testing we enter functional testing functional testing can be otherwise called as different phases in testing the Big Data application the Big Data application is designated to deal with huge blocks of data such a huge volume and variety of data is often prone to bring data issues such as bad data duplicate values metadata missing values and whatnot this is exactly why the Pioneers in testing the Big Data design the procedure for functional testing of Big Data the different phases in which the big data is tested are as follows firstly the validation phase data validation phase deals with the business logic and the layers of Big Data application the data is collected from the source and it is run against the business use case data collected is checked for accuracy and moment through the different layers of the application at this stage the big data is tested with aggregation and filtering mechanisms the data undergoes end-to-end validation and transformation logic based on business rules the next stage is data Integrity phase data is checked for completeness with referential integrative validation data constraints and duplication is verified against error conditions boundary testing recognizes schema limits for each layer in Integrity phase the next phase is data ingestion phase this is a very important phase where the data gets ingested into the Hadoop ecosystem the ability of the application to connect with different data modules is checked here the data is replayed with messaging systems and loss of data is monitored the main motto of this phase is to achieve the following qualities firstly fault tournaments then continuous data availability and lastly stable connection with a variety of data streams followed by data injection phase we have data processing phase the data processing phase carefully examines and executes the business logic the business rules are cross-validated the mapreduce logic is validated at every stage data is processed from end to end the application is checked for exceptions and they get perfectly handled followed by the data processing phase we have the data storage phase the data storage phase concentrates on the following parameters which are the read and write timeouts continuous data availability load balancing and finally the query performance analysis after all these stages we have one final stage which is called the report generation phase it is the final stage in functional testing and it deals with the following data validation form measures and dimensions real-time reporting data drill up and data drill down mechanisms and lastly the business reports and chats so this was the functional testing now let's move ahead and understand non-functional testing the non-functional testing phase takes care about three major dimensions and characteristics of big data which are the volume velocity and variety of Big Data there are five stages involved in non-functional testing which are data quality monitoring infrastructure data security data performance and failover testing mechanism firstly we shall understand data quality monitoring data quality monitoring checks for erroneous data records and messages data quality monitoring make sure that the following parameters about the data are available which are the data accuracy data procession data timeliness data consistency data profiling now the next stage is the infrastructure the infrastructure may show that there is a continuous service availability in both external big data processing applications and internal big data processing applications infrastructure also takes care about the data replication Factor data backup and data restore point followed by the infrastructure we have the data security data security is called to be the most important aspect of the Big Data application data security stage protects the sensitive data it manages user authentication and checks user role-based authorization data security also takes care about the data encryption and masking of personal information followed by data security we have data performance data performance evaluates every single component it evaluates maximum data processing speed evaluates maximum data capacity size checks the message transfer speed at response time calculates the number of operations performed per unit time engages parallel job monitoring and finally performs read write and update operations on real-time databases and lastly we have failover test mechanism failover test mechanism ensures seamless data processing while switching to neighboring data nodes it creates data recovery points and parallel it will be ready for any unexpected calamities failover test mechanism will be ready to replay the data using multiple offsets it enables Dynamic clustering with this let us move ahead into the next type of testing which is based on the performance testing performance testing highly concentrates on the performance delivered by all the components in the Big Data System performance testing includes the following categories firstly the data collection phase data ingestion phase data processing phase and finally component testing phase firstly we shall understand what is data collecting phase in this stage the big data system is validated based on its speed and capacity to grasp the data within the given frame of time data can be collected from different sources such as rdbms databases data warehouses and many more the next stage deals with the data ingestion here the application is tested and validated based on its space and capacity to load the collected data from the source to the destination which might be sdfs mongodb Cassandra or many other similar storage units followed by the second stage the third stage is data processing stage here the application is tested based on the mapreduce logic written the logic is run against every single node in the cluster and the processing speeds are validated the queries to be executed are expected to perform with high speeds and low data latency and finally we have the component testing this stage is related to the component performance each component in the system should be highly available and connected the component backup should be online when any node fails high capacity data exchange should be smoothly supported now with this let us move ahead and understand the Performing testing approach the performance testing approach can be understood through the following flow diagram firstly the procedure Begins by establishing a big data cluster and running the application later the Big Data developer designs the workload required to run the test then in the next stage we involve the clients into the test and take their feedback after that execute the application with data and analyze the results and now if we find the application to be performing with Optimum stability then the process is finished else we applied the required modification and retest the application so this is the performance testing approach now followed by this we shall understand the parameters involved in performance testing so the parameters involved in performance testing are data storage data storage takes note of the orientation in which the data gets stored in the system followed by that we have Commit logs which marks the limits for committing logs followed by that we have the concurrency which checks the number of threads allocated for the read and write process and next to the concurrency we have caching caching includes dedicated row cache and key cache and finally we have timeouts the timeout set the timers for the application which are related to the connection and queries Etc followed by the performance testing we have architecture testing architecture testing concentrates on establishing a stable Hadoop architecture the Hadoop architecture of big data processing application plays a key role in achieving smooth operations fully designed architecture leads to chaos which might be performance degradation node failure High data latency and may require high maintenance so these were the chaos which may show up if you have a poor architecture in your big data application followed by this we shall understand the Big Data testing tools so the Big Data testing tools are majorly classified into four categories firstly Big Data ingestion tools then big data processing tools followed by that we have the big data storage tools and lastly we have big data migration tools now we should look at the examples based on each one of them firstly the Big Data ingestion tools zookeeper Kafka and scoop are the best examples for data ingestion next we have data processing and the popular tools used in big data processing are map R Hive and Apache pick the next kind of tools are the data storage tools the most famous examples for data storage are Amazon S3 and hdfs finally we have the data migration tools and the popular examples for data migration are Talon and clove DX now with this we shall move ahead and understand the challenges faced in Big Data testing so the key challenge is faced in Big Data testing are big data testing is highly complicated and the process requires highly skilled official automated Big Data testing procedures are predefined and they are not suited for unexpected errors virtual machine latency creates latency in tests and managing multimedia is a big hassle followed by that the volume of the data is one major challenge for Big Data testing the testing environment and automation should be developed for different platforms each component is from a different technology hence it requires isolated testing no single tool can perform end-to-end testing followed by that high degree of scripting is required for Designing test cases finally customized Solutions are required to increase performance and test in critical areas so with this we shall move ahead into the last topic which deals with the difference between traditional testing and Big Data testing so the first difference is the Big Data testing supports all type of data testing whereas the traditional testing supports only structured testing the next difference between the both is Big Data testing requires research and development whereas in traditional data testing we don't require research and development the third difference is the data size is Unlimited in Big Data testing whereas the data size is limited in traditional testing followed by the third difference we shall enter the four difference which says Big Data testing requires a special environment but whereas the traditional testing doesn't require any kind of special environment so now we shall enter into the last difference which says only highly skilled and qualified candidates can perform Big Data testing but when it comes to traditional testing the basic operations knowledge is enough to run the tests [Music] let's take a look at the various application domains that big data offers to the industries as big data is growing at an exponential rate various fields in day-to-day life are using big data to ease the process of storing and processing the data so here I have listed few of the sectors that have been implementing Big Data like healthcare education e-commerce government iot and even in media and entertainment now let's look at each of these domains in depth first let's see how big data is used in healthcare Industries guys you all know that there is a huge amount of data generated in healthcare Industries and that data includes patient records transactions research data and many more so traditionally what happened they failed to use the big data why because they had the limited ability to store and consolidate the data okay I got a query from piyush he asks how big data analytics has improved Healthcare Industries thank you piyush for this question I'll explain you how big data analytics have improved the healthcare by providing personalized medicine and prescriptive analytics not only that but also the researchers are mining the data to see what kind of treatments are more effective for particular conditions and based on that they identify the patterns that are related to drug side effects and then provide the solution that can help the patients and reduce the cost and also on adoption of M health e-health and variable Technologies the volume of data is increasing and that includes electronic health records data crn data that has customer relationship management data fitness trackers historic patient research data purchase data and many more so now let me tell you why demographic data plays a vital role in healthcare Industries here what we do is we map the healthcare data sets with the geographical data sets and by doing that it's possible to predict the disease that will escalate in specific areas and based on these kind of predictions it is very easy to strategize the Diagnostics and then plan for stocking serums and vaccines so this is how big data analytics has enhanced the healthcare Industries now let's see how big data is used in real world clinical Analytics Healthcare Industries wanted to replace their legacy data warehousing solution with the data Lake that could manage High volumes of data in order for this Healthcare Industries selected a company called CTS Tech which is a specialist provider for healthcare to build the solution so CTS Tech designed the solution based on Cloudera Hadoop distribution map produce spark streaming and other Hadoop Technologies so here is how it works here we have diagnostic results and billing messages this is nothing but the data source from where we get the data and this data is injected into the data ingestion stream wherein it undergoes spark streaming and produce real-time data streams and these real-time beta streams are capable of processing 20 000 records per second and they are then landed onto the Cloudera Landing Zone wherein it undergoes the duplicating cleansing and standardization of the data and once deduplicating cleansing and standardization of the data is done it is then processed using mapreduce job and data processing stream and this process data is queried using Cloudera Impala now we populate and store the messages from Cloudera Impala into IBM unified data model for healthcare organization and from there we visualize the results and dashboard and then we arrive at the solution so this is how CTS Tech has used big data analytics and provided the solution in real world clinical Analytics now let's see how big data is used in education sector big data is revolutionizing the way we manage education sector so how was it doing that here have jotted down few of them first let me tell you how it is used in improving evaluation of student results guys can you guess what is the only way to assess the performance of a student the only measurement of the performance of a student is the answers that they write to assignments and exams correct so however during his or her life each student generates a unit data trail and analyzing this data trail in real time one can understand the individual behavior of students and that helps to create an optimal learning environment for the students and also it is possible to monitor student actions such as how long they take to answer a question which sources they use for exam preparation which questions to the Skip and many more so by considering all these factors one can improve the evaluation of student results next analyzing and creating the custom programs you all know that we have lacks of students from each and every universities but customized programs for each and every individual student can be created everyone wondering how that is possible with the help of a technique that is called as Blended learning confused what is this Blended learning it is very simple it is just a combination of online and offline learning and that gives the students the opportunity to follow the classes that they are interested in and also they have the possibility for offline guidance by professors next how big data helps us to reduce the Dropout rates as we have already discussed that it would help to improve the evaluation of student results it's an obvious fact that the Dropout rates at schools and colleges would also reduce so what does educational institutions do they use Predictive Analytics on all the data that is collected to give them insights on future student outcomes and such kind of predictions will also help to run a scenario analysis on a course or a program before it is introduced into the curriculum and this minimizes the need for a trial and error next now let's see how to compute the marks of students let me take an example here say let me take attendees names itself consider period scores 90 marks in mathematics and 60 in geography but chaitra scores more in geography and less in mathematics so what does big data analytics do it helps to combine and analyze all the data of students and based on that importance can be given to each of the students who lack this course and also to give an insight about how to improve the performance so these are the various ways where we use big data analytics in education sector now let's see a case study of IBM in education it is used to monitor individual student performance to prevent attrition from a course or a program next to identify outlines for early intervention to identify and develop effective instructional techniques for testing and evaluation of curriculum now let's see how IBM has used big data analytics in learning analytics flow model here we capture instructional transactions as they occur and these instructional transactions occur in a time-sensitive learning application and this would be actually possible within a learning management or a course management system and we use full capabilities of a learning management system and then generate a 15-week online course and this 15-week online course would generate thousands of transactions per second and then we perform real-time analysis on these transactions and that would be used to feed a learning analytics app and to process all these data we need a big database system without the help of analytics software it is not possible to process the data and big database system so we use Apache spark for data processing and then we analyze the transactions and establish a pattern and then we are arrive at decisions and course of action and these decisions and course of action is then introduced into a curriculum or a course management system so this is how IBM has made use of big data analytics and learning analytics flow model now let's see how big data is used in e-commerce Industries big data is a game changer when it comes to retail and e-commerce retailers and e-commerce brands are using more of analytics to drive strategic action and offer better customer experience so these are few of the usages first it is used to predict the trends in order to predict the trends we use a trend forecasting algorithm and this trend forecasting algorithm combines the data from social media post and web browsing habits to identify what is causing a buzz and also to know which product is discussed more of online we perform sentiment analysis task and based on sentiment analysis and Trend forecasting algorithm it is very easy to predict the trends in the market next optimize pricing Big Data enables retailers to identify best price for goods by tracking transactions competitor and cost of goods and not only that but retailers can also map the rise and fall of demand and match the pricing accordingly now let's see how analytics has enhanced Amazon to forecast its demand analytics enables Amazon to protect the traffic on the website along with the possible conversion rate and through the Amazon web services Cloud the business has the flexibility to scale up in a real time so as it can scale up in a real-time analytics enables Amazon to forecast its demand next how one can create a personalized store we have something called as fast web server Technologies and this fast web server technology is in combination with big data businesses can generate Dynamic websites that are filled with relevant products based on the historic Behavior of a consumer and their personal preferences so by grouping the personal preferences and the historic behavior of a consumer one can create a personalized stores next customer service customer service is available 24 7 in all the e-commerce Industries so big data analytics allows business to optimize this customer service and how does it do it compiles the data from previous online and offline transactions social media information purchase history and many more and by this businesses can create a 360 degree view of the customer and based on that they provide enhanced customer experience next sales generation the main motto of all the e-commerce Industries is to sell their goods and products and retailers use big data to offer a personalized experience and prevent potential abundant so by this we can have more number of sales generated so these are few of the usages in e-commerce Industries let's see a a real-time use case in e-commerce here we have a user user communicates with the e-commerce server portal in data stream here the data is collected from various sources and that includes customer information purchase history reviews and many more and we have something called as input selection module to remove the noise from the data so this input selection module is based on two factors One Singular value decomposition second dimensionality reduction so what does the singular value decomposition it is nothing but the technology that is used to speed up the recommendations with very fast online performance and that requires just few simple arithmetic operations so the data which is used to speed up the online performance is selected next what is dimensionality reduction here we reduce a number of random variables under consideration that are not required in the data So based on these two factors the input selection module removes the noise from the data so this noise free data is fed as an input to hdfs input in Hadoop stream and that input undergoes mapreduce job and produces hdfs output the data that we get after applying mapreduce job is ready for analyzing so in the data analytics stream we use R tool to analyze the data and from there we arrive at textual and graphical reports and they are nothing but the reports that are used in prediction of Trends forecasting demand and in sales generation now let's see how government sector has made use of big data analytics in government use cases the same data sets are often applied across multiple applications and that requires multiple departments to work in collaboration okay I got a query from kalgi she asks why big data analytics is needed in cyber security and intelligence I'll surely tell you but for now let's see how it is used in traffic optimization Big Data helps to aggregate the real-time traffic data generated from Road sensors GPS device and video cameras the potential traffic problems in dense areas can be prevented by adjusting public transportation routes in real time now I will tell you why it is used in cyber security and intelligence you all know that the cyber attacks are increasing in volume and complexity and that's becoming a tedious task for traditional analytics tools but companies have to protect themselves against all kind of attacks and they also need to be able to detect and respond fast so it's nothing but a PDR Paradigm that is proven detect and respond and in order to prevent detect and respond at a faster rate and quickly there comes the big data analytics approach and these challenges can be prevented and overcome with the help of big data analytics for example you can say it is used in Enterprises to fight against cyber threats I hope you got an idea about it kalki now let's see how it is used in crime prediction and prevention police departments use Advanced and real-time analytics to understand the criminal Behavior identify crime and uncover location-based threats next weather forecasting the National Oceanic and Atmospheric Administration gathers every minute of every data from Land C and space-based sensors and on a daily basis National Oceanic and Atmospheric Administration uses big data to analyze and extract value from over 20 terabytes of data next drug evaluation the National Institution of health uses big data Technologies to access large amounts of data to evaluate drugs and treatment next tax compliance Big Data applications can be used by tax organizations to analyze both unstructured and structured data from a variety of sources in order to identify suspicious behavior and multiple identities this actually help in tax fraud identification so these are the various sectors where big data is used in government now let's see how it is useful in e-government Portal so what is the c government portal it is an application that uses electronic communication devices computer and the internet to provide public services to Citizens and other people so citizen or Enterprise user use certain Services via e-government portal so these user requests are sent to Big Data infrastructure in order to be processed and this big data infrastructure has a database that is not centralized and hence it integrates with database of different Ministries government agencies and local authorities so both the databases are then deployed to Cloud infrastructure and in this solution we use cloud infrastructure to reduce the cost for storage and here we have Hadoop framework to process the data we run Big And Hive queries on a Hadoop data platform and then store this data into hdfs and then the response is sent back so now let me tell you the workflow use a sensor request to the network using an appropriate interface in this case the appropriate interface is nothing but the web application that he interacts with and the request is then forwarded to Big Data and the processing happens using these Technologies and hence the response is sent back so this is the internal working that happens but for the user when he communicates or interacts with the web application he gets a response immediately so this is how big data analytics is used in e-government Portal now let's see how big data is used in iot data extracted from Internet of Things devices provides a mapping of device interconnectivity such mappings have been used by various companies and governments to increase the efficiency and iot is also increasingly adopted as a means for Gathering sensory data and this sensory data is used in Medicare industry retail Vehicles communication and many more so iot represents the connected devices but as there is an enormous amount of data these devices are connected through iot and Big Data Technologies is used to store and process this huge amount of data let's take a look at the smart city concept the combination of iot and big data is an unexplored research area that has brought new and interesting challenges for achieving the goals of future smart cities you might have heard of Amsterdam right it is the Pioneer and the concept of smart city now you might be wondering how it does introduce the concept of smart City Amsterdam has introduced the concept of smart city in smart government smart Health Smart retail smart Agriculture and many more I got one more query here and this query is from Geno he asks how Amsterdam has implemented the concept of smart city in Smart Home Geno I will explain you now in case of smart City they use a variable technology that is any person in the home can wear the fitness band that is a variable technology and the techies have programmed and integrated this smart thing in such a way that when the person wakes up the lights will automatically turn on and the coffee machine starts to Prevail the coffee and it is vice versa when the person who is wearing the fitness band goes back to sleep the light turns off I hope you got an idea about a genome now let's move further in this case various smart applications Exchange change information using embedded sensor devices and other devices that are integrated with cloud computing infrastructure to generate large amounts of unstructured data and these large amounts of unstructured data are collected and stored in a cloud or a data center using a distributed fault tolerant database that is nosql and that is used to improve the single service or application and is shared among various devices and we also have a programming model for processing large data sets with the parallel distributed algorithm and that can be used for data analytics to obtain the value from the data here we have query engine like Hive and Mahood to query and structure the data so this is how cloud and Big Data both are used in the smart city concept last but not the least let's see how big data is helpful in media and entertainment various companies in the media and entertainment Industries like Publishers broadcasters cable companies YouTubers are facing new business models for the way they create market and distribute their content and that is happening because of the current consumer search and the requirement of accessing this content anywhere anytime and on any device and Big Data provides actionable points of information about millions of individuals and all these insights are gathered through various data mining activities and Big Data applications benefits media and entertainment industry by media scheduling ad targeting content monetization audience interest customer churn prevention so now let me tell you how to prevent a customer churn customer churn is a serious Menace that media companies find almost impossible to tackle and it has been found that at least 30 percent of the customers share their views through social media and until Big Data arrived combining and making sense of this user data was next to Impossible and with the Advent of big data analytics it is possible to know why customers subscribe and unsubscribe and what kind of programs they like dislike with Crystal Clear Clarity sub analyzing all these factors it is very easy to prevent the customer churn let's take a look at the Netflix example in Big Data Netflix has tons of user data present in its database system and this data is processed and analyzed and we recognize the patterns for that and here we have a new user request and this user comes and search for a video in the Netflix search engine and based on the video preferences and the user choice we rank the video and also we give user experience and then we continue to watch the trending video and we arrive at video similarity algorithm so now let's see how the decision is made and the video ranking has been given this is decision making Netflix data framework Netflix uses Cassandra because it is highly scalable and strong on performance here we have Cassandra that is a nosql database and it is used by Netflix because it is very strong on performance and we have Prime that is the Cassandra helper tool and also it is a Netflix cluster management tool that simplifies the Netflix Administration and it also uses API to query the Cassandra metrics and next we have S3 assist tables the query data is stored in sdss table in a table format and we have adjusters that is a platform built by Netflix engineer to work on Hadoop mapreduce and this Edge assist converts the Cassandra S3 assist tables into queryable format and we have s3json that is a JavaScript object notation and that is used to structure the query data and hence based on all these factors the decision is made and the video ranking has been given so if you go back you continue to watch the trending video and you get the video similarity algorithm so this is how Netflix has used big data analytics now let's move further and see the scope of big data that will create a havoc in the near future first 6X growth it's obvious fact the amount of data that will be generated in the next five years will be 6X times more than that what was generated in the past five years so imagine we'll be having huge huge amount of data and not only that but also the Hadoop adoption is growing at 29 times more than that of US GDP next an open source Supernova let me bring out the comparison here in 1998 only 10 percent of the companies used to use open source networks and that gradually increased to 50 in 2011 and now 78 of the companies use open source softwares and in the next years entire 100 of the companies use open source softwares and these advancements is because of the growth of big data in the market next why it is termed as meteoric Innovation you all know that in 2008 there were more things connected to the internet than people and now iot devices represent over 6.48 billion devices connected and in 2020 the iot market will connect to over 21 billion devices and by 2030 self-driving cars will rule the roads of the world and that is why it is termed as an mature Innovation the next 3x increase in profit based on the about three terms that a 6X growth open source software's meteoric Innovation it is an obvious fact that as the Technologies evolve the growth will be evolved and that leads to 3x increase in profit rate as well so in the next few years the profit rate will tends to increase to 3x times more than that what it is there in today's world so this is why we say that big data is going to create a havoc in the near future [Music] why do you think big data analytics is so important and why do you feel that we need to study this topic or we should know what exactly it is so now let me tell you why so just like the entire universe in our galaxy said to have formed it to the Big Bang explosion similarly data has also been growing exponentially which is leading to to the explosion of data so this can simply be termed as big data and you know that we are creating about 2.5 quintillion bytes of data every day and one quintillion amounts to around 10 raised to the power of 18 bytes so you can do the math and imagine the amount of data that we are creating every day and this data as you can see from the image that I have depicted here is coming in from various sources whether it is from social media from banking sectors from governments from various other institutions all right and this data is not in the same format so it is coming from various sources so it is in different formats so now guys what do you think that is big data only limited to the volume or the enormous amount that is being generated or does it Define with various other characteristics that you know exactly Define what big data is so I have put down four such reasons here to tell you that why it is so important and how it is helping many organizations all around the globe so the first reason here that I've stated is for making smarter and more efficient organizations so big data analytics is basically highly contributing to this factors and organizations are adopting this to basically lead them to faster decision making so one such example that I you know came across that I wanted to share with you guys is about the New York police department in short which is the NYPD so big data and analytics are helping the NYPD and the other large police departments to anticipate and identify the criminal activity before it occurs so what they do is that they analyze the entire Big Data technology to geolocate and then analyze the historical patterns and they map these historical patterns with sporting events paydays rainfalls traffic flows and federal holidays so essentially what the NYPD is doing that they're utilizing these data patterns scientific analytics technological tools to do their job and they're ensuring that by using these different tools they're doing their job to the best of their ability so by using a big data and analytics strategy the NYPD was able to identify something called crime hot spots so basically where crime occurrence was more so they were able to identify these hot spots and then from there they deployed their local officers so that they could reach their own time before it was actually committed so this is how NYPD basically utilizes entire you know this field of big data analytics so that they can prevent crime and make New York a more safer place so now after exploring the first reason let's move on to the second reason and see what is it the second reason here is to optimize business operations by analyzing customer Behavior so the best example for this is Amazon we all know how much Amazon is popular and how much we use it on our daily basis so Amazon basically uses our click stream data that is the customers so they use our click stream data and the historical purchase data of more than 300 million customers which have you know signed up for Amazon and then they analyze age users data how they are clicking on different products and how they're navigating through their site so basically they show each user customized results on customized web pages so after analyzing all these clicks of every visitor on their website they're able to better understand their site navigation Behavior the parts that people are taking to buying their products and services and what else a customer looked on while buying that product and also the parts that led a customer to leave their page so this information basically helps Amazon to improve their customer experience and expand their customer base so guys let's see what the third reason is now so big data Technologies like Hadoop and cloud-based analytics they basically reduce your cost significantly for storage of Big Data because for storing Big Data if you buy like huge servers and you know huge Machinery so that is going to cost you a lot so by using Hadoop technology so what Hadoop does basically stores big data in a distributed fashion so that you can process it parallely so it reduces your cost a lot so by using commodity Hardware we are reducing their costs significantly so which brings us to our third reason you must have gauged what the third reason is it is cost reduction so now let us see how Healthcare is using big data analytics to curb their costs so using new data tools that's an automatic alerts when patients are due for immunizations or lab work more and more Physicians could reduce the hospitalizations by practicing better preventive care so you know what the patient started using these new sensor devices at home and on the go so these new the devices they basically you know deliver constant streams of data that can be monitored and analyzed in real time so they help the patients avoid hospitalization by self-managing their conditions now for hospitalized patients Physicians can use Predictive Analytics to optimize outcomes and then reduce the readmissions so Parkland Hospital in Dallas Texas is one such example which has been using analytics and predictive modeling to identify these high risk patients and then they predict lightly outcomes once the patients are sent home so as a result Parkland has been able to reduce its 30-day react missions back to Parkland and all area hospitals for Medicare patients with heart failure by around 31 percent so for Parkland that you know estimates about a savings of five hundred thousand dollars annually and of course not to mention that the savings which patients are also realizing by avoiding these readmissions so this is how Healthcare is you know widely using big data analytics to reduce their cost significantly now let's move forward to see the last reason for why big data analytics is so essential so our last reason is next Generation products that how big data analytics is really really contributing to generate more such you know high-tech products so you know to see how customers needs can be satisfied and how they can use these new generation products for their own benefit so I have cited three such examples here for you guys so the first example here is Google self-driving car I'm very sure that most of you guys must have heard about it what Google self-driving car basically does it makes millions of calculations on every trip that help the car decide when and where to turn whether to slow down or speed up and when to change their Lanes so the same decision a human driver is making behind the wheel Google self-driving car is also doing that with the help of big data analytics another example of a sales diamond car is the Toyota prize which is fitted with cameras GPS as well as powerful computers and sensors to safely drive on the road without the intervention of human beings so this is how it is you know really really contributing to making such high-tech products which in the long run we'd be using probably and it will make our life more easier now moving on to the second product that I'm going to site here so it's really fascinating product let me ask you a question how many of you all love watching TV shows and how many of you all prefer spending your weekends doing nothing with Netflix and chill um let me guess almost all of us do I mean I love binge watching shows over the weekend so I know by now you would have guessed what example I'm arriving to so it is Netflix so Netflix committed for two seasons of its extremely popular show House of Cards without even seeing a single episode of the show guys and this project of you know House of Cards of two seasons it costed Netflix about 100 million dollars so guys how do you think that Netflix was able to you know take such a big risk monetarily so the answer to this my friends is Big Data text so by analyzing the viewer data the company was able to determine that the fans of the original house of cards which aired in the UK they were also watching movies that start Kevin Spacey who was playing the lead in the show House of Cards and they were directed by David Fincher who's also one of the show's executive producers so basically Netflix is analyzing everything so from what show you are watching to when you pause it or to when even you turn it off so last year Netflix grew it subscriber us subscriber Base by around 10 and then they added nearly 20 million subscribers from all around the globe so how fascinating is that I mean this is brilliant I am sure that the next time you guys are watching a show on Netflix you'll be really happy because you already know how the back end is working and how Netflix is recommending you new shows and new movies so now moving on to the third example that I've cited here so it's one of the really cool things that I've come across so this is a smart yoga mat now this has sensors embedded in the mat which will be able to provide feedback on your posters score practice and even guide you through an at-home practice so the first time you use your smart mat it will take you through a series of movements to calibrate your body shape size and personal limitations so this personal profile information of yours is then stored into your smart matte app and this will help the smart mat detect when you're out of alignment or balance so over time it will automatically evolve with updated data as you improve your yoga practice so now I'm sure that with these you know very interesting and exciting examples you would have got an idea about what exactly big data analytics is doing and how it is improving various organizations in their sales and marketing sector so now let's move forward and finally you know formally Define what big data analytics is [Music] so guys what is big data analytics big data analytics examines large and different types of data to uncover hidden patterns correlations and other insights so basically what big data analytics is doing it is helping large companies to facilitate their growth and development so this majorly involves applying various data mining algorithms on a given set of data which will then Aid these organizations in making better decisions so now that you know why we need big data analytics what is exactly big data analytics now let us see and explore what are the different kind of stages which are involved in this procedure of Epic data analytics [Music] so these are the different stages involved in this entire procedure so the first stage is identifying the problem so what is our problem that we need to solve this is the most important step of course and this is the first step of the process the second step is to design our data requirement so of course after identifying the problem we need to decide what kind of data is required for analyzing this particular problem the third step is pre-processing so in the pre-processing step basically cleaning of data takes place and you perform some sort of processing now after the processing stage we come to the fourth stage which is the analytics stage so in this stage you would be basically analyzing the process data using various methods after the analytics stage we'll move to the final stage which is data visualization so in visualization of data stage you will basically visualize the data using tools like Tableau angularjs but the visualization of data will only take place in the end so these are the basic five stages in this entire procedure now that you've understood this let's move forward and understand what are the different types of big data analytics [Music] so there are four basic types one is descriptive analytics second is Predictive Analytics third is prescriptive analytics and fourth is diagnostic analytics so let us understand the first type which is descriptive analytics so descriptive analytics basically answers your question what has happened and how does descriptive analytics answer this question it uses data aggregation and data mining techniques to provide insight into the past and then it answers what is happening now based on the incoming data so basically descriptive analytics does exactly what what the name implies it describes or it summarizes the raw data and it makes it something which is interpretable by humans and the past which I just referred in this context it basically can be one minute ago or even a few years back so the best example that I could side here for descriptive analytics is basically the Google Analytics tool so Google analytics basically is aiding organizations or different businesses by analyzing their results through Google analytics tools so the outcomes that help the businesses understand what actually has happened in the past and then they validate if a promotional campaign was successful or not based on the basic parameters like page views so basically descriptive analytics is therefore an important source to determine what to do next another example is what we saw earlier in the new generation product which is Netflix so Netflix basically uses descriptive analytics as I told you guys to find the correlations among the different movies that a subscriber is watching and to improve the recommendation engine they use historic sales and customer data so this is what descriptive analytics is now let's move forward with a second type which is Predictive Analytics so the second type which is Predictive Analytics basically uses statistical models and forecast techniques to understand the future and answer what could happen so basically as the word suggests it predicts we are able to understand through Predictive Analytics that what are the different future outcomes so basically Predictive Analytics provides the companies with actionable insights based on the data so through sensors and other machine generate data companies can identify when a malfunction is likely to occur so then the company can preemptively order paths and make repairs to avoid downtime and losses so an example of this type of analytics is the Southwest Airlines so Southwest analyzes their sensor data on the planes to identify the potential malfunctions or safety issues so basically this allows the airline to address the possible problems and then make repairs without interrupting the flights or putting the passengers in danger so this is a very great use of you know Predictive Analytics to how basically reduce their downtime and losses and as well as you know prevent delays and various other factors like accidents so now let's move forward to the third reason which is prescriptive Analytics prescriptive analytics uses optimization and simulation algorithms to advise on the possible outcomes and answer the question what should we do so basically it allows the users to prescribe a number of different possible actions and then guide them towards a solution so in a nutshell these analytics are all about providing advice so prescriptive analytics they use you know a combination of techniques and tools such as business rules algorithms machine learning and computational modeling procedures so then these techniques are applied against input from many different data sets including historical and transactional data real-time data feeds and then big data so these analytics go beyond descriptive and Predictive Analytics by recommending one or more possible courses of action and the best example for this is the Google self-driving car this example also we have already seen in the new generation products section so basically Google self-driving car analyzes the environment and then decides the direction to take based on the data so it is decides whether to slow down or speed up to change the lanes or not to take a long cut to avoid traffic or prefer short routes Etc so in this way it functions just like a human driver by using data analytics at scale now prescriptive analytics is a little complex type of analytics and it is not yet adopted by all the companies but when implemented correctly they can have a large impact on how the businesses make their decisions so now let's move on to our last type which is diagnostic analytics so diagnostic analytics is used to determine why something happened in the past so it is characterized by techniques like drill down data Discovery Data Mining and correlations so diagnostic analytics it takes a deeper look at the data to understand the root cause of the events it is helpful in undermining what kind of factors and events contributed to a particular outcome so mostly it uses probabilities likelihoods and the distribution of data for the analysis so for example in a Time series data of sales diagnostic analytics will help you to understand why the sales of a company has decreased or increased for a particular year and so on so examples for Diagnostic analytics could be a social media marketing campaign so you can use diagnostic analytics to assess the number of posts mentions followers fans page views reviews pins Etc so and then you can analyze the failure and the success rate of a campaign at a fundamental level so therefore there can be thousands of online line mentions that can be distilled into a single view to see what worked in your past campaign and what did not so now that we have seen all the four types I hope that you've understood the different examples of all the four types and the difference between them [Music] now let's move forward and have a look at the tools which are required for big data analytics so these are some of the tools that I have listed down here there are more such tools which are used for big data analytics but let's explore the ones which I have mentioned over here so let me name them Hadoop Pig Apache Edge base Apache spark Talent Splunk Apache Hive Kafka so now let me start with the first one which is Hadoop so Hadoop is basically a framework that allows you to store big data in a distributed fashion so that you can process it parallely Apache pick is a platform that is majorly used for analyzing large data sets and then represent these data sets as data flows so basically pig is used for scripting and the language is Pig Latin now coming to Kafka so Kafka is a messaging system now guys what is a messaging system a messaging system is basically something which is responsible for transferring data from one application to another so the applications can focus on the data and they do not need to worry about how to share it so this is what Kafka does now coming to Apache hi now Apache Hive is a data warehousing tool so it allows us to perform big data analytics using Hive query language which is similar to SQL coming to Splunk so Splunk is a log analysis tool now what are logs so logs are generated on Computing as well as non-computing devices and they are stored in a particular location or directory so they contain details about every single transaction or operation that you guys have made so next is Talon Talent is an open source software integration platform which helps you to analyze effortlessly and then turn the data into business insights so it helps the company in taking real-time decisions and become more data driven next is Apache spark so party spark is an in-memory data processing engine that allows us to efficiently execute streaming machine learning and SQL workloads and it requires fast iterative access to data sets so basically it is used for real-time processing now moving to the last one which is Apache Edge base so Apache edgebrace is a nosql database that allows you to store unstructured and semi-structured data with ease and provides real-time read or write access so these were the tools that I could list down and I have also told you about the different functions in brief that they perform [Music] so now let me just move forward and explore the different kind of domains which are you know using big data analytics so these are some of the domains that I've listed out for you guys to understand how they're using big data analytics and how widely it is being used in different kinds of domains so Healthcare we've already discussed previously has been using big data analytics to you know reduce costs predict epidemics avoid preventable diseases and then improve the quality of life in general so one of the most widespread application of big data in healthcare is electronic health records which is ehrs I'm sure that most of you must have heard about it it basically stores the patient's entire data now coming to Telecom industry so Telecom industry is one of the most significant contributors to Big Data so Telecom industry basically analyzes all our call data records in real time and then they identify fraudulent behavior and acts on them immediately so the marketing division of Telecom industry it basically modifies their campaign to better Target its customers and then use these insights which are gained by them to develop new Production Services coming to insurance companies so insurance companies use big data analytics for risk assessment fraud detection marketing customer insights customer experience and much more now governments across the world are also adopting big data analytics the Indian government for example had used big data analytics to get an estimate of the trade in the country so the economists used central sales tax invoices for trade between two states to estimate the extent to which the states were trading between each other coming to Banks and financial firms now Banks and financial services firms they use analytics to differentiate fraudulent interactions from legitimate business transactions so by applying analytics and machine learning they're able to define the normal activity of a user or a customer based on their history and then distinguish it from the unusual behavior indicating fraud so then the analysis systems they suggest immediate actions such as blocking the irregular transactions which stop the fraud before it occurs and improves the profitability now moving on to the next domain which is automobile so many automobile companies are utilizing big data analytics and one example is Rolls Royce so Rolls-Royce embraced big data by fitting hundreds of sensors into its engines and propulsion systems and these sensors basically record every tiny detail about the operation of these engines and propulsion systems so then the changes in the data in real time are reported to the engineers who will then decide the best course of action such as scheduling a maintenance or dispatching the engineering teams if the problem arises now the next domain is education so education is one field where big data analytics is very slowly and gradually being adopted but it is very important that we utilize big data analytics in this field because so by opting for Big Data power technology you know as a learning tool instead of the traditional lecture methods we can enhance the learning of a student as well as it can Aid a teacher to basically track the performance in a better manner now coming to the last domain which is retail so retail includes both e-commerce and in-stores and they are widely using dictate analytics to optimize their business strategies so we already saw that with the example of Amazon so now that we've explored the various domains let me show you the use cases that I have taken here to explain you about how big data analytics is widely being used [Music] so I've taken two such use cases here's the first use case is of Starbucks so the leading Coffee House chain makes use of Behavioral analytics by collecting the data on its customers purchasing habits in order to send personalized ads and coupon offers to the customers mobile phones so the company also identifies Trends indicating whether the customers are losing interest in their product and then the direct offer specifically to those customers in order to regenerate their interest so I came across this article by Fox which reported how Starbucks made use of big data to analyze the preferences of their customers to enhance and personalize their experience so they analyzed you know every member's coffee buying habits along with their preferred drinks to what time of the day they are usually ordering so even when people visit a new Starbucks location that stores point of sale system is able to identify the customer through their smartphone and then the Barista gives them their preferred order so in addition based on ordering preferences their app which is a Starbucks app will suggest new products that the customers might be interested in drawing so this is how Starbucks is basically optimizing their business strategies and improving and basically increasing their customer base now let's move on and see what is the Second Use case that I want to share with you guys the Second Use case is of PNG Procter and Gamble so Procter and Gamble uses Market Basket analysis and price optimization to optimize their products so Market Basket analysis analyzes customer buying Habits by finding associations between the different items that the customers place in their shopping baskets so this is what exactly Market Basket analysis does so apart from this Market Basket analysis may be performed on the retail data of customer transactions at your store so stores like Target Walmart Etc that they use Market Basket analysis to basically increase their sales and marketing so you can then use the results to plan marketing and advertise your strategies or even design a new catalog so for instance Market Basket analysis may help you design different store layouts in one strategy items that are frequently purchased together can be placed in close proximity to further encourage the combined sales of such items so example I'm going to a store I want to buy bread then I also you know cite butter so I will want to buy butter as well so that's how you know stores optimize their sales so they place all these products like butter bread milk eggs in close proximity because they know when a customer comes to buy bread they might also want to buy butter or milk or eggs all right so this is one such example so how PNG basically utilizes it is the company uses simulation models and Predictive Analytics in order to create the best design for its products so it creates and sorts through thousands of iterations in order to develop the best design for example for a disposable diaper and then they use Predictive Analytics to determine how moisture affects fragrance molecules in a dish soap so that the right amount of fragrance comes out at the right time during the dishwashing process I mean so we can't even imagine that a simple product like a dish soap also has so much thought process behind it and also has so much strategies or you know analytics applied behind it so I hope that you guys found both these you know use cases really interesting and how more such companies are utilizing big data analytics in a more proficient manner in order to basically increase their sales and marketing so now let's move forward and see the next one which is facts and statistics by Forbes so I've collected some of these so four which I found really interesting and I wanted to share with you guys so the first one here basically states that nearly 50 percent of respondents to a recent McKenzie analytics survey say that analytics and big data have fundamentally changed business practices in their sales and marketing functions so we also have seen examples of this by you know like by Starbucks of PNG of Amazon so these are such companies which are responding to such surveys now the next one is showing that how big data applications and analytics is projected to grow from about 5.3 billion dollars in 2018 to 19.4 billion dollars in 2026 which attains about a compound aggregate of 15.49 percent so the next one here is an extremely important fact or a stat which I found out and it is basically an eye-opener so according to an Accenture study 79 of Enterprise Executives agree that companies that do not Embrace big data will lose their competitive position and could face Extinction even more 83 percent have pursued Big Data projects to seize a Competitive Edge so this very fact guys tells you that how important this field is and if your particular organization or company is not adopting big data analytics in the future it is going to lead to obsoletion so now let's see which is the last fact that I have stated here so according to new Vantage Venture Partners big data is delivering the most value to Enterprises by decreasing their expenses by about 49.2 percent and creating new avenues for Innovation by about 44.3 percent an example of both of these facts we saw in New Generation why we need big data analytics section where we talk about cost reduction as well as new generation products so this is an example of that so now let's move forward and look at the career prospects in big data analytics so the first one here that I've stated here is there's a storing demand for an analytics professional so technology professionals who are experienced in big data analytics are in high demand as organizations are looking for ways to exploit the power of big data so therefore there is a store in demand for analytics professional and as and when the data is going to grow more such people will be required to analyze that data so that leads us to our second point which is huge job opportunities so there are more job opportunities in big data management and analytics than there were last year and many IIT professionals are prepared to invest time and money for the training so now that companies under various domains are adopting big data analytics so there are definitely more huge job opportunities so now let's see what are the salary aspects I think is one of the most important ones again because we need to know that what kind of salary are we going to draw if you become a big data analytics professional so six analytics and data science jobs are included in glass stores 50 best jobs in America for the year 2018. these include data scientists analytics manager database administrator data engineer data analyst and business intelligence developer and the average salary of the six analytics jobs that I just stated along with data science jobs is about 95 000 which is absolutely amazing and data scientist has been named the best job in America for about three years running with a median based salary of hundred and ten thousand dollars and 4524 job openings I mean how wonderful is that so you guys can see that how great the prospects are in this field and if you guys are interested then you should definitely learn more about this field and you know who knows that you might be drawing such kind of a salary so but in India the percentage of analytics professionals commanding the salaries lesser than 10 lakhs it has gone lower which is great so the percentage of analytics professionals earning more than 15 lakhs has increased from about 17 percent in 2016 to 21 in the last year 2017 and to the current 22.3 percent in this year 2018. now let me tell you what kind of job titles are there in this field so the first one here is big data analytics business consultant second is big data analytics architect third is Big Data engineer fourth is Big Data solution architect fifth is big data analyst sixth is analytics associate seventh is business intelligence and analytics consultant and the last one is metrics and analytics specialist so I've just stated eight over here so these might be addressed in different names and different you know job titles and there are more such job titles I'm sure so you can explore that so now let's move on to see what are the skill sets that you require if you want to become an analytics professional so these are the few skill sets that I've mentioned over here and there can be more depending on the role that you're going to play or maybe even you know restricted to one particular skill set so it depends upon what role are you going to play in this field of big data analytics so the first one is that I've put on here is basic programming so you would obviously be expected to know some kind of a general purpose programming language the second one here is statistical and quantitative analysis so it is preferable if you know about the statistics and quantitative analysis now moving on to data warehousing so knowledge of you know SQL and nosql database languages such as MySQL and nosql has mongodb Apache Edge base and Cassandra so knowing these databases is also very important so next one is data visualization which is I think one of the most important skill sets which are required so as an analytics professional you should know how to visualize the data in order to you know basically improve your business so you need to know what kind of Trends are going to be there in the data and how it is increasing and what kind of insights this data is going to provide you so you should be able to visualize the data you should be able to understand what the data is indicating the next one is specific business knowledge so this is extremely necessary according to me because if you're an analytics professional and you don't know what business your company is basically working on and you're not aware about it you won't be able to apply your knowledge of analytics to basically increase the fails and marketing of the company all right so the business knowledge of a particular company or the area which you're working on is extremely important the last skill set that I've mentioned over here is computational Frameworks so out of the tools that we discussed in the previous section one is expected to know at least more than one so if you know a purchase Park Hadoop Pig also again that is depending upon the job role that you're going to play so it is important that you are aware about at least one or more tools which are you know required for big data analytics and one or two such computational Frameworks because it is going to of course help you and you will have a basic knowledge about how these tools are used for analyzing the data foreign let us start with the U.S primary election use case first in this use case we will be discussing about the 2016 primary elections in the primary elections the contenders from each party compete against each other to represent his or her own political party in the final elections there are two major political parties in the U.S the Democrats and Republicans from the Democrats the contenders were Hillary Clinton and Bernie Sanders and out of them Hillary Clinton won the primary elections and from the Republicans the contenders were Donald Trump Ted Cruz and a few others as you already know Donald Trump was the winner from the Republicans so now let us assume that you are an analyst already and you have been hired by Donald Trump and he tells you that I want to know what were the different reasons because of which Hillary Clinton won and I want to carry out my upcoming campaigns based on that so I can win the favor of the people that voted for her so that was the entire agenda so this is the task that has been given to you as a data analyst so what is the first thing that you will need to do the first thing you'll do is that you'll ask for data and you have got two data sets with you so let us take a look at what this data sets contains so this is our first data set which is the US primary election data set so these are the different fields present in our data set so the first field is state so we've got the list of the state of Alabama the state abbreviation for Alabama is Al we've got the different counties in Alabama like artuga Baldwin Barber bib Blount Bullock Butler Etc and then we've got fips now fips are federal information processing standards code so this basically means zip code then we've got the party so which we will be analyzing the Democrats only because we want to know what was the reason for Hillary Clinton's win so we will be analyzing the Democrats only and then we've got the candidate and since I told you there were two candidates Bernie Sanders and Hillary Clinton so we've got the name of the candidate here and the number of the votes each candidate got so Bernie Sanders got 544 in artuga county and Hillary Clinton got to 23.87 in this field over here represents the fraction of the votes so if you add these two together you will get a one so this basically represents the percentage of vote each of the candidates got so let's take a look at our second data set now so this data set is the U.S County demographic features data set so the first we will have again Phipps in the area name artuga County Baldwin and different other counties in Alabama and other states also the state abbreviation so here it is only showing Alabama and the fields that you see here are actually the different features you won't know what this exactly contains because it is written in a coded form but let me give you an example what this data set contains let me um tell you that I'm just showing you a few rows of the data set this is not the entire data set so this contains different fields like population in 2014 and 2010 the sex ratio how many females males and then based on some ethnicity how many Asians how many Hispanic how many black American people how many um black African people and then there's also based on the age groups how many infants uh how many senior citizens how many adults so there are a lot of fields in our data set and this will help us to analyze and actually find out what led to the winning of Hillary Clinton so now you have seen our data set you have to understand your data set you have to figure out what are the different features or what are the different columns that you are going to use and you have to think of a strategy or think of how you're going to carry out this analysis so this is the entire solution strategy so the first thing you will do is that you need a data set and you've got two data sets with you the second thing that you'll need to do is to store that data into hdfs now hdfs is Hadoop distributed file system so you need to store the data the next step is to process that data using spark components and we will be using spark SQL spark M lib Etc so the next task is to transform that data using spark SQL transforming here means filtering out the data in the rows and columns that you might need in order to implement or or in order to process this the next step is clustering this data using spark M lib and for clustering our data we will be using k-means and the final step is to visualize the result using Zeppelin now visualizing this data is also very important because without the visualization you won't be able to identify what were the major reasons and you won't be able to gain proper insights from your data now don't be scared if you're not familiar with terms like spark SQL spark analog k-means clustering you will be learning all of these in today's session so this is our entire strategy this is what we're going to do today this is how we're going to implement this use case and find out why Hillary Clinton won so now let me give you a visualization of the results so I'll just show the analysis that I have performed and I'll show you how it looks so this is zeppelin which is in my master node in my Hadoop cluster and this is where we're going to visualize our data so there's a lot of code don't be scared this is just Scala code with spark SQL and at the end you will be learning how to write this code so I'm just jumping onto the visualization part so this is the first visualization that we've got and we have analyzed it according to different ethnicities of people for example in our x-axis we have foreign born persons and in y-axis we're seeing that among the foreign-born people what is the popularity of Hillary Clinton among the Asians and the circles represent the highest values the bigger circle is the bigger counts so we have made a few more visualizations so now we've got a line graph that compares the votes of Hillary Clinton and Bernie Sanders together again we have got an area graph also that compares Bernie Sanders and Hillary Clinton votes and hence we have a lot more visualization we have got our bar charts and everything finally we also have a state and county-wise distribution of votes so these visualizations that will help you derive a conclusion to derive an answer whatever answer that Donald Trump wants from you and don't worry you'll be learning how to do that I'll explain each and every detail of how I've made these visualizations so let's get started with Hadoop and Spark we will start with an introduction to Hadoop and Spark so now let's take a look at what is Hadoop and what is spark so Hadoop is a framework where you can store large clusters of data in a distributed Manner and then process them parallelly then Hadoop has got two components for storage it has hdfs which stands for Hadoop distributed file system and it allows to dump any kind of data across the Hadoop cluster and it'll be stored in a distributed manner in commodity Hardware with for processing you've got yarn which stands for yet another resource negotiator and this is the processing unit of Hadoop which allows parallel processing of the distributed data across your Hadoop cluster in hdfs then we've got sparks Apache spark is one of the most popular projects by Apache and this is an open source cluster Computing framework for real-time processing where on the other hand Hadoop is used for batch processing spark is used for real-time processing because with spark the processing happens in memory and it provides you with an interface for programming entire clusters with implicit data parallelism and fault tolerance so what is data parallelism data parallelism is a form of parallelization across multiple processes in parallel Computing environments a lot of parallel words in that sentence um so let me tell you simply that it basically means Distributing your data across nodes which operate on the data parallel and it works on fault tolerant systems like hdfs and S3 and it is built on top of yarn because with yarn you can combine different tools like Apache spark for better processing of your data and if you see the topology of Hadoop and Spark both of them have the same topology which is a Master Slave topology so in Hadoop if you consider in terms of hdfs the master node as known as the name node and the working node of the slave nodes are known as data node and in spark the master is known as master and slave are known as workers so this is these are basically demons so this is a brief introduction to Hadoop and Spark and now let's take a look at spark complementing Hadoop there's always been a debate about what to choose had Uber spark but let me tell you that there is a stubborn misconception that Apache spark is an alternative to hadu and that is likely to bring an end to the era for Hadoop it is very difficult to say Hadoop versus spark because the two framers are not mutually exclusive but they are better when they are paired with each other so let's say the different challenges that we address when we are using spark and Hadoop together you can see the first point that spark processes data 100 times faster than mapreduce so it gives us the results faster and it performs faster analytics the next point is spark applications can run on yarn leveraging Hadoop cluster and you know that Hadoop cluster is usually set up on commodity Hardware so we are getting better processing but we are using very low cost hardware and this will help us cut our cost a lot so hence also the cheap cost optimization the third point is that Apache spark can use hdfs as storage so you don't need a different storage space for Apache spark it can operate on hdfs itself so you don't have to copy the same file again and if you want to process it with spark so hence you can avoid duplication of files so Hadoop forms a very strong foundation for any of the future Big Data initiatives and Spark is one of those big data initiatives and it's got enhanced features like in memory processing machine learning capabilities and you can use it with Hadoop and had it uses commodity Hardware which can give you better processing with minimum cost these are the benefits that you get when you combine spark and Hadoop together in order to analyze Big Data let's see some of the big data use cases so the first big data use case is web e-tailing the recommendation engines whenever you go out on Amazon or any other online shopping site in order to buy something you will see some recommended items popping below your screen or to the side of your screen and that is all generated using big data analytics and AD targeting if you go to Facebook you see a lot of different items asking you to buy them and when you've got search quality abuse and click fraud detection you can use big data analytics and Telecommunications also in order to find out the customer churn prevention the network performance optimization analyzing Network to predict failure and you can prevent loss before the error or before the fault actually occurs it's also widely used by governments for fraud detection and cyber security in order to introduce different welfare schemes Justice it has been widely used by health care and Life Sciences for health information exchange Gene sequencing serialization healthcare service quality improvements and Drug safety now let me tell you that with big data analytics it has been very easy in order to diagnose a particular disease and find out the Cure also so these are some more big data use cases it is also used in Banks and financial services for modeling true risk fraud detection credit card scoring analysis and many more it can be used in Retail transportation services hotels and food delivery service addresses and actually every field you name no matter whatever business you have if you're able to use Big Data efficiently your company will grow and you will be gaining different insights by using big data analytics and hence improve your business even more nowadays everyone is using uh big data and you you've seen different fields and everything is different from each other but everyone is using big data analytics and Big Data analysis can be done with tools like Hadoop and Spark Etc so this is why big data analytics is very much in demand today and why it is very important for you to learn how to perform big data analytics with tools like this so now let's take a look at a big data use solution architecture as a whole you're dealing with big data now the first thing that you need to do is you need to dump all those that data into hdfs and store it in a distributed way and the next thing is to process that data so that you can gain insights and we'll be using yarn because yarn can allow us to integrate different tools together which will help us to process the Big Data these are the tools that you can integrate with yarn you can choose either Apache Hive Apache spark map reduce Apache Kafka in order to analyze big data and Apache spark is one of the most popular and most widely used tools with yarn in order to process big data so this is the an entire solution as a whole now so let's take a look at Apache Spark Apache spark is an open source cluster Computing framework for real-time processing and it has been the thriving open source community and is most active Apache project at this moment and Spark components are what make Apache spark fast and reliable and a lot of spark components were built to resolve the issues that cropped up while using Hadoop mapreduce so Apache Spark has got the following components it's got the spark core engine now the core engine is for the entire spark Frameworks every component is based on and it is placed in the core engine so at first we've got spark SQL so spark SQL is a spark module for structured data processing and you can run a modified Hive queries on existing Hadoop deployments and then we've got spark streaming now spark streaming is the component of spark which is used to process real-time streaming data and is useful addition to the core spark API because it enables Hive throughput and fault tolerance stream processing of live data streams and then we've got spark M lib uh this is the machine learning library for spark and we'll be using spark Emma live in uh to implement machine learning in our use cases too and then we've got graphx which is the graph computation engine and this is the spot API for graphs and graph parallel computation it has got a set of fundamental operators like subgraph join purchases Etc then you've got spark R so this is the package for our language to enable our users to leverage spark power from our shell so the people who have already been working on are are comfortable with it and they can use our shell directly at the same time and they can use spark using this particular component which is spark R you can write all your code in the r shell and Spark will process it for you now let's take a deeper look at a realistic people and all these important components so we've got spark core and Spark core is the basic engine for large-scale parallel and distributed data processing the core is the distributed execution engine and Java Scala and python apis offer a platform for distributed edl development and further additional libraries which are built on top of the core allow for reverse streaming SQL and machine learning it's also responsible for scheduling Distributing and monitoring jobs in a cluster and also interacting with storage systems let's take a look at the spark architecture so Apache spark has a a well-defined and layered architecture where all the spark components and layers are Loosely coupled and integrated with various extensions and libraries first let's talk about the driver program this is the spark driver which contains the driver program and Spark context this is the central point and entry point of the spark shell and the driver program runs the main function of the application and this is the place where Spark context is created what is spark context spark context represents the connection to the entire spark cluster and it can be used to create resilient distributed data sets accumulators and broadcast variables on that cluster and you should know that only one spark context may be active per Java virtual machine and you must stop any active spark context before creating a new one let's talk about the driver program that runs on the master knob of the spark cluster it schedules the job execution and negotiates with the cluster manager this is the cluster manager over here and the cluster manager is an external service that is responsible for acquiring resources on that spark cluster and allocating them to a spark job then in the worker node we have got the executors the executor is a distributed agent that is responsible for the execution of tasks and Every Spark application has its own executor process executors usually run for their entire lifetime of the spark application and this phenomenon is also known as static allocation of executors but you can also opt for dynamic locations of executors where you can add or remove spark executors dynamically to match with the overall workflow okay so now let me tell you what actually happens when the spark job is submitted when a client submits a spark user application code of the driver implicitly converts the code containing Transformations and actions into a logical directed acylic graph or dag and at this stage the driver program also performs certain kinds of optimizations like pipelining Transformations and then converts The Logical dag into a physical execution of a plan with a set of stages and after creating a physical execution plan it creates more physical execution units that are referred to as tasks under each state age then these tasks are bundled to be sent to the spark cluster so the driver program then talks to the cluster manager and negotiates for resources and the cluster manager then launches the executors on the worker nodes on behalf of the driver and at this point the driver sends tasks to the cluster manager based on the day replacement and before the executors begin execution they first register themselves with the driver program so that the driver has got a holistic view of all the executors now the executors will execute the various tasks and assigned to them by the driver program and at any point in time when the spark application is running the driver program will keep the on monitoring the set of executors that are running the spark application code this driver program here also schedules future tasks Based on data replacement by tracking the location of the cache data so I hope you have understood the architecture of spark any doubts all right no doubts now let's take a look at spark SQL and its architecture so spark SQL is the new module in spark and it integrates relational processing with Spark's functional programming API and it supports querying of data either by a SQL or via Hive query language so for those of you who have been familiar with rdbm s so spark SQL will be a very easy transition from your earlier tools because you can extend the boundaries of traditional relational data processing with spark SQL and it also provides support for various data sources and makes it possible to read SQL queries with code transformation and that is why spark SQL has become a very powerful tool this is the architecture of spark SQL so let's talk about each of these components one by one the first we have got the data source API so this is the universal API for loading and storing structured data and it is built on support for Hive Avro Json jdbc CVS parquet Etc so it also supports the third-party integration through spark packages then you've got the data frame API data frame API is the distributive collection of data that is organized into named columns and is similar to relational table in SQL that is used for storing data in tables so this is the domain specific language applicable to or DSL applicable on structured and semi-structured data so it processes data from kilobytes to petabytes on a single node cluster to a multi-node cluster and it provides different apis for python Java scale and our programming so I hope you have understood all the architecture of spark SQL we will be using spark SQL in order to solve our use cases so these are the different commands to start the spark daemons these are very similar to Hadoop commands to start the hdfs daemons so you can see to start all the spark demons so the spark demons are master and worker and you can use this command to check if all the daemons are running on your machine you can use JPS like Hadoop and then in order to start the spark shell you can use this you could go ahead and try this out so this is very similar to the Hadoop part that I just showed you earlier so I'm not going to do it again and then we've seen as Apache spark also so now let's take a look at k-means and Zeppelin k-means is the clustering method and Zeppelin is what we're going to use in order to visualize our data so let's talk about the k-mean clustering now k-means is one of the most simplest unsupervised learning algorithms that solves the well-known clustering problem so the procedure of k-means follows a simple and easy way to classify a data set to a certain number of clusters which is fixed prior to performing the clustering method so the main idea is Define K centroids one for each cluster and the centroids should be placed in a very cunning way because of different locations causes different results so here let's take an example so let's say that we want to Cluster total population of a certain location and so we want to Cluster them into four different clusters namely group one two and three and four so the main thing that we should keep in mind is that the objects in group one should be as similar as possible but there should be as much difference between an object in group 1 and group two it means that the points that are lying in the same group should have similar characteristics and it should be different from the points that are lying in a different cluster and the attributes of the objects are allowed to determine which object should be grouped together for example let us take in the same sample that we're using in the U.S County so let's consider the second data set we have used there are a lot of features that I already told you like there are age groups and they are categorized by professions and they also categorize by the ethnicity and so this is the thing that we are talking about so these are the attributes that will allow us to Cluster our data so this is K means clustering here is one more example let us consider a comparison on income and balance so in my x-axis I've got the gross monthly incoming in the y-axis I have the current balance I want to Cluster my data according to these two attributes here if you see this is my first cluster and this is my second cluster so this uh is the cluster that indicates the people who have high income and low balance in the account and they spent a lot and this cluster comprises of the people who have got a low income but a high balance and they are safe you can see that all the points that are lying here have got similar characteristics that they have got low income and high balance and here are the people who show the same characteristics where they have uh got low balance and high income and there are a few outliers here and there but they don't form a cluster so this is an example of k-means clustering and we'll be using that in order to solve our problems does anybody have any questions so here is one more example and one more problem for you so you guys will tell me now so the problem is that I want to set up schools in my city and these are the points which indicate where each student lives so my question to you is where should I be building my school if I have students living around the city in these particular locations in order to find that out we will do K means clustering and we'll find out the center point right so if you can cluster and make groups of all these locations and set up schools at the center point of each cluster that would be Optimum isn't it because that is how the students have to travel less it will be close to everyone's house and there it is so we have formed three clusters so you can see the brown dots are one cluster and the blue dots are one cluster and the red dots are one cluster and we have set up schools in the center points of each cluster so here is one here is one and here is yet another one so this is where I need to set my schools up so that my students do not have to travel that much so that was all about k-means and now let's talk about Apache Zeppelin this is a web page notebook which brings in data ingestion data exploration visualization sharing and collaboration features to Hadoop and Spark so remember when I showed you my Zeppelin notebook you can see that we have written the code there we have even run SQL codes there and we have more visualizations by executing code there so this is how interactive Zeppelin is and it supports many many interpreters and it is a very powerful visualization tool that can use uh that goes very well with Linux systems and it supports a lot of language interpreters that supports R python and a lot of other interpreters so now let's move on to the solution of the use case so this is what you've been waiting for first we will solve our U.S County solution so the first thing we will do is we will store the data into hdfs and then we will analyze the data by using Scala spark SQL and Spark ml lab and then finally we'll find out the results and visualize them using Zeppelin so this was the entire U.S election solution strategy that I told you not I don't think I should repeat it again but if you want me I can should I repeat all right so most of the people are saying no so I will go right through this one again so let me just go to my VM and execute this for you so this is my Zeppelin and I open my notebook and here and let us go to my us election notebook and this is the code so first of all what I'm going to do is that I am importing certain packages because I'll be using certain functions that are in those packages so I've imported spark SQL packages and I have also imported spark ml lib packages because I'll be using k-means clustering so Vector assembler enables me certain machine learning functions so over here I have the vector assembler package that gives me certain machine learning functions that I'm going to use I've also imported the k-means package because I'll be using k-means clustering then the first thing that you need to do is that you need to start the SQL context so I've started my spark SQL context here and the next thing that you need to do is that you need to define a schema because when you want to dump our data set or we want to dump our data it should be in a particular format and we have to tell spark and which format it should be so we're defining a schema here so let me take you uh through the code so I'm storing schema in a variable called schema and we have to define the schema in a proper structure so we're going to start with struct type and since you know that our data set has got different fields as columns we're going to Define this as an array of fields and then this is an array and struct so we are defining the different fields now so we'll start with the first field by defining it as struct field inside the braces which should mention what should be the name of that particular field so I've named it as state it should be a string type and true that means it is a string type the next we've got fips which is of string type now I know that fips is a number but since we are not going to do any kind of numeric operation on fips we're going to let it stay as a string then we've got party as a string type candidate as a string type and then votes as integer type because we're going to count the number of votes and there is going to be certain numeric operations that we are going to perform that will help us to analyze our data then we've got a fraction votes which you know is a decimal type so we have to keep it as double type the next thing you need to do is that spark needs to read the data set from the hdfs so for that you have to use the command spark read option header true header true means that you have mentioned and you have told spark that my data set already contains column headers because State as avbr they are nothing but they are column headers so you don't have to explicitly Define the column headers for it neither will spark choose any random row as a column header so it will choose only the column headers your data set has then you have to mention the schema that you have defined so I have defined it in my variable schema so that's why I have mentioned it in my file should be in CSV format and then I have mentioned the path of the file in my hdfs this is the path and I store this entire data set in my variable DF now what I am going to do is that I'm going to divide up certain rows from my data set because you know that my data set contains both the Republican and Democrat data and I just want the Democrat data right because we're going to analyze the Hillary Clinton and Bernie Sanders part okay so this is how you divide your data set so the first thing that we have done is that we have created one more variable called dfr and we have applied a filter where a party is equal to Republican and then we are storing the Democrat Party data into DFD so we're going to use the DF underscore D from onwards and dfr the Republican data is going to be your assignment for the next class now I am going to analyze the Democrat data and then after this class is over I want you guys to take the Republican data this data set is already available in your elements and you've got the VMS also with everything installed so please when you are at home when you have free time just analyze the Republican data and tell me uh what were the reasons that Donald Trump want I want you to do all that analysis and come up with that in the next class and we'll discuss about it and whatever results and conclusions that you have made after analyzing the Republican data and that way you'll also learn even more and it will also be practice for you after today's class so all right so we're going to take DF underscore D now and the first thing that we will do is that we will create a table View and I'm going to name the table view as election and let me just show you what it looks like and what it has so this is the command that I have run in Zeppelin so this is SQL code that I have run in Zeppelin and you can see that I have got States state underscore abbr and I have only got the Democrat data all right let's go back all right so after creating the table view now all of the Democrat data is in my election table so now what I'm going to do is that I'm creating a temporary variable and I'm running spark SQL code to what I'm actually doing by writing this code the motive of writing this SQL code with our SQL query is that I want to refine my data even more so what I'm trying to analyze here is how a particular candidate actually won I don't have to do anything with the losing data because you know that each of fips contain one of the losing candidate members in one of the winning candidate members it contains the data of the winning candidate and the losing candidate also because my data set contains both of the data of Bernie Sanders and Hillary Clinton in some parts Bernie Sanders won and in some counties Hillary Clinton won so I just want to find out that who are the winners in a particular County okay so I'm going to refine that data and for that I'm using this query so I'm going to select all from election and then I'm going to perform an inner join with their query so this is one more query inside this query and let me tell you what I'm actually doing so first of all what we have done is that we have selected fips as B you know that now you have got two entries for each fips so each fips actually appears twice in the data set so I named it as B and now we are counting the maximum fraction votes so you know that in each FIP we have the maximum fraction vote and then we can find the winner by actually seeing who has got the maximum fraction boats and then we have named it as a the maximum fraction votes column is named as a and we are grouping by fips so now each of my fips will be selected which has a maximum fraction vote and I have two columns for that fips which is one zero zero one and one zero zero one so the only row will be selected which has the maximum fraction votes now I'll have the winner data and I've named this entire table inside this query as group TT and then I'm validating it as where election.phips the main tableview.fips should be equal to the B column that we have created in group TT table and election dot fraction votes should be equal to group tt.a so any doubts on this querying about how I have written this all right so now what we're going to do is that whatever data that we've got here I'm storing that in election one let me just show you what is in election one now so this is my election table only and uh you can see that I've got two fips so one zero six seven one zero six seven now let me show you election one so there now I can see that I don't have repetition of fips I have only one entry for fips and that is the row which tells me who won in that county or in that particular FIP or in the FIP associated with a particular County and you can see for Bullock it was Hillary Clinton for okay who knew it was Hillary Clinton Cherokee also Hillary Clinton and then state house district 19 is Bernie Sanders so Alaska is mainly Bernie Sanders so this is what we've done now and then you can see that we have also got additional columns as b and a so a tells you the maximum fraction votes in B tells you the fips so the data in fips and the data in B are the same and data in fraction votes and data in a is the same right what I'm going to do now is since my columns are repeating and they have the same value I Don't Want A and B now right so what I'm going to do is I'm going to filter out the columns I don't need and in this case I don't want b and a and what I'm going to do is I'm going to make a temporary variable again so I'm using the temporary variable to store some data temporarily so I'm writing to the spark SQL code to select Only The Columns that I want I want the state state abbreviation County fips Party candidate votes fashion votes from election one I'm storing everything in D winner I've created this new variable and whatever there was in temp I'm assigning it to D winner and now I've got only the winner data so I have got all the counties and I've got uh who won in that particular County and by how much in the fraction of votes what I'm just doing till now is that I'm just refining our data set so that it will be easy for us to make some conclusions or gain some insights from that data right and also let me tell you that it's not always necessary that you were filing your data set in the exact way that I'm doing it if you have something in mind after you've seen your data and understand your data and you find out what actually you need to do you can carry out different steps to do that also this is just one way of doing it and this is my way of doing it so I'm just telling you and then we are creating a table for dwinner and we are going to name it as Democrats let me go again and let me show you what the Democrat table view looks like you can press shift enter so there you uh have we have column A and B that we had in election one and so I've just got uh the winner data so now let us go back and find out what we're going to find is that I want to find out that uh which of the candidates won my state and then whatever date and whatever result I'll get when it'll be stored in the temporary variable when I'm assigning everything that will be stored in the temporary variable to a new variable called D State and then similarly I'm going to create a table view for D state which is State let me show you what my state table view actually contains so there it is so I've got State Connecticut Hillary Clinton won 55 counties Florida Hillary Clinton won 58 counties so this is what we've come up to for our first data set so now let's see what we can do with our second data set that contains all the different demographic features uh first thing again you have to define a schema and this time I'm naming that schema schema one since you know that we have got almost 54 columns so I have to Define all those 54 columns also so you remember what though each of those columns contains so this is exactly what I have done and I don't need to go through every line but I like I already told you how to define a schema you can have the code in your LMS so you can take a look at it so the next thing we're doing again we have to read our data set and I'm storing my data set into a new variable called df1 and this is the path in my hdfs where my data set was and then I have created a table view for my data set which is called facts now let me show you what facts contain as you can see that it contains abbreviations state abbreviation population 2014 so instead of using the code now or the encoded form that was actually there in my data set I have given a varied meta name that would describe what it contains right so instead of PST 214 I've got population 2014. so does that make sense right and contains all the 54 demographic features or different features that was in my data set white alone not Hispanic or Latino living in the same house one year and over foreign-born persons language or other than English spoken at home high school graduate or higher so it contains basically all the different features or all the different columns that actually was in my data set and that I have defined in my schema so this is what facts I have so now what I'm going to do is that I'm not going to analyze my whole data based on all this different features I'm going to choose some specific features in order to analyze it uh I'm going to take just a few add one so these are the different features that I'm going to use I'm going to use fips I'm going to use state I'm going to use state abbreviation then area name candidate and people who are over 65 years senior citizens and female people white Alone um black African alone I'm choosing Asian alone Hispanic or Latino so basically what I'm trying to do is I'm trying to check what is the popularity of Hillary Clinton among the foreign people or people from different ethnicities so I'm choosing white people black people and Hispanic people so I'm just trying to analyze it okay and you know that I have stored this in a temporary variable again and then whatever result I'll get by running this spark SQL code I'll store it in a different variable called DFX and then I'll store it and then I'll make a table view for DF facts such as winner facts so let me show you what winter facts look like so it's winner facts you've got Phipps the state is Alabama the state abbreviation is Al for Alabama um the area name is artuga County and the winner was Hillary Clinton and the people over 65 years in that particular county is 13.8 percent female percentage is 51.4 white alone 77.9 and so these show you the data so uh black white or African is 18 and then I've got the different fields that I have selected Asian alone Hispanic or Latino foreign born so I have chosen 14 features to analyze it from all right so now what I'm doing again is that I'm going to divide the Hillary Clinton data and the Bernie Sanders data so that we can analyze only why Hillary Clinton won or why Bernie Sanders won in some particular counties so we are planning to filter the same way we do we divided Democrats and Republican data from our initial primary result data set so this is what you have done so you know that is stored in DF fax so we are putting a filter in DFX where uh candidate is equal to Hillary Clinton so that will be stored in HC in the data of Bernie Sanders will be stored in BS so after that what we are doing is that we are doing a one hot encoding so we'll add two more columns in our data set a WBS and whc e in this case we are going to do one hot encoding and what we're going to do is that we are going to include or we are going to attach two more columns in Winter facts as whc and WBS so it'll just contain either one or zero and so you can edit it in that way whichever County so if you're considering a county let's say artuga County say if Hillary Clinton is the winner it will have a one in whc and in WBS it will have zero similarly in Which counties Bernie Sanders won so Bernie Sanders will have one so WBS will have 1 and whc will have a zero and then we are creating different views for both of these two together so this will only tell me wherever whc is one that means this will only show me the counties where Hillary Clinton won this will only show me the counties where Bernie Sanders won and we are creating a view for both of these so for Bernie Sanders the view is WBS and for Hillary Clinton it's whc then finally we are merging both of them together using Union so select all from whc Union all select from WBS and finally you have stored it in result and we have created a table view known as result so let me show you what this result contains uh so there it is for our two it was Hillary Clinton so we've got the Bernie Sanders data over here at the bottom and I've got all the different fields also from my second data set the different features that I chose from my second data set uh to analyze it so now comes the actual analyzing part this is where we're going to perform K means but first we have to define the feature columns actually you have to Define what is the input that you're going to feed so that you get an output so this is actually the input that you are going to feed the to the machine so that the machine learning goes on and finally it gives you some kind of result right so this is where I'm defining again I'm using an array to find all the different fields from my data set so I'm using person 65 years and older female person percentage white alone or black or African American alone Asian alone Hispanic or Latino foreign born persons language other than English spoken at home bachelor degree or higher veterans homeownership rate median household income persons below poverty level and population per square mile whc and WBS and then I'm going to use the vector assembler so this is what enables different machine learning algorithms where we are using K means so my input column is features calls so this is going to be the input in my output column and will be called features so whatever result that I'm going to get is features and we have to transform the results so this is the final table view that we have created and you know what transforming means and transforming again means so in our strategy we already saw that we have to transform the data first so my updated data set was result so I'm going to transform result and put columns as going to be these which is featured columns and output table view will be called features and then we're going to perform the k-means clustering and we're going to store it in a variable called K means so we're using different functions from spark M A Live library and we have chosen spark with clustering k-means and you know that in k-means we already defined that how many clusters do we need and we need four so we have selected four clusters and then we're going to set feature columns as features and then set prediction column as predictions so after that we're going to make a model and we have defined our input and output columns in row so we're going to use keans.fit row and whatever predictions we will get we're going to store it in a model and then we are going to do this and that we are going to print the Clusters centers for each cluster so let me show you what my cluster centers are so after we run this code you can see that these are the different cluster centers you know so just what I can make you understand about what we're going to do after K means clustering and how to analyze it so the numbers are present they are placed very haphazardly so what I have done is that I've picked out each of the cluster Center points and then I have made a new table yes so this is it so you know that we have four clusters we have got the zeroth cluster first cluster second cluster and uh a third so zero one two three okay so four clusters and we have found out this uh cluster centers according to different features that we fed into my k-means algorithm so what we observed here in whc and WBS is that the winning percentage or the winning chances of Hillary Clinton was 0.9 whereas winning chances for Bernie Sanders was 0.1 and uh then If You observe the differences in the cluster centers for each feature here you can see that there is not much difference not even here so it's 50 49 49 51 and then uh it's well again it is not much of a difference but if you see here that it's nine and it's uh going to 16. so you can do a more detailed analysis on black or African-American so if you want to know the real support of black or African American and you want to see uh what was their voting pattern or how popular was Hillary Clinton among them so maybe this could be a good field to analyze because you see the variations in the number similarly you can check out other features and you can check out here at 16 8 9 and 36 so maybe again Hispanic or Latino field and you should do some more analysis on it and even here you can see in veterans there's 47 806 whereas we've got 182 000 all so there is uh also a lot of difference and here is only 2759 and we've got uh you know ten thousands we've got numbers and even 100 thousands here so this is how we can identify that which of the fields or which we can find the main reasons of the main points where you should make your analysis so let's go back to our Zeppelin notebook and here it is so now what we're going to do is that we're going to visualize the result first so we are counting from predictions so you can see that in cluster ones the prediction means prediction contains my cluster since you know that I have stored my clusters my cluster information and prediction this is my output after K means uh so I've got uh these this many clusters so this is the count of my counties or a count of different various that lie in my particular cluster you can see that in cluster one I have got 1917 and cluster two I've got 751 so maybe I should pay more attention on analyzing cluster one right so that's why I've selected cluster one here and we're making different predictions so you can see that in the x-axis I have got foreign born people and y-axis I have chosen uh language other than English spoken at home and then we are grouping it by candidate so you can see the lighter blue is for Bernie Sanders and the more dark blue is for Hillary Clinton so all this light blues for Bernie Sanders and you can see that as the number of foreign people increases you can only see Hillary Clinton in the scatter plot here so there might be a few outliers like back here in the size of defined according to black or African American alone so you remember that this was the feature where we find a lot of variations in a number so that's why we grouped it according to that and you can see the Biggers the circle represents the more black or African-American alone and that's what the conclusion we can find out from this scatter plot and we can see that as the number of foreign people increases the popularity of Hillary Clinton is uh more in larger groups of foreign people you can also choose different parameters out of all the different features that you have chosen so remember that we have also seen the variation in veteran so let's choose veterans and y-axis so let's also change x-axis and let me just use white alone here so you can see here that uh uh there is the x-axis that has white alone and this is the Veterans so you can see that Hillary Clinton is popular among veterans also in a smaller group of veterans since we have decided to size in black or African-American alone so the size um also represents some values she is popular among the African-American veterans and then as you go ahead and as the count increases you can see actually since it's a scatter plot and it almost represents that uh this is a point as the number of people increases or as the number of white people increases the votes are equally kind of distributed between Bernie Sanders and Hillary Clinton because there are a lot of points in this scatter plot over here and you can go ahead and drag and drop different features and you can make different visualizations on that now what we've done is that we know that there are 1917 counties in my cluster one so I am going to do is that I'm going to see that among these 1917 how many were in favor for Hillary Clinton and how many were in favor for Bernie Sanders um so in cluster number one you can see clearly Hillary Clinton is the winner and Bernie Sanders only has got 764 whereas she got 1153. similarly in cluster 3 again Hillary Clinton is the winner with nine and Bernie Sanders uh with one then it uh it too she's also got 388 and Bernie Sanders was 363 so this was very close call and again in zero you've got 119 and 30. and then we went ahead and created a line chart also of the word distribution for Hillary Clinton and Bernie Sanders so in Keys we have select a prediction the values here are whc and WBS the sum that we have got over here so definitely Bernie Sanders is lagging behind so even though you don't have that table for you you can also find it out according to this line chart so you can see that in cluster zero even again Hillary Clinton was ahead of Bernie Sanders in cluster 2 there was a very neck to neck competition and you can see it in the graph year so this represents cluster two and so you can see you have a neck to neck competition and again in cluster three uh they have got neck to neck competition so this describes the distribution of votes for Hillary Clinton and Bernie Sanders and definitely Hillary Clinton uh knows ahead and that's why of course she won the primary elections so again you can go ahead and we have created the same graph it's only just area graph instead of a line graph the key here are State and candidates so I've got States and candidates over here and the values is counties once if you just hover onto this bar chart you can see that in Connecticut Bernie Sanders won 115 counties in Connecticut Hillary Clinton won 55 only so in Florida Hillary Clinton is 58 in Florida Bernie Sanders is nine and here you can see in the main Bernie Sanders one 462 so Bernie Sanders got a majority of votes for Maine so you can also classify it Statewide you can find out which are the states and as Donald Trump now you will know that which are the states that you can Target right so you know that in Maine a lot of people voted for Bernie Sanders and maybe Hillary uh Clinton is not popular so you can go ahead and lead out so is Donald Trump's party member you can just advise him to go in Maine and carry out different campaigns because uh Hillary Clinton is not so popular there so maybe it would be a little easier to get votes from the people in Maine so this is what you can make a conclusion from and might not be very accurate but this will be very close the thing is you can make different charts you can make bar charts you can make pie charts so whatever counties one have made in the bar chart so they're here it is in a pie chart it looks better but it's not maybe as insightful I just placed it so that I can show you you can make pie charts also so these are the insights you can make after analyzing your U.S County data and this is what you can tell Donald Trump these are the different suggestions that you can actually go and tell Donald Trump uh that she is popular among the foreign people and the people who speak different languages she is popular among the Hispanic people in then in Maine she lost a lot of counties she almost lost all of the counties in Maine so these are different insights that you uh have got and then you can tell you're Superior or your employer who has hired you to do that so this is what you can present right so this is for a very beginner's level and there are some more analytics that you need to do I just showed you a few options you can go ahead and try more in the Democrat section also and you remember that uh you have to do it for the Republican party also now let me see what you've learned today so if you have any questions right now you can just go ahead and ask me so does anyone have any questions so now we will move on and find out the solution for the instant cab use case you remember that we have got the Uber data set which contains the pickup time and the location by two columns latitude and longitude and we have uh also got the license number for a particular Uber driver and what we have to do is that we have to find the Beehive locations that is the point where we will find the maximum pickups and then we will also have to find out what is the peak hour of the day so this was the entire strategy so we've got the Uber pickup data set and then we store the data into hdfs we will transform the data set and make predictions by using k-means clustering on the latitude and longitude and find out the b point or beehive point so now let me open my other notebook the Uber notebook so again the first thing that you have to do is copy the Uber data set into your hdfs now we've done that before explaining to you the U.S County analysis so again the code is kind of the same the first thing is that again we are importing some spark SQL packages and some spark ml lib packages because we are going to use k-means clustering and you can see Vector assembler here again spark ml clustering k-means and other Spark SQL packages so then we have to start our SQL context and we're doing it the same way then the first thing again we have to define a schema now I don't have many fields I've got only four Fields if I remember so the first field was the date and time stamp that defines the time so we're defining it as DT the next field is the latitude the longitude and base then I'm going to read my data set this is the path in my hdfs where my Uber data set is there so I Define schema as schema here the header is true because again my data set contains column headers and I'm going to store in DF so feature calls here is going to be latitude and longitude because I'm going to find out The Beehive point the point where I will get my maximum kick up from so again I have set the input calls as feature calls and output calls us features so I'm using the assembler to transform my data set and then again I'm using kmes and we use the same elbow method we found out that we should make eight clusters for this data set okay and then we are selecting the prediction column and the output column as predictions and then we have printed the cluster centers for each cluster so definitely whatever result we are going to find the cluster centers will tell me the exact location so this cluster uh centers that we will find after k-means is actually the Beehive points this will be the point where I will find maximum pickups right so here I have printed my cluster centers and this defines the latitude and longitude and this is going to be my location where I'm going to find the maximum pickups and I got eight results uh like that because I got eight clusters and uh defined the eight centers for different clusters so this is exactly like the k-school problem that I explained to you in K means this is exactly what happens just as we found out the center of each cluster and that is where we are replacing the school or building the new school so similarly this is going to be my beehive point and this is where I will place my maximum number of cabs okay so we found out the Beehive points the next thing we will need to do is we need to find the peak hours because I also need to know at what time should I place my cabs in the location so what we're doing now we are taking a new variable called q and we are selecting hour from the timestamp column and then the Alias name should be our and we're getting it from our prediction or from the result that we got after my K means clustering so now we are grouping it and it will have the different hours of the day and then it will just show me the pickups at the different hours of the day in the location that we found out are the Beehive points and then we're going to count uh how many pickups we are going to get from that place right so we're ordering it by descending so the smaller pickup count will be uh the first and then the larger will be at the bottom similarly again we are creating new variable called T and we're going to do the same thing so here what we're doing is we are selecting the time now or the date the latitude longitude prediction and we filter by hour which is not null so we're filtering out the null values from here so now we have created a table view for categories so let me show you what the categories contain okay let me just go down so I've done some few operations here so let's scroll back up and I'll show you and again we have created table views for T and Q also which is again T and Q all right and then I have made some visualizations for each so and we have created a value P where our is not null so again we have filtered out the null hours and we have created a new to view called p so here's my hours this is my count and in the x-axis that show how many pickups were there and this contains different hours of the day and then I have grouped it by prediction so the size is according to the count so you can see that the bigger the circle means more pickups so you can find out the biggest circle and you know that you can find the biggest circle as you go along the x-axis because this is where the count increases so you can find out the biggest circle would be here and it lies in my fourth cluster and you can see that there are eight hundred or eight thousand nine hundred fifteen pickups at the 17th hour of the day which is around 5 PM and so you know that the maximum pickups are around four o'clock or five o'clock and this lies all in my fourth cluster and so it means my peak hours are around four or five o'clock in the evening right so this is what Insight we have gained and you can tell instant cab CEO that I have found out that your calves should be ready around four or five because that's the time when uh people go home from offices or they're going out for dinner or something and this is what an another table view looks like which is tea so here we have latitude and longitude and this is where we are finding the Beehive locations so I have uh got this the distribution in a scatter plot again and you can see that we have got uh very dense points over here it means that these represent the Beehive points so what you can do is that you can just put the US map and scale it according to this scale over here and then you can exactly find out what is the exact location where you need to put your cabs around the 17th hour or the 16th hour of the day all right and you know that we had a lot of rows but the results are only limited by ten thousand if it's around ten thousand rows but we obviously had a lot more and you can check in different uh clusters so now we are analyzing uh uh cluster zero so here if you see this point over here this lies in cluster four and this lies in cluster five and this lies in cluster zero so you can analyze each cluster also so here I have just laid out the latitude and longitude for my zeroth cluster so you can see here where the prediction is equal to zero and I have selected this from the table view of T so here you can find out the exact latitude and longitude and here the latitude is 40.722 and the longitude is negative 73.995 so uh this is how you can point Pinpoint location where your cab should be during a peak hours again if you see this distribution this is just a pie chart that I've created with that tells you what is the count of pickups at each hour of the days starting from 0 to 23. there are 24 slices in this circle so you can see uh that these few slices are the bigger chunks and this is the 19th hour of the day which is around seven o'clock six o'clock five o'clock four o'clock three o'clock and so on so you can see the midnight maybe nobody travels so maybe your cabs could rest or you don't have to place any more cabs during this part of the day um these are the insights that you gained so any questions on that and I think after doing the US County election this was pretty easy to do and this is also uh pretty easy to understand and the results which were also much more clear correct [Music] we started into the heart of cluster let us understand what actually a basic computer cluster is a cluster basically means that it is a collection a computer cluster is also a collection of interconnected computers which are capable enough to communicate with each other and work on a given task as a single unit similarly Hadoop cluster is also a collection of commodity Hardware which can be computers and servers interconnected with each other and work together as a single unit Hadoop cluster has a master and numerous number of slaves Master assigns the task and guides the sleeves now that we know what a Hadoop cluster is let us now understand its advantages over the other similar data processing units some of the major advantages of Hadoop cluster are Hadoop cluster is scalable cost effective flexible fast and resilient to failure let us now discuss each one of them in detail first Hadoop cluster is scalable Hadoop is a beautiful storage platform with unlimited scalability compared to rdbms Hadoop storage Network can be expanded by just adding additional commodity Hardware while rdbms can't scale and process huge amounts of data Hadoop on the other hand can run business applications over thousands of computers all together processing petabytes of data second one the Hadoop cluster is cost effective traditional data storage units had many limitations and the major limitation was related to the storage Hadoop clusters overcome it drastically by its distributed storage topology Hadoop clusters use commodity hardware and the lack of storage can be handled by just adding additional storage units to the system and the Clusters functions as good as new the third one is Hadoop clusters are flexible flexibility is the major advantage of Hadoop cluster the Hadoop clusters can process any type of data irrelevant whether it is structured semi-structured or completely unstructured this enables Hadoop to process multiple types of data directly from social media the fourth Advantage is Hadoop clusters are fast Hadoop classes can process petabytes of data within a fraction of second this is possible because of the efficient data mapping capabilities of Hadoop the fourth feature is how do clusters are fast Hadoop clusters can process petabytes of data within a fraction of second this is possible because of the efficient data mapping capabilities of Hadoop the secret behind high speed performance is that the data processing tools are always kept available on the service that is the data processing tool is available on the same unit where the data needed is stored the fifth Advantage is the data clusters are resilient to failure the data loss in a Hadoop cluster is a myth it is practically impossible to lose any data in a Hadoop cluster as it follows the data replication which acts as a backup storage unit in a case of a node failure so with this let's move on to our next topic which is related to Facebook's Hadoop cluster since 2004 from its launch Facebook is one of the biggest uses of Hadoop cluster it is called as the beefiest Hadoop cluster it approach ultimately uses 4000 machines and is capable to process millions of gigabytes together Facebook has 2.38 billion number of active users to manage such a huge Network Facebook uses multiple storage Frameworks and millions of developers writing mapreduce programs in multiple programming languages it also uses SQL which drastically improves the process of search log processing recommendation system starting from data warehousing to the video and image analysis Facebook is growing day to day by encouraging all possible updates to its cluster scuba with a huge amount of unstructured data coming across each and every day Facebook slowly realized that it needs a platform to speed up the entire analysis part this is when the scuba was developed Hadoop developers can dive into massive data sets and carry on ADD dog analysis in real time the second update was Cassandra the traditional data storage units started lagging behind when Facebook search team discovered an inbox search Problem the developers were facing issues in storing the reverse indices of messages sent and received by the users the challenge was to develop a new storage solution that could solve the inbox search problem and similar problems in the future the objective was to develop a distributed storage system dedicated to manage in large amount of structured data across multiple commodity service without even failing for once this is when the Cassandra was developed the next update is Hive after Yahoo implemented Hadoop for its search engine Facebook thought about empowering the data scientists so that they could store larger amount of data in the Oracle data warehouse hence Hive came into existence this tool improved the query capability of Hadoop by using a subset of SQL and soon gained a popularity in the world of unstructured data today almost thousands of jobs are ran using this system to process a range of applications quickly today Facebook is one of the biggest corporations on Earth thanks to its active two and a half billion users let us have an overview on the Facebook's Hadoop cluster then let us move on to the architecture of Hadoop cluster so this is the overview of Facebook's Hadoop cluster which consists of web service ad hoc hype Hadoop cluster production Hive Hadoop cluster and many more now that we have gone through a few facts on Facebook's Hadoop cluster let us move on to the Hadoop architecture which has the following components the architecture of Hadoop consists the following components hdfs and yarn let us now begin with hdfs hdfs consists of the following components the name node secondary name node and data node let us discuss about each one of them in detail name node name node is responsible for running master demons name node is designed to store the metadata which means the information about the actual data or in short the schema of the data name node is the first one to encounter the client's request for data it then transfers the request to the data nodes which store the actual data the name node is responsible for managing the health of all the data nodes it receives a heartbeat from all the data nodes at a particular interval of time and it also receives a status update of the task assigned if any of the data node fails to respond with the heartbeat then the name node considers the data node to be dead and it reassigns it as to the next data node the next one is data node data nodes are called as slaves of name node and they are responsible to store the actual data and also to update the task status and health status to the name node in the form of a heartbeat now the last one is the second rename node the secondary name node as it speaks is not actually a backup of name node but it actually acts as a buffer which saves the latest updates to the fs image which are obtained in the intermediate process and finally updates them to the final FS image now let us discuss about yarn yarn yet another resource negotiator Jan consists the following elements node manager app master and container let us discuss each one of them in detail node manager node manager is a Java utility that runs as a separate process from weblogic server it allows you to perform common operations for a managed server regardless of its location with respect to the administration server the second one is App Master App Master is responsible for negotiating the resources between resource manager and node manager and the last one is the container the container is actually a collection of reserved amount of resources allocated from the resource manager to work with the task assigned by the node manager now with this we shall have a look on the overview of the Hadoop cluster architecture and followed by that we shall look into the rack awareness algorithm so this is the architecture of Hadoop cluster which consists of Rags each and every rack consists of a set of computers and one of the rack consists master and these racks use core switches to communicate between each other now let us move on to the rack awareness algorithm direct awareness algorithm is all about data storage it says that the first replica of the actual data must be located in the local rack and the rest of the replicas can be stored on a different remote track let us look on to an example to understand this in a better way here I'm having a data block on the data Node 1 and the data Node 1 is available on the rack 1 which happens to be a local track now according to the rack awareness algorithm the replica of the data Block in data Node 1 can be stored in the remote racks which might be rack 2 or rack 3. as you can see the replicas have been stored in a remote track which is rack number two now let us deal with a different block as you can see we have a new Block in rack number 2 Data note 7. this is the local rack for the data block stored in data node 7. now let us see where the replicas of data node 7 are stored the replicas of data Note 7 are stored into the remote rack which is rack number three and the data block is stored in data node 9 and data node 12. as you can see now we have a new data block stored in the data node 11 and rack 3 is a local rack for the data block stored in data node 11. now let us see where the replicas of data node 11 are stored as you can see the replica blocks of data node 11 are stored in the remote track which is rack number one and the data blocks are stored in data node 2 and data node 4. with this we have finished our theory part now let us get into the Practical part where we'll learn to set up a Hadoop cluster with one master and two sleeves so let us begin with a practical session here we must create three host systems out of which one is the master and the other two are the slaves so I'll be choosing the Linux operating system for this and I'll be using Centos 7. I'll be starting with creating a new virtual machine here I'll be selecting my ISO image this happens to be my ISO image which has the Hadoop pre-installed select the process finish and using the similar process you must create two new host systems and you must name them as Hadoop Slave 1 and slave 2 as I have done here as you can see I have my Hadoop Master here and Hadoop Slave 1 and Hadoop slave 2. let me start each one of them now I have started my Hadoop Master Hadoop Slave 1 and Hadoop slave 2. now all the components of Hadoop cluster which are the Hadoop Master Hadoop slave one and Hadoop slave 2 all are started assuming that you know how to install Hadoop I have chosen Ascent OS operating system which has the Hadoop pre-installed now let's start our localhost our localhost is started now let's see the hdfs web user interface this is how our hdfs web user interface looks like and you can see our name node is in progress and similarly let us try to start the hdfs web user interface in our Hadoop Slave 1 and slave 2. as you can see the hdfs web user has been successfully started on Hadoop Master Hadoop Slave 1 and Hadoop slave 2. now let us begin with setting up our Hadoop cluster with Hadoop master and Hadoop Slave 1 and slave 2. before getting started our first job will be getting to know the IP address of our Hadoop Master to know the IP address of the Hadoop Master we can type in the command fconfig this command will give us the IP address of our Hadoop master here my IP address is 192.168.233.130 similarly similarly let's find out the IP address of our Hadoop Slave 1 and slave 2. now I'm in the slave one here I'll type in my command fconfig which must show me my IP address of slave1 so here the IP address of the Slave 1 is 192.168.233.138 similarly let us find out the IP address of slave 2. as you can see the IP address of slave 2 is 192.168.233.129 now our next job is to edit the host and set the IP addresses of the master and slaves now let us open a new terminal you can edit the host by the command VI Etc hosts here I have already edited the host as you can see I have given 192.168.207.134 as my Hadoop master and 192.168.233.128 as my slave 1 and finally 192.168.233.129 as my slave 2. this step must be followed on all the three machines which is the Hadoop Master Hadoop Slave 1 and Hadoop slave 2. once you define the IP addresses you can simply press escape and colon WQ to exit the terminal and save the changes you made in the terminal as you can see I have made the same changes in Hadoop Slave 1 and you can also see the same changes in Hadoop slave 2. remember you have to follow all these changes in all the three machines which includes Hadoop master and Hadoop slaves now we know that we have defined the IP addresses of all the three machines to all the three machines now let us see if they work or not let me open a new terminal and try to Ping the Hadoop slave 1. the name of my Hadoop slave one is slave one so let me write ping to slave 1. as you can see the Ping is successful and the data has been sent to the slave 1. now let me open a new terminal and try to Ping to my next slave using the command ping as you can see my second slave name is slave 2. so I'll be writing slave to and I'll start the Ping as you can see the Ping is successful now similarly let us try this on our slaves too let me open a new terminal on Slave 1 and let me type in the command to Ping the master as you can see the name of my master is master so let us try to Ping to the master the Ping is successful and the data has been sent similarly let us try on our slave 2 if slave 2 can successfully send ping to the master or not let us open a new terminal and write in ping to the master as you can see the data has been successfully sent to the master here so with this we have successfully established a Hadoop cluster consisting of Hadoop master and Hadoop slaves as you can see the pinks are successful the master is communicating with both slay 1 and slave 2. and similarly the slave one is communicating with master and slave 2 is also communicating with Master now with this we have finished our demo session now let us learn about managing a Hadoop cluster Hadoop is both a command line interface as well as an API it does not require any tool in specific for managing and monitoring utilities yet there are some options available such as ambari and Orton works the most popular one is the ambari let us see how does a typical ambari user interface looks like so this is how a typical ambari user interface looks like we can see the hdfs disk usage percentage data nodes alive memory research graph Network usage graph and CPU loads cluster load name node Heap and many more [Music] what is a Hadoop cluster a cluster is basically a collection a computer cluster is a collection of computers interconnected to each other over a network similarly a Hadoop cluster is a collection of extraordinary computational systems designed and deployed to store optimize and analyze petabytes of Big Data with astonishing agility now that we are clear with the definition of Hadoop cluster we shall move ahead and understand the factors that decide the Hadoop cluster capacity the various factors that we need to look into in order to design and deploy an efficient Hado cluster are volume of data data retention data storage and the type of workload we shall discuss each one of them in detail we shall discuss the first factor which is the volume of data and what is its role when it comes to deciding the capacity of a Hadoop cluster so if you ever wonder how Hadoop came into existence then it is because of huge volume of big data that the traditional processing systems couldn't handle since the introduction of Hadoop the volume of data is increased exponentially so it is important for a Hadoop admin to know the volume of data he needs to deal with and accordingly plan organize and set up the Hadoop cluster with appropriate number of nodes for efficient data management followed by the volume of data the next factor is data retention data retention is all about storing important and valid data there are many situations where data aligned will be incomplete or invalid that may affect the process of data analysis so there is no point in storing such data data retention is a process where the user gets to remove outdated invalid and unnecessary data from the Hadoop storage in order to save and improve the cluster computation speeds followed by data retention the next important feature is the data storage data storage is one of the most crucial factors that come into picture when you are into planning a Hadoop cluster data is never stored directly as it is obtained it undergoes through a process called Data compression here the obtained data is encrypted and compressed using various data encryption and data compression algorithms so that the data security is achieved and the space consumed to save the data is as minimal as possible followed by the data storage the last and the most important factor is the type of workloads this factor is purely performance oriented all this Factor deals with the performance of the cluster the workload on the processor can be classified into three types intensive normal and low some jobs like data storage will cause low workload on the processor jobs like data querying will have an intense workload on both the processor and the storage unit of the cluster so the type of workloads also matter a lot so if you want to have a smooth and efficient cluster then the processors you use and RAM you use should be high-end so these were the factors that decide the planning of R2 cluster followed by the social move into the next topic which deals with the hardware requirements of Hado cluster but before that let us understand the Hadoop architecture in the Hadoop architecture we have the following components the hdfs and yarn inside hdfs we have name node and secondary name node which are considered as Masters similarly Indian resource manager would be your master and when it comes to Slaves data nodes and node managers are the slaves so basically name node controls all the data nodes which are existing in a Hadoop cluster name node is considered as the master Daemon that manages the data nodes the name node records all the metadata or the schema of all the files which it receives followed by that name node receives a heartbeat and a block report from all the data nodes on the other hand data nodes are considered as the slave demons that run on slave machines the actual data is stored on data nodes and the data nodes are responsible for serving read and write requests so this was the architecture of Hadoop now let us discuss about the hardware requirements the name node secondary name node and job tracker are the Masters the name node and secondary name nodes are the most crucial parts of any Hadoop cluster they are expected to be highly available the name node and secondary name node servers are dedicated to store all the namespace storage and edit log journaling so the hardware requirements for these nodes are at least four to six SAS storage tests so these disks will be divided as follows one terabyte hard disk space will be dedicated for operating system followed by that two terabyte hard disk space will be dedicated for FS image storage followed by that one terabyte hard disk memory will be given for other softwares like Apache zookeeper and other required softwares now when we come into the processor we require a hexaco or at least a octa-core processor with 2 to 2.5 gigahertz processing speed and now if we discuss about Ram we have to include at least 128 GB capacity of RAM for an efficient and Flawless performance and finally the major part the internet one should have at least 10 gbps speed internet to have an efficient workflow so these were the requirements for name node secondary name node and job tracker followed by the name node and job tracker the next crucial components in a Hadoop cluster where the actual data is stored and Hadoop jobs get executed are the data nodes and task trackers respectively let us now discuss the hardware requirements for data node and task trackers the number of nodes in a standard Hadoop cluster are 24 nodes and each node should be having a capacity of 4 terabyte storage followed by that the next important requirement is the processor similar to the name node we also need a hexaco or at least an octaco processor of Space 2 to 2.5 gigahertz followed by that the ram we should have at least 128 GB of RAM and finally the internet should be similar to the name node that is 10 gbps so these were the hardware requirements for the Hadoop cluster now let us move ahead and understand the software requirements of the Hadoop cluster when it comes to software the operating system becomes the most important one you can set up your Hadoop cluster using the following operating systems of your choice few of the most important and recommended operating systems to set up a Hadoop cluster are Solaris Ubuntu Fedora red hat and Centos now that we have understood the hardware and software requirements of Ado cluster capacity planning we will now plan a sample Hadoop cluster for better understanding the following problem is based on the same let us assume that we have to deal with minimum of 10 terabytes and assume that there is a gradual growth of data say 25 percent every three months or every quarter of an year in future assuming that the data grows per every year and the data in year 1 is 10 000 terabytes by the end of five years let us assume that it may grow to 25 000 terabytes if we assume 25 percent of year by year growth and 10 000 terabytes data per year then in five years the resultant data may be nearly equals to 1 lakh terabytes how exactly can we estimate the number of data nodes that we might require to tackle this data the answer is simple using the formula which is mentioned here Hadoop storage HS is equals to CRS divided by 1 minus I where C is the compression ratio R is the replication Factor s is being the size of the data which is being moved into Hadoop and I is the intermediate Factor now let's calculate the number of nodes required assuming that we will not be using any sort of data compression here we shall keep c as a standard one and we all know that the standard replication factor in Hadoop is 3. so R is equals to 3 and next the intermediate factor I is equals to 0.25 then the calculation for Hadoop in this case will result as follows HS is equals to 1 into 3 into s whole divided by 1 minus 0.25 which will result in HS is equals to 4S the expected heard of storage instance in this case is 4 times the initial storage the following formula can be used to estimate the number of data nodes which is n is equals to Hadoop storage divided by D which in turn results as CRS divided by 1 minus I whole divided by D where D is none other than the disk space available for each node let us assume that 25 terabytes is the available disk space for single node each node comprising of 27 disks of one terabytes each here 2 terabytes is dedicated to operating system also assuming the initial data to be 5000 terabytes n is equals to 5000 divided by 25 which results in 200. hence we need 200 nodes in this scenario so that's how we plan a sample Hadoop cluster now that we have understood how to plan a sample Hadoop cluster we shall move into the final topic for today's discussion which is the Hadoop admin responsibilities the Hadoop admin should be responsible for implementation and administration of Hadoop Administration followed by that he or she should be responsible for testing mapreduce Hive pink and should be having access for all the Hadoop applications the third responsibility is the Hadoop admin should be taking responsibility of cluster maintenance tasks like backup recovery upgrading and patching the fourth responsibility is performance tuning and capacity planning for clusters the last but the most important responsibility of a Hadoop admin is to monitor Hadoop cluster and deploy security [Music] who is a Hadoop developer Hadoop developer is a professional programmer with sophisticated knowledge of Hadoop components and tools a Hadoop developer basically designs develops and deploys Hadoop applications with strong documentation skills so the basic definition of a Hadoop developer is as follows a Hadoop developer is a professional programmer who has some sophisticated knowledge about Hadoop components along with its tools a Hadoop developer basically designs develops and deploys Hadoop applications with strong documentation skills now let us move ahead and understand the roadmap to become a Hadoop developer to become a Hadoop developer this is the roadmap which you need to follow firstly you need a strong grip on SQL Basics and understand terminology of distributed systems which is mandatory followed by that the next important skill is that you need to be comfortable with Linux Basics you need to know the important commands and terminologies in Linux then comes the next stage where you need to have strong programming sales in popular languages such as Java python JavaScript node.js and minimum followed by that you need to have your own Hadoop projects in order to understand the terminology of Hadoop then you need to be comfortable with Java because Hadoop was developed using Java and the most important thing is you need to have a bachelor or a master's degree in computer science and engineering technology and finally you need to have an experience of about two to three years you might want to join some startup companies for experience or you might also want to join some internships based on Hadoop so this is the roadmap to become a successful Hadoop developer followed by this we shall discuss the skills that are required by a Hadoop developer Hadoop development involves multiple Technologies and programming languages so the core and important skills to become a successful Hadoop developer are enlisted as follows firstly you need to have a basic knowledge of Hadoop and its ecosystem later he should be able to work with Linux and execute some of the basic commands in Linux followed by that you are expected to have some hands-on experience with Hadoop core components then Hadoop Technologies like mapreduce Pig hype and hbase are mandatory followed by that you should be able to handle multi-threading and concurrency in ecosystem once you are familiar with that your next stage is to work with the ETL tools the familiarity with ATL tools such as loading data and processing it using Flume and scoop is a compulsory followed by that you should be able to work with back-end programming as well once you are done with that you should be experienced with some scripting languages such as Pig Latin and after that you should be having a good knowledge of very languages like hive ql so these were the few important skills which you require to become a successful Hadoop developer now let us move ahead and understand the salary trends of Hadoop developer Hadoop developer is one of the most highly rewarded profiles in the world of IT industry salary estimations based on the most recent updates provided in the social media say the average salary of a Hadoop developer is more than any other professional let us discuss the salary trends of a Hadoop developer compared to other Technologies firstly VMware you can see that the salary transfer of VMware professional lies between the ninety thousand dollars to eighty five thousand dollars followed by that we have my sequel so the MySQL professionals also lie under the same segment followed by that we have.net VB which is a visual Basics then we have IBM Mainframe Developers Then followed by that we have C plus plus developers with an annual income of ninety thousand dollars followed by that we have JavaScript in the same segment then sap developers with a little hike around ninety five thousand dollars per annum then we have teradata developers then we have the second highly paid Unix developers which lie under the segment of hundred and five thousand dollars per annum and finally you can see the Hadoop developers with the most highly paid profile which lie under the segment of hundred and ten thousand dollars paid per annum now with this let us move ahead and discuss the salary trends of Hadoop developer in different countries based on their experience firstly let us consider the United States of America based on experience you can see that the average salary of a Hadoop developer in United States of America with experience around one to two years is around 120 000 per annum followed by America we have the average salary of a Hadoop developer in United Kingdom which comes around 85 000 pounds per annum and finally we have the average salary of a Hadoop developer in India which is around 5 lakh rupees per annum we shall discuss the salary trends of Hadoop developers in a much detailed way you can see that in America for an entry-level Hadoop developers the salary begins from 75 000 to 80 000 and you can see that for an experience Hadoop developer the salary Trends from 125 000 US dollars to 150 000 US dollars now followed by the United States of America we have United Kingdom here you can see that for an entry level Hadoop developer the salary starts from 25 000 pounds to thirty thousand pounds and whereas on the other hand foreign experienced Hadoop developer the salary Trends from 80 000 pounds to ninety thousand pounds per annum now in India we have four lakh package to five lakh package for an entry-level Hadoop developer on the other hand for an experience level Hadoop developer the salary Trends from 45 lakhs to 50 lakhs per annum now we shall move ahead and discuss the job trends for a Hadoop developer you can see that compared to other developers how do developers have seen a gradual growth in the rate of requirement the number of Hadoop jobs has increased at a sharp rate from 2014 to 2019. it has just an almost double in between April 2016 to April 2019. 50 000 vacancies related to Big Data are currently available in business sectors of India India contributes to 12 percent of Hadoop Developer jobs in the worldwide Market the number of offshore jobs in India is likely to increase at a rapid Pace due to Outsourcing almost all big MNC companies in India are offering handsome salaries for Hadoop developers in India eighty percent of Market employers are looking for Big Data experts from engineering and management domains now with this let us move ahead and discuss about the top companies that are hiring Hadoop Developers the top companies which are hiring Hadoop developers are Oracle Dell capgemini IBM emphasis cji Facebook Twitter LinkedIn Yahoo medium Adobe Infosys cognizant Accenture Oracle Dell Amazon and many more now let us move ahead and understand the roles and responsibilities of a Hadoop developer different companies have different issues with their data so the roles and responsibilities of the developers need to be a varied skill set to be capable enough to handle multiple situations with instantaneous Solutions some of the major and general roles of a Hadoop developer are as follows firstly you should be capable enough to develop Hadoop and implement it with Optimum performance followed by that you should be able to load data from different data sources which might be rdbms dbms data warehouse and many more followed by that you should be capable enough to design build install configure and support Hadoop system followed by that you should be able to translate complex technical requirements in a detailed design you should be able to analyze vast data storages and uncover insights you should be also able to maintain security and data privacy you should be capable to design scalable and high performance web services using Data Tracking high speed data querying is a must you should also be able to load deploy and manage data in edgebase defining job flows using schedulers like zookeeper is mandatory and finally cluster coordination Services through zookeeper is also important now with this let us also discuss about the future of Hadoop Developers major large-scale Enterprises need Hadoop for storing processing and analyzing their big data the amount of data is increasing exponentially and so is the need for this software in the year 2018 the global big data and business analytics Market were standing at 169 billion US Dollars and by 2022 it is predicted to grow to 274 billion US dollars however a PWC report predicts that by 2020 there will be around 2.7 million job postings in data science and analytics in the us alone if you're thinking to learn Hadoop then it's the perfect time [Music] the prerequisites to install Hadoop in Windows operating system are Java so we all know that Hadoop supports only Java version 8. so firstly we need to download Java 8 version followed by that a latest Hadoop version which we need for our operating system then the configuration files so these were the prerequisites now let's quickly go ahead and download Java 8 version into our local system and also Hadoop so you can see that this particular web page belongs to Oracle and here you'll be getting your Java development kit number eight so these are the various versions available for Java 8 for Linux as well as Windows so we need a jdk which is compatible with Windows so here you can see that Windows x64 jdk version which will support Windows so this particular link will redirect you and download jdk8 for you into your local system once you click on it it will ask you to accept the license terms from Oracle now you can just click on download followed by this you will be redirected into a login page where you need to create your own account with Oracle so that you can download this jdk don't worry this account is free of cost so you can see the jdk is getting downloaded here so as the jdk is getting downloaded we shall now move ahead and download Hadoop for our local system so this particular web page belongs to Apache organization where we can download Hadoop for free so these are the various versions available for Hadoop which are 2.10 3.1.3 3.2.1 and many more so we shall select the latest version of Hadoop but while you're selecting the latest version of Hadoop please make sure that you're not actually downloading the exact latest version of Hadoop here you can see we have three different versions 3.1.3 3.2.1 3.1.2 as you can see 3.2.1 is the latest version we have to select the version which is earlier to it which is 3.1.3 because this particular version will be the stable version now we shall move ahead and select binary once you select binary you will be redirected into a new web page where you will have a mirror link select that mirror link and your Hadoop will be downloaded for your local system as you can see Hadoop 3.1.3 tar.gz is getting downloaded now here you can see I have successfully downloaded Hadoop version 3.1.3 tar file as well as jdk 8 and those two files have successfully moved into my C drive now let's install Java first now make sure that you create a new folder for Java so select change and here select Windows C drive then select make new folder now rename this new folder as Java click OK and now select next you can see the installation procedure has now been started you can see Java development kit 8 has been successfully installed now we shall enter into program files and move our jdk into Java file because sometimes there will be an error while we set environment variables for Java so you can see inside program files we have another folder called Java so inside Java there you have rjdk so now what I'll be doing is just moving this jdk into Java file which we have created in C drive this one now you can just delete this Java file from your program files so that you don't have to mess with duplication of java file now you have your Java and jdk in one single file which is Java that is you have created in Windows C drive now we shall move ahead and set the environment variables for Java so click windows and then enter into settings and inside the settings select system and inside system just type in environment variables and there you go select the edit the system environment variables option and you have this dialog box here select environment variables and inside the environment variables you need to set the Java home as well as path for Java now select new and here just type in Java home and here let us add the location of jdk Bin so here we will add in the variable value that is the jdk bin location so our jdk bin location is in the C drive and inside the C drive we have the Java folder and inside Java folder we have our jdk 1.8.0 and inside jdk file we have the bin location so this will be the home location for Java select OK and then now move into the next dialog box which is the system variables and inside that select path and select edit here Create A New Path variable which will be the jdk path the same location that is the bin of jdk Select OK and now select OK again and now OK and close it now Java has been successfully installed into our local system now let's check Java as functional or not we can do that by selecting Windows R and inside windows are just type in CMD so that you can open your command prompt here just type in Java C if you see the set of files popping up into your terminal then it means that Java is working properly so you can see Java is working just fine now let's check the version of our Java installed into our local system so this can be checked by typing in Java space hyphen version so you can see we have 1.8 version which is running in our local system now that we have successfully installed Java into our local system let us now move ahead and install Hadoop into our local system you can see that we have downloaded the top version of Hadoop so for that we need to extract it first now you can see that the process of extraction has been completely finished that is hundred percent but you have three errors you can ignore these errors now just close the extracting process then you have your Hadoop file now let us rename our Hadoop 3.1.3 as just Hadoop to reduce the confusion now that we have successfully extracted Hadoop let's set environment variables for Hadoop but before that let's set the configuration of Hadoop you can select Hadoop and inside that you have a file called Etc and inside ATC you have another folder with the name Hadoop and inside that you have a set of folders so out of these all folders We have four important folders they are coresight.xml Then hdfs site.xml followed by hdfs we have another one which is mapred site dot example and lastly the yarn side.xml file so we need to edit all these four different files and once after we edit these four files we need to edit one last file which is the Hadoop EnV Windows command prompt file so here you're just going to add in the Java home location now let's quickly edit all those four files so we have successfully open our four important files which are coresight.xml mapreduce site.xml Yan site.xml hdfs site.xml followed by the four important files the last file which is the Hadoop environment.cnd file here we are going to set this Java home location now let's first set the values for coresight.xml so the values that are changed in cosi.xml are the properties so inside the configuration I have added one property which is the file location that is fs.default file system and the localhost location that is 9000. now let's save this cosi.xml similarly we need to also edit mapreduce site.xml files here inside this we need to add some properties as you can see we have also edited the configuration files of mapreduce site.xml let's save it now followed by the mapreduce site.xml we have yarn site.xml let's edit this also as you can see the yarn side.xml is also been updated no worry about this property file I will link this in the description box below you can have the access to it and you can use the same configuration file and install Hadoop followed by eonside.xml we have the last one which is hdfs side dot XML but before editing this particular file I want you to create a new folder in Hadoop location which is data let's see how to create it so this particular folder is inside C drive this is Hadoop and inside this Hadoop file you need to create a new folder with the name data inside data you need to create two more new files which are data node and name node so the first folder will be name node and now another folder which will be our data node so now let's copy the location of data node and name node so this location is the data node location and followed by that the name node location so this particular location will be the name node location we have the two locations copied onto our clipboard now let's go back to the hdfs site.xml file and edit the configurations here so you can see that we have edited the configuration file of hdfs site.xml and inside the configuration we have provided the application factor which is the first property and we have set the value as one since we are using our local system we might want to save memory so the replication is only one but the default value for the Hadoop replication factor is 3 and followed by the first property the second property which is our name node so we have provided our name node location which is Hadoop file and followed by that the data file and inside that we have the name node and similarly the last property which is the date of node property so here the value is Hadoop data data node now let's save it now that we have successfully edited all our four important files let's get back to Hadoop env.cmd file and edit the Java home location so for safer site let's get back to environment variables and get our jdk location so this particular location is the location for Java home we might want to remove the bin over here so only see Java and jdk is enough to set the Java home into our Hadoop envis.cmd file now let's save this particular file and close it so all the important files have been now successfully edited now let's go back to environment variables and set home and path for Hadoop now select new and write in Hadoop home so this particular location that is C Hadoop when is the location for Hadoop home now select ok now let's get back to path and set path for Hadoop files in here let's set up a new path variable that is Hadoop win and now remember to create another path variable that is your S Pen so to locate sbin get back into Hadoop and select s pin and this will be the location or path value for your S Pen select that and edit a new variable in path and paste it so that's how you set sbin and select ok ok and finally another OK and close the system properties now that we have successfully set home and path for Hadoop let's go ahead and fix the configuration files you can see that inside the bin folder of Hadoop we are missing some important configuration files to fix this we need a new configuration file which will be available in the description box below you can click on that particular link and the required configuration file will be downloaded into your local system and all you need to do is just replace that particular file with your bin folder in your Hadoop you can see that there is a new file in my Hadoop which is Hadoop configuration fiexpin one dot RAR now all you need to do is just extract this particular folder you can see that the folder has been successfully extracted and all the executable files that you require in your Hadoop have been downloaded successfully now what you need to do is just move this bin into your Hadoop bin so cut this bin and get back to Hadoop and enter bin so just delete this particular bin and replace it with a new one so there you go you have successfully done it now let's delete the unnecessary files there you go ask it is near so you have all the executable files and your Hadoop has been set to check if Hadoop is functioning properly or not let's open CND and type in hdfs space name node space hyphen format if you see a set of files popping up on your terminal that means you have successfully installed R2 you can see that the name node has been successfully getting started now let's open a new terminal and start all the Hadoop demons here you just need to enter your Hadoop location file that is CD space Hadoop now you are inside Hadoop and inside Hadoop enter S Pen now your inside has been now you need to type in start all dot shr start all dot CMD and there you go all your demons are getting started so that's how you install Hadoop into your local Windows operating system with the version Windows 10. [Music] now you'd be thinking how does sdfs work first of all as you know sdfs stands for Hadoop distributed file system so let's take a step back and understand what actually is a distributed file system in the first place a distributed file system talks about managing data that is files or folders across multiple nodes or computers it serves the same purpose as the file system provided by the OS in your PC for example for Windows you have something called as NTFS or for Mac you have something called as HFS the only difference between the file system that is there in your PC that is your local file system and a distributed file system is that instead of storing data in a single machine in case of distributed file system your data is stored in a cluster which is again nothing but a bunch of computers connected to each other forming a network even though the files are stored across a cluster the DFS will organize and display your data in such a manner that if you try to access the data from any one of the machine in the cluster you will feel as if the data is stored in the machine that you were using in other words DFS provides you an abstraction of a single big machine that has a combined disk capacity equal to sum total of disk capacity of each nodes in the DFS cluster now let me give you an example just to clarify things further suppose you have a DFS comprising of four computers where each computer has a disk capacity of 1 DB in this case the DFS will provide an abstraction of a very big machine that has combined storage of 4tb now you can go ahead and store a single file of let's say 3tb which will eventually get stored and distributed across the four computers so this is all about DFS guys next you may ask why we need DFS in the first place I mean we could have just increased the disk capacity of single machine to whatever that is required well first thing is there is a limit up to which you can increase the disk capacity of a single machine even if you somehow manage to store all of the data on a single machine it would lead to another big problem that is it would take a lot of time to process the data using a single machine let us take an example to understand this for example let us say I am having a 4GB file that takes four hours to completely process it using a single PC now what I did I use DFS and stored the same 4GB file in a four node cluster where each node was storing a chunk of file that file that will be equal to 1GB so therefore in total the 4GB file was distributed across the cluster now what I can do I can process each chunk of file finally using the four computers thus reducing the entire processing time which will be reduced to 1 4 of the former that is equal to 1R so that is the advantage that you get with DFS so I guess by now you would have understood what is DFS or why do we need it let us come back to sdfs again so again sdfs is also a distributed file system but for Hadoop that allows you to store huge data sets such as terabytes and petabytes of data across the cluster or multiple machines so that you can go ahead and process the data stored in each machine in parallel simple yet quite powerful idea isn't it so that is the main reason why Hadoop or sdfs became so famous now the next question that would be pondering in your mind is how does sdfs manages the data or who distributes the data across the cluster how can one access the data present inside stfs to answer these questions you need to understand the architecture of hdfs now sdfs or Hadoop cluster follows Master Slave topology it means that you have one master node and remaining all other nodes are slave node in sdfs the master node is called as name node whereas the slave nodes are called as data nodes now data nodes are actually responsible for storing the actual data whereas name node being the master node is responsible for managing all the slave nodes or data nodes other responsibility of name node includes maintaining and managing metadata metadata is information about data that is there or present inside the data nodes so name node will be keeping all those information regarding which data are stored in which of the data nodes along with that data node are supposed to send some hard bit or a signal so as to ensure that all the data nodes are working to name node so if let's say one of the data nodes stops sending that signal the name node will assume that that particular data node has been failed and that same will be notified to the admin so that a new data node can be permissioned so other information in the media data are like file permission directory permission the data locations and all that so all those stuffs are there in the name node metadata so now that you have understood the architecture that is followed by hdfs where we have one master node and when we have slave nodes as data nodes now the question is how files are actually stored inside hdfs just like any file system in case of sdfs also the files are stored as blocks the only difference between the file system block size in that is there in your system to that of our sdfs is the default size of each block in hdfs is 128 MB since we are dealing with a large or huge amount of data set it is configurable and you can go ahead and change the default block size as per your use case now let's understand how does this work for example let's say I want to put a file of free atmb into hdfs so this 380 MB files will be broken down into three data blocks where the first two data blocks will be of 128 MB thus making 256 MB and the last block will be occupying 380 minus 256 MB that is equal to 124 MP then finally these data blocks will be distributed across the cluster that is in different different data nodes now when you will look at this implementation you will find one more problem what will happen if one of the data nodes containing the data blocks crashes so how will name node insurer how will hdfs ensure fault tolerance with respect to the current implementation that we are having let's understand this problem with an example so we'll take the same example that we were having in the previous case where we are having one file of 380mb that has been distributed across the sdfs with three blocks now let's say the third data node having the 124 MB block has been crashed so in this case we will not be able to retrieve the data or we will be facing data loss issue so lcms has a solution for it called as replication Factor so what happens is whenever you are storing any data or copying any file into hdfs it is broken down into blocks and each blocks are replicated twice by default and are distributed across the details in case if any of the data nodes fails in that case we can again retrieve the data by having a replica that will be stored in some other data so that is how your sdfs ensures fault tolerance capability also there is one more advantage of having Master Slave topology is that we can go ahead and add more nodes on the fly so for example let's say I want more disk space in my cluster in that case what I can do I can commission new data nodes on the fly without affecting the current infrastructure so enough of the a theory guys so let's go ahead and have a look at the real cluster how does it look like and what are the different ways by which we can go ahead and access hdfs so these are Edge node guys let me log in all right so basically we'll be having two type of file system over here one will be the local file system with respect to your Edge nodes where all the files will be there and then there will be distributed file system with respect to sdfs so let me list all the files in directory in both the cases so for example for local file system I will be using local shell command which is LS and using that you can see these are the files that is there inside my Edge node now for listing out a directories and files that is there inside my sdfs I will be using SDF shell command so for that you have to go ahead with SDF sdfs then hyphen LS and then you provide the name of the directory on which you want to list the files or subdirectories for example in my case I'll be looking at my profile directory that is slash user slash edu underscore Big Data user so as you can see these are the files and directories inside my profile directory so now you would be understanding how sdfs provides you that abstraction that you are using a single big machine so let's go ahead and do some demos as well for example let's say I want to create a directory inside my DFS so for that I'm gonna go ahead and say sdfs DFS iPhone make the IR and we'll go ahead with the name of the directory which will be user slash edu Big Data user and let's name the directory as sdfs dir all right so directory has been created let's go ahead and list all the files that is there inside hdfs again for that again I'm going to go ahead and say slash user slash EDU Big Data slash user oops I made a mistake sorry so here you can see I'm having a subdirectory called as sdfsdir which has been created recently all right now let's go ahead and copy some of the data from Edge node to hdfs so if I do LS these are the files that is there inside my Edge node let me choose one file so basically I'll be copying this particular tar file so let me check the size of the star file all right the file size is of 1.1 GB let's go ahead and copy it into the cluster so I'll say sdfs DFS hyphen put is for moving data from your local system to hdfs and then I'll mention the name of the file which is CDH hyphen Char dot tar dot gz and then I'll mention the directory where I want to copy it inside sdfs so which will be slash user slash Edo Big Data user slash sdfs underscore dir the directory that we have created recently now let's go ahead and list the files that is there inside the newly created directory so as to ensure whether the copy has been done successfully or not for that I'm gonna again do sdfs DFS hyphen LS then we'll mention the directory path so let me copy it and paste it over here so as you can see found one item that is the file that we have copied recently now let us go ahead and check the number of block that has been made with respect to this particular file that we have copied inside sdfs so I'm going to say Hadoop I'm sorry Hadoop fsck and then again I'm going to mention the directory path so as you can see we have total of nine blocks that has been created with respect to 1.1 GB file and each block has been replicated Thrice by default so in total off there will be 27 blocks there residing in sdfs in different different data nodes so this is how your sdfs Works guys now for a programmer again I don't have to worry about how data blocks are being created or how it is being distributed among different nodes all I have to worry about is the logic that I'll be working on these data so that's why your stfs provides you all those abstraction and provides your tools and all those apis so that it is very easier for you to go ahead and access it and manage the data that is there inside hdfs with the given fact that even if I run a job or if even if I process the data that again the data will be processed locally in the each node in a distributed and parallel fashion thus will be getting a reduced processing time so whenever you are talking about huge amount of data sets or huge data sets sdfs is the best option the last fact that I'm gonna mention is that all these sdfs Hardwares are commodity Hardware so basically Hadoop is a very cost effective solution to Big Data problem in case if you have a specific use case that requires more cluster configuration then you can go ahead and upgrade your cluster or Hardware on the Fly you can add more notes to it so that more parallelism can happen and you can again process large data sets [Music] what is Hadoop map reduce and why is it required so Hadoop map reduce is actually the processing unit of Hadoop using which you can process the big data that is present on Hadoop hdfs or that is stored on Hadoop hdfs but what is the requirement why do we need Hadoop map reduce in the first place it is because the big data that is stored on Hadoop hdfs is not stored in a traditional fashion the data gets divided into chunks of data which is stored in respective data nodes okay so there is no complete data that is present in one single location or one centralized location hence a native client application which used to be there like a Java application or any other application cannot process that data right away and hence we needed a special framework that has the capability of processing the data that stays as a blocks of data into respective data nodes and the processing can go there and process that data and then only bring back the result so that kind of a framework is Hadoop mapreduce and we'll move on to the next slide and we'll see mapreduce in a nutshell so this particular slide basically gives you the overview of mapreduce and what are the things that are related to Mac videos to start with what are the applications of mapreduce or where it is used for example it is used for indexing and searching it is used to create classifiers it can be used to create recommendation engines like it has been created by big e-commerce companies like Amazon Flipkart it can be used for analytics by several companies when we talk about the features of mapreduce it is a programming model it can be used for large-scale distributed model like Hadoop hdfs it has the capability of parallel programming which makes it very useful when I talk about functions that are present in mapreduce there are basically two functions that get executed one is the map function and the second is the reduce function if we talk about design patterns that has already been there in the industry for a long time yes you can also Implement all those design patterns using mapping use like summarization classification recommendation or analytics right like join and selection map views has been implemented by Major giants like Google and it has also been adopted by Apache Hadoop for hdfs for processing data in hdfs for processing data using Peg for processing data using Hive or for storing data or executing queries over the big data using edgebase which is a nosql database right so this is something which actually gives you the overview of mapreduce and what are the various features what are the applications where it is implemented what are the functions that I use that kind of information is given in this slide guys are you able to grab that information and now we'll explore the two biggest advantages of mapreduce the very first Advantage is parallel processing you must be aware of parallel processing from before as well because it's not a very new term using mapreduce you can always process your data in parallel okay as you can see in the diagram there are five slave machines and there's some data that is residing on these machines these boxes are nothing but representing a chunk of data a block of data or a stfs block which is getting processed in the respective slave machines right you can see your circle going on so this simply represents the processing okay so in here data gets processed parallely using Hadoop mapreduce and hence the processing becomes fast so it is as simple as the work time problem that you would have solved in your school days for example you would have solved a problem like if a particular task is done by one person he is going to take one day so same task if it is done by three persons how many days it is going to take to finish the job right so what are we doing there we are actually Distributing the task among three people and hence the time that is taken to execute that job becomes less right similarly same happens in Hadoop mapreduce what happens is entire chunk of data gets divided by Hadoop stfs into hdfs blocks and the processing now processes this data in parallel and hence the processing becomes fast so I'll move on to the next slide and we'll explore the second advantage of Hadoop mapreduce that is data locality this is one versatile thing that is given by Hadoop map reduce that is you are able to process the data where it is what does it mean let me tell you the data that you move into Hadoop cluster gets divided into stfs blocks and these blocks are stored in the slave machines or the data nodes right as you can see the data is stored in all these slave machines that are there in this picture right what mapreduce does is it sends the processing it sends the logic to the respective slave nodes or the respective data nodes where the data is actually residing as hdfs blocks so what happens is the processing is executed over a smaller chunk of data in multiple locations in parallel right this saves a lot of time as well as it saves the network bandwidth that is required to move big data from one location to other just remember that this was big data which was broken down into chunks right if you start moving that big data through your network channels into a centralized machine and then process it it will give you no Advantage right because you are going to consume the entire bandwidth just in moving the data to a centralized server right so using mapreduce you are not just doing parallel processing however you are also moving the processing the logic that you would like to execute over big data into the respective slave nodes where the chunks of data are present and hence you are also saving a lot of network bandwidth right which is very beneficial finally when the slave machines are done with the processing of the data that is stored at the slave machines they send back the results to the Master machine because the results are not as big as the blocks that were stored on the slave machine hence it will not be utilizing a lot of bandwidth right so what they do is they send the results back to the Master machine these results are aggregated together and the final result is sent back to the client machine which actually submitted the job so we'll move on to the next slide and we'll explore the traditional versus the mapreduce way for this we'll take a real life analogy of election votes counting okay everyone would be aware of Elections right happens everywhere so you would be aware of booths as well so booths are the location where people come and cast their votes right so there are n number of booths spread across the country right let's take a scenario where we have five booths okay where people will go and cast their votes no we also have a result Center which has all the information of the boots that are there and where they are located okay however when people come and cast their votes in these respective Goods the votes are kept there itself that is who they will have its own n number of votes both B will have n number of votes both c will have n number of votes similarly Booth d and e will also have n number of votes that were casted there itself that information is not shared with the result center right so let's move on and let's see how does the vote counting will happen in the traditional fashion okay so if you solve this problem using the traditional way all these votes will be moved to a centralized result center right here and then the counting would start now in this case if we do this what happens is we need to move all the votes to a result Center which is a costly affair right that is you'll have to gather all the votes and move to a center location so there is a cost involved along with that along with the effort right second point is result Center also gets overburdened because it has to count all the votes that were casted in these respective routes right as well as since they are counting votes that were casted in all the booths it is going to take a long time so this process doesn't work very well let's see how does mapreduce solve this problem so the very first thing is mapreduce doesn't follow this approach now when you see the mapreduce way what happens is as you already learned in our previous slide that is mapreduce follows data locality right so that means it is not going to bring all the votes into a centralized result Center instead it will do the counting in the respective boots itself in parallel right so what is happening is so once the words that are casted on every both are counted they are sent back to the result Center and the result center now only has to aggregate the results that were sent from respective goods and announced the winner so this way declaring the result becomes easy and very quick otherwise I'll move on to the next slide okay let's move on now let's understand map videos in detail okay now what did we do in the previous example so we had an input okay and that input was distributed among various booths now every input was processed by a respective map function okay in the starting I told you that mapreduce has got two functions one is map and the other is reduced so the counting part that I talked about which was done on the respective boards was done by the map function so every input at every Booth was counted using the map function right here after that the results were sent to the reduce function so the aggregation part is done by the reduce function and the final result is given as the output so this is what has happened in our previous example okay so the entire thing can be divided into math task and reduce task map task gets an input the output of the map task is given to the reduce task and this reduced task gives the output finally to the client in this slide we'll understand the anatomy of mapreduce so what happens is a mapreduce task works on a key value pair as you can see on the left so when I talk about a map a map takes the input as key value okay and gives an output as a list of key value okay now this list of key value goes to a shuffle phase and an input of key and a list of values given to the reducer finally the reducer gives you a list of key value pairs okay in this slide what you need to understand is mapreduce works with key values itself and the remaining thing will be understanding this in the coming slides we move on to the next slide let us take an example to understand the mapreduce way okay so we had an input right the input that you have gets divided or it gets splitted into various inputs okay so that's process is called input splitting so the entire input gets divided into spreads of data on the basis of the new line character the very first line is the first input that is deer beer and River the second line is the second input car car and reward okay it would be dear car and beer now let's move on to the next phase that is the mapping phase now in the mapping phase if you can see what we do is we create a list of key value pairs okay so the input is key and value so key is nothing but the offset of the line number the line number is the key and the entire line is the value okay so line one the offset would be the key and the value would be dear beer and Reverb in real life the line number or the offset is actually a hexadecimal number however to make it easy we will only consider it as one or two right so line number one would be the key and this the entire line would be value okay when it is passed to the mapping function what mapping function will do is it will create the list of key value pairs for example there so what it will do is it will read every word from the line and it will mark one after the comma okay it will Mark 1 as a value so Dr comma 1 BR comma 1 and River comma 1. why are we putting one after every word it is because there is one count so dear comma one bear in itself is one count so bear comma One River in itself is again a one count so River comma one so we have key value pairs that is d r comma 1 beer comma 1 and River comma one similarly when we come to the second line we'll read each word from the line that is car and will mark one against it because it is one count in itself so Chi comma one again car comma 1 and again River comma one what will happen with the third line what will be the result of the mapping function when this is the input it is exactly the same that is given Dr is in itself one car in itself is one and bear in itself is one let's move on and let's go on to the shuffling phase so let's see what happens in shuffling phase in shuffling phase for every key there is a list prepared okay so K2 comma list V2 for every key there is a list of values that will be prepared in shuffling phase so what what shuffling phase will do is so it will find the appearance of key bear and it will add the values into the list so let's see what is happening you can see that there are two incoming arrows the first arrow is coming from list one so bear and then in the list it has added one the other Arrow was coming from this list so bear comma 1 again so what it did instead of adding another key it just added the value 1 in the list of values part okay so the result would be bear comma list one comma 1 because there were two occurrences of Bear in two different lists similarly when I talk about cars so again for car another list will be prepared for values so as you can see that there are three incoming arrows two arrows are coming from the same list so car one comma one and the third arrow is coming from the last list again comma one okay so since there were three occurrences of car hence the list of values will have three ones one comma one comma one okay similarly goes with there now there was there in the first list and the third list hence there were two occurrences so dear comma one and one one and one is nothing but the values that are against the respective keys in the map phase right can you tell me what will be the fourth answer the answer would be River one comma 1 it is because River was found in two different layers the first list River one so first occurrence and then River two so River and then the list of values that is one comma one now comes the reducing phase in reducing phase what we are doing is we start aggregation of the values that were present in the list against every key so for Beard there were two values present in the list one comma one so the summation of these values will be done so ba comma 2 similarly for car the values will be sum so 1 plus 1 plus 1 becomes three a result from reducing function would be car 3 for dear it becomes 1 plus 1 Dr 2 what will be the last answer the answer would be River comma 2 it's because the values are 1 and 1 and 1 plus 1 becomes two so this will be the result for River and at last the final result will be sent back to the client where bear comma 2 car comma 3 Dia comma 2 and River comma 2. this is how the entire word count process works when you are using mapreduce way okay now let's execute this program practically okay I showed you the entire process now let's execute this program execute a word count program over the same input file that we saw and see what is the result okay so I'll enter the password this is the address virtual machine that you can download from your LMS I'm sure that you the setup is already ready on your end in case you are not ready with the setup guys please do that okay so this is the input file that I have already created to save time so this was the input in the example right dear Bear River deer car repair okay so what we are trying to do is execute a word count program over this for your information let me tell you that so there are set of mapreduce example that already comes with the Hadoop package that you download okay so let me just show you even word count example is also a part of it so I'll just take you there my Hadoop is installed in user left Hadoop let us see the contents of this folder they mapreduce examples jar is present in the share folder I'll just change the directory share Hadoop and we need to go to the mapreduce part we'll find the jar in this this is the jar that I'm talking about that is Hadoop mapreduce examples jar let me show you the examples that are present in this jar first examples 2.2.0. char so I'll execute this command it will give me the list of all the classes all the examples that are there within this jar as you can see one is aggregate word count aggregate word list bbp along with the description so you guys can go through it and try all the examples what we are going to try is the word count example a mapreduce program that counts the words in the input files okay so this is what we are going to execute now you can try other examples as well as you have already seen the list of examples that were that is present in the Hadoop example chart now it's time we execute the word count example but before that we need to move the input file to hdfs first and then only we can execute the mapreduce example it is because mapreduce takes the input from the hdfs and dumps the output to the sdfs okay so very first thing is I'll move the file so I do DFS that is Hadoop distributed file system hyphen put that is I'm trying to put a file into hdfs I I'll get the path of my file that is present on my local file system that is home edureka desktop and then the name of the file input and then I'll give the hdfs directory that is I would like to store on the root directory of hdfs okay I'll press enter I'm sure by now you would be very clear with the hdfs commands in case you are not I request you to please execute the hdfs commands that are present in the hdfs tutorial blog or you also have a document in the LMS okay so looks like a file has been copied to hdfs folder let me just check if it has been done so Hadoop EFS I'll directly cat that is I'll read the file directly and I'll put the name of the file that is input so I'm telling that go look for the file input address sent in the root directory of hcfs and read it okay so this was the file that we copied right good so let's execute the word count example I'll write Hadoop which are and I'll mention the name of the chart that is Hadoop mapreduce examples jar I'll mention the class that I want to execute that is word count I'll give the input part of hdfs that is backslash input and then I'll mention the output directory that is first example out and I'll press enter so looks like it has started the execution it is again connecting to resource manager it is because this client submits the job to the resource manager right good as you can see the map and reduce task has started now first the map map task would be executed you can also track the status of your job at this particular location okay so for every job there's an ID application ID that is generated so you can always go to Port 8088 which is nothing but resource manager sport now as you can see the map has finished hundred percent and finally the reduced space would be executed so reduce phase is also complete now finally the job has executed successfully okay as you can see completed successfully let's go and check the output okay I'll clear the screen so Hadoop DFS hyphen LS and the name of the directory was first example hubbed and enter as you can see there is a success file that has been created and a part file the output is always present in the part files okay so we'll just read the part file now instead of Ls now I'll use the command cat and I'll copy this part and I'll paste at the end of first example output okay and let's press enter this is the command that I'm using to read the final part file the output all right so this is the output br2 car 3 Dr 2 and River 2. let's try and compare this with the input file itself so first we'll read the input file Hadoop TFS hyphen cat backslash input right so this is the input file and now we'll read the output file again the part file now guys can you compare it is the result correct let's say there it says it has come twice so bear one and then Bear 2 right car has come Thrice so car one two and three there has come to ice so dear one and then second finally River came twice so River one and River two so we can see that the output that has been given by the word count example is absolutely correct now do you want me to execute the same word count example on a bigger file will that be interesting let me just show you so for that what I'll do is let's open the browser first let's select the website like Apache Hadoop itself okay so let's open Apache Hadoop Apache Hadoop so we'll open this website so as you can see there's the multiple there is a multiple words there let me open the source file of this page so I'll go to view page source so how about running the mapreduce program on this page source and find out the frequency of every word that is coming in this page will that be something you would like to see good so what I've done is I've written a small python script okay I'll just show you using that python script I am trying to scrape the page that we have just opened that is Apache Hadoop org okay so I'll just change my directory to desktop first I'll clear the screen so the name of the script is script Dot py since it is a python script it's extension is script.py I'll open the file script.dy so in this just I am telling that this is where my python is located I am importing a library and that is called URL lab2 the page that I would like to scrape is HTTP Hadoop dot apache.org that I've already shown you now using this command that is URL lib2.url open Hadoop page I am able to fetch the entire source code of this page in one variable that is called page and then finally I am writing this page variable the content of this page variable in a file that is called Hadoop Apache okay that will be created on the desktop okay very simple script I'll just save the file and I'll execute the script by dot backslash scrape.py so you'll see that there will be a file that will get created once this script is finished so right here as you can see Hadoop Apache is the file that has been created I can open it for you so it has the entire source code that was there in the Apaches website right interesting now what I'll do is I'll execute the same word count example but before that I need to move the file into Hadoop hdfs right I hope you remember so I'll do Hadoop DFS hyphen word I'll give this path of this file that is Hadoop Apache the local path and then I'll mention backslash that is I would like to move this file to the root directory of Hadoop okay I'll make the terminal full screen now now the file has been copied now I'll execute the word count example on this so I Do jar I need to give the entire path where my jar is located that is user live Hadoop within that share folder Hadoop mapreduce and then there will be the Hadoop mapreduce example jar next I need to mention the class I would like to execute that is word count now I'll mention the name of the input file that is present on hdfs that is Hadoop Apache and I'll give the name of the output directory where I would like to store the final result of this map reduce program and I'll name it as second example out okay and let's press enter as you can see it's connecting to resource manager again it is because it is somebody who executes all your jobs that are submitted from the client so the mapreduce task has started now as you can see which will be the first phase that will get executed guys is it map or reduce it will be map itself I hope everyone is very clear that map will get execute it first so as you can see map has executed now even the reduced face is also over as you can see and we have the results written on hdfs I'll clear the screen so let's check the results quickly so Hadoop DFS hyphen LS the output directory was second example out and we'll find the part files within it which will have the final output so as you can see this is the part file let's try and cut it okay so Hadoop DFS hyphen CAD I need to mention the entire path I'll copy this and paste paste and enter it will be a huge file it is because the number of words that were present in the file was more than the previous input that we gave right as you can see for every character every word that was there in the file you have the word come right so let me just clear this so let's try and find out the count for Hadoop okay so I'll simply execute a Linux command that is graph on the output that will be received from after reading the file I'll simply look for the word Hadoop sadhu came into so many words as you can see right here so this is the command that I executed and here is the research okay so since this was the source code it is a little messy however when you're working with real data set you will be cleaning that data set first and then only you'll be executing your final processing okay even for cleaning you'll be writing a map with this program okay so how did you like the example Matthew already says that wow that's a great example to observe it gives the taste of real-time scenario okay so Matthew it was not a very real time thing however still we were able to scrape a page and then do a word count on this okay there are a lot of other things that can be done using map reduce because the limits of map views is endless okay let's go back to our slides now now let's understand how mapreduce is using yarn to execute the jobs over the cluster but before we go ahead what does yarn stand for yet another resource negotiator it is the one which allocates the resources for various jobs that needs to be executed over the Hadoop cluster it was introduced in Hadoop 2.0 itself till Hadoop 1.0 mapreduce was the only framework or only processing unit that can execute over the Hadoop cluster however in Hadoop 2.0 yarn was introduced using which we were able to go beyond mapreduce as well as you can see in this slide okay so we have hdfs in the bottom in between we have got yarn and using yarn lot of Frameworks are able to connect and utilize hdfs okay so now even mapreduce has to connect using yarn request for resources and then only it can execute the job over hdfs that is Hadoop cluster okay similarly spark can connect other search engines can connect to hdfs storm can connect to it hbase which is a nosql database which can connect to it so the applications of hdfs became huge just because Jan was able to open the gates for other Frameworks other big data analytics tools as well let's move on let's talk about the demons present in Hadoop 2.x cluster which runs the components that is storage and processing let's understand how does resource manager and node manager Works in Hadoop 2.x cluster and manages the processing and the job that needs to be executed over the Hadoop cluster okay so what is resource manager resource manager is the master demon that runs on the Master machine which is a high-end machine node manager on the other hand is the Daemon that runs on slave machines or the data nodes okay along with the data node process right let's understand this in a little more detail and explore other components as well okay so what is a client a client is nothing but with submit some are produced job like we did from the CLI okay that is command line interface so it is also a client using which we were able to submit the mapreduce job okay so it is a client similarly a client could be a Java application or anything now what is the resource manager resource manager as I said is the master Daemon to which all the jobs are submitted from the client it is the one which allocates all the cluster level resources for executing a particular job resource manager runs on a machine which is actually a high-end machine with good configuration because it is the Master machine which has to manage everything over the cluster okay what is node manager node manager is a slave demon that runs on the slave machines or the data nodes so every slave machine will have a node manager running it monitors the resources of a particular data node resource manager manages the cluster resources and node manager manages the data node resources okay so what is job history server it is someone who keeps a track of all the jobs that has been executed over the cluster or has been submitted on the cluster it keeps the track of their status as well okay it also keeps the log files of every execution happened over the Hadoop cluster okay so what is application Master application Master is again a process that is executed on a node machine on a slave machine and created by a resource manager to execute and manage a job it is the one which negotiates the resources from the resource manager and finally coordinates with the node manager to execute the task okay similarly what is a container a container is created by the node manager itself which has been allocated by resource manager and within the container all the jobs are finally executed okay let me give you a pictorial representation of this process so yarn application workflow in mapreduce so as I said resource manager so there is a resource manager to which all the jobs are submitted there is a cluster in which there are slave machines on every slave machine there is a node manager running okay The Source manager has got two components one scheduler another one is application manager okay so Matthew says what is the difference between application master and application manager application manager is a component of resource manager which ensures that every task is executed and an application Master is created for it application Master on the other hand is somebody who executes the task and requests for all the resources that are required to execute it okay now let's say our job is submitted to the resource manager as soon as the job is submitted the scheduler schedules the job once the scheduler schedules the job to be executed application manager will create a container in one of the data nodes within this container application Master will be started this application Master will then register with the resource manager and request for a container to execute the task okay now as soon as the container is allocated application Master will now connect with the node manager and request to launch the container so as you can see application Master got allocated these two data nodes now this application Master requested the node manager to launch these containers okay as soon as the containers were launched application Master executed the task within the container and the result was sent back to the client let's understand this in a little sequential manner okay so on the right as you can see these are the lines the first one is the client the second one is the resource manager third one is the node manager and the fourth one is the application master so let's see how are the steps executed between them okay so very first step is client submits the job to the resource manager as you can see right here now the second step is resource manager allocates a container to start the application Master on the slave machines right the third step is application Master registers where the resource manager as you can see right here as soon as it registers it requests for the containers to execute the task that is the fourth step after that application Master notifies the node manager on which the container needs to be launched once the node manager has launched the containers application Master will execute the code within these containers finally the client contacts the resource manager or the application Master to monitor application status okay and at the end finally the application Master unregisters itself from the resource manager and the result is given back to the client right so this is one simple sequential flow of how a mapreduce program is executed using yarn framework foreign [Music] and how it works we'll first start with Hadoop version 1 where we had mapreduce version one so in mrv1 the two core Services were hdfs that is Hadoop distributed file system and mapreduce which forms the basis of almost all Hadoop functionality all other components are built around these services and must use mapreduce to run Hadoop jobs while mapreduce method enables user to focus on the problem at hand rather than the underlying processing mechanism they do limit some of the problem domains that can run in Hadoop framework so now let us know how mapreduce version 1 Works in order to understand its limitations the mapreduce processing model consists of two separate steps the first step is parallel map phase in which input data is split into discrete chunks that can be processed independently the second phase is the reduce phase in which the output of map phase is aggregated to produce the desired result this is simple and fairly restricted nature of programming model which lends itself to a very efficient and extremely large scale implementation across thousands of low-cost Community servers when mapreduce is paired with a distributed file system such as hdfs which can provide High aggregate i o bandwidth the economic of the system becomes extremely compelling this is a key factor in the popularity of Hadoop one of the key to Hadoop performance is the lack of data motion such as compute task move to the server on which the data decides and not the other way around specifically the map task can be scheduled on the same physical nodes on which data resides in sdfs which exposes the underlying storage layout across the cluster this design significantly reduces the network input output patterns and keeps most of the i o on local disk or on a neighboring server within the same server rack to understand the new yarn process flow it is helpful to review the original Apache Hadoop mapreduce design as a part of Apache software Foundation mapreduce has evolved and improved as an open source project mapreduce project itself can be broken down into the end user mapreduce apis for programming the desired mapreduce applications the mapreduce runtime which is the implementation of various phases such as map phase short Shuffle or merge aggregations and the reduce phase and the mapreduce framework which is the backend infrastructure required to run users mapreduce application manage cluster resources and schedule thousands of congruent jobs among other things this separation of concerns has high significant benefits particularly for end users where they can completely focus on their applications via the API and let the combination of mapreduce runtime and the framework deal with the complex details such as the source management fault organization scheduling as you can see in the right hand side image the master process is the job tracker which serves as the Clearinghouse for all mapreduce jobs in the cluster each node has a task tracker process that manages tasks on the individual node the task tracker are controlled by the job trackers the job tracker is responsible for resource management tracking resource consumption availability and job lifecycle management like scheduling individual tasks of the job tracking progress providing fault orders for tasks Etc the task tracker has simple responsibilities like launch or tear down tasks on order from job tracker and provide task status information to the job tracker periodically Hadoop mapreduce framework has exhibited some growing pains in particular with regard to the job tracker several aspects including scalability cluster utilization ability to control you upgrade to the stack and support for workloads other than mapreduce itself mapreduce is great for many applications but not everything other programming model better serves requirements such as craft processing and iterative modeling using message passing interface as is often the case much of the Enterprise data is already available in Hadoop sdfs and having multiple paths for processing is critical and clearly necessity furthermore given the mapreduce is essentially for batch processing support for real time and near airtime processing has become an important issue for the user base a more robust Computing environment within Hadoop will enable organizations to see an increased return on their Hadoop investment by lowering their operational cost now the first challenge was scalability the processing power available in data center continues to increase rapidly consider the additional Hardware capability offered by a commodity server over three year period in 2009 we had eight cores 16 GB of RAM and 4 into 1tb idea of disk coming down to 2012 we had 16 plus course 72 GB of RAM and 12 into 3 TB of hard disk these new servers are often available at the same price in general server are twice as capable today as they were two or three years ago mapreduce is known to scale to production deployment of approximately 5000 server nodes in 2009 vintage thus for same price the number of CPU cores amount of RAM and local storage available to the user will put continued pressure on the scalability of new Apache Adobe installations so the next point is utilization so in the current system the job tracker views the cluster as composed of nodes with distinct map slots and reduced slots which are not fungible utilization issues occurs because of maps thoughts might be full while reduced thoughts are empty or a vice versa scenario improving the situation is necessary to ensure the entire system could be used to its maximum capacity for high utilization and applying resources when needed next is user agility in a real-time deployment Hadoop is very commonly offered as shared multi-tenant system as a result changes to Hadoop software stack affects a large cross-section of Enterprise against the backdrop users are very keen to control upgrades to the software stack as such upgrades have a direct impact on their applications now let us discuss a problem that Yahoo faced Yahoo the first company to embrace Hadoop in a big way and it was a trendsetter within the Hadoop ecosystem in late 2012 it struggled to handle iterative and stream processing system of data on Hadoop infrastructure due to mapreduce limitations scalability bottleneck was caused by having a single job tracker according to Yahoo the Practical limit of such design are reached up to a cluster of 5000 nodes and 40 000 tasks running congruently after implementing yarn in first quarter of 2013 Yahoo installed more than 30 000 production notes on spark for iterative streaming storm for stream processing Hadoop for batch processing Etc such a solution was possible only after yarn was introduced and multiple processing Frameworks were implemented so to address these needs the yarn project was started by Apache Hadoop Community to give Hadoop the ability to run non-maproduced jobs within the Hadoop framework yarn provides a generic Resource Management framework for implementing Hadoop applications in Hadoop version 2.x mapreduce has undergone a complete renovation yarn is also known as mrv2 that is mapreduce version 2. yarn provides both full compatibility with existing mapreduce application and a news approach for virtually any distributed application the introduction of yarn does not change the ability of Hadoop to run mapreduce jobs it does however position map reduce as a merely one of the application Frameworks within Hadoop which works the same way as it did in mrv1 the new capability offered by yarn is to use the new non-bah produced Frameworks that adds many new features to the Hadoop ecosystem so as you can see here in Hadoop one point x mapreduce serves two purpose that is processing Paradigm as well as Resource Management now in Hadoop 2.x yarn takes all the responsibilities of resource management or mapreduce or other Frameworks can be used as processing paradigms the fundamental idea of VR is to split the two major responsibilities of job tracker that is resource management and job scheduling or monitoring into separate demons a global resource manager and a per application application Master the resource manager and per node slave the node manager forms the new and generic operating system for managing applications in a distributed manner John relies on these three main components for all its functionalities the first component is the resource manager which is the arbitrator of the cluster resources it has two parts a pluggable scheduler and an application manager that manages user jobs on the cluster the second component is the per node node manager which manages user jobs and workflow on the given node the central resource manager and the collection of node manager creates a unified computational infrastructure of the cluster the third component is the application Master a user job lifecycle manager the application Master is where the user application resides together these three components provide a very scalable flexible and efficient environment to run virtually any type of large-scale data processing jobs so the resource manager is the ultimate Authority that are by traits division of resources among all the applications in the system per application application Master is a framework specific entity and is tasked with negotiating for resources from the resource manager and is working with the node manager to execute and monitor the component tasks the resource manager has a plugable scheduler component which is responsible for allocating resources to various learning applications like capacity queue and other factors the scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application offering no guarantees on restarting tasks that are not carried due to either Hardware failure or could be application failure this scheduler performs its scheduling function based on the resource requirement of an application by using the abstract notion of resource container which incorporates resource Dimensions like memory CPU disk and network the node manager is the per machine slave which is responsible for launching in the application containers monitoring their resources like CPU memory disk Network and Reporting the scene to the resource manager application application Master is responsible for negotiating appropriate resource containers from the schedulers tracking their status and monitoring the progress from the system perspective the application Master runs a normal container one of the crucial implementation details of mapreduce within the new yarn system is the ReUse of existing mapreduce framework without any major surgery this step was very important to ensure compatibility of existing mapreduce applications and users now the major benefits of VR is increased scalability better memory utilization with containers and the third one is other Frameworks can be integrated like spark Strom Hive Etc Now by adding new functionalities yarn brings new components into Apache Hadoop workflow these component provides finer grain control for the end user and simultaneously offer more advanced capabilities to Hadoop ecosystem let's look at them one by one so first is resource manager as mentioned earlier the yarn resource manager is primarily a pure scheduler it is strictly limited to R by trading requests for available resources in the system made by competing applications it optimizes the cluster utilization like keep all resource in use all the time against various constraints such as capacity guarantees fairness and slas to allow for different policy constraints the resource manager has a plugable scheduler that enables different algorithms such as those focusing on capacity and fair scheduling to be used as necessary so as we discussed resource manager has two components that is scheduler and application manager so the scheduler is responsible for allocating resources to app applications and does not offer guarantees about restarting failed tasks scheduler also have pluggable policies that is capacity scheduler or a fair scheduler Etc moving down to Application Manager application manager is responsible for accepting job submission negotiating with the first container for executing the application specific application master and it also provides the service for restarting the application Master on failure next is node manager the node manager is Yan's per node worker agent taking care of individual compute nodes in Hadoop cluster its duties include keeping up to date with resource manager overseeing application containers lifecycle management monitoring resource uses like memory CPU Network Etc of individual containers tracking node health log management and auxiliary services that may be exploited by different yarn applications on Startup the node manager registers with the resource manager it then sends hurt beat with status and dates for the instruction its primary goal is to manage application containers assigned to it by the resource manager yarn containers are described by a container launch context that is CLC This Record includes a map of involvement variables dependencies stored in remote accessible Storage security tokens payload for node manager services and the command necessary to create the process after validating the authenticity of the container lease the node manager configures the environment for the container including initializing its monitoring subsystem with a resource constraint specified application the node manager also kills container as directed by the resource manager now next is application master so application Master is an important Concept in yarn the application Master is ineffective an instance of the framework specific library and is responsible for negotiating resources from the resource manager it is also working with the node manager to execute and monitor the containers and their resource consumption it has the responsibility of negotiating for appropriate resource containers from the resource manager tracking their status and monitoring progress the application Master is the process that coordinates an applications execution in the cluster each application has its own application Master which is tasked with negotiating resources from resource manager and with the node manager to execute and monitor the task in yarn design mapreduce is just one application framework this design permits building and deploying distribute applications using other Frameworks once the application Master is started it will periodically send herdbeats to resource manager to affirm its health and to update the record of its resource demands after building a model of its requirement the application Master encodes it preferences and constraints in a heartbeat message to the resource manager in response to subsequent herdbeats the application Master will receive a lease on container bounds to allocation of resources at the particular node in the cluster depending on the containers it received from the resource manager the application Master May update its execution plan to accommodate the excess or lack of resource container location or deallocation can take place in Dynamic Fashions as an application progresses the application master design enables yarn to offer some important features like scalability the application Master provide much of the job oriented functionality to the job tracker so that the entire system can scale more dramatically simulations have shown that jobs May scale to 10 000 node clusters composed of Modern Hardware without any significant issue as a pure scheduler the resource management does not have to provide fault tolerance of resources across the cluster by shifting fault tolerance to application Master instance control becomes local rather than Global second is moving all the application framework specific code into the application Master generalizes the system so that it can now support multiple Frameworks such as MPI that is message passing interfaces and graph processing in reality every application has its own instance of an application Master however it is completely feasible to implement an application Master to manage a set of applications furthermore this concept has been stressed to manage long running Services which manage their own applications example launching Edge base in yarn via a special HBS App Master so you can go and see the responsibilities of application master that is managing the application life cycle making Dynamic adjustment to Resource consumptions managing execution flow managing faults providing status and metrics to the result Source manager interacting with node manager as well as resource manager and so on so next component is containers yarn has a pluggable scheduling component depending on the use case and user needs administrator May select either a simple that is first in first out capacity or a fair share scheduler the scheduler class is set in yarn default dot XML file at the fundamental level a container is a collection of physical resources such as RAM CPU cores and disk on a single node there can be multiple containers on a single node every node in the system is considered to be composed of multiple containers of minimum size of memory and CPU the application Master can request any container so as to occupy a multiple of minimum size a container is supervised by the node manager and scheduled by the resource manager each application starts out as an application Master which is itself a container often referred as container 0 once started the app application Master must negotiate with the resource manager for more containers container requests can take place in a dynamic fashion at runtime for instance a mapreduce job May request a certain amount of mapper containers as they finish their task it may release them and request more reducer containers to be started now as we have already discussed yarn containers are described by a container launch context that is CLC So This Record includes a map of environment variables dependencies stored in remotely accessible Storage security tokens payload for node managed services and they come out necessary to create the process so a CLC include everythings that you need to launch a container moving ahead let us understand the yarn workflow so in earlier Hadoop versions each node in the cluster was statically assigned the capability of running a predefined number of map slots and a predefined number of reduced slots the slots could not be shared between maps and reduces the static allocation of the slot wasn't optimal because slot requirements vary during the mapreduce application life cycle typically there is a demand of map slots when the job starts as opposed to the need of reduced slots towards the end of the job the resource allocation model in yarn addresses the inefficiencies of static allocation by providing a greater flexibility as discussed previously resources are requested in form of containers where each container has a number of non-static attributes yarn currently has attribute support for memory and CPU the generalized attribute model can also support things like bandwidth or gpus so first let's look at the client resource request so as you can see a yarn application starts with a client resource request a client communicates with the application manager component of the resource manager to initiate this process the client must first notify the resource manager that it wants to submit an application the resource manager responds with an application ID and information about the capabilities of the cluster that will Aid the client in requesting this sources next the client response with an application submission context the application submission context contains the application ID user queue and other information needed to start the application master in addition a CLC that is container launch context is sent to the resource manager as we already have discussed about CLC that it contains job files security tokens resource requirements and other information needed to launch an application Master on the Node once the application has been submitted the client can also request the resource manager kill the application or provide status report about the application when the resource manager receives the application submission context from a client its schedules an available container for the application Master if there are no applicable containers the request must wait if a suitable container can be found then the resource manager contacts the appropriate node manager and start the application Master as part of this step the application master RPC port and tracking URL for monitoring the application status will be established using which your client can see the application status in response to the registration request the application master sent information about the minimum and maximum capabilities of the cluster at this point the application Master must decide how to use the capabilities that are currently available based on the available capabilities reported from the resource manager the application Master will request the number of containers the request can be very specific including containers with multiple of resource minimum values the resource manager will respond as best as possible to the request with container resources that are assigned to the application Master as the job progresses hurt beat and process information is sent from the application Master to the resource manager Within These heartbeats it is possible for the application Master to request and release containers when the job finishes the application Master sends the finished message to the resource manager and exits next is application master and container communication at this point the resource manager has handed off control of assigned node managers to the application Master the application Master will independently contact its assigned node managers and provide them the container launch context that includes environment variables dependencies located in its remote storage security tokens and commands needed to start the actual process next the node manager launches the container when the container starts all data files executables and necessary dependencies are copied to local storage on the Node dependencies can potentially be shared between containers running the application once all Container have started their status can be checked by the application Master the resource manager is absent from the application progress and is free to schedule and monitor other resources the resource manager can direct the node manager to kill containers expected kill events can happen when the application Master informs the resource management measure of its completion or the resource manager needs nodes for other applications when a container is killed the node manager cleans up the local working directory when a job is finished the application Master informs the resource manager that the job is completed successfully the resource manager then informs node manager to aggregate logs and clean up container specific files the node manager are also instructed to kill any remaining containers including the application Master if they have not already exited now let us quickly move ahead and look at the summary of the yarn application workflow so first client submits an application to the resource manager then resource manager allocates the container and starts the application Master then application Master registers the container with resource manager second the application Master asks for the containers from resource manager moving ahead application Master notifies node manager to launch containers application code is executed in contain trainers then client contacts The Source manager and application Master to monitor the application Health at last application Master unregisters with the resource manager and leaves all the resources acquired by it so let us now quickly understand how yarn Works let's go to VM so this is our VM I'll be executing a simple word count example so let's go to our Hadoop home in share Hadoop mapreduce so we have a sample Hadoop Matrix examples so in this chart we have a word count file which we can execute and check the status Orion but before that I'll show you the hdfs so this is the hdfs you can see we have one data node let us browse the file system so this is the word count underscore input.txt file which I'll be giving as input for the word count if you are not having any file on your hdfs you can go ahead and execute put command so let us quickly go ahead and execute this command the command is Hadoop jar then the name of the jar file that is Hadoop mapreduce examples the name of the class is word count and then we'll give the input file that is let me tally it word count underscore input.txt and the output files will be stored in output directory which is in the root of hdfs so let us quickly go ahead and execute this command meanwhile it will go to our yarn web UI that is eight zero eight eight so this is the yarn web UI as you can see this is the application that we are executing this is the application ID the user so the name of the application that is word count type of application again the status is accepted final state is undefined there is one container running allocated CPU V courses one and here you can track the progress of the application this is the tracking UI let us go to the tracking UI so state is succeeded let us go back to our terminal as you can see the program has been successfully executed you can also see the parameters over there like number of bytes to the red number of bytes to be written number of read operations the number of write operations again you can figure out the mapreduce framework how many number of map input records number of map output records and then we have all the mapreduce details over here so let's not worry about them so it gives you all the details related to that job which you have executed you can see the log files that has been created while executing this job you can also track it on your web history server that is localhost for nine Triple Eight so again you can see this word count and you can again track the details that average map Time shuffle time average reduce time so these are the details guys let us go to the sdfs browser and look at the output file that has been created so this is the output file let's go ahead and download it so as you can see the number of count for car is full there is three and the river is two so this is the output that we received I have given a simple input file let me show you the file as well this is the input which we have given and this is the output which we have received [Music] so what were the problems associated with the relational database system as I have already mentioned that for a Hadoop developer actual game starts after the data is being loaded in hdfs and the developers play around this data in order to gain various insights that are hidden in the data stored in hdfs so for this analysis the data residing in the rdbms needs to be transferred to hdfs and you all know that the task of writing mapreduce code for importing and exporting the data from relational database to hdfs TDs so this is where Apache scope comes to rescue and remove the pain of data ingestion so why do we need scope it's a known fact that before Big Data came into existence the entire data was stored in relational database servers and the relational database structure but with the advancement of scope it makes a life of developers Easier by providing CLI for importing and exporting the data and scoop internally converts a command into mapreduce task which are then executed over hdfs it uses yarn framework to Import and Export the data which provides fault Tolerance on top of parallelism it also uses yarn framework to Import and Export the data which provides fault Tolerance on top of parallelism not only that it is also very useful for data analysis high in performance and provides command line interface now let's understand what is scope so before I tell you what the scope let me tell you how the name scope came into existence by now I hope that you have got an idea that it is used for data transfer between relational database and hdfs so before I tell you what the scope let me first tell you how the name came into existence the first two letters in scoop stands for the first two letters in SQL and the last three letters in scope refers to the last three letters in Hadoop that is oop so it clearly depicts that it is SQL to Hadoop and Hadoop to SQL that is how the name of scope came into existence so what is scope it is a tool used for data transfer between rdbms like MySQL Oracle SQL Etc and Hadoop like Hive hdfs Edge base Etc it is used to import the data from rdbms to Hadoop and Export the data from Hadoop to rdbms simple again scoop is one of the top projects by Apache software foundation and works brilliantly with relational databases such as teradata netizer Oracle MySQL Etc it also uses mapreduce mechanism for its operation like important export work and work on a parallel mechanism as well as fault tolerance as I have already mentioned that it provides command line interface for importing and exporting the data the developers just have to provide the basic information like database authentication Source destination operations Etc and the rest of the work will be done by scoop tool itself sounds much reliable correct now let's move further and talk about some of the amazing features of scoop for Big Data developers First full load Apache scope can load whole table by a single command you can also load all the tables from a database using a single command next incremental load scope provides a facility of incremental load where you can load the paths of a table wherever it is updated next parallel Import and Export again as already mentioned scoop uses yarn framework to Import and Export the data which provides fault Tolerance on top of parallelism next compression you can compress your data by using gzip algorithm with compress argument or by specifying compression codec argument next Kerberos security integration so what is Kerberos it's a computer network Authentication Protocol which works on the basis of tickets to allow the nodes that are communicating over a non-secure network to prove their identity to one another in a secure manner next load data directly into Hive and Edge base here it is very simple you can load the data directly to Apache high for analysis and you can also dump your data in hbase which is a nosql database now let's see what's next the architecture is one of the empowering Apache scope with its benefits now as we know the features of Apache scope let's move ahead and try to understand Apache scope's architecture on its working so when we submit our job or a command through scope it is mapped into map task which brings the chunks of data from hdfs and these chunks are exported to a structured data destination and combining all these exported chunks of data we receive the whole data at the destination which in most of the cases is rdbms server next reduce phase is required in case of aggregations but Apache scope just Imports and Export the data it does not perform any aggregations map job launched multiple mappers depending on the number defined by the user for scoop import each mapper task will be assigned with the part of data that is to be imported and scope distributes the input data among all the mappers equally in order to achieve high performance then each mapper creates a connection with the database using GDP C and fetches the part of the data assigned by the scope and then writes that data to hdfs Hive or Edge base based on the arguments provided in the command line interface so this is how scope Import and Export works like they gather the metadata again it Summits only map job the reduce phase will never occur here and then it stores the data in hdfs storage coming to scoop export it's the same thing the data will be reversed back to rdbms so here the scoop import tool will import each table of the rdbms in Hadoop and each row of the table will be considered as a record in the hdfs and all the records are stored as Text data in the text files or binary data and sequence files on the other hand the scope export tool will export the Hadoop files back to the rdbms tables again the records in the hdfs files will be the rows of a table and those are read and passed into a set of records and delimited with the user specified delimiter so this is all about the scoop architecture and its Import and Export now let's execute some scope commands and understand how it works so at the first we have scope import that is it Imports the data from rdbms in Hadoop the command goes very simple here you have to just provide the connection for MySQL your IP address your database name your table name the username for MySQL user if you have set privileges for password you can specify the password or it is not required and the target directory now let's see how to execute so I'll open my terminal and check whether all my Hadoop demons are up and running or not so I can see that all my Hadoop demons are up and running so now let's execute scoop help and see whether the scoop has been properly installed or not so these are the available commands in scope here I will show you the execution and explain a few of these commands now this is also properly been set now I will open another terminal and connect to mySQL database the command goes like this my user is edureka so I am giving it as a Eureka you can give it as root if your user is root simple so now I am into mySQL database if you want to create a database you can create the database by giving this command create database database name Etc as I have already created a database so I'll just specify show databases command to list the database present in the mySQL database so now these are the list of database present here so now I want to use employees database so I'll give use employees database got changed now I want to list the tables present in the employees database so I'll give show tables so these are the 11 tables present in the database employees now let's say I want to use employees table so what will I do I'll just skip select star from employees that is table name so it's just a huge amount of data that is present in this database so there are these many rows present in this table now open the other terminal where you have executed this command and here I will show you how to import the data present in the table to hdfs so how we are going to do that by using scope import command the command goes like this the IP here is localhost and you know that employees is my database name and the username will be a Eureka and the table that I have chosen is employees so I have not set any privileges for passwords so I'm not specifying the password over here so it got executed and you can see the number of job counters the mapreduce framework the input records output records Etc so what happens after executing this command the map task will be executed at the back end now let's check the web UI of hdfs that is the web UI port for hdfs is localhost 50070 and let's see where the data got imported one important thing to note I have not specified Target directory so by default the data will be imported to this folder user add Eureka and employees so here you can see the four different part files where our data got imported so you might be thinking y4 correct here I have not specified the number of mappers so by default it takes a number of mappers to 3 and then gives the output in four different part files so let's open the part file and see how the output will be so here you can see the data is imported from rdbms to hdfs there's lots of data being present over here I'm just scrolling down and it's not coming to an end similarly the output will be same in the other part files as well so this is all about the simple scope import command without specifying the target directory and the number of mappers now let's see how to import the data from rdbms to hdfs by specifying the target directory this part remains out to be the same now I will do one thing I'll specify the number of mappers as one so that your output will be in just one single part file and then I will specify the target directory as well and I will name the target directory as employee 10 enter again it takes lot of time to execute because there is lot of data present in the database so it got executed and retrieved these many records again you can see the mapreduce framework the number of bytes written number of read operations right operations Etc now again let's go to the hdfs browser and see the output so I had specified the target directory name as employ 10 so you can see it here and the output is just in one single part file because I have specified the number of mappers as one in this case you can control the number of mappers independently from the number of files present in the directory so the entire records will be present in one single part file so this is the output now let me tell you how to import the command using where clause here you can import a subset of the table using the where clause and scope import tool it executes a corresponding SQL query in the respective database server and stores the result in a Target directory in hdfs so this is how the command goes the same command as before I'll just change the name of the target directory I'll specify employee 11 and I will increase the number of mappers to 3 and here I will specify the where clause so let's give a condition like where the employee number will be greater than 49 000 it should display the output enter so what you expect your output will be so here it displays the output records of the employees whose employee number is greater than 49 000. so here it retrieved these many records which are above the employee number 49 000. now let's check the output so I have specified the target directory name as employee 11. so here you can see the output that it has retrieved the records of employee number which is more than 49 000. so I hope you understand how to do this next let's see how to import all the tables from the rdbms database server to the hdfs here each table data is stored in a separate directory and the directory name is same as a table name it is mandatory that every table in that database must have a primary key field the command will be simple like simple import but just that you have to remove the table name and you have to replace the import with import all tables that's all so it will retrieve all the tables present in the employees okay so it imported all the tables from rdbms to hdfs now let's check the output again I have not specified the target directory file so by default it will be in user at Eureka and you can see here it imported all the tables present in the database to hdfs so these are the various tables present in rdbms and now it is present in hdfs so this is how import all tables command works so this was all about executing import command in various ways now let's move further and see how does scope export works it exports the data from hdfs to rdbms correct again the command goes very simple you have to specify the connection your table name username and instead of the target directory you have to specify the export directory path so let's see how it works one important thing to notify the target table must exist in the Target database that is the data is stored as records in hdfs and these records are read and passed and delimited with the user-specified delimiter the default operation is to insert all the records from the input files to the database using the insert statement in update mode scoop generates the update statement that replaces the existing record in the database so first we are creating an empty table where we will export our data I am going to create a table called employee0 the primary key value should never be null so I'm specifying it as not null so here I created an empty table called employee0 and now I'll show you how the scope export works here instead of import I'll make it as export this is the database name and I have created a table called employee 0 and I'm going to specify the path for export directory simple that's all so you can see here it exported all these records into the rdbms so now let's cross check I'm going to give select count star from the table name that I have specified to export the tables so you can see the entire records got exported to this table so this is how the scope export works now let's see how to list the database present in the relational database here you need not even specify the database name because you are going to list the database that is present in the relational database system so you are going to specify list databases and execute the command so the database is present in the relational database system is test jdbc test employees and information schema again let's cross check I'm going to give show databases it retrieve the same database so the resultal is so we can also list the tables present in the database let's see how to list all the tables present in the database again it's very simple you have to just specify the database name like here and give list tables instead of import so you can see that it listed all the tables present in the database employees so again let's cross check show tables same thing and now let's see what is code gen in object oriented application every database table has one data access object class that contains getter and Setter methods to initialize the objects and codes and generates the AO class automatically and it also generates the Dao class in Java based on the table schema structure so this is a simple command let's see how to execute it here I will give scope code gen and give the connection and the database name as employees then I'm going to specify the table name as well so it is going to create a employee chart file in which the backend code will be generated so I'll copy this path and jump into this directory so you can see here that it created employees class jar file and the Java object file as well so now let's open the file system and check for the file so what was the name of the file it ends with 539b9 so here is the folder so in this you can see the object file that is being generated so this is the backend code that is being generated so this is all about how the scope code chain works [Music] why do we need Flume for data ingestion let's take a scenario here say we have hundreds of services running in different servers and that produce lots of large logs which should be analyzed together and we already have Hadoop to process these logs so how do we send all the logs to a place that has Hadoop it's obvious fact that we need a reliable scalable extensible and a manageable way to do it so to resolve this problem Apache Flume came into picture which is the most reliable distributed and available service for systematically collecting aggregating and moving large amounts of streaming data into the hdfs and that's why and where we need Flume so what is Apache Flume by now you all might have thought that Flume is a tool for data ingestion in hdfs it collects Aggregates and transports large amount of streaming data such as as log files events from various sources like Network traffic social media email messages Etc the main idea behind the flumes design is to capture streaming data from various web servers to hdfs it has very simple and flexible architecture based on streaming data flows it is Fault tolerant and also provides a reliability mechanism for failure recovery again Flume can easily integrate with Hadoop and dump all the unstructured as well as a semi-structured date on hdfs and that complements the power of Hadoop this is why Apache Flume is said to be the most important part of Hadoop ecosystem now let's move further and see the benefits of Flume there are several advantages of Apache Flume which makes it a better choice over others I have listed some of the benefits Flume scalable reliable fault tolerant and customizable for different sources and syncs it can also store the data in centralized storage like edgebase hdfs Etc if the read rate exceeds a write rate Flume provides a steady flow of data between read and write operations Flume also provides a reliable message delivery the transactions in The Flume are Channel based where two transactions are maintained for each message that is one sender and one receiver for each message using Flume we can ingest the data from multiple servers into Hadoop and it also helps us to ingest online streaming data from various sources and supports a large set of sources and destination types so let's see what's next the architecture is the one which is empowering Apache Flume with its benefits now as we know the advantages of Apache Flume let's move ahead and understand Apache flume's architecture one important thing to note here there is a flume agent which inches the streaming data from various data sources to hdfs from this figure you can easily understand that web server Facebook Cloud social media all these indicate the data source from where we are getting the data and Twitter is among one of the famous sources for data streaming and now coming to flomatient it comprises of three components Source sync and channel First Source it accepts the data from the incoming streamline and stores the data in the channel now coming to channel in general the reading speed is faster than the writing speed thus we need some buffer to match the read and write speed difference basically what happens the buffer acts as an intermediary storage that stores the data which is being transferred temporarily and therefore prevents data loss similarly Channel acts as a local storage or a temporary storage between the source of data and the persistent data in hdfs now coming to our last component sync it collects the data from the channel and commits or writes the data in the hdfs permanently so this is all about the working of loom architecture as we know how Apache Flume Works let's take a look at the Practical where we will sync the Twitter data and store it in the hdfs in this figure you can see there is lots of data generated from Twitter now you want to go ahead and stream this data so how will you do it for this we have Apache Flume which will help us to do the real-time data streaming of Twitter data and store it in the hdfs so let's see how to do it so first we will create an application and get the tweets from it here we use the experimental Twitter Source provided by Apache Flume so here we will use memory channel to buffer these tweets and hdfs Inc to push these tweets into the hdfs the first step is to create a Twitter application for this you have to go to this URL that is apps.fitter.com and you have to sign into your account if you do not have developer account you have to apply for the one by completing the firm as I already have a developer account I'll just click on developer.fitter.com and click on my profile and choose apps so here you have a application called create an app here you have to fill these required Fields first app name now while filling the website address you have to give the complete URL pattern like this and then you have to specify how this app will be used I'll say something like this so after everything is done you have to just click on create so your application got created now click on keys and tokens and here you will find something called as consumer API Keys an access token and access token secret you have to regenerate this and you have to create an access token Keys now we've got the API key the API secret key the access token and the access token secret as well so copy the key and access token so you might be thinking why you need to copy these things we need to pass these tokens in our Flume configuration file to connect to this Twitter application so I am going to copy this make sure you copy it properly now let's create a flume configuration file in the flumes root directory I have already created a flume.com file I show you how to specify the parameters as we have discussed in the flumes architecture I have configured Source sync and channel here our source is Twitter channel is memory Channel and the sync is hdfs in Source configuration we are passing the Twitter Source type as this then we are passing all the four tokens which we have received from Twitter that is the keys and the secret keys that I have copied make sure you paste them properly at class in the source configuration we are passing the keywords on which we are going to fetch the tweets so I have specified some keywords like spark Hadoop big data analytics cloud computing Etc now in the sync configuration I have configured the hdfs properties that is hdfs channel as memory channel the type will be hdfs the path of hdfs Where The Flume tweets will be stored the file type the write format will be always text remember that the batch size the roll size roll count Etc and at last we are going to set the memory Channel properties like the type capacity and the transaction capacity now save this file and close it for my convenience I'll copy this file from Flume directory and paste it in the home directory okay now we are all set for execution let's go ahead and see how to execute this command now I have placed a flume.com file from the flumes root directory to home directory so I'll go back to the home directory and then run the command the command goes like this let us slow mention con file path that is home address and D flume.root.logger will be debug.console enter never started executing so you can see here it started processing the records and created this Flume tweets folder in that you have this file so after executing this command for a while you can exit the terminal using Ctrl C now let's open the web browser and go to hdfs web UI so here in the mention path you can check the file that is present so here you have Flume tweets click on that so this is the output file where the data will be streamed so this is the output where you can see the tweets from the followers someone is saying as a survivor of a politically motivated attack a it's tragic to think this is an acceptable state of political discourse it depends like whatever you follow and you will get all those things and here someone is saying Youth Association for transforming National Kulu based on NGO working in the field education environment protection health and general awareness among the people so this output is nothing but the live data streaming of Twitter data foreign [Music] over mapreduce it is a tool or platform which is used to analyze larger sets of data representing them as their flows pink is generally used with Hadoop we can perform all the data manipulation operations in Hadoop using Apache pick to write data analysis programs pick provides a high level language known as big Latin this language provides various operators using which programmers can develop their own functions for Reading Writing and processing data to analyze data using Apache effect programmers need to write scripts using Peg Latin language all the scripts are internally converted into map and reduced us respectively Apache pick has a component called Apache pink engine that accepts the big Latin scripts as input and converts those particular scripts into mapreduce jobs so this was a basic introduction to Apache Peg now moving ahead we shall understand the different modes in which Apache Peg functions there are two particular modes in which Apache Peg functions those are the local mode and mapreduce mode first we will understand what exactly is local mode in local mode Apache pick is designed to execute in a single jvm and is used for development experimenting and prototyping here files are installed and run using localhost the local mode works on local file system and the input and the output data is stored in the local file system to access the command or the crunch shell in local mode we need to execute a command called Peg hyphen X local we shall practically execute this in the demo section now moving ahead we shall discuss about the second type of mode in which Apache pick can be run that is the map reduce mode the mapreduce mode is also known as Hadoop mode Apache pick chooses Hadoop mode as its default mode and this pick renders by glatten into mapreduce shops and executes them on a Hadoop cluster it can be executed against semi-distributed or fully distributed Hadoop installation here the input and the output are present on hdfs the command for executing pick in mapreduce mode is pick or pick hyphen X mapreduce again we shall discuss about this particular command and our demo section where I'll show you both the modes and execute the pick scripts there now followed by the big modes we should understand the ways to execute big program so basically the big scripts are executed in three particular modes those are interactive mode batch mode and lastly the embedded mode don't worry I'll explain to you each of these modes firstly we shall discuss about the interactive mode in this particular mode the peg is executed in a grand shell to invoke Grand shell run the pick command once the grunt mode executes we can provide big Latin statements and command interactively add the command line itself so the next mode is the batch mode in this particular mode we can run a script file having a DOT big extension these files contain the pig latin commands so basically what we do is we write the pick command or the peg script and store it in a location and using the terminal we will access that particular location and that particular file and run the code present in that file so this is what happens in batch mode so followed by that the last mode is the embedded mode in this particular mode we can Define our own functions these functions can be called as user defined functions here we use programming languages like Java and python to Define our own user defined functions so these were the three different modes in which we can execute a big script now moving ahead we shall enter into our next topic that is the features of peg so there are five different and important features of pig so first up we should understand the ease of programming writing complex Java programs for map reduce is quite tough for non-programmers Peg makes this process very easy in the big the queries are converted into mapreduce processes internally so obviously it has grown the ease of programming followed by that the next important feature is the optimization opportunities so what exactly is optimization it is how the tasks are encoded permacy system to optimize the execution automatically allowing the user to focus on semantics rather than efficiency so followed by the optimization we have the extensibility a user defined function is written in which the user can write their logic to execute over the data sets so this increases the extensibility of the programmer followed by that the next important feature is it is highly flexible Apache pick can easily handle structure as well as unstructured data so Apache pick tool is considered to be highly flexible irrelevant of the data type followed by that the last and important feature is inbuilt operators Apache pick contains various types of operators such as sort filter joints and many more which are inbuilt so here the programmer doesn't have to program these functions externally instead you can just directly get an access to those envelope functions and execute them so these were the five important features of Apache Peg so the next Topic in our today's discussion is Apache pick installation into our local system now we shall discuss about one of the easiest ways to install Apache pick into our local system so today I'll explain you how to install Apache pick into Windows operating system so to do this the easiest way is to download one of the virtual machines so I would prefer you to download Oracle virtualbox for this particular task followed by that we shall also download a quick start VM of cloud around don't worry about the softwares I'll drop down the link for those softwares in the description box below you can use that and download them so once after the Oracle virtualbox is installed into a local system and it is running this is how it looks like now what we want to do is to add a new virtual machine into our virtual box so for that you just need to select import option and it will give you a new dialog box and from here you must redirect to the location where your virtualbox is located so in my system it's located in F drive and inside F Drive CCA Cloud era virtual box Cloudera quick start VM so there you go now you just have to select open and before we actually select the button import you might want to select the RAM and increase its size to at least 8 GB so I'll just write in 9000 MB which is just a little above 8 GB now we are good to go just select the option import and your virtualbox will be imported you can see that the quick start VM is getting imported now you can see that the virtual machine got successfully imported now to start it all you need to do is just click on it and then select the start button so you can see that the cloud era quick start VM version 5.1 3.0 has been successfully imported and booted up here we are we have the quick start Cloudera VM welcome node and now if you want to start up with a big editor you might want to log into Hue Fest or your hdfs remember n Cloud era the default username is cloud era and the password also is clutter run now let me tell you in Cloud era the default username and password for everything is cloud era for example let's log into our view using the username Cloudera and password Cloud error now you might want to just select remember to remember your password and there you go you have successfully logged into here and now if we want your query editor for pick you can just select the bottom Aroma and there you can select editor and inside editor you have the editor designed for pick now you are in the window where you can write big scripts and execute them now we will move ahead into our next topic and after finishing the theory part we shall come back into our Cloud era and execute some of the basic operations on big terminal so our next concept is understanding the big architecture so first of all let us go through the diagram of big architecture the following diagram represents the architecture of Apache Peg as shown in the figure there are various components in Apache pick framework let us look at the major components firstly the parser initially The Click scripts are handled by the parcel it checks the syntax of the script thus type checking and other miscellaneous checks the output of the passer will be a dag which is directed a cyclic graph which represents the peg Latin statements and logical operators in directed acyclic graphs The Logical operators of the scripts are represented as the nodes and data flows are represented as edges and next the optimizer the logical plan for directed acyclic graphs is passed to The Logical optimizer this carries out the logical optimizations such as projections and push Downs next comes the compiler the compiler is used to compile the optimized logical plan into the series of mapreduce shops followed by that we have the execution engine finally the map reduce jobs are submitted to the Hadoop in assorted order finally these mapreduce shops are executed on Hadoop producing the desired results so this was a basic explanation based on Apache pick architecture now moving ahead we shall understand the major advantages of Apache effect so some of the major advantages of Apache effect are less code the pic consumes less than a line of code to perform any operation so this reduces the number of lines included in the code followed by that code reusability the quick code is so flexible enough to reuse it again you can basically write the pick script into a file and access the file whenever you need it so followed by that the next important advantage of Apache pack is the nested data types the peg provides a useful concept of nesting data drives like Tuple bag and map so these were the few important advantages of Apache pick now we shall move ahead into the next topic which is about the differences between Apache Pig and Apache mapreduce so there are basically four differences between Apache pick and Apache mapreduce so first up we will begin with mapreduce in mapreduce we have low level data processing when it comes to Apache Peg it is considered as a high level data processing tool followed by that the next important difference between mapreduce and Apache epic is that we find complex Java programs when we are using mapreduce but when we come into Apache page we have SIMPLE programming scripts which are shorter in code length and easily understandable followed by that the third difference is the data operations are completely tough and complicated and mapreduce but an Apache Peg you don't have to worry about those operations because they're already built in next and the last difference between mapreduce and Apache break is Apache mapreduce does not allow nested data drives whereas Apache pick allows nested data drives so these were the basic differences between Apache my produce and Apache Peg so the next topic is the big demo here we will understand all the basic commands in Apache effect and the basic functionalities which are available in pink now without further Ado let's quickly begin with our demo for today's session now we have come back to our Cloud era that we have installed into our local system now let's start Pink as we have discussed before Apache pick can be executed in two modes they are the local mode and the mapreduce mode firstly we shall execute an example based on local node to start pick in local mode we need to type in the command Peg space hyphen X space local firing this command will enable pin in local mode as you can see the command is getting decrypted now you can see that it has successfully started the grand shell and in local mode now what are we going to execute we are going to execute a very simple word count program for this particular word count example I have considered a basic text that is the definition of Apache pick which happens to be Apache pick is a high level platform for creating programs that run on Apache Hadoop the language for this particular platform is called Pig Latin pick and execute its Hadoop jobs and map reduce Apache test and Apache Spark so we will consider this particular paragraph and we will also count the number of words which are included in this particular paragraph now let's quickly go back to our terminal and execute our program so now we have come back to our terminal let's clear it using the command Control Plus l now let's quickly start up typing our commands so as discussed before we are going to execute this particular program in local mode so here we're not loading the data into hdfs instead we are considering the local location watches the local location that is form Cloud era desktop word count now we are going to run this command and see the output so the data has successfully loaded now we'll execute the next command so now we are going to tokenize each and every single word in this particular paragraph and we will be considering each and every word as a single word and we are going to separate each and every word by using a space now let's fire up this command and see the output yeah the command just got executed or the script just got executed now in the next script we are going to group the words according to their recurrence yeah even that is done now the next script would help us to count the number of words which have been repeated so even that is done now the last script would be dumping the output which is stored in p word c which is big word count now let's enter the command and see the output you can see some mapreduce shops that are getting executed and there you go we have our output so the words are a and on for its days Hadoop Apache and all those words and along with them we also have the number of repetitions of each and every word for example we have Hadoop which is repeated one time and the word Apache is repeated for four times and so on so with this now let us move ahead and execute some examples based on Apaches Hadoop mode or mapreduce mode now let's close this terminal and open a new one and let's begin executing Hadoop in mapreduce mode as discussed before Apache pick can be executed in both local mode and mapreduce mode we have already executed some examples based on local mode now we shall execute some examples based on map reduce mode to do so we will first load some local data into hdfs so using this command I'll be loading my local data that aspect tutorial.csv into my hdfs and I'll name it as edureka in as you can see the command is getting deprecated and the data has been successfully loaded now let us use cat command and see what exactly is present in that particular data as you can see the command got deprecated and the data is a simple CSV file related to students that has ID name department and year now let's start pick in my previous mode to do so we just need to type in p i g and strike enter there you go we have successfully started big in mapreduce mode now let's execute some commands so using this command we will be loading at eureka.n that is the data the CSV file which we have discussed before using pick storage as comma separated file and the schema for the data will be ID as character array name as character array Department as character array and here as integer array now let's try and enter and see the output there you go the data has been successfully loaded now now let's use dump command to see the data what we have looted so there you go the data has been successfully dumped so this was the data which we have loaded now let's move further and execute few more commands now we shall use for each command and for each data present in our data file we will generate ID name and Department there you go the command got successfully executed now let's dump the data using dump command so pick for each is our variable that stores the data now let's type in semicolon and enter so there you go the 4H command has been successfully executed and we have generated ID name and Department now we shall execute few more examples now we shall use descending operator to arrange the data in the form of descending order of ID so it's been executed now let's dump the data and see the output there you go we have successfully executed the done command and the IDS have been arranged in the descending order now as you can see it now this is how the order by descending function works now let's move ahead and execute one last command as you can see here we are using filter operation we are going to filter the students based on the department where department is equals to CSE so executing this command will give us the students which are inside the department CSE now the command got executed now let's dump the data using the command so there you go you can see some commands getting deprecated here you can see the mapreduce shops getting executed so there you go you can finally see the output which has the students that belong to the CSE Department [Music] why exactly we needed Apache hive it All Began in the early 90s when Facebook started slowly the number of users at Facebook entries that is nearly 1 billion users and along with the users increase the data which is nearly equals to thousands of terabytes of data and nearly 1 lakh queries then also 500 million photographs uploaded daily and this was a huge amount of data that Facebook had to process and the first thing that everybody had in their mind was to use rdbms and we all know that rdbms couldn't handle such a huge amount of data and neither it was capable enough to process it and the very next big guy who was capable enough to handle all this big data was Hadoop even when Hadoop came into picture it was not too easy to manage all the queries it used to take a lot of time to execute all the queries performed so one common thing that all the Hadoop developers had was the sequel so they thought to come up with a new solution that has hadoop's capacity and interface like SQL that is when Hive came into picture so now we understand the exact definition of Apache High Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and data analysis Hive gives a SQL like interface to query data stored in various databases and file systems that integrate with Hadoop also Apache Hive has data warehousing software utility it can be used for data analytics it is built for SQL users manages querying of structured data and it simplifies and abstracts the load that is on Hadoop and lastly no need to learn Java and Hadoop API to handle data using Hive so followed by this we shall understand Apache High applications Apache Hive is used in many major applications few of the major applications are as follows Hive is a data warehousing infrastructure for Hadoop the primary responsibility of Hive is to provide data summarization query and data analysis it supports analysis of large data sets in hadoop's hdfs as well as on Amazon S3 file system followed by that we have document indexing with Hive the goal of hype indexing is to improve the speed of query lookup on certain Columns of a table without an index queries could load an entire table or partition a whole process as rows this would be troublesome so with Hive we have solved this problem followed by that predictive modeling the data manager allows you to prepare your data so it can be processed in automated analytics it offers a variety of preparation functionalities including the creation of analytical records and timestamp populations followed by that the next important application of Hive is business intelligence Hive is a data warehousing component of Hadoop and it functions well with structured data enabling ad hoc queries against large transactional data sets hence it happens to be a best-in-class tool available for business intelligence and helps many companies to predict their business requirements with high accuracy last but not the least log processing Apache Hive is a data warehouse infrastructure built on top of Hadoop it allows processing of data with sqlite queries and is very pluggable so that we can configure it to provide our logs quite easily so these with a few important hype applications now let us move ahead and understand Apache Hive features the first and the four most important feature of Apache Hive is SQL type queries the SQL type queries present on hype will help many of the Hadoop developers to write queries with ease followed by that the next important feature of Apache Hive is olap based design olap is nothing but online analytical processing this allows users to analyze database information from multiple database systems at one time so using Apache Hive we can achieve olap with higher accuracy followed by the second feature we have the third feature which says Apache Hive is really fast since we have SQL like interface in Apache Hive using this feature on hdfs will help us writing queries faster and executing them followed by that we believe Apache Hive is highly scalable Hive tables are defined directly in Hadoop file system hence Hive is fast and scalable and easy to learn followed by that it is known to be highly extensible Apache Hive uses Hadoop file system and Hadoop file systems or hdfs provides horizontal extensibility and finally the ad hoc querying using hype we can execute ad hoc querying to analyze and predict data so these were the few important features of Apache High let us move on to our next topic where we deal with Apache Hive architecture the following architecture explains the flow of submission of query into high the first stage is the hive client Heim allows writing applications in various languages including Java Python and C plus it supports different types of clients such as Thrift server jdbc driver and odbc driver so what exactly is Thrift server it is a cross-language service provider platform that serves the request from all these programming languages that supports Thrift followed by that jdbc driver it is used to establish connection between Hive and Java applications the jdbc driver is present in the class org.apache Dot hadoop.hype.jdbc.hypedriver finally we come to otbc driver so what exactly is odbc driver odbc driver allows the applications that support odbc protocol to connect to Hive followed by that we have the hive Services the following are the services provided by hype they are Hive CLI Hive web user interface Hive metastore Hive server Hive driver Hive compiler and lastly The Hive execution engine The Hive CLI or command line interface is a shell where we can execute the hive queries and commands followed by that the hive web UI is just an alternative for Hive CLI it provides a web-based graphical user interface for executing hype queries and commands followed by that the hive metastore it is a central expository that stores all the structured information of various tables and partitions in the warehouse it also includes metadata of column and its type information the serializers and deserializers which is used to read and write data and the corresponding hdfs files where the data is stored followed by that The Hive server it is referred to as Apache Thrift server it accepts the request from different clients and provides to the hype driver moving on we should deal with Hive driver The Hive driver receives queries from different sources such as web UI CLI Thrift and jdbc or odbc drivers it transfers the queries to the compiler followed by that we have the hive compiler the purpose of Hive compiler is to pass the query and perform semantic analysis on the different query blocks and expressions it converts Hive ql statements into mapreduce jobs finally we have Hive execution engine Hive execution engine is the optimizer that generates the logical plan in the form of dag or directed acyclic graph of mapreduce tasks and hdfs tasks in the end the execution engine executes the incoming task in the order of their dependencies followed by that we have the map reduce and hdfs mapreduce is the processing layer which executes the mapping and reducing jobs on the data provided lastly the hdfs or Hadoop distributed file system is the location where the data which we provide is stored so this is the architecture of Apache Hive then moving next we have Apache Hive components so what are the different components which are present in Hive they are first one the Shell Shell is the place where we write our queries and execute them followed by that we have metastore as discussed in the architecture the metastore is a place where all the details related to our tables is stored like schema Etc followed by that we have the execution engine so execution engine is the component of Apache High which converts the query or the code which we have written into the language which The Hive can understand followed by that driver is the component which executes the code or query in the form of acyclic graphs and lastly the compiler compiler compiles whatever the code we write and executes and provides us the output so these are the major Hive components moving ahead we shall understand Apache Hive installation on Windows operating system so at Eureka is all about providing the technical knowledge in the simplest way as possible and later play around with the Technologies to understand the complicated parts of it so now let's try to install Hive into our local system in the most simplest way as possible to do so we might need the Oracle virtual box which looks like this so once after you download Oracle virtualbox and install it into your local system The Next Step would be to download the Cloudera quick start VM for your local system the link to this will be provided in the description box below now let's quickly start our Cloud era quick start VM with our Oracle virtualbox select import option and now provide the location where your Cloud era quick start VMS existing in my local system it is in the local disk drive f there you go select open and now just make sure your Ram size is more than 8 GB just randomly I'm providing 9000 MB which is just above 8 GB so that you have a smooth functionality of cloud era now select import and there you go you can see that cloud era quick start VM is getting imported now you can see that Cloudera quick start VM has been successfully important and it's ready for deployment you can just double click on it and it'll get started you can see that Cloudera VM has been successfully imported and it started and also you can see that we have gone live on cloud era you can see all the Hue Hadoop Edge base Impala spark which are pre-installed in Cloudera now our concern would be to start a five so to start Hive you need to start up Hue first so let me remind you one thing in Cloudera every single password and username is Cloudera by default so for example we have got Hue username and password here so the username that is the default username for cloud Eros Hue would be Cloud era and along with that even the password will be Cloud era that is by default so we have got Cloudera and Cloud era as username and password respectively let's just sign in you may select remember option in case if you forget your passwords so now we are getting connected to Hue and we are live on Hue now there you go we've got started are you so now we'll enter into sdfs there we go we have a hive here now that we have successfully installed Hive into our local system let us move further and understand few more Concepts in Hadoop firstly we shall deal with the data types the data types are completely similar to any other programming language which we have they are tiny and small and integer big and similarly followed by that we have float and inside High float is used for single precision and if you want double Precision you can go ahead with double and followed by that we have a string and Boolean which are completely similar to any other programming languages which we use in this daily life followed by that we have Hive data models so these are the basic data models which we use in Hive that we basically create databases and store our data in the form of tables and sometimes we also need partitions we will discuss each one of these data models in our demo ahead so we'll first create databases and inside databases we will be creating tables inside which we will be storing data in the form of rows and columns and along with that partitions partitions they are like Advanced way of storing data like if you have just imagine you are in a school say standard one and inside standard one you have sections a b c d so partition is like you're getting partitions for Section a section B section c and section D you're storing different different students in different different sections so that when you're querying for a particular data for example say you're searching for it called San and you have the section of his class SB so you just don't have to just search for Sam in all the four sections you can just directly go into section B and call in Sam and you'll get access to and that's how partitions work followed by partitions we have buckets so similar to partitions even buckets work in the same way let's understand each one of these in much better way through a practical demo after data models we shall understand about hype operators so what are operators operators are any other operators that we use in normal programming languages such as arithmetic operators logical operators we should also go through some examples based on arithmetic and logical operators in hype in the hive demo we will use some arithmetic operations as well as logical operations on the data which we have stored in the form of tables in Hive which will go through a brief look on that as well so before we get started let's have a brief look on the CSV files that I have created for today's demo these are the small CSV files that I've personally created using Ms Excel and I've saved them as dot CSV files I have made the CSV files to be smaller because just to make sure the execution time consumed as less as possible since we're using Cloud era the execution time might be a little more so it's better we use smaller CSV files so this is my first CSV file which is employee.csv which has employee IDs employee name salary and age similarly we have another employee two dot CSV file which has the same details along with one more column that is the country column I have included country because we will be using this country column in Joins that we will be performing in future followed by that we have the department so here we have Department ID and Department name so we have Development Department testing product relationship admin and IIT support similarly we also have student CSV this is another CSV file that I have created this has ID name course and age of the student followed by that we have another CSV this is studentreport.csv which has the reports of a particular student gender authenticity parental education lunch course maths call reading score writings current other so these are the CSV files that we will be using in our demo today so now let's quickly begin with our demo so to start Hive we shall open a terminal so starting or firing up Hive in Cloud era is really simple you just have to type in Hive and enter there you go logging initialized using configuration files and Etc The Hive CLI is deprecated and migration to B line as recommended and there you go your hive terminal or CLI has been started so first let's try to create a database to save time I've already created the document which has all the codes that we will be executing today so this is the particular file which I will be using today so don't worry this file will be linked in the description box below you can use the same file and try executing the same codes in your personal systems just to practice if you feel so so just to save time I've already created the document which has the codes that we are going to execute to there so this code or this file will be attached in the description box below you can get access to it and you can also execute the same codes in your own personal system to have a practical experience about this particular higher tutorial so the first thing that we will be doing today is to create a database so I'm going to create the database using SQL type commands which are create database name of the database which is edureka there you go the database has been successfully created so now you can also use the following command to check if your database has been created or not so show databases will help you to find it so there you go you can see the first database which is a default database which will be pre-existing and followed by that you have our own database which we have created now that is at Eureka so followed by this next we will move ahead and try to create a new table so when you come into tables you need to understand there are two types of tables in Hive they are managed tables or internal tables followed by that external tables so what is the difference between these two tables so internal table or manage table is the default table that will be created whenever you try to create a table in high so for example if you're trying to create a new table say a Eureka then Hive considers that particular table as an internal table by default so when you create an internal table your data is not secure understand this so when you create an internal table your data is not secured in case just imagine you are working with a team and all your team members have access to your hive or hue so the table has been existing in your hive and some random newbie or some random inexperienced guy tries to change few things in your table and accidentally he ends up deleting the table so when you delete the table then if the table was created using an internal table code then your data will be erased so that's the disadvantage of using internal tables but in case if you create an external table even if somebody tries to delete your table the table or the data whatever is there will be deleted from their own local system but not from high so that's the best part of using external tables don't worry we will discuss about internal tables and external tables as well so first we'll try to create an internal table so this particular code is based on in internal table so we are using SQL type command here which is create table and the table name is employee and the columns inside our table are ID of the employee name of the employee salary and age so row format has been delimited followed by that since this is a CSV file so the fields will be terminated by comma and don't forget you have to use semicolon unless you use semicolonian code is not complete so let's fire and enter and see if the table gets created or not yeah the table is created successfully now we shall see the table or let's describe the table so describing the table means you can see what are the columns which are present in your table so to describe a table you can use the keyword describe a name of the table which is employee and don't forget semicolon there you go so your table has the columns ID name salary ah so those are the four columns which you have included in your particular table employee now let's move ahead and see if this particular table is an internal table or manage table or the other type of table which is the external table so to do that we can just write in describe formatted table name and semicolon they might be a small issue here yeah there is a typing mistake that is describe Ms S so there you go we got it so this particular table is managed table as you can see here now let's move ahead and try out external tables let's clear our screen first you can use Control Plus L to clear your screen there you go we have a clear screen now now let's try to create an external table creating an external table is completely similar to that of internal table but the only difference is that you need to add a keyword which is external so this particular keyword is used to create an external table now let's fire and enter and see if the table gets created or not you can see the table got created now let's try to describe the table employee to don't forget the semicolon I'm saying this again and again because most of the times we miss semicolon and we will get an error so you can see the table got described and we have the following columns inside our table now let's move ahead and see if this particular table is an external table or a manage table to do so you can type in describe formatted the same code what we have used earlier let us describe formatted name of the table that is employee to semicolon don't forget there is some issue again I think I've missed something or maybe a typing error yeah this is a typing error yeah there you go the table type is external table so that's how we create an internal table or manage table and external table so now that we have understood how to create a database and table and the two types of tables that are internal table or manage table followed by that the second type of table that is the external table now let's try to create an external table in a particular location so for that you can use the following code but the only difference is you are specifying the location that is user Cloudera at Eureka employee edu EMP is a file that we will be creating in our Hive so let's file an enter and see it if it's created or not yeah it's successfully created let's go back to Hue and see if the following table is created or not so one thing you have to remember is when you fire in a commander if you try to create a table the first folder that will be created is a warehouse so inside Hive you have your warehouse and inside Warehouse you have all the databases that we have created our first database was the edureka database and after that we have created table which is employee and the second table is employee two so this is in the particular location which is user Cloud era and the file is employed to let's see that this was the file yes sometimes you will not show it because of network issues you don't have to worry about it you will get back that data now followed by this let us enter into Hue again so when you come back into Hue if you have to upload a file into Hue you can just select this particular option which is plus so selecting this will give you a dialog box which will be something like this and here you can just select any of the files which you want to upload into view now let me select a student report.csv and select open so there you go upload is in progress so the data file has been successfully uploaded now if you want to access your data file you can just click on that so there you go you have all your data successfully loaded onto hue you can also perform queries on this particular data you can just select query and inside that you just need to select editor and you have various editors over here which is Peg Impala Java spark map reduce shell scoop and we also have Hive in here so if you just select Hive and there you go you have the editor here you can just type in your commands of queries whatever you have see you have many dictionaries as well you can just select any one of those select and that's how you write queries on the hive terminal now let's not waste much time here and we have a lot to learn so let's continue with the next Topics in our today's session now we shall try to edit the tables and now we have created the new table that is employee 3 and we have named The Columns as ID name String salary age and flow now we should try to make some alterations to our table so the first alteration that we will try to make to our table is to rename our table as EMP table you know that our employee table was named as employee 3. now we are trying to rename it to EMP table so we are using the keyword alter here so just fire an enter and see if this is possible or not yeah it is possible the name has been changed to EMP table now let's try if it's completely changed or clearly changed or not you can just type in describe EMP table semicolon if we get the same column names in our description then it should be changed so there you go we can see the same columns here so we have successfully changed the name to EMP table now we shall also try to add in some more columns to our table which is EMP table so here I will try to add in a new column that is the surname of string data type so I'm doing that by using the keyword Alta followed by that table uh the table name is EMP table and I'm using the keyword add columns and the column name is surname and the data type of that column is string so now let's fire in enter and there you go we have successfully added a new row to our table now let's try to describe our table again and see if the column has been successfully added or not there you go you can see the last row which is the surname that we have added most recently so this is how you can alter the table and you can also change the names of the existing columns let's try to do that one as well now what I'm doing is I'm changing the column name to first name so one of the column name in my table EMP table is the name which gives me the names of the employees so since I added surname I'll change this column name from name to first name so this is the command that I'm using for that operation right now let's fire in enter and see the result yeah the chain has been made Let's describe our table don't forget the semicolon there you go you can see that earlier we had name now it's been changed to first name and we also have a surname let's clear our screen so that's all for alterations now we shall move ahead into our next major topic or the data model which is partitioning so we have dealt with the first two data models that are databases and tables so we have learned how to create a database and we have learned how to create a table we have learned how to create internal or manage table and also we have created external table and also we have learned how to create an external table in a particular location in your hive and load data to your table and also how to all trade your tables the column names the name of your table and how to add or delete new columns to your table so far so good and now we shall continue with the next type of data model that is the partitioning as we have discussed earlier about partitioning it's completely similar to a school or a college just imagine that you are in a college and you are in computer science section so our college has many branches so maybe computer science mechanical and electronics and Communications so imagine your name is Harry so if someone comes to your college and if he's looking for Harry so there are many haters in your school so if the person is asking specifically about you that is Harry from computer science then can you imagine how simple is this query so you don't have to search for electronics and mechanical you just have to come into the class computer science and search for Harry and there you go you're present so that's how partitions work to execute commands or to execute queries on partition we will create a whole new database here let's start everything from fresh so we'll create a separate database for executing a new data model that is partitioning so I'm creating a new database that is at your acast student so there you go the database has been successfully created followed by that let's use this database now to use the database you just need to add in the keyword use and name of the database so let's fire in enter and now we are currently using edureka student database now let's create a table in at Eureka student database so here I'm creating a normal table that is the manage table so inside my student table I'll be having some basic columns such as ID number of student name of the student what is his age and course so you're not finding course here because I'm going to partition the table based on course so here you can find the course I'm using the keyword partition and on what terms so on the terms of course I'm going to partition students so we have discussed about our students CSV file right so here we have a CSV file and the courses that this particular Institute is offering are Hadoop Java Python and yeah so these are the courses that this particular Institute is offering so I am going to categorize or I'm going to partition these students based on their courses so this is how I'll be partitioning them using this following code so basically the table has all the columns and I'm going to partition the table using course so let's fire and enter and see the execution of this particular code the partition has been done now all we have to do is try to load in our data and before that let's try to describe it let's try to see what are the columns present in our particular table student so as you can see here the course column is present don't worry the code looks that we have messed out course but we did not miss the course column it is present in the table the only thing is that we have just partitioned it based on the course that we are going to offer now let's try to categorize the students based on their course so you can do that by using the following code we are going to load the data using the command load data local in path so this particular folder that is the student.csv is in my local location so that is home Cloud era desktop student.csv and I am loading the data present in this particular location into the student which is present in Hive right now so I'm going to partition the student based on their course Hadoop now let's fire in this command and see the output yeah now you can see some mapreduce jobs taking place yeah the data has been successfully loaded let's now refresh our I can refresh your hive or hue based on two methods the first one is just clicking refresh button on the URL or you can also select a manual refresh this is the manual refresh and there you go it's done you can see the new database that is the Eureka student database that we have right now created and inside that you can see the student table that we have created and there you go we have the file of students based on course Hadoop now we will try to add in few more students based on the course Java for that all you need to do is just replace the course name with Java there you go here we had Hadoop course and now here we have Java course just fire and enter and you can see the output followed by that we also had another course that is python so let's also execute a code for that there you go python so now we have uploaded student details into our Hive and we have also partitioned using one of our data models that as partition into three categories that are based on Hadoop Java and python now let's go back to our review and see if the three categories are done or not yeah we need to refresh that and there you go you have successfully refreshed still there is no sign of java and python maybe a manual refresh could help yeah the manual refresh has resulted in the two new files which are Java and python so you have all the three partitions here Hadoop Java and python just enter them and you can see the student details so now that we have understood partitioning sorry I forgot to mention we have two types of partitioning which are Dynamic partitioning and static partitioning so the static partitioning is in static or manual partitioning it is required to pass the values of partition columns manually while loading the data into the table hence the data file does not contain partitioned columns you can see that we have sent the partition columns manually for python Java and Hadoop but when it comes to Dynamic partitioning you just need to do it once and all the three files will be automatically configured and the files will be created so now what is a dynamic partitioning so uh Dynamic partitioning the values of partition columns exist within the table so it is not required to pass the values of partition columns manually now what is this don't worry we shall execute the code based on Dynamic partitioning and we shall understand this in a much better way now let's clear our screen now let's start fresh again let's try to create a new database for Dynamic partitioning and let's start again fresh so here we'll be creating a new database that is at your acast couldn't do so earlier we created a Eureka student and now we'll be testing our Dynamic partitioning on our new database that is edureka student2 so there you go the database has been successfully created now we shall use this particular database currently uh we were in at Eureka store into one database now we'll enter into suit in two database so we'll use it now now we are in enterica student too now before we start up with Dynamic partitioning we have to set High execution to Dynamic partition is equals to true because by default the partitions that will be taking place in Hive will be static so we need to convert that into Dynamic Partition by specifying this particular code now we are good to go with Dynamic partitioning along with that we need to execute another command which says partition mode would be non-strict so by default when you are partitioning using the static partition the partition mode will be strict so now you're specifying it to be non-strict now let's execute this so there you go we have executed the two required codes for that now let's create a new table so the name of the table will be edureka student that is edu s t u d and this will have the same columns which are the ID of the student name of the student course age Etc now we will try to load in the data from our local path that is home Cloudera desktop student.csv into the table edu sdud so the data has been successfully loaded and the size is 267 KB number of files as one now comes the tough part so here we are going to partition so we will be partitioning the table based on the same thing which is the course and we will be separating the data using the comma now let's fire in enter now the table has been separated based on course and now we will be loading the data to this particular table which is the student part so this particular table that we have created based on Dynamic partitioning and we are going to partition the data based on course now it's been created so the student part table has been successfully created now the only part remaining is to load the data to this particular table now we will be writing a code so using that code the mapreduce will automatically segregate the data members or the students based on their courses so the guys which are in Hadoop will be separated guys and Java will be separated and loaded into different file and similarly with python now let's see how to do it using the code so there you go we are going to insert into student part partition based on course select ID name course H from the table at your so the data will be imported from the table what we have created here that is at Eureka student so this particular location has the student.csv file now let's fire in enter and see if it's created or not fine you can see some of the mapreduce jobs are getting executed you can see here we have three jobs so first one is getting executed we have three because uh one is for Hadoop one is for Java and one for python so this will take a little time so this is the reason why I have chosen smaller CSV files so just save time when you take up the course from it Eureka then you can work on real-time data so that you get hands-on experience from real time and you can get yourself placed in some good companies with the experience what you gain from this particular course so the stages are been successfully finished and the data has been loaded now let's see what are the datas present in the particular table student part there you go you have the output executed in here so these are the data members present in the partition student part so these are the data members which are separated based on their courses that is the partition based on their courses that has Hadoop Java and python so now that we have understood Dynamic partitioning and static partitioning we shall move ahead into the last type of data model which is bucketing Once after we finish the bucketing we shall enter into some query functionalities of 5 or query operations which can be performed in Hive and followed by that we will also learn some functions which are present in Hive and some of the other things like group by order by sort by and finally we shall wind up the session with joins which are available in Hive for now let's get continued with bucketing the last type of data model present in Hive so for that let's again start fresh we shall create a new database for that before that let's go back to Hue and check if our partition has been made or not let's refresh also let us make a manual refresh so our database was nitty rayca student 2 database and inside that we have the table that is student part and there you go you can see the files which are based on the partition so 22 is for a different course 23 is for a different course and 24 is for a different course and this is the default partition which has all the data members as we discussed earlier now let's start with the last data model in Hive that is bucket now we have created a new database that is at Eureka bucket now we shall also create a new table for that before that we need to start with this particular database so we can use the command use at your a cup bucket now we are an edureka bucket now let's create a new table so the table name will be at Eureka bucket and it will be containing the ID name salary age of the employees the table is created now let's try to load the data so the data file that we will be using is the same one that is the employee.csv so the data has been successfully loaded into the location now comes to the major part that is the bucketing part so to start a bucketing in high we need to use the command set Hive dot info start bucketing is equals to true so that's done now we will cluster or classify the data present in this particular file using this particular code so we will be clustering based on the ID and we will be categorizing them into three different buckets so let's fire in this command and see if it happens here that's successfully done now we will override the data using the following command now we'll be inserting data into this package that we have made that is three buckets and we will override the table using this particular code there you go you can see some produced jobs to be taken care of now so one mapper and red users are three for now so stage one is getting done so we should be having three tasks basically so let's see what's the output stage one is finished the process is finished and data has been successfully inserted now let's go back to Hive and check if it's done or not so before that let's do a refresh now a manual refresh would be much better there you go we have our database here which is edureka bucket and inside it Eureka bucket we have EMP bucket and that's our data employee.csv there you go now let's move ahead and understand the basic operations we can perform in Hive so for that let's start fresh again let's create a new database I'm creating a new database for each and every option or each and every operation that I'm performing in this particular tutorial just to make things or keep things in a sorted manner so as you can see here in our particular file system I have separated each and everything like I have sorted everything so for bucketing I've got a separate database and for partitioning I've got a separate database and for understanding how to create database and tables I've got a separate database for that just to keep things arranged and sorted this looks uh in a much better way so now let's discuss about the operations that we could perform in high so I'm creating a new database again for this so the database would be Hive query language now let's use this particular database this creates a habit of learning things in a better way or it's like a revision for the things what you have performed or learned so far as you can see the table has been successfully created now let's try to add in some data into this particular location that is employee data it's been successfully loaded now let's try to see what are the details present in this particular file we can use in the command select star from the table at Eureka employee so there you go these are the details or information present in the table meteoreka employee now we shall see what are the functions that we can perform on this particular file so since we discussed that the mathematical operations and logical operations can be performed on high so let's try to perform an addition operation so I'm selecting the column salary and as we have seen here the salaries are 25 30 40 20 000 rupees for every employee now let me add in 5 000 more to each and every employee so I'm adding uh the value 5000 by using the addition operation so let's enter you can see we have added 5000 so the first element was 25 now it's 30 so similarly all the other employees got 5000 rupees hike all of a sudden now let's try to remove 1000 so to do so all you need to do is replace the addition operation with a subtraction operation that is minus fire and enter and there you go each and every employee lost one thousand so the initial amount was 25 000 so removing 1000 from that will result in 24. so this is considering the first initial values so this is how it's working uh followed by that let's also perform some logical operations let's clear the screen and yeah here I'm fetching for the employees who are having a salary equal to or greater than 25 000. so these are the employees which are having the salaries above or equal to 25 000. now similarly let's execute another one which detects the employees with salaries less than 25 000 so we have got two employees which are having lower salaries which are Amit and chaitanya fine so this is how you perform some operations in high so now let's move ahead and understand the functions which you can perform on high so in the same way let's create a new database again and let's use this particular database that has Hive functions now let's create a table in this particular database so the table is employee function and it's created now let's try to load in the data yeah the data has been successfully loaded and now let's see if the data is correctly loaded or not yeah the data is loaded correctly now let's try to apply some functions in this particular data so the first thing or the first function I'm going to apply would be a square root function where I'll be finding out square root of the salaries of the employees so there you go the square root of 525000 was 158 dot decimal numbers so this is how you perform some basic functions on your data now let's try to find out the maximum salary so yeah the job is getting executed you can see some mapreduce charts here I think the biggest salary would be from sanjana so the maximum salary is forty thousand so this is how it works since we are working on cloud era and the system configuration is limited the execution speed is a bit low but if you're working in real time then this process would take like few seconds and it's done there you go you have the value of 40 000 as shown here so forty thousand the employee name is sanjana is the maximum salary so that's what we got here now let's try to find out the minimum salary so the minimum salary is 15 000 and who would that be yeah it's chaitanya with minimum salary 15 000. so that's how you do some operations in five let's execute some more operations such as converting the names of the employees to uppercase so you can see the employees are converted to uppercase here and similarly let's try to convert to lower case so here you can see we have converted them to lowercase so this is how you learn technology you need to play with the technology then you will come to know the advantages and disadvantages so you can learn the possible ways where you can make things work out this is how you do it now let's move ahead and understand Group by function in five so for that we'll be creating a separate database that is grouped now we will use this particular database that is group so we'll type in command use group semicolon now we'll create a table so the table has been successfully created now we will load data into this particular table now we will use the nude CSV file which will be employee2.csv now we are using this particular table because we have an additional column in this particular table which is the country column now as discussed before we will be grouping the employees based on Country let's see our data first so we have countries such as USA India UAE so these are the three countries that we are having in our CSV file so we will be categorizing the employees based on their countries so this is the particular command that we will be using so maybe I made an error while creating the table I think I gave a wrong table name here so let's drop our table so by mistake I gave different table that is employee order so to drop a table you just need to use the keyword drop and it's done yeah the keyword table was missing so you need to type in drop table and the table name and the table gets dropped so we were supposed to create a different table that is employed group so now let's create a new table that is employee Group Employee group has been created now let's try to add in data into the employee group so we have used employ 2 here because the employee 2 has another column which is based on Country so the countries that we are having here are India USA and UAE so we will be using the group by function here and we will categorize the employees based on their countries so there you go you can see some mapreduce shops getting executed yeah there you go we have categorized the employees based on their country status India UAE and USA and the sum of the salary so the guy is working in India and their summation of the salary is 90 000 and similarly UAE is nearly 1 lakh five thousand and USA is 80 000. now let's also execute a different command based on Group by so here we'll be using Group by function and we will categorize based on the country as well as the summation of the salary which is greater than or equals to fifteen thousand so it's similar to the previous command so you can see the data got executed and we got the same output now let's move ahead and understand order by and sort by methods so for that we'll create a new database orders now we'll use orders now let's create a new table again so the new table is employee order and the table got created now let's load the data into this particular table by now I think you have some good practice of how to create a database how to create a table and how to load data into that particular table so the data got loaded and now we are going to order the data present in this particular table based on the descending order of the salary so you're seeing some mapreduce jobs going ahead so here we'll see the employees ordered based on their salaries in descending order so the highest salary will be at the first place and the lower salary will be at the last place yeah so we have sanjana the first position with the 40 000 as the highest salary and she's working for you AE and we have chaitanya with lowest salary 15 000 working for India now let us also execute another command based on sort by so first we try to execute a command based on order by now let's see the same output using sort by so basically both work in the same way so there you go we have sorted the records based on descending order of salary now that we have learned what are the various operations that can be performed in height that are the arithmetic operations logical operations and also some of the functions such as maximum minimum Group by order by sort by so these are the various operations and functions that you can perform in Hive now let's move ahead into the last type of operations that can be performed when Hive those are the joints so for that let's again create a new database so here I'll be creating a new database that is at Eureka join and followed by that let's use this particular database now for that we need to use the keyword use and there you go we are in at Eureka join now let's create a new table for that so the table will be EMP join here you can see that I forgot to mention semicolon so now the table got created now we should load the data into this particular table so now I've created the first table that is employee table and I'm loading the employee data into this particular table now to perform join operations we always need two tables so in this particular database at Eureka join I've already created the first table that is employee join now let's create second table that is the department table which will be present in the same database so this particular table is a department table which will be having the entities that are Department ID and Department name now let's load the data of Department into this particular table so the data has been loaded so you can see the employee 2. CSV had the columns ID name salary agent country and similarly the department.csv has the entities which are Department ID and Department name so the department IDs are present here and the names are development testing product relationship and admin and ID support now we have created both the tables and we have created or we have loaded the data also now we have four different joints available in Hive they are in a joint left outer joint right Auto joint and full outer joint now let's perform the first type of join which is the inner joint so in inner join we are going to select the employee name and employee department and based on the employee ID and Department ID we are going to perform the join operation that is the first churn in a joint so you can see some jobs getting executed so the map reduce tasks successfully completed so the first set of join has been successfully finished and the output is mentioned rated now let's try out the second type of join that is the left outer join so the only difference is that we are using the keyword left outer join now you can see one of the job got started so you can see the output is pin generated as well of the left outer join now let's move ahead and understand right outer joint so for right outer joint you need to use the keyword right outer join fire in the command and you can see the jobs getting executed so you can see the output of right outer joint has been successfully executed or displayed now let's type in the last join operation that is full or to join so here I'm using the keyword full auto join file on the command and you can see it's getting executed so the output for full outer joint has been displayed here so this is how the join operations are executed in Hive so we have learned how to create database how to create table how to load data and the various data models present in Hive that are the tables databases partitions bucketing and after that we have also understood various operations that are the arithmetic operations logical operations and functions that can be performed in Hive such as square root and the summation minimum maximum and after that other operations such as Group by sort by order by and also the joints that are possible in Hive which are in The Joint left outer right outer and full outer so each and every operation that could be possibly executed in Hive have been displayed in this particular tutorial and everything is sorted here in the base of databases and you can get all the details about this and you'll also get the code that I have used in the description box below and you can try it out and also if you're looking for an online certification and training based on Big Data Hadoop then you can check out the link in the description box below and during the training you'll get to have real-time hands-on experience with real-time data you'll learn a lot of things in the training and so far so good now we shall also discuss some of the limitations of hive so Apache high of limitations so Hive is not capable of handling real-time data Hive is capable of batch processing if you have to work with real-time data then you have to go with real-time tools such as spark and Kafka so it's like I will actually taken the data for example imagine you're working on Twitter and you have one lakh comments on a particular post so if you had to process those one lakh comments you'll have to first load all those comments into hive then you need to process it so while you're loading the data from Twitter to Hive you may also get a few more comments that you will be missed out so it's not preferable for Real Time Hive is preferable for only batch more processing so followed by that it is not designed for online transaction processing so online transaction processing is something which only works in real time so Hive cannot support real time processing so last but not the least High queries contain High latency yeah Hive queries take a longer time to process as you've seen I have taken a smaller CSV file and the time consumed to process such a small CSV file was taking so long so yeah High queries contain High latency so these are the few important noticeable limitations of high [Music] now let's just start with the databases a database is a systematic collection of data and a database management system supports storage and manipulation of data which makes data management easy for example an online telephone directory uses a database to store data of people like phone numbers and other contact details that can be used by service provider to manage billing client related issues and handle fall data Etc so in simple words we can say a database management system provides the mechanism to store and retrieve the data there are different kinds of database Management Systems present today which are relational database management system known as rdbms online analytical processing known as olab and nosql which is popularly known as not only SQL nosql refers to all databases and data stores that are not based on the relational database management system or rdbms principles nosql is a new set of a database that has emerged in the recent past as an alternative solution to relational databases in 1998 called strausi introduced the term nosql to name his file based database so now we'll see what nosql actually is nosql does not represent single product or technology but it represents a group of products and various related data concepts for storage and management nosql is an approach to database management that can accommodate a wide variety of data models including key value document column and graph formats a nosql database generally means that it is non-relational it is distributed flexible and scalable so we can bind it up as an approach to database design that provides flexible schemas for the storage and retrieval of data beyond the traditional table structures found in relational databases it relates to large data sets accessed and manipulated on a web scale now when we have understood what is nosql the question arises is why to use Mo SQL the concept of no SQL data databases became popular with internet giants like Google Facebook Amazon Etc who deal with huge volumes of data the system response time becomes slow when you use rdbms for massive volumes of data so to resolve this problem we could scale up our systems by upgrading our existing Hardware but this process is expensive so the alternative for this issue is to distribute database load on multiple hosts whenever the load increases this method is known as scaling out nosql databases are non-relational so they scale out better than relational databases as they are designed with the web applications in mind now the nosql database is exactly the type of database that can handle all sorts of semi-structured data unstructured data rapidly changing data or big data so we can say to resolve the problems related to a large volume and semi-structured data no SQL databases have emerged now let's talk about the features of nosql databases nosql databases Never follow the relational model and don't provide tables with flat fixed column records so they doesn't require object oriented mapping and data normalization most of the nosql databases are open source multiple nosql databases can be executed in a distributed fashion and they offer Auto scaling and failover capabilities they follow shared nothing architecture which enables less coordination and higher distribution nosql databases are non-relational and offer heterogeneous structures of data in the same domain which supports New Generation web applications nosql databases are either schema free or have relaxed schemas and do not require any sort of definition of the schema of the data and nosql databases offer easy to use interfaces for storage and querying data provided and apis allow low-level data manipulation and selection methods now we can compare SQL and nosql on comparing SQL and nosql we can find the differences like SQL databases have fixed or static or predefined schema while nosql databases have Dynamic schema SQL databases are not suited for hierarchical data storage while nosql databases are best suited for hierarchical data storage SQL databases are vertically scalable while nosql databases are horizontally scalable as they are distributed SQL databases follow asset property that is atomicity consistency isolation and durability while nosql databases follow cap theorem that is consistency availability and partition tolerance now you must be wondering what is gap theorem which plays an important role in nosql databases cap theorem is also called Brewer's theorem and it states that it is impossible for a distributed data store to offer more than two out of three guarantees which are consistency availability and partition tolerance so basically some nosql databases offer consistency and partition tolerance while some offer availability and partition tolerance but partition tolerance is common as nosql databases are distributed in nature it's based on the requirement we can choose which nosql database has to be used different types of nosql databases are available based on the data models so before talking about those let's just understand what is data model the data model is defined as an abstract model that organizes data description data semantics and consistency constraints of data the data model emphasizes on what data is needed and how it should be organized instead it of what operations will be performed on data now we can see the types of nosql databases available based on their data models so no SQL databases come in a variety of types which are key value pair based graph based column based and document based now we will see each of these nosql databases one by one so let's start with key value pair based nosql databases in this data is stored in key value pairs each data element in the database is stored as a key value pair consisting of an attribute name or key and a value these databases are used to handle lots of data and heavy load some of key value pair based nosql databases include redis react Voldemort Etc now we will see a demo for key value pair based nosql databases data has to be stored as key value pairs so here our key is name and value for this key is ritual so it makes a pair of key and value that is actually the data similarly for dob key 2nd August 1988 is the value and for address key 32w NYC is the value now moving on let's understand column oriented nosql databases column oriented databases work on columns where each column is treated separately and the values of single column databases are stored contiguously these databases are used to manage data warehouses business intelligence CRM library card catalogs Etc some of column oriented nosql databases include hbase Cassandra Etc and these databases are based on big table paper by Google now let's understand this with a demo we have row Keys associated with the columns so here we have a row key one Associated to column 1 which has value 1 and the column 2 which has value 2 and column 3 which has value 3. we can say K1 is associated to three columns where each column has some value similarly row key 2 is associated to column 1 and column 2 which have value 1 and value 2 respectively here row keys are for grouping The Columns but actual data is stored in the columns now moving on let's understand the graph based databases a graph type database stores entities as well as the relations amongst those entities the entities stored as a node with the relationship as edges and Edge gives a relationship between nodes and every node and Edge have a unique identifier a graph database is a schema-less yet multi-relational in nature these databases are used to manage social networks Logistics and spatial data some of graph based more SQL databases include neo4j Orion DB Etc now let's understand graph based databases with the demo we have a person as node who has a relationship with another person as friend and he has other relationships with City that he lives in a city whose properties could be addressed as a node can have multiple relationships and properties here the person has one more relationship with the city that he likes the city and properties include the ratings and reviews similarly the person likes a mall based on the properties like ratings and reviews and the node mall has a relationship with the city that it is located in a city and properties include address so we can infer that in graph based databases the data is stored as nodes relationships and properties where properties can be of nodes as well as of relationships now moving further let's understand document oriented nosql databases a document database stores data in Json bson and XML documents in a document database documents can be nested and particular elements can be indexed for faster querying document databases make it easier for developers to store and query data in a database by using the same document model format they use in their application code these databases are used for the applications having Dynamics schema and characteristics some of document oriented nosql databases include mongodb couchbase Etc and mongodb is the most popular database in most SQL databases now let's understand this with the demo so we have a particular database which can have multiple collections here we have collection 1 and collection 2. let's see how collection is made each collection can have multiple documents with different schemas as it has Dynamic schema so here we have document 1 with v one key value pair then document 2 with two key value pairs document 3 again with two key value Pairs and finally document 4 with 3 key value pairs these key value pairs can be of any type like string numbers even another document as well basically document oriented databases are hierarchical version of key value based data databases so with this we have covered all the four kinds of nosql databases based on their data models [Music] why we needed Edge base we all know that the traditional data storage system what we had was rdbms that is relational database management system for storing data and maintaining the related problems but slowly we Face the rise of big data so since the rise of big data we have come across new Solutions and Hadoop was one amongst them and we started using Hadoop but when we stored huge amount of Big Data inside Hadoop and tried to fetch few records from Hadoop it was a major issue because you had to scan the entire Hadoop distributed file system to fetch for a smallest record so this was the limitation of Hadoop it did not provide random access to databases so this problem was solved using edgebase so edgebase is similar to database management systems but it provides us the capability to access data in a random way so this was the limitation of Hadoop and this is how edgebase solved it now moving ahead we shall understand the basic definition of Edge base edgebase is a distributed column oriented database built on top of Hadoop file system it is an open source project and horizontally scalable edgebase is a data model that is similar to Google's big data table designed to provide quick random access to huge amounts of structured data it leverages the fault tolerance provided by Hadoop file system and it is a part of Hadoop ecosystem that provides random real-time read and write access to the data in Hadoop file system so now this was the basic definition of edgebase now we will move ahead and understand the basic differences between hbase and hdfs edgebase is built on top of Hadoop hdfs whereas hdfs is one of the major component of Hadoop and it is built in a way to store the files in a distributed manner followed by the first difference hbase provides fast lookups for larger tables and when we come into hdfs we all discuss this limitation hdfs does not provide random access to the data and it does not support fast individual record lookups followed by the second difference edgebase provides low latency to single rows from billions of Records this is because Edge space is capable to provide us random access to the data that is we can randomly read the data from any location as well as write it followed by edgebase hdfs provides High latency processing and it cannot process Random Access of data and it is not capable to provide as random access to the data as edgebase does followed by the third difference the fourth difference is edgebase internally uses hash tables and provides random access and it stores data in in-text hdfs files for faster lookups whereas hdfs provides only sequential access of data so these are the major differences between hdfs and edgebase followed by the differences we shall move into the storage mechanism in Edge space edgebase is a column oriented database and the tables in it are sorted by Row the table schema defines only column families which are the key value pairs the table have multiple column families and each column family can have any number of columns subsequent columns values are stored continuously on the disk each cell value of the table has a timestamp and in short in an edge space table is a collection of rows row is a collection of column families column family is a collection of columns and column is a collection of key value pairs the following table gives you the brief description of the story mechanism followed in Edge base followed by the storage mechanism now we shall enter into the next stage where we will understand the features of edge base the first feature is edgebase is linearly scalable because Edge space is built on top of hdfs and hdfs provides the horizontal scalability and this similar feature is also adopted by hbase followed by the first feature the second important feature is that edgebase has automatic failure support this automatic failure support provides us the fault tolerance feature of hdfs followed by the second feature the third feature is hdfs provides consistent read and writes this important feature provides us the random access of reading and writing data followed by the third feature we have the fourth feature that says hdfs integrates with Hadoop and both as a source and destination followed by the fourth feature the fifth feature is that it is easy for Java API for client and the last and important feature is that edgebase provides data replication across all the Clusters so these are the important features of edgebase followed by the important features we shall move ahead into the next topic which is the edgebase architecture inside edgebase tables are split into regions and are served by region servers regions are vertically divided by column families into stores and stores are saved as files in hdfs the following image describes about the architecture of edgebase hbase has three major components the client Library a master server and region servers we should understand Master server first Master server assigns regions to the region servers and takes the help of Apache zookeeper for this task master server also handles load balancing of the regions across region servers Master server is also responsible for maintaining state of the cluster by negotiating the load balancing followed by the master server we shall discuss about the regions regions are nothing but the tables that are split up and spread across the region servers the regions of us have regions that communicate with the client and handle data related operations they also handle read and write requests for all the regions under it the region servers get to decide the size of the region by following the region size thresholds now we shall move into the next slide where we will look into regions and the following image describes how the region is split up when we take a deeper look into the region server it contained regions and stores the data as shown below the store contains memory store and edge files memstar is just like cache memory anything that is entered into the edgebase S automatically stored initially later the data is transferred and saved into edge files as blocks and the mem store is erased out followed by the region server we shall enter into the next important component which is Zookeeper zookeeper is an open source project that provides service like maintaining configuration information naming providing distributed synchronization and many more zookeeper has empirical nodes representing different region servers Master servers use these nodes to discover available servers in addition to availability the nodes are also used to track server failures or network partitions clients communicate with region servers via zookeeper and sudo and standalone notes Edge space itself will take care of Zookeeper now this was the architecture of edgebase now we should move ahead into the next important stage of this particular tutorial which is the edgebase demo so in this particular Demo First we will deal with installation of edgebase into our local system that has Windows operating system after the installation we shall start Edge space in our localhost and followed by that and we shall open a terminal and execute the commands which are possible in edgepace now without wasting much time let's start with installation process to install hbase into your local system the easiest way is to have an oracle virtual machine so you have to download Oracle virtual machine and install into your local system once after you have downloaded and installed Oracle virtual machine into your local system the next date would be downloading quick start Cloud era VM for your local system once after you download your Cloud era quick start VM for your local system you need to just import that using your Oracle virtual box you have to select import and you will be provided with a new dialog box where you have to provide the location of your Cloud era quick start VM so this is the location of my cloud era quick start VM in my local system and then you have to select open and there you go you can see that you have got a new dialog box here and remember to increase the Ram size to at least 8 GB for smooth functionality of cloud era I have provided a random size of 9000 MB which is just above 8 GB now let's click import and there you go it is starting to import your Cloud era quick start VM into Oracle virtualbox this might take a little time there you go the process has been successfully finished and you can see the cloud era quick start VM icon on your virtual box uh all you need to do here is to just click it and select start to start it you can see a new window just got popped up now you can see the cloud era has been successfully started now let's open the browser and start our Local Host so you can see that cloud era is live now and the Local Host is also live so this is our Edge base click on it and just open all tabs and there you go you have the edgebase region server here and edgebase master service and everything is online and remember one thing in Cloud era the username and password is cloud era by default for example if you want to start up Hive or if you want to start up the Hue you will be redirected into a new web page where Cloudera will ask you the username and password which looks something like this you just need to type in Cloud era as the username and password as Cloud era and there you go you can just simply click on sign in and you will be logged in so this is how it works now that we have started our Edge base let's quickly get started with our demo and execute some of the commands which are possible in edgebase to have a briefer idea how edgebase works so now that everything is good to go let's start our terminal and you can quickly see that the terminal got started and let's increase the size of the font so that it's visible yeah I think it's pretty good now so to start hspace in Cloud eras terminal you need to type in edgebase shell and there you go yeah we are successfully logged into Edge space and the edge space got started here I have created a document of the codes that I will be executing today to just save time so firstly we will try to execute some of the basic commands in edgebase such as status which will give us some important information about master backup Masters region servers and many more you can see that we have one active master and one server and since it is a VM we don't have any backup Masters over here so followed by status let's try to execute version command where we'll come to know the version of edge space that we are using in our current system so we are using 1.2.0 and the Cloudera version is 5.1 3.0 and followed by the version command let's take in another command which is the table help which will show us some various operations that we could perform using that are table enable table disable and flush the table drop table which are completely similar to an rdbms let's click in control plus C to clear our screen now one last command would be who I am I which will give you the important information related to your local system that is cloud era groups Cloud era default there are many other commands in hbase now we have executed few of them you can check out our article in at Eureka site so get more information related to the commands based on edgebase now that we have executed some of the basic commands let's move ahead and create a table so here I am going to create a table by the name employee and our table has been successfully created and the columns in my particular table are name of the employee ID of the employee designation salary and his or her department so followed by the creation of table we shall move ahead and verify our table is created or not so for that you need to type in list so when you type in list you will get to know what are the existing tables in your Edge base so we had a table that is default and followed by that we have employee table now let's check our local host and see if the table has been created or not yeah here you can see the new table has been created successfully so this is our table now moving ahead let's try to perform few more operations on table such as enabling and disabling them so to disable a table you can just write and disable and the name of the table and you can disable it so there you go it's been successfully disabled now you might have a doubt if it's properly disabled or not so to resolve that issue you can just type in scan employee and if you get the details of the employee table then the table is not disabled if you don't get the details of the table instead if you get something like this which says the employee table is disabled so you must be sure that the table got disabled now so this is how it works now let's quickly clear our screen and move ahead with the next type of operations now let's try to create another table with different name this time we will create employee 2. now we have got two tables in our headspace so if you had to delete multiple tables or if you had to disable multiple tables with starting letter e or some like many other databases have with scene names for example if you are in a school then you have classes and inside classes you might have sections so for example if you take class 10 you have section a section B and section CD so you might name them as class 10 section A Class 10 section B Class 10 section B and if you had to delete the all details of all the students of Class 10 then you might use this command so here I'm using disable all e so this command will give me two tables which are my first table which are employee and employee two so you can just provide y for yes and and for now to disable these tables now I don't want to disable my tables I'll just give n so I have closed it now moving ahead now we will create a new table with the name student and apply some operations on it so create is the keyword and student is the name of my table and inside that I have columns name age and course let's create it the table has been successfully created with the name student now followed by that we shall try to put in some data into this Edge piece so the table name is student and the value will be sharath and inside the column family name I've got another column which says the full name so the full name of sharath is sharath Kumar so I'm going to load this data into my student table fire and enter and the data is in now let's add in another data into my table this time I'll add in age so here I'm trying to use the keyword put and I'm loading the data into the table student and the row is sharet zero and inside Shara I've got a column family called age and inside that I've got the column name as present age so the present age of sharath is around 24 so I've added that data as well followed by that let's add in the course to um Sharif the course which he is studying so inside Sheriff's row and student table I've got course family and inside course family I've got pursuing column and he is pursuing Hadoop in this particular institution so now let's fire an enter to load this data so the data has been successfully loaded similarly let's add the same details for another student which is Shashank so we have added Shashank into the table now let's add in his age so Shashank is around 23 years of age and now we have added that data as well now let's add in the course with Shashank is studying so he's studying Java which is also loaded now now that we have loaded all the datas into the tables now let's get the details so we are here using the keyword get so we are getting the information from the table student and the particular student of the information which we require is Shashank now let's fire an enter and you get all the information related to Shashank so the complete name of Shashank is Shashank R and the course he is studying that is course pursuing is Java and his age the present age is 23. so this is how you get the data now we will also get the data of a shut up using the similar way so there you go so the present age of Shadow s24 and the course he is studying is Hadoop and the complete name of sharath is sharath Kumar now if you want to get a particular information like if you don't want his age if you don't want his full name and he just want his course what is sharat studying in particular institution then you can provide the command as get information from the table student of the student Sharad and his course that is course now let's fire and enter and see you can see that we have got the course family and inside course family we have the column pursuing so which says pursuing where the course had similarly let's try to get the complete name of sharath so if you just want the complete name of sharath then you can provide the keyword get from the table student from the row sharath and his name so there you go from the column family name you have the column full name and his complete name as sharath Kumar so this is how you get data from a table now we shall see the complete details of the table student so for that you can use the command scan table which will give you the complete details of all the students present in your particular Institute so right now we have two students so which are sharath and Shashank so this particular command gives you the age of Shashank the course he's studying and his complete name followed by that you have Shashank his complete age the course he is studying and his complete name which is Shashank R so followed by this we shall try to count the number of rows available in this particular table that is student so we have two rows right now and yeah here you go you have two rows which are present in the table student so this is how you can perform count operation in edgebase followed by that you can also alter some informations based on your table so I am going to alter the table student and the column which I'll be altering is H so you can see that the process is getting done so you can alter the age now we shall try to alter the name of a particular student so for now I want to alter Shashank name as Shashank Rao so the r alphabet present in his name is about his full name which is Rao now we will rename Shashank r as Shashank Rao so the process has been successfully completed now let's scan a student again and see if his name is changed or not so there you go earlier we had Shashank are now it's scenes to Shashank Rao so this is how you can alter some of the rows or data present in your table now let's try to delete a one particular column from your table that is the full name now we will delete the full name so shashank's full name will be deleted and only Shashank will be remaining so now the process is complete so this is how you can execute some of the basic commands in Edge space and don't worry about this code we will drop down this file in the description box below where you can get access to it and you can execute the same commands in your own personal system to just have a practice about Edge Space tutorial [Music] what is Apache Uzi answer well Apache Uzi can be defined as a job scheduler system designed and deployed to manage and run Hadoop jobs in a distributed storage environment it allows users to combine multiple complex jobs to be run in a sequential order to achieve a higher order job to be finished the reason behind using Uzi along with Hadoop framework is its capability to strongly find and integrate itself with Hadoop jobs like Hive Scala Apache pick and minimum Uzi is an open source Java web application available under Apache License 2.0 it is responsible for triggering workflow actions which in turn uses Hadoop execution engine to actually execute the task hence Uzi is able to leverage the existing Hadoop Machinery designed for many tasks such as load balancing system failover and many more now that we have a brief understanding of Apache Uzi let us move ahead and understand the major job types that Apache Uzi can practically perform Apache Uzi jobs the reason for choosing Apache Uzi for Hadoop jobs is its way of executing its jobs Apache Uzi is capable to detect the completion of tasks through callback and polling when Uzi starts a task it provides a unique callback HTTP URL to task and notifies the URL when the task is completed if the task fails to invoke the Callback URL then Uzi can pull the task for completion there are three main types of jobs in OC they are OC workflow jobs UC coordinator jobs OC bundles firstly move the workflow jobs these are directed acyclic graphs or dags which specifies a sequence of actions to be executed next Who's the coordinator jobs consist of the workflow jobs triggered by the time and data availability and lastly the bundles these can be referred to as a package of multiple coordinators and workflow jobs now let us understand all these shops in a little detailed way one by one firstly we shall understand Apache uc's workflow a sample workflow with controls such as start decision folk join and end and actions like Apache Hive shell Apache pick will look like the following diagram Uzi workflow is nothing but a sequence of actions that can be carried out and represented in the form of Dax or directed a cyclic graphs the actions will be carried out in a sequence which means the output of the previous action will be the input for the present action and the output for the present action will be the input for the next upcoming action let us understand this in a bit more detail in a flow of different sequential tasks some tasks can be performed in parallel to execute some tasks in parallel we can use the fork option in OC the join option is used to merge two parallel tasks into one let us discuss with this diagram as shown down below you can see the starting phase is the start and the last phase is the end in between we have mapreduce job paid job and folk and join so here out of multiple jobs such as high shell big Mr we are using mapreduce job and big job after starting the job the workflow will enter the mapreduce job first after executing the mapreduce shop a result will be generated or an output will be generated which will act as an input for the next upcoming big job and inside the pick job the tasks get executed and the output will be forwarded to folk here two other jobs are existing which are mapreduce job and a hive job as said earlier Fort is used to execute two different tasks in parallel to save time so here the output from the pick job is forwarded to two different jobs that is the map reduce job and the hive job using fork and after that the output will be generated from both the job status the mapreduce job and the hive job and these two outputs will be joined together using the join method and after that the entire Outlook the final output will be thrown out and that is the end of the execution of OC workflow so this is how the Oz workflow Works in real time so here the different components used in this particular job are start mapreduce shop pick job folk join Hive job and finally the end next are the nodes in the uzi workflow there are mainly three control flow nodes in OC workflow they are start and and kill nodes as you can see in the above diagram we have start node in the first and end node in the last position in between we have a mapreduce program which is based on word count if this particular map reduce job encounters an error then the job will be terminated using the kill option or the kill node in case if the mapreduce job is successful then the control will flow into the next node which is the end node next we deal with the coordinators you can schedule complex workflows as well as workflows that are scheduled regularly using a coordinator Uzi coordinators trigger the workflow jobs based on time data or even predictions the workflows inside the job coordinator start when the given condition is satisfied definitions required for the coordinated jobs are start and time zone frequency and some more properties that are available in control flow information firstly the start start says about the date and time which is related to the particular job assigned next is the end the end defines the date and time for the particular job assigned followed by that we have the time zone time zone represents the time zone of a coordinator application based on the particular location where the program is being executed followed by that we have the frequency frequency says about the information of the number of minutes consumed while executing the job now apart from that we have a few more which are like time out the maximum time in minutes for which an action will await or satisfy the additional conditions before getting discarded the zero indicates that all the input depends are satisfied at the time of action materialization the action should be timed out immediately one indicates no timeout the action will wait forever the default value will always be 1. next is the concurrency the maximum number of actions for a job that can run parallely the default value is always 1. followed by that we have the execution the execution specifies the execution order if multiple instances of the coordinator job have to be satisfied with their execution criteria that can be fifo by default which is first and first out or lifo which can be last in first out finally the last only the command for the coordinator jobs to be executed is shown in the slide below so followed by the command if a configuration property used in the definition is not provided with the job configuration While submitting the coordinator job the job submission will fail so now with this we finish off the coordinator jobs followed by coordinated jobs we have the uzi bundle Uzi bundle system allows you to Define and execute a set of coordinator applications often called as a data pipeline in Uzi bundle there is no explicit dependency among the coordinator applications however you could use the data dependency of coordinator applications to create an implicit data application pipeline you can start stop suspend resume re-run the bundle it gives a better and easy operation control the most important term we need to understand about OC bundle is its kick off time the kickoff time is the time when the bundle should start and submit coordinator applications advancing in this Apache OC tutorial we shall understand how to install Apache OC now to install Apache Uzi there are multiple ways instant toys we need to install Cloud error then download 2C download external JS support for Uzi download Maven setup and then set up mySQL database and after that you need to configure the setting and this seems very complex right no worries in this tutorial we shall learn the simplest way of installing Uzi and then later we can try to learn the complex ways so firstly all we need is an oracle Workshop box so just Google Oracle virtualbox for Windows and you will be redirected into a new web page where you can choose Oracle VM virtualbox downloads Oracle technology which is the first link in the description you can go ahead and click the latest version of Windows so by clicking on the Windows installer button you will get your Oracle virtualbox basic packages installed into your local system you can see the virtualbox is getting downloaded followed by that open a new tab and Google Cloud era so you can see that we have Googled Cloud error download for virtualbox so that will redirect you into a new web page and the first link which is download QuickStart for CDH 5.13 Cloud era clicking on this link here will be redirected into a new web page and in this you need to select a platform so here we shall select virtualbox and now just click on get it now and here you need to provide your details why are you using this product which is for learning or anything else and followed by that you need to provide your name last name business email ID the company which you're working or you can ignore if you're a student after that your job title and you can ignore this if you're a student then your official phone number then accept all the above that private data policies and if you would like to prefer someone to use Cloudera then press continue and your virtualbox will be downloaded already downloaded a quick start CDH 5.13 in my local system so I would prefer to ignore this now that we have successfully downloaded our Cloudera CDH and virtualbox required which will run them and install into our local system you can see that the Oracle VM virtualbox setup is started now you can select next followed by that just select next again and here you are provided with different options whether to create a shortcut menu on your desktop or not I would prefer not then just select next then yes and the last button will be installed so I have already installed Oracle virtualbox 6.4 version into my local system to save time we shall cancel this but you have to select install button to install your virtualbox system so once after the virtual box is installed into your system it will be looking something like this so here you need to select the option import and you will be provided with a new dialog box and in here you need to select the browse button over here so once you select the browse button you will be redirected into your local file system so you have to know the location where your quick start Cloudera has been downloaded so I have saved my quick start Cloud era ISO file in this particular file so I need to select my quick start VM then just select open and for the safer side in the configuration we shall select the Ram size above 8GB so I'll just specify randomly as 9000 MB which is just above 8 GB and now just import you can see the importing Appliance has started so your Cloudera will be imported as soon as possible so you can see that the Cloudera quick start VM has been successfully imported onto the virtual box now let's quickly start it by double clicking it or also you can just select start now you can see that quick start VM has been successfully started and it's loading you can see the cloud era is getting booted up so there you go the quick start Cloudera VM has been successfully booted up now the best part about Cloud era is it's got everything you need the uzi Hive Hue spark edgebase Impala everything for a beginner to start with now what we need is Uzi so we need to start up our Hue and Uzi first let's open Uzi in a new tab and finally let us also start here so this is how the web page of hue looks like or the editor of hue looks like and also this is the web page or web console of Uzi so here you can have workflow jobs coordinated jobs which I've mentioned in the previous explanation and also the bundle shops SLA System Info instrumentation metrics and all the extra additional settings you require now we shall discuss about the editors present in Hue so to find out the editors in Hue you need to select the button query and inside that you will be getting a drop down menu and inside the menu select editor and you can see there are various editors present in you that are Impala Hive job big job Java code spark job mapreduce shell scoop and many more so now that we have understood how to install OC into local system and the editors present in Hue and the uzi web console we shall Advance ahead and understand the advantages of OC the first among the advantages of OC is OC is scalable and reliable to monitor jobs in Hadoop cluster Oz supports various jobs in Hadoop ecosystem like mapreduce big high streaming and also Java based applications UC has an extensible architecture that supports great programming paradigms the next Advantage is complex workflow action and dependencies Uzi workflow comprises of actions and dependencies among them followed by complex workflow action dependencies we have reduced time to market system the directed acyclic graph specification enables users to specify the workflow so this saves a lot of time to Market followed by TDM we have frequency execution users can specify execution frequency and can wait for data arrivals to trigger an action on the workflow followed by frequency execution we have native Hadoop stack integration Oz supports all type of jobs such as spark Hive big and many more Oz is validated against Hadoop stack and OC is integrated with Yahoo distribution of Hadoop what security and is a primary mechanism to manage a variety of complex data analysis [Music] let's compare Apache spark with Hadoop on different parameters to understand their strengths we will be comparing these two Frameworks based on these parameters let's start with performance first Spark is fast because it has in-memory processing it can also use disk for data that doesn't fit into memory Sparks in memory processing delivers near real-time analytics and this makes Sparks suitable for credit card processing system machine learning security analytics and processing data for iot sensors now let's talk about hadoop's performance now Hadoop is originally designed to continuously gather data from multiple sources without worrying about the type of data and storing it across distributed environment and mapreduce uses batch processing mapreduce was never built for real-time processing main idea behind yarn is parallel processing over distributed data set the problem with comparing the two is that they have different way of processing and the idea behind the development is also Divergent next ease of use spark comes with a user-friendly apis for Scala Java Python and Spark SQL spark SQL is very similar to SQL so it becomes easier for SQL developers to learn it spark also provides an interactive shell for developers to query and perform other actions and have immediate feedback now let's talk about Hadoop you can ingest data in Hadoop easily either by using shell or integrating it with multiple tools like scoop and Flume and yarn is just a processing framework that can be integrated with multiple tools like Hive and big for analytics Hive is a data warehousing component which performs Reading Writing and managing Lars data set in a distributed environment using SQL like interface to conclude here both of them have their own ways to make themselves user friendly now let's come to the costs Hadoop and Spark are both Apache open source projects so there's no cost for the software cost is only associated with the infrastructure both the products are designed in such a way that it can run on commodity Hardware with low TCO or total cost of ownership well now you might be wondering the ways in which they are different they're all the same storage and processing in Hadoop is disk based and Hadoop uses standard amounts of memory so with Hadoop we need a lot of disk space as well as faster transfer speed Hadoop also requires multiple systems to distribute the disk input output but in case of Apache spark due to its in-memory processing it requires a lot of memory but it can deal with the standard speed and amount of disk as disk space is a relatively inexpensive commodity and since spark does not use disk input output for processing instead it requires large amounts of RAM for executing everything in memory so spark systems incurs more cost but yes one important thing to keep in mind is that Sparks technology reduces the number of required systems it needs significantly fewer systems that cost more so there will be a point at which spark reduces the cost per unit of the computation even with the additional RAM requirement there are two types of data processing batch processing and stream processing batch processing has been crucial to the Big Data World in simplest term batch processing is working with high data volumes collected over a period in batch processing data is first collected then processed and then the results are produced at a later stage and batch processing is an efficient way of processing large static data sets generally we perform batch processing for archived data sets for example calculating average income of a country or evaluating the change in e-commerce in the last decade now stream processing stream processing is the current Trend in the Big Data World need of the hour is speed and real-time information which is what stream processing does batch processing does not allow businesses to quickly react to changing business needs in real time stream processing has seen a rapid growth in that demand now coming back to Apache Spark versus Hadoop yarn is basically a batch processing framework when we submit a job to yarn it reads data from the cluster performs operation and write the results back to the cluster and then it again reads the updated data performs the next operation and write the results back to the cluster and so on on the other hand spark is designed to cover a wide range of workloads such as batch application iterative algorithms interactive queries and streaming as well now let's come to fall tolerance Hadoop and Spark both provides fall tolerance but have different approaches for hdfs and yarn both Master demons that is the name node in hdfs and research manager in the arm checks the heartbeat of the slave demons the slave demons are data nodes and node managers so if any slave demon fails the master demons reschedules all pending and in-progress operations to another slave now this method is effective but it can significantly increase the completion time for operations with single failure also and as Hadoop uses commodity Hardware another way in which hdfs ensures fault tolerance is by replicating data now let's talk about spark as we discussed earlier rdds or resilient distributed data sets are building blocks of Apache spark and rdds are the one which provide fault tolerant to spark they can refer to any data set present in external storage system like hdfs edgebase shared file system Etc they can also be operated parallely rdds can persist a data set in memory across operations which makes future actions 10 times much faster if a rdd is lost it will automatically get recomputed by using the original Transformations and this is how spark provides fault tolerance and at the end let us talk about security while Hadoop has multiple ways of providing security Hadoop supports Kerberos for authentication but it is difficult to handle nevertheless it also supports third-party vendors like ldap for for authentication they also offer encryption hdfs supports traditional file permissions as well as Access Control lists Hadoop provides service level authorization which guarantees that clients have the right permissions for job submission spark currently supports authentication via a shared Secret spark can integrate with hdfs and it can use hdfs ACLS or Access Control lists and file level permissions spark can also run on yarn leveraging the capability of Kerberos now this was the comparison of these two Frameworks based on these following parameters now let us understand use cases where these Technologies fit best use cases where Hadoop fits best for example when you're analyzing archive data yarn allows parallel processing over huge amounts of data parts of data is processed parallely and separately on different data nodes and gathers result from each node manager in cases when instant results are not required now Hadoop map produces a good and economical solution for batch processing however it is incapable of processing data in real time use cases where Spark fits best in real time Big Data analysis real-time data analysis means processing data that is getting generated by the real-time event streams coming in at the rate of millions of events per second the strength of spark lies in its abilities to support streaming of data along with distributed processing and Spark claims to process data 100 times faster than mapreduce while 10 times faster with the discs it is used in graph processing spark contains a graph computation Library called Graphics which simplifies our life in-memory computation along with inbuilt graph support improves the performance of algorithm by a magnitude of one or two degrees over traditional mapreduce programs it is also used in iterative machine learning algorithms almost all machine learning algorithms work iteratively as we have seen earlier iterative algorithms involve input output bottlenecks in the mapreduce implementations mapreduce uses coarse grain tasks that are too heavy for iterative algorithms spark caches the intermediate data set after each iteration and runs multiple iterations on the cache data set which eventually reduces the input output overhead and executes the algorithm faster in a fault tolerant manner so at the end which one is the best the answer to this is Hadoop and Apache spark are not competing with one another in fact they complement each other quite well Hadoop brings huge data sets under control by commodity systems and Spark provides real-time in-memory processing for those data sets when we combine Apache Spark's ability that is the high processing speed and advanced analytics and multiple integration support with hadoop's low-cost operation on commodity Hardware it gives the best results Hadoop complements Apache spark capabilities spark cannot completely replace Hadoop but the good news is that the demand of spark is currently at an all-time high foreign so this project is based on the e-commerce domain so let me give you an introduction to this project in context of one of the biggest names of e-commerce platforms none other than Amazon so if you have ever shopped from Amazon before which I presume you must have you must have seen something like this when you click on a product so you'll view the details of the product something like this and as you know most of the e-commerce organizations do not have any inventory so they tie up with different Merchants similar is the case with Amazon Amazon provides the merchants or the sellers a platform to get connected to the buyers so when you click on the details of a product you can see that Amazon gives you something like this and you also find something like this there are 28 offers from this price and if you click on it you can see the name of the different sellers who who are selling the same product at different prices and the prices they're offering are listed like this but you can see over here that by default Amazon has selected the appario retail private limited for this particular product so how does Amazon do that so it is actually based on a merchant rating system and as a platform as an e-commerce platform you have to ensure that you always display the product from the best Merchant in order to ensure quality because you don't want angry customers right so it is very important that your customers are satisfied with their product so you have to choose from different Merchants for the same product in order to decide which Merchants product needs to get displayed by default and hence Amazon has a merchant rating system in order to decide that and this is what exactly we're going to build all right so we're going to make a merchant rating system similar to this so here is the problem statement so there are multiple merchants selling the same type of products as you can see that Merchant 1 and Merchant 6 are selling the same shirt Merchant three and four are selling the same shoes and similarly five and one are selling the same pants and there are multiple other Merchants who are selling the same kind of products and you have to build a merchant rating system or the company wants to build a merchant rating system in order to determine which Merchant sells the best product so that their product would be displayed by default and as a big data expert let's just assume that you were hired by the company as a big data expert you are assigned this task so this is now your problem to solve so the first thing your organization will give you before you start to do your work is the data set so this is the data set that you're going to get so this is the transaction data set then has certain Fields like transaction ID customer ID merchant ID time stamp when the purchase was made the invoice number the amount and the segment of product that was bought you have another data set which is the merchant data so these are the details about the different sellers or the merchants so you have got your merchant ID the tax registration number the merchant name their mobile number start date email address State country pin code description longitude latitude the location basically so these are all the details about the merchant that you have in your data set so let me just show you the data set so this is the data set that is in your hdfs right now so here is the transaction data so here is the transaction data which is a 2GB of file and the merchant data set which is 20 MB because as you know there are many transactions but a limited number of sellers or Merchants so that is why the size of the data set the merchant data set is quite smaller as compared to the transaction one and to tell you we haven't actually used the entire data present in the data set we have just selected a subset or a sub data set you can say because the original data set was quite huge and this was a demo project so just for you understanding we have chosen a sub data set we just took two GBS of data out of it all right and this is how it exactly looks like so this is the CSV file of the data set that we have this is the transactions data all right so this is the approach to solve so the first thing we'll do is that we'll segregate the merchants based on the price of their products and their sales so we will be segregating them into four categories so the categories are the merchants who are selling products that are below 5000 or less than five thousand dollars there is one more category for merchants who sells products between five thousand to ten thousand dollars another category of merchants who sells products between ten thousand to twenty thousand and another category greater than twenty thousand or more than twenty thousand we'll be using a simple logic to approach solving this problem so let's say that if there is a merchant who's selling their products at quite a low price and you see if it's not making a good number of sales it means that he is not selling quality products because if a merchant who's selling the product at quite a less price and people aren't still buying from him it clearly indicates that his products are not up to the Mark so it's a low rating for a merchant similarly on the other hand if you see a merchant who is selling their product at quite a high price and yet he has a very good number of sales so it clearly indicates that his products must be very good because despite of the higher price people are still buying from him so in that case obviously the rating of the merchant would be also very good right so this is a simple logic that we're going to use in order to rate our merchants or sellers and you have three options to choose from so you have got Apache High which is a great analytical tool we have got Hadoop mapreduce and Apache Pig and today we will be choosing Hadoop mapreduce so mapreduce is the core component of Hadoop that process huge amount of data in parallel by dividing the work into a set of independent tasks map reduces the data processing layer of Hadoop it is a software framework for easily writing applications that processes the vast amount of structured and unstructured data that is stored in your hdfs hdfs is Hadoop distributed file system in Hadoop mapreduce works by breaking the data processing into two phases map phase and the reduce phase and that is how exactly it gets its name map reduce so the map is the first phase of processing where we specify all the complex logic business rules reduces the second phase of process thing where we specify lightweight processing like aggregation or summing up the outputs but the question is why choose mapreduce well I'll give you two reasons for it first is the custom input format now input format is something that defines how your input files are going to be split and read so in mapreduce you can create your own custom input format instead of using default input format this actually makes handling of your data quite easier because here we can create our custom input format for transactions and pass it as an argument then we have the distributed cache so distributed cache is nothing but it is a facility that is provided by mapreduce framework to cachet files files like your text files archives jars Etc that is needed by your application let's understand this with an analogy just think about it that there are three students sitting on a table solving chemistry problems and they have one periodic table so you keep the periodic table in the middle of the table so the students all the three students can refer from the periodic table to find to see the atomic numbers of different elements and solve their problems right so one periodic table everyone can refer to it and solve their own problems so this is what distributed cache is so with distributed cache you can put the data that will be used by your different data nodes to refer in order to run mapreduce jobs so we'll be learning more about how to create your custom input format and how to use the distributed cache in the demo part all right so before that let us just understand how map produce exactly works so this is a sample mapreduce job execution with an example so this is your input file this is a text input file so it contains some words so first what will happen is that the input will get divided into three splits all right so I'm taking one sentence at a time so I have got three splits over here and this will distribute the work among all the different map nodes then we will tokenize the words in each of the mapper and give value one to each of the tokens or words so tier one bear One River one similarly here car one car one River one and now a list of key value pair will be created where the key is nothing but the individual word and the value is one so this is the key and this is the value and this will happen on all the three nodes so the mapping process remains same on all the nodes so after the mapper phase a partition process takes place where the sorting and shuffling happens here all the tuples with the same key are sent to the corresponding reducer so all the bear are together cars are together deer and River are together so after the sorting and shuffling phase each reducer will have a unique key and a list of corresponding values to that key for example bear one one car one one and one and so on now comes the reducing phase now each reducer counts the values which are present in the list of values so reducer gets a list of values which is one one for the key bear and then it counts the number of ones in the list and gives the final output as bare two similarly for car it's three deer Two River it'll count two one so two and finally all the output the key value pairs are then collected and written in the output files so this is your output file so it has just combined the result from different reducers and here is your final output so if understood mapreduce with the classic example of the word count program now this is the generic execution flow of the mapreduce job say you have your input file over here so the data for mapreduce task is stored in input files and input files typically lives in the hdfs the format of this files is arbitrary while line based log files and binary format can also be used and you have an input format now input format defines how this input files are split and read it selects the files or other objects that are used for input an input format creates the input split so it logically represents the data which will be processed by an individual mapper one map task is created for each split and thus the number of map tasks will be equal to the number of input splits the splitter then divided into records and each record will be processed by the mapper now let's talk about the mapper the mapper processes each input record from the record reader and generates a key value pair and this key value pair is generated by the mapper is completely different from the input pair the output of the mapper is also known as the intermediate output which is written to the local disk the output of the mapper is not stored on hdfs as this is temporary data and writing on hdfs will create unnecessary copies so then the mapper's output is passed on to the combiner for further process the combiner is also known as the mini reducer so Hadoop mapreduce combiner performs local aggregation on mapper's output which helps to minimize the data transfer between mapper and the reducer once the combiner functionality is executed the output is then passed to the partitioner for further work now partitioner comes on the picture if you're working on more than one reducer and here we have two reducers in the example so if you have one reducer you don't actually need a partitioner so the partitioner takes the output from the combiners and profounds partitioning partitioning of output takes place on the basis of the key and then sorted so by hash function a key is used to derive the partition according to the key value in mapreduce each combiner output is partitioned and record having the same key value goes to the same partition and that each partition is sent to a reducer so by using partitioner it allows to have an even distribution of the map output over the reducer so after that comes the shuffling and sorting part so now the output is shuffled to the reduce node the shuffling is the physical movement of the data which is done over the network once all the mappers are finished and their output is shuffled on the reducer nodes then this intermediate output is merged and sorted which is then provided as an input to the reduce phase now comes the reducers it takes the set of intermediate key value pairs produced by all the mappers as the input and then runs a reducer function on each of them to generate the output the output of the reducer is the final output which is stored in hdfs so if you have multiple reducers the result from different reducers will combine and that is going to be your final output which will be written into the hdfs so this was the mapreduce job execution flow so we'll also be using the distributed cache so you'll have different data nodes each data node will have their local copy of their data and if each of the data nodes needs to refer something we will keep that in the distributed cache and in this case we'll be keeping the merchants file all right so distributed cache is nothing but think of it as a share drive right so if you have multiple users who wants to have access to one data set so you can just put it up in the shared drive and all of your users can share the data use the same data right so this is what distributed cache is so applications specify the files via URLs to Cache via jobcon so I'll be telling you about the job conf later in the demo section and the distributor cache assumes that the file specified via URLs are already present on the file system add the path specified by the URL and are accessible by every machine in the cluster so the framework will copy necessary files to the slave node before any jobs are executed on that node and distributed cache tracks modification timestamps of the cache files so clearly the cache files should not be modified by the applications or externally while the job is executing so how will it works in our case we'll be storing the data into hdfs and we'll be executing map reduce program over that file so we'll store the merchant data in the distributed cache then we'll segregate the transaction data into categories such as less than five thousand five thousand to ten thousand dollars ten thousand to twenty thousand and greater than twenty thousand dollars with the merchant ID then we'll use the merchant file from the distributed cache and map the merchant ID with the merchant name and at last we'll receive the output as the merchant name with date indicating the number of sales in different categories now let us talk about the code sections so the execution will start from the main method where we'll use the tool Runner so the tool Runner can be used to run classes implementing tool interface so it pause the generic Hadoop command line arguments and modifies the configuration of the two then it will point to the run method which will point to the Run Mr jobs so here we are specifying the driver code so I'll be telling you about the driver code later on so next the execution will move to the mapper class which is the transaction mapper the framework first calls the setup method followed by the map method for each key value pair in the input split so in setup method we are loading the cache file and calling a method where we'll be resolving the merchant name from the merchant ID next in the map method we are creating the object of transaction which we will be using to catch all the fields of transaction first and then using the object of aggregate data we will create this segment of transactions as we discussed before the four segments less than five thousand ten thousand twenty thousand those segments at last using the merchant ID name map we will resolve or find out the merchant name the output of the map method would be the key which will be the combination of merchant name and date of sale while the value would be in the form of number of sales of different categories next the execution will go to the partitioner code where we'll have the get partition method which will send the records with the same key to the same reducer and it allows the reducer code will execute which will aggregate the data with the same can provide the output so I'll be showing you and explaining you all the codes involved over here in the demo part all right now let's move ahead so first let me take you through this transaction class where we are defining all the variables corresponding to the transaction file as you can see we have got the transaction ID customer ID merchant ID time stamp invoice number invoice amount and segment here so these are the fields in the transactions and we have created the variable for the same so next we are creating the getter and Setter methods for each of the variables so as to read the value of the field and we write the value of the field so as you can see here we are defining the method get so get segment we have got get segment here where we are returning the value of the field and the set method here the set segment where we're writing the value of this field so similarly we're doing it for all the variables as you can see here so we have the get and set for customer ID so we have the get customer ID method which Returns the value of the field and we have got the set customer ID which writes the value of the field so this is similar for all the fields in our transaction data so similarly you can have the aggregate data class which we have used to create the categories of the product so here we have Fields like order below 5000 or they're below 10 000 or they're below 20 000 order above 20 000 which are nothing but the different categories which we have defined earlier and similar to the transaction class here we are defining the getter and Setter method so we have got get total order method which Returns the value of total order and we have set total order method which is writing the value of this field total order so we have the same for all the different variables that we have defined in the aggregate data class the getter and Setter methods then we've got the aggregate writable class so first here we are creating an object of Json class so Json is basically used to convert Java objects to Json format next we're initializing the aggregate data object now we have two Constructors first one is the basic Constructor which is not taking any argument the second Constructor is taking aggregate data format as an argument and trying to initialize the aggregate data object next we have the getter method for the aggregate data which will return the aggregate data object after that we have the right method which will write the value of corresponding Fields using the getter method of the field like for order get order below 5000 get order below 10 000 order below 20 000 get order above 20 000 then you have read Fields method which will be calling the setter method of each field to assign the values to the corresponding field and at last we are overriding to string method which will convert the aggregate data object to Json and then return the Json so I hope you guys are clear with the custom input format so now let us take a look at the main Java file which is the merchant Analytics job.java so the main class is the merchant analytics job class inside which all the jobs will reside so the execution will start from the main method so first let us go to the main method so here we are using tool Runner so tool Runner can be used to run classes implementing the tool interface it passes the generic Hadoop command line arguments and modifies the configuration of the tool so toolrunner dot run method runs the given tool after parsing the given generic arguments it uses the given configuration or builds one if not it sets the tools configuration with the possibly modified version of the conf here we are passing the configuration object Merchant analytics job object which is the main class and arguments which we will be providing while executing the job so in our case there are three arguments first is the path of the transaction file second is the path of the merchant file and third is the output directory now we'll execute the run method where we are returning the values of the Run Mr jobs method we're also parsing the arguments that is all the three parts that is the transaction Merchant and output directory to the Run Mr jobs method now let us see the Run Mr jobs method so here we have the driver code so first we initialize the configuration object and then we will initialize the control job object so Control job class encapsulates a mapreduce job and its dependency it monitors the state of the defending jobs and updates the state of this job and now we are creating the object of job class and we will define the properties of the job so first we have the set output key Class Property where we are defining the output format class of the key which is text class similarly we're defining the set output value class for output format class of the value that is the aggregate writable class next we have set jar by class which tells the class where all the mapper and reducer code resides which the merchant analytics job in our case now we are specifying the reducer class which is Merchant order reducer and next we are providing the input directory so the set input the recursive method will read all the files from the directories recursively if we are providing the directory path so first we're adding the input path of the transaction file which is present in the argument zero then here we are also specifying the input format of the file and the mapper class that is the transaction mapper next we're talking about all the merchant file from the directory provided in argument one and adding this file to the distributed cache using the job.ad cache archive method and moving ahead we are setting the output directory path which is provided in argument 2 and we are also adding the timestamp as the subdirectory and at last we are setting the partitioner class that is the merchant partitioner and then we are returning 0 or 1 depending on whether the job has been executed successfully or not and next we will take a look at the transaction mapper class which implements the mapper interface so it Maps input key value pairs to a set of intermediate key value pairs so maps are the individual tasks which transform input records into an intermediate record the transformed intermediate records need not to be of the same type as the input records the Hadoop mapreduce framework spawns one map task for each input split generated by the input format for the job and mapper implementations can access the jobcon for the job via the job configurable and initialize themselves the framework first calls the setup method followed by map method for each key value pair in the input split so in the setup method we are loading the merchant file from the cache using the get cache archives method then from each file we are calling the load merchant ID name in Cache so we're calling this method and we are passing the path of the cache files and the configuration objects now in this load merchant ID name in Cache method we are initializing the object of file system using the conf and next we're opening the file and then we are reading the data line by line from the file now here first we're removing the codes from the line and then we're splitting the line using the comma and at last we're putting the merchant ID and Merchant name in the merchant ID name map variable so here you can see in the merchant file that we have merchant ID at index 0 and Merchant name at index 2. so this merchant ID name map will help us in resolving the merchant name from merchant ID and next we are defining exception to notify us if the cache file is not read and at last we're closing the object of the buffered reader now let's talk about the map function so now the map function will be called so the input format of key is long writable and the value is text we're also creating a context of the mapper framework where we will be writing our intermediate output now the output format of the key is text and the value is aggregate writable again here we are removing the codes from the line and then we're splitting the line using comma so in the split area we have all the fields of the transaction data stored in the consecutive indexes now we are creating an object of transaction class and setting the values of the field using Setter methods and next we are creating the objects of aggregate data and aggregate writable class then using transactions get invoice amount field we are deciding that in which aggregate data field the transaction would lie we will set the value of corresponding field of the aggregate data object to 1. and next we have the output key which will contain the merchant name and the date of sale we will set the value of corresponding field of that aggregate data object to one and next we have the output key which will contain the merchant name and the date of the sale so merchant ID name map method will return the merchant name from the merchant ID as we just discussed so we are passing the values as merchant ID and the date at last we'll pass the intermediate result in form of key and value to the context next the result will be sent to the partitioner class that is the merchant partitioner so it's over here so in this class we are overriding the default get partition method and in this method we are converting the key using hash function and using ABS method to return the absolute value of a number and at last we're using the modular function to get the remainder and now we're dividing it with the number of partitions so which is nothing but the number of reducers and in our case we have specified five reducers so the modulo 5 would return the value between 0 and 4. and one more thing is the same key would always have the same hash generated and hence the modular result would be also the same and thus the records with the same key will be sent to the same reducer and based on this records are sent to the reducer so next we have the reducer code and it's over here so it's the merchant order reducer the reducer Clause as defined in the driver code resides in the merchant order reducer class so here we have the key input as text and value input as aggregate writable which was written by our mapper class and the output key format is again text and the output value format is aggregate writable so here our execution will move to reduce method where we are passing the input key value and context as argument and here we are again creating the objects of aggregate data and aggregate writable class and next we are taking the input values now here we are calling the setter function of each category getting the earlier value of that category and then adding the new value of the new aggregate data object for that category so it will add the value to the corresponding Fields if there is a record with the same key and at last we are writing the key and value in context dot write method so I have explained you the code so now let us just go ahead and execute it so first let us move to the project directory so we have the palm.xml file which has all the dependencies that we require in order to run our mapreduce job so this is the command so it has my jar file and the path of my jar file then I have got my main class over here which is Merchant analytics job so this is where my main function is and then I'm passing the three parts so first is my transactions dot CSV this is my data set is the path of my data set then my Merchant data this is the path of my Merchant data and finally my output directory which is the result this is the path over here so let us just go ahead and execute this command so the code is run so here are the different parameters on which this mapreduce job was run so you can see the details over here so you can see the number of reduced tasks where five since we had five reducers so you can see all the details here let me just show you the result now so it is in the results directory so there are two directories over here because this was the earlier one that when I had previously executed it so this is the one that we have got right now so let me just show it to you so we have got five part files because there are five reducers so I'm just clicking on one part and you can just click on download all right let me just open it so this is what you get so you get the merchant name and the timestamp over here and also the category or the segregation that we did based on the cost of the orders right so it was order above twenty thousand dollars at this time stamp and the total order was one so this is the format of the result so we have got a lot of rows so this is the result so we have just used a few Fields or parameters from the merchant file we have just used the merchant name and the ID over here since this is a sample project sample demo project but the scope of this particular project is huge you can use a lot of other parameters that was mentioned there like the location you can analyze it based on locations based on the time period where the order was placed so you can take in account different fields and improve this or make this analysis even better by yourself so when you're doing this project as a part of your course curriculum so you will be exploring the other fields as well I have just shown you with just using two Fields the merch name and the ID [Music] what is Kafka in general Kafka is a producer to the consumer based messaging system that has a producer that produces the message and the consumer that consumes the message in between the both we have Brokers that distribute the messages to the consumers and data storage unit which is none other than Apache zookeeper to understand more about Apache zookeeper and Kafka you can go through the article Link in the description box below Apache Kafka so basically Apache Kafka is an open source messaging tool developed by LinkedIn to provide low latency and high throughput platform for the real-time data feed it is developed using Scala and Java programming languages so followed by the definition of Kafka we shall enter and understand what exactly is a stream in general a stream can be defined as an unbounded and continuous flow of data packets in real time data packets are generated in the form of key value Pairs and these are automatically transferred from the publisher there is no need to place a request for the same the below image depicts the key value pests that are involved in data stream each and every single key value pair is one single unit of data or it is also called as one single unit of a record so followed by the stream we shall understand what exactly is a Kafka strain Kafka stream is an API that integrates Kafka cluster to the data processing applications which are either written in Java or Scala this API leverages the data processing capabilities of Kafka and increases data parallelism Apache Kafka stream can be defined as an open source client library that is used for building applications and microservices here the input and the output data is stored in the form of Kafka clusters it integrates the intelligibility of Designing and deploying standard applications using the programming languages such as Scala and Java with the benefits of Kafka server side cluster technology so this was the basic definition of Kafka stream now let us understand Kafka stream API in a much better way through its architecture Apache Kafka streams internally use the producer and consumer libraries it is basically coupled with Kafka and the API allows you to leverage the capabilities of Kafka by achieving data parallelism fault tolerance and many other powerful features the following image depicts the basic architecture of Kafka stream here you can see the Kafka cluster which has the input streams and the output streams together followed by that we have Kafka streaming application or the API which takes care of the queries which are received from numerous applications which are connected to Kafka streaming application followed by this we have numerous components present in Kafka stream architecture which are as follows they are input string output stream instance we have two instances here which are stream instance 1 and stream instance 2. so inside every instance we have consumers as well as local state and stream topology so in Kafka stream API input stream and output stream can be one single Kafka cluster followed by that we have a consumer which provides the input and receives the output and inside the stream instance we have stream topology and local state we shall understand about stream topology in a much detailed way in the next slide stream topology is all about the directed acyclic graph or the steps in which the particular process is executed followed by that we have a local state local state is none other than a memory allocation which stores the intermediate data or the result provided by the stream topology these results are produced after applying various Transformations such as map flat map Etc so after the data is processed the tasks are united together and sent back to Output strain so this is how the architecture of Kafka stream API works now let us understand more about stream topology so this particular diagram explains the stream topology here you can see the stream processor all the dots which are provided here are none other than stream processors and the line which is connecting them is the Stream the stream is none other than the K value pairs of the data or records so basically the input is read from Kafka cluster first followed by that we apply various operators such as filter map join Aggregate and many more and finally we will receive the results which will be sent back to the output Kafka cluster so this is how the stream topology works now let us discuss the important features of Kafka streams that give internet over the other similar Technologies so the various important features of Apache Kafka streams API are elastic fault tolerant highly viable Integrated Security Java and scalar language support and exactly ones don't worry we shall discuss each one of them in detail firstly we shall discuss about elastic nature Apache Kafka is an open source project that was designed to be highly available and horizontally scalable hence with the support of Kafka Kafka streams API has achieved its highly elastic nature and can be easily expandable so this was the first feature followed by that we have the second feature which is about fault tolerance the data logs are initially partitioned and these partitions are shared among all the servers in the cluster that are handling the data and their respective requests thus Kafka achieves fault tolerance by duplicating each partition over a number of servers followed by fault tolerance we have the next important feature that is highly viable since Kafka clusters are highly available they can be preferred to any sort of use cases regardless of their size they are capable of supporting small scale use cases medium scale use cases also the large-scale use cases followed by highly viable feature we have Integrated Security Kafka has three major security components that offer the best-in-class security for the data in its clusters they are mentioned as follows they are encryption of data using SSL or TLS followed by that authentication of SSL or sasl and finally the authorization of ACLs so followed by the security we have its support for top and programming language the best part of Kafka streams API is that it integrates itself with the most dominant programming languages such as Java and scalar and makes designing and deploying Kafka server-side applications with ease followed by that we have exactly once processing semantics usually stream processing is a continuous execution of unbounded series of data or events but in the case of Kafka it is not exactly once means that the user defined statement or logic is executed only once and the updates to the state which are managed by spe or stream processing element are committed only once in a durable back-end store so this is how Apache Kafka streaming API is considered to be having exactly one's processing semantics so followed by the important features we shall go through a sample program based on Kafka streams API so this particular example can be executed using Java programming language yet there are few prerequisites on this one one needs to have Kafka and zookeeper installed in the local system and it should be running in the background if you have not installed zookeeper and Kafka in your local system then I have linked the article in the description box below which will explain you about the detailed installation procedure of Zookeeper and Kafka in your local system once the Zookeeper and Kafka are installed into your local system you need to fire them up once the Kafka and zookeeper are successfully installed into your local system and they are running in the background you can go to Kafka and Define a producer topic and the consumer once the producer topic and consumer are defined you can come back to Kafka article and execute the following code in any of the Java editors the code will count the number of words that you have provided in your text document and you will receive the output as shown in the article here the text given to the code was welcome to edureka Kafka training and this article is based on Kafka streams these were the two sentences given to the program and the output is as shown below here the word welcome is repeated for once 2 is repeated for once edureka once Kafka is repeated for two times and training is once this article is about streams so all these words are repeated for once so this is how exactly you should be receiving the output once after you execute the following code in your Java editor so followed by the example based on Kafka streams we can move ahead and understand the important differences between Kafka and Kafka streams so now the first difference is that in Kafka stream API single Kafka cluster can support as both consumer as well as producer while on the other hand in Kafka we need separate consumer and producer and Kafka considers consumer and producer as separate entities the second difference is that in Kafka API exactly once processing semantics are supported whereas in Kafka it is not by default but you can achieve exactly one's processing in Kafka manually the third difference is that Kafka streams API is capable enough to perform complex operations whereas Kafka is designed to perform only simple operations the fourth difference is that Kafka API supports single Kafka cluster on the other hand in Kafka you need two different clusters for producer and consumer followed by that in Kafka API the code length is significantly shorter when you come into Kafka the code length involved is highly lengthy the next difference between the both s Kafka streams API can support both stateless and stateful networks what are stateless and stateful networks in stateless networks the client provides requests to the server and he gets instantaneous reply from server and here the cookies or the requests which are sent by the client are not stored whereas if you come into stateful Network the client requests the server along with some additional data which is required by the server in this case the cookies or the requests which are provided by the client are recorded So Kafka stream API is capable to support both stateless and stateful networks but on the other hand Kafka is capable only to support stateless Network protocols followed by that the Kafka streams API can support multitasking whereas Kafka is not capable to support multitasking at a single task level followed by that Kafka stream API does not support batch processing whereas Kafka is capable to support batch processing Kafka stream API is all about real time so it doesn't have to support batch processing so these were the few important differences between Kafka streams API and Kafka now we shall move ahead and wind up the session with our last topic which are the important use cases based on Apache Kafka streams API Apache Kafka streams API is used in multiple use cases some of the major applications where streams API is being used are mentioned as follows firstly the New York Times the New York Times is one of the powerful media in the United States of America they used Apache Kafka and Apache Kafka streams API to store and distribute the real-time news through various applications and systems to their readers followed by the New York Times we have Trivago Trivago is the global Hotel search platform they use Kafka Kafka connect and Kafka streams to enable their developers to access details of various hotels and provide their uses with the Best in Class service at lowest prices and finally Pinterest Pinterest uses Kafka at a longer scale to power the real-time predictive budgeting system of their advertising system with Apache streams API backing them up they have more accurate data than ever [Music] reason number one the top priority of organizations is now big data analytics well big data has been playing a role of a big game changer in most of the industries over the last few years in fact big data has been adopted by vast number of organizations belonging to various domains and by examining large data sets using big data tools like Hadoop and Spark they are able to identify hidden patterns to find unknown correlations market trends customer preferences and other useful business information and let me tell you that in an article in Forbes it was published that big data adoption has reached up to 53 percent in 2017 from 17 in 2015 with Telecom and financial services leading early adopters and the primary goal of big data analytics is to help companies make better and effective business strategies by analyzing large data volumes the data sources include web server logs internet click stream data social media content and activated reports text from customer emails phone call details and machine data captured by sensors and connected to the internet of things iot as we call it big data analytics can lead to more effective marketing new Revenue opportunities better customer services improved operational efficiency competitive advantages over rival organizations and other business benefits and that is why it is so much widely used an IDC says that the commercial purchases of big data and business analytics related Hardware software and services are expected to maintain a compound annual growth rate or cagr of 11.9 percent through 2020 when revenues will be more than 210 billion dollars and that is huge and the image here clearly shows the tremendous increase in unstructured data like images mails audio Etc which can only be analyzed by adopting Big Data Technologies like Hadoop spark Hive and others this has led to Serious amount of skill Gap with respect to available Big Data Professionals in the current it market and hence it is not at all surprising to see a lot buzz in the market to learn Hadoop reason number two big data is revolutionizing various domains now big data is not leaving any stone unturned nowadays what I mean by this is that big data is present in each and every domain allowing organizations to leverage its capability for improving their business values the most common domain which are rigorously using big data and Hadoop are Healthcare retail government banking media and entertainment Transportation natural resources and so on as shown in the image over here well to be honest it is not only limited up to these domains mentioned here big data is spreading even more across different domains as the technology is evolving the data is increasing and hence big data is getting adopted across a wide range of domains hence you can build your career in any of this domain by learning Hadoop reason number three increasing demand for Hadoop professionals the demand of Hadoop can be directly attributed with the fact that this is one of the most prominent technology that can handle big data and is quite cost effective and scalable with the Swift increase in Big Data sources and amount of data Hadoop has become more of a foundation for other big data Technologies evolving around it such as spark Hive Etc and this is generating a large number of Hadoop jobs at a very steep rate you can check out different job portals like indeed.com knock read timesjob Etc and you'll see the demand of Hadoop professionals when you browse through the job postings reason number four scarcity of Big Data Hadoop professionals now the demand must be more but the supply is quite less and as we discussed Hadoop job opportunities are growing at a high Pace but most of this job roles are still vacant due to a huge skill Gap that is still persisting in the market and such scarcity of proper skill set for big data and Hadoop technology has created a vast gap between supply and demand chain and hence now it is the right time for you to step ahead and start your journey towards building a bright career in big data and Hadoop in fact the famous saying now or never this is an apt description that explains the current opportunities in the big data and Hadoop Market reason number five big data and Hadoop salary this in fact is quite rewarding one of the captivating reason to learn Hadoop is the fat paycheck that you're gonna get the scarcity of Hadoop professionals is one of the major reasons behind their High salary and according to payscale.com the salary of a Hadoop professional varies from ninety three thousand dollars to one hundred and twenty seven thousand dollars per annum based on different job roles you can see the different job roles and their annual salaries over here but of course it can vary according to the organization that you're joined in and also with the experience you have reason number six the big data and Hadoop Trend and as per Google Trends Hadoop has a stable graph in the past five years and one more interesting thing to notice here is that the trend of big data and Hadoop are tightly coupled with each other you can see over here that they go hand in hand and have a direct correlation big data is something which talks about the problem that is associated with storage curation processing and analyzing of the huge amount of data and hence it is quite evident that all of the companies need to tackle the big data problem one way or another for making Better Business decisions and hence one can clearly deduce that big data and Hadoop has a promising future and is not something that is going to vanish Into Thin Air at least for the next 20 years reason number seven it caters different professional backgrounds Hadoop ecosystem has various tools which can be leveraged by professionals from different backgrounds if you are from programming background you can write mapreduce codes in different languages like Java python Etc if you're exposed to scripting language Apache pig is the best fit for you and if you're comfortable with SQL then you can also go ahead with Apache Hive or Apache drill the market of big data analytics is growing across the world and its strong growth pattern translates into great opportunity for ID professionals it is suited for developers project managers software Architects ETL and data warehousing professionals analytics and business intelligence professionals and also for testing and Mainframe professionals it is also recommendable for freshers who are going to start a fresh career in the ID world so if you start with big data I'm very sure that you're going to have a bright future ahead reason number eight the different big data and Hadoop job profiles so there are various job profiles and big data and Hadoop learning big data and Hadoop doesn't mean that you'll be sticking on to just Analytics you can pursue any one of these job roles based on your professional backgrounds you could be a Hadoop developer a Hadoop admin a data analyst a big data architect or a data scientist software engineer senior software engineer data engineer whatever seems more convenient for you reason number nine Hadoop is a disruptive technology when Hadoop came into the market it completely disrupted the existing market and created a market of its own Hadoop has proven itself as a better option than that of traditional data warehousing systems in terms of Cost scalability Storage and performance over variety of data sources it can handle structured data unstructured data and semi-structured data in fact Hadoop has revolutionized the way data is processed nowadays and has brought a drastic change in the field of data analytics besides this Hadoop ecosystem is going through continuous experimentations and enhancements in a nutshell I would tell you that big data in Hadoop is taking out the World by storm and if you don't want to get affected you have to ride with the tide reason number 10 and the most important one Hadoop is the gateway to all the big data Technologies Hadoop has become a de facto for big data analytics and has been adopted by large number of companies typically besides Hadoop a big data solution strategy involves multiple Technologies in a tailored manner so it is essential for one to not only learn Hadoop but become expert on other big data Technologies falling under the Hadoop ecosystem this will help you to further boost your big data career and grab Elite roles like that of a big data architect data scientists Etc so if you want to become a big data architect or a data scientist Hadoop is the best option to get started with Hadoop is the stepping stone for you to move into the Big Data domain so here were the top 10 reasons to learn Hadoop [Music] so the first book in the beginner section is Hadoop definitive guide so this particular book was written by Tom White and the publisher as Riley media so let's have a quick overview of this particular book if you are a complete beginner then there is no better book than Hadoop definitive guide this book guides beginners to build a reliable and easily maintainable Hadoop configuration it helps to work on data sets regardless of sizes and types it has numerous assignments that help you understand Hadoop real-time functionality in a much better way going through this book will help you to understand even the latest changes very easily followed by the first book the second book in The Beginner's section is Hadoop in 24 hours so this particular book was written by Jeffrey even and this book was published by O'Reilly media let's have a quick overview of this book in case if you already have a brief idea on Hadoop and want to have a quick recap of the technology then this book is for you this particular book gives you a perfect overview of building a functional Hadoop platform interface old Hadoop ecosystem components and many more also if you're looking for some real-time examples then it has the best in class setup Solutions ready for download followed by the second book the third book in this section is Hadoop in action this particular book was written by Chuck Lam and the publisher is Manning Publications let's have a quick review of this particular book Hadoop in action is like the one step solution to learn how to from scratch this book basically starts from the default Hadoop installation procedures followed by the installation it explains all about the most crucial components of Hadoop the mapreduce and many more also the book deals with some real-time applications of Hadoop and mapreduce including the major Big Data Frameworks which are used in data analytics so the last book in the beginner section is Hadoop real world Solutions this particular book is written by three authors they are Brian John and Jonathan Owens and the publisher of this particular book is packed publishing and now let's have an overview of this particular book this particular book is for the intermediate Learners who are looking to try out multiple approaches to resolve the problems this book has an in-depth explanation of the concepts problem statements technical challenges steps to be followed and Crystal Clear explanation of the code used you will also understand the procedures to build Solutions using tools like Apache Hive Apache Pig mahaut graph hdfs and many more crucial components now we shall learn about some books for the experienced programmers so the first one amongst the books for the experienced Hadoop developers are prohadu this particular book was written by Jason venner and it was published by apris Publications and the overview goes like this this particular book gets the readers an upgraded stage to play with Hadoop the heard of clusters this book covers every single detail related to Hadoop clusters starting from setting up a Hadoop cluster to analyzing and deriving valuable information for improvising the business and scientific research you can understand to solve the real-time big data problems using the mapreduce Away by dividing problem into multiple chunks and distributing these chunks across the cluster and solved it parallely in a short period of time followed by Pro Hadoop we have the second book in The Experience section that is optimizing Hadoop for mapreduce this particular book was written by Kali Tanner and the publisher is packed publishing let's have an overview of this particular book this book is all about solving the major loopholes in real-time applications of Hadoop and mapreduce this book majorly concentrates on the optimization process of mapreduce jobs this book basically starts from introduction of mapreduce and then it takes off to the real-time applications of mapreduce and gives us an in-depth understanding of mapreduce so that we could tune the code for maximum performance followed by optimizing Hadoop for mapreduce we have Hadoop operations this particular book was written by Eric Summers and it was published by O'Reilly media let's have a quick overview of this particular book the necessity for managing operation specific data has grown exponentially and Hadoop has become the standard solution for all the big data problems processing these large-scale industry level problems require a whole new different level of approach and Hadoop cluster configuration this book exactly explains the same and gives you a brief on managing large-scale data sets and Hadoop clusters so followed by Hadoop operations we have scaling Big Data with Hadoop solar this particular book was written by Mr rishikesh and it was published by Pact publishing let's have a quick overview of this particular book this particular book is all about big data Enterprise search engine and with the help of Apache Hadoop and Apache solar together Apache Hadoop and Apache Sola have come up with an approach to help organizations to deal with their big data and resolve their problem of information extraction through an amazing solution and it has extraordinary faced search capabilities this book gives a complete briefing about the scene followed by scaling Big Data with Hadoop solar we have professional Hadoop Solutions this particular book was written by three authors they are Boris Kevin and Alexi and it was published by rocks Publications let's have a quick overview about this book too this book is for advanced and professional level Hadoop developers this book deals with one concept to increase the power and maximize the capability of Hadoop The crucial responsibility of Hadoop developers and Hadoop Architects is to understand the compatibility between Hadoop Frameworks and Hadoop apis and how to integrate them to provide optimized performance and to deliver real-time Solutions now with this let us move ahead into the last book of today's session that is the data analytics with Hadoop so this particular book is written by two authors they are Benjamin and Jenny Kim So this particular book was published by O'Reilly media and let's have a quick overview of this book in recent days machine learning and artificial intelligence are taking over and Hadoop is know by giving up the race it is constantly trying to integrate itself with data science and Hadoop framework has now become the standard for data analytics this book is a perfect guide to understand data warehousing techniques and higher order workflows that Hadoop can perform in the process of data analytics so these were the top 10 best books for learning how to [Music] now who is a big data engineer a big data engineer is somebody who's responsible for collecting the data from various sources transforming it into a usable format and storing it they basically take raw data and convert it into something that is Optimum for stakeholders that access the data it makes it easier to process and derive business insights from stakeholders can be data analysts data scientists and software developers in simple words data Engineers transform data into a format which can be easily analyzed in order to collect and store data these professionals design build test and maintain complete infrastructures the system provides a foundation for each and every data driven activity and action that is performed in the organization and while doing so big data Engineers always keep the business requirements in mind now let's look at the big data Engineers roles another typically three kinds of roles that a big data that engineer has to assume first of all we have the generalist now generalists are typically found on small teams or in small companies in this setting data Engineers wear many hats as of one of the few data focused people in a company journalists are often responsible for each step of the data process from managing data to analyze it next we have pipeline Centric data Engineers often found in mid-size companies these Engineers work alongside data scientists to help make use of data that they collect pipeline Centric data Engineers need in-depth knowledge of distributed systems and computer science and finally we have the database Centric profile in larger organizations where managing the flow of data is a full-time job data Engineers focus on analytics databases database-centric data Engineers work with data warehouses across multiple databases and are responsible for developing table schemas now the next logical question will obviously obviously be what does a big data engineer do now Big Data engineer plays a big role in any data driven business which also means they are responsible for many things but most importantly they are responsible for Designing creating testing and maintaining the complete infrastructure and for storing and processing data that is gathered from various sources in order to perform this activity data Engineers need to have a good grasp of fundamental knowledge such as OS programming knowledge and database management system apart from this the professional has to be an expert on SQL development further providing support to data and analytics in database design data flow and Analysis activities the position of the database engineer also plays a key role in development and deployment of innovative Big Data platforms and for advanced analytics and data processing the next thing I want to talk about is building highly scalable robust and 4 tolerance systems now imagine a building we all know that the deeper it is under the ground the higher the building can be constructed without collapsing now a big data engineer does something pretty similar with data now these data Engineers work closely with big data architects in designing a complete architecture both of them make sure that the system must be scalable in terms of either adding new data sources or in handling exponentially growing huge amounts of data Big Data Engineers should also have the capability to architect highly scalable distributed systems using different open source tools designing consideration should incorporate that the system must be robust and fault tolerant where each component should provide a level of fault tolerance they are also involved in the design of Big Data Solutions because of the experience they have with Hadoop based Technologies such as mapreduce Hive mongodb or Cassandra a Big Data engineer builds large-scale data processing systems and is an expert on data warehousing Solutions and should be able to work with the latest nosql database Technologies next let's talk about the biggest process in all these responsibilities which is the ETL or the extract transform and load process mundane as it sounds this is actually the process which might take the most amount of time in order to store data in such a format that data analysts and data scientists can analyze and derive meaningful insights from it the raw data collected from various sources need to be transformed data Engineers need to have the knowledge of programming language and tools to perform ETL the ETL process becomes much more complex when Big Data comes into picture with the Advent of huge amounts of data which is getting generated at a very high rate it becomes even more tough to perform ETL a big data engineer should be somebody who does this with utmost proficiency next is the business Acumen aspect now data Engineers should have a good business Acumen so that the system that he or she develops or the data that is transformed and stored should be according to business needs this reduces the cost of deriving insights from data and a good data engineer performs half the transformation that is required for data analytics larger organizations often have multiple data analysts or scientists to help understand data while smaller companies might rely on a data engineer to do so next is data acquisition now data engineer should always look at the bigger picture he or she must have the idea about gaining data from various sources and how the data helps in gaining insights this will help him or her to understand how data is acquired from different sources and can be used in different ways to derive insights from they can also try finding more data sources that can help getting more accurate predictions and better insights next let's talk about the programming languages and tools that a data Engineers should be proficient in some of the responsibilities of a data engineer include improving data foundational procedures integrating new data management Technologies and software into existing systems and building data collection pipelines a big data engineer should embrace the challenge of dealing with petabytes or even exabytes of data on a daily basis a professional so understands how to apply Technologies to solve big data problems and to develop Innovative Big Data Solutions in order to be able to do this the Big Data engineer should have extensive knowledge in different programming or scripting languages like Java Linux Ruby python or R also expert knowledge should be present regarding different nosql or relational database management systems such as mongodb or redis building data processing systems with with Hadoop And Hive using Java or python should be common knowledge also to a big data engineer apart from this he or she should have a good command over at least one programming language and multiple tools knowledge of ETL tools and data warehousing tools is also required apart from the knowledge of individual tools Big Data Engineers should also know how to integrate various tools and create a complete solution based on given requirement now having said all that one very important thing which a data engineer should be looking at is Performance Tuning one of their responsibilities includes performance tuning and making the whole system more efficient with time first the performance of an individual component needs to be improved and then the entire system needs to be optimized with that let's move on to the summary of what we've discussed so far in terms of responsibility now with all that said the basic responsibilities of a big data engineer boys rounded these three things first up data ingestion now this is associated with the task of getting data out of the source systems and ingesting it into a data Lake a data engineer should need to know how to efficiently extract data from a source including multiple approaches for both batch and real time apart from these they would also need to know about how to deal with issues around incremental data loading fitting within small Source windows and parallelization of loading data as well a part of it is also data synchronization which could be considered a subtask of data ingestion is data synchronization but because it is such a big issue in the Big Data world since Hadoop and other big data platforms do not support incremental loading of data here the data Engineers should need to know how to deal with changes in Source data merge and sync change data from sources into a big data environment next let's talk about data transformation this is is the tea part in the ETL which is extract transform and load and is mostly focused on integrating and transforming data for a specific use the major skill set here is knowledge of SQL and as it turns out not much has changed in terms of the type of data Transformations that people are doing now as compared to a purely relational environment finally performance optimization and data models now this is one of the tougher areas anyone can build a slow performing system the challenge is to build data pipelines that are both scalable and efficient so the ability and understanding of how to optimize the performance of an individual data Pipeline and the overall system are at a higher level of data engineering skill for example Big Data platforms continue to be challenging with regard to query performance and have added complexity to a data engineer's job in order to optimize performance of queries and the creation of reports and interactive dashboards the data engineer needs to know how to denormalize partition and index data models or understand tools and Concepts regarding in-memory models and olap cubes now that we've spoken about the responsibilities of a data engineer let's talk about the skills that is required to assume these responsibilities these are the basic skills that one should have to fulfill the responsibilities of a big data engineer first of all let's talk about Big Data Frameworks or Hadoop based Technologies now with the rise of big data in the early 21st century a new framework was born and that is Hadoop all thanks to dog cutting for introducing a framework which not only stores big data in a distributed manner but also processes the data parallely there are several tools in the Hadoop ecosystem which caters to different purposes and professionals belonging to different backgrounds but some of the tools which are must to master are hdfs or Hadoop distributed file system which as the name suggests is the storage part of Hadoop but stores data in a distributed cluster being the foundation of Hadoop knowledge of hdfs is a must to start working with this framework next we have yarn which is introduced originally in Hadoop to point x in order to make Hadoop more flexible efficient and scalable yarn performs resource management by allocating resources to different applications and scheduling jobs next we have mapreduce which is a parallel processing Paradigm allowing data to be processed parallely on top of distributed Hadoop storage next we have pig and Hive which look at the data warehousing perspective of big data to perform analytics and scripting next we have Flume and scoop which are popular tools for importing and exporting data to hdfs next we have zookeeper which acts as a coordinator amongst the distributed Services running in the Hadoop environment it also helps in configuration management and synchronizing services and finally we have Uzi which is a scheduler binding multiple logical jobs together and helping in accomplishing a complete task the next skill I'm going to talk about is real-time processing framework now real-time processing with quick actions is the need of R either it is to detect fraudulent transactions in a credit card system or a recommendation system each and every one of them needs real-time processing and it is a very important skill for a data engineer to have now Apache spark is one such distributed real-time processing framework which is used in the industry rigorously and it can be easily integrated with Hadoop leveraging hdfs next database Management systems and architecture a database management system is something that stores organizes and manages a large amount of information within a single software application data Engineers need to understand dbms to manage data efficiently and allow users to perform multiple tasks with ease this will help the data engineer in improve data sharing data security access integration and minimize data inconsistencies these are fundamentals that said professionals should know to build a scalable robust and fault tolerance system next we have SQL based Technologies now there are various relational databases that are used in the industry such as Oracle DB MySQL Microsoft SQL Server sqlite Etc now data Engineers must have at least the knowledge of one such database knowledge of SQL is also must structured query languages is used to structure manipulate and manage data stored in relational basis as data Engineers work closely with relational databases they need to have a strong command on SQL next we have nosql Technologies as the requirements of organizations had grown Beyond structured data nosql databases were introduced it could store large volumes of structured semi-structured and unstructured data with quick alteration and agile structure as per application requirements some of the most prominently used databases are hbase Cassandra and mongodb next we have programming languages now various programming languages can serve the same purpose the knowledge of just one programming language is enough as the flavor changes but the logic Remains the Same if you're a beginner you can go ahead with python as it is easy to learn due to its easy syntax and good Community Support whereas R has a steep learning curve which is developed by statisticians now R is mostly used by analysts and data scientists next we have ETL or data warehousing Solutions data warehousing which is very important when it comes to managing a huge amount of data coming in from heterogeneous sources where you need to apply ETL now data warehouse is used for data analytics and Reporting and is a very crucial part of business intelligence it is important for a big data engineer to Master One data warehousing or ETL tool and after mastering one it becomes easy to learn new tools as the fundamentals remain the same Informatica click View and talent are very well known tools used in the industry I would recommend you start with talent because after this learning any data warehousing tool will become a piece of cake for you and finally we have operating systems apart from all these skills intimate knowledge of Unix Linux and Solaris is very helpful as many math tools are going to be based on these systems due to their unique demands for root access to hardware and and Os functionality above and beyond that of Microsoft's Windows or Mac OS that was all about the skills we have spoken more about this segment in a previous installment to this video you can go and look at it to understand the skills you'll be requiring as a big data engineer in detail [Music] in this era each and every organization is acquiring data from all possible sources analyzing it and making thought out data-driven decisions now data Engineers are the ones who design build test and maintain the complete architecture of this large-scale processing system with the increasing data sources and accelerating data growth various challenges have emerged in storing processing and handling data and that is called Big Data fun fact according to an Accenture study 79 of Enterprise Executives agree that companies that do not Embrace big data will lose their competitive position and would eventually face Extinction as quoted by the Forbes Magazine next let's look at the market trends and projections now the best way to analyze a big data engineer's job trend is by analyzing the jobs available on various job portals according to Glassdoor the number of jobs for a big data engineer is way over nine thousand and in UK it is way more than 2 000 but indeed begs to differ according to indeed the number of jobs for a big data engineer in India are way above 13 000 and in the US is way above 127 000. in spite of considering Big Data as a challenge organizations are turning it into an opportunity to find insights from the data and gain competitive advantage over the rivals to achieve that they hire Big Data Engineers who are paid handsomely no matter how fresh they are in the field here you can see the job distribution per salary range in India for a data engineer as we can see people who get paid more than 5 lakhs in annum are about 33 percent 730 per annum or 26 percent 8 lakh 70 000 about 20 percent which are very high salary brackets apart from that the average salary for a data engineer is almost 8 lakhs in annum and for a senior data engineer is almost 16 lakhs in annum if we look at the same numbers in the US there are 32 percent of professionals who make more than 90 000 a year and 27 professionals who make hundred and five thousand dollars a year the average salary in the US for a data engineer is way more than ninety thousand dollars and for a senior data engineer it is 124 000 per annum now as we have discussed the salary of a big data engineer let's look at a few factors in the form of skills and technology that they know on which their salary depends here we've carefully curated a table which lists out the skills and the average salary which can be encashed through them you can see services such as AWS data analysis data mining warehousing machine learning and even programming languages like Java and R apart from that you can see bi tools and statistical tools like Tableau database architecture ETL and structured query languages now another influence on the salary is experience because experience is also a very important factor in deciding the Big Data engineer's salary the distribution of salary is like so an entry-level data engineer makes about 85 000 a year people who have five to eight years of work experience bagged nearly 103 thousand dollars a year and people who are experienced and talking like 10 years of an industry experience get over 118 000 a year now this salary must be coming from somewhere presenting to you the companies that hire in this job role as you can see there are some very big names like Amazon Google Bosch Microsoft and IBM who hire Big Data Engineers now companies that hire Big Data professionals are companies that are invested in the future and the worldwide big data market revenues for software and services are projected to increase from 42 billion dollars in 2018 to 103 billion dollars in 2027 attaining a compound annual growth rate of about 10.5 percent that sort of a growth needs some kind of work after going through multiple job descriptions we found that the Big Data engineer's salary has many variables here are the skills that you require to be a big data engineer a big data engineer must know his operating system and Linux is one of the most widely used os's in the industry apart from that they must be proficient in at least one programming language CR or python for instance database management system is also a very crucial skill required for Big Data apart from that they must know how to work with structured query languages and nosql Technologies data models and data schema are amongst the key skills that a data engineer should possess hence data warehousing is something that they should be acquainted in doing apart from that what is a big data engineer who does not know Big Data Frameworks hdfs yarn mapreduce Big And Hive should be on the fingertips of the Big Data engineer and finally real-time processing with quick actions is the needle of the r either it is a credit card fraudulent detection system or a recommender system each and every one of them needs real-time processing it is very important for a data engineer to have this knowledge [Music] now when we talk about big data right the very basic questions a lot of time pops up what are five ways available in Big Data can anybody answer that okay so Ravi want to answer this okay let me unmute Ravi Ravi over to you first is volume a volume of size of data how much is growing day by day and next is variety where it is we have three types of data actually structured and structured and semi-structured data structured data is nothing but relational database all those things unstructured data is audio images or files all this data semi circuit is XML files velocity is now how much task is growing okay very very much uh one good answer uh when Big Data started IBM gave a definition with this Theory The Three B's for volume variety velocity so what was it volume volume was when we talked about like in terms of amount of data what we are doing this right for example today's Facebook is dealing with very huge amounts of data right when we talk about variety now is it only Facebook which is generating data no right even Twitter is generating data okay we are talking about just social media sites no we can can we talk about medical domains yes in medical domains also the data is getting generated right lot of big data is getting generated so that's a different variety of data via structured data unstructured data that we can have videos audio right this is basically is going to be called as a variety one third component is velocity like I said to you Facebook is just to 10 to 12 year old company and imagine the growth they have made it in just 10 to 12 years imagine each each user is also doing this activity posting video audio all all kind of chats right so with that how much data they are dealing with so that is you will be calling that velocity with the pace they have grown up from scratch to this level with the speed they have grown up to this level is called velocity now these were the three components which were actually going in the market for long actually if you ask me these three were the major components you will go today but slowly they started realizing there should be a fourth category of data which is velocity which also makes sense in the data because what basically happens is that the data what we will see to us right the problem with that data is we cannot expect that the data is going to be always a clean data they can be a missing data they can be a corrupted data and middle how to deal with this scenario how to deal basically with those scenarios because that is the component of norms that we did right so basically be that corrupted or the bad data what we are getting how to a missing data what we are getting so those categories also they've decided to call it as velocity now people started calling that okay these four are the major component but we are not yet now they started adding few more components they started saying that no we are not going to stop it so when you are saying that velocity can be added why not value because the data what I am getting I I want to know what is the value of data what's how much important that data is now somebody said that I want to visualize the data so visualization it should also be one one of the B Factor somebody started saying that I want to see the vocabulary of the data or the validity of the data now this started keep on adding their wish but majorly if you talk about theorem only which carry some good value okay and usually in any interview they will not expect you to know all the ways basically if you know it's all good but they will be just respecting you to kind of understand that okay do you know at least for this video important if you can answer that they will be all good though okay so sometimes to make it tricky they ask you five things you just see that how good you are in terms of thinking like I generally do that so when when I generally ask questions I generally see that okay that guy must be doing for this let me ask five please let me see that you'll be able to think little beyond what what he knows already moving further we just talked about something called structured and unstructured data right can everybody explain me the difference between them what basically are this uh structured data and unstructured data I let me add one more component to it semi-structure there that third category of data let me added now can anybody explain this plus give me the difference between them as well can anybody give me an answer so unstructed not easy to save the data into our dbms that the questions answered from naveesh okay uh structured data is basically in row and column format easy to really don't pass from word okay uh nurse similar saying structure data and dbms data unstructured is like log audio video semi structure is like external Json yes so lot of people are giving the right answer here if we talk about basically structured data if you go with basically 1980s you know when this Oracle and IBM and all those companies came up into the market so if we talk about 1970s 1980 even at that time they used to have data but that data was not finished it was small data but you will be surprised to hear at that moment it was still a challenge to deal with that data though it was having some sort of Passage it used to have some sort of pattern and it's a small data now people used to think that how to use it how to manage it how to store it where to store it all those questions were coming into full mind and that is where companies like Oracle IBM and all came up into Market with their rdbm resolution now they started delivering this solution that you can now store the data you can now process that data what what sounding like or having some pattern and you will be all good and today I need not tell you that today where these companies are like Oracle Microsoft cannot in fact everybody of you must be willing to work for them if given a chance right so they they are basically uh now the market gents and they have given the solution for that it was going onwards but now slowly what happens the other kind of data started coming so the data what they were dealing with was structured kind of data but with today's world right like like Facebook came up in few years back right now as soon as Facebook came up into Market I'm just giving an example now they you what you do in Facebook on Facebook you either upload video upload audio pictures right so you started dealing with this kind of data now do you think this kind of data can be dealt with my rdbms system answer is no right because now we cannot deal with these kind of data now we cannot call this data as structured data because this kind of data do not even have any sort of pattern and that is where we started calling it as unstructured data I mean any sort of data which do not have patent kind of thing like your audios videos we started calling it as unstructured data now the third category of data is semi-structured data right so there are some foreign for example let's talk about XML data so as soon as you see XML what is it sounding like um does it have pattern or not does it have patterns or not XML files it's a negative answer XML file Json files do they have pattern yes yes it has title so as soon as they say that it has pattern uh our first answer this must be coming in your mind should be that okay it is the structure data but now as soon as you tell me that it's a structured data my question for you is in that case can you do all the activities what you can do in rdbms to XML data I know you can do today even you can deal with unstructured data as well because they have introduced block and crop data types as well but is that efficient can you deal can you fire Triggers on that can you do all the things which you can do with your traditional data no right now it started sounding to me that it's a unstructured data now I am confused whether it's a structure data or an unstructured data that's where they created a third category called as Administration so that they can keep it in the middle the data which is sounding something of the structure type or as an unstructured type they started create they created a new category called as semic everybody is here on this part what basically structure data semi structure data unstructured data okay moving further now I have another question for you how had to differs from your traditional processing system using rdbms can I get an answer what would you answer this one so these are like warm-up kind of questions usually in interview they will not start with the most complicated questions right so these are kind of they kind of warm up they want to see your level of expertise how much you know so basically that is what is happening here so can you answer this part how Hadoop difference from traditional processing system using rdbms and friends I have a request rather than reading your hand please type it on chat window because this chat window I want to make it more impacted okay okay I don't know why your name is let's make it processing is done data is that no input open okay Hadoop can store and process any type of data where our tvms can store only relational data okay I can take this answer column distributed storage and processing okay good answer but parallel processing large data distributed no uh so few people are just started answering what else I do right don't answer me that I want difference back read the question properly the questions they give me the difference do not tell me what this do and that do tell me the clear cut difference what are the differences what you notice with rdbms system in future or maybe is it really replacing right now okay this question can be asked in this way as well now can you answer me this one so I hope anybody who know Hadoop Basics should be able to answer this easy can I get some answers no who want to come on here also can tell me I can unmute you if anybody want to come on here and answer both are complementary to to each other okay but anyone who want to come on here want to answer it this will give you good confidence also when you will be speaking in interview it cannot replace rdbms but I want reasonable that that's everybody you know it cannot replace handy payments but what's the reason very good very good as this property is not supported or in other words can I say credit operation is not supported create delete update can I do that at Pro level no right so that's with the major reason you cannot go with Hadoop systems like you cannot replace them also when the data is small okay if the data is small and it's a structured data which is going to be more efficient rdbms or Hadoop systems yeah that's basically the latest version they are supporting it but there are a lot of restrictions in it that's all about Hadoop 316 which which is yet to come in the market properly so hold on till the time will come up because it has a lot of restriction it happened if you have already seen it you might be aware of this one I do sales news when we are more of right once we multiple times very good it follows warm principle w o r m which means right once read multiple times okay so and one more important difference rdbms is free of cost is rdbms free of course no right so basically the license software you need to pay for it right but when it comes to Hadoop it completely open source there are companies who are now basically making money with this as well this because it's an open source Community right Hadoop is an open source committee and also get stuck where you will get the support there is no way you can get the support if you get stuck in argument system there are companies to support you but what about Hadoop system if you get let's say some work who will fix it for me right it's an open source so basically anybody can come and fix it but like there is nobody speaks in your bus then what right so in that case what people are doing is they are taking support okay so a lot of companies are providing the code also on this so a lot of companies are providing support in terms of now they are making money from it they say that okay if you want to use a group use it we will give you the appropriate support whatever you're going and you have to pay them people so a lot of companies actually came up like like this kind of idea that they are exposing to do and if you require any support we will help you with that so that is one thing which is happening now Rd members can only deal with structured data right basically when we talk about anti-gms you cannot deal with unstructured kind of data so you can like not deal with it as well because like somebody just argued with me she said that you know in the latest version of Thai they are trying to support you in current operation right but it has lot of description it's not efficient that of the moment till the time it do not come out on the beta phase you cannot say anything about that similarly rdbms also started creating cloth and block data type c l o b and b l o b right where what they say that you can now store unstructured data and can work on it but are they efficient and there is no right so similarly like most of the data what you deal with rmms is going to be structured data but when it comes to your Hadoop system it can be unstructured semi-structured as the latch structure data right because if we take an example of high and all they deal with structured data right so that is one thing which is the rdbms you work just on a single machine right so let's say you work on a single laptop where you have rdbms installed and working but when it comes to Hadoop you are working in a distributed fashion right there will be multiple machines which can be involved in this case right and as I said in rdbms mostly when the data is known your speed will be very fast your computation is going to be very quick at the same time with Hadoop your computation speed is not going to be that great okay with the small data it's not going to be that way with the purchase path switching into the market that's a different story now they are actually picking up basically because if your memory is good then you can actually make up a good speed but with traditional root when we talk about Map Reviews the speeds are fluent I believe if you have already done the classes of map values you might have already noticed this thing right when you are doing that word quantity content right I hope everybody must have done what want example in math games right if you have taken the session right so in that case you might have seen the speed of what current example that it is not that fast so that is one thing with Hadoop system especially the data is smaller your speed is not comparatively with rdbms it's going to be slower now this brings me to another question can anybody tell me the components of Hadoop and their services impact I am showing you all the components can you explain these components what are the components available and what are the services what they provide can I get an answer very good so basically when we talk about hbfs right it is on your student side okay very good connect can I get more answers I believe everybody must be knowing this fact uh nursing is saying storage hdf says processing cluster very good so you can say yarn is a cluster resource manager right there are few more things and I get more answers may not manages the cluster data nodes towards the data okay I can take this answer but partially they would name node manages the cluster can you be little more explicit in this part yarn for resource allocation very good are in the same yarn to run map reduce okay uh you can say to schedule now produce that would be better anymore right rather than saying to run mapreduce penalization dual map reduce right that would be a more appropriate answer here but a good example name node as meta data very good right name node as meta data you can take it off something of this sort so uh in your real term scenario right let me go back you can simply relate it with this your real-time live version right let's say in real time scenario what happens let's say you have a boss okay how many people have uh kind of a smiling boss good boss very funny he plays with your emotions when I say emotions I basically mean that basically he kind of is a very clever boss anybody who have it okay she has a her even have it narration boss is very interesting okay so let's say this boss okay let me draw a smiling bus okay so smiling bar now usually every boss is smiling right they just keep on smiling but the the point to go there is that smile is stunning smile or what kind of man now what next now these are people who are working right like uh like Shri said like he reports to such kind of boss right so basically Vivek said he's report to such kind of Boss who is a very clever boss right now let's say um uh narasim is saying that okay find his boss is also kind of a very egosing boss but you know behind that he knows this there might be a lot of cleverness I think right behind the scenes so let's say these three people are reporting to this clever boss what happens boss get products right boss good product so let's skip the project now what happens boss will be getting a project these three people are reporting to this uh to this boss right so what they will do so boss usually will distribute the project right so let's say the project must be he distributed into three parts P1 P2 and the third component is measures what he's going to do so he is going to keep this P1 uh eagerness come to see and say that work on P1 project he will come to where we can say that to open P2 project he will come to nursing and say that right he will say that now all the three people are working properly given the project on timeline boss is going to be happy now imagine a scenario that will make which is not not specifically this but maybe he said that okay fine my boss was reversing so let me take an advantage of him telling him that you know I have a family emergency I can't work I want I have to take leave something of family emergency kind of thing now forces and problems right because boss needs to work on these two projects off right P1 and people who will deliver me to project so that is the problem so what what they do they come up basically distance earlier uh kind of startups and bosses clever remember right so boss is going to call Shri in his cabin and say to Sri SRI you know you are doing very very good job right so you're doing very good job I'm thinking to promote you if you keep on working like that you will get promoted very quickly in the market for sure and um so you should take up some senior responsibilities from now right so as soon as we will she will not hear anything she will just hear that you know boss is telling me promotion works okay you will just see a promotion work and you will be very happy in his mindset suddenly he has possible through our time because boss is clever so boss gives me what he will say uh actually we just discussed right that you are going to take up some serious responsibilities so can you do one thing now can you basically take the backup of the way first no what happened in this case immediately as he said that can you take the backup of basically Vivek project now she came back to sentence and started arguing you know I'm already busy but he said that you have to just take backup I'm not asking you to work right just take backup office work if you did me then only you have to work anywhere you are a senior candidate and if your boss is born she will not be able to say no right you have to work for it similarly he will go to Vivek he will do the same stuff right because managers will be tell this right everything is confidential so same thing you will do with new also this is confidential do not speak about it outside same way it will call you with tell the same story and you will ask now to take the backup of nursing project right so he is backing up this project similarly nurse previously note that nothing up to back up Shield Project he did the same thing with nursing also and basically non-restling laptop backup this uh basically this report now if you note in this situation if we wait to cancel emergency now force will not fail any instrument right now boss will not face any trouble because the work who have to do the extrava in fact who will be sad in this scenario definitely boss is not going to be right but who is going to be sad in this scenario definitely she is going to be said right the definition is going to face the heat of doing the external now similarly why I am telling you all this because all the components you can relate here so basically the in when we talk about Hadoop Hadoop is not much different in this what Hadoop do is and boss is also giving one more information right so boss is also keeping information of let's say what all projects he have who is working on what project that said she is working on Project P1 and we have also a backup of P2 so all these details what boss is given when we talk about kind of doing the exactly the single kind of stuff when we talk about boss the first thing is now in Hadoop what every human is going to be replaced by machines now the first component what we were talking about right so if we talk about the first component it was name the boss is representing they know second component was Data node these employees who were working basically for that boss you can represent them as data node the node where you are doing all the processing it was employed to the work employee do the work was only instructor management right same story here so these are your data nodes cell properties what is this metadata right even name node is going to keep all the data about data that is called your metadata now what is this this basically secondary name node and all those stuff so basically we require something like this is the backup for the date of birth right for this data file P1 P2 P3 but what about this boss marks so basically we want to create some backups for that so that's the reason we keep like passing name node and all the stuff there okay so basically that is the backup part that's one of the components now that that is means like Network in your company they have also backed up a boss like in case if this boss needs then I should have all the details so that's that's what the backup ninja now there is one more component called as load manager Resource Management what are these things now what basically your boss is doing right what is boss mean boss is having this skill set to schedule the job right he only decided where to send what all those details right so that you can call it as a part of your resource manager kind of scheduling the work who is helping you to schedule the work you can call it as a resource management right boss have that skill set similarly in your Hadoop what your resource manager is going to schedule everything up now the the last part right node manager now do you think you will be able to work on this project without any skill set no right you require some skill set to work on that project right so that skill set you can relate it like the node manager which is managing your own node means you can delete it like your skill set which is helping you to solve a project right so same with the load manager you can say that it is managing the whole world it is kind of helping the node to execute that task that you can call it as node manager so this is how you can relate everything I hope with this example now it will make your life easy to remember all this components because this is a very important questions in basically in interview questions they generally ask this question what are the configuration files what what basically are the main components so you should be aware of this and that's the reason I have explained you with this analogy so that you get some idea and you can relate to it that if you have to explain interview you did not remember all this stuff okay let's go further now so can I ask this question now what are the main tattoo configuration files right so basically now we are talking about configuration type this is related to mostly Hadoop Administration interviews now when you talk about Hadoop administrator right so there will be few files which you need to configure so I hope there are people in this batch in this session also where people must have done some Hadoop administrative purpose right so I am assuming that YouTube people know that people who do not know that that's let's not worry about this capitalism and answering this directly so there are few important files here one is Hadoop environment.sh this is where you kind of mention all your environmental variables for example where is your Java home where is the Hadoop form all those things you define in Hadoop env.ss code sign dot XML so these this file that we Define where your lecture name node is going to run right so you need to tell the address of your main node where you want to run that maybe you want to run it some machine at 9004 so you will be telling all that your core side XML when we talk about hdfs site XML here we talk about what should be the replication Factor where should physically my data node should be present where physically my name node should be present all those things to Define in httfs type dot Excel yarn side and macgrid side basically defines the map jobs right so what kind of cluster you are going to use are you going to use let's say this Source or you are going to use yarn or you are going to run a local distributed what all those things will be to defining here also you will be defining where your resource manager should be running it should be running on this machine on 9th now 9001 code or whatever Port you want to Define right so those information you will be defining in John site or mapredside dot XML not last two files are Masters and slave file in Master's file we usually mention where my secondary name dot procedure when I say secondary Gameloft it's it's like a backup not exactly I should call this as a backup but I should call it as a snapshot of the name it's something like this like somebody just copying the metadata that's it it's not going to become active as soon as the main load is done it is copying the data so that is name node is down at least I should have a backup that's it slips very clear with the name and where are all my data nodes what all machines are going to be maintained or not that thing we Define in my slaves machine so people who who says fine sorry so basically people who have done this Hadoop administrator you must have basically played with this file these are the major Hadoop configuration files there are others as well right like hype site XML there are others like HB site XML now these are very much kind of uh tool specific so that's the reason they will not be called as the main Hadoop configuration type when somebody says name Hadoop configuration fine your answer would be these seven by you need to remember these seven times basically these seven files are the one which you will mention if you are going for Hadoop admin status interview expect good number of questions in this okay they will ask you to explain in each and every five how basically you will be what you do in which file which I make a note of that and that is definitely one of the favorite questions on the interviewer when you go for to administrator kind of group moving further now let's talk about some hdfs questions in terms of hddfs now my question is hdfs stores data using commodity Hardware which has higher chance of failure which is obvious right because my laptop can be one of the data known right now definitely my data root can fail at any moment now so how httfs ensures the fault tolerance capability of the system can anybody answer this very good one can I get more answers I just answered it beforehand itself remember that boss can employ relationship what that boss was doing boss was keeping back up right boss was creating backup similarly Hadoop also create backup right that backup in Hadoop word is called replication right so basically you have let's say one block P1 you are creating one backup or two backups of basically that block P1 and that is called replication this is how Hadoop is ensuring that in case of any failure also there should be no mistake okay we should not be using data because in case it's very much one machine fails also it's okay it will all work fine for me so that is what's going to happen so very good lot of people have given me the right answer in this case so block replication is the answer in this case as you can see in this example also like block one is replicated three times if you notice here right so block one is replicated three times similarly if you notice block 2 is also replicated the three guns so these are four different machines and I have replicated block one block two block three block four blocks five okay so this is what is happening edit login FS image is used to recreate the image FS image and edit block are two different things these these basically are two different things you cannot waste this application if you want to know about that let me just answer you here so what happens what a physically this assess image and edit log is now what happens basically you have a main node right you have a name one where you keep the data in them anybody have answered where you keep the data no not accessing its file initially where you keep the data in name node and not sdfs message I'm asking in name node this can also be an interview question very good honor we keep it in memory why let's say there is one client came up this is one point this is clients into this flying C3 let's say there are multiple clients okay now what is happening in this case is let's say uh if this was my name node and let's say my data in name node is my metadata is let's see setting in this what would be the problem here let's say client one thing and want to assess some data what is going to happen this data will be most because any processing which need to happen right that happens in memory only right now this data will come to the memory once this memory box will be over then what will happen then basically again it will come back to this right it will remove it from memory now don't you think there is a input output operation happening and input output operation is always expensive right now imagine if there are multiple clients asking at the same time to the name node don't you think there will be too many input output operation every time we need to basically bring the block to the memory and then basically do the stuff and give the output now this is something which we want to avoid in order to avoid what they came up with the idea is that whatever you are going whatever metadata you are going to create should directly be created and kept in memory okay that is what the idea they came up they are not going to keep any data in the description should we directly checked in the memory which brings another question for you in fact a severe question for you now as since you are telling that the data what you can keep in memory but my Ram is volatile when I say volatile I will lose I can lose the data at any moment right that's very obvious I can lose the data at any moment because my Ram is always going to be volatile right now I restart my system will run data is gone right so I will lose all the metadata in that case how I will ensure that I should not lose my metadata now what they started doing is okay fine I will create everything in name node memory only but what I am going to do is whenever any uh like what I'm going to do add some interval of time at some interval of time I will keep on taking a backup of that metadata in disk okay in this I will keep on taking the backup of that data and whatever backup you are taking in the disks from your body is called your SS image now don't you think this Express image is going to be big right this FS image is going to be big now what usually happens is usually let's say today my Express image is version FS1 now usually the back of what they take this every 24 hours if any purpose kind of thing comes up it actually stop main system now anyway now what basically is going to happen so what happens is every 24 hours we can we usually do this pattern okay so basically today we did some backups tomorrow again this metadata is going to do backup now which brings another problem the problem now again would be let's say what about so in eight teams are or maybe in 23rd are my machine face or my brand page in that case I am going to rule set to this VR dataway which is again not good what should I do for that now for that what they came up is that let's create whatever activity is happening here I will keep on writing in a small file that will be created for let's say 24 hours okay and that file is called as edit log then what gonna happen for 24 hours whatever activity you are doing will be getting stored as a edit law Okay now what's gonna happen after every 24 hours this ss1 plus edit log is gonna be added up and FS2 will be created now in this scenario even if I use the data in 2013 my edit log will be having the data and that's how I am ensuring that I am not losing any data now are you clear about this edit log and Fs image question I mostly see that people are very confused with this logic that whatever they just kind of mug up and come back and tell that you know I know FS image and I did not but what exactly are they they I have seen people actually kind of confused with this so I hope you should be very clear now on that in this file can this map of prime inter will be configured so basically wherever the physical location of neighborhood you have considered and where you consider I just told you what in hdfs site dot XML right so wherever you have configured that so there will be a name directory in it in that name directly there is another subdirectory called as current directory and that correct directory there is another directory called as SNL directory which is secondary manual directly there we keep this SS image and edit reference clear about this part so this is where we basically give them uh so can you some please summarize the answer once sure you have the same answer of FS image energy clock but correct correct not exactly two press one file will be kind of very different so that's the reason we are creating a smaller version of that time called so that every 24 hours activity begins that and it locks are stored in the distance yes in the name website it's kind of act like a backup that's it it's a very good interview question that's the reason as soon as this question came up I thought to answer it up so it's not a part of this line but this is a very famous interview question that can you explain the success we made an edit block and I can tell you that most of the people fail to explain this here memory is ranked correct correct so it just metadata which is stored correct correct it is just creating a backup of that metadata for the 24 hours activity that means edit log is getting erased and getting and get new data yes yes every 24 hours it just keep on working and kind of erase the data that's all for now okay or it creates a new version it depends how your admin have configured that what is the uh block data goes more than the memory now in that case there is something called as still basic usually it's not it do not move like that but there is some concept called a string so in that case there will be some input output operation happening you have to deal with that so then you are making your name notes wrong so you have to make sure if you should have a good configuration but if you do not have it then in that case you have to do input.com no other option then you have to keep in the disk and then the data will be having input output the window is of 24 backup can which is yes it can be changes to summarize this what we just talked about so in name node uh in name node basically you will be storing all the data but the problem is my Ram is going to be volatile now because of that I want to definitely want to have a backup now we keep a backup in the disk and that whatever backup we are keeping we call it as assessments now FS image backup is always taken in 24 hours slot now the another problem starting with this what happened if I move the data in 23rd in that case I should again create a smaller version of the file called as edit block okay that will also be available now these things will be added and will be basically given what can be the Ram size the bigger the better so definitely there is no right answer for it now definitely if you are saying that 32GB is good I will say how about 128 GB if you say 128 GB is good I will say how about 256 GB because that will be better what if more metadata so we can keep on arguing and keep on improving right so the more the rank better it is for you correct correct women with every change in the CFS that's a bit not yes correct that's that's what basically happened now let's move further so I hope now everybody should be clear with this question though it's a separate question but actually it's it's good that you brought it up because that's one of the very famous interview questions so I thought could cover it up now another question what is the problem in having lot of small Financial papers please provide one method to overcome this problem can you get this answer can I get this answer so what are the problems if you will have small files in our JFS and also can you give me and basically a method to overcome this problem change block size if you change the block face no I I want a better answer I want a better answer name node memory will be overload good good now you are coming to right track see lamb will run out yeah because if you will have small data right if you will have small files definitely your metadata is going to be kind of too much right your metadata entry will be too many and that's how you will be kind of filling up your RAM right of your basically name one we just learned that every metadata is stored in the ram of the name node now what are the solution for it so this is the problem what is the solution for it having larger data block size so that name node will have this middle metadata to handle I can take this answer but I am expecting a better answer sheet more map jobs will be used yes that's also one of the problem merge them and save them very good job okay uh what's your name basically I hope that that should not be a real name I don't know why it's Chinese or is this your real name because it's telling me twice okay so it should be once right it should be once I should call it okay then fine now so I don't know this is offering two times so that's a filling it real so increasing block size in ldfs merging the file with same and it's easier to read and write the data somebody just answered can you combine everything that is the right answer we can create dot h a r y in your windows what you do you create a zip file right or a R5 similarly in Hadoop also you can do that you can create a dot h a r file which is called as Hadoop Arch type so you can bring all the small files into one folder together kind of zipping it together now basically with that what's gonna happen it's gonna just keep only one metadata is equal if the metadata entry is going to be reduced how to do that this is the command Hadoop archive now hyphen archive name Whatever item name you want to give it your input location and output okay so basically this is how you can be with the smaller files as well better to create a graphite this is what you do in the real time also right when you have multiple files of the same type using them right just to keep them together so the same thing you will be doing in Hadoop as well moving for the now another question this is also a very interesting question and an easy question also Suppose there is a file of size 5 14 can be stored in hdfs through point x using default block size configuration and default replication Factor we did an assignment which image files I'm on LinkedIn as geometer okay okay now I got it so using uh default block size configuration and default replication Factor then how many blocks will be created in total and what would be the size of this block okay before you answer this can I get an answer what is the default replication factor and what is the default block size if I am talking about hadus 2.8 now it's very easy to answer can everybody answer how to split this by 14 MBO file how to split this 514 MBR file it's in front of you you can do the calculation and give me the answer as well can I get this answer very good 15 block a lot of people have given me basically less they said five blocks but don't you think there will be a replication also of all the block so a lot of people who are giving me this answer of 4 block is completely wrong and then right because they will be a block of 2 MB as well right if you notice what's going to happen this is 128 into 4 is basically 5 12 right there will be two MB blocks they are going to be five blocks because the replication is 5 now with the sorry replication factor is three so it's gonna be 5 into 350 block okay this is how basically you will be calculating this is very famous interview question moving further how to copy a file into hdfs with a different block size to that of existing block size configuration can I get an answer what basically I'm asking is let's say you have a block size of 128 MB by default but when you are copying that data right when you are doing let's say SBF plus iPhone put SDF Surplus I've been put maybe you want to now use the block size of 30 qubit not the default of 128 bit then what you will do to achieve the Cs phase the parameter what what is that parameter what is that parameter block size um you know can you see this d f s dot block size okay so what you need to do you need to just Define the uh basically the bytes what you want to mention so 32 bytes is equivalent to this number okay 32 minus equivalent to this comma so you need to basically Define the bytes what you want to put it up now while doing any command let's say hype input or maybe hyphen copy from local there you can mention this DFS dot block sign and whatever number of bytes you want to mention so you can mention that okay if you want to check the block size you have another command called as that I do FS icons pack and you can see all the statistics related questions will tell you how many bytes it is to basically uh distributed and everything up you can basically directly use this fat thing and you will get that output headset okay this is sometimes useful in projects and that's the reason it's it's a very good interview question as well because one of them in the projects you want to you don't want to use the default size you want to change some other to some other number so in that case you will be using this because one way is either you change everything from your configuration files which is not a good idea to do so better thing is programmatically you deal with it and here you can change it by the usage of DFS dot block size okay so it's not block underscore size connect I hope you got your answer what's the mistake you are doing it should be block size but you are close now what is a block scanner in HTTPS can I get this answer this is the UK question in your Hadoop Administration this this is basically what your Hadoop administrators go so people who are who have done this Hadoop Administration classes can you answer this I'm expecting this answer basically from you even others can answer what is a block scanner in hdfs what is a block scanner in hdfs can I get answer nobody is answering this who all have done Administration code or no administrator Hadoop Administration can I get answers who wanted I'm not asking you to answer me this one just to answer me who all have done this Hadoop administrator goes initially few people mentioned it that we I have done this administrator was you must have read about block scanner okay let me answer this one usually in a block scanner okay word is answering now to check if the block has any empty space left in my two block uh okay one of the answer one of the answer I can take but not exact answer it's not very good uh scan to the block and Report the remaining spaces okay okay again I can take partially this answer not just the cleaning space but in fact it ensure the Integrity of your data blocks okay it basically keep on recruiting with your data nodes will keep on reporting to the network and it will keep on checking the Integrity of the data block let's say if any data block forgot kind of corrected right or maybe the replica replica value become low right all those things it keeps on monitoring and try to rectify it okay so it will basically keep informing the name but that's the reason this has been usually done by administrators because they keep on monitoring the health of the data nodes data block name load they are also responsible for this block right so this is what they keep on doing in order to make sure that they use block scanner basically to do that okay there is one more way to check the replication Factor anybody know what is that there is one more way to check the replication Factor so this is about block scanner but there is one more way to check the replication that the environmental value and we know and we will just tell that data node is good or not data node right I'm talking about let's say some file got under replicated in that case how who will kind of inform Network let's say blocks scanner is not there there is something called as Hadoop load balancer I'm not sure if you have read about that I do load balancer that basically ensures that if if your data blocks are not up right if they are under replicated or not that basically informs that okay this is under replicated let me take that off okay so this is basically uh the way also to check the under replicated or more replicated drops can multiple clients write into an hdfs file concurrently can I get this answer if somebody asks you this question can multiple clients write into an hdfs file concurrently interesting I am getting one yes one no no two years one no question through no two years Lotus yes okay do you think it should be feasible to write multiple right I'm not saying reading I am saying writing notice this part nothing you answer don't you think if we make my file inconsistent yes single file I'm talking what don't you think it will make my file inconsistent if I do that okay it will make my file inconsistent so it is not allowed basically it allows only one writing and multiple reading stuff so that's the reason single file it will not allow you to basically keep on writing but at the same time by multiple places it will not allow you to basically do that for multiple times at the same time once one client is writing it will be kind of filed with the kind of lock for others right once the client have written after that only the other clients can write but everybody can win read conquering that is one thing which is very important in Hadoop so writing at the same time is not possible concurrently but reading is possible that's how httfs is basically created why they have not allowed multiple rights together is at the same time because if they do it can break the file inconsistent that will be a big trouble and that's the reason they do not allow you to do for current right because this is distributed system if multiple clients will write on the same file now maybe somebody can override my change right so that's the reason they will not allow me to do this okay this is my architecture another question what do you mean by high availability of name node and how it is achieved can I get this answer how this is achieved and what do you understand by high availability of the name node in fact I already answered this in Boss example you can answer me this path not for application what basically happens in these cases so let's say if there are two let me show you the slide itself we have drawn it properly see this one there will be two name go to it one will be active name node and one will be passive member so what happens is let's say this is my Active network which is running okay and what these data nodes are there who are reporting to the active member now we also create a passive mirror now this passive name node also these data nodes will be reporting this passive name node will not be doing anything but it will just keep on connecting the data from your data nodes okay that is what the role of plastic name node would be now as soon as this because this now is reading right so it knows the status of data it knows that where the blocks are being written everything it has the information now suddenly if this machine is down in that case my passive name node will ensure it will immediately start acting like a backup and that is how it is ensuring High availability you are not going to lose your plus per time so basically the downtime will not be there immediately your passive mail load will start acting like an active member okay so this is how basically the Hadoop is ensuring High availability this is a very famous interview question good question difference between passive name node and secondary member secondary name what we used to use in Hadoop one point x now secondary name of what used to happen was secondly in download you can say it's just like a snapshot of this machine means you're just copying the data copying the data to other machine but if my active name node is down if my main name node is not in that case my secondary main node will not start acting like a bathtub it will only keep the data but it will not start acting immediately like a name node it will just keep the data you have to physically manually uh kind of bring the name node up copy the data from the secondary neighbor to primary neighborhood and then start working on it work in passive member yes well so it's kind of when you or human intervention right manual intervention is required and there will be a downtime they will get down time here but when it comes to the active and passive member passive name node is going to ensure that it is not fully collecting the data metadata but it as soon as active name node is down casting name loads start acting like a captain menu now the difference should be very clear to you so let's move further okay so we have few more questions now uh so this is for Mac video set so let's do one thing friends let's take a uh five minutes break okay let's take a five minute break and we will start with math produced questions okay uh I'm not sure maybe I do recognize somebody will answer you just need a sip of water also so just give me five minutes we will take a okay one ten minutes uh Ravi this is just basically a two to three hour session so we don't want to make it too much let's make it a seven eight seven to eight minutes will that work let's come up with a individual kind of solution so let's come back by uh with maybe by 10 3 okay we will start there have a water break also we will come back to map reduce then we have five we have a score from that so let's come everybody please be back by 10 3. exactly I'm going to start by 20. okay so guys everybody is back now everybody back namin you should be able to hear me now okay looks like he's been starting okay fine so uh now we are going to start uh so basically now we are going to start with map reduce topic okay now in no you have not said I'm not shared I'm just going to share that just give me a second it takes almost a few seconds to basically you come to this now I hope it should be showing for you okay meanwhile I have talked with the edury cutting and they have informed me that you all will be receiving this video recording in order okay since that was a question from Lord of people when we be getting all this video recording so now so you will be getting this video recording on your email ID in a day or two so all these things will be derivative so I think that will definitely view for you if you want to take a look at any moment now question for you can you explain me the process of spilling in math reviews this is an interesting question in fact think from a perspective of where Network keeps the output and from that you can basically make out what is this spelling I give you a bigger thing possibly so can you give me this answer can we explain the process of spilling in map reduce space to Temp folder lfs of when it spin and from where it said can I get this answer map first case very good what usually happens is the output of your Mac or task it goes to your left now what basically gonna happen they have kept a specific size of that thing so let's say they keep let's say this 100 MB now 100 MBA of data will be kept let's say in that but they have they will be keeping a threshold so it will slowly keep on filling up slowly keep on filling up then what will happen as soon as it will reach a threshold let's say 80 percent of that Ram of 100 MB spread it will start filling that output to the local disk you notice here I am not saying hdfs I am saying local disk okay to local disk only we will be keeping this data local this means you see Drive is Right wherever you want to become so this is how they have designed it so as soon as the mapper output in the memory reach to a threshold limit it starts filling that mapper data to your local disk and this space is called as filling space in massages okay so this question is also asked and this shows basically the internal working of your Mac produce okay so basically this is now internally your map reduce work so this is how it is filling the data once it's filling it will again come down and you can eat more and more data in it which leads me to another question can you explain me the difference between block input splits and record this is a very famous interview question can anybody answer me questions what is the difference between blocks input split and Records what is the difference between these three since this is a very important question if you have done this Mass reduce part then you must be knowing this one difference between blocks input strict and Records very good job so Jyoti is answering record is a single line of data right very good kernel is saying block is hard cut of data just 128 MB very good right block is equal to 128 MB is because 130 MB input bit will happen very well block is based on block size input State make sure that the line is not broken so it makes sense record is single line block distance DFS input status logically split very good very good so what usually happens right so let's say when we talk about block so let's say you have default space is 128 MB so that will be called as the physical block okay when we talk about inputs left right so let's say if your data is of 150 MB now in that case don't you think it makes sense to have a logical split of 130 MBA so that will be your input and record is when you do not produce programming right when you do math with this programming how your mapper take the data it takes line by line right it takes line by line that line is called record so one line of data which it picks up in the mass surface is called your with code okay very very famous question on this part so you can say block is a physical division by logical division are called your input splits and Records okay because the logical division is what your map reduce programs which brings me to another question again relate to mapreduce what is the role of record reader in Hadoop masteries what is the role of record reader in Hadoop map reduce make sure to read the complete you got no no how that is how mapper reads the god good but December can you give be a little more respected you are coming close to the answer can you just be a little more explicit that is how macro region record will come in close what about others what is recording record is single line we have just understood it so what should we record later parsing that fingerline very good one right so don't you think that single right what you are reading and how mappers convert your data it converts into key value pair right so it initially takes an input as a c value pair so when that conversion is happening that is done basically by record reader look at this see let's say this is the data it will be getting converted to key value here where key is called your offset and value is first line right or second line or third line right so this is done by record reader okay so this is what we've got we don't do now what is the significance of counters in mapreduce significance of counters in map it is okay give statistics of data counters will be done in name node okay counters to validate the data red to calculate that because good not just bad because people it can do even other things right it's just bad because it's just one example right it's just one example you can say so what it do is it helps you to identify the statistics right now basically it gives you the statistics about some operation what you want to do you can print it in the console also I I believe that if you have done this Hadoop course you must have seen one quote for your counters right where you must be doing some sort of operations and you might be printing it on your console window now how you will be doing all that so what you will do let's say we are applying counters on this example and in this example as I think somebody just mentioned for the kind of reading the bad data this example actually brings up the same thing but it can do even other things I will tell you what are the things now but let's take this example let's say we want to find out what all bad data I have in this so let's say it is reading it is reading table all good no problem counter will remain as zero then what is it it basically just gave the value as 0 it reads to the second line now this value over now basically this was the back data I read it I passed the statistics Now counter value became one it reads to Jeff if the value stays as 1 because this is a good data now Sean again the value Remains the Same now as soon as it reads the last line of that data it will again increase this counter and will return that I have two bad lines of code but does that mean we can only do this operation on this no maybe I can have an example where let's say I have data of which is let's say of timestamp timestamps are there okay I want to print in this time so time frame I can convert to date time right in my program I can convert it to data now when I convert to date type maybe I have months being defined right because in in data I have months now maybe I want to find out I want to calculate the services that in among this timestamp how many times January is offering how many times February is opening how many times March absolute and all are operating let's say I want to identify all the statistics I can do that with the help of counters very easily okay so this is the purpose of your counters now moving further why the output of map pass was spilled into the local disk and not in HTTPS good question now can you answer me this remember we talked about uh we talked about yes we have just discussed that question right so we have just seen Phillip we have seen filling now how can the SS contest there are basically libraries available for them there are classes available debt counters is the member function to get that this is how basically we'll be assessing it so counter is the class in which we have number functions predefined functions using that you can access that because it's an intermediate output okay intermediate output that's fine but why we are not keeping an sdfs that's that's my question when I can give intermediate output in sdfs very good very good Ramya very good narasima so basically if you notice what happens if you keep in hbfs right remember there is a replication platform right that replication platform we don't want it will increase the number of output blocks do you think it makes sense to increase the replication for your mapper output definitely a big move right so that's the reason we will be keeping in your local private stuff we will not keeping as data otherwise sdfs will replicate even Mac or output which we don't want to happen to do so that's the reason we will stop that so we will be keeping in the local disk not in hdfs in order to avoid basically the replication which brings me to another question can you define this speculative execution can you define this speculative execution can I get this answer can we Define speculative execution it prioritize only some tasks uh coming close but not exactly right if a job in a node is taking much time very good very good people so this is what happens in speculative execution let's say if any of your task is running very slow in that case your Speculator there will be a scheduler which will basically start a duplicate task of it it will start running a duplicated task for it so ensure that basically the duplicate task runs faster and once it will finish it will kill all the duplicate tasks so it is just kind of making sure that because it can happen right maybe your task is waiting due to some resource it got blocked you to any reason so in that case it will start immediately or duplicate task and making sure that your job finished okay so this is the part of your speculative execution which brings me another question question is how will you prevent young very good very good how will you prevent a file from splitting in case you want the whole file to be processed by same mapper how you will prevent a file from waiting in case you want the whole file to be processed by same mapper I want my file to be now basically to be used by the same method not combined it's not combined with a number can I get some more answers it's easy can you see this time so what we can do here is first of all we can increase the minimum number of split size which should now make it larger than your last five effect this operation is itself good enough to make this phase two right because if you increase the size itself you will be all good but there is one more thing which you can do after that is this is Method two basically in method one you can just basically increase the size itself it will be always or what you can do you can go to your input format plus and in that you can just update this property you can make this as splitable to be written in first usually people prefer method one because that's the easiest way of it right you need not update the Java code basically to achieve all this so what we will do for this can you please tell us value where file splitting is not needed it all depends right so let's say if I know that I have only one data known I mean let's say I have only one data known now in that one data node do you think it makes sense to divide a file multiply together I have only one data not right do you think it makes sense because again it's in the same machine right so that's the reason there you want to basically keep one block it will so that I can execute it so this is these can be few situations where you can decide not to split the file and in that case how you will be doing it these are the two methods to achieve okay moving further is it legal to set the number of reducer tasks to zero that's question number one where the output will be stored in this case is it legal to set the number of reducer tasks to zero isn't legal definitely legal right everybody have your scope in school was there any reducer was there any reducer in scope no scope only use mapper no reducing right that's itself a tool right when a tool is not even creating an introducer definitely I will not be using that right so what is going to happen here is electric so what what was the purpose of reducer purpose of reducer is when you want to do some sort of aggregation right maybe you want to submit in the in in all those typing we came up there but it's not mandatory that all the problem statement in the word require aggregation okay so those problems where you do not require a creation like at that school because in school what happens you copy the data from rdpms to your hbfs or vice versa now are you doing any sort of aggregation no right you are just copying the file from rdbms to your https no aggregation requirement so in those cases you will be having reducer as you know output where to store definitely whatever mapper output is coming wherever you are telling it will be stored in that sdfs location right so that is how you will be using it so definitely the answer is yes and basically wherever the mapper output you are leaving there it will be getting stored what is the role of application master in map reduce jobs can I get this answer this is a very famous interview question what is the role of of application master in map reduce job to assign the task okay that's it only to assign the task sets the input split okay it manages the application fired and keep track of sub process that's it before nothing else to get the resources needed for the tax very good from here now you are coming here you are coming close to create task until internal threads yes what basically application Master the first thing is application Master is kind of deciding that how many resources it needs okay and it can basically now inform the source manager that I require this many Resort and give me this many resources to do that then after that resource manager give back the container right if you might have gone through a yarn architecture right there you might have understood all this portion right so basically what happens it first basically find out how many resources are required secondly what we want to do is it basically wants to find out once it happens when it gets all the container it kind of basically gets them working together it connects output back and return it to basically the master mode so it is doing multiple things in Mass producer it's not it's basically not doing just one task it is also checking which is a very important role of it it is checking that how many resources I require any kind of helping the resource manager to take that okay so this is what the same thing being explained the role of your application faster which brings me to another question what do you mean by overboard so when you map reduce job runs I don't I'm not sure whether you have noticed that or not there is something called as Uber mode it sometimes comes in your console if you have noticed that so can you tell me what is that Uber mode is there any advantage of sitting on the Uber mode what basically Uber mode is going to do runs on application Master not very good can I get some more answers more insight on this can I get some more answers yes so let's say if you have a small job right if you have a small job in that case you you are basically again application Master need to be anywhere no application must we will request and then it will allocate container right so container sending and all basically creating container all those things are time consuming if the jobs are small what your um your application masking can do application Mastery can start a jvm in itself okay so basically replication Master you can decide to complete your job because the job is not it may decide to complete the job on its own in that case we call it as Uber mode so Uber mode basically require less it's when you will be using let's say less number of transfers that only 10 mappers you have only one producer to work on so in those cases you use Uber mode now how to enable all that so basically there is a property which you can set to true and it will basically email fill your performance in which what's gonna happen your application Master will start acting like a jvm and will finish the job so that it will save you to by creating container making your performance better so you will see what people do is when whenever they will be having some sort of whenever they will be having small jobs right and they want to include the performance they usually kind of enable the Uber mode so now the application Master it will start executing the task and that way they improve the performance but when you have a bigger job it will not work in that case we need to keep the open word response otherwise you will degrade your performance it will not even work okay so this is basically your Uber mode another question how you will enhance the performance of mass produced jobs when dealing with too many small files if you have let's say many many small files in that case very good paper yeah that's also one question if you have let's say many many small files how you can improve the performance of your map reduce job no coming close but not the right answer too many small files are there then now what you will do uh basically there is something called s because none of you give the very right answer there is something called as combined file input format that's the reason I told you was you are coming close but not exactly coming right right so what this do is it kind of packets all the small files together see this diagram see this first next these are some small files it basically you know combines all the files together now because these small files got combined together now my institution time will be passed this is basically one practical thing which which is being represented this performance can say like the small files was taking this much of time and basically with this when you combine this it actually started taking less time so the improving the performance of the system so this is what businesses how basically you can improve the performance if you have multiple small pipes now let's move to high five is very important topic now question for you where the data of high table gets stored the data universe that you are going to be in sdfs but where is the location of that where is the location for that but what is the default for the natural version now you are coming across so by default you keep it in slash user slash Hive slash Warehouse so this is the location where by this point all your high table gets stored if you want to change this you can go to your hindsight.xml and can update this setting as well another question why hdfs is not used by high meta stored for storage what I mean is you might have read in your course that you give your eyes basically your metal store in your rdpms right not in hdfs what is the reason behind that why we not keep all these things in my hdfs why your meta stored is created in your rdbms why you are considering configuring your meta store in 90 dbms for 5 why not in hdfs can I get this answer why you are keeping your meta score in your rdbms system and not in your https I can tell you this is the most important question of height and any interview or five you will go expect this question there's usually everyone ask for random message but that you can do in sdfs also come here if we need DB catalog what is that catalog because it needs to use jdbc connection no no that's not the answer okay let me ask you this question okay which is actually the main reason for it uh so let's say if uh what you do is when you create a table in height what happens it creates an entry for that table in metastore table right what it's doing it's inserting the row level right at a row level it is inserted you created another table in life again you are basically inserting some values here right now let's say you deleted some table in hand what happened it did a row level building can you do this low level delete row level in insertion and all in your hbfs okay we are saying referring to the same thing that is can you do this basically this in your SDA test no right this itself is a good answer to explain it right so with and second thing is definitely in order thinking time is going to be faster enough those are the other factors but first thing is basically your credit operation cannot be done in sdfs that's the reason we will not be able to keep in spfs then forget about other factors right they do not make any sense in fact because my first property itself is failing of basically current operation now let's see some scenario questions usually in high peak you will find some scenario questions coming up now scenario question is suppose I have installed Apache Hive on top of my Hadoop cluster can you please show the last answer sure why not see this answer in the yeah let's move forward now can can you see this question now suppose I have installed Apache Hive on top of my Hadoop cluster using default meta store configuration then will what will happen if we have multiple clients trying to assess Hive addresses suppose I have installed Apache Hive on top of my Hadoop cluster by using default metastore configuration then what will happen if we have multiple client trying to assess side at the same time very good usually in eyes you can only basically assess a one by one line so multiple client access itself is not allowed right very good estimate they are given basically the right answer for it so usually this these are the scenarios what it follows so main thing is your multiple client access and type is not allowed this is by architecture right because you should maintain the read consistency very good transition right so that's the reason this itself is not going to work out so basically that is what you need to keep in mind what is the difference between external table and manage stable in fact manageable you also call it as internal table so can I get an answer what is the difference between external table and manage table can I get this answer external table can be a file okay but I want a difference proper difference external table where the hdfs file won't be deleted if we delete that table and you say as the difference fine okay okay I can accept your answer as well external table is stored in separate location of a workers but even into another table I can store it at some other location by defining the location keyword and the external table keeps data when it gets deleted very good that is the major factor when you talk about manage stable what happens is if you have deleted any of the people what is going to happen it will delete the entry in your meta score at the same time it is also going to delete the data file but in external table if you delete a table it is going to only delete the entry in your meta store not strong your main data okay so that is the major difference basically in the minus table and external table another question when should we use sort by instead of order if you notice these two apis belongs to 5 and these two basically going to do exactly the same thing so when should I use sort by and not order by operation when should I use sort by and not order by operation so basically to answer this notable you catch things I do not get this value when we should have only one method okay I think you are in fifth module right so that's the reason uh yeah I can understand you have told me the starting itself right that you are still going through this course so if you are not getting it is completely fine just listen to this okay just listen to this conversation once you will go over these course topics in uh basically from wherever you are doing you will be all comfortable with it okay basically now I can understand if you will not get anything because these modules is not being plotted yet so these are basically the new model which which will be taught later now can I get this answer uh in case of numericals no no that you can use also order by one of it use reducer other use numbers okay okay when you use Group by operation no no that's not the way actually what happens is if you have huge data set in that case you should use basically the sort by option it usually do this sorting on multiple reducer while order by do it on one producer that is basically the major difference so when you have huge data set use sort by instead of order Bank okay lot of people remain confused with this that's why this is a very tricky question what people are usually if you ask anyone right if we do not know the answer we will tell you both do the same thing but actually that's not perfect there is a difference now another question what's the difference between partition and profit in life I think the most easiest question to answer can everybody answer this whoever I've done on hive topic what's the difference between partition and bucket this is the most serious answer can I get this answer difference between partnership and bucket simple right partition is basically at the first level right when you split the data into different directory Buffet is like a sub partitions of that right so even for that partition itself when you create another sub partitions you can call them as Rocket like in this case can you see the first partition is PLC Department Civil Department electrical department but after that we have also created some sub partitions of it and that is your pocket that's basically the differences another question let's say this is the scenario you are creating a transition table now this is the table what you have like transaction table is the table you have this many columns they limited field by comma now let's say you have inserted 50 000 tuples in this table now I want to know the total revenue generated for each month but height is taking too much time in processing this query can you tell me what the solution you are going to provide this scenario is actually a very good interview scenario very good uh that's enough can I get more answers very good very good people can I get more answers you will be partitioning this table how you will be partitioning you will be partitioning your table with month right so basically if you partition or table you will improve your performance so these are the simple steps you can create a table Partition by month set these properties to truth so that you can enable your partition insert the data and then you can retrieve the data where your month is going to January so while practice after purchasing the table you can improve the performance secondly can I get an answer of this what is dynamic partitioning and when is it used can I get this answer what is dynamic partitioning and when is it used that can be static partitioning also in the right so I want to know what is dynamic partition very good partition happens when loading the data into table right now I don't know that if I do a dynamic partitioning where it is what's how many partitions also it is going to play so the value of your partition columns will be known only during your run time when you will be creating the partition that is called your Dynamic parties mean okay how I've distributed the rows into bucket can I get this answer how high distributes the rows into bucket very good hash algorithm okay it uses the hash algorithm to understand this part if you look what we are doing here is now no you will be using clusters drive but basically are in the how internally this will bring that's basically how it is let's say you want to put into two bucket in that case it is going to do more than one okay so that's a model of this output came out to be one so it will decide to put in bucket one modular two which is going to become 0 it is going to keep in this second bucket so this is how it will be decided okay which will be using basically a hash computation of this so basically it will be using hash function of this so let's say hash value of this value came out to be one as function of this value came out to B2 as function values this came out to be 3 then basically it is doing this module operation and giving the output and that's solid distribute the operating data now which brings another question suppose I have a CSV file which is named as sample.php present in pm10 directory with the following increase in that case how you will consume this CSV file into where Hive Warehouse using built-in certainty sudden with serialization third is these are definitely serialization deserialization when you convert your data into kind of kind of bytes can I get this answer row format is limited by comma not exactly I'm looking for something else after this required some VPN so let me show you this time see this answer in this case you will say raw format survey of dot Apache dot hive.surdy2 dot open CSC Sunday this is what you need to add okay now you got it as about what's the mistake you are doing so basically this is what you need to add otherwise everything is same you can save it like TMK folder and all that just that the only difference will come in row format per day another question I have lot of small CSV files present in input direct screen and CFS and I want to create a single table height table corresponding to these files the data in these files are in this format now as we know Hadoop performance degrades when we use lot of small file so how you will solve this problem can anybody give me a simple answer of this this should be easy you have multiple small files now in that case what what should be resolution because definitely my performance is going to degrade what should I do concatenate salt pile but LGB another answer don't you think you can use sequence file here he Consignment right if yeah one solution can be true that is from the hdfs level itself English right I am talking from basically fine perspective right don't you think I can convert in the sequence file C console will convert everything like 0 1 0 1 kind of thing so that's what make it better right we will improve my performance so first create a table load the data after that what you do store it as sequence file and then basically load this data from this file whatever inserted to this file this will ensure now that your speed will be put why do we need to do insiders why do we okay so basically when you do searches right the advantage what you get yes so so let's say firstly is compression because we are serializing the data when you serialize the data it makes that transfer also very easy because we have converted like 0 1 0 1 0 1 1 right so first thing is when you convert two centers with compress the data second thing since you convert to this zero one format now the transfer over the network become what moves upon that's the reason we basically use let's send this email so you but remember there is no free lunch right there is no nothing called as prevalence don't you think this will also have a disadvantage when you do a desertization again you need to convert it back don't you think it will impact your performance a bit right so remember there is no freelance though it is helping you in this that but at the same time it will demand your performance it will lead up some of your performance now some quick questions can you give me this answer difference between logical and physical plan should be simple difference between logical and physical science I know guys that you've got little tired because this is a big session but don't worry we are almost getting done so I want everybody attention to me back now can you tell me the difference between logical and physical plan this must be the first thing what you must have learned in your hadoopers when you went to pick topic whenever you are executing statement by statement okay it is just executing the statement nothing and that is first it creates a logical plan means let's say there is no error right in that case if you just create in a logical term but when you do dump right when you do dump then only the execution start right because of lazy evaluation then your logical plan kind of gets converted to lack of physical stuff means it start getting executed now let's say if you have given the wrong file path right in logical plan it will not give you any error because there is no syntax only at the time of physical plan it will give you an error So Physical plan is when you are basically executing your map producer when this trick is getting converted to map producer and by logical plan is at the initial level okay so that is what happens can you tell me what is that what is bad connection of couples very good right so basically when you say a whole data file itself right collection of couples if you notice so let's say this is the data right if this is the data now if you see this is one Tuple this is second Tuple this is third Tuple so collection of all these things will be called as path okay collection of all these things will be called as no how high is only working with uh like I was able to deal with only structure data but pig is willing to deal with unstructured data figures how quick is able to deal with unstructured data can I get this answer how big is able to deal with unstructured data it is actually happening because of schema less but right e minus one basically if you do not have schema or if you do not have schema defined then also text inbox what we do is let's say you you do not know like in high you have column names right let's say column name is age integer all that right so you need to define the problem is in quick there is nothing like that so let's say if you do not know the column name there is no schema being defined you can Define it like this First Column you can Define the dollar one second column you can Define with dollar two so that's all you can also Define so basically big you can Define it when your schema less thing okay so you don't have anything it will treat it like null it will start creating if you do not Define data type it will start taking get us by so basically this big kind of converts your values to other way so basically that's the reason you can basically go with big with uncertain future data which one this is one of the major reason that quick can deal even unstructured data by height cannot do because big kind of converts the data or treats that data in this indifferently okay like if you don't have column name you can Define this as dollar two dollar one all those things as well what are the different execution modes available in pick so there are two modes right one is local mode one is mapreduce most capturing you can do that this is the same thing I mean like if you have no data I treat it like byte array if you don't have column it start reaching like dollar one dollar two right one before okay this one back so I did not wait let's go for the now so there are two modes available one is mapreduce what one is local mode so when you go with things in map reduce mode so when you just type big right it takes you to the crunch by default it takes you to the map to this mode which basically also states that that basically if we are going with back producing what you are assessing your hdfs while if you are using local mode what you need to do you need to go like this pick hyphen X local it will now take you as in the loop color mode when you say local mode what basically happens it basically now starts assessing the file from your local file system now it has no more accessing sdfs but it is directly assessing the data from your local file system these are the two execution ones Platinum this is very simple right so basically flatten is the keyword available if you have this kind of data you can flatten it up like everything will come together in the back so pattern is just an API you can see this platter so basically it is just basically converting this form of data to this form of data okay so this is basically meant by Latin these are simple questions now can anybody explain me this hbase components can anybody explain me this it's been components anyone who want to talk about it it's based components it's in front of you can anybody explain me these components of X base you can start by one by one this is the last topic so friends I want everybody to be attentive here so HBS gives the data in a distributed mode right so I distributed what where it keeps the data it defines our region where it keeps Predator so like this will be your first reason this will be like the reason where you're keeping let's say this column value row value right this is the one region not together just like how you define rack right you can Define religion surface right where you are defining basically different different regions together so this is one reason server this is one reason server right now what happens the master will keep basically your will be called as x master like active mask just like your name node was right similarly HB uses this x master what is this zookeeper doing here zookeeper is kind of helping you to execute everything so let's do what happens in zoo basically they keep animals right they keep animals they manage multiple different category of illness similarly two people like big data also got so many tools available foreign now you have name node data node all those things available but when you're working with XPS you don't have all those things right so in hvs you don't have concepts of name load you are keeping the timer so for that to do this part the Q people play a major role here so zookeeper is kind of going to act like a coordinator inside your hips based environment okay it will help you to coordinate all the things because here we don't have name node data nodes enough so and there is no yarn basic equation so you can treat it like basically just like how yarn was handing things there it is going to help you as a coordinator it also maintains the directory structures can anybody tell me what is Bloom filter Bloom filter and even though what is room filtering at this can I get this answer what is room freedom basically it helps to improve the overall throughput of your cluster it helps you to basically improve the performance okay now if you want to search any specific row column cells it also help you to do that so it's it's made through system very fast so that's that's what you need to just enable this and if it is enabled it kind of improves the throughput of your cluster this is the role of your blue filter in h space coming to next question what is the role of jbbc driver in a school setup can I get this answer is basically scoop are important topic in interview HBS I would still say that they are not very important these people do not ask questions or advice but definitely this cool topic is very important very good rdbms database I want to connect now basically our dbms can be of any type it can be my SQL it can be uh it can be minus equal it can also be in your Oracle DB it can be dv2 right so jdbc driver will be common and can be used for any of your things so this will basically help you to create a connection with any sort of gpms system this is a very famous question what's the difference between hyphen hyphen Target dir and the difference with Warehouse hype and dir hyphen hyphen Target the IR with Warehouse bio this is a very famous interview question for school because both will basically help you to put the data in as some SDF specific location then what's the difference between them very good not so importing all table and all is fine that you are giving their use case that's simple I want basically a proper solution since what's the difference between them is defined no no registry what happens in Target the IRS in Target Udi if you're defining Target dir you need to give the directory path name okay so you need to give the directory name where you will be keeping data so it is possible let's say in your mind SQL or in rdbms let's say you have table called as accounts but now you want to import this table to your hdfs using school now why if you give Target dir you need to tell the name of the hdfs directory where you will be keeping this so now you can change this name of hbfs from accounts let's say accounts one you can do all that if you are using Target dir you can change the name of this account Square bronze one and MPS but if you are using Warehouse bis in that case whatever the name will be there in your rdbms same name will be created same name will be created no change with that okay so that is one of the major difference between that so Warehouse vir will maintain the same name while we target the IR we can keep the same name for different name as well so your codes to basically give the name of the hdfs folder where you want to import it while and where all the error that's a difference no can you tell me what this query is doing here read this query and tell me what this query is doing here incremental data no no importing employer table but can you see the type and help and where clause it is filtering right it is only filtering all the employees table where your start date is greater than this value okay this is what this is see this this is what this is done now let's say this is a question in a sport import command you have mentioned to run eight parallel map to discuss but scope is only running code what can be the reason what can be the reason very good very goodness yes because maybe your number of codes are not allowing you to run eight parallel companies right maybe you have less number of ports itself in that case will not be able to take you up to eight thousand after this first right it will only use less number of code so basically if your course are less in that case this is bound to happen give us good command to show all the databases in the MySQL server can you give us good command to show all the databases in MySQL server it should be simple series spoon list databases not show databases list databases become okay hyphen hyphen connect check the connections that's it will list all the databases that's it so this will list all the databases okay so those sessions will be very useful for all of you thank you everyone for making it interactive and amazing I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist And subscribe to Eddie Eureka channel to learn more happy learning

Info

Channel: edureka!

Views: 24,587

Rating: undefined out of 5

Keywords: yt:cc=on, big data full course, big data and hadoop full course, hadoop full course, big data hadoop tutorial for beginners, big data tutorial for beginners, big data hadoop, big data and hadoop, hadoop tutorial for beginners, big data tutorial, hadoop tutorial, learn big data, bid data hadoop, big data, hadoop, big data hadoop tutorial, what is hadoop, hadoop training, big data training, big data analytics, hadoop ecosystem, edureka, big data edureka, Hadoop edureka

Id: 9QxZhapbo0o

Channel Id: undefined

Length: 700min 50sec (42050 seconds)

Published: Wed Jan 18 2023