Big Data Full Course 2022 | Big Data Tutorial For Beginners | Big Data Step By Step | Simplilearn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
big data is a term i'm sure you're all familiar with big data as a technology has grown massively over a decade and a half when internet users boomed and companies started generating vast amounts of data it became more popular with the advent of ai machine learning mobile technology and the internet of things big data analytics helps companies from different sectors such as automobile manufacturing e-commerce and logistics to manage processes streamline use data sets in real time as well as improve their organization's decision-making capability big data and analytics can enable organizations to get a better understanding of their customers and help to narrow down their targeted audience thus helping to improve companies marketing campaigns now there is no doubt that the big data market is steadily growing and as per fortune business insights the global big data market is expected to grow from 231.43 billion dollars in 2021 to 549.73 billion us dollars in 2028 at a cagr of 13.2 percent as per monster.com annual trends report big data analytics is most likely going to be the top in demand skill for 2022. at least 96 percent of companies are definitely planning or likely to plan to hire new staff with relevant skills to fill future big data analytics related roles in 2022 this indicates that the career opportunities and big data are very high and the scope is really good now if you are looking for a career in big data in 2022 on the screen you can see some of the top big data companies for 2022. so we have oracle ecommerce leader amazon hpe tech giant ibm and salesforce now let's look at the agenda for today's video on big data full course 2022 so we will start off by learning how to become a big data engineer and then we will see the crucial skills for big data next we will understand big data analytics and look at the top applications in big data after that we are going to learn about big data tutorial and understand the most popular big data framework that is hadoop next we are going to look at the different tools that are part of the hadoop ecosystem then we are going to learn about apache spark and understand the spark architecture finally we are going to close our full course video session by learning the top hadoop interview questions so let's begin today i'm going to tell you how you can become a big data engineer now it should come as no surprise for you guys to know that organizations generate as well as use a whole lot of data this vast volume of data is called big data companies use big data to draw meaningful insights and take business decisions and big data engineers are the people who can make sense out of this enormous amount of data now let's find out who is a big data engineer a big data engineer is a professional who develops maintains tests and evaluates the company's big data infrastructure in other words develop big data solutions based on a company's requirements they maintain these solutions they test out these solutions as to the company's requirements they integrate this solution with the various tools and systems of the organization and finally they evaluate how well the solution is working to fulfill the company's requirements next up let's have a look at the responsibilities of a big data engineer now they need to be able to design implement verify and maintain software systems now for the process of ingesting data as well as processing it they need to be able to build highly scalable as well as robust systems they need to be able to extract data from one database transform it as well as load it into another data store with the process of etl or the extract transform load process and they need to research as well as propose new ways to acquire data improve the overall data quality and the efficiency of the system now to ensure that all the business requirements are met they need to build a suitable data architecture they need to be able to integrate several programming languages and tools together so that they can generate a structured solution they need to build models that reduce the overall complexity and increase the efficiency of the whole system by mining data from various sources and finally they need to work well with other teams ones that include data architects data analysts and data scientists next up let's have a look at the skills required to become a big data engineer the first step is to have programming knowledge one of the most important skills required to become a big data engineer is that of experience with programming languages especially hands-on experience now the big data solutions that organizations would want you to create will not be possible without experience in programming languages i can even tell you an easy way through which you can get experience with programming languages practice practice and more practice some of the most commonly used programming languages used for big data engineering are python java and c plus the second skill that you require is to have in-depth knowledge about dbms and sql you need to know how data is maintained as well as managed in a database you need to know how sql can be used to transform as well as perform actions on a database and by extension know how to write sql queries for any relational database management systems some of the commonly used database management systems for big data engineering mysql oracle database and the microsoft sql server the third skill that you require is to have experience working with etl and warehousing tools now you need to know how to construct as well as use a data warehouse so that you can perform the etl operation or the extract transform and load operations now as a big data engineer you'll be constantly tasked with extracting unstructured data from a number of different sources transforming it into meaningful information and loading it into other data storages databases or data warehouses now what this is is basically aggregating unstructured data from multiple sources analyzing it so that you can take better business decisions some of the tools used for this purpose are talent ibm data stage pentaho and informatica next up we have the fourth skill that you require which is knowledge about operating systems now since most big data tools have unique demands such as root access to operating system functionality as well as hardware having a strong understanding of operating systems like linux and unix is absolutely mandatory some of the operating systems used by big data engineers are unix linux and solaris now the fifth skill that you require to have experience with hadoop based analytics since hadoop is one of the most commonly used tools when it comes to big data engineering it's understood that you need to have experience with apache hadoop based technologies technologies like hdfs hadoop mapreduce apache edge base hive and pig the sixth skill that you require is to have worked with real-time processing frameworks like apache spark now as a big data engineer you'll have to deal with vast volumes of data so for this data you need an analytics engine like spark which can be used for large-scale real-time data processing now spark can process live streaming data from a number of different sources like facebook instagram twitter and so on it can also perform interactive analysis and data integration and now we're at our final skill requirement which is to have experience with data mining and modeling so as a data engineer you'll have to examine massive pre-existing data so that you can discover patterns as well as new information with this you will create predictive models for your business so that you can make better informed decisions some of the tools used for this are r rapidminer weka and now let's talk about a big data engineer's salary as well as other roles they can look forward to now the average salary of a big data engineer in the united states is approximately ninety thousand dollars per year now this ranges from sixty six thousand dollars all the way to hundred and thirty thousand dollars per annum in india the average salary is around seven lakh rupees and ranges from four lakhs to 13 lakhs per annum after you become a big data engineer some of the job roles that you can look forward to are that of a senior big data engineer a business intelligence architect and a data architect now let's talk about certifications that a big data engineer can opt for first off we have the cloudera ccp data engineer a cloudera certified professional data engineer possesses the skills to develop reliable and scalable data pipelines that result in optimized data sets for a variety of workloads it is one of the industry's most demanding performance based certification ccp evaluates and recognizes a candidate's mastery of the technical skills most sought after by employers the time limit for this exam is 240 minutes and it costs 400 next we have the ibm certified data architect big data certification an ibm certified big data architect understands the complexity of data and can design systems and models to handle different data variety including structured semi-structured unstructured volume velocity veracity and so on a big data architect is also able to effectively address information governance and security challenges associated with the system this exam is 75 minutes long and finally we have the google cloud certified data engineer a google certified data engineer enables data driven decision making by collecting transforming and publishing data they should also be able to leverage deploy and continuously train pre-existing machine learning models the length of the certification exam is 2 hours and its registration fee is 200 now let's have a look at how simply learn can help you become a big data engineer simply learn provides the big data architect masters program this includes a number of different courses like big data hadoop and spark developer apache spark and scala mongodb developer and administrator big data and hadoop administrator and so much more this course goes through 50 plus in-demand skills and tools 12 plus real life projects and the possibility of an annual average salary of 19 to 26 lakh rupees per annum it will also help you get noticed by the top hiring companies this course will also go through some major tools like kafka apache spark flume edge base mongodb hive pig mapreduce java scala and much more now why don't you head over to simplylearn.com and get started on your journey to get certified and get ahead we all use smartphones but have you ever wondered how much data it generates in the form of texts phone calls emails photos videos searches and music approximately 40 exabytes of data gets generated every month by a single smartphone user now imagine this number multiplied by 5 billion smartphone users that's a lot for our mind even process isn't it in fact this amount of data is quite a lot for traditional computing systems to handle and this massive amount of data is what we term as big data let's have a look at the data generated per minute on the internet 2.1 million snaps are shared on snapchat 3.8 million search queries are made on google one million people log on to facebook 4.5 million videos are watched on youtube 188 million emails are sent that's a lot of data so how do you classify any data as big data this is possible with the concept of five v's volume velocity variety veracity and value let us understand this with an example from the healthcare industry hospitals and clinics across the world generate massive volumes of data 2 314 exabytes of data are collected annually in the form of patient records and test results all this data is generated at a very high speed which attributes to the velocity of big data variety refers to the various data types such as structured semi-structured and unstructured data examples include excel records log files and x-ray images accuracy and trustworthiness of the generated data is termed as veracity analyzing all this data will benefit the medical sector by enabling faster disease detection better treatment and reduced cost this is known as the value of big data but how do we store and process this big data to do this job we have various frameworks such as cassandra hadoop and spark let us take hadoop as an example and see how hadoop stores and processes big data hadoop uses a distributed file system known as hadoop distributed file system to store big data if you have a huge file your file will be broken down into smaller chunks and stored in various machines not only that when you break the file you also make copies of it which goes into different nodes this way you store your big data in a distributed way and make sure that even if one machine fails your data is safe on another mapreduce technique is used to process big data a lengthy task a is broken into smaller tasks b c and d now instead of one machine three machines take up each task and complete it in a parallel fashion and assemble the results at the end thanks to this the processing becomes easy and fast this is known as parallel processing now that we have stored and processed our big data we can analyze this data for numerous applications in games like halo 3 and call of duty designers analyze user data to understand at which stage most of the users pause restart or quit playing this insight can help them rework on the storyline of the game and improve the user experience which in turn reduces the customer churn rate big data also helped with disaster management during hurricane sandy in 2012 it was used to gain a better understanding of the storm's effect on the east coast of the us and necessary measures were taken it could predict the hurricane's landfall five days in advance which wasn't possible earlier these are some of the clear indications of how valuable big data can be once it is accurately processed and analyzed back in the early 2000s there was relatively less data generated but with the rise of various social media platforms and multinational companies across the globe the generation of data has increased by leaps and bounds according to the idc the total volume of data is expected to reach 175 zettabytes in 2025 that's a lot of data so we can define big data as massive amount of data which cannot be stored processed and analyzed using the traditional ways let's now have a look at the challenges with respect to big data the first challenge with big data is its storage storing big data is not easy as the data generation is endless in addition to this storing unstructured data in our traditional databases is a great challenge unstructured data refers to data such as photographs and videos the second challenge is processing big data data is only useful to us if it is processed and analyzed processing big data consumes a lot of time due to its size and structure to overcome these challenges of big data we have various frameworks such as hadoop cassandra and spark let us have a quick look at hadoop and spark so what is hadoop it is a framework that stores big data in a distributed way and processes it parallely how do you think hadoop does this well hadoop uses a distributed file system known as hadoop distributed file system to store big data if you have a huge file your file will be broken out into smaller chunks and stored in various machines this is why it is termed as distributed storage mapreduce is the processing unit of hadoop here we have multiple machines working parallelly to process big data thanks to this technique the processing becomes easy and fast let's now move on to spark spark is a framework that is responsible for processing data both in batches and in real time spark is used to analyze data across various clusters of computers now that you have understood big data hadoop and spark let's look into few of the different job roles in this domain career opportunities in this field are limitless as organizations are using big data to enhance their products business decisions and marketing effectiveness so here we will understand in depth as to how to become a big data engineer in addition to this we will also look into the various uh skills required to become a hadoop developer spark developer and a big data architect starting off with a big data engineer role do you know who a big data engineer is well big data engineers are professionals who develop maintain test and evaluate a company's big data infrastructure they have several responsibilities firstly they are responsible for designing and implementing software systems they verify and maintain these systems for the ingestion and processing of data they built robust systems extract transform load operations known as the etl process is carried out by big data engineers they also research various new methods to obtain data and improve its quality big data engineers are also responsible for building data architectures that meet the business requirements they provide a solution by integrating various tools and programming languages in addition to the above responsibilities their primary responsibility is to mine data from plenty of different sources to build efficient business models lastly they also work closely with data architects data scientists and data analysts those acquire a lot of responsibilities let us now have a look at the skills required to achieve these responsibilities as you can see on your screens we have listed the top top seven skills needed to be possessed by a big data engineer starting off the essential skill is programming a big data engineer needs to have hands-on experience in any one of the programming languages such as java c-plus plus or python as a big data engineer you should also have in-depth knowledge of dbms and sql this is because you have to understand how data is managed and maintained in a database hence you need to know how to write sql queries for any rdbms systems some of the commonly used database management systems for big data engineering are mysql oracle database and the microsoft sql server as mentioned earlier carrying out an e carrying out an etl operation is a big data engineer's responsibility now you need to know how to construct as well as use a data warehouse so that you can perform these etl operations as a big data engineer you will be continuously tasked with extracting data from various sources transforming them into meaningful information and loading it into other data storages some of the tools used for this purpose are talent ibm data stage pentaho and informatica next up we have the fourth skill that you require which is knowledge about operating systems big data tools run on operating systems hence a sound understanding of unix linux windows and solaris is mandatory the fifth skill that you require is to have experience with hadoop based analytics since haroop is one of the most commonly used big data engineering tools it's understood that you need to have experience with apache hadoop based technologies like hdfs mapreduce apache pig hive and apache hedge base the sixth skill that you require is to have worked with real-time processing frameworks like apache spark as a big data engineer you would deal with large volumes of data so for this you need an analytics engine like spark which can be used for both batch and real-time processing spark can process live streaming data from several different sources like facebook instagram twitter and so on and now we are at our final skill requirement which is to have experience with data mining and data modeling as a big data engineer you would have to examine massive pre-existing data so that you can discover patterns and new information this will help you create predictive models that will help you make various business business decisions some of the tools used for this are are rapidminer becca and nime now let's talk about a big data engineer salary in the u.s a big data engineer's average annual salary is around hundred and two thousand dollars per annum in india a big data engineer makes about seven lakh rupees per annum that was all about our first role big data engineer moving on to our second role we have hadoop developer as the name suggests a hadoop developer looks into the coding and programming of the various hadoop applications this job role is more or less similar to that of a software developer moving on to the skills required to become a hadoop developer at first it is a knowledge of the hadoop ecosystem a hadoop developer should have in-depth knowledge about the hadoop ecosystem and its components which include hedge base pig scoop flume uzi etc you should also have data modeling experience with olap and oltp knowledge of sql is required like a big data engineer a hadoop developer must also know popular tools like pentaho informatica and talent finally you should also be well versed in writing pig latin scripts and map-reduce jobs the average annual salary of a hadoop developer in the u.s is nearly 76 000 per annum in india a hadoop developer makes approximately 4 lakh 57 000 rupees per annum moving to our third job role we have spark developer we saw what spark is now let's understand what a spark developer does spark developers create spark jobs using python or scala for data aggregation and transformation they also write analytics code and design data processing pipelines a spark developer needs to have the following skills they need to know spark and its components such as spark core spark streaming spark machine learning library etc they should also know the scripting languages like python or perl just like the previous job roles a spark developer should also have basic knowledge of sequel queries and a database structure in addition to the above a spark developer is also expected to have a fairly good understanding of linux and its commands moving to the average annual salary of a spark developer it is nearly 81 000 dollars per annum in the us in india a spark developer earns nearly 5 lakh 87 000 rupees per annum let's now move on to our final job role that is big data architect so let's understand who a big data architect is well a big data architect is a professional who is responsible for designing and planning big data systems they also manage large scale development and deployment of hadoop applications moving on to the skills required to become a big data architect first up the individual must have advanced data mining and data analysis skills big data architects should also be able to implement and use a nosql database and cloud computing techniques they must also have an idea of various big data technologies like hadoop mapreduce hbase hype and so on finally big data architects must hold experience with agile and scrum frameworks let's now have a look at the average annual salary of a big data architect in the united states and india in the u.s a big data architect earns a whopping 118 thousand dollars per annum meanwhile in india a big data architect makes nearly 19 lakh rupees per annum these are huge numbers right so now that we have understood the job roles of a big data engineer hadoop developer spark developer and a big data architect let us have a look at the companies hiring these professionals as you all can see on your screens we have ibm amazon american express netflix microsoft and bosch to name a few let us now understand why big data analytics is required with an example so all of you listen to music online isn't it here we will take an example of spotify which is a swedish audio streaming platform and see how big data analytics is used here spotify has nearly 96 plus million users and all these users generate a tremendous amount of data data like the songs which are played repeatedly the numerous likes shares and the user search history all these data can be termed as big data here with respect to spotify have you ever wondered what spotify does with all this big data well spotify analyzes this big data for suggesting songs to its users i'm sure all of you might have come across the recommendation list which is made available to each one of you by spotify each one of you will have a totally different recommendation list this is based on your likes your past history like the songs you like listening to and your playlists this works on something known as the recommendation system recommendation systems are nothing but data filtering tools they collect all the data and then filter them using various algorithms this system has the ability to accurately predict what a user would like to listen to next with the help of big data analytics this prediction helps the users stay on the page for a longer time and by doing so spotify engages all its users the users don't have the need to go on searching for different songs this is because spotify readily provides them with a variety of songs according to their taste well this is how big data analytics is used by spotify now that we have understood why big data analytics is required let us move on to our next topic that is what is big data analytics but before that there is another term we need to understand that is big data we all hear the term big data many times but do you know what exactly big data means well we will understand the term big data clearly right now big data is a term for data sets that cannot be handled by traditional computers or tools due to their value volume velocity and veracity it is defined as massive amount of data which cannot be stored process and analyzed using various traditional methods do you know how much data is being generated every day every second even as i talk right now there are millions of data sources which generate data at a very rapid rate these data sources are present across the world as you know social media sites generate a lot of data let's take an example of facebook facebook generates over 500 plus terabytes of data every day this data is mainly generated in terms of your photographs videos messages etc big data also contains data of different formats like structured data semi-structured data and unstructured data data like your excel sheets all fall under structured data this data has a definite format your emails fall under semi-structured and your pictures and videos all fall under unstructured data all these data together make up for big data it is very tough to store process and analyze big data using rtbms if you have looked into our previous videos you would know that hadoop is the solution to this hadoop is a framework that stores and processes big data it stores big data using the distributed storage system and it processes big data using the parallel processing method hence storing and processing big data is no more a problem using hadoop big data in its raw form is of no use to us we must try to derive meaningful information from it in order to benefit from this big data do you know amazon uses big data to monitor its items that are in its fulfillment centers across the globe how do you think amazon does this well it is done by analyzing big data which is known as big data analytics what is big data analytics in simple terms big data analytics is defined as the process which is used to extract meaningful information from big data this information could be your hidden patterns unknown correlations market trends and so on by using big data analytics there are many advantages it can be used for better decision making to prevent fraudulent activities and many others we will look into four of them step by step i will first start off with big data analytics which is used for risk management bdo which is a philippine banking company uses big data analytics for risk management risk management is an important aspect for any organization especially in the field of banking risk management analysis comprises a series of measures which are employed to prevent any sort of unauthorized activities identifying fraudulent activities was a main concern for bdo it was difficult for the bank to identify the fraudster from a long list of suspects bdo adopted big data analytics which held the bank to narrow down the entire list of suspects thus the organization was able to identify the fraudster in a very short time this is how big data analytics is used in the field of banking for risk management let us now see how big data analytics is used for product development and innovations with an example all of you are aware of rolls-royce cars right do you also know that they manufacture jet engines which are used across the world what is more interesting is that they use big data analytics for developing and innovating this engine a new product is always developed by trial and error method big data analytics is used here to analyze if an engine design is good or bad it is also used to analyze if there can be more scope for improvement based on the previous models and on the future demands this way big data analytics is used in designing a product which is of higher quality using big data analytics the company can save a lot of time if the team is struggling to arrive at the right conclusion big data can be used to zero in on the right data which have to be studied and thus the time spent on the product development is less big data analytics helps in quicker and better decision making in organizations the process of selecting a course of action from various other alternatives is known as decision making lot of organizations take important decisions based on the data that they have data driven business decisions can make or break a company hence it is important to analyze all the possibilities thoroughly and quickly before making important decisions let us now try to understand this with an use case starbucks uses big data analytics for making important decisions they decide the location of their new outlet using big data analytics choosing the right location is an important factor for any organization the wrong location will not be able to attract the required amount of customers positioning of a new outlet a few miles here or there can always make a huge difference for an outlet especially a one like starbucks various factors are involved in choosing the right location for a new outlet for example parking adequacy has to be taken into consideration it would be inconvenient for people to go to a store which has no parking facility similarly the other factors that have to be considered are the visibility of the location the accessibility the economic factors the population of that particular location and also we would have to look into the competition in the vicinity all these factors have to be thoroughly analyzed before making a decision as to where the new outlet must be started without analyzing these factors it would be impossible for us to make a wise decision using big data analytics we can consider all these factors and analyze them quickly and thoroughly thus starbucks makes use of big data analytics to understand if their new location would be fruitful or not finally we will look into how big data analytics is used to improve customer experience using an example delta an american airline uses big data analytics to improve its customer experiences with the increase in global air travel it is necessary that an airline does everything they can in order to provide good service and experience to its customers delta airlines improves its customer experience by making use of big data analytics this is done by monitoring tweets which will give them an idea as to how their customers journey was if the airline comes across a negative tweet and if it is found to be the airline's fault the airline goes ahead and upgrades that particular customer's ticket when this happens the customer is able to trust the airlines and without a doubt the customer will choose delta for their next journey by doing so the customer is happy and the airlines will be able to build a good brand recognition thus we see here that by using analysis delta airlines was able to improve its customer experience moving on to our next topic that is life cycle of big data analytics here we will look into the various stages as to how data is analyzed from scratch the first stage is the business case evaluation stage here the motive behind the analysis is identified we need to understand why we are analyzing so that we know how to do it and what are the different parameters that have to be looked into once this is done it is clear for us and it becomes much easier for us to proceed with the rest after which we will look into the various data sources from where we can gather all the data which will be required for analysis once we get the required data we will have to see if the data that we received is fit for analysis or not not all the data that we receive will have meaningful information some of it will surely just be corrupt data to remove this corrupt data we will pass this entire data through a filtering stage in this stage all the corrupt data will be removed now we have the data minus the corrupt data do you think our data is now fit for analysis well it is not we still have to figure out which data will be compatible with the tool that we will be using for analysis if we find data which is incompatible we first extract it and then transform it to a compatible form depending on the tool that we use in the next stage all the data with the same fields will be integrated this is known as the data aggregation stage the next stage which is the analysis stage is a very important stage in the life cycle of big data analytics right here in this step the entire process of evaluating your data using various analytical and statistical tools to discover meaningful information is done like we have discussed before the entire process of deriving meaningful information from data which is known as analysis is done here in this stage the result of the data analysis stage is then graphically communicated using tools like tableau power bi click view this analysis result will then be made available to different business stakeholders for various decision making this was the entire life cycle of big data analytics we just saw how data is analyzed from scratch now we will move on to a very important topic that is the different types of big data analytics well we have four different types of big data analytics as you see here these are the types and below this are the questions that each type tries to answer we have descriptive analytics which asks the question what has happened then we have diagnostic analytics which asks why did it happen predictive analytics asking what will happen and prescriptive analytics which questions by asking what is the solution we will look into all these four one by one with an use case for each we will first start off with descriptive analytics as mentioned earlier descriptive analytics asks the question what has happened it can be defined as the type that summarizes past data into a form that can be understood by humans in this type we will look into the past data and arrive at various conclusions for example an organization can review its performance using descriptive analytics that is it analyzes its past data such as revenue over the years and arrives at a conclusion with the profit by looking at this graph we can understand if the company is running at a profit or not thus descriptive analytics helps us understand this easily we can simply say that descriptive analytics is used for creating various reports for companies and also for tabulating various social media metrics like facebook likes tweets etc now that we have seen what is descriptive analytics let us look into and use case of descriptive analytics the dow chemical company analyzed all its past data using descriptive analytics and by doing so they were able to identify the under utilized space in their facility descriptive analytics helped them in this space consolidation on the whole the company was able to save nearly 4 million dollars annually so we now see here that descriptive analytics not only helps us derive meaningful information from the past data but it can also help companies in cost reduction if it is used wisely let us now move on to our next type that is diagnostic analytics diagnostic analytics asks the question why a particular problem has occurred as you can see it will always ask the question why did it happen it will look into the root cause of a problem and try to understand why it has occurred diagnostic analytics makes use of various techniques such as data mining data discovery and drill down companies benefit from this type of analytics because it helps them look into the root cause of a problem by doing so the next time the same problem will not arise as the company already knows why it has happened and they will arrive at a particular solution for it initsoft's bi query tool is an example of diagnostic analytics we can use query tool for diagnostic analytics now that you know why diagnostic analytics is required and what diagnostic analytics is i will run you through an example which shows where diagnostic analytics can be used all of us shop on e-commerce sites right have you ever added items to your cart but ended up not buying it yes all of us might have done that at some point an organization tries to understand why its customers don't end up buying their products although it has been added to their carts and this understanding is done with the help of diagnostic analytics an e-commerce site wonders why they have made few online sales although they have had a very good marketing strategy there could have been various factors as to why this has happened factors like the shipping fee which was too high or the page that didn't load correctly not enough payment options available and so on all these factors are analyzed using diagnostic analytics and the company comes to a conclusion as to why this has happened thus we see here that the root cause is identified so that in future the same problem doesn't occur again let us now move on to the third type that is predictive analytics as the name suggests predictive analytics makes predictions of the future it analyzes the current and historical facts to make predictions about future it always asks the question what will happen next predictive analytics is used with the help of artificial intelligence machine learning data mining to analyze the data it can be used for predicting customer trends market trends customer behavior etc it solely works on probability it always tries to understand what can happen next with the help of all the past and current information a company like paypal which has 260 plus million accounts always has the need to ensure that their online fraudulent and unauthorized activities are brought down to nil fear of constant fraudulent activities have always been a major concern for paypal when a fraudulent activity occurs people lose trust in the company and this brings in a very bad name for the brand it is inevitable that fraudulent activities will happen in a company like paypal which is one of the largest online payment processors in the world but paypal uses analytics wisely here to prevent such fraudulent activities and to minimize them it uses predictive analytics to do so the organization is able to analyze past data which includes a customer's historical payment data a customer's behavior trend and then it builds an algorithm which works on predicting what is likely to happen next with respect to their transaction with the use of big data and algorithms the system can gauge which of the transactions are valid and which could be potentially a fraudulent activity by doing so paypal is always ready with precautions that they have to take to protect all their clients against fraudulent transactions we will now move on to our last type that is prescriptive analytics prescriptive analytics as the name suggests always prescribes a solution to a particular problem the problem can be something which is happening currently hence it can be termed as the type that always asks the question what is the solution prescriptive analytics is related to both predictive and descriptive analytics as we saw earlier descriptive analytics always asks the question what has happened and predictive analytics helps you understand what can happen next with the help of artificial intelligence and machine learning prescriptive analytics helps you arrive at the best solution for a particular problem various business rules algorithms and computational modeling procedures are used in prescriptive analytics let us now have a look at how and where prescriptive analytics is used with an example here we will understand how prescriptive analytics is used by an airline for its profit do you know that when you book a flight ticket the price of it depends on various factors both internal and external factors apart from taxes seed selection there are other factors like oil prices customer demand which are all taken into consideration before the flight sphere is displayed prices change due to availability and demand holiday seasons are a time when the rates are much higher than the normal days seasons like christmas and school vacations also weekends the rates will be much higher than weekdays another factor which determines a flight's fair is your destination depending on the place where you are traveling to the flight fair will be adjusted accordingly this is because there are quite a few places where the air traffic is less and in such places the flight sphere will also be less so prescriptive analytics analyzes all these factors that are discussed and it builds an algorithm which will automatically adjust a flight sphere by doing so the airline is able to maximize its profit these were the four types of analytics now let us understand how we achieve these with the use of big data tools our next topic will be the various tools used in big data analytics these are few of the tools that i will be talking about today we have hadoop mongodb talindi kafka cassandra spark and storm we will look into each one of them one by one we will first start off with hadoop when you speak of big data the first framework that comes into everyone's mind is hadoop isn't it as i mentioned earlier apache hadoop is used to store and process big data in a distributed and parallel fashion it allows us to process data very fast hadoop uses mapreduce big and high for analyzing this big data hadoop is easily one of the most famous big data tools now let us move on to the next one that is mongodb mongodb is a cross-platform document oriented database it has the ability to deal with large amount of unstructured data processing of data which is unstructured and processing of data sets that change very frequently is done using mongodb talendi provides software and services for data integration data management and cloud storage it specializes in big data integration talandi open studio is a free open source tool for processing data easily on a big data environment cassandra is used widely for an effective management of large amounts of data it is similar to hadoop in its feature of fault tolerance where data is automatically replicated to multiple nodes cassandra is preferred for real-time processing spark is another tool that is used for data processing this data processing engine is developed to process data way faster than hadoop mapreduce this is done in a way because spark does all the processing in the main memory of the data nodes and thus it prevents unnecessary input output overheads with the disk whereas mapreduce is disk based and hence spark proves to be faster than hadoop mapreduce storm is a free big data computational system which is done in real time it is one of the easiest tools for big data analysis it can be used with any programming language this feature makes storm very simple to use finally we will look into another big data tool which is known as kafka kafka is a distributed streaming platform which was developed by linkedin and later given to apache software foundation it is used to provide real-time analytics result and it is also used for fault tolerant storage these were few of the big data analytics tools now let us move on to our last topic for today that is big data application domains here we will look into the various sectors where big data analytics is actively used the first sector is e-commerce merely 45 percent of the world is online and they create a lot of data every second big data can be used smartly in the field of e-commerce by predicting customer trend forecasting demands adjusting the price and so on online retailers have the opportunity to create better shopping experience and generate higher sales if big data analytics is used correctly having big data doesn't automatically lead to a better marketing strategy meaningful insights need to be derived from it in order to make right decisions by analyzing big data we can have personalized marketing campaigns which can result in better and higher sales in the field of education depending on the market requirements new courses are developed the market requirement needs to be analyzed correctly with respect to the scope of a course and accordingly a scope needs to be developed there is no point in developing a course which has no scope in the future hence to analyze the market requirement and to develop new courses we use big data analytics here there are a number of uses of big data analytics in the field of health care and one of it is to predict a patient's health issue that is with the help of their previous medical history big data analytics can determine how likely they are to have a particular health issue in the future the example of spotify that we saw previously showed how big data analytics is used to provide a personalized recommendation list to all its users similarly in the field of media and entertainment big data analytics is used to understand the demands of shows songs movies and so on to deliver personalized recommendation list as we saw with spotify big data analytics is used in the field of banking as we saw previously with a few use cases big data analytics was used for risk management in addition to risk management it is also used to analyze a customer's income and spend patterns and to help the bank predict if a particular customer is going to choose any of the bank offers such as loans credit card schemes and so on this way the bank is able to identify the right customer who is interested in its offers it has noticed that telecom companies have begun to embrace big data to gain profit big data analytics helps in analyzing network traffic and call data records it can also improve its service quality and improve its customer experience let us now look into how big data analytics is used by governments across the world in the field of law enforcement big data analytics can be applied to analyze all the available data to understand crime patterns intelligence services can use predictive analytics to focus the crime which could be committed in durham the police department was able to reduce the crime rate using big data analytics with the help of data police could identify whom to target where to go when to petrol and how to investigate crimes big data analytics help them to discover patterns of crime emerging in the area before we move on to the applications let's have a quick look at the big data market revenue forecast worldwide from 2011 to 2027. so here's a graph in which the y-axis represents the revenue in billion us dollars and the x-axis represents the years as it is seen clearly from the graph big data has grown until 2019 and statistics predict that this growth will continue even in the future this growth is made possible as numerous companies use big data in various domains to boost their revenue we will look into few of such applications the first big data application we will look into is weather forecast imagine there is a sudden storm and you're not even prepared that would be a terrifying situation isn't it dealing with any calamities such as hurricane storms floods would be very inconvenient if we are caught off guard the solution is to have a tool that predicts the weather of the coming days well in advance this tool needs to be accurate and to make such a tool big data is used so how does big data help here well it allows us to gather all the information required to predict the weather information such as the climate change details wind direction precipitation previous weather reports and so on after all this data is collected it becomes easier for us to spot a trend and identify what's going to happen next by analyzing all of this big data a weather prediction engine works on this analysis it predicts the weather of every region across the world for any given time by using such a tool we can be well prepared to face any climate change or any natural calamity let's take an example of a landslide and try to understand how big data is used to tackle such a situation predicting a landslide is very difficult with just the basic warning signs lack of this prediction can cause a huge damage to life and property this challenge was studied by the university of melbourne and they developed a tool which is capable of predicting a landslide this tool predicts the boundary where a landslide is likely to occur two weeks before this magical tool works on both big data and applied mathematics an accurate prediction like this which is made two weeks before can save lives and help in relocating people in that particular region it also gives us an insight into the magnitude of the upcoming destruction this is how big data is used in weather forecast and in predicting any natural calamities across the world let us now move on to our next application that is big data application in the field of media and entertainment the media and the entertainment industry is a massive one leveraging big data here can produce sky-high results and boost the revenue for any company let us see the different ways in which big data is used in this industry have you ever noticed that you come across relevant advertisements in your social media sites and in your mailboxes well this is done by analyzing all your data such as your previous browsing history and your purchase data publishers then display what you like in the form of ads which will in turn catch your interest in looking into it next up is customer sentiment analysis customers are very important for a company the happier the customer the greater the company's revenue big data helps in gathering all the emotions of a customer through their posts messages conversations etc these emotions are then analyzed to arrive at a conclusion regarding the customer satisfaction if the customer is unhappy the company strives to do better the next time and provides their customers a better experience while purchasing an item from an e-commerce site or while watching videos on an entertainment site you might have noticed a segment which says most recommended list for you this list is a personalized list which is made available to you by analyzing all the data such as your previous watch history your subscriptions your likes and so on recommendation engine is a tool that filters and analyzes all this data and provides you with a list that you would most likely be interested in by doing so the site is able to retain and engage its customer for a longer time next is customer churn analysis in simple words customer churn happens when a customer stops a subscription with a service predicting and preventing this is of paramount importance to any organization by analyzing the behavioral patterns of previously churned customers an organization can identify which of their current customers are likely to churn by analyzing all of this data the organization can then implement effective programs for customer retention let us now look into an use case of starbucks big data is effectively used by the starbucks app 17 million users use this app and you can imagine how much data they generate data in the form of their coffee buying habits the store they visit and to the time they purchase all of this data is fed into the app so when a customer enters a new starbucks location the system analyzes all their data and they are provided with their preferred order this app also suggests new products to the customer in addition to this they also provide personalized offer and discounts on special occasions moving on to our next sector which is healthcare it is one of the most important sectors big data is widely used here to save lives with all the available big data medical researchers are done very effectively they are performed accurately by analyzing all the previous medical histories and new treatments and medicines are discovered cure can be found out even for few of the incurable diseases there are cases when one medication need not be effective for every patient hence personal care is very important personal care is provided to each patient depending on their past medical history and individuals medical history along with their body parameters are analyzed and personal attention is given to each of them as we all know medical treatments are not very pocket friendly every time a medical treatment is taken the amount increases this can be reduced if readmissions are brought down analyzing all the data precisely will deliver a long-term efficient result which will in turn prevent a patient's readmission frequently with globalization came an increase in the ease for infectious diseases to spread widely based on geography and demographics big data helps in predicting where an outbreak of epidemic viruses are most likely to occur an american healthcare company united healthcare uses big data to detect any online medical fraud activities such as payment of unauthorized benefits intentional misrepresentation of data and so on the healthcare company runs disease management programs the success rates of these programs are predicted using big data depending on how patients respond to it the next sector we will look into is logistics logistics looks into the process of transportation and storage of goods the movement of a product from its supplier to a consumer is very important big data is used to make this process faster and efficient the most important factor in logistics is the time taken for the products to reach their destination to achieve minimum time sensors within the vehicle analyze the fastest route this analysis is based on various data such as the weather traffic the list of orders and so on by doing so the fastest route is obtained and the delivery time is reduced capacity planning is another factor which needs to be taken into consideration details regarding the workforce and the number of vehicles are analyzed thoroughly and each vehicle is allocated a different route this is done as there is no need for many trucks to travel in the same direction which will be pointless depending on the analysis of the available workforce and resources this decision is taken big data analytics also finds its use in managing warehouses efficiently this analysis along with tracking sensors provide information regarding the underutilized space which results in efficient resource allocation and eventually reduces the cost customer satisfaction is important in logistics just like it is in any other sector customer reactions are analyzed from the available data which will eventually create an instant feedback loop a happy customer will always help the company gain more customers let us now look into a use case of ups as you know ups is one of the biggest shipping company in the world they have a huge customer database and they work on data every minute ups uses big data to gather different kinds of data regarding the weather the traffic jams the geography the locations and so on after collecting all this data they analyze it to discover the best and the fastest route to the destination in addition to this they also use big data to change the routes in real time this is how efficiently ups leverages big data next up we have a very interesting sector that is the travel and tourism sector the global tourism market is expected to grow in the near future big data is used in various ways in this sector let us look into a few of them hotels can increase their revenue by adjusting the room tariffs depending on the peak seasons such as holiday seasons festive seasons and so on the tourism industry uses all of this data to anticipate the demand and maximize their revenue big data is also used by resorts and hotels to analyze various details regarding their competitors this analysis result helps them to incorporate all the good facilities their competitors are providing and by doing so the hotel is able to flourish further a customer always comes back if they are offered good packages which are more than just the basic ones looking into a customer's past travel history likes and preferences hotels can provide its customers with personalized experiences which will interest them highly investing in an area which could be the hub of tourism is very wise few countries use big data to examine the tourism activities in their country and this in turn helps them discover new and fruitful investment opportunities let us look into one of the best online homestay networks airbnb and see how big data is used by them airbnb undoubtedly provides its customers with the best accommodation across the world big data is used by it to analyze the different kinds of available properties depending on the customer's preferences the pricing the keywords previous customers ratings and experiences airbnb filters out the best result big data works its magic yet again now we will move on to our final sector which is the government and law enforcement sector maintaining law and order is of utmost importance to any government it is a huge task by itself big data plays an active role here and in addition to this it also helps governments bring in new policies and schemes for the welfare of its citizens the police department is able to predict criminal activities way before it happens by analyzing big data information such as the previous crime records in a particular region the safety aspect in that region and so on by analyzing these factors they are able to predict any activity which breaks the law and order of the region governments are able to tackle unemployment to a great extent by using big data by analyzing the number of students graduating every year to the number of relevant job openings the government can have an idea of the unemployment rate in the country and then take necessary measures to tackle it our next factor is poverty in large countries it is difficult to analyze which area requires attention and development big data analytics makes it easier for governments to discover such areas poverty gradually decreases once these areas begin to develop governments have to always be on the lookout for better development a public survey voices the opinion of a country's citizens analyzing all the data collected from such surveys can help governments build better policies and services which will benefit its citizens let us now move on to our use case did you know that the new york police department uses big data analytics to protect its citizens the department prevents and identifies crimes by analyzing a huge amount of data which includes fingerprints certain emails and records from previous police investigations and so on after analyzing all of this data meaningful insights are drawn from it which will help the police in taking the required preventive measures against crimes now when we talk about evolution of big data we have known that data has evolved in last five years like never before now in fact before going to big data or before understanding these solutions and the need and why there is a rush towards big data technology and solution i would like to ask a question take a couple of minutes and think why are organizations interested in big data why is there certain rush in industry where everyone would want to ramp up their current infrastructure or would want to be working on technologies which allow them to use this big data think about it what is happening and why are organizations interested in this and if you think on this you will start thinking about what organizations have been doing in past what organizations have not done and why are organizations interested in big data now before we learn on big data we can always look into internet and check for use cases where organizations have failed to use legacy systems or relational databases to work on their data requirements now over in recent or over past five years or in recent decade what has happened is organizations have started understanding the value of data and they have decided not to ignore any data as being uneconomical now we can talk about different platforms through which data is generated take an example of social media like twitter facebook instagram whatsapp youtube you have e-commerce and various portals say ebay amazon flipkart alibaba.com and then you have various tech giants such as google oracle sap amazon microsoft and so on so lots of data is getting generated every day in every business sector the point here is that organizations have slowly started realizing that they would be interested in working on all the data now the question which i asked was why are organizations interested in big data and some of you might have already answered or thought about that organizations are interested in doing precise analysis or they want to work on different formats of data such as structured unstructured semi-structured data organizations are interested in gaining insights or finding the hidden treasure in the so-called big data and this is the main reason where organizations are interested in big data now there are various use cases there are various use cases we can compare that organizations from past 50 or more than 50 years have been handling huge amount of data they have been working on huge volume of data but the question here is have they worked on all the data or have they worked on some portion of it what have they used to store this data and if they have used something to store this data what is happening what is what is changing now when we talk about the businesses we cannot avoid talking about the dynamism involved now any organization would want to have a solution which allows them to store data and store huge amount of data capture it process it analyze it and also look into the data to give more value to the data organizations have then been looking for solutions now let's look at some facts that can convince you or that would convince you that data is exploding and needs your attention right 55 billion messages and 4.5 billion photos are sent each day on whatsapp 300 hours of video are uploaded every minute on youtube did you guys know that youtube is the second largest search engine after google every minute users send 31.25 million messages and watch 2.77 million videos on facebook walmart handles more than 1 million customer transactions every hour google 40 000 search queries are performed on google per second that is 3.46 million searches a day in fact you could also say that a lot of times people when they are loading up the google page is basically just to check their internet connection however that is also generating data idc reports that by 2025 real-time data will be more than a quarter of all the data and by 2025 the volume of digital data will increase to 163 zeta bytes that is we are not even talking about gigabytes or terabytes anymore we are talking about petabytes exabytes and zeta bytes and zeta bytes means 10 to the power 21 bytes so this is how data has evolved now you can talk about different companies which would want to use their data to take business decisions they would want to collect the data store it and analyze it and that's how they would be interested in drawing insights for the business now this is just a simple example about facebook and what it does to work on the data now before we go to facebook you could always check in google by just typing in companies using big data and if we say companies using big data we should be able to find a list of different companies which are using big data for different use cases there are various sources from where you can find we could also search for solution that is hadoop which we'll discuss later but you could always say companies using hadoop and that should take you to the wiki page which will basically help you know what are the different companies which are using this so so-called solution called hadoop okay now coming back to what we were discussing about so organizations are interested in big data as we discussed in gaining insights they would want to use the data to find hidden information which probably they ignored earlier now take an example of rdbms what is biggest drawback in using an rdbms now you might think that rdbms is known for stability and consistency and organizations would be interested in storing their data in oracle or db2 or mysql or microsoft sql server and they have been doing that for many years now so what has changed now now when we talk about rdbms the first question which i would ask is do we have access to 100 of data being online in rdbms the answer is no we would only have 10 or 20 or 30 percent of data online and rest of the data would be archived which means that if an organization is interested in working on all the data they would have to move the data from the archived storage to the processing layer and that would involve bandwidth consumption now this is one of the biggest drawbacks of rdbms you do not have access to 100 of data online in many of the cases organizations started realizing that the data which they were ignoring as being uneconomical had hidden value which they had never exploited i had read a presentation somewhere which said torture the data and it will confess to anything now that's the value of data which organizations have realized in recent past take an example of facebook now this shows what facebook does with its big data and we'll come to what is big data but let's understand the use case now facebook collects huge volumes of user data whether that is sms whether that is likes whether that is advertisements whether that is features which people are liking or photographs or even user profiles now by collecting this data and providing a portal which people can use to connect facebook is also accumulating huge volume of data and that's way beyond petabytes they would also be interested in analyzing this data and one of the reasons would be they would want to personalize the experience take an example of personalized news feed depending on a user behavior depending on what a user likes what a user would want to know about they can recommend a personalized news feed to every particular user that's just one example of what facebook does with its data take an example of photo tag suggestions now when you log into facebook account you could also get suggestions on different friends whom you would like to connect to or you would want to tag so that they could be known by others some more examples which show how facebook uses its data are as follows so the flashback collection of photos and posts that receive the most comments and likes okay there was something called as i voted that was used for 2016 elections with reminders and directions to tell users their time and place of polling also something called as safety checks in incidents such as earthquake hurricane or mass shooting facebook gives you safety checks now these are some examples where facebook is using big data and that brings us to the question what is big data this was just an example where we discussed about one company which is making use of that data which has been accumulated and it's not only for companies which are social media oriented like facebook where data is important take an example of ibm take an example of jpmorgan chase take an example of ge or any other organization which is collecting huge amount of data they would all want to gather insights they would want to analyze the data they would want to be more precise in building their services or solutions which can take care of their customers so what is big data big data is basically a term it is used to describe the data that is too large and complex to store in traditional databases and as i gave an example it's not just about storing the data it is also about what you can do with the data it also means that if there is a lot of dynamism involved can you change the underlying storage and handle any kind of data that comes in now before we get into that let's just understand what is big data so big data is basically a term which has been given to categorize the data if it has different characteristics organizations would want to have the big data stored processed and then analyzed to get whatever useful information they can get from this data now there are five v's of big data volume velocity variety value velocity although these are five v's but then there are other v's which also categorize the data as big data such as volatility validity viscosity virality of data okay so these are five v's of big data and if the data has one or all of these characteristics then it can be considered as big data including the other ways which i just mentioned so volume basically means incredible amount of data huge volumes of data data generated every second now that could be used for batch processing that could be used for real-time stream processing okay you might have data being generated from different kind of devices like your cell phones your social media websites online transactions variable devices servers and these days with iot we are also talking about data of getting generated via internet of things that is you could have different devices which could be communicating to each other you could be getting data from radars or leaders or even camera sensors so there is a huge volume of data which is getting generated and if we are talking about data which has huge volume which is getting generated constantly or has been accumulated over a period of time we would say that is big data velocity now this is one more important aspect of big data speed with which the data is getting generated think about stock markets think about social media websites think about online surveys or marketing campaigns or airline industry so if the data is getting generated with a lot of speed where it becomes difficult to capture collect process cure mine or analyze the data then we are certainly talking about big data the next aspect of big data is variety now this is where we talk about structured data semi-structured data or unstructured data and here i would like to ask a question what is the difference when do you call the data is structured semi-structured or unstructured now let's look at an example before we theoretically discuss about this i always would like to use some examples let's look at a log file and let's see what is it so if i look at this log file and if i would say what kind of data is this which is the highlighted one the answer would be it is structured data it has specific delimiters such as space it has data which is separated by space and if i had a hundred or thousand or million rows which had similar kind of data i could certainly store that in a table i could have a predefined schema to store this data so i would call the one which is highlighted and structured but if i look at this portion where i would look at a combination of this kind of data where some data has a pattern and some data doesn't now this is an example of semi-structured data so if i would have a predefined structure to store this data probably the pattern of data would break the structure and if i look at all the data then i would certainly call it unstructured data because there is no clear schema which can define this data now this is what i mean by variety of data that is structured data which basically has a schema or has a format which could be easily understood you have semi-structured which could be like an xml or json or even your excel sheets where you could have some data which is structured and the other is unstructured and when we talk about unstructured we are talking about absence of schema it does not have a format it does not have a schema and it is hard to analyze which brings its own challenges the next aspect is value now value refers to the ability to turn your data useful for business you would have lot of data which is being collected as we mentioned in previous slides right there would be lot of data wrangling or data pre-processing or cleaning up of data happening and then finally you would want to draw value from that data but from all the data collected what percentage of data gives us value and if all my data can give me value then why wouldn't i use it this is an aspect of big data right veracity now this means the quality of data billions of dollars are lost every year by organizations because the data which was collected was not of good quality or probably they collected a lot of data and then it was erroneous take an example of autonomous driving projects which are happening in europe or u.s where there are car fleets which are on the road collecting data via radar sensors and camera sensors and when this data has to be processed to train algorithms it has realized that sometimes the data which was collected was missing in some values might be was not appropriate or had a lot of errors and all this process of collecting the data becomes a repetitive task because the quality of data was not good this is just one example we can take example from healthcare industry or stock markets or financial institutions and so on so extracting loads of data is not useful if the data is messy or poor in quality and that basically means that velocity is a very important v of big data now apart from veracity volume variety velocity and value we have the other v such as viscosity how dense the data is validity is the data still valid volatility is my data volatile or virality is the data viral now all of these different v's categorize the data as big data now here we would like to talk on a big data case study and we have taken an example of google which obviously is one of the companies which is churning and working on huge amount of data now it's actually said that if you compare one grain of sand with one byte of data then google is processing or google is handling whole worlds sand every week that is the kind of data which google is processing now in early 2000 and since then when the number of internet users started growing google also faced lot of problems in storing increasing user data and using the traditional servers to manage that now that was a challenge which google started facing could they use traditional data server to store the data well yes they could right storage devices have been getting cheaper day by day but then how much time does it take to retrieve that data what is the seek time what is the time taken to read and process that data thousands of search queries were raised per second no doubt now we could say millions and billions of queries are raised per second every query read 100 mbs of data and consumed tens of billions of cpu cycles based on these queries so the requirement was that they wanted to have a large distributed highly fault tolerant file system large to store to capture process huge amount of data distributed because they could not rely just on one server even if that had multiple disks stacked up that was not an efficient choice what would happen if this particular machine failed what would happen if the whole server was down so they needed a distributed storage and a distributed computing environment they needed something which can be highly fault tolerant right so this was the requirement which google had and the solution which came out as a result was gfs google file system now let's look at how gfs works so normally in any particular linux system or linux server you would have a file system you would have set of processes you would have set up files and directories which could store the data gfs was different so to facilitate gfs which could store huge amount of data there was an architecture an architecture which had one master and multiple chunk servers or you could say slave servers or slave machines master machine was to contain metadata was to contain data about data when we say metadata we are talking about information about data and then you have the chung servers or the slave machines which could be storing data in a distributed fashion now any client or an api or an application which would want to read the data would first contact the master server it would contact the machine where the master process was running and client would place a request of reading the data or showing an interest of reading the data internally what it is doing is it is requesting for metadata your api or an application would want to know from where it can read the data master server which has metadata whether that is in ram or disk we can discuss that later but then master server would have the metadata and it would know which are the chunk servers or the slave machines where the data was stored in a distributed fashion master would respond back with the metadata information to the client and then client could use that information to read or write to these slave machines where actually the data was stored now this is what the process or set of processes work together to make gfs so when you say a chunk server we would basically have the files getting divided into fixed size chunks now how would they get divided so there would be some kind of chunk size or a block size which would determine that if the file is bigger than the pre-decided chunk size then it would be split into smaller chunks and be distributed across the chunk servers or the slave machines if the file was smaller then it would still use one chunk or a block to get stored on the underlying slave machines so these junk servers or slave machines are the ones which actually store the data on local disks as your linux files client which is interacting with master for metadata and then interacting with chunk servers for read write operations would be the one which would be externally connecting to the cluster so this is how it would look so you have a master which would obviously be receiving some kind of heartbeats from the chunk servers to know their status and receive information in the form of packets which would let the master know which machines were available for storage which machines already had data and master would build up the metadata within itself the files would be broken down into chunks for example we can look at file one it is broken down into chunk one and chunk two and file two has one chunk which is one portion of it and then you have file two residing on some other chunk server which also lets us know that there is some kind of auto replication for this file system right and the data which is getting stored in the chunk could have a data of 64 mb now that chunk size could be changed based on the data size but google file system had the basic size of the chunk as 64 mb each chunk would be replicated on multiple servers the default replication was 3 and that could again be increased or decreased as per requirement this would also mean that if a particular slave machine or a chunk server would die or would get killed or would crash there would never be any data loss because a replica of data residing on the failed machine would still be available on some other slave server chunk server or slave machine now this helped google to store and process huge volumes of data in a distributed manner and does have a fault tolerant distributed scalable storage which could allow them to store a huge amount of data now that was just one example which actually led to the solution which today we call as hadoop now when we talk about big data here i would like to ask you some questions that if we were talking about the rdbms case take an example of something like nasa which was working on a project called set eye search of extraterrestrial intelligence now this was a project where they were looking for a solution to take care of their problem the problem was that they would roughly send some waves in space capture those waves back and then analyze this data to find if there was any extraterrestrial object in space now they had two options for it they could either have a huge server built which could take care of storing the data and processing it or they could go for volunteer computing now volunteer computing basically means that you could have a lot of people volunteering and being part of this project and what they would in turn do is they would be donating their ram and storage from their machines when they are not using it how would that happen basically download some kind of patch on their machine which would run as a screen saver and if the user is not using his machine some portion of data could be transferred to these machines for intermittent storage and processing using ram now this sounds very interesting and this sounds very easy however it would have its own challenges right think about security think about integrity but those those problems are not bigger as much as is the requirement of bandwidth and this is the same thing which happens in rdbms if you would have to move data from archived solution to the processing layer that would consume huge amount of bandwidth big data brings its own challenges huge amount of data is getting generated every day now the biggest challenge is storing this huge volume of data and especially when this data is getting generated with a lot of variety where it can have different kind of formats where it could be viral it could be having a lot of value and nobody has looked into the veracity of data but the primary problem would be handling this huge volume of data variety of the data would bring in challenges of storing it in legacy systems if processing of the data was required now here again i would suggest you need to think what is the difference between reading a data and processing a data so reading might just mean bringing in the data from disk and doing some io operations and processing would mean reading the data probably doing some transformations on it extracting some useful information from it and then storing it in the same format or probably in a different format so processing this massive volume of data is the second challenge organizations don't just store their big data they would eventually want to use it to process it to gather some insights now processing and extracting insights from big data would take huge amount of time unless and until there was an efficient solution to handle and process this big data securing the data that's again a concern for organizations right encryption of big data is difficult to perform if you would think about different compression mechanisms then that would also mean decompressing of data which would also mean that you could take a hit on the cpu cycles or on disk usage providing user authentication for every team member now that could also be dangerous so that led to hadoop as a solution so big data brings its own challenges big data brings its own benefits and here we have a solution which is hadoop now what is hadoop it's an open source framework for storing data and running applications on clusters of commodity hardware hadoop is an open source framework and before we discuss on two main components of hadoop it would be good to look into the link which i was suggesting earlier that is companies using hadoop and any person who would be interested in learning big data should start somewhere here where you could list down different companies what kind of setup they have why are they having hadoop what kind of processing they are doing and how are they using these so called hadoop clusters to process and in fact store capture and process huge amount of data another link which i would suggest is looking at different distributions of hadoop any person who is interested in learning in big data should know about different distributions of hadoop now in linux we have different distributions like ubuntu centos red hat susie debian in the same way you have different distributions of hadoop which we can look on the wiki page and this is the link which talks about products that include apache hadoop or derivative works and commercial support which basically means that apache hadoop the sole products that can be called a release of apache hadoop come from apache.org that's an open source community and then you have various vendor-specific distributions like amazon web services you have cloud era you have hortonworks you have ibm's big inside you have mapper all these are different distributions of hadoop so basically all of these vendor-specific distributions are depending on using on core apache hadoop in brief we can say that these are the vendors which take up the apache hadoop package it within a cluster management solution so that users who intend to use apache hadoop would not have difficulties of setting up a cluster setting up a framework they could just use a vendor-specific distribution with its cluster installation solutions cluster management solution and easily plan deploy install and manage your cluster let's rewind to the days before the world turned digital back then miniscule amounts of data were generated at a relatively sluggish pace all the data was mostly documents and in the form of rows and columns storing or processing this data wasn't much trouble as a single storage unit and processor combination would do the job but as years passed by the internet took the world by storm giving rise to tons of data generated in a multitude of forms and formats every microsecond semi-structured and unstructured data was available now in the form of emails images audio and video to name a few all this data became collectively known as big data although fascinating it became nearly impossible to handle this big data and a storage unit processor combination was obviously not enough so what was the solution multiple storage units and processors were undoubtedly the need of the hour this concept was incorporated in the framework of hadoop that could store and process vast amounts of any data efficiently using a cluster of commodity hardware hadoop consisted of three components that were specifically designed to work on big data in order to capitalize on data the first step is storing it the first component of hadoop is its storage unit the hadoop distributed file system or hdfs storing massive data on one computer is unfeasible hence data is distributed amongst many computers and stored in blocks so if you have 600 megabytes of data to be stored hdfs splits the data into multiple blocks of data that are then stored on several data nodes in the cluster 128 megabytes is the default size of each block hence 600 megabytes will be split into four blocks a b c and d of 128 megabytes each and the remaining 88 megabytes in the last block e so now you might be wondering what if one data node crashes do we lose that specific piece of data well no that's the beauty of hdfs hdfs makes copies of the data and stores it across multiple systems for example when block a is created it is replicated with a replication factor of three and stored on different data nodes this is termed the replication method by doing so data is not lost at any cost even if one data node crashes making hdfs fault tolerant after storing the data successfully it needs to be processed this is where the second component of hadoop mapreduce comes into play in the traditional data processing method entire data would be processed on a single machine having a single processor this consumed time and was inefficient especially when processing large volumes of a variety of data to overcome this mapreduce splits data into parts and processes each of them separately on different data nodes the individual results are then aggregated to give the final output let's try to count the number of occurrences of words taking this example first the input is split into five separate parts based on full stops the next step is the mapper phase where the occurrence of each word is counted and allocated a number after that depending on the words similar words are shuffled sorted and grouped following which in the reducer phase all the grouped words are given account finally the output is displayed by aggregating the results all this is done by writing a simple program similarly mapreduce processes each part of big data individually and then sums the result at the end this improves load balancing and saves a considerable amount of time now that we have our mapreduce job ready it is time for us to run it on the hadoop cluster this is done with the help of a set of resources such as ram network bandwidth and cpu multiple jobs are run on hadoop simultaneously and each of them needs some resources to complete the task successfully to efficiently manage these resources we have the third component of hadoop which is yarn yet another resource negotiator or yarn consists of a resource manager node manager application master and containers the resource manager assigns resources node managers handle the nodes and monitor the resource usage in the node the containers hold a collection of physical resources suppose we want to process the mapreduce job we had created first the application master requests the container from the node manager once the node manager gets the resources it sends them to the resource manager this way yarn processes job requests and manages cluster resources in hadoop in addition to these components hadoop also has various big data tools and frameworks dedicated to managing processing and analyzing data the hadoop ecosystem comprises several other components like hive pig apache spark flume and scoop to name a few the hadoop ecosystem works together on big data management before we dive into the technical side of hadoop we're going to take a little detour to try to give you a visual understanding and relate it to maybe a more life setup and we're going to go to the farm in this case so we have in a farm far away i almost wish they'd put far far away it does remind me a little bit of a star wars theme so we're going to look at fruit at a farm we have jack who harvests his grapes and then sells it in the nearby town after harvesting he stores his produce in a storage shed or a storage room in this case we found out though is there was a high demand for other fruits so he started harvesting apples and oranges as well hopefully he has a couple fills with these different fruit trees growing and he set up there and you can see that he's working hard to harvest all these different fruits but he has a problem here because there's only one of him so he can't really do more work so what he needs to do then is hire two more people to work with him with this harvesting is done simultaneously so instead of him trying to harvest all this different fruit he now has two more people in there who are putting their food away and harvesting it for them now the storage room becomes a bottleneck to store and access all the fruits in a single storage area so they can't fit all the fruit in one place so jack decides to distribute the storage area and give each one of them a separate storage and you can look at this computer terms we have our people that are the processors we have our fruit that's a data and you can see it's storing it in the different storage rooms so you can see me popping up there getting my hello i want fruit basket of three grapes two apples and three oranges i'm getting ready for a breakfast with family a little large family my family's not that large to complete the order on time all of them work parallelly with their own storage space so here we have a process of retrieving or querying the data and you can see from the one storage space he pulls out three grapes she pulls out two apples and then another storage room he pulls out three oranges and we complete a nice fruit basket and this solution helps them to complete the order on time without any hassle all of them are happy and they're prepared for an increase in demand in the future so they now have this growth system where you can just keep hiring on new people they can continue to grow and develop a very large farm so how does this story relate to big data and i hinted at that a little bit earlier the limited data only one processor one storage unit was needed i remember back in the 90s they would just upgrade the computer instead of having a small computer you would then spend money for a huge mainframe with all the flashing lights on it then the cray computers were really massive nowadays a lot of the computers that sits on our desktop are powerful as the mainframes they had back then so it's pretty amazing how time has changed but used to be able to do everything on one computer and you had structured data and a database you stored your structured data in so most of the time you're acquiring databases sql queries just think of it as a giant spreadsheet with rows and columns where everything has a very specific size and fits neatly in that rows and columns and back in the 90s this was a nice setup you just upgraded your computer you would get yourself a nice big sun computer or mainframe if you had a lot of data and a lot of stuff going on and it was very easy to do soon though the data generation increased leading to high volume of data along with different data formats and so you can imagine in today's world this year we will generate more data than all the previous years summed together we will generate more data just this year than all the previous years some together and that's the way it's been going for some time and you can see we have a variety of data we have our structured data which is what we're you think about a database with rows and columns and easy to look at nice spreadsheet we have our semi-structured data they have emails as an example here that would be one example your xml your html web pages and we have unstructured data if you ever look through your folder on photos i have photos that were taken on my phone with high quality i've got photos from a long time ago i got web photos low quality so just in my pictures alone none of them are the same you know there certainly are groups of them that are but overall there are a lot of variety in size and setup so a single processor was not enough to process such high volume of different kinds of data as it was very time consuming you can imagine that if you are twitter with millions of twitter feeds you're not going to be able to do a query across one server there's just no way that's going to happen unless people don't mind waiting a year to get the history of their tweets or look something up hence we start doing multiple processors so they're used to process high volume of data and this saved time so we're moving forward we got multiple processors the single storage unit became the bottleneck due to which network overhead was generated so now you have your network coming in and each one of these servers has to wait before it can grab the data from the single stored unit maybe you have a sql server there with a nice setup or a file system going the solution was to use distributed storage for each processor this enabled easy access to storage and access to data so this makes a lot of sense you have multiple workers multiple storage units just like we had our storage room and the different fruit coming in your variety you can see that nice parallel to working on a farm now we're dealing with a lot of data it's all about the data now this method worked and there were no network overhead generated you're not getting a bottleneck somewhere where people are just waiting for data being pulled or being processed this is known as parallel processing with distributed storage so parallel processing distributed storage and you can see here the parallel processing is your different computers running the processes and distributed storage here is a quick demo on setting up cloudera quick start vm in case you are interested in working on a standalone cluster you can download the cloudera quick start vm so you can just type in download cloudera quick start vm and you can search for package now this can be used to set up a quick start vm which would be a single node cloudera based cluster so you can click on this link and then basically based on the platform which you would be choosing to install such as using a vm box or which version of cloudera you would install so here i can select a platform so i can choose box and then you can click on get it now so give your details and basically then it should allow you to download the quick start vm which would look something like this and once you have the zip file which is downloaded you can unzip it which can then be used to set up a single node cloud error cluster so once you have downloaded the zip file that would look something like this so you would have a quick start virtual box and then a virtual boss disk now this can be used to set up a cluster ignore these files which are related to amazon machines and we you don't need to have that so you would just have this and this can be used to set up a cloud error cluster so for this to be set up you can click on file import appliance and here you can choose your quick start vm by looking into downloads quick start vm select this and click on open now you can click on next and that shows you the specifications of cpu ram which we can then change later and click on import this will start importing virtual disk image dot vmdk file into your vm box once this is done we will have to change the specifications or machines to use two cpu cores minimum and give a little more ram because cloudera quickstart vm is very cpu intensive and it needs good amount of ram so to survive i will give two cpu cores and 5gb ram and that should be enough for us to bring up a quick start vm which gives us a cloudera distribution of hadoop in a single node cluster setup which can be used for working learning about different distributions in cloudera clusters working with sdfs and other hadoop ecosystem components let's just wait for this importing to finish and then we will go ahead and set up a quick start vm for our practice here the importing of appliance is done and we see cloudera quickstart machine is added to my list of machines i can click on this and click on settings as mentioned i would like to give it more ram and more cpu cores so click on system and here let's increase the ram to at least five and click on processor and let's give it two cpu cores which would at least be better than using one cpu core network it goes for nat and that's fine click on ok and we would want to start this machine so that it uses two cpu cores 5gb ram and it should bring up my cloudera quick start vm now let's go ahead and start this machine which has our quick start vm it might take initially some time to start up because internally there will be various cloudera services which will be starting up and those services need to be up for our cloud era quick start vm to be accessible so unlike your apache hadoop cluster where we start our cluster and we will be starting all our processes in case of cloudera it is your cloudera scm server and agents which take care of starting up of your services and starting up of your different roles for those services i explained in my previous session that for a cloud era cluster it would be these services let me just show you that so in case of apache cluster we start our services that is we start our cluster by running script and then basically those scripts will individually start the different processes on different nodes in case of cloud era we would always have a cloudera scm server which would be running on one machine and then including that machine we would have clouded icm agents which would be running on multiple machines similarly if we had a hortonworks cluster we would have ambari server starting up on the first machine and then embody agents running on other machines so your server component knows what are the services which are set up what are their configurations and agents running on every node are responsible to send heartbeats to the server receive instructions and then take care of starting and stopping off of individual roles on different machines in case of our single node cluster setup in quick start vm we would just have one scm server and one scm agent which will start on the machine which will then take care of all the roles which need to be started for your different services so we will just wait for our machine to come up and basically have cloud sem server and agent running and once we have that we need to follow few steps so that we can have the cloudera admin console accessible which allows you to browse the cluster look at different services look at the roles for different services and also work with your cluster either using command line or using the web interface that is now that my machine has come up and it already is connected to the internet which we can see here we need to do certain things so that we can have our admin console accessible at this point of time you can click on terminal and check if you have access to the cluster so here type in host name and that shows you your host name which is quickstart.cloudera we can also type in hdfs command to see if we have access and if my cluster is working these commands are same as you would give them in a apache hadoop cluster or in any other distribution of a loop sometimes when your cluster is up and you have access to the terminal it might take few seconds or few minutes before there is a connection established between cloudera cm server and cloudera cm agent running in the background which takes care of your cluster i have given a sdfs dfs list command which basically should show me what by default exists on my sdfs let's just give it a couple of seconds before it shows us the output we can also check by giving a service cloudera scm server status and here it tells me that if you would want to use cloudera express free run this command it needs 8 gb of ram and it leads to virtual cpu cores and it also mentions it may take several minutes before cloudera manager has started i can login as root here and then give the command service cloud error scm server status remember the password for root is cloudera so it basically says that if you would want to check the settings it is good to have express edition running so we can close this my sdfs access is working fine let's close the terminal and here we have launch cloud error express click on this and that will give you that you need to give a command which is force let's copy this command let's open a different terminal and let's give this command like this which will then go ahead and shut down your cloudera based services and then it will restart it only after which you will be able to access your admin console so let's just give it a couple of minutes before it does this and then we will have access to our admin console here if you see it is starting the cloudera manager server again it is waiting for cloudera manager api then starting the cloudera manager agents and then configuring the deployment as per the new settings which we have given as to use the express edition of cloudera once all this is done it will say the cluster has been restarted and the admin console can be accessed by id and password as cloudera we'll give it a couple of more minutes and once this is done we are ready to use our admin console now that deployment has been configured client configurations have also been deployed and it has restarted the cloudera management service it gives you an access to quick start admin console using username and password as cloud error let's try accessing it so we can open up the browser here and let's change this to 7180 that's the default port and that shows the admin console which is coming up now here we can log in as cloud error cloud error and then let's click on login now as i said cloudera is very cpu intensive and memory intensive so it would slow down since we have not given enough gb ram to our cloud error cluster and thus it will be advisable to stop or even remove the services which we don't need now as of now if we look at the services all of them look in a stop status and that's good in one way because we can then go ahead and remove the services which we will not use in the beginning and later we can anytime add services to the cluster so for example i can click on key value store here and then i can scroll down where it says delete to remove this service from the admin console now anytime you are removing a particular service it will only remove the service from the management by cloudera manager all the role groups under this service will be removed from host templates so we can click on delete now if this service was depending on some other service it would have prompted me with a message that remove the relevant services on which this particular service depends if the service was already running then it would have given me a message that the service has to be stopped before it can be deleted from the cloudera admin console now this is my admin console which allows you to click on services look at the different roles and processes which are running for this service we anyways have access to our cloudera cluster from the terminal using our regular sdfs or yarn or maplet commands now i removed a service i will also remove solar which we will not be using for the beginning but then it depends on your choice so we can here scroll down to delete it and that says that before deleting the solar service you must remove the dependencies on the service from the configuration of following services that is hue now hue is a web interface which allows you to work with your sdfs and that is depending on this so click on configure service dependency and here we can make sure that our hue service does not depend on a particular service we are removing so that then we can have a clean removal of the service so i'll click on none and i will say save changes once this is done then we can go ahead and try removing the solar service from our admin console which will reduce some load on my management console which will also allow me to work faster on my cluster now here we have removed the dependency of hue on solar so we can click on this and then we can delete it remember i'm only doing this so that my cluster becomes little lighter and i can work on my focus services at any point of time if you want to add more services to your cluster you can anytime do that you can fix different configuration issues like what we see here with different warning messages and here we have these services which are already existing now if we don't need any of the service i can click on the drop down and click on delete again this says that scoop 2 also has relevance to hue so hue as a web interface also depends on scope 2. as of now we'll make it none at any point of time later you can add the services by clicking the add service option now this is a cluster to which you have admin access and this is a quick start vm which gives you a single node cloud error cluster which you can use for learning and practicing so here we will click on scope 2 and then we will say delete as we have configured the dependency now and we will remove scope 2 also from the list of services which your admin console is managing right so once this is done we have removed three services which we did not need we can even remove scope as a client and if we need we can add that later now there are various other alerts which your cloudera admin console shows and we can always fix them by clicking on the health issues or configuration issues we can click here and see what is the health issue it is pointing to if that is a critical one or if that can be ignored so it says there is an issue with a clock offset which basically relates to an ntp service network time protocol which makes sure that one or multiple machines are in the same time zone and are in sync so for now we can click on suppress and we can just say suppress for all hosts and we can say look into it later and confirm so now we will not have that health issue reported that probably the ntp service and the machines might not be in sync now that does not have an impact for our use case as of now but if we have a kerberos kind of setup which is for security then basically this offset and time zone becomes important so we can ignore this message and we are still good to use the cluster we also have other configuration issues and you can click on this which might talk about the heap size or the ram which is available for machines it talks about zookeeper should be in odd numbers q does not have a load balancer sdfs only has one data node but all of these issues are not to be worried upon because this is a single node cluster setup so if you want to avoid all of these warnings you can always click on suppress and you can avoid and let your cluster be in all green status but that's nothing to worry so we can click on cluster and basically we can look at the services so we have removed some services which we don't intend to use now i have also suppressed a offset warning which is not very critical for my use case and basically i am good to start the cluster at any point of time as i said if you would want to add services this is the actions button which you can use to add service so we will just say restart my cluster which will restart all the services one by one starting from zookeeper as the first service to come up we can always click on this arrow mark and see what is happening in the services what services are coming up and in which order if you have any issues you can always click on the link next to it which will take you to the logs and we can click on close to let it happen in the background so this will basically let my services restart one by one and my cluster will then become completely accessible either using hue as a web interface or quick start terminal which allows you to give your commands now while my machines are coming up you can click on hosts and you can have a look at all the hosts we have as of now only one which will also tell you how many roles or processes are running on this machine so that is 25 rolls it tells you what is the disk usage it tells you what is the physical memory being used and using this host tab we can add new host to the cluster we can check the configuration we can check all the hosts in diagnostics you can look at the logs which will give you access to all the logs you can even select the sources from which you would want to have the logs or you can give the host name you can click on search you can build your own charts you can also do the admin stuff by adding different users or enabling security using the administration tab so since we have clicked on restart of a cluster we will slowly start seeing all the services one by one coming up starting with zookeeper to begin with and once we have our cluster up and running whether that is showing all services in green or in a different status we still should be able to access the service now as we saw in apache hadoop cluster even here we can click on sdfs and we can access the web ui once our sdfs service is up by clicking on quick links so the service is not yet up once it is up we should be able to see the web ui link which will allow you to check things from sdfs web interface similarly yarn as a service also has a web interface so as soon as the service comes up under your quick links we will have access to the yarn ui and similarly once the service comes up we will have access to hue which will give you the web interface which allows you to work with your sdfs which allows you to work with your different other components within the cluster without even using the command line tools or command line options so we will have to give it some time while the cloud error scm agent on every machine will be able to restart the roles which are responsible for your cluster to come up we can always click here with tells that there are some running commands in the background which are trying to start my cluster we can go to the terminal and we can switch as hdfs user remember sdfs user is the admin user and it does not have a password unless you have set one so you can just log in as sdfs which might ask you for a password initially which we do not have so the best way to do this is by logging in as root where the password is cloud error and then you can log in as hdfs so that then onwards you can give your sdfs commands to work with your file system now since my services are coming up right now when i try to give a sdfs dfs command it might not work or it might also say that it is trying to connect to the name node which is not up yet so we will have to give it some time and only once the name node is up we will be able to access our sdfs using commands so this is how you can quickly set up your quick start and then you can be working using the command line options from the terminal like what you would do in apache hadoop cluster you could use the web interfaces which allow you to work with your cluster now this usually takes more time so you will have to give it some time before your services are up and running and for any reason if you have issues it might require you to restart your cluster several times in the beginning before it gets accustomed to the settings what you have given and it starts up the services at any point of time if you have any error message then you can always go back and look in logs and see what is happening and try starting your cluster so this is how we set up a quick start vm and you can be using this to work with your cloud error clusters we'll start with the big data challenges and the first thing with the big data is you can see here we have a nice chaotic image with all these different inputs server racks all over the place graphs being generated just about everything you can imagine and so the problems that come up with big data is one storing it how do you store this massive amount of data and we're not talking about a terabyte or 10 terabytes we're talking a minimal of 10 terabytes up to petabytes of data and then the next question is processing and so the two go hand in hand because when you're storing the data that might take up a huge amount of space or you might have a small amount of data that takes a lot of processing and so either one will drive a series of data or processing into the big data arena so with storing data storing big data was a problem due to its massive volume just straight up people would have huge backup tapes and then you'd have to go through the backup tapes for hours to go find your data and a simple query could take days and then processing processing big data consumed more time and so the hadoop came up with a cheap way to process the data it used to be like i've had some processes that if i ran on my computer without trying to use multiple cores and multiple threads would take years to process just a simple data analysis can get that heavy in the data processing and so processing can be as big of a problem as the size of the data itself so hadoop as a solution this is the solution to big data and big data storage storing big data was a problem due to its massive volume so we take the hadoop file system or the hdfs and now we're able to store huge data across a large number of machines and access it like this one file system and processing big data consumed more time we talked about some processes you can't even do on your computer because it would take years now the hadoop with the map reduce processing big data was faster and i'm going to add a little notation right here that's really important to note that hadoop is the beginning when we talk about data processing they've added new processes on top of the map reduce that even accelerate it your spark setup and some other different functionalities but really the basis of all of it where it all starts with the most basic concept is your hadoop mapreduce let us now look into the hdfs in detail in the traditional approach all the data was stored in a single central database with the rise of big data a single database was not enough for storage and i remember the old sun computers or the huge ibm machines with all the flashing lights now all the data on one of those can be stored on your phone it's almost a bit of humor how much this has accelerated over the years the same thing with our rise of big data no longer can you just store it on one machine no longer can you go out and buy a sun computer and put it on that one sun computer no more can you buy a craig machine or an enterprise ibm server it's not going to work you're not going to fit it all into one server the solution was to use distributed approach to store the massive amount of data data was divided and distributed amongst many individual databases and you can see here where we have three different databases going on so you might actually saw this in one where they divided up the user accounts a through g so on by the letter and so the first query would say what's the first letter of this uh whatever id it was and then it would go into that database to find it so i had a database telling it which database to look for stuff that was a long time ago and to be honest it didn't work really well nowadays so that was a distributed database you'd have to track which database you put it in so what is hdfs hadoop distributed file system hdfs is a specially designed file system for storing huge data sets in commodity hardware and commodities an interesting term because i mentioned enterprise versus commodity and i'll touch back upon that it has two core components name node and data node name node is the master daemon there's only one active name node it manages the data nodes and stores all the metadata so it stores all the mapping of where the data is on your hadoop file system now the name node is usually an enterprise machine you spend a lot of extra money on it so you have a very solid name known machine and so then we have our data nodes our data node data node data node we have three here the data node is the slave there can be multiple data nodes and it stores the actual data and this is where your commodity hardware comes in the best definition i've heard of commodity hardware is the cheap knockoffs this is where you buy you can buy 10 of these and you expect one of them not to work because you know when they come in they're going to break right away so you're looking at hardware it's not as high-end so where you might have your main master node is your enterprise server then your data nodes are just as cheap as you can get them with all the different features you need on them as said earlier name node stores the metadata metadata gives information regarding the file location block size and so on so our metadata in the hdfs is maintained by using two files it has the edit log and the fs image the edit log keeps track of the recent changes made on the hadoop file system only recent changes are tracked here the fs image keeps track of every change made on the hdfs since the beginning now what happens when the edit log file size increases the name node fails these are big ones so what happens we get our edit log it just keeps so big until it's too big or our main enterprise computer that we spent all that money on so it would break actually fails because they still fail the solution we make copies of the edit log and the fs image files so that's pretty straightforward you just copy them over so you have both the most recent edits going on and the long term image of your file system and then we also create a secondary name node is a node that maintains a copies of the edit log and the fs image it combines them both to get an updated version of the fs image now the secondary name node only came in the last oh i guess two three years where it became as part of the main system and usually your secondary name node is also an enterprise computer and you'll put them on separate racks so if you have three racks of computers you would have maybe the first two racks would have the name node and the second rack would have the secondary name note and the reason you put them on different racks is you can have a whole rack go down you could have somebody literally trip over the power cable or the switch that goes between the racks is most common goes down well if the switch goes down you can easily switch to the secondary name node and while you're getting your switch replaced and replacing that hardware because of the way the hdfs works it still is completely functional so let's take a look at the name node we have our edit log our fs image and we have our secondary name node and you take that it copies the edit log over and it copies the fs image and you can see right here you have all the different contents on your main name node now also on your secondary name node and then your secondary name node will actually take these two your edit log and your fs image and it will make a copy so you have a full fs image that contains its current it's up to date the secondary name node creates a periodic checkpoint of the files and then it updates the new fs image into the name node now used to be this all occurred on the name node before you had a secondary name note so now you can use your secondary node to both back up everything going on the name node and it does that lifting in the back where you're combining your edit log and bringing your fs image so it's current and then you end up with a new edit log and a new you have your fs image updated and you start a fresh edit log this process of updating happens every hour and that's how it's scheduled you can actually change those schedules but that is the standard is to update every hour so let's we took a look at the the master node and the you know the name node and the secondary name node let's take a look at the cluster architecture of our hadoop file system the hdfs cluster architecture so we have our name node and it stores the metadata and we have our block location so we have our fs image plus our edit log and then we have the backup fs image and edit log and then we have so you have your rack we have our switch on top remember i was talking about the switch that's the most common thing to go in the rack is the switches and underneath the rack you have your different data nodes you have your data node 1 2 four five maybe have 10 15 on this rack you can stack them pretty high nowadays uh used to be you only get about 10 servers on there but now you see racks that contain a lot more and then you have multiple racks so we're not talking about just one rack we also have you know rack 2 rack 3 4 5 6 and so on until you have rack in so if you had a 100 data nodes we would be looking at 10 racks of 10 data nodes each and that is literally 10 commodity server computers hardware and we have a core switch which maintains network bandwidth and connects the name node to the data nodes so just like each rack has a switch that connects all your nodes on the rack you now have core switches that connect all the racks together and these also connect into our name node setup so now i can look up your fs image and your edit log and pull that information your metadata out so we've looked at the architecture from the name node coming down and you have your metadata your block locations this then sorts it out you have your core switches which connect everything all your different racks and then each individual rack has their own switches which connect all the different nodes and to the core switches so now let's talk about the actual data blocks what's actually sitting on those commodity machines and so the hadoop file system splits massive files into small chunks these chunks are known as data blocks each file in the hadoop file system is stored as a data block and we have a nice picture here where it looks like a lego if you ever played with the legos as a kid it's a good example we just stack that data right on top of each other but each block has to be the same symmetry has to be the same size so that it can track it easily and the default size of one data block is usually 128 megabytes now you can go in and change that this standard is pretty solid as far as most data is concerned when we're loading up huge amounts of data and there's certainly reasons to change it 128 megabytes is pretty standard block so why 128 megabytes if the block size is smaller then there will be too many data blocks along with lots of metadata which will create overhead so that's why you don't really want to go smaller on these data blocks unless you have a very certain kind of data similarly if the block size is very large then the processing time for each block increases then as i pointed out earlier each block is the same size just like your lego blocks are all the same but the last block can be the same size or less so you might only be storing 100 megabytes in the last block and you can think of this as if you had a terabyte of data that you're storing on here it's not going to be exactly divided into 128 megabytes we just store all of it 128 megabytes except for the last one which could have anywhere between one and 128 megabytes depending on how evenly your data is divided now let's look into how files are stored in the hadoop file system so we have a file a text let's say it's 520 megabytes we have block a so we take 128 megabytes out of the 520 and we store it in block a and then we have block b again we're taking 128 megabytes out of our 520 storing it there and so on block c block d and then block e we only have eight megabytes left so when you add up 128 plus 128 plus 128 plus 128 you only get 512. and so the last eight megabytes goes into its own block the final block uses only the remaining space for storage data node failure and replication and this is really where hadoop shines this is what makes it this is why you can use it with commodity computers this is why you can have multiple racks and have something go down all the data blocks are stored in various data notes you take each block you store it at 128 megabytes and then we're gonna put it on different nodes so here's our block a block b block c from our last example and we have node one node two node three node four node five node six and so each one of these represents a different computer it literally splits the data up into different machines so what happens if node five crashes well that's a big deal i mean we might not even have just node five you might have a whole rack go down and if you're a company that's building your whole business off of that you're going to lose a lot of money so what does happen when node 5 crashes or the first rack goes down the data stored in node 5 will be unavailable as there is no copy stored elsewhere in this particular image so the hadoop file system overcomes the issue of data node failure by creating copies of the data this is known as the replication method and you can see here we have our six nodes here's our block a but instead of storing it just on the first machine we're actually going to store it on the second and fourth nodes so now it's spread across three different computers and in this if these are on a rack one of these is always on a different rack so you might have two copies on the same rack but you never have all three on the same rack and you always have you never have more than two copies on one node there's no reason to have more than one copy per node and you can see we do the same thing with block b block c is then also spread out across the different machines and same with block d and block e node five crashes will the data blocks b d and e be lost well in this example no because we have backups of all three of those on different machines the blocks have their copies in the other nodes due to which the data is not lost even if the node 5 crashes and again because they're also stored on different racks even if the whole rack goes down you are still up and live with your hadoop file system the default replication factor is three in total we'll have three copies of each data block now that can be changed for different reasons or purposes but you got to remember when you're looking at a data center this is all in one huge room these switches are connecting all these servers so they can shuffle the data back and forth really quick and that is very important when you're dealing with big data and you can see here each block is by default replicated three times that's the standard there is very rare occasions to do four and there's even fewer reasons to do two blocks i've only seen four used once and it was because they had two data centers and so each data center kept two different copies rack awareness in the hadoop file system rack is a collection of 30 to 40 data nodes rack awareness is a concept that helps to decide where the replica of the data block should be stored so here we have rack one we have our data node one to four remember saying that it used to be only put ten machines and then it went to twenty now it's thirty to forty so you can have a rack with forty servers on it then we have rack two and rack three and we put block one on there replicas of block a cannot be in the same rack and so i'll put the replicas onto a different rack and notice that these are actually these two are on the same rack but you'll never have all three stored on the same rack in case the whole rack goes down and replicas of block a are created in rack two and they actually do they do by default create the replicas onto the same rack and that has to do with the data exchange and maximizing your processing time and then we have of course our block b and it's replicated onto rack 3 and block c which will then replicate onto rack 1 and so on for all of your data all the way up to block d or whatever how much ever data you have on there hdfs architecture so let's look over the architecture as a bigger picture we looked at the name node and we store some metadata names replicas home food data three so it has all your different infra your metadata stored on there and then we have our data nodes and you can see our data nodes are each on different racks with our different machines and we have our name node and you're going to see we have a heartbeat or pulse here and a lot of times one of the things that confuses people sometimes in classes they talk about nodes versus machines so you could have a data node that's a hadoop data node and you could also have a spark node on there sparks a different architecture and these are each daemons that are running on these computers that's why you refer to them as nodes and not just always as servers and machines so even though i use them interchangeable be aware that these nodes you can even have virtual machines if you're testing something out it doesn't make sense to have 10 virtual nodes on one machine and deploy it because you might as well just run your code on the machine so we have our heartbeat going on here and the heartbeat is a signal that data nodes continuously send to the name nodes this signal shows the status of the data node so there's a continual pulse going up and saying hey i'm here i'm ready for whatever instructions or data you want to send me and you can see here we've divided up into rack 1 and rack 2 and our different data nodes and it also have the replications we talked about how to replicate data and replicate it in three different locations and then we have a client machine and the client first requests the name node to read the data now if you're not familiar with the client machine the client is you the programmer the client is you've logged in external to this hadoop file system and you're sending it instructions and so the client whatever instructions or script you're sending is first request the name node to read the data the name node allows the client to read the requested data from the data nodes the data is read from the data nodes and sent to the client and so you can see here that it basically the the name node connects the client up and says here here's a data stream and now you have the query that you sent out returning the data you asked for and then of course it goes and finishes it and says oh metadata operations and it goes in there and finalizes your request the other thing the name node does is you're sending your information because the client sends the information in there is your block operations so your block operations are performs creation of data so you're going to create new files and folders you're going to delete the folders and also it covers the replication of the folders which goes on in the background and so we can see here that we have a nice full picture you can see where the client machine comes in it cues the metadata the metadata goes into it stores a metadata and then it goes into block operations and maybe you're sending the data to the hadoop file system maybe you're querying maybe you're asking it to delete if you're sending data in there it then does the replication on there it goes back so you have your data client which is writing the data into the data node and of course replicating it and that's all part of the block operations and so let's talk a little bit about read mechanisms in the hadoop file system the hadoop file system read mechanism we have our hadoop file system client that's you on your computer and the client jvm on the client node so we have our client jvm or the java virtual machine that is going through and then your client node and we're zooming in on the read mechanism so we're looking at this picture here as you can guess your client is reading the data and i also have another client down here writing data we're going to look a little closer at that and so we have our name node up here we have our racks of your data nodes and your racks of computers down here and so our client the first thing it does is it opens a connection up with the distributed file system to the hdfs and it goes hey can i get the block locations and so it goes by using the rpc remote procedure call it gets those locations and the name node first checks if the client is authorized to access the requested file and if yes it then provides a block location and a token to the client which is showing to the slave for authentication so here's the name node it tells a client hey here's the token the client's going to come in and get this information from you and it tells the client oh hey here's where the information is so this is what is telling your script you sent to query your data whether you're writing your script in one of the many setups that you have available through the hadoop file system or a connection through your code and so you have your client machine at this point then reads so your fs data input stream comes through and you have and you can see right here we did one and two which is verify who you are and give you all the information you need then three you're going to read it from the input stream and then the input stream is going to grab it from the different nodes where it's at and it'll supply the tokens to those machines saying hey this client needs this data here's a token for that we're good to go let me have the data the client will show the authentication token to the data nodes for the read process to begin so after reaching the end of the data block the connection is closed and we can see here where we've gone through the different steps get block locations we have step one you open up your connection you get the block locations by using the rpc then you actively go through the fs data input stream to grab all those different data brings it back into the client and then once it's done it closes down that connection and then once the client or in this case the programmer you know manager has gone in and pig script you can do that there's an actual coding in each in hadoop we're pulling data called pig or hive once you get that data back we close the connection delete all those randomly huge series of tokens so they can't be used anymore and then it's done with that query and we can go ahead and zoom in just a little bit more here and let's look at this even a little closer here's our hadoop file system client and our client jvm our java virtual machine on the client node and we have the data to be read block a block b and so we request to read block a and b and it goes into the name node two then it sends a location in this case um ip addresses of the blocks for the dn1 and dn2 where those blocks are stored then the client interacts with the data nodes through the switches and so you have here the core switch so your client node comes in three and it goes to the core switch and that then goes to rack switch 1 rack switch 2 and rack switch 3. now if you're looking at this you'll automatically see a point of failure in the core switch and certainly you want a high end switch mechanism for your core switch you want to use enterprise hardware for that and then when you get to the racks that's all commodity all your rack switches so if one of those goes down you don't care as much you just have to get in there and swap it in and out really quick you can see here we have block a which is replicated three times and so is block b and it'll pull from there so we come in here and here the data is read from the dn1 and dn2 as they are the closest to each other and so you can see here that it's not going to read from two different racks it's going to read from one rack whatever the closest setup is for that query the reason for this is if you have 10 other queries going on you want this one to pull all the data through one setup and it minimizes that traffic so the response from the data nodes to the client that read the operation was successful it says ah we've read the data we're successful which is always good we like to be successful i don't know about you i like to be successful so if we look at the read mechanism let's go ahead and zoom in and look at the write mechanism for the hadoop file system so our hdfs write mechanism and so when we have the hdfs write mechanism here's our client machine this is again the programmer on their in computer and it's going through the client java machine or java virtual machine the jvm this is all occurring on the client node somebody's office or maybe it's on the local server for the office so we have our name node we have data nodes and we have the distributed file system so the client first executes create file on the distributed file system says hey i'm going to create this file over here then it goes through the rpc call just like our read did the client first executes create file on the distributed file system then the dfs interacts with the name node to create a file name node then provides a location to write the data and so here we have our hdfs client and the fs data output stream so this time instead of the data going to the client it's coming from the client and so here the client writes the data through the fs data output stream keep in mind that this output stream the client could be a streaming code it could be i mean you know i always refer to the client as just being this computer that your programmer is writing on it could be you have your sql server there where your data is that's current with all your current sales and it's archiving all that information through scoop one of the tools in the hadoop file system it could be streaming data it could be a connection to the stock servers and you're pulling stock data down from those servers at a regular time and that's all controlled you can actually set that code up to be controlled in many of the different features in the hadoop file system and some of the different resources you have that sit on top of it so here the client writes the data through the fs data output stream and the fs data output stream as you can see goes into right packet so it divides it up into packets 128 megabytes and the data is written and the slave further replicates it so here's our data coming in and then if you remember correctly that's part of the fs data setup is it tells it where to replicate it out but the data node itself is like oh hey okay i've got the data coming in and it's also given the tokens of where to send the replications to and then acknowledgement is sent after the required replicas are made so then that goes back up saying hey successful i've written data and made three replications on this as far as our you know going through the pipeline of the data nodes and then that goes back and says after the date is written the client performs the close method so the client's done it says okay i'm done here's the end of the data we're finished and after the date is written the client performs a close method and we can see here just a quick reshape we go in there just like we did with the read we create the connection this creates a name node which lets it know what's going on step two step three that also includes uh the tokens and everything and then we go into step three where we're now writing through the fs data output stream and that sorts it out into whatever data node it's going to go to and which also tells it how to replicate it so that data node then sends it to other data nodes so we have a replication and of course you finalize it and close everything up marks it complete to the name node and it deletes all those magical tokens in the background so that they can't be reused and we can go ahead and just do this with an example the same setups and whether you're doing a read or write they're very similar as we come in here from our client and our name node and you can see right here we actually depicting these as the actual rack and the switch is going on and so as the data comes in you have your request like we saw earlier it sends the location of the data nodes and this actually turns your your ip addresses your dynamic node connections they come back to your client your hdfs client and at that point with the tokens it then goes into the core switch and the core switch says hey here it goes the client interacts with the data nodes through the switches and you can see here where you're writing in block a replication of block a and a second replication of block a on the second server so blocking is replicated on the second server and the third server at this point you now have three replications of block a and it comes back and says it acknowledges and says hey we're done and that goes back into your hadoop file system to the client and says okay we're done and finally the success written to the name node and it just closes everything down a quick recap of the hadoop file system and the advantages and so the first one is probably one of the all of these are huge one you have multiple data copies are available so it's very fault tolerant whole racks can go down switches can go down even your main name node could go down if you have a secondary name node something we didn't talk too much about is how scalable it is because it uses distributed storage you run into then you're oh my gosh i'm out of space or i need to do some heavier processing let me just add another rack of computers so you can scale it up very quickly it's a linear scalability where it used to be if you bought a server you would have to pay a lot of money to get that the craig computer remember the big craig computers coming out the craig computer runs 2 million a year just the maintenance to liquid cool it that's very expensive compared to just adding more racks of computer and extending your data center so it's very cost effective since commodity hardware is used we're talking cheap knockoff computers you still need your high-end enterprise uh for the name node but the rest of them literally it is a tenth of the cost of storing data on more traditional high-end computers and then the data is secure so it has a very high-end data security provides data security for your data so all theory and no play doesn't make for much fun so let's go ahead and show you what it looks like as far as some of the dimensions when you're getting into pulling data or putting data into the hadoop file system and you can see here i have oracle virtual machine virtual box manager i have a couple different things loaded on there cloudera is one of them so we should probably explain some of these things if you're new to virtual machines the oracle virtualbox allows you to spin up a machine as if it is a separate computer so in this case this is running i believe centos linux and it creates like a box on my computer so the centos is running on my machine while i'm running my windows 10 this happens to be a windows 10 computer and then underneath here i can actually go under let me just open up the general might be hard to see there and you can see i can actually go down to system processor i happen to be on an 8 core it has 16 dedicated threads registers are 16 cpus but 8 cores and i've only designated this for one cpu so it's only going to use one of my dedicated threads on my computer and this the oracle virtual machine is open source you can see right here we're on the oracle www.oracle.com i usually just do a search for downloading virtualbox if you search for virtualbox all one word it will come up with this page and then you can download it for whatever operating system you're working with there certainly are a number of different options and let me go and point those out if you're setting this up as a for demoing for yourself the first thing to note for doing a virtual box and doing a cloudera or horton setup on that virtual box for doing the hadoop system to try it out you need a minimum of 12 gigabytes it cannot be a windows 10 home edition because you'll have problems with your virtual setup and sometimes you have to go turn on the virtual settings so it knows it's in there so if you're on a home setup there are other sources there's cloudera and we'll talk a little bit about clare dara here in just a second but they have the cloudera online live we can go try the cloudera setup i've never used it but cloudera is a pretty good company cloudera and hortonworks are two of the common ones out there and we'll actually be running a cloudera hadoop cluster on on our demo here so you have oracle virtual machine you also have the option of doing it on different vmware that's another one like virtual machine this is more of a paid service there is a free setup for just doing it for yourself which will work fine for this and then in the cloudera like again they have the new online setup where you can go in there to the online and for cloudera you want to go underneath the cloudera quick start if you type in a search for cloudera quickstart it'll bring you to this website and then you can select your platform in this case i did virtualbox there's vmware we just talked about docker docker is a very high-end virtual setup unless you already know it you really don't want to mess with it then your kvm is if you're on a linux computer that sets up multiple systems on that computer so the two you really want to use are usually the virtual box or do an online setup and you can see here with the download if you're going into the horton version it's called hortonworks and they call it sandbox so you'll see the term hortonworks sandbox and these are all test demos you're not going to deploy a single node hadoop systems that would just be kind of ridiculous and defeat the whole purpose of having a horton or having a hadoop system if it's only installed on one computer in a virtual node so a lot of different options if you're not on a professional windows version or you don't have at least 12 gigabytes ram to run this you'll want to try and see if you can find an online version and of course simplylearn has our own labs if you sign up for classes we set you up i don't know what it is now but last time i was in there was a five node uh setup so you can get around and see what's going on whether you're studying for the admin side or for the programming side in script writing and if i go into my oracle virtual box and i go under my cloudera and i start this up and each one has their own flavors um horton uses just a login so you log everything in through a local host through your internet explorer or i might use chrome cloudera actually opens up a full interface so you actually are in that setup and you can see when i uh started it let me go back here once i downloaded this this is a big download by the way i had to import the appliance in virtualbox the first time i ran it it takes a long time to configure the setup and the second time it comes up pretty quick and with the cloudera quick start again this is a pretend single node it opens up and you'll see that it actually has firefox here so here's my web browser i don't have to go to localhost i'm actually already in the quickstart for cloudera and if we come down here you can see getting started i have some information analyze your data manage your cluster your general information on there and what i always want to start to do is to go ahead and open up a terminal window so we'll open up a terminal widen this a little bit let me just maximize this out here so you can see so we are now in a virtual machine this virtual machine is centos linux so i'm on a linux computer on my windows computer and so when i'm on this terminal window this is your basic terminal if i do list you'll see documents eclipse these are the different things that are installed with the quick start guide on the linux system so this is a linux computer and then hadoop is running on here so now i have hadoop single node so it has both the name node and the data node and everything squished together in one virtual machine we can then do let's do hdfs telling it that it's a hadoop file system dfs minus ls now notice the ls is the same i have ls for list and ls for list and i click on here it'll take you just a second reading the hadoop file system and it comes up with nothing so a quick recap let's go back over this three different environments i have this one out here let's just put this in a bright red so you can actually see it i have this environment out here which is my slides i have this environment here or i did a list that's looking at the files on the linuxcentos computer and then we have this system here which is looking at the files on the hadoop file system so three completely separate environments and then we connect them and so right now we have i have whatever files i have my personal files and the um of course we're also looking at the screen for my windows 10 and then we're looking at the screen here uh here's our list there let's look at the files and this is the screen for centos linux and then this is looking at the files right here for the hadoop file system so three completely separate files this one here which is the linux is running in a virtual box so this is a virtual box i'm using one core to run it or one cpu and everything in there is it has its own file system you can see we have our desktop and documents and whatever in there and then you can see here we right now have no files and our hadoop file system and this hadoop file system currently is stored on the linux machine but it could be stored across 10 linux machines 20 100 this could be stored across in petabytes i mean it could be really huge or it could just be in this case just a demo where we're putting it on just one computer and then once we're in here let me just see real quick if i can go under view zoom in view zoom in this is just a standard browser so i can use any like the control plus and stuff like that to zoom in and this is very common to be in a browser window with the hadoop file system so right now i'm in a linux and i'm going to do oh let's just create a file go file my new file and i'm going to use vi and this is a vi editor just a basic editor and we go and type something in here one two three four maybe it's columns 44 66 77. of course a new file system just like your regular computer can also so in our vi editor you hit your colon i actually work with a lot of different other editors and we'll write quit vi so let's take a look and see what happened here i'm in my linux system i type in ls for list and we should see my new file and sure enough we do over here there it is let me just highlight that my new file and if i then go into the hadoop system hdfs dfs minus ls for list we still show nothing it's still empty so what i can simply do is i can go hdfs dfs minus put and i'm going to put my new file and this is just going to move it from the linux system because i'm in this file folder into the hadoop file system and now if we go in and we type in our list for a new file system you will see in here that i now have just the one file on there which is my new file and very similar to linux we can do cat and the cat command simply uh evokes reading the file so hdfs dfs minus cat and i had to look it up remember cloudera the the format is going to be a user uh then of course the path location and the file name and in here when we did the list here's our list so you can see it lists our file here and we realize that this is under uh user cloudera and so i can now go user cloudera my new file and the minus cat and we'll be able to read the file in here and you can see right here this is the file that was in the linux system is now copied into the cloudera system and it's one three four five that what i entered in there and if we go back to the linux and do list you'll still see it in here my new file and we can also do something like this in our hdfs minus mv i will do my new file and we're going to change it to my new new file and if we do that underneath our hadoop file system the minus mv will rename it so if i go back here to our hadoop file system ls you'll now see instead of my new file it has my new new file coming up and there it is my new new files we've renamed it we can also go in here and delete this so i can now come in here so in our hdfs dfs we can also do a remove and this will remove the file and so if we come in here we run this we'll see that when i come back and do the list the file is gone and now we just have another empty folder with our hadoop file system and just like any file system we can take this and we can go ahead and make directory create a new directory so mk for make directory we'll call this my dir so we're going to make a directory my dir it'll take it just a second and of course if we do the list command you'll see that we now have the directory in there give it just a second there it comes mydir and just like we did before i can go in here and we're going to put the file and if you remember correctly from our files in the setup i called it my new file so this is coming from the linux system and we're going to put that into my dir that's the target in my hadoop setup and so if i hit enter on there i can now do the hadoop list and that's not going to show the file because remember i put it in a sub-folder so if i should do the kadroop just this will show my directory and i can do list and then i can do my dur for my directory and you'll see underneath the my directory in the hadoop file system it now has my new file put in there and with any good operating system we need a minus help so just like you can type in help in your linux you can now come in here and type in hdfs help and it shows you a lot of the commands in there underneath the hadoop file system most of them should be very similar to the linux on here and we can also do something like this a hadoop version and the hadoop version shows up that we're in hadoop 2.60 cdh is it we're cloudera 5 and compiled by jenkins and it has a date and all the different information on our hadoop file system so this is some basics in the terminal window let me go ahead and close this out because if you're going to play with this you should really come in here let me just maximize the cloud data it opens up in a browser window and so once we're in here again this is a browser window which you could access might look like any access for a hadoop file system one of the fun things to do when you're first starting is to go under hue you'll see it up here at the top has cloudera hue hadoop your hbase your impala your spark these are standard installs now and hue is basically an overview of the file system and so come up here and you can see where you can do queries as far as if you have a hbase or a hive the hive database we go over here to the top where it says file browser and if we go under file browser now this is a hadoop file system we're looking at and once we open up the file browser you can now see there's my directory which we created and if i click on my directory there's my new file which is in here and if i click on my new file it actually opens it up and you can see from our hadoop file system this is in the hadoop file system the file that we created so we covered the terminal window you can see here's a terminal window up here it might be if you were in a web browser it'll look a little different because it actually opens up as a web browser terminal window and we've looked a little bit at hue which is one of the most basic components of hadoop one of the original components for going through and looking at your data and your databases of course now they're up to the hue4 it's gone through a number of changes and you can see there's a lot of different choices in here for other different tools in the hadoop file system and i'll go ahead and just close out of this and one of the cool things with the virtual uh box i can either save the machine state send the shutdown signal or power off the machine i'll go and just power off the machine completely now suppose you have a library that has a collection of huge number of books on each floor and you want to count the total number of books present on each floor what would be your approach you could say i will do it myself but then don't you think that will take a lot of time and that's obviously not an efficient way of counting the number of books in this huge collection on every floor by yourself now there could be a different approach or an alternative to that you could think of asking three of your friends or three of your colleagues and you could then say if each friend could count the books on every floor then obviously that would make your work faster and easier to count the books on every floor now this is what we mean by parallel processing so when you say parallel processing in technical terms you are talking about using multiple machines and each machine would be contributing its ram and cpu cores for processing and your data would be processed on multiple machines at the same time now this type of process involves parallel processing in our case or in our library example where you would have person 1 who would be taking care of books on floor 1 and counting them person 2 on floor 2 then you have someone on floor 3 and someone on floor 4. so every individual would be counting the books on every floor in parallel so that reduces the time consumed for this activity and then there should be some mechanism where all these counts from every floor can be aggregated so what is each person doing here each person is mapping the data of a particular floor or you can say each person is doing a kind of activity or basically a task on every floor and the task is counting the books on every floor now then you could have some aggregation mechanism that could basically reduce or summarize this total count and in terms of map reduce we would say that's the work of reducer so when you talk about hadoop map reduce it processes data on different node machines now this is the whole concept of hadoop framework right that you not only have your data stored across machines but you would also want to process the data locally so instead of transferring the data from one machine to other machine or bringing all the data together into some central processing unit and then processing it you would rather have the data processed on the machines wherever that is stored so we know in case of hadoop cluster we would have our data stored on multiple data nodes on their multiple disks and that is the data which needs to be processed but the requirement is that we want to process this data as fast as possible and that could be achieved by using parallel processing now in case of mapreduce we basically have the first phase which is your mapping phase so in case of mapreduce programming model you basically have two phases one is mapping and one is reducing now who takes care of things in mapping phase it is a mapper class and this mapper class has the function which is provided by the developer which takes care of these individual map tasks which will work on multiple nodes in parallel your reducer class belongs to the reducing phase so a reducing phase basically uses a reducer class which provides a function that will aggregate and reduce the output of different data nodes to generate the final output now that's how your mapreduce works using mapping and then obviously reducing now you could have some other kind of jobs which are map only jobs wherein there is no reducing required but we are not talking about those we are talking about our requirement where we would want to process the data using mapping and reducing especially when data is huge when data is stored across multiple machines and you would want to process the data in parallel so when you talk about mapreduce you could say it's a programming model you could say internally it's a processing engine of hadoop that allows you to process and compute huge volumes of data and when we say huge volumes of data we can talk about terabytes we can talk about petabytes exabytes and that amount of data which needs to be processed on a huge cluster we could also use mapreduce programming model and run a mapreduce algorithm in a local mode but what does that mean if you would go for a local mode it basically means it would do all the mapping and reducing on the same node using the processing capacity that is ram and cpu cores on the same machine which is not really efficient in fact we would want to have our map reduce work on multiple nodes which would obviously have mapping phase followed by a reducing phase and intermittently there would be data generated there would be different other phases which help this whole processing so when you talk about hadoop map reduce you are mainly talking about two main components or two main phases that is mapping and reducing mapping taking care of map tasks reducing taking care of reduced tasks so you would have your data which would be stored on multiple machines now when we talk about data data could be in different formats we could or the developer could specify what is the format which needs to be used to understand the data which is coming in that data then goes through the mapping internally there would be some shuffling sorting and then reducing which gives you your final output so the way we access data from sdfs or the way our data is getting stored on sdfs we have our input data which would have one or multiple files in one or multiple directories and your final output is also stored on sdfs to be accessed to be looked into and to see if the processing was done correctly so this is how it looks so you have the input data which would then be worked upon by multiple map tasks now how many map tasks that basically depends on the file that depends on the input format so normally we know that in a hadoop cluster you would have a file which is broken down into blocks depending on its size so the default block size is 128 mb which can then still be customized based on your average size of data which is getting stored on the cluster so if i have really huge files which are getting stored on the cluster i would certainly set a higher block size so that my every file does not have huge number of blocks creating a load on name nodes ram because that's tracking the number of elements in your cluster or number of objects in your cluster so depending on your file size your file would be split into multiple chunks and for every chunk we would have a map task running now what is this map task doing that is specified within the mapper class so within the mapper class you have the mapper function which basically says what each of these map tasks has to do on each of the chunks which has to be processed this data intermittently is written to sdfs where it is sorted and shuffled and then you have internal phases such as partitioner which decides how many reduce tasks would be used or what data goes to which reducer you could also have a combiner phase which is like a mini reducer doing the same reduce operation before it reaches reduce then you have your reducing phase which is taken care by a reducer class and internally the reducer function provided by developer which would have reduced task running on the data which comes as an output from map tasks finally your output is then generated which is stored on sdfs now in case of hadoop it accepts data in different formats your data could be in compressed format your data could be in part k your data could be in afro text csv psv binary format and all of these formats are supported however remember if you are talking about data being compressed then you have to also look into what kind of split ability the compression mechanism supports otherwise when mapreduce processing happens it would take the complete file as one chunk to be processed so sdfs accepts input data in different formats this data is stored in sdfs and that is basically our input which is then passing through the mapping phase now what is mapping phase doing as i said it reads record by record depending on the input format it reads the data so we have multiple map tasks running on multiple chunks once this data is being read this is broken down into individual elements and when i say individual element i could say this is my list of key value pairs so your records based on some kind of delimiter or without delimiter are broken down into individual elements and thus your mac creates key value pairs now these key value pairs are not my final output these key value pairs are basically a list of elements which will then be subjected to further processing so you would have internally shuffling and sorting of data so that all the relevant key value pairs are brought together which basically benefits the processing and then you have your reducing which aggregates the key value pairs into set of smaller tuples or tuples as you would say finally your output is getting stored in the designated directory as a list of aggregated key value pairs which gives you your output so when we talk about mapreduce one of the key factors here is the parallel processing which it can offer so we know that we have a data is getting stored across multiple data nodes and you would have huge volume of data which is split and randomly distributed across data nodes and this is the data which needs to be processed and the best way would be parallel processing so you could have your data getting stored on multiple data nodes or multiple slave nodes and each slave node would have again one or multiple disks to process this data basically we have to go for parallel processing approach we have to use the mapreduce now let's look at the mapreduce workflow to understand how it works so basically you have your input data stored on sdfs now this is the data which needs to be processed it is stored in input files and the processing which you want can be done on one single file or it can be done on a directory which has multiple files you could also later have multiple outputs merged which we achieve by using something called as chaining of mappers so here you have your data getting stored on sdfs now input format is basically something to define the input specification and how the input files will be split so there are various input formats now we can search for that so we can go to google and we can basically search for hadoop map reduce yahoo tutorial this is one of the good links and if i look into this link i can search for different input formats and output formats so let's search for input format so when we talk about input format you basically have something to define how input files are split so input files are split up and read based on what input format is specified so this is a class that provides following functionality it selects the files or other objects that should be used for input it defines the input split that break a file into tasks provides a factory for record reader objects that read the file so there are different formats if you look in the table here and you can see that the text input format is the default format which reads lines of a text file and each line is considered as a record here the key is the byte offset of the line and the value is the line content it says you can have key value input format which passes lines into key value pairs everything up to the first tab character is the key and the remainder is the line you could also have sequence file input format which basically works on binary format so you have input format and in the same way you can also search for output format which takes care of how the data is handled after the processing is done so the key value pairs provided to this output collector are then written to output files the way they are written is governed by output format so it functions pretty much like input format as described in earlier right so we could set what is the output format to be followed and again you have text output sequence file output format null output format and so on so these are different classes which take care of how your data is handled when it is being read for processing or how is the data being written when the processing is done so based on the input format the file is broken down into splits and this logically represents the data to be processed by individual map tasks or you could say individual mapper functions so you could have one or multiple splits which need to be processed depending on the file size depending on what properties have been set now once this is done you have your input splits which are subjected to mapping phase internally you have a record reader which communicates with the input split and converts the data into key value pairs suitable to be read by mapper and what is mapper doing it is basically working on these key value pairs the map task giving you an intermittent output which would then be forwarded for further processing now once that is done and we have these key value pairs which is being worked upon my map your map tasks as a part of your mapper function are generating your key value pairs which are your intermediate outputs to be processed further now you could have as i said a combiner face or internally a mini radio surface now combiner does not have its own class so combiner basically uses the same class as the reducer class provided by the developer and its main work is to do the reducing or its main work is to do some kind of mini aggregation on the key value pairs which were generated by map so once the data is coming in from the combiner then we have internally a partitioner phase which decides how outputs from combiners are sent to the reducers or you could also say that even if i did not have a combiner partitioner would decide based on the keys and values based on the type of keys how many reducers would be required or how many reduced tasks would be required to work on your output which was generated by map task now once partitioner has decided that then your data would be then sorted and shuffled which is then fed into the reducer so when you talk about your reducer it would basically have one or multiple reduced tasks now that depends on what or what partitioner decided or determined for your data to be processed it can also depend on the configuration properties which have been set to decide how many radio stars should be used now internally all this data is obviously going through sorting and shuffling so that your reducing or aggregation becomes an easier task once we have this done we basically have the reducer which is the code for the reducer is provided by the developer and all the intermediate data has then to be aggregated to give you a final output which would then be stored on sdfs and who does this you have an internal record writer which writes these output key value pairs from reducer to the output files now this is how your mapreduce works wherein the final output data can be not only stored but then read or accessed from sdfs or even used as an input for further mapreduce kind of processing so this is how it overall looks so you basically have your data stored on sdfs based on input format you have the splits then you have record reader which gives your data to the mapping phase which is then taken care by your mapper function and mapper function basically means one or multiple map tasks working on your chunks of data you could have a combiner phase which is optional which is not mandatory then you have a partitioner phase which decides on how many reduced tasks or how many reducers would be used to work on your data internally there is sorting and shuffling of data happening and then basically based on your output format your record reader will write the output to sdfs directory now internally you could also remember that data is being processed locally so you would have the output of each task which is being worked upon stored locally however we do not access the data directly from data nodes we access it from sdfs so our output is stored on sdfs so that is your mapreduce workflow when you talk about mapreduce architecture now this is how it would look so you would have basically a edge node or a client program or an api which intends to process some data so it submits the job to the job tracker or you can say resource manager in case of hadoop yarn framework right now before this step we can also say that an interaction with name node would have already happened which would have given information of data nodes which have the relevant data stored then your master processor so in hadoop version 1 we had job tracker and then the slaves were called task trackers in hadoop version 2 instead of job tracker you have resource manager instead of task trackers you have node managers so basically your resource manager has to assign the job to the task trackers or node managers so your node managers as we discussed in yarn are basically taking care of processing which happens on every note so internally there is all of this work happening by resource manager node managers and application master then you can refer to the yarn based tutorial to understand more about that so here your processing master is basically breaking down the application into tasks what it does internally is once your application is submitted your application to be run on yarn processing framework is handled by resource manager now forget about the yarn part as of now i mean who does the negotiating of resources who allocates them how does the processing happen on the nodes right so that's all to do with how yarn handles the processing request so you have your data which is stored in sdfs broken down into one or multiple splits depending on the input format which has been specified by the developer your input splits are to be worked upon by your one or multiple map tasks which will be running within the container on the nodes basically you have the resources which are being utilized so for each map task you would have some amount of ram which will be utilized and then further the same data which has to go through reducing phase that is your reduced task will also be utilizing some ram and cpu cores now internally you have these functions which take care of deciding on number of reducers doing a mini reduce and basically reading and processing the data from multiple data nodes now this is how your mapreduce programming model makes parallel processing work or processes your data which is stored across multiple machines finally you have your output which is getting stored on its dfs so let's have a quick demo on mapreduce and see how it works on a hadoop cluster now we have discussed briefly about mapreduce which contains mainly two phases that is your mapping phase and your reducing phase and mapping phase is taken care by your mapper function and your reducing phase is taken care by your reducer function now in between we also have sorting and shuffling and then you have other phases which is partitioner and combiner and we will discuss about all those in detail in later sessions but let's have a quick demo on how we can run a mapreduce which is already existing as a package jar file within your apache hadoop cluster or even in your cloudera cluster now we can build our own mapreduce programs we can package them as jar transfer them to the cluster and then run it on a hadoop cluster on yarn or we could be using already provided default program so let's see where they are now these are my two machines which i have brought up and basically this would have my apache hadoop cluster running now we can just do a simple start hyphen all dot sh now i know that this script is deprecated and it says instead use start dfs and start yarn but then it will still take care of starting off my cluster on these two nodes where i would have one single name node two data nodes one secondary name node one resource manager and two node managers now if you have any doubt in how this cluster came up you can always look at the previous sessions where we had a walkthrough in setting up a cluster on apache and then you could also have your cluster running using less than 3 gb of your total machine ram and you could have apache cluster running on your machine now once this cluster comes up we will also have a look at the web ui which is available for name node and resource manager now based on the settings what we have given our uis will show us details of our cluster but remember the ui is only to browse now here my cluster has come up i can just do a jps to look at java related processes and that will show me what are the processes which are running on c1 which is your data node resource manager node manager and name node and on my m1 machine which is my second machine which i have configured here i can always do a jps and that shows me the processes running which also means that my cluster is up with two data nodes with two node managers and here i can have a look at my web ui so i can just do a refresh and the same thing with this one just to refresh so i had already opened the web pages so you can always access the web ui using your name notes hostname and 570 port it tells me what is my cluster id what is my block pull id it gives you information of what is the space usage how many live nodes you have and you can even browse your file system so i have put in a lot of data here i can click on browse the file system and this basically shows me multiple directories and these directories have one or multiple files which we will use for our mapreduce example now if you see here these are my directories which have some sample files although these files are very small like 8.7 kilobytes if you look into this directory if you look into this i have just pulled in some of my hadoop logs and i have put it on my sdfs these are a little bigger files and then we also have some other data which we can see here and this is data which i have downloaded from web now we can either run a mapreduce on a single file or in a directory which contains multiple files let's look at that before looking at demon mapreduce also remember mapreduce will create a output directory and we need to have that directory created plus we need to have the permissions to run the mapreduce job so by default since i'm running it using admin id i should not have any problem but then if you intend to run map reduce with a different user then obviously you will have to ask the admin or you will have to give the user permission to read and write from sdfs so this is the directory which i have created which will contain my output once the mapreduce job finishes and this is my cluster file system if you look on this ui this shows me about my yarn which is available for taking care of any processing it as of now shows that i have total of 8 gb memory and i have 8 v cores now that can be depending on what configuration we have set or how many nodes are available we can look at nodes which are available and that shows me i have two node managers running each has 8 gb memory and 8 v cores now that's not true actually but then we have not set the configurations for node managers and that's why it takes the default properties that is 8gb laminate vcos now this is my yarn ui we can also look at scheduler which basically shows me the different cues if they have been configured where you will have to run the jobs we'll discuss about all these in later in detail now let's go back to our terminal and let's see where we can find some sample applications which we can run on the cluster so once i go to the terminal i can well submit the mapreduce job from any terminal now here i know that my hadoop related directory is here and within hadoop you have various directories we have discussed that in binaries you have the commands which you can run in s bin you basically have the startup scripts and here you also notice there is a share directory in the end if you look in the shared directory you would find hadoop and within hadoop you have various sub directories in which we will look for map reduce now this mapreduce directory has some sample jar files which we can use to run a mapreduce on the cluster similarly if you are working on a cloudera cluster you would have to go into opt cloudera parcel cdh slash lib and in that you would have directories for sdfs mapreduce or sdfs yarn where you can still find the same jars it is basically a package which contains your multiple applications now how do we run a mapreduce we can just type in hadoop and hit enter and that shows me that i have an option called jar which can be used to run a jar file now at any point of time if you would want to see what are the different classes which are available in a particular jar you could always do a jar minus xvf for example i could say jar xv f and i could say user local hadoop share hadoop mapreduce and then list down your jar file so i'll say hadoop mapreduce examples and if i do this this should basically unpack it to show me what classes are available within this particular jar and it has done this it has created a meta file and it has created a org directory we can see that by doing a ls and here if you look in ls org since i ran the command from your phone directory i can look into org patchy hadoop examples which shows me the classes which i have and those classes contain which mapper or reducer classes so it might not be just mapper and reducer but you can always have a look so for example i am targeting to use word count program which does a word count on files and gives me a list of words and how many times they occur in a particular file or in set of files and this shows me that what are the classes which belong to word count so we have a int sum reducer so this is my reducer class i have tokenizer mapper that is my mapper class right and basically this is what is used these classes are used if you run a word code now there are many other programs which are part of this jar file and we can expand and see that so i can say hadoop jar and give your path so i'll say hadoop jar user local hadoop share hadoop mapreduce hadoop mapreduce examples and if i hit on enter that will show me what are the inbuilt classes which are already available now these are certain things which we can use now there are other jar files also for example i can look at hadoop and here we can look at the jar files which we have in this particular path so this is one hadoop mapreduce examples which you can use you can always look in other jar files like you can look for hadoop mapreduce client job client and then you can look at the test one so that is also an interesting one so you can always look into hadoop mapreduce client job client and then you have something ending with this so if i would have tried this one using my hadoop jar command so in my previous example when we did this it was showing me all the classes which are available and that already has a word count now there are other good programs which you can try like teragen to generate dummy data terasort to check your sorting performance and so on and terra validate to validate the results similarly we can also do a hadoop jar as i said on hadoop mapreduce i think that was client and then we have job client and then test star now this has a lot of other classes which can be used or programs which can be used for doing a stress testing or checking your cluster status and so on one of them interesting one is test dfs io but let's not get into all the details in first instance let's see how we can run a mapreduce now if i would want to run a mapreduce i need to give hadoop jar and then my jar file and if i hit on enter it would say it needs the input and output it needs which class you want to run so for example i would say word count and again if i hit on enter it tells me that you need to give me some input and output to process and obviously this processing will be happening on cluster that is our yarn processing framework unless and until you would want to run this job in a local mode so there is a possibility that you can run the job in local mode but let's first try how it runs on the cluster so how do we do that now here i can do a hdfs ls slash command to see what i have on my sdfs now through my ui i was already showing you that we have set of files and directories which we can use to process now we can take up one single file so for example if i pick up new data and i can look into the files here what we have and we can basically run a mapreduce on a single file or multiple files so let's take this file whatever that contains and i would like to do a word count so that i get a list of words and their occurrence in this file so let me just copy this now i also need my output to be written and that will be written here so here if i want to run a mapreduce i can say hadoop which we can pull out from history so hadoop jar word count now i need to give my input so that will be new data and then i will give this file which we just copied now i am going to run the word count only on a single file and i will basically have my output which will be stored in this directory the directory which i have created already mr output so let's do this output and this is fair enough now you can give many other properties you can specify how many map jobs you want to run how many reduced jobs you want to run do you want your output to be compressed do you want your output to be merged or many other properties can be defined when you are specifying word count and then you can pass in an argument to pass properties from the command line which will affect your output now once i go ahead and submit this this is basically running a simple inbuilt mapreduce job on our hadoop cluster now obviously internally it will be looking for name node now we have some issue here and it says the output already exists what does that mean so it basically means that hadoop will create an output for you you just need to give a name but then you don't need to create it so let's give let's append the output with number one and then let's go ahead and run this so i've submitted this command now this can also be done in background if you would want to run multiple jobs on your cluster at the same time so it takes total input paths to process one so that is there is only one split on which your job has to work now it will internally try to contact your resource manager and basically this is done so here we can have a look and we can see some counters here now what i also see is for some property which it is missing it has run the job but it has run in a local mode it has run in a local mode so although we have submitted so this might be related to my yarn settings and we can check that so if i do a refresh when i have run my application it has completed it would have created an output but the only thing is it did not interact with your yarn it did not interact with your resource manager we can check those properties and here if we look into the job it basically tells me that it went for mapping and reducing it would have created an output it worked on my file but then it ran in a local mode it ran in a local mode so mapreduce remember is a programming model right now if you run it on yarn you get the facilities of running it on a cluster where yarn takes care of resource management if you don't run it on yarn and run it on a local mode it will use your machines ram and cpu cores for processing but then we can quickly look at the output and then we can also try running this on yarn so if i look into my hdfs and if i look into my output mr output that's the directory which was not used actually let's look into the other directory which is ending with one and that should show me the output created by this mapreduce although it ran in the local mode it fetched an input file from usdfs and it would have created output in hdfs now that's my part file which is created and if you look at part minus r minus these zeros if you would have more than one reducer running then you would have multiple such files created we can look into this what does this file contain which should have my word count and here i can say cat which basically shows me what is the output created by my mapreduce let's have a look into this so the file which we gave for processing has been broken down and now we have the list of words which occur in the file plus a count of those words so if there is some word which is in is more then it shows me the count so this is a list of my words and the count for that so this is how we run a sample mapreduce job i will also show you how we can run it on yeah now let's run mapreduce on yarn and initially when we tried running a mapreduce it did not hit yarn but it ran in a local mode and that was because there was a property which had to be changed in mapreduce hyphen site file so basically if you look into this file the error was that i had given a property which says mapred dot framework dot name and that was not the right property name and it was ignored and that's why it ran in a local mode so i changed the property to mapreduce.framework.name restarted my cluster and everything should be fine now and that map red hyphen site file has also been copied across the nodes now to run a mapreduce on a hadoop cluster so that it uses yarn and yarn takes care of resource allocation on one or multiple machines so i'm just changing the output here and now i will submit this job which should first connect to the resource manager and if it connects to the resource manager that means our job will be run using yarn on the cluster rather than in a local mode so now we have to wait for this application to internally connect to resource manager and once it starts there we can always go back to the web ui and check if our application has reached yarn so it shows me that there is one input part to be processed that's my job id that's my application id which you can even monitor status from the command line now here the job has been submitted so let's go back here and just do a refresh on my yarn ui which should show me the new application which is submitted it tells me that it is an accepted state application master has already started and if you click on this link it will also give you more details of how many map and reduce tasks would run so as of now it says the application master is running it would be using this node which is m1 we can always look into the logs we can see that there is a one task attempt which is being made and now if i go back to my terminal i will see that it is waiting to get some resources from the cluster and once it gets the resources it will first start with the mapping phase where the mapper function runs it does the map tasks one or multiple depending on the splits so right now we have one file and one split so we will have just one map task running and once the mapping phase completes then it will get into reducing which will finally give me my output so we can be toggling through these sessions so here i can just do a refresh to see what is happening with my application is it proceeding is it still waiting for resource manager to allocate some resources now just couple of minutes back i tested this application on yarn and we can see that my first application completed successfully and here we will have to give some time so that yarn can allocate the resources now if the resources were used by some other application they will have to be freed up now internally yan takes care of all that which we will learn more detail in yarn or you might have already followed the yarn based session now here we will have to just give it some more time and let's see if my application proceeds with the resources what yarn can allocate to it sometimes you can also see a slowness in what web ui shows up and that can be related to the amount of memory you have allocated to your nodes now for apache we can have less amount of memory and we can still run the cluster and as i said the memory which shows up here 16 gb and 16 cores is not the true one those are the default settings right but then my yarn should be able to facilitate running of this application let's just give it a couple of seconds and then let's look into the output here again i had to make some changes in the settings because our application was not getting enough resources and then basically i restarted my cluster now let's submit the application again to the cluster which first should contact the resource manager and then basically the map and reduce process should start so here i have submitted an application it is connecting to the resource manager and then basically it will start internally in app master that is application master it is looking for the number of splits which is one it's getting the application id and it basically then starts running the job it also gives you a tracking url to look at the output and now we should go back and look at our yarn ui if our application shows up here and we will have to give it a couple of seconds when it can get the final status change to running and that's where my application will be getting resources now if you closely notice here i have allocated specific amount of memory that is 1.5 gb for node manager on every node and i have basically given two cores each which my machines also have and my yarn should be utilizing these resources rather than going for default now the application has started moving and we can see the progress bar here which basically will show what is happening and if we go back to the terminal it will show that first it went in deciding map and reduce it goes for map once the mapping phase completes then the reducing phase will come into existence and here my job has completed so now it has basically used we can always look at how many map and reduce sas were run it shows me that there was one map and one reduced task now with the number of map tasks depends on the number of splits and we had just one file which is less than 128 mb so that was one split to be processed and reduced task is internally decided by the reducer or depending on what kind of property has been set in hadoop config files now it also tells me how many input records were read which basically means these were the number of lines in the file it tells me output records which gives me the number of total words in the file now there might be duplicates and that which is processed by internal combiner further processing or forwarding that information to reducer and basically reducer works on 335 records gives us a list of words and their account now if i do a refresh here this would obviously show my application is completed it says succeeded you can always click on the application to look for more information it tells me where it ran now we do not have a history server running as of now otherwise we can always access more information so this leads to history server where all your applications are stored but i can click on this attempt tasks and this will basically show me the history url or you can always look into the logs so this is how you can submit a sample application which is inbuilt which is available in the jar on your hadoop cluster and that will utilize your cluster to run now you could always as i said when you are running a particular job remember to change the output directory and if you would not want it to be processing is a single individual file you could also point it to a directory that basically means it will have multiple files and depending on the file sizes there would be multiple splits and according to that multiple map tasks will be selected so if i click on this this would submit my second application to the cluster which should first connect to resource manager then resource manager has to start an application master now here we are targeting 10 splits now you have to sometimes give couple of seconds in your machines so that the resources which were used are internally already freed up so that your cluster can pick it up and then your yarn can take care of resources so right now my application is an undefined status but then as soon as my yarn provides it the resources we will have the application running on our yarn cluster so it has already started if you see it is going further then it would launch 10 map tasks and it would the number of reduced stars would be decided on either the way your data is or based on the properties which have been set at your cluster level let's just do a quick refresh here on my yarn ui to show me the progress also take care that when you are submitting your application you need to have the output directory mentioned however do not create it hadoop will create that for you now this is how you run a mapreduce without specifying properties but then you can specify more properties you can look into what are the things which can be changed for your mapper and reducer or basically having a combiner class which can do a mini reducing and all those things can be done so we will learn about that in the later sessions now we will compare hadoop version one that is with mapreduce version one we will understand and learn about the limitations of hadoop version one what is the need of yarn what is yarn what kind of workloads can be running on yarn what are yarn components what is yarn architecture and finally we will see a demo on yarn so hadoop version one or mapreduce version one well that's outdated now and nobody is using hadoop version one but it would be good to understand what was in hadoop version one and what were the limitations of hadoop version one which brought in the thought for the future processing layer that is yarn now when we talk about hadoop we already know that hadoop is a framework and hadoop has two layers one is your storage layer that is your sdfs hadoop distributed file system which allows for distributed storage and processing which allows fault tolerance by inbuilt replication and which basically allows you to store huge amount of data across multiple commodity machines when we talk about processing we know that mapreduce is the oldest and the most mature processing programming model which basically takes care of your data processing on your distributed file system so in hadoop version 1 mapreduce performed both data processing and resource management and that's how it was problematic in mapreduce we had basically when we talked about the processing layer we had the master which was called job tracker and then you had the slaves which were the task records so your job tracker was taking care of allocating resources it was performing scheduling and even monitoring the jobs it basically was taking care of assigning map and reduced tasks to the jobs running on task trackers and task trackers which were co-located with data nodes were responsible for processing the jobs so task trackers were the slaves for the processing layer which reported their progress to the job tracker so this is what was happening in hadoop version 1. now when we talk about hadoop version 1 we would have say client machines or an api or an application which basically submits the job to the master that is job tracker now obviously we cannot forget that there would already be an involvement from name node which basically tells which are the machines or which are the data nodes where the data is already stored now once the job submission happens to the job tracker job tracker being the master demon for taking care of your processing request and also resource management job scheduling would then be interacting with your multiple task trackers which would be running on multiple machines so each machine would have a task tracker running and that task tracker which is a processing slave would be co-located with the data nodes now we know that in case of hadoop you have the concept of moving the processing to wherever the data is stored rather than moving the data to the processing layer so we would have task trackers which would be running on multiple machines and these task trackers would be responsible for handling the tasks what are these tasks these are the application which is broken down into smaller tasks which would work on the data which is respectively stored on that particular node now these were your slave demons right so your job tracker was not only tracking the resources so your task trackers were sending heartbeats they were sending in packets and information to the job tracker which would then be knowing how many resources and when we talk about resources we are talking about the cpu cores we are talking about the ram which would be available on every node so task trackers would be sending in their resource information to job tracker and your job tracker would be already aware of what amount of resources are available on a particular node how loaded a particular node is what kind of work could be given to the task tracker so job tracker was taking care of resource management and it was also breaking the application into tasks and doing the job scheduling part assign different tasks to these slave demons that is your task trackers so job tracker was eventually overburdened right because it was managing jobs it was tracking the resources from multiple task trackers and basically it was taking care of job scheduling so job tracker would be overburdened and in a case if job tracker would fail then it would affect the overall processing so if the master is killed if the master demand dies then the processing cannot proceed now this was one of the limitations of hadoop version one so when you talk about scalability that is the capability to scale due to a single job tracker scalability would be hitting a bottleneck you cannot have a cluster size of more than 4000 nodes and cannot run more than 40 000 concurrent tasks now that's just a number we could always look into the individual resources which each machine was having and then we can come up with an appropriate number however with a single job tracker there was no horizontal scalability for the processing layer because we had single processing master now when we talk about availability job tracker as i mentioned would be a single point of failure now any failure kills all the queued and running jobs and jobs would have to be resubmitted now why would we want that in a distributed platform in a cluster which has hundreds and thousands of machines we would want a processing layer which can handle huge amount of processing which could be more scalable which could be more available and could handle different kind of workloads when it comes to resource utilization now if you would have a predefined number of map and reduce slots for each task tracker you would have issues which would relate to resource utilization and that again is putting a burden on the master which is tracking these resources which has to assign jobs which can run on multiple machines in parallel so limitations in running non-map reduce applications now that was one more limitation of hadoop version one and mapreduce that the only kind of processing you could do is mapreduce and mapreduce programming model although it is good it is oldest it has matured over a period of time but then it is very rigid you will have to go for mapping and reducing approach and that was the only kind of processing which could be done in hadoop person one so when it comes to doing a real time analysis or doing ad hoc query or doing a graph based processing or massive parallel processing there were limitations because that could not be done in hadoop version 1 which was having mapreduce version 1 as the processing component now that brings us to the need for yarn so yarn it stands for yet another resource negotiator so as i mentioned before yarn in hadoop version one well you could have applications which could be written in different programming languages but then the only kind of processing which was possible was mapreduce we had the storage layer we had the processing but then kind of limited processing which could be done now this was one thing which brought in a thought that why shouldn't we have a processing layer which can handle different kind of workloads as i mentioned might be graph processing might be real-time processing might be massive parallel processing or any other kind of processing which would be a requirement of an organization now designed to run map-reduce jobs only and having issues in scalability resource utilization job tracking etc that led to the need of something what we call as yarn now from hadoop version 2 onwards we have the two main layers have changed a little bit you have the storage layer which is intact that is your hdfs and then you have the processing layer which is called yarn yet another resource negotiator now we will understand how yarn works but then yarn is taking care of your processing layer it does support mapreduce so mapreduce processing can still be done but then now you can have a support to other processing frameworks yarn can be used to solve the issues which hadoop version 1 was posing something like resource management something like different kind of workload processing something like scalability resource utilization all that is now taken care by yarn now when we talk about yarn we can have now a cluster size of more than 10 000 nodes and can run more than 100 000 concurrent tasks that's just to take care of your scalability when you talk about compatibility applications which were developed for hadoop version 1 which were primarily map reduce kind of processing can run on yarn without any disruption or availability issues when you talk about resource utilization there is a mechanism which takes care of dynamic allocation of cluster resources and this basically improves the resource utilization when we talk about multi-tenancy so basically now the cluster can handle different kind of workloads so you can use open source and proprietary data access engines you can perform real-time analysis you can be doing graph processing you can be doing ad hoc querying and this can be supported for multiple workloads which can run in parallel so this is what yarn offers so what is yarn as i mentioned yarn stands for yet another resource negotiator so it is the cluster resource management layer for your apache hadoop ecosystem which takes care of scheduling jobs and assigning resources now just imagine when you would want to run a particular application you would basically be telling the cust cluster that i would want resources to run my applications that application might be a mapreduce application that might be a hive query which is triggering a mapreduce that might be a big script which is triggering a mapreduce that could be hive with days as an execution engine that could be a spark application that could be a graph processing application in any of these cases you would still you as in in sense client or basically an api or the application would be requesting for resources yan would take care of that so yarn would provide the desired resources now when we talk about resources we are mainly talking about the network related resources we are talking about the cpu cores or as in terms of yarn we say virtual cpu course we would talk about ram that is in gb or mb or in terabytes which would be offered from multiple machines and yarn would take care of this so with yarn you could basically handle different workloads now these are some of the workloads which are showing up here you have the traditional mapreduce which is mainly batch oriented you could have an interactive execution engine something as space you could have hbase which is a column oriented or a four dimensional database and that would be not only storing data on sdfs but would also need some kind of processing you could have streaming functionalities which would be from storm or kafka or spark you could have graph processing you could have in-memory processing such as spark and its components and you could have many others so these are different frameworks which could now run and which can run on top of yarn so how does the undo that now when we talk about yarn this is how a overall yarn architecture looks so at one end you have the client now client could be basically your edge node where you have some applications which are running it could be an api which would want to interact with your cluster it could be a user triggered application which wants to run some jobs which are doing some processing so this client would submit a job request now what is resource manager doing resource manager is the master of your processing layer in hadoop version 1 we basically had job tracker and then we had task trackers which were running on individual nodes so your task trackers were sending your heart beats to the job tracker your task trackers were sending it their resource information and job tracker was the one which was tracking the resources and it was doing the job scheduling and that's how as i mentioned earlier job tracker was overburdened so job tracker is now replaced by your resource manager which is the master for your processing layer your task trackers are replaced by node managers which would be then running on every node and we have a temporary demon which you see here in blue and that's your app master so this is what we mentioned when we say yet another resource negotiator so appmaster would be existing in a hadoop version 2. now when we talk about your resource manager resource manager is the master for processing layer so it would already be receiving heartbeats and you can say resource information from multiple node managers which would be running on one or multiple machines and these node managers are not only updating their status but they are also giving an information of the amount of resources they have now when we talk about resources we should understand that if i'm talking about this node manager then this has been allocated some amount of ram for processing and some amount of cpu cores and that is just a portion of what the complete node has so if my node has say imagine my node has around 100 gb ram and i have saved 60 cores all of that cannot be allocated to node manager so node manager is just one of the components of hadoop ecosystem it is the slave of the processing layer so we could say keeping in all the aspects such as different services which are running might be cloudera or hortonworks related services running system processes running on a particular node some portion of this would be assigned to node manager for processing so we could say for example say 60 gb ram per node and say 40 cpu cores so this is what is allocated for the node manager on every machine similarly we would have here similarly we would have here so node manager is constantly giving an update to resource manager about the resources what it has probably there might be some other applications running and node manager is already occupied so it gives an update now we also have a concept of containers which is basically we will we will talk about which is about these resources being broken down into smaller parts so resource manager is keeping a track of the resources which every node manager has and it is also responsible for taking care of the job request how do these things happen now as we see here resource manager at a higher level you can always say this is the processing master which does everything but in reality it is not the resource manager which is doing it but it has internally different services or components which are helping it to do what it is supposed to do now let's look further now as i mentioned your resource manager has these services or components which basically helps it to do the things it is basically a an architecture where multiple components are working together to achieve what yarn allows so resource manager has mainly two components that is your scheduler and applications manager and these are at high level four main components here so we talk about resource manager which is the processing master you have node managers which are the processing slaves which are running on every nodes you have the concept of container and you have the concept of application master how do all these things work now let's look at yarn components so resource manager basically has two main components you can say which assist resource manager in doing what it is capable of so you have scheduler and applications manager now there is when you talk about resources there is always a requirement for the applications which need to run on cluster of resources so your application which has to run which was submitted by client needs resources and these resources are coming in from multiple machines wherever the relevant data is stored and a node manager is running so we always know that node manager is co-located with data nodes now what does the scheduler do so we have different kind of schedulers here we have basically a capacity scheduler we have a fair scheduler or we could have a fifo scheduler so there are different schedulers which take care of resource allocation so your scheduler is responsible for allocating resources to various running applications now imagine a particular environment where you have different teams or different departments which are working on the same cluster so we would call the cluster as a multi-tenant cluster and on the multi-tenant cluster you would have different applications which would want to run simultaneously accessing the resources of the cluster how is that managed so there has to be some component which has a concept of pooling or queuing so that different departments or different users can get dedicated resources or can share resources on the cluster so scheduler is responsible for allocating resources to various running applications now it does not perform monitoring or tracking of the status of applications that's not the part of scheduler it does not offer any guarantee about restarting the failed tasks due to hardware or network or any other failures scheduler is mainly responsible for allocating resources now as i mentioned you could have different kind of schedulers you could have a fifo scheduler which was mainly in older version of hadoop which stands for first in first out you could have a fair scheduler which basically means multiple applications could be running in the cluster and they would have a fair share of the resources you could have a capacity scheduler which would basically have dedicated or fixed amount of resources across the cluster now whichever scheduler is being used scheduler is mainly responsible for allocating resources then it's your applications manager now this is responsible for accepting job submissions now as i said at higher level we could always say resource manager state doing everything it is allocating the resources it is negotiating the resources it is also taking care of listening to the clients and taking care of job submissions but who is doing it in real it is these components so you have applications manager which is responsible for accepting job submissions it negotiates the first container for executing the application specific application master it provides the service for restarting the application master now how does this work how do these things happen in coordination now as i said your node manager is the slave process which would be running on every machine slave is tracking the resources what it has it is tracking the processes it is taking care of running the jobs and basically it is tracking each container resource utilization so let's understand what is this container so normally when you talk about a application request which comes from a client so let's say this is my client which is requesting or which is coming up with an application which needs to run on the cluster now this application could be anything it first contacts your master that's your resource manager which is the master for your processing layer now as i mentioned and as we already know that your name node which is the master of your cluster has the metadata in its ram which is aware of the data being split into blocks the blocks when stored on multiple machines and other information so obviously there was a interaction with the master which has given this information of the relevant nodes where the data exists now for the processing need your client basically the application which needs to run on the cluster so your resource manager which basically has the scheduler which takes care of allocating resources resource manager has mainly these two components which are helping it to do its work now for a particular application which might be needing data from multiple machines now we know that we would have multiple machines where we would have node manager running we would have a data node running and data nodes are responsible for storing the data on disk so your resource manager has to negotiate the resources now when i say negotiating the resources it could basically ask each of these node managers for some amount of resources for example it would be saying can i have 1 gb of ram and 1 cpu core from you because there is some data residing on your machine and that needs to be processed as part of my application can i again have one gb and one cpu core from you and this is again because some relevant data is stored and this request which resource manager makes of holding the resources of total resources which the node manager has your resource manager is negotiating or is asking for resources from the processing slave so this request of holding resources can be considered as a container so resource manager now we know it is not actually the resource manager but it is the application manager which is negotiating the resources so it negotiates the resources which are called container so this request of holding resource can be considered as a container so basically a container can be of different sizes we will talk about that so resource manager negotiates the resources with node manager now node manager which is already giving an update of the resources it has what amount of resources it holds how much busy it is can basically approve or deny this request so node manager would basically approve in saying yes i could hold these resources i could give you this container of this particular size now once the container has been approved or allocated or you can say granted by your node manager resource manager now knows that resources to process the application are available and guaranteed by the node manager so resource manager starts a temporary demon called appmaster so this is a piece of code which would also be running in one of the containers it would be running in one of the containers which would then take care of execution of tasks in other containers so your application master is per application so if i would have 10 different applications coming in from the client then we would have 10 app masters one app master being responsible for per application now what does this app master do it basically is a piece of code which is responsible for execution of the application so your app master would run in one of the containers and then it would use the other containers which node manager is guaranteed that it will give when the request application request comes to it and using these containers the app master will run the processing tasks within these designated resources so it is mainly the responsibility of application master to get the execution done and then communicate it to the master so resource manager is tracking the resources it is negotiating the resources and once the resources have been negotiated it basically gives the control to application master application master is then running within one of the containers on one of the nodes and using the other containers to take care of execution this is how it looks so basically container as i said is a collection of resources like cpu memory your disk which would be used or which already has the data and network so your node manager is basically looking into the request from application master and it basically is granting this request or basically is allocating these containers now again we could have different sizing of the containers let's take an example here so as i mentioned from the total resources which are available for a particular node some portion of resources are allocated to the node manager so let's imagine this is my node where node manager as a processing slave is running so from the total resources which the node has some portion of ram and cpu cores is basically allocated to the node manager so i could say out of total 100 gb ram we can say around 60 cores which the particular node has so this is my ram which the node has and these are the cpu cores which the node has some portion of it right so we can say might be 70 or 60 percent of the total resources so we could say around 60 gb ram and then we could say around 40 v cores have been allocated to node manager so there are these settings which are given in the yarn hyphen site file now apart from this allocation that is 60 gb ram and 40v cores we also have some properties which say what will be the container sizes so for example we could have a small container setting which could say my every container could have 2gb ram and say one virtual cpu core so this is my smallest container so based on the total resources you could calculate how many such small containers could be running so if i say 2gb ram then i could have around 30 containers but then i am talking about one virtual cpu core so totally i could have around 30 small containers which could be running in parallel on a particular node and as of that calculation you would say 10 cpu cores are not being utilized you could have a bigger container size which could say i would go for 2 cpu cores and 3gb ram so 3gb ram and 2 cpu cores so that would give me around 20 containers of bigger size so this is the container sizing which is again defined in the yarn hyphen site file so what we know is on a particular node which has this kind of allocation either we could have 30 small containers running or we could have 20 big containers running and same would apply to multiple nodes so node manager based on the request from application master can allocate these containers now remember it is within this one of these containers you would have an application master running and other containers could be used for your processing requirement application master which is per application it is the one which uses these resources it basically manages or uses these resources for individual application so remember if we have 10 applications running on yarn then it would be 10 application masters one responsible for each application your application master is the one which also interacts with the scheduler to basically know how much amount of resources could be allocated for one application and your application master is the one which uses these resources but it can never negotiate for more resources to node manager application master cannot do that application master has to always go back to resource manager if it needs more resources so it is always the resource manager an internally resource manager component that is application manager which negotiates the resources at any point of time due to some node failures or due to any other requirements if application master needs more resources on one or multiple nodes it will always be contacting the resource manager internally the applications manager for more containers now this is how it looks so your client submits the job request to resource manager now we know that resource manager internally has scheduler an applications manager node managers which are running on multiple machines are the ones which are tracking their resources giving this information to the source manager so that resource manager or i would say its component applications manager could request resources from multiple node managers when i say request resources it is these containers so your resource manager basically will request for the resources on one or multiple nodes node manager is the one which approves these containers and once the container has been approved your resource manager triggers a piece of code that is application master which obviously needs some resources so it would run in one of the containers and will use other containers to do the execution so your client submits an application to resource manager resource manager allocates a container or i would say this is at a high level right resource manager is negotiating the resources and internally who is negotiating the resources it is your applications manager who is granting this request it is node manager and that's how we can say resource manager locates a container application master basically contacts the related node manager because it needs to use the containers node manager is the one which launches the container or basically gives those resources within which an application can run an application master itself will then accommodate itself in one of the containers and then use other containers for the processing and it is within these containers the actual execution happens now that could be a map task that could be a reduced task that could be a spark executor taking care of spark tasks and many other processing so before we look into the demo on how yarn works i would suggest looking into one of the blogs from cloudera so you can just look for yarn untangling and this is really a good blog which basically talks about the overall functionality which i explained just now so as we mentioned here so you basically have the master process you have the worker process which basically takes care of your processing your resource manager being the master and node manager being the slave this also talks about the resources which each node manager has it talks about the yarn configuration file where you give all these properties it basically shows you node manager which reports the amount of resources it has to resource manager now remember if worker node shows 18 to 8 cpu cores and 128 gb ram and if your node manager says 64 v cores and ram 128 gb then that's not the total capacity of your node it is some portion of your node which is allocated to node manager now once your node manager reports that your resource manager is requesting for containers based on the application what is a container it is basically a logical name given to a combination of vcore and ram it is within this container where you would have basically the process running so once your application starts and once node manager is guaranteed these containers your application or your resource manager has basically already started an application master within the container and what does that application master do it uses the other containers where the tasks would run so this is a very good blog which you can refer to and this also talks about mapreduce if you have already followed the mapreduce tutorials in past then you would know about the different kind of tasks that is map and reduce and these map and reduce tasks could be running within the container in one or multiple as i said it could be map task it could be reduced task it could be a spark based task which would be running within the container now once the task finishes basically that resources can be freed up so the container is released and the resources are given back to yarn so that it can take care of further processing if you'll further look in this blog you can also look into the part two of it where you talk mainly about configuration settings you can always look into this which talks about why and how much resources are allocated to the node manager it basically talks about your operating system overhead it talks about other services it talks about clouded our hotend works related services running and other processes which might be running and based on that some portion of ram and cpu cores would be allocated to node manager so that's how it would be done in the yarn hyphen site file and this basically shows you what is the total amount of memory and cpu cores which is allocated to node manager then within every machine where you have a node manager running on every machine in the yarn hyphen site file you would have such properties which would say what is the minimum container size what is the maximum container size in terms of ram what is the minimum for cpu cores what is the maximum for cpu cores and what is the incremental size in where ram and cpu cores can increment so these are some of the properties which define how containers are allocated for your application request so have a look at this and this could be a good information which talks about different properties now you can look further which talks about scheduling if you look in this particular blog which also talks about scheduling where it talks about scheduling in yarn which talks about fair scheduler or you basically having different cues in which allocations can be done you also have different ways in which queues can be managed and different schedulers can be used so you can always look at this series of blog you can also be checking for yarn schedulers and then search for a hadoop definitive guide and that could give you some information on how it looks when you look for hadoop definitive guide so if you look into this book which talks about the different resources as i mentioned so you could have a fee for scheduler that is first in first out which basically means if a long running application is submitted to the cluster all other small running applications will have to wait there is no other way but that would not be a preferred option if you look in fifo scheduler if you look for capacity scheduler it basically means that you could have different queues created and those queues would have resources allocated so then you could have a production queue where production jobs are running in a particular queue which has fixed amount of resources allocated you could have a development queue where development jobs are running and both of them are running in parallel you could then also look into fair scheduler which basically means again multiple applications could be running on the cluster however they would have a fair share so when i say fair share in brief what it means is if i had given 50 percent of resources to a queue for production and 50 percent of resources for a queue of development and if both of them are running in parallel then they would have access to 50 percent of cluster resources however if one of the queue is unutilized then second queue can utilize all cluster resources so look into the fair scheduling part it also shows you about how allocations can be given and you can learn more about schedulers and how cues can be used for managing multiple applications now we will spend some time in looking into few ways or few quick ways in interacting with yarn in the form of a demo to understand and learn on how yarn works we can look into a particular cluster now here we have a designated cluster which can be used you could be using the similar kind of commands on your apache based cluster or a cloudera quick start vm if you already have or if you have a cloud error or a hortonworks cluster running there are different ways in which we can interact with yarn and we can look at the information one is basically looking into the admin console so if i would look into cloud data manager which is basically an admin console for a cloudera's distribution of hadoop similarly you could have a hortonworks cluster than access to the admin console so if you have even read access for your cluster and if you have the admin console then you can search for yarn as a service which is running you can click on yarn as a service and that gives you different tabs so you have the instances which tells basically what are the different roles for your yarn service running so we have here multiple node managers now some of them showing stop status but that's nothing to worry so we have three and six node managers we have resource manager which is one but then that can also be in a high availability where you can have active and standby you also have a job history server which would show you the applications once they have completed now you can look at the yarn configurations and as i was explaining you can always look for the properties which are related to the allocation so you can here search for course and that should show you the properties which talk about the allocations so here if we see we can be looking for yarn app mapreduce application master resource cpu course what is the cpu course allocated for mapreduce map task reduce task you can be looking at yarn node manager resource cpu course which basically says every node manager on every node would be allocated with six cpu cores and the container sizing is with minimum allocation of one cpu core and the maximum could be two cpu cores similarly you could also be searching for memory allocation and here you could then scroll down to see what kind of memory allocation has been done for the node manager so if we look further it should give me information of node manager which basically says here that the container minimum allocation is 2 gb the maximum is 3 gb and we can look at node manager which has been given 25 gb per node so it's a combination of this memory and cpu cores which is the total amount of resources which have been allocated to every node manager now we can always look into applications tab that would show us different applications which are submitted on yarn for example right now we see there is a spark application running which is basically a user who is using spark shell which has triggered a application on spark and that is running on yarn you can look at different applications workload information you can always do a search based on the number of days how many applications have run and so on you can always go to the web ui and you can be searching for the resource manager web ui and if you have access to that it will give you overall information of your cluster so this basically says that here we have 100 gb memory allocated so that could be say 25 gb per node and if we have four node managers running and we have 24 cores which is six cores per node if we look further here into nodes i could get more information so this tells me that i have four node managers running and node managers basically have 25 gb memory allocated per node and six cores out of which some portion is being utilized we can always look at the scheduler here which can give us information what kind of scheduler has been allocated so we basically see that there is just a root cue and within root you have default queue and you have basically user's queue based on different users we can always scroll here and that can give us information if it is a fair share so here we see that my root dot default has 50 percent of resources and the other queue also has 50 percent of resources which also gives me an idea that a fair scheduler is being used we can always confirm that if we are using a fair scheduler or a capacity scheduler which takes care of a location so search for scheduler and that should give you some understanding of what kind of scheduler is being used and what are the allocations given for that particular scheduler so here we have fair scheduler it shows me you have under root you have the root q which has been given hundred 100 capacity and then you have within that default which also takes 100 so this is how you can understand about yarn by looking into the yarn web ui you can be looking into the configurations you can look at applications you can always look at different actions now since we do not have admin access the only information we have is to download the client configuration we can always look at the history server which can give us information of all the applications which have successfully completed now this is from your yarn ui what i can also do is i can be going into hue which is the web interface and your web interface also basically allows you to look into the jobs so you can click on hue web ui and if you have access to that it should show up or you should have a way to get to your hue which is a graphical user interface mainly comes with your cloud error you can also configure that with apache hortonworks has a different way of giving you the web ui access you can click and get into hue and that is also one way where you can look at yarn you can look at the jobs which are running if there are some issues with it and these these are your web interfaces so either you look from yarn web ui or here in hue you have something called as job browser which can also give you information of your different applications which might have run so here i can just remove this one which would basically give me a list of all the different kind of jobs or workflows which were run so either it was a spark based application or it was a map reduce or it was coming from hive so here i have list of all the applications and it says this was a mapreduce this was a spark something was killed something was successful and this was basically a probably a hive query which triggered a mapreduce job you can click on the application and that tells you how many tasks were run for it so there was a map task which ran for it you can get into the metadata information which you can obviously you can also look from the yarn ui to look into your applications which can give you a detailed information of if it was a map reduce how many map and reduce tasks were run what were the different counters if it was a spark application it can let you follow through spark history server or job history server so you can always use the web ui to look into the jobs you can be finding in lot of useful information here you can also be looking at how many resources were used and what happened to the job was it successful did it fail and what was the job status now apart from web ui which always you might not have access to so in a particular cluster in a production cluster there might be restrictions and the organization might not have access given to all the users to graphical user interface like you or might be you would not have access to the cloud era manager or admin console because probably organization is managing multiple clusters using this admin console so the one way which you would have access is your web console or basically your edge node or client machine from where you can connect to the cluster and then you can be working so let's log in here and now here we can give different commands so this is the command line from where you can have access to different details you can always check by just typing in map red which gives you different options where you can look at the map reduce related jobs you can look at different queues if there are queues configured you can look at the history server or you can also be doing some admin stuff provided you have access so for example if i just say map red and q here this basically gives me an option says what would you want to do would you want to list all the queues do you want information on a particular queue so let's try a list and that should give you different queues which were being used now here we know that per user a queue dynamically gets created which is under root dot users and that gives me what is the status of the queue what is the capacity has there been any kind of maximum capacity or capping done so we get to see a huge list of queues which dynamically get configured in this environment and then you also look at your root dot default i could have also picked up one particular queue and i could have said show me the jobs so i could do that now here we can also give a yarn command so let me just clear the screen and i will say yarn and that shows me different options so apart from your web interface something like web ui apart from your yarn's web ui you could also be looking for information using yarn commands here so these are some list of commands which we can check now you can just type in yarn and version if you would want to see the version which basically gives you information of what is the hadoop version being used and what is the vendor-specific distribution version so here we see we are working on cloudera's distribution 5.14 which is internally using hadoop 2.6 now similarly you can be doing a yarn application list so if you give this that could be an exhaustive list of all the applications which are running or applications which have completed so here we don't see any applications because right now probably there are no applications which are running it also shows you you could be pulling out different status such as submitted accepted or running now you could also say i would want to see the services that i finished running so i could say yarn application list and app state says finished so here we could be using our command so i could say yarn list and then i would want to see the app states which gives me the applications which have finished and we would want to list all the applications which finished now there that might be applications which succeeded right and there is a huge list of application which is coming in from the history server which is basically showing you the huge list of applications which have completed so this is one way and then you could also be searching for one particular application if you would want to search a particular application if you have the application id you could always be doing a grip that's a simple way i could say basically let's pick up this one and if i would want to search for this if i would want more details on this i could obviously do that by calling in my previous command and you could do a grip if that's what you want to do and if you would want to search is there any application which is in the list of my applications that shows my application i could pull out more information about my application so i could look at the log files for a particular application by giving the application id so i could say yarn logs now that's an option and every time anytime you have a doubt just hit enter it will always give you options what you need to give with a particular command so i can say yarn logs application id now we copied an application id and we could just give it here we could give other options like app owner or if you would want to get into the container details or if you would want to check on a particular node now here i am giving yarn logs and then i'm pointing it to an application id and it says the log aggregation has not completed might be this was might be this was an application which was triggered based on a particular interactive shell or based on a particular query so there is no log existing for this particular application you can always look at the status of an application you can kill an application so here you can be saying yan yan application and then what would you want to do with an application hit and enter it shows you the different options so we just tried app states you could always look at the last one which says status and then for my status i could be giving my application id so that tells me what is the status of this application it connects to the resource manager it tells me what's the application id what kind of application it was who ran it which was the queue where the job was running what was the start and end time what is the progress the status of it if it is finished or if it has succeeded and then it basically gives me also an information of where the application master was running it gives me the information where you can find this job details in history server if you are interested in looking into it also gives you a aggregate resource allocation which tells how much gb memory and how many c core seconds it used so this is basically looking out at the application details now i could kill an application if the application was already running i could always do a yarn application minus skill and then i could be giving my application now i could try killing this however it would say the application is already finished if i had an application running and if my application was already given an application id by resource manager i could just kill it i can also say yarn node list which should give me a list of the node managers now this is what we were looking from the yarn web ui and we were pulling out the information so we can get this and kind of information from your command line always remember and always try to be well accustomed with the command line so you can do various things from the command line and then obviously you have the web uis which can help you with a graphical interface easily able to access things now you could be also starting the resource manager which we would not be doing here because we are already running in a cluster so you could give a yarn resource manager you could get the logs of resource manager if you would want by giving yarn demon so we can try that so you can say yarn and then demon so it says it does not find the demon so you can give something like this get level and here i will have to give the node and the ip address where you want to check the logs of resource manager so you could be giving this for which we will have to then get into cloud error manager to look into the nodes and the ip address you could be giving a command something like this which basically gives you the level of the log which you have and i got this resource manager address from the web ui now i can be giving in this command to look into the demand log and it basically says you would want to look at the resource manager related log and you have the log 4j which is being used for logging the kind of level which has been set as info which can again be changed in the way you're logging the information now you can try any other commands also from yarn for example looking at the yarn rm admin so you can always do a yarn rm admin and this basically gives you a lot of other informations like refreshing the cues or refreshing the nodes or basically looking at the admin acls or getting groups so you could always get group names for a particular user now we could search for a particular user such as yarn or hdfs itself so i could just say here i would want get groups and then i could be searching for say username hdfs so that tells me sdfs belongs to a hadoop group similarly you could search for say map red or you could search for yarn so these are service related users which automatically get created and you can pull out information related to these you can always do a refresh nodes kind of command and that is mainly done internally this can be useful when you're doing commissioning decommissioning but then in case of cloudera or hortonworks kind of cluster you would not be manually giving this command because if you are doing a commissioning decommissioning from an admin console and if you are an administrator then you could just restart the services which are affected and that will take care of this but if you were working in an apache cluster and if you were doing commissioning decommissioning then you would be using in two commands refresh nodes and basically that's for refreshing the notes which should not be used for processing and similarly you could have a command refresh notes which comes with sdfs so these are different options which you can use with your yarn on the command line you could also be using curl commands to get more information about your cluster by giving curl minus x and then basically pointing out to your resource manager web ui address now here i would like to print out the cluster related metrics and i could just simply do this which basically gives me a high level information of how many applications were submitted how many are pending what is the reserved resources what is the available amount of memory or cpu cores and all the information similarly you can be using the same curl commands to get more information like scheduler information so you would just replace the metrics with scheduler and you could get the information of the different queues now that's a huge list we can cancel this and that would give me a list of all the queues which are allocated and what are the resources allocated for each queue you could also get cluster information on application ids and status running of applications running in yarn so you would have to replace the last bit of it and you would say i would want to look at the applications and that gives me a huge list of applications then you can do a grip and you can be filtering out specific application related information similarly you can be looking at the nodes so you can always be looking at node specific information which gives you how many nodes you have but this could be mainly used when you have an application which wants to or a web application which wants to use a curl command and would want to get information about your cluster from an http interface now when it comes to application we can basically try running a simple or a sample mapreduce job which could then be triggered on yarn and it would use the resources now i can look at my application here and i can be looking into my specific directory which is this one which should have lot of files and directories which we have here now i could pick up one of these and i could be using a simple example to do some processing let's take up this file so there is a file and i could run a simple word count or i could be running a hive query which triggers a mapreduce job i could even run a spark application which would then show that the application is running on the cluster so for example if i would say spark to shell now i know that this is an interactive way of working with spark but this internally triggers a spark submit and this runs an application so here when you do a spark to shell by default it will contact yarn so it gets an application id it is running on yarn with the master being yarn and now i have access to the interactive way of working with spark now if i go and look into applications i should be able to see my application which has been started here and it shows up here so this is my application 3827 which has been started on yarn and as of now we can also look into the yarn ui and that shows me the application which has been started which basically has one running container which has one cpu core allocated 2gb ram and it's in progress although we are not doing anything there so we can always look at our application from the yarn ui or as i mentioned from your applications tab within yarn services which gives us the information and you can even click on this application to follow and see more information but you should be given access to that now this is just a simple application which i triggered using spark shell similarly i can basically be running a mapreduce now to run a mapreduce i can say hadoop jar and that basically needs a class so we can look for the default path which is opt cloud error parcels cdh lib hadoop mapreduce hadoop mapreduce examples and then we can look at this particular jar file and if i hit on enter it shows me the different classes which are part of this jar and here i would like to use word count so i could just give this i could say word count now remember i could run the job in a particular queue by giving in an argument here so i could say minus d map red dot job dot q dot name and then i can point my job to a particular queue i can even give different arguments in saying i would want my mapreduce output to be compressed or i want it to be stored in a particular directory and so on so here i have the word count and then basically what i can be doing is i can be pointing it to a particular input path and then i can have my output which can be getting stored here again a directory which we need to choose and i will say output new and i can submit my job now once i have submitted my job it connects to resource manager it basically gets a job id it gets an application id it shows you from where you can track your application you can always go to the yarn ui and you can be looking at your application and the resources it is using so my application was not a big one and it has already completed it triggered one map task it launched one reduced task it was working on around 12 466 records where you have then the output of map which is these many number of output records which was then taken by a combiner and finally by a reducer which basically gives you the output so this is my on application which has completed now i could be looking into the yarn ui and if my job has completed you might not see your application here so as of now it shows up here the word count which i ran it also shows me my previous spark shell job it shows me my application is completed and if you would want further information on this you can click and go to the history server if you have been given access to it or directly go to the history server web ui where your application shows up it shows how many map and reduce tasks it was running you can click on this particular application which basically gives you information of your map and reduce tasks you can look at different counters for your application right you can always look at map specific tasks you can always look into one particular task what it did on which node it was running or you can be looking at the complete application log so you can always click on the logs and here you have click here for full log which gives you the information and you can always look for your application which can give you information of app master being launched or you could have search for the word container so you could see a job which needs one or multiple containers and then you could say container is being requested then you could see container is being allocated then you can see what is the container size and then basically your task moves from initializing to running in the container and finally you can even search for release which will tell you that the container was released so you can always look into the log for more information so this is how you can interact with yarn this is how you can interact with your command line to look for more information or using your yarn web ui or you can also be looking into your hue for more information welcome to scoop tutorial one of the many features of the hadoop ecosystem for the hadoop file system what's in it for you today we're going to cover the need for scoop what is scoop scoop features scoop architecture scoop import scoop export scoop processing and then finally we'll have a little hands-on demo on scoop so you can see what it looks like so where does the need for scoop come in in our big data hadoop file system processing huge volumes of data requires loading data from diverse sources into hadoop cluster you can see here we have our data processing and this process of loading data from the heterogeneous sources comes with a set of challenges so what are the challenges maintaining data consistency ensuring efficient utilization of resources especially when you're talking about big data we can certainly use up the resources when importing terabytes and petabytes of data over the course of time loading bulk data to hadoop was not possible it's one of the big challenges that came up when they first had the hadoop file system going and loading data using script was very slow in other words you'd write a script in whatever language you're in and then it would very slowly load each piece and parse it in so the solution scoop scooped helped in overcoming all the challenges to traditional approach and could lead bulk data from rdbms to hadoop very easily so thank your enterprise server you want to take the from mysql or your sql and you want to bring that data into your hadoop warehouse your data filing system and that's where scoop comes in so what exactly is scoop scoop is a tool used to transfer bulk of data between hadoop and external data stores such as relational databases and my sql server or the microsoft sql server or my sql server so scoop equals sql plus hadoop and you can see here we have our rdbms all the data we have stored on there and then your scoop is the middle ground and brings the import into the hadoop file system it also is one of the features that goes out and grabs the data from hadoop and exports it back out into an rdbms let's take a look at scoop features sku features has parallel import and export it has import results of sql query connectors for all major rdbms databases kerberos security integration provides full and incremental load so we look at parallel import and export scoop uses yarn yet another resource negotiator framework to import and export data this provides fault tolerance on a top of parallelism scoop allows us to import the result returned from an sql carry into the hadoop file system or the hdfs and you can see here where the import results of sql query come in scoop provides connectors for multiple relational database management system rdbms's databases such as mysql and microsoft sql server and has connectors for all major rdbms databases scoop supports kerberos computer network authentication protocol that allows nodes communicating over a non-secure network to prove their identity to one another in a secure manner scoop can load the whole table or parts of the table by a single command hence it supports full and incremental load let's dig a little deeper into the scoop architecture we have our client in this case a hooded wizard behind his laptop you never know who's going to be accessing the hadoop cluster and the client comes in and sends their command which goes into scoop the client submits the import export command to import or export data data from different databases is fetched by scoop and so we have enterprise data warehouse document based systems you have connect connector for your data warehouse a connector for document based systems which reaches out to those two entities and we have our connector for the rdbms so connectors help in working with a range of popular databases multiple mappers perform map tasks to load the data onto hdfs the hedo file system and you can see here we have the map task if you remember from hadoop hadoop is based on map reduce because we're not reducing the data we're just mapping it over it only accesses the mappers and it opens up multiple mappers to do parallel processing and you can see here the hdfs hbase hive is where the target is for this particular one similarly multiple map tests will export the data from hdfs onto rdbms using scoop export command so just like you can import it you can now export it using the multiple map routines scoop import so here we have our dbms data store and we have the folders on there so maybe it's your company's database maybe it's an archive at google with all the searches going on whatever it is usually you think with scoop you think sql you think mysql server or microsoft sql server that kind of setup so it gathers the metadata and you see the scoop import so introspect database together metadata primary key information and then it submits uh so you can see submits map only job remember we talked about mapreduce it only needs the map side of it because we're not reducing the data we're just mapping it over scoop divides the input data set into splits and uses individual map tests to push the splits into hdfs so right into the hadoop file system and you can see down on the right is kind of a small depiction of a hadoop cluster and then you have scoop export so we're going to go the other direction and with the other direction you have your hadoop file system storage which is your hadoop cluster you have your scoop job and each one of those clusters then gets a map mapper comes out to each one of the computers that has data on it so the first step is you've got to gather the metadata so step one you gather the metadata step two submits map only job introspect database together metadata primary key information scoop divides the input data set into splits and uses individual map tests to push the splits to rdbms scoop will export hadoop files back to rdms tables you can think of this in a number of different manners one of them would be if you're restoring a backup from the hadoop file system into your enterprise machines there's certainly many others as far as exploring data and data science so as we dig a little deeper into scoop input we have our connect our jdbc and our url so specify the jdbc connect string connecting manager we specify the connection manager class to use you can see here driver with the class name manually specify the jdbc driver class to use hadoop mapreduce home directory override hadoop mapped home username set authentication username and of course help print usage instructions and with the export you'll see that we can specify the jdbc connect string specify the connection manager class to use manually specify jdbc driver class to use you do have to let it know to override the hadoop map reduce home and that's true on both of these and set authentication username and finally you can print out all your help setup so you can see the format for scoop is pretty straightforward both import and export so let's uh continue on our path and look at scoop processing and what the computer goes through for that and we talk about scoop processing first scoop runs in the hadoop cluster it imports data from the rdbms the nosql database to the hadoop file system so remember we might not be importing the data from a rdbms it might actually be coming from an osql sql there's many out there it uses mappers to slice the incoming data into multiple formats and load the data into hdfs it exports data back into an rdbms while making sure that the schema of the data in the database is maintained so now that we've looked at the basic commands in our scoop in the scoop processing or at least the basics as far as theory is concerned let's just jump in and take a look at a demo on scoop for this demo i'm going to use our cloudera quick start if you've been watching our other demos we've done you'll see that we've been using that pretty consistently certainly this will work in any of your your horton sandbox which is also a single node testing machine cloudera is one of um there's a docker version instead of virtualbox and you can also set up your own hadoop cluster plan a little extra time if you're not an admin it's actually a pretty significant endeavor for an admin if you've been admitting linux machines for a very long time and you know a lot of the commands i find for most admins it takes them about two to four hours the first time they go in and create a virtual machine and set up their own hadoop in this case though when you're just learning and getting set up best to start with cloudera cloudera also includes an installed version of mysql that way you don't have to install the the sql version for importing data from and 2. once you're in the cloudera quick start you'll see it opens a nice centos linux interface and it has the desktop setup on there this is really nice for learning so you're not just looking at command lines and from in here it should open up by default to hue if not you can click on hue here's a kind of a fun little uh web-based interface under hue i can go under query i can pick an editor and we'll go right down to scoop so now i'm just going to load the scoop editor inner hue now i'm going to switch over and do this all in command line i just want to show that you can actually do this in a hue through the web-based interface the reason i like to do the command line is specifically on my computer it runs much quicker or if i do the command line here and i run it it tends to have an extra lag or an added layer in it so for this we're going to go ahead and open our command line the second reason i do this is we're going to need to go ahead and edit our mysql so we have something to scoop in otherwise i don't have anything going in there and of course we zoom in we'll zoom in this and increase the size of our screen so for this demo our hands-on i'm going to use oracle virtualbox manager and the cloudera quick start if you're not familiar with this we do have another tutorial we put out and you can send a note in the youtube video below and let our team know and they'll send you a link or come visit www.simplylearn.com now this creates a linux box on my windows computer so we're going to be in linux and it'll be the cloudera version with scoop and we'll also be using mysql mysql server once inside the cloudera virtualbox we'll go under the hue editor now we're going to do everything in terminal window but i just want you to be aware that under the hue editor you can go under query editor and you'll see as we come down here here's our scoop on this so you can run your scoop from in here now before we do this we have to do a little exploration in my sql and my sql server that way we know what data is coming in so let me go ahead and open up a terminal window in cloudera you have a terminal window at the top here that you can just click on it open it up and let me just go ahead and zoom in on here go view and zoom in now to get into my sql server you simply type in mysql and this part will depend on your setup now the cloudera quickstart comes up that the username is root and the password is cloudera kind of a strange quirk is that you can put a space between the minus u and the root but not between the minus p and the cloudera usually you'd put in a minus capital p and then it prompts you for your password on here for this demo i don't worry too much about you knowing the password on that so we'll just go right into my sql server since this is the standard password for this quick start and you can see we're now into mysql and we're going to do just a couple of quick commands in here there's show databases and you follow by the semicolon that's standard in most of these shell commands so it knows it's the end of your shell command and you'll see in here in the quick start cloudera quickstart the mysql comes with a standard set of databases these are just some of these have to do like with the uzi which is the uzi part of hadoop where others of these like customers and employees and stuff like that those are just for demo purposes they come as a standard setup in there so that people going in for the first time have a database to play with which is really good for us so we don't have to recreate those databases and you will see in the list here we have a retail underscore db and then we can simply do uh use retail underscore db this will set that as a default in mysql and then we want to go ahead and show the tables and if we show the tables you can see under the database the retail db database we have categories customers departments order items orders products so there's a number of tables in here and we're going to go ahead and just use a standard sql command and if you did our hive language you'll note remember it's the same for hql also on this we're just going to select star everything from departments so there's our departments table and we're going to list everything on the department's table and you'll see we've got six lines in here and it has a department id and a department name two for fitness three for footwear so on and so forth now at this point i can just go ahead and exit but it's kind of nice to have this data up here so we can look at it and flip back and forth between the screens so i'm going to open up another terminal window and we'll go ahead and zoom in on this also and it isn't too important for this particular setup but it's always kind of fun fun to know what your setup you're working with what is your host name and so we'll go ahead and just type that in this is a linux command and it's a host name minus f and you see we're on quick start cloudera no surprise there now this next command is going to be a little bit longer because we're going to be doing our first scoop command and i'm going to do two of them we're going to list databases and list tables it's going to take just a moment to get through this because there's a bunch of stuff going on here so we have scoop we have list databases we have connect and under the connect command we need to let it know how we're connecting we're going to use the jdbc this is a very standard one jdbc mysql so you'll see that if you're doing an sql database that's how you started off with and then the next part this is where you have to go look it up it's however it was created so if your admin created a mysql server with a certain setup that's what you have to go by and you'll see that usually they list this as localhost so you'll see something like localhost sometimes there's a lot of different formats but the most common is either localhost or the actual connection so in this case we want to go ahead and do quickstart 3306. and so quick start use the name of the local host database and how it's hosted on here and when you set up the quick start for um for hadoop under cloudera it's port 3306 is where that's coming in so that's where all that's coming from and so there's our path for that and then we have to put in our password we typically type password if you look it up password on the cloudera quick start is cloudera and we have to also let it know the username and again if you're doing this you'd probably put in a minus capital you can actually just do it for a prompt for the password so if you leave that out it'll prompt you but for this doesn't really matter i don't care if you see my password it's the default one for cloudera quickstart and then the username on here is simply root and then we're going to put our semicolon at the end and so we have here our full setup and we go ahead and list the databases and you'll see you might get some warnings on here i haven't run the updates on the quick start i suggest you're not running the updates either if you're doing this for the first time because it'll do some reformatting on there and it quickly pass up and you can see here's all of our the tables we went in there and if we go back to on the previous window we should see that these tables match so here we come in and here we have our databases and you can see back up here where we had the cm customers employees and so on so the databases match and then we want to go ahead and list the tables for a specific database so let's go ahead and do that i'm a very lazy typist so i'll put the up arrow in and you can see here scoop list databases we're just going to go back and change this from databases to list tables so we want to list the tables in here same connection so most the connection is the same except we need to know which tables we're listing an interesting fact is you can create a table without being under a database so if you left this blank it will show the open tables that aren't connected directly to a database or under a database but what we want to do is right past this last slash on the 3306 we want to put that retail underscore db because that's the database we're going to be working with and this will go in there and show the tables listed under that database and here we go we got categories customers departments order items and products if we flip back here real quick there it is the same thing we had we had categories customers departments order items and so on and so let's go ahead and run our first import command and again i'm that lazy typer so we're going to do scoop and instead of list tables we want to go ahead and import so there's our import command and so once we have our import command in there then we need to tell it exactly what we're going to import so everything else is the same we're importing from the retail db so we keep that and then at the very end we're going to tag on dash dash table that tells us we can tell it what table we're importing from and we're going to import departments there we go so this is pretty straightforward because it what's nice about this is you can see the commands are the same i got the same connection um i change it for the whatever database i'm in then i come in here our password the username are going to be the same that's all under the mysql server setup and then we let it know what table we're entering in we run this and this is going to actually go through the mapper process in hadoop so this is a mapping process it takes the data and it maps it up to different parts in the setup in hadoop on there and then saves that data into the hadoop file system and it does take it a moment to zip through which i kind of skipped over for you since it is running a you know it's designed to run across the cluster not on a single node so when you're running on a single node it's going to run slow even if you dedicate a couple cores to it i think i put dedicated four cores to this one uh and so you can see right down here we get to the end it's now mapped in that information and then we can go in here we can go under can flip back to our hue and under hue on the top i have there's databases and the second icon over is your hadoop file system and we can go in here and look at the hadoop file system and you'll see it show up underneath our documents there it is departments cloudera departments and you can see there's always a delay when i'm working in hue which i don't like and that's the quick start issue that's not necessarily running out on a server when i'm running it on a server you pretty much have to run through some kind of server interface i still prefer the terminal window it still runs a lot quicker but we'll flip back on over here to the command line and we can do the hadoop type in the hadoop fs and then list minus ls and if we run this you'll see underneath our hadoop file system there is our departments which has been added in and we can also do uh hadoop fs and this is kind of interesting for those who've gone through the hadoop file system everything you'll you'll recognize this on here i'm going to list it the contents of departments and you'll see underneath departments uh we have part part m0001002003 and so this is interesting because this is how hadoop saves these files this is in the file system this is not in hive so we didn't directly import this into hive we put this in the hadoop file system depending on what you're doing you would then write the schema for hive to look at the hadoop file system certainly visit our hive tutorial for more information on hive specific so you can see in here are different files that it forms that are part of departments and we can do something like this we can look at the contents of one of these files fs minus ls or a number of the files and we'll simply do the full path which is user cloudera and then we already know the next one is departments and then after departments we're going to put slash part star so this is going to say anything that has part in it so we have part dash m 0 0 0 and so on we can go ahead and cat use that cut command or that list command to bring those up and then we can use the cat command to actually display the contents and that's a linux command hadoop linux command the cat captinate not to be confused with catatonic catastrophic there's a lot of cat got your tongue and we see here fitness footwear apparel that should look really familiar because that's what we had in our mysql server we went in here we did a select all on here there it is fitness footwear apparel golf outdoors and fan shop and then of course it's really important let's look back on over here to be able to tell it where to put the data so we go back to our import command so here's our scoop import we have our connect we have the db underneath our connection our my sql server we have our password our username the table going where it's going to i mean the table where it's coming from uh and then we can add a target on here we can put in uh target dash directory and you do have to put the full path that's a hadoop thing it's a good practice to be in and we're gonna add it to department uh we'll just do department one and so here we now add a target directory in here and user cloudera department one and this will take just a moment before so i'll go ahead and skip over the process since it's going to run very slowly it's only running on like i said a couple cores and it's also on a single node and now we can do the hadoop let's just do the up arrow file system list we want just straight list and when we do the hadoop file system minus ls or list you'll see that we now have department one and we can of course do list department one and you can see we have the files inside department one and they mirrored what we saw before with the same files in there and the part m o zero zero and so on if we want to look at them it'd be the same thing we did before with the cat so except instead of departments we'd be department one there we go something that's going to come up with the same data we had before now one of the important things when you're importing data and it's always a question to ask is do you filter the data before it comes in do we want to filter this data as it comes in so we're not storing everything in our file system you would think hadoop big data put it all in there i know from experience that putting it all in there can turn a couple hundred terabytes into a petabyte very rapidly and suddenly you're having to really add on to that data store and you're storing duplicate data sometimes so you really need to be able to filter your data out and so let's go ahead and use our up arrow to go to our last import since it's still a lot the same stuff so we have all of our commands under import we have the target we're going to change this to department 2 so we're going to create a new directory for this one and then after departments there's another command that we didn't really slide in here and that's our mapping and i'll show you what this looks like in a minute but we're going to put m3 in there that doesn't have nothing to do with the filtering i'll show you that in a second though what that's for and we just want to put in where uh so where and what is the where in this case we want to know where department id and if you want to know where that came from we can flip back on over here we have department underscore ids this is where that's coming from that's just the name of the column on here so we come in here to department id is greater than four simple um logic there you can see where you'd use that for maybe creating buckets for ages uh you know age from 10 to 15 20 to 30. you might be looking for i mean there's all kinds of reasons why you could use the where command on here and filter information out maybe you're doing word counting and you want to know words that are used less than 100 times you want to get rid of the and is and and all the stuff that's used over and over again uh so we'll go ahead and put the where and then department id is greater than four we'll go ahead and hit enter on here and this will create our department two set up on this uh and i'll go ahead and skip over some of the runtime again it runs really slow on a single node real quick page through our commands let's see here we go our list and we should see underneath the list the department 2 on here now and there it is department 2 and then i can go ahead and do list department two you'll see the contents in here and you'll see that there is only three maps and it could be that the data created three maps but remember i set it up to only use three mappers uh so there's zero one and two and we can go ahead and do a cat on there remember this is department two so we want to look at all the contents of these three different files and there it is it's greater than four so we have golf is five outdoor six uh fan shop is seven so we've effectively filtered out our data and just storing the data we want on our file system so if you're going to store data on here the next stage is to export the data remember a lot of times you have my sql server and we're continually dumping that data into our long term storage and access a hadoop file system but what happens when you need to pull that data out and restore a database or maybe you have you just merged with a new company a favorite topic merging companies and merging databases that's listed under nightmare and how many different names for company can you have so you can see where being able to export is also equally important and let's go ahead and do that and i'm going to flip back over to my sql server here and we'll need to go ahead and create our database we're going to export into now i'm not going to go too much in detail on this command we're simply creating a table and the table is going to have it's pretty much the same table we already have in here from departments but in this case we're going to create a table called dept so it's the same setup but it's it's just going to we're just giving a different name a different schema and so we've done that and we'll go ahead and do a select star from e p t there we go and it's empty that's what we expect a new database a new data table and it's empty in there uh so now we need to go ahead and export our data that we just filtered out into there so let's flip back on over here to our scoop setup which is just our linux terminal window and let's go back up to one of our commands here's scoop import in this case instead of import we're going to take the scoop and we're going to export so we're going to just change that export and the connection is going to remain the same so same connect same database we're also we're still doing the retail db uh we have the same password so none of that changes uh the big change here is going to be the table instead of departments uh remember we changed it and gave it a new name and so we want to change it here also d-e-p-t so department we're not going to worry about the mapper count and the where was part of our import there we go and then finally it needs to know where to export from so instead of target directory we have an export directory that's where it's coming from uh still user cloudera and we'll keep it as department 2. just so you can see how that data is coming back with that we filtered in and let's go ahead and run this it'll take it just a moment to go through with steps and again because it's low i'm just going to go and skip this so you don't have to sit through it and once we've wrapped up our export we'll flip back on over here to mysql use the up arrow and this time we're going to select star from department and we can see that there it is it exported the golf outdoors and fan shop and you can imagine also that you might have to use the where command in your export also so there's a lot of mixing the command line for scoop is pretty straightforward you're changing the different variables in there whether you're creating a table listing a table listing databases very powerful tool for bringing your data into the hadoop file system and exporting it so now that we've wrapped up our demo on scoop and gone through a lot of basic commands let's dive in with a brief history of hive so the history of hive begins with facebook facebook began using hadoop as a solution to handle the growing big data and we're not talking about a data that fits on one or two or even five computers we're talking due to the fits on if you've looked at any of our other hadoop tutorials you'll know we're talking about very big data and data pools and facebook certainly has a lot of data it tracks as we know the hadoop uses mapreduce for processing data mapreduce required users to write long codes and so you'd have these really extensive java codes very complicated for the average person to use not all users were versed in java and other coding languages this proved to be a disadvantage for them users were comfortable with writing queries in sql sql has been around for a long time the standard sql query language hive was developed with the vision to incorporate the concepts of tables columns just like sql so why hive well the problem was for processing and analyzing data users found it difficult to code as not all of them were well-versed with the coding languages you have your processing ever analyzing and so the solution was required a language similar to sql which was well known to all the users and thus the hive or hql language evolved what is hive hive is a data warehouse system which is used for querying and analyzing large data sets stored in the hdfs or the hadoop file system hive uses a query language that we call hive ql or hql which is similar to sql so if we take our user the user sends out their hive queries and then that is converted into a map reduce tasks and then accesses the hadoop mapreduce system let's take a look at the architecture of hive architecture of hive we have the hive client so that could be the programmer or maybe it's a manager who knows enough sql to do a basic query to look up the data they need the hive client supports different types of client applications in different languages prefer for performing queries and so we have our thrift application in the hive thrift client thrift is a software framework hive server is based on thrift so it can serve the request from all programming language that support thrift and then we have our jdbc application and the hive jdbc driver jdbc java database connectivity jdbc application is connected through the jdbc driver and then you have the odbc application or the hive odbc driver the odbc or open database connectivity the odbc application is connected through the odbc driver with the growing development of all of our different scripting languages python c plus plus spar java you can find just about any connection in any of the main scripting languages and so we have our hive services as we look at deeper into the architecture hive supports various services so you have your hive server basically your thrift application or your hive thrift client or your jdbc or your hive jdbc driver your odbc application or your hive odbc driver they all connect into the hive server and you have your hive web interface you also have your cli now the hive web interface is a gui is provided to execute hive queries and we'll actually be using that later on today so you can see kind of what that looks like and get a feel for what that means commands are executed directly in cli and then the cli is a direct terminal window and i'll also show you that too so you can see how those two different interfaces work these then push the code into the hive driver hive driver is responsible for all the queries submitted so everything goes through that driver let's take a closer look at the hive driver the hive driver now performs three steps internally one is a compiler hive driver passes query to compiler where it is checked and analyzed then the optimizer kicks in and the optimize logical plan in the form of a graph of mapreduce and hdfs tasks is obtained and then finally in the executor in the final step the tasks are executed we look at the architecture we also have to note the meta store metastore is a repository for hive metadata stores metadata for hive tables and you can think of this as your schema and where is it located and it's stored on the apache derby db processing and resource management is all handled by the mapreduce v1 you'll see mapreduce v2 the yarn and the tez these are all different ways of managing these resources depending on what version of hadoop you're in hive uses mapreduce framework to process queries and then we have our distributed storage which is the hdfs and if you looked at our hadoop tutorials you'll know that these are on commodity machines and are linearly scalable that means they're very affordable a lot of time when you're talking about big data you're talking about a tenth of the price of storing it on enterprise computers and then we look at the data flow and hive so in our data flow and hive we have our hive in the hadoop system and underneath the user interface or the ui we have our driver our compiler our execution engine and our meta store that all goes into the mapreduce and the hadoop file system so when we execute a query you see it coming in here goes into the driver step one step two we get a plan what are we going to do refers to the query execution uh then we go to the metadata it's like well what kind of metadata are we actually looking at where is this data located what is the schema on it then this comes back with the metadata into the compiler then the compiler takes all that information and the syn plan returns it to the driver the driver then sends the execute plan to the execution engine once it's in the execution engine the execution engine acts as a bridge between hive and hadoop to process the query and that's going into your mapreduce in your hadoop file system or your hdfs and then we come back with the metadata operations it goes back into the metastore to update or let it know what's going on which also goes to the between it's a communication between the execution engine and the meta store execution engine communications is bi-directionally with the metastore to perform operations like create drop tables metastore stores information about tables and columns so again we're talking about the schema of your database and once we have that we have a bi-directional send results communication back into the driver and then we have the fetch results which goes back to the client so let's take a little bit look at the hive data modeling hive data modeling so you have your high data modeling you have your tables you have your partitions and you have buckets the tables in hive are created the same way it is done in rdbms so when you're looking at your traditional sql server or mysql server where you might have enterprise equipment and a lot of people pulling and removing stuff off of there the tables are going to look very similar and this makes it very easy to take that information and let's say you need to keep current information but you need to store all of your years of transactions back into the hadoop hive so you match those those all kind of look the same the tables are the same your databases look very similar and you can easily import them back you can easily store them into the hive system partitions here tables are organized into partitions for grouping same type of data based on partition key this can become very important for speeding up the process of doing queries so if you're looking at dates as far as like your employment dates of employees if that's what you're tracking you might add a partition there because that might be one of the key things that you're always looking up as far as employees are concerned and finally we have buckets uh data present in partitions can be further divided into buckets for efficient querying again there's that efficiency at this level a lot of times you're taught you're working with the programmer and the admin of your hadoop file system to maximize the efficiency of that file system so it's usually a two-person job but we're talking about hive data modeling you want to make sure that they work together and you're maximizing your resources hive data types so we're talking about hive data types we have our primitive data types and our complex data types a lot of this will look familiar because it mirrors a lot of stuff in sql in our primitive data types we have the numerical data types string data type date time data type and miscellaneous data type and these should be very they're kind of self-explanatory but just in case numerical data is your floats your integers your short integers all of that numerical data comes in as a number a string of course is characters and numbers and then you have your date time stamp and then we have kind of a general way of pulling your own created data types in there that's our miscellaneous data type and we have complex data types so you can store arrays you can store maps you can store structures and even units in there as we dig into hive data types and we have the primitive data types and the complex data types so we look at primitive data types and we're looking at numeric data types data types like an integer a float a decimal those are all stored as numbers in the hive data system a string data type data types like characters and strings you store the name of the person you're working with uh you know john doe the city memphis the state tennessee maybe it's boulder colorado usa or maybe it's hyper bad india that's all going to be string and stored as a string character and of course we have our date time data type data types like timestamp date interval those are very common as far as tracking sales anything like that you just think if you can type a stamp of time on it or maybe you're dealing with the race and you want to know the interval how long did the person take to complete whatever task it was all that is date time data type and then we talk miscellaneous data type these are like boolean and binary and when you get into boolean binary you can actually almost create anything in there but your yes knows zero one now let's take a look at complex data types a little closer uh we have arrays so your syntax is of data type and it's an array and you can just think of an array as a collection of same entities one two three four if they're all numbers and you have maps this is a collection of key value pairs so understanding maps is so central to hadoop so we store maps you have a key which is a set you only have one key per mapped value and so you in hadoop of course you collect uh the same keys and you can add them all up or do something with all the contents of the same key but this is our map as a primitive type data type in our collection of key value pairs and then collection of complex data with comment so we can have a structure we have a column name data type comment call a column comment so you can get very complicated structures in here with your collection of data and your commented setup and then we have units and this is a collection of heterogeneous data types so the syntax for this is union type data type data type and so on so it's all going to be the same a little bit different than the arrays where you can actually mix and match different modes of hive hive operates in two modes depending on the number and size of data nodes we have our local mode and our map reduce mode we talk about the local mode it is used when hadoop is having one data node and the data is small processing will be very fast on a smaller data sets which are present in local machine and this might be that you have a local file stuff you're uploading into the hive and you need to do some processes in there you can go ahead and run those high processes and queries on it usually you don't see much in the way of a single node hadoop system if you're going to do that you might as well just use like an sql database or even a java sqlite or something python and sqlite so you don't really see a lot of single node hadoop databases but you do see the local mode in hive where you're working with a small amount of data that's going to be integrated into the larger database and then we have the map reduce mode this is used when hadoop is having multiple data nodes and the data is spread across various data nodes processing large data sets can be more efficient using this mode and this you can think of instead of it being one two three or even five computers we're usually talking with the hadoop file system we're looking at 10 computers 15 100 where this data is spread across all those different hadoop nodes difference between hive and rdbms remember rdbms stands for the relational database management system let's take a look at the difference between hive and the rdbms with hive hive enforces schema on read and it's very important that whatever is coming in that's when hive's looking at it and making sure that it fits the model the rdbms enforces a schema when it actually writes the data into the database so it's read the data and then once it starts to write it that's where it's going to give you the error tell you something's incorrect about your scheme hive data size is in petabytes that is hard to imagine um you know we're looking at your personal computer on your desk maybe you have 10 terabytes if it's a high-end computer but we're talking petabytes so that's hundreds of computers grouped together when a rdbms data size is in terabytes very rarely do you see an rdbms system that's spread over more than five computers and there's a lot of reasons for that with the rdbms it actually has a high end amount of writes to the hard drive there's a lot more going on there you're writing and pulling stuff so you really don't want to get too big with an rd bms or you're going to run into a lot of problems with hive you can take it as big as you want hive is based on the notion of write once and read many times this is so important and they call it worm which is write w wants o read are many times m they refer to it as worm and that's true of any of a lot of your hadoop setup it's it's altered a little bit but in general we're looking at archiving data that you want to do data analysis on we're looking at pulling all that stuff off your rd bms from years and years and years of business or whatever your company does or scientific research and putting that into a huge data pool so that you can now do queries on it and get that information out of it with the rdbms it's based on the notion of read and write many times so you're continually updating this database you're continually bringing up new stuff new sales the account changes because they have a different licensing now whatever software you're selling all that kind of stuff where the data is continually fluctuating and then hive resembles a traditional database by supporting sql but it is not a database it is a data warehouse this is very important it goes with all the other stuff we've talked about that we're not looking at a database but a data warehouse to store the data and still have fast and easy access to it for doing queries you can think of twitter and facebook they have so many posts that are archived back historically those posts aren't going to change they made the post they're posted they're there and they're in their database but they have to store it in a warehouse in case they want to pull it back up with the rdbms it's a type of database management system which is based on the relational model of data and then with hive easily scalable at a low cost again we're talking maybe a thousand dollars per terabyte um the rdbms is not scalable at a low cost when you first start on the lower end you're talking about 10 000 per terabyte of data including all the backup on the models and all the added necessities to support it as you scale it up you have to scale those computers and hardware up so you might start off with a basic server and then you upgrade to a sun computer to run it and you spend you know tens of thousands of dollars for that hardware upgrade with hive you just put another computer into your hadoop file system so let's look at some of the features of hive when we're looking at the features of hive we're talking about the use of sql like language called hive ql a lot of times you'll see that as hql which is easier than long codes this is nice if you're working with your shareholders you come to them and you say hey you can do a basic sql query on here and pull up the information you need this way you don't have to take off have your programmers jump in every time they want to look up something in the database they actually now can easily do that if they're not skilled in programming and script writing tables are used which are similar to the rdbms hence easier to understand and one of the things i like about this is when i'm bringing tables in from a mysql server or sql server there's almost a direct reflection between the two so when you're looking at one which is the data which is continually changing and then you're going into the archive database it's not this huge jump where you have to learn a whole new language you mirror that same schema into the hdfs into the hive making it very easy to go between the two and then using hive ql multiple users can simultaneously query data so again you have multiple clients in there and they send in their query that's also true with the rdbms which kind of queues them up because it's running so fast you don't notice the lag time well you get that also with the hql as you add more computers and the query can go very quickly depending on how many computers and how much resources each machine has to pull the information and hive supports a variety of data types so with hive it's designed to be on the hadoop system which you can put almost anything into the hadoop file system so with all that let's take a look at a demo on hive ql or hql before i dive into the hands-on demo let's take a look at the website hive.apache.org that's the main website since apache it's an apache open source software this is the main software for the main site for the build and if you go in here you'll see that they're slowly migrating hive into beehive and so if you see beehive versus hive note the beehive as the new release is coming out that's all it is it reflects a lot of the same functionality of hive it's the same thing and then we like to pull up some kind of documentation on commands and for this i'm actually going to go to hortonworks hive cheat sheet and that's because hortonworks and cloudera are two of the most common used builds for hadoop and four which include hive and all the different tools in there and so hortonworks has a pretty good pdf you can download cheat sheet on there i believe cloudera does too but we'll go ahead and just look at the horton one because it's the one that comes up really good and you can see when we look at the query language it compares my sql server to hive ql or hql and you can see the basic select we select from columns from table where conditions exist the most basic command on there and they have different things you can do with it just like you do with your sql and if you scroll down you'll see data types so here's your integer your flow your binary double string timestamp and all the different data types you can use some different semantics different keys features functions uh for running a hive query command line setup and of course a hive shell uh set up in here so you can see right here if we loop through it has a lot of your basic stuff and it is we're basically looking at sql across a horton database we're going to go ahead and run our hadoop cluster hive demo and i'm going to go ahead and use the cloudera quick start this is in the virtual box so again we have an oracle virtual box which is open source and then we have our cloudera quick start which is the hadoop setup on a single node now obviously hadoop and hive are designed to run across a cluster of computers so we talk about a single node is for education testing that kind of thing and if you have a chance you can always go back and look at our demo we had on setting up a hadoop system in a single cluster just set a note down below in the youtube video and our team will get in contact with you and send you that link if you don't already have it or you can contact us at the www.simplylearn.com now in here it's always important to note that you do need on your computer if you're running on windows because i'm on a windows machine you're going to need probably about 12 gigabytes to actually run this it used to be goodbye with a lot less but as things have evolved they take up more and more resources and you need the professional version if you have the home version i was able to get that to run but boy did it take a lot of extra work to get the home version to let me use the virtual setup on there and we'll simply click on the cloudera quick start and i'm going to go and just start that up and this is starting up our linux so we have our windows 10 which is a computer i'm on and then i have the virtual box which is going to have a linux operating system in it and we'll skip ahead so you don't have to watch the whole install something interesting to know about the cloudera is that it's running on linuxcentos and for whatever reason i've always had to click on it and hit the escape button for it to spin up and then you'll see the dos come in here now that our cloudera spun up on our virtual machine with the linux on uh we can see here we have our it uses the thunderbird browser on here by default and automatically opens up a number of different tabs for us and a quick note because i mentioned like the restrictions on getting set up on your own computer if you have a home edition computer and you're worried about setting it up on there you can also go in there and spin up a one month free service on amazon web service to play with this so there's other options you're not stuck with just doing it on the quick start menu you can spin this up in many other ways now the first thing we want to note is that we've come in here into cloudera and i'm going to access this in two ways the first one is we're going to use hue and i'm going to open up hue and i'll take it a moment to load from the setup on here and hue is nice if i go in and use hue as an editor into hive or into the hadoop setup usually i'm doing it as a from an admin side because it has a lot more information a lot of visuals less to do with you know actually diving in there and just executing code and you can also write this code into files and scripts and there's other things you can otherwise you can upload it into hive but today we're going to look at the command lines and we'll upload it into hue and then we'll go into and actually do our work in a terminal window under the hive shell now in the hue browser window if you go under query and click on the pull down menu and then you go under editor and you'll see hive there we go there's our hive setup i go and click on hive and this will open up our query down here and now it has a nice little b that shows our hive going and we can go something very simple down here like show databases and we follow it with the semicolon and that's the standard in hive is you always add our punctuation at the end there and i'll go ahead and run this and the query will show up underneath and you'll see down here since this is a new quick start i just put on here you'll see it has the default down here for the databases that's the database name i haven't actually created any databases on here and then there's a lot of other like assistant function tables your databases up here there's all kinds of things you can research you can look at through hue as far as a bigger picture the downside of this is it always seems to lag for me whenever i'm doing this i always seem to run slow so if you're in cloudera you can open up a terminal window they actually have an icon at the top you can also go under applications and under applications system tools and terminal either one will work it's just a regular terminal window and this terminal window is now running underneath our linux so this is a linux terminal window or on our virtual machine which is resting on our regular windows 10 machine and we'll go ahead and zoom this in so you can see the text better on your own video i simply just clicked on view and zoom in and then all we have to do is type in hive and this will open up the shell on here and it takes it just a moment to load when starting up hive i also want to note that depending on your rights on the computer you're on in your action you might have to do pseudohyme and put in your password and username most computers are usually set up with the hive login again it just depends on how you're accessing the linux system and the hive shell once we're in here we can go ahead and do a simple uh hql command show databases and if we do that we'll see here that we don't have any databases so we can go ahead and create a database and we'll just call it office for today for this moment now if i do show we'll just do the up arrow up arrow is a hotkey that works in both linux and in hive so i can go back and paste through all the commands i've typed in and we can see now that i have my there's of course a default database and then there's the office database so now we've created a database it's pretty quick and easy and we can go ahead and drop the database we can do drop database office now this will work on this database because it's empty if your database was not empty you would have to do cascade and that drops all the tables in the database and the database itself now if we do show database and we'll go ahead and recreate our database because we're going to use the office database for the rest of this hands-on demo a really handy command now set with the sql or hql is to use office and what that does is that sets office as a default database so instead of having to reference the database every time we work with a table it now automatically assumes that's the database being used whatever tables we're working on the difference is you put the database name period table and i'll show you in just a minute what that looks like and how that's different if we're going to have a table and a database we should probably load some data into it so let me go ahead and switch gears here and open up a terminal window you can just open another terminal window and it'll open up right on top of the one that you have hive shell running in and when we're in this terminal window first we're going to go ahead and just do a list which is of course a linux command you can see all the files i have in here this is the default load we can change directory to documents we can list in documents and we're actually going to be looking at employee.csv a linux command is the cat you can use this actually to combine documents there's all kinds of things that cat does but if we want to just display the contents of our employee.csv file we can simply do cat employee csv and when we're looking at this we want to know a couple things one there's a line at the top okay so the very first thing we notice is that we have a header line the next thing we notice is that the data is comma separated and in this particular case you'll see a space here generally with these you've got to be real careful with spaces there's all kinds of things you got to watch out for because it can cause issues these spaces won't because these are all strings that the space is connected to if this was a space next to the integer you would get a null value that comes into the database without doing something extra in there now with most of hadoop that's important to know that you're writing the data once reading it many times and that's true of almost all your hadoop things coming in so you really want to process the data before it gets into the database and for those who of you have studied uh data transformation that's the etyl where you extract transfer form and then load the data so you really want to extract and transform before putting it into the hive then you load it into the hive with the transform data and of course we also want to note the schema we have an integer string string integer integer so we kept it pretty simple in here as far as the way the data is set up the last thing that you're going to want to look up is the source since we're doing local uploads we want to know what the path is we have the whole path in this case it's home slash cloudera slash documents and these are just text documents we're working with right now we're not doing anything fancy so we can do a simple get edit employee.csv and you'll see it comes up here it's just a text document so i can easily remove these added spaces there we go and then we go and just save it and so now it has a new setup in there we've edited it the g edit is usually one of the default that loads into linux so any text editor will do back to the hive shell so let's go ahead and create a table employee and what i want you to note here is i did not put the semicolon on the end here semicolon tells it to execute that line so this is kind of nice if you're you can actually just paste it in if you have it written on another sheet and you can see right here where i have create table employee and it goes into the next line on there so i can do all of my commands at once now just so i don't have any typo errors i went ahead and just pasted the next three lines in and the next one is our schema if you remember correctly from the other side we had uh the different values in here which was id name department year of joining and salary and the id is an integer name is a string department string air joining energy salary an integer and they're in brackets we put close brackets around them and you could do this all as one line and then we have row format delimited fields terminated by comma and this is important because the default is tabs so if i do it now it won't find any terminated fields so you'll get a bunch of null values loaded into your table and then finally our table properties we want to skip the header line count equals one now this is a lot of work for uploading a single file it's kind of goofy when you're uploading a single file that you have to put all this in here but keep in mind hive and hadoop is designed for writing many files into the database you write them all in there and then you can they're saved it's an archive it's a data warehouse and then you're able to do all your queries on them so a lot of times we're not looking at just the one file coming up we're loading hundreds of files you have your reports coming off of your main database all those reports are being loaded and you have your log files you have i mean all this different data is being dumped into hadoop and in this case hive on top of hadoop and so we need to let it know hey how do i handle these files coming in and then we have the semicolon at the end which lets us know to go ahead and run this line and so we'll go ahead and run that and now if we do a show tables you can see there's our employee on there we can also describe if we do describe employee you can see that we have our id integer name string department string year of joining integer and salary integer and then finally let's just do a select star from employee very basic sql nhql command selecting data it's going to come up and we haven't put anything in it so as we expect there's no data in it so if we flip back to our linux terminal window you can see where we did the cat employee.csv and you can see all the data we expect to come into it and we also did our pwd and right here you see the path you need that full path when you are loading data you know you can do a browse and if i did it right now with just the employee.csv as a name it will work but that is a really bad habit in general when you're loading data because it's you don't know what else is going on in the computer you want to do the full path almost in all your data loads so let's go ahead and flip back over here to our hive shell we're working in and the command for this is load data so that says hey we're loading data that's a hive command hql and we want local data so you got to put down local in path so now it needs to know where the path is now to make this more legible i'm just going to go ahead and hit enter then we'll just paste the full path in there which i have stored over on the side like a good prepared demo and you'll see here we have home cloudera documents employee.csv so it's a whole path for this text document in here and we go ahead and hit enter in there and then we have to let it know where the data is going so now we have a source and we need a destination and it's going to go into the table and we'll just call it employee we'll just match the table in there and because i want it to execute we put the semicolon on the end it goes ahead and executes all three lines now if we go back if you remember we did the select star from employee just using the up error to page through my different commands i've already typed in you can see right here we have as we expect we have rows sam mike and nick and we have all their information showing in our four rows and then let's go ahead and do uh select and count let's look at a couple of these different select options you can do we're going to count everything from employee now this is kind of interesting because the first one just pops up with the basic select because it doesn't need to go through the full map reduce phase but when you start doing a count it does go through the full map redo setup in the hive in hadoop and because i'm doing this demo on a single node cloudera virtual box on top of a windows 10 all the benefits of running it on a cluster are gone and instead it's now going through all those added layers so it takes longer to run you know like i said when you do a single node as i said earlier it doesn't do any good as an actual distribution because you're only running it on one computer and then you've added all these different layers to run it and we see it comes up with four and that's what we expect we have four rows we expect four at the end and if you remember from our cheat sheet which we brought up here from hortons it's a pretty good one there's all these different commands we can do we'll look at one more command where we do the uh what they call sub queries right down here because that's really common to do a lot of sub queries and so we'll do select star or all different columns from employee now if we weren't using the office database it would look like this from office dot employee and either one will work on this particular one because we have office set as a default on there so from office employee and then the command where creates a subset and in this case we want to know where the salary is greater than 25 000. there we go and of course we end with our semicolon and if we run this query you can see it pops up and there's our salaries of people top earners we have rose and i t and mike and hr kudos to them of course they're fictional i don't actually we don't actually have a rose and a mic in those positions or maybe we do so finally we want to go ahead and do is we're done with this table now remember you're dealing with a data warehouse so you usually don't do a lot of dropping of tables and databases but we're going to go ahead and drop this table here before we drop it one more quick note is we can change it so what we're going to do is we're going to alter table office employee and we want to go ahead and rename it there's some other commands you can do in here but rename is pretty common and we're going to rename it to and it's going to stay in office and it turns out one of our shareholders really doesn't like the word employee he wants employees plural it's a big deal to him so let's go ahead and change that name for the table it's that easy because it's just changing the metadata on there and now if we do show tables you'll see we now have employees not employee and then at this point maybe we're doing some house cleaning because this is all practice so we're going to go ahead and drop table and we'll drop table employees because we changed the name in there so if we did employee just give us an error and now if we do show tables you'll see all the tables are gone now the next thing we want to go and take a look at and we're going to walk back through the loading of data just real quick because we're going to load two tables in here and let me just float back to our terminal window so we can see what those tables are that we're loading and so up here we have customer we have a customer file and we have an order file we want to go ahead and put the customers and the orders into here so those are the two we're doing and of course it's always nice to see what you're working with so let's do our cat customer.csv we could always do g edit but we don't really need to edit these we just want to take a look at the data in customer and important in here is again we have a header so we have to skip a line comma separated nothing odd with the data we have our schema which is integer string integer string integer so you'd want to take that note that down or flip back and forth when you're doing it and then let's go ahead and do cat order.csv and we can see we have oid which i'm guessing is the order id we have a date up something new we've done integers and strings but we haven't done date when you're importing new and you never worked with the date date's always one of the more trickier fields to port in when that's true of just about any scripting language i've worked with all of them have their own idea of how date's supposed to be formatted what the default is this particular format or its year and it has all four digits dash month two digits dash day is the standard import for the hive so you'll have to look up and see what the different formats are if you're going to do a different format in there coming in or you're not able to pre-process the data but this would be a pre-processing of the data thing coming in if you remember correctly from our edel which is uh e just in case you weren't able to hear me last time etl which stands for extract transform then load so you want to make sure you're transforming this data before it gets into here and so we're going to go ahead and bring both this data in here and really we're doing this so we can show you the basic join there is if you remember from our setup merge join all kinds of different things you can do but joining different data sets is so common so it's really important to know how to do this we need to go ahead and bring in these two data sets and you can see where i just created a table customer here's our schema the integer name age address salary here's our eliminated by commas and our table properties where we skip a line well let's go ahead and load the data first and then we'll do that with our order and let's go ahead and put that in here and i've got it split into three lines so you can see it easily we've got load data local in path so we know we're loading data we know it's local and we have the path here's the complete path for oops this is supposed to be order csv grab the wrong one of course it's going to give me errors because you can't recreate the same table on there and here we go create table here's our integer date customer the basic setup that we had coming in here for our schema row format commas table properties skip header line and then finally let's load the data into our order table load data local in path home cloudera documents order.csv into table order now if we did everything right we should be able to do select star from customer and you can see we have all seven customers and then we can do select star from order and we have uh four orders uh so this is just like a quick frame we have a lot of times when you have your customer databases in business you have thousands of customers from years and years and some of them you know they move they close their business they change names all kinds of things happen uh so we want to do is we want to go ahead and find just the information connected to these orders and who's connected to them and so let's go ahead and do it's a select because we're going to display information so select and this is kind of interesting we're going to do c dot id and i'm going to define c as customer as a customer table in just a minute then we're going to do c dot name and again we're going to define the c c dot age so this means from the customer we want to know their id their name their age and then you know i'd also like to know the order amount uh so let's do o for dot amount and then this is where we need to go ahead and define uh what we're doing and i'm going to capitalize from customer so we're going to take the customer table in here and we're going to name it c that's where the c comes from so that's the customer table c and we want to join order as o that's where our o comes from so the o dot amount is what we're joining in there and then we want to do this on we got to tell it how to connect the two tables c dot id equals o dot customer underscore id so now we know how they're joined and remember we have seven customers in here we have four orders and as a processes we should get a return of four different names joined together and they're joined based on of course the orders on there and once we're done we now have the order number the person who made the order their age and the amount of the order which came from the order table uh so you have your different information and you can see how the join works here very common use of tables and hql and sql and let's do one more thing with our database and then i'll show you a couple other hive commands and let's go ahead and do a drop and we're going to drop database office and if you're looking at this and you remember from earlier this will give me an error and let's just see what that looks like it says fill to execute exception one or more tables exist so if you remember from before you can't just drop a database unless you tell it to cascade that lets it know i don't care how many tables are in it let's get rid of it and in hadoop since it's an art it's a warehouse a data warehouse you usually don't do a lot of dropping maybe at the beginning when you're developing the schemas and you realize you messed up you might drop some stuff but down the road you're really just adding commodity machines to pick up so you can store more stuff on it so you usually don't do a lot of database dropping and some other fun commands to know is you can do select round 2.3 is round value you can do a round off in hive we can do as floor value which is going to give us a 2 so it turns it into an integer versus a float it goes down you know basically truncates it but it goes down and we can also do ceiling which is going to round it up so we're looking for the next integer above there's a few commands we didn't show in here because we're on a single node as as an admin to help spediate the process you usually add in partitions for the data and buckets you can't do that on a single node because the when you add a partition it partitions it across separate nodes but beyond that you can see that it's very straightforward we have sql coming in and all your basic queries that are in sql are very similar to hql let's get started with pig why pig what is pig mapreduce versus hive versus pig hopefully you've had a chance to do our hive tutorial in our mapreduce tutorial if you haven't send a note over to simplylearn and we'll follow up with a link to you we'll look at pig architecture working a pig pig latin data model pig execution modes a use case twitter and features a pig and then we'll tag on a short demo so you can see pig in action so why pig as we all know hadoop uses mapreduce to analyze and process big data processing big data consumed more time so before we had the hadoop system they'd have to spend a lot of money on a huge set of computers and enterprise machines so he introduced the hadoop map reduce and so afterwards processing big data was faster using mapreduce then what is the problem with map reduce prior to 2006 all mapreduce programs were written in java non-programmers found it difficult to write lengthy java codes they faced issues in incorporating map sort reduced to fundamentals of mapreduce while creating a program you can see here map face shuffle and sort reduce phase eventually it became a difficult task to maintain and optimize a code due to which the processing time increased you can imagine a manager trying to go in there and needing a simple query to find out data and he has to go talk to the programmers anytime he wants anything so that was a big problem not everybody wants to have a on-call programmer for every manager on their team yahoo faced problems to process and analyze large data sets using java as the codes were complex and lengthy there was a necessity to develop an easier way to analyze large datasets without using time-consuming complex java modes and codes and scripts and all that fun stuff apache pig was developed by yahoo it was developed with the vision to analyze and process large datasets without using complex java codes pig was developed especially for non-programmers pig used simple steps to analyze data sets which was time efficient so what exactly is pik pig is a scripting platform that runs on hadoop clusters designed to process and analyze large data sets and so you have your pig which uses sql like queries they're definitely not sql but some of them resemble sql queries and then we use that to analyze our data pig operates on various types of data like structured semi-structured and unstructured data let's take a closer look at mapreduce versus hive versus pig so we start with a compiled language your mapreduce and we have hive which is your sql like query and then we have pig which is a scripting language it has some similarities to sql but it has a lot of its own stuff remember sql like query which is what hive is based off looks for structured data and so we get into scripting languages like pig now we're dealing more with semi-structured and even unstructured data with a hadoop map reduced we have a need to write long complex codes with hive no need to write complex codes you could just put it in a simple sql query or hql hive ql and in pig no need to write complex codes as we have piglet now remember in the map reduce it can produce structured semi-structured and unstructured data and as i mentioned before hive can process only structured data think rows and columns where pig can process structured semi-structured and unstructured data you can think of structured data as rows and columns semi-structured as your html xml documents like you have on your web pages and unstructured could be anything from groups of documents and written format twitter tweets any of those things come in as very unstructured data and with our hadoop map reduce we have a lower level of abstraction with both hive and pig we have a higher level abstraction so it's much more easy for someone to use without having to dive in deep and write a very lengthy map reduce code and those map and reduce codes can take 70 80 lines of code when you can do the same thing in one or two lines with high ever pig this is the advantage pig has over hive it can process only structured data and hive while in pig it can process structured semi-structured and unstructured data some other features to know that separates the different query languages as we look at map and reduce mapreduce supports partitioning features as does hive pig no concept of partitioning in pix it doesn't support your partitioning feature your partitioning features allow you to partition the data in such a way that it can be queried quicker you're not able to do that in pig mapreduce uses java and python while hive uses an sql like query language known as hive ql or hql pig latin is used which is a procedural data language mapreduce is used by programmers pretty much as straightforward on java hive is used by data analysts pig is used by researchers and programmers certainly there's a lot of mix between all three programmers have been known to go in and use a hive for quick query and anybody's been able to use pig for quick query or research under map and reduce code performance is really good under hive code performance is lesser than map and reduce in pig under pig code performance is lesser than mapreduce but better than hive so if we're going to look at speed and time the map reduce is going to be the fastest performance on all of those where pig will have second and high follows in the back let's look at components of pig pig has two main components we have pig latin pig latin is the procedural data flow language used in pig to analyze data it is easy to program using piglet and it is similar to sql and then we have the runtime engine runtime engine represents the execution environment created to run pig latin programs it is also a compiler that produces mapreduce programs uses hdfs or your hadoop file system for storing and retrieving data and as we dig deeper into the pig architecture we'll see that we have pig latin scripts programmers write a script in piglet to analyze data using pig then you have the grunt shell and it actually says grunt when we start it up and we'll show you that here in a little bit which goes into the pig server and this is where we have our parser parser checks the syntax of the pig script after checking the output will be a dag directed acylic graph and then we have an optimizer which optimizes after your dag your logical plan is passed to the logical optimizer where an optimization takes place finally the compiler converts the dag into mapreduce jobs and then that is executed on the map reduce under the execution engine the results are displayed using dump statement and stored in hdfs using store statement and again we'll show you that the kind of end you always want to execute everything once you've created it and so dump is kind of our execution statement and you can see right here as we were talking about earlier once we get to the execution engine and it's coded into mapreduce then the mapreduce processes it onto the hdfs working of pig pig latin script is written by the users so you have low data and write pig script and pig operations so when we look at the working of pig pig latin script is written by the users there's step one we load data and write pig script and step two in this step all the pig operations are performed by parser optimizer and compiler so we go into the pig operations and then we get to step three execution of the plan in this days the results are shown on the screen otherwise stored in the hdfs as per the code so it might be of a small amount of data you're reducing it to and you want to put that on the screen or you might be converting a huge amount of data which you want to put back into the hadoop file system for other use let's take a look at the pig latin data model the data model of pig latin helps pig to handle various types of data for example we have adam rob or 50. atom represents any single value of primitive data type in pig latin like integer float string it is stored as a string tuple so we go from our atom which are most basic things so if you look at just rob or just 50 that's an atom that's our most basic object we have in pig latin then you have a tuple tuple represents sequence of fields that can be of any data type it is the same as a row in rdbms for example a set of data from a single row and you can see here we have rob comma five and you can imagine with many of our other examples we've used you might have the id number the name where they live their age their date of starting the job that would all be one row and stored as a tuple and then we create a bag a bag is a collection of tuples it is the same as a table in rdbms and is represented by brackets and you can see here we have our table with rob5 mic 10 and we also have a map a map is a set of key value pairs key is of character array type and a value can be of any type it is represented by the brackets and so we have name and age where the key value is mic in 10. pig latin has a fully nestable data model that means one data type can be nested within another here's a diagram representation of pig latin data model and in this particular example we have basically an id number a name and age and a place and we break this apart we look at this model from pig latin perspective we start with our field and if you remember a field contains basically an atom it is one particular data type and the atom is stored as a string which then converts it into either an integer or number or character string next we have our tuple and in this case you can see that it represents a row so our tuple would be three comma joe comma 29 comma california and finally we have our bag which contains three rows in it in this particular example let's take a quick look at pig execution modes pig works in two execution modes depending on where the data is reciting and where the pig script is going to run we have local mode here the pig engine takes input from the linux file system and the output is stored in the same file system local mold local mode is useful in analyzing small data sets using pig and we have the map reduce mode here the pig engine directly interacts and executes in hdfs and mapreduce in the map reduce mode queries written in pig latin are translated into mapreduce jobs and are run on a hadoop cluster by default pig runs in this mode there are three modes in pig depending on how a pig latin code can be written we have our interactive mode batch mode and embedded mode the interactive mode means coding and executing the script line by line when we do our example we'll be in the interactive mode in batch mode all scripts are coded in a file with the extension.pig and the file is directly executed and then there's embedded mode pig lets its users define their own functions udfss in a programming language such as java so let's take a look and see how this works in a use case in this case use case twitter users on twitter generate about 500 million tweets on a daily basis the hadoop mapreduce was used to process and analyze this data analyzing the number of tweets created by a user in the tweet table was done using mapreduce and java programming language and you can see the problem it was difficult to perform map reduce operations as users were not well versed with written complex java codes so twitter used apache pig to overcome these problems and let's see how let's start with the problem statement analyze the user table and tweet table and find out how many tweets are created by a person and here you can see we have a user table we have alice tim and john with their id numbers one two three and we have a tweet table in the tweet table you have your um the id of the user and then what they tweeted google was a good whatever it was tennis dot spacecraft olympics politics whatever they're tweeting about the following operations were performed for analyzing given data first the twitter data is loaded into the pig storage using load command and you can see here we have our data coming in and then that's going into pig storage and this data is probably on an enterprise computer so this is actually active twitter's going on and then it goes into hadoop file system remember the hadoop file system is a data warehouse for storing data and so the first step is we want to go ahead and load it into the pig storage into our data storage system the remaining operations performed are shown below in join and group operation the tweet and user tables are joined and grouped using co-group command and you can see here where we add a whole column when we go from user names and tweet to the id link directly to the name so alice was user 1 10 was 2 and john 3. and so now they're listed with their actual tweet the next operation is the aggregation the tweets are counted according to the names the command used is count so it's very straightforward we just want to count how many tweets each user is doing and finally the result after the count operation is joined with the user table to find out the username and you can see here where alice had three tim two and john 1. pig reduces the complexity of the operations which would have been lengthy using mapreduce in joining group operation the tweet and user tables are joined and grouped using co-group command the next operation is the aggregation the tweets are counted according to the names the command used is count the result after the count operation is joined with the user table to find out the username and you can see we're talking about three lines of script versus a mapreduce code of about 80 lines finally we could find out the number of tweets created by a user in a simple way so let's go quickly over some of the features of pig that we already went through most of these first ease of programming as pig latin is similar to sql lesser lines of code need to be written short development time as the code is simpler so we can get our queries out rather quickly instead of having to have a programmer spend hours on it handles all kinds of data like structured semi-structured and unstructured pig lets us create user defined functions pig offers a large set of operators such as join filter and so on it allows for multiple queries to process on parallel and optimization and compilation is easy as it is done automatically and internally so enough theory let's dive in and show you a quick demo on some of the commands you can do in pick today's setup will continue as we have in the last three demos to go and use cloudera quick start and we'll be doing this in virtual box we do have a tutorial in setting that up you can send a note to our simply learn team and then get that linked to you once your cloudera quickstart has spun up and remember this is virtualbox we've created a virtual machine and this virtual machine is centos linux once it's spun up you'll be in a full linux system here and as you can see we have thunderbird browser which opens up to the hadoop basic system browser and we can go underneath the hue where it comes up by default if you click on the pull down menu and go under editor you can see there's our impala our hive a pig along with a bunch of other query languages you can use and we're going under pig and then once you're in pig we can go ahead and use our command line here and just click that little blue button to start it up and running we will actually be working in terminal window and so if you're in the cloudera quick start you can open up the terminal window up top or if you're in your own setup and you're logged in you can easily use all of your commands here in terminal window and we'll zoom in that way you get a nice view of what's going on there we go now for our first command we're going to do a hadoop command and import some data into the hadoop system in this case a pig input and let's take a look at this we have a hadoop that listen no it's going to be a hadoop command dfs there's actually four variations of dfs so if you have hdfs or whatever that's fine all four of them point used to be different setups underneath different things and now they all do the same thing and we want to put this file which in this case is under home cloudera documents and sample and we just want to take that and put it into the pig input now let's take a look at that file if i go under my document browsers and open this up you'll see it's got a simple id name profession and age we have one jack engineer 25 and that was in one of our earlier things we had in there and so let's go ahead and hit enter and execute this and now we've uploaded that data and it's gone into our pig input and then a lot of the hadoop commands mimic the linux commands and so you'll see we have cat as one of our commands or it has a hyphen before it so we execute that with hadoop dfs hyphen cat slash pig input because that's what we called it that's where we put our sample csv at and we execute this you can see from our hadoop system it's going to go in and pull that up and sure enough it pulls out the data file we just put in there and then we can simply enter the pig latin or pig editor mode by typing in pig and we can see here uh by our grunt i told you that's how it was going to tell you you were in pig latin there's our grunt command line so we are now in the pig shell and then we'll go ahead and put our load command in here and the way this works is i'm going to have office equals load and here's my load in this case it's going to be pig input we have that in single brackets you remember that's where the data is in the hadoop file system where we dumped it into there we're going to using pig storage our data was separated as with a comma so there's our comma separator and then we have as in this case we have an id character array name character array profession character a and age character ray and we're just going to do them all as character arrays just to keep this simple for this one and then when i hit put this all in here you can see that's our full command line going in and we have our semicolon at the end so when i hit enter it's now set office up but it hasn't actually done anything yet it doesn't do anything until we do dump office so there's our command to execute whatever we've loaded or whatever setup we have in here and we run that you can see it go through the different languages and this is going through the map reduce remember we're not doing this locally we're doing this on the hadoop setup and once we finished our dump you can see we have id name profession age and all the information that we just dumped into our pick oh we can now do let's say oh let's say we have a request just for we'll keep it simple in here but just for the name and age and so we can go office we'll call it each as our variable underscore each and we'll say for each office generate name comma h and for each means that we're going to do this for each row and if you're thinking map reduce you know that this is a map function because it's mapping each row and generating name and age on here and of course we want to go ahead and close it with a semicolon and then once we've created our query or the command line in here let's go ahead and dump office underscore each in with our semicolon and this will go through our map reduce setup on here and if we were on a large cluster the same processing time would happen in fact it's really slow because i have multiple things on this computer and this particular virtual box is only using a quarter of my processor it's only dedicated to this and you can see here there it is name and age and it also included the top row since we didn't delete that out of there or tell it not to and that's fine for this example but you need to be aware of those things when you're processing a significantly large amount of data or any data and we can also do office and we'll call this dsc for descending so maybe the boss comes to you and says hey can we order office by id descending and of course your boss you've taught him how to uh your shareholder it sounds a little druggatory and say boss you've talked to the shareholder and you said and you've taught them a little bit of pig latin and they know that they can now create office description and we can order office by id description and of course once we do that we have to dump office underscore description so that it'll actually execute and there goes into our map reduce we'll take just a moment for it to come up because again i'm running on only a quarter of my processor and you can see we now have our ids in descending order returned let's also look at and this is so important with anytime you're dealing with big data let's create office with a limit and you can of course do any of this instead of with office we could do this with office descending so you get just the top two ids on there we're going to limit just to two and of course to execute that we have to dump office underscore limit you can just think of dumping your garbage into the pig pen for the pig to eat there we go dump office limit two and that's going to just limit our office to the top two and for our output we get our first row which had our id name profession and age and our second row which is jack who's an engineer now let's do an filter we'll call it office underscore filter you guessed it equals filter office by profession equals and keep note this is uh similar to how python does it with the double equal signs for equal for doing a true false statement so for your logic statement remember to use two equal signs in pig and we're going to say it equals doctor so we want to find out how many doctors do we have on our list and we'll go ahead and do our dump we're dumping all our garbage into the pig pen and we're letting pig take over and see what it can find out and see who's a doctor on our list and we find uh employee id number two bob is a doctor 30 years old for this next section we're going to cover something we see it a lot nowadays in data analysis and that's word counting tokenization that is one of the next big steps as we move forward in our data analysis where we go from say stock market analysis of highs and lows and all the numbers to what are people saying about companies on twitter what are they saying on the web pages and on facebook suddenly you need to start counting words and finding out how many words are totaled how many are in the first part of the document and so on we're going to cover a very basic word count example and in this case i've created a document called wordrose.txt and you can see here we have simplylearn is a company supporting online learning simplylearn helps people attain their certifications simplylearn is an online community i love simply learn i love programming i love data analysis and i went and saved this into my documents folder so we could use it and let me go ahead and open up a new terminal window for our word count let me go and close the old one so we're going to go in here and instead of doing this as pig we're going to do pig minus x local and what i'm doing is i'm telling the pig to start the pig shell but we're going to be looking at files local to our virtual box or this centos machine and let me go ahead and hit enter on there just maximize this up there we go and it will load pig up and it's going to look just the same as the pig we were doing which was defaulted to hi to our hadoop system to our hdfs this is now defaulted to the local system now we're going to create lines we're going to load it straight from the file remember last time we took the hdfs and loaded it into there and then loaded it into pig since we're going to local we're just going to run a local script we have lines equals load home the actual full path home cloud area documents and i called it wordrose.txt and as line is a character array so each line and i've actually you can change this to read each document i certainly have done a lot of document analysis and then you go through and do word counts and different kind of counts in there so once we go ahead and create our line instead of doing the dump we're going to go ahead and start entering all of our different setups for each of our steps we want to go through and let's just take a look at this next one because the load is straightforward we're loading from this particular file since we're locals loading it directly from here instead of going into the hadoop file system and it says as and then each line is read as a character array now we're going to do words equal for each of the lines generate flat tokenize line space as word now there's a lot of ways to do this this is if you're a programmer you're just splitting the line up by spaces there's actual ways to tokenize it you gotta look for periods capitalization there's all kinds of other things you play with with this but for the most basic word count we're just going to separate it by spaces the flatten takes the line and just creates a it flattens each of the words out so this is uh we're just going to generate a bunch of words for each line and then each each of those words is as a word a little confusing in there but if you really think about it we're just going down each line separating it out and we're generating a list of words one thing to note is the default for tokenize you can just do tokenized line without the space in there if you do that it'll automatically tokenize it by space you can do either one and then we're going to do group we're going to group it by words so we're going to group words by word so when we we split it up each token is a word and it's a list of words and so we're going to grouped equals group words by word so we're going to group all the same words together and if we're going to group them then we want to go ahead and count them and so for count we'll go ahead and create a word count variable and here's our four each so for each group grouped is our line where we group all the words in the line that are similar we're going to generate a group and then we're going to count the words for each group so for each line we group the words together we're going to generate a group and that's going to count the words we want to know the word count in each of those and that comes back in our word count and finally we want to take this and we want to go ahead and dump word count and this is a little bit more what you see when you start looking at grunt scripts you'll see right here these these lines right here we have each of the steps you take to get there so we load our file for each of our lines we're going to generate and tokenize it into words then we're going to take the words and we're going to group them by same words for each group we're going to generate a group and we're just going to count the words so we're going to summarize all the words in here and let's go ahead and do our dump word count which executes all this and it goes through our mapreduce it's actually a local runner you'll see down here you start seeing where they still have mapreduce but as a special runner we're mapping it that's a part of each row being counted and grouped and then when we do the word count that's a reducer the reducer creates these keys and you can see i is used three times a came up once and came up once is to continue on down here to attain online people company analysis simply learn they took the top rating with four certification so all these things are encountered in the how many words are used uh in in data analysis this is probably the very the beginnings of data analysis where you might look at it and say oh they mentioned love three times so whatever's going on in this post it's about love and uh what do they love and then you might attach that to the different objects in here so you can see that uh pig latin is fairly easy to use there's nothing really you know might it takes a little bit to learn the script uh depending on how good your memory is as i get older my memory leaks a little bit more so i don't memorize it as much but that was pretty straightforward the script we put in there and then it goes through the full map reduce localized run comes out and like i said it's very easy to use that's why people like pig latin is because it's intuitive one of the things i like about pig latin is when i'm troubleshooting when we're troubleshooting a lot of times you're working with a small amount of data and you start doing one line at a time and so i can go lines equal load and there's my loaded text and maybe i'll just dump lines and then it's going to run it's going to show me all the lines that i'm working on in the small amount of data and that way i can test that if i got an error on there that said oh this isn't working maybe they'll be oh my gosh i'm in map reduce or i'm in the basic grunt shell instead of the local path grunt so maybe it'll generate an error on there and you can see here it just shows each of the lines going down hive versus pig on one side we'll have our sharp stinger on our black and yellow friend and on the other side our thick hide on our pig let's start with an introduction to hbase back in the days data used to be less and was mostly structured you see we have structured data here we usually had it like in a database where you had uh every field was exactly the correct length so if you had a name field that is exactly 32 characters remember the old access database in microsoft the files are small if we had you know hundreds of people in one database that was considered big data this data could be easily stored in relational database or rdbms when we talk about relational database you might think of oracle you might think of sql microsoft sql mysql all of these have evolved even from back then to do a lot more today than they did but they still fall short in a lot of ways and they're all examples of an rdms or relationship database then internet evolved and huge volumes of structured and semi-structured data got generated and you can see here with the semi-structured data we have email if you look at my spam filter you know we're talking about all the html pages xml which is a lot of time is displayed on our html and help desk pages json all of this really has just even in the last each year it almost doubles from the year before how much of this is generated so storing and processing this data on an rdbms has become a major problem and so the solution is we use apache hbase apache hbase was the solution for this let's take a look at the history the hbase history and we look at the hbase history we're going to start back in 2006 november google released the paper on big table and then in 2017 just a few months later hbase prototype was created as a hadoop contribution later on in the year 2007 in october first usable hbase along with the hadoop 0.15 was released and then in january of 2008 hbase became the subproject of hadoop and later on that year in october all the way into september the next year hbase was released the 0.81 version the 0.19 version and 0.20 and finally in may of 2010 hbase became apache top level project and so you can see in the course of about four years hbase started off as just an idea on paper and has evolved all the way till 2010 as a solid project under the apache and since 2010 has continued to evolve and grow as a major source for storing data in semi-structured data so what is hbase hbase is a column-oriented database management system derived from google's nosql database big table that runs on top of the hadoop file system or the hdfs it's an open source project that is horizontally scalable and that's very important to understand that you don't have to buy a bunch of huge expensive computers you're expanding it by continually adding commodity machines and so it's a linear cost expansion as opposed to being exponential no sql database written in java which permits faster querying so java is the backend for the hbase setup and it's well suited for sparse data sets so it can contain missing or n a values and this doesn't boggle it down like it would another database companies using hbase so let's take a look and see who is using this nosql database for their servers and for storing their data and we have hortonworks which isn't a surprise because they're one of the like cloudera hortonworks they are behind hadoop and one of the big developments and backing of it and of course apache hbase is the open source behind it and we have capital one as banks you also see bank of america where they're collecting information on people and tracking it so their information might be very sparse they might have one bank way back when they collected information as far as the person's family and what their income for the whole family is and their personal income and maybe another one doesn't collect the family income as you start seeing where you have data that is very difficult to store or it's missing a bunch of data hubspot's using it facebook certainly all of your facebook twitter most of your social medias are using it and then of course there's jpmorgan chase and company another bank that uses the hbase as their data warehouse for nosql let's take a look at an hbase use case so we can dig a little bit more into it to see how it functions telecommunication company that provides mobile voice and multimedia services across china the china mobile and china mobile they generate billions of call detailed records or cdr and so these cdrs and all these records are these calls and how long they are and different aspects of the call maybe the tower they're broadcasted from all that is being recorded so they can track it a traditional database systems were unable to scale up to the vast volumes of data and provide a cost-effective solution no good so storing and real-time analysis of billions of call records was a major problem for this company solution apache hbase hbase stores billions of rows of detailed call records hbc performs fast processing of records using sql queries so you can mix your sql and no sql queries and usually just say no sql queries because of the way the query works applications of hbase one of them would be in the medical industry hbase is used for storing genome sequences storing disease history of people of an area and you can imagine how sparse that is as far as both of those a genome sequence might be only have pieces to it that each person is unique or is unique to different people and the same thing with disease you really don't need a column for every possible disease a person could get you just want to know what those diseases those people have had to deal with in that area e-commerce hbase is used for storing logs about customer search history performs analytics and target advertisement for better business insights sports hba stores match details in the history of each match uses this data for better prediction so when we look at hbase we all want to know what's the difference between hbase versus rdbms that is a relational database management system hbase versus rdbms so the hbase does not have a fixed schema it's schema-less defines only column families and we'll show you what that means later on rdbms has a fixed schema which describes the structure of the tables and you can think of this as you have a row and you have columns and each column is a very specific structure how much data can go in there and what it does with the hbase it works well with structured and semi-structured data with the rdbms it works only well with structured data with the hbase it can have denormalized data it can contain missing or null values with the rdbms it can store only normalized data now you can still store a null value in the rdbms but it still takes up the same space as if you're storing a regular value in many cases and it also for the hbase is built for y tables it can be scaled horizontally for instance if you were doing a tokenizer of words and word clusters you might have 1.4 million different words that you're pulling up and combinations of words so with an rdbms it's built for thin tables that are hard to scale you don't want to store 1.4 million columns in your sql it's going to crash and it's going to be very hard to do searches with the age base it only stores that data which is part of whatever row you're working on let's look at some of the features of the hbase it's scalable data can be scaled across various nodes as it is stored in the hdfs and i always think about this it's a linear add-on for each terabyte of data i'm adding on roughly a thousand dollars in commodity computing with an enterprise machine we're looking at about 10 000 at the lower end for each terabyte of data that includes all your backup and redundancy so it's a big difference it's like a tenth of the cost to store it across the hbase it has automatic failure support right ahead log across clusters which provides automatic support against failure consistent read and write hbase provides consistent read and write of the data it's a java api for client access provides easy to use java api for clients block cache and bloom filters so the hbase supports block caching and bloom filters for high volume query optimization let's dig a little deeper into the hbase storage hbase column oriented storage and i told you we're going to look into this to see how it stores the data and here you can see you have a row key this is really one of the important references is each row has to have its own key or your row id and then you have your column family and in here you can see we have column family one column family two column family three and you have your column qualifiers so you can have in column family one you can have three columns in there and there might not be any data in that so when you go into column family one and do a query for every column that contains a certain thing that row might not have anything in there and not be queried where in column family two maybe you have column 1 filled out and column 3 filled out and so on and so forth and then each cell is connected to the row where the data is actually stored let's take a look at this and what it looks like when you fill the data in so in here we have a row key with a row id and we have our employee id one two three that's pretty straightforward you probably would even have that on an sql server and then you have your column family this is where it starts really separating out your column family you might have personal data and under personal data you would have name city age you might have a lot more than just that you might have number of children you might have degree all those kinds of different things that go into personal data and some of them might be missing you might only have the name and the age of an employee you might only have the name the city and how many children and not the age and so you can see with the personal data you can now collect a large variety of data and store it in the hbase very easily and then maybe you have a family of professional data your designation your salary all the stuff that the employee is doing for you in that company let's dig a little deeper into the hbase architecture and so you can see here what looks to be a complicated chart it's not as complicated as you think from the apache a space we have the zookeeper which is used for monitoring what's going on and you have your h master this is the hbase master assigns regions and load balancing and then underneath the region or the hbase master then under the h master or hbase master you have your reader server serves data for read and write and the region server which is all your different computers you have in your hadoop cluster you'll have a region an h log you'll have a store memory store and then you have your different files for h file that are stored on there and those are separated across the different computers and that's all part of the hdfs storage system so we look at the architectural components or regions and we're looking at we're drilling down a little bit hbase tables are divided horizontally by a row so you have a key range into regions so each of those ids you might have ids 1 to 20 21 to 50 or whatever they are regions are assigned to the nodes in the cluster called region servers a region contains all rows in the table between the region's start key and the end key again 1 to 10 11 to 20 and so forth these servers serve data for read and write and you can see here we have the client and the get and then git sends it out and it finds out where that start if it's between which start keys and n keys and then it pulls the data from that different region server and so the region sign data definition language operation create delete are handled by the h master so the h-master is telling it what are we doing with this data what's going out there assigning and reassigning regions for recovery or load balancing and monitoring all servers so that's also part of it so you know if your ids if you have 500 ids across three servers you're not going to put 400 ids on server 1 and 100 on the server 2 and leaves region 3 and region 4 empty you're going to split that up and that's all handled by the h master and you can see here monitors region servers assigns regions to region servers assigns regions to recent servers and so forth and so forth hbase has a distributed environment where h-master alone is not sufficient to manage everything hence zookeeper was introduced it works with h-master so you have an active h-master which sends a heartbeat signal to zookeeper indicating that it's active and the zookeeper also has a heartbeat to the region servers so the region servers send their status to zoo keeper indicating they are ready for read and write operation inactive server acts as a backup if the active hmaster fails it'll come to the rescue active hmaster and region servers connect with a session to zookeeper so you see your active hmaster selection region server session they're all looking at the zookeeper keeping that pulse an active h-master region server connects with a session to the zoo keeper and you can see here where we have ephemeral nodes for active sessions via heartbeats to indicate that the region servers are up and running so let's take a look at hbase read or write going on there's a special hbase catalog table called the meta table which holds the location of the regions in the cluster here's what happens the first time a client reads or writes data to hbase the client gets the region server the host the meta table from zookeeper and you can see right here the client it has a request for your region server and goes hey zookeeper can you handle this the zookeeper takes a look at it and goes ah middle location is stored in zookeeper so it looks at his meta data on there and then the metadata table location is sent back to the client the client will query the meta server to get the region server corresponding to the row key if it wants to access the client caches this information along with the minute table location and you can see here the client going back and forth to the region server with the information and it might be going across multiple region servers depending on what you're querying so we get the region server for row key from the meta table that's where that row key comes in and says hey this is where we're going with this and so once it gets a row key from the corresponding region server we can now put row or git row from that region server let's take a look at the hbase meta table special hbase catalog table that maintains a list of all the region servers in the hbase storage system so you see here we have the meta table we have a row key and a value table key region region server so the meta table is used to find the region for the given table key and you can see down here you know meta table comes in is going to fire out where it's going with the region server and we look a little closer at the right mechanism in hbase we have right ahead log or wall as you abbreviate it kind of a way to remember wall is right ahead log is a file used to store new data that is yet to be put on permanent storage it is used for recovery in the case of failure so you can see here where the client comes in and it literally puts the new data coming in into this kind of temporary storage or the wall on there once it's gone into the wall then the memory store memstor is the right cache that stores a new data that's not yet been written to disk there is one mems store per column family per region and once we've done that we have three ack once the data is placed in mems store the client then receives the acknowledgement when the mems store reaches the threshold it dumps or commits the data into h file as you can see right here we've taken our gun into the wall the wall then source it into the different memory stores uh and then the memory stores it says hey we've reached we're ready to dump that into our h files and then it moves it into the h files h files store the rows as data as stored key value on disk so here we've done a lot of theory let's dive in and just take a look and see what some of these commands look like and what happens in our age base when we're manipulating a nosql setup [Music] so if you're learning a new setup it's always good to start with where is this coming from it's open source by apache and you can go to hbase.apache.org and you'll see that it has a lot of information you can actually download the hbase separate from the hadoop although most people just install the hadoop because it's bundled with it and if you go in here you'll find a reference guide and you can go through the apache reference guide and there's a number of things to look at but we're going to be going through apache hbase shell that's what we're going to be working with and there's a lot of other interfaces on the setup and you can look up a lot of the different commands on here so we go into the apache hbase reference guide we can go down to read hbase shell commands from a command file you can see here where it gives you different options of formats for putting the data in and listing the data certainly can also create files and scripts to do this too but we're going to look at the basics we're going to go through this on a basic hbase shell and one last thing to look at is of course if you continue down the setup you can see here where they have more detail as far as how to create and how to get to your data on your hbase now i will be working in a virtual box and this is by oracle you can download the oracle virtual box you can put a note in below for the youtube as we did have a previous session on setting up virtual setup to run your hadoop system in there i'm using the cloudera quick start installed in here there's hortons you can also use the amazon web service there's a number of options for trying this out in this case we have cloudera on the oracle virtualbox the virtual box has linux centos installed on it and then the hadoop that has all the different hadoop flavors including hbase and i bring this up because my computer is a windows 10 the operating system of the virtual box is linux and we're looking at the hbase data warehouse and so we have three very different entities all running on my computer and that can be confusing if it's the first time in and working with this kind of setup now you'll notice in our cloudera setup they actually have some hbase monitoring so i can go underneath here and click on hbase and master and it'll tell me what's going on with my region servers it'll tell me what's going on with our backup tables right now i don't have any user tables because we haven't created any and this is only a single node and a single hbase tour so you're not gonna expect anything too extensive in here since this is for practice and education and perhaps testing out package you're working on it's not for really you can deploy cloudera of course but when you talk about a quick start or a single node setup that's what it's really for so we can go through all the different hbase and you'll see all kinds of different information with zookeeper if you saw it flash by down here what version we're working in since zookeeper is part of the hbase setup where we want to go is we want to open up a terminal window and in cloudera it happens to be up at the top and when you click on here you'll see your cloudera terminal window open and let me just expand this we have a nice full screen and then i'm also going to zoom in that way you have a nice big picture and you can see what i'm typing what's going on and to open up your hbase shell simply type hbase shell to get in and hit enter and you'll see it takes just a moment to load and we'll be in our age based shell for doing hbase commands once we've gotten into our hbase shell you'll see it'll have the hbase prompt information ahead of it we can do something simple like list this is going to list whatever tables we have it so happens that there's a base table that comes with hbase now we can go ahead and create and i'm going to type in just create what's nice about this is it's going to throw me kind of a it's going to say hey there's no just straight create but it does come up and tell me all these different formats we can use for create so we can create our table and one of our families and add splits names versions all kinds of things you can do with this let's just start with a very basic one on here and let's go ahead and create and we'll call it new table now let's just call it new tbl for table new table and then we also want to do let's do knowledge so let's take a look at this i'm creating a new table and it's going to have a family of knowledge in it and let me hit enter it's going to come up it's going to take it a second to go ahead and create it now we have our new table in here so if i go list you'll now see table and new table so you can now see that we have the new table and of course the default table that's set up in here and we can do something like uh describe we can describe and then we're going to do new tbl and when we describe it it's going to come up it's going to say hey name i have knowledge data block encoding none bloom filter row or replication go version all the different information you need new minimum version zero forever deleted cells false block size in memory you can look this stuff up on apache.org to really track it down one of the things that's important to note is versions so you have your different versions of the data that's stored and that's always important to understand that we might talk about that a little bit later on and then we have to describe it we can also do a status the status says i have one active master going on that's our hbase as a whole we can do status summary it should do the same thing as status so we got the same thing coming up and now that we've created let's go ahead and put something in it so we're going to put new tbl and then we want row one you know what before i even do this let's just type in put and you can see when i type in put it gives us like a lot of different options of how it works and different ways of formatting our data as it goes in and all of them usually begin with the new table new tbl then we have in this case we'll call it row one and then we'll have knowledge remember we created knowledge already and we'll do knowledge sports and then in knowledge and sports we're gonna set that equal to cricket so we're gonna put underneath this uh our knowledge setup that we have a thing called sports in there and we'll see what this looks like in just a second let's go ahead and put in we'll do a couple of these let's see let's do another row one and this time set of sports let's do science you know this person not only you know we have row one which is both knowledgeable and cricket and also in chemistry so it's a chemist who plays cricket in row one and uh let's see if we have let's do another row one just to keep it going and we'll do science in this case let's do physics not only in chemistry but also a physicist i have quite a joy in physics myself so here we go we have uh row one there we go and then let's do uh row two let's see what that looks like we start putting in row two and in row two this person is has knowledge in economics this is a master of business and how or maybe it's global economics maybe it's just for the business and how it fits in with the country's economics and that we call it macro economics so i guess it is for the whole country there so we have knowledge economics macroeconomics and then let's just do one more we'll keep it as row two and this time our economist is also a musician so we'll put music and they happen to have knowledge and they enjoy oh let's do pop music they're into the current pop music going on so we've loaded our database and you'll see we have two rows row one and row two in here and we can do is we can list the contents of our database by simply doing scan scan and then let's just do scan by itself so you can see how that looks you can always just type in there and it tells you all the different setups you can do with scan and how it works in this case we want to do scan new tbl and in our scan new tbl we have row one row one row two row two and you'll see row one has a column called knowledge science time step value crickets value physics so it has information as when it was created when the time stamp is row one also has knowledge sports and a value of cricut so we have sports and science and this is interesting because if you remember up here we also gave it originally we told it to come in here and have chemistry we had science chemistry and science physics and we come down here i don't see the chemistry why because we've now replaced chemistry with physics so the new value is physics on here let me go ahead and clear down a little bit and in this we're going to ask the question is enabled new table when i hit enter in here you're going to see it comes out true and then we'll go ahead and disable it let's go ahead and disable new table make sure i have our quotes around it and now that we've disabled it what happens when we do the scan when we do the scan new table and hit enter you're gonna see that we get an error coming up so once it's disabled you can't do anything with it until we re-enable it now before we enable the table let's do an alteration on it and here's our new table and this should look a little familiar because it's very similar to create we'll call this test info we'll hit enter in there it'll take just a moment for updating and then we want to go ahead and enable it so let's go ahead and enable our new table so it's back up and running and then we want to describe describe new table and we come in here you'll now see we have name knowledge and under there we have our data encoding and all the information under knowledge and then we also have down below test info so now we have the name test info and all the information concerning the test info on here and we'll simply enable it new table so now it's enabled oops already did that i guess we'll enable it twice and so let's start looking at well we had scan new table and you can see here where it brings up the information like this what if we want to go ahead and get a row so we'll do r1 and when we do hbase r1 you can see we have knowledge science and it has a timestamp value physics and we have knowledge sports and it has a time stamp on it and value cricket and then let's see what happens when you put into our new table and in here we want row one and if you can guess from earlier because we did something similar uh we're going to do knowledge economics and then it's going to be instead of i think it was what macroeconomics is now market economics and we'll go back and do our get command and now see what it looks like and we can see here where we have knowledge economics it has a timestamp value market economics physics and cricket and this is because we have economics science and sports those are the three different columns that we have and then each one has different information in it and so if you managed to go through all these commands and look at basics on here you'll now have the ability to create a very basic hbase setup nosql setup based on your columns and your rows and just for fun we'll go back to the cloudera where they have the website up for the hbase master status and i'll go ahead and refresh it and then we can go down here and you'll see user tables table set one and we can click on details and here's what we just did it goes through uh so if you're the admin looking at this and go oh someone just created new tbl and this is what they have underneath of it and their new table in there here we will learn on apache spark history of spark what is spark hadoop which is a framework again wes spark components of apache spark that is spark core spark sql spark streaming spark ml lab and graphics then we will learn on spark architecture applications of spark spark use cases so let's begin with understanding about history of apache spark it all started in 2009 as a project at uc berkeley amp labs by mate in 2010 it was open source under a bsd license in 2013 spark became an apache top level project and in 2014 used by data bricks to sort large-scale data sets and it set a new world record so that's how apache spark started and today it is one of the most in demand processing framework or i would say in memory computing framework which is used across the big data industry so what is apache spark let's learn about this apache spark is a open source in-memory computing framework or you could say data processing engine which is used to process data in batch and also in real time across various cluster computers and it has a very simple programming language behind the scenes that is scala which is used although if users would want to work on spark they can work with python they can work with scala they can work with java and so on even r for that matter so it supports all these programming languages and that's one of the reasons that it is called polyglot wherein you have good set of libraries and support from all the programming languages and developers and data scientists incorporate spark into their applications or build spark based applications to process analyze query and transform data at a very large scale so these are key features of apache spark now if you compare hadoop west spark we know that hadoop is a framework and it basically has map reduce which comes with hadoop for processing data however processing data using mapreduce in hadoop is quite slow because it is a batch oriented operation and it is time consuming if you if you talk about spark spark can process the same data 100 times faster than mapreduce as it is a in-memory computing framework well there can always be conflicting ideas saying what if my spark application is not really efficiently coded and my mapreduce application has been very efficiently coded well then it's a different case however normally if you talk about code which is efficiently written for mapreduce or for spark based processing spark will win the battle by doing almost 100 times faster than mapreduce so as i mentioned hadoop performs batch processing and that is one of the paradigms of map reduced programming model which involves mapping and reducing and that's quite rigid so it performs batch processing the intermittent data is written to sdfs and written red back from sdfs and that makes hadoop's map reduce processing slower in case of spark it can perform both batch and real-time processing however lot of use cases are based on real-time processing take an example of macy's take an example of retail giant such as walmart and there are many use cases who would prefer to do real time processing or i would say near real time processing so when we say real time or near real time it is about processing the data as it comes in or you are talking about streaming kind of data now hadoop or hadoop's mapreduce obviously was started to be written in java now you could also write it in scala or in python however if you talk about mapreduce it will have more lines of code since it is written in java and it will take more times to execute you have to manage the dependencies you have to do the right declarations you have to create your mapper and reducer and driver classes however if you compare spark it has few lines of code as it is implemented in scala and scala is a statically typed dynamically inferred language it's very very concise and the benefit is it has features from both functional programming and object-oriented language and in case of scala whatever code is written that is converted into byte codes and then it runs in the jvm now hadoop supports kerberos authentication there are different kind of authentication mechanisms kerberos is one of the well-known ones and it can really get difficult to manage now spark supports authentication via a shared secret it can also run on yarn leveraging the capability of kerberos so what are spark features which really makes it unique or in demand processing framework when we talk about spark features one of the key features is fast processing so spark contains resilient distributed data sets so rdds are the building blocks for spark and we'll learn more about rdds later so spark contains rdds which saves huge time taken in reading and writing operations so it can be 100 times or you can say 10 to 100 times faster than hadoop when we say in memory computing here i would like to make a note that there is a difference between caching and in memory computing think about it caching is mainly to support read ahead mechanism where you have your data pre-loaded so that it can benefit further queries however when we say in memory computing we are talking about lazy evaluation we are talking about data being loaded into memory only and only when a specific kind of action is invoked so data is stored in ram so here we can say ram is not only used for processing but it can also be used for storage and we can again decide whether we would want that ram to be used for persistence or just for computing so it can access the data quickly and accelerate the speed of analytics now spark is quite flexible it supports multiple languages as i already mentioned and it allows the developers to write applications in java scala r or python it's quite fault tolerance so spark contains these rdds or you could say execution logic or you could say temporary data sets which initially do not have any data loaded and the data will be loaded into rdds only when execution is happening so these can be fault tolerant as these rdds are distributed across multiple nodes so failure of one worker node in the cluster will really not affect the rdds because that portion can be recomputed so it ensures loss of data it ensures that there is no data loss and it is absolutely fault tolerant it is for better than analytics so spark has rich set of sql queries machine learning algorithms complex analytics all of this supported by various par components which we will learn in coming slides with all these functionalities analytics can be performed better in terms of spark so these are some of the key features of spark however there are many more features which are related to different components of spark and we will learn about them so what are these components of spark which i'm talking about spark core so this is the core component which basically has rdds which has a core engine which takes care of your processing now you also have spark sql so people who would be interested in working on structured data or data which can be structuralized would want to prefer using spark sql and spark sql internally has components or features like data frames and data sets which can be used to process your structured data in a much much faster way you have spark streaming now that's again an important component of spark which allows you to create your spark streaming applications which not only works on data which is being streamed in or data which is constantly getting generated but you would also or you could also transform the data you could analyze or process the data as it comes in in smaller chunks you have sparks mlib now this is basically a set of libraries which allows developers or data scientists to build their machine learning algorithms so that they can do predictive analytics or prescriptive descriptive pre-emptive analytics or they could build their recommendation systems or bigger smarter machine learning algorithms using these libraries and then you have graphics so think about organizations like linkedin or say twitter where you have data which naturally has a network kind of flow so data which could be represented in the form of graphs now here when i talk about graphs i'm not talking about pie charts or bar charts but i'm talking about network related data that is data which can be networked together which can have some kind of relationship think about facebook think about linkedin where you have one person connected to other person or one company connected to other companies so if we have our data which can be represented in the form of network graphs then spark has a component called graphics which allows you to do graph based processing so these are some of the components of apache spark spark core spark sql spark streaming spark mlib and graphics so to learn more about components of spark let's learn here about spark core now this is the base engine and this is used for large scale parallel and distributed data processing so when you work with spark at least and the minimum you would work with a spark core which has rdds as the building blocks of your spark so it is responsible for your memory management your fault recovery scheduling distributing and monitoring jobs on a cluster and interacting with storage systems so here i would like to make a key point that spark by itself does not have its own storage it relies on storage now that storage could be your sdfs that is hadoop's distributed file system it could be a database like nosql database such as hbase or it could be any other database say rdbms from where you could connect your spark and then fetch the data extract the data process it analyze it so let's learn a little bit about your rdds resilient distributed data sets now spark core which is the base engine or the core engine is embedded with the building blocks of spark which is nothing but your resilient distributed data set so as the name says it is resilient so it is existing for a shorter period of time distributed so it is distributed across nodes and it is a data set where the data will be loaded or where the data will be existing for processing so it is immutable fault tolerant distributed collection of objects so that's what your rdd is and there are mainly two operations which can be performed on an rdd now to take an example of this say i want to process a particular file now here i could write a simple code in scala and that would basically mean something like this so if i say val which is to declare a variable i would say val x and then i could use what we call a spark context which is basically the most important entry point of your application so then i could use a method of spark context for example that is text file and then i could point it to a particular file so this is just a method of your spark context and spark context is the entry point of your application now here i could just give a path in this method so what does this step do it does not do any evaluation so when i say val x i'm creating an immutable variable and to that variable i'm assigning a file now what this step does is it actually creates a rdd resilient distributed data set so we can imagine this as a simple execution logic a empty data set which is created in memory of your node so if i would say i have multiple nodes in which my data is split and stored imagining that your yarn your spark is working with hadoop so i have hadoop which is using say two nodes and this is my distributed file system sdfs which basically means my file is written to htfs and it also means that the file related blocks are stored in the underlying disk of these machines so when i say val x equals sc.text file that is using a method of spark context now there are various other methods like hold text files parallel eyes and so on this step will create an rdd so you can imagine this as a logical data set which is created in memory across these nodes because these nodes have the data however no data is loaded here so this is the first rdd and i can say first step in what we call as a tag a dag which will have series of steps which will get executed at later stage now later i could do further processing on this i could say val y and then i could do something on x so i could say x dot map and i would want to apply a function to every record or every element in this file and i could give a logic here x dot map now this second step is again creating an rdd a resilient distributed data set you can say second step in my dag okay and here you have a external rdd one more rdd created which depends on the first rtd so my first rdd becomes the base rdd or parent rtd and the resultant rtd becomes the child rdd then we can go further and we could say val zed and i would say okay now i would want to do some filter on y so this filter which i am doing here and then i could give a logic might be i'm searching for a word i am searching for some pattern so i could say val z equals y dot filter which again creates one more rdd a resilient distributed data set in memory and a you can say this is nothing but one more step in the dag so this is my tag which is a series of steps which will be executed now here when does the execution happen when the data get when will the data get loaded into these rdds so all of this that is using a method using a transformation like map using a transformation like filter or flat map or anything else these are your transformations so the operations such as map filter join union and many others will only create rdds which basically means it is only creating execution logic no data is evaluated no operation is happening right now only and only when you invoke an action that is might be you want to print some result might be you want to take some elements and see that might be you want to do a count so those are actions which will actually trigger the execution of this dag right from the beginning so if i here say z dot count where i would want to just count the number of words which i am filtering this is an action which is invoked and this will trigger the execution of dag right from the beginning so this is what happens in a spark now if i do a z dot count again it will start the whole execution of dag again right from the beginning so my z dot count second time in action is invoked again the data will be loaded in the first rtd then you will have map then you will have filter and finally you will have result so this is the core concept of your rdds and this is how rtd works so mainly in spark there are two kind of operations one is your transformations and one is your actions transformations or using a method of spark context will always and always create an rtd or you could say a step in the tag actions are something which will invoke the execution which will invoke the execution from the first rdd till the last rdd where you can get your result so this is how your rdds work now when we talk about components of spark let's learn a little bit about spark sql so spark sql is a component type processing framework which is used for structured and semi-structured data processing so usually people might have their structured data stored in rdbms or in files where data is structured with but particular delimiters and has a pattern and if one wants to process this structured data if one wants to use spark to do in-memory processing and work on this structured data they would prefer to use spark sql so you can work on different data formats say csv json you can even work on smarter formats like avro parquet even your binary files or sequence files you could have your data coming in from an rdbms which can then be extracted using a jdbc connection so at the bottom level when you talk about spark sql it has a data source api which basically allows you to get the data in whichever format it is now spark sql has something called as data frame api so what are data frames data frames in short you can visualize or imagine as rows and columns or if your data can be represented in the form of rows and columns with some column headings so data frame api allows you to create data frames so like my previous example when you work on a file when you want to process it you would convert that into an rdd using a method of smart context or by doing some transformations so in the similar way when you use data frames or when you want to use spark sql you would use sparks context which is sql context or hive context or spark which allows you to work with data frames so like in my earlier example we were saying val x equals sc dot text file now in case of data frames instead of sc you would be using say spark dot something so spark context is available for your data frames api to be used in older versions like spark 1.6 and so on we were using hive context or sql context so if you were working with spark 1.6 you would be saying val x equals sql context dot here we would be using spark dot so data frame api basically allows you to create data frames out of your structured data which also lets spark know that data is already in a particular structure it follows a format and based on that your sparks back-end dag scheduler right so when i say about that i talk about your sequence of steps so spark is already aware of what are the different steps involved in your application so your data frame api basically allows you to create data frames out of your data and data frames when i say i'm talking about rows and columns with some headings and then you have your data frame dsl language or you can use spark sql or hive query language any of these options can be used to work with your data frames so to learn more about data frames follow in the next sessions when you talk about spark streaming now this is very interesting for organizations who would want to work on streaming data imagine a store like macy's where they would want to have machine learning algorithms now what would these machine learning algorithms do suppose you have a lot of customers walking in the store and they are searching for particular product or particular item so there could be cameras placed in the store and this is being already done there are cameras placed in the store which will keep monitoring in which corner of the store there are more customers now once camera captures this information this information can be streamed in to be processed by algorithms and those algorithms will see which product or which series of product customers might be interested in and if this algorithm in real time can process based on the number of customers based on the available product in the store it can come up with a attractive alternative price so that which the price can be displayed on the screen and probably customers would buy the product now this is a real-time processing where the data comes in algorithms work on it do some computation and give out some result and which can then result in customers buying a particular product so the whole essence of this machine learning and real-time processing will really hold good if and when customers are in the store or this could relate to even an online shopping portal where there might be machine learning algorithms which might be doing real time processing based on the clicks which customer is doing based on the clicks based on customer history based on customer behavior algorithms can come up with recommendation of products or better altered price so that the sale happens now in this case we would be seeing the essence of real time processing only in a fixed or in a particular duration of time and this also means that you should have something which can process the data as it comes in so spark streaming is a lightweight api that allows developers to perform batch processing and also real-time streaming and processing of data so it provides secure reliable fast processing of live data streams so what happens here in spark streaming in brief so you have a input data stream now that data stream could be a file which is constantly getting appended it could be some kind of metrics it could be some kind of events based on the clicks which customers are doing or based on the products which they are choosing in a store this input data stream is then pushed in through a spark streaming application now spark streaming application will broke break this content into smaller streams what we call as disk criticized streams or batches of smaller data on which processing can happen in frames so you could say process my file every five seconds for the latest data which has come in now there are also some windows based uh options like when i say windows i mean a window of past three events window of past three events each event being of five seconds so your batches of smaller data is processed by spark engine and this process data can then be stored or can be used for further processing so that's what spark streaming does when you talk about mlib it's a low level machine learning library that is simple to use scalable and compatible with various programming languages now hadoop also has some libraries like you have apache mahout which can be used for machine learning algorithms however in terms of spark we are talking about machine learning algorithms which can be built using ml labs libraries and then spark can be used for processing so mlib eases the deployment and development of scalable machine learning algorithms i mean think about your clustering techniques so think about your classification where you would want to classify the data where you would want to do supervised or unsupervised learning think about collaborative filtering and many other data science related techniques or techniques which are required to build your recommendation engines or machine learning algorithms can be built using sparks mlip graphics is spark's own graph computation engine so this is mainly if you are interested in doing a graph based processing think about facebook think about linkedin where you can have your data which can be stored and that data has some kind of network connections or you could say it is well networked i could say x is connected to y y is connected to z z is connected to a so x y z a all of these are in terms of graph terminologies we call as vertices or vertex which are basically being connected and the connection between these are called edges so i could say a is friend to b so a and b are vertices and friend a relation between them is the edge now if i have my data which can be represented in the form of graphs if i would want to do a processing in such way this could be not only for social media it could be for your network devices it could be a cloud platform it could be about different applications which are connected in a particular environment so if you have data which can be represented in the form of graph then graphics can be used to do etl that is extraction transformation load to do your data analysis and also do interactive graph computation so graphics is quite powerful now when you talk about spark your spark can work with your different clustering technologies so it can work with apache mesos that's how spark came in where it was initially to prove the credibility of apache mesos spark can work with yarn which is usually you will see in different working environments spark can also work as standalone that means without hadoop spark can have its own setup with master and worker processes so usually or you can say technically spark uses a master slave architecture now that consists of a driver program that can run on a master node it can also run on a client node it depends on how you have configured or what your application is and then you have multiple executors which can run on worker nodes so your master node has a driver program and this driver program internally has the spark context so your spark every spark application will have a driver program and that's driver program has a inbuilt or internally used spark context which is basically your entry point of application for any spark functionality so your driver or your driver program interacts with your cluster manager now when i say interacts with cluster manager so you have your spark context which is the entry point that takes your application request to the cluster manager now as i said your cluster manager could be say apache mesos it could be yarn it could be spark standalone master itself so your cluster manager in terms of yarn is your resource manager so your spark application internally runs as series or set of tasks and processes your driver program wherever that is run will have a spark context and spark context will take care of your application execution how does that do it spark context will talk to cluster manager so your cluster manager could be on and in terms of when i say cluster manager for yarn would be resource manager so at high level we can say a job is split into multiple tasks and those tasks will be distributed over the slave nodes or worker nodes so whenever you do some kind of transformation or you use a method of spark context and rdd is created and this rdd is distributed across multiple nodes as i explained earlier worker nodes are the slaves that run different tasks so this is how a spark architecture looks like now we can learn more about spark architecture and its interaction with yarn so usually what happens when your spark context interacts with the cluster manager so in terms of yarn i could say resource manager now we already know about yarn so you would have say node managers running on multiple machines and each machine has some ram and cpu cores allocated for your node manager on the same machine you have the data nodes running which obviously are there to have the hadoop related data so whenever the application wants to process the data your application via spark context contacts the cluster managers that is resource manager now what does resource manager do resource manager makes a request so resource manager makes requests to the node manager of the machines wherever the relevant data resides asking for containers so your resource manager is negotiating or asking for containers from node manager saying hey can i have a container of 1gb ram and one cpu core can i have a container of 1gb ram and 1 cpu core and your node manager based on the kind of processing it is doing will approve or deny it so node manager would say fine i can give you the container and once this container is allocated or approved by node manager resource manager will basically start an extra piece of code called appmaster so appmaster is responsible for execution of your applications whether those are spark applications or mapreduce so your application master which is a piece of code will run in one of the containers that is it will use the ram and cpu core and then it will use the other containers which were allocated by node manager to run the tasks so it is within this container which can take care of execution so what is a container a combination of ram and cpu core so it is within this container we will have a executable process which would run and this executor process is taking care of your application related tasks so that's how overall spark works in integration with yarn now let's learn about this spark cluster managers as i said spark can work in a standalone mode so that is without hadoop so by default application submitted to spark standalone mode cluster will run in fifo order and each application will try to use all the available nodes so you could have a spark standalone cluster which basically means you could have multiple nodes on one of the nodes you would have the master process running and on the other nodes you would have the spark worker processes running so here we would not have any distributed file system because spark is standalone and it will rely on an external storage to get the data or probably the file system of the nodes where the data is stored and processing will happen across the nodes where your worker processes are running you could have spark working with apache mesos now as i said apache mesos is an open source project to manage your computer clusters and can also run hadoop applications apache mesos was introduced earlier and spark came in and as existence to prove the credibility of apache missiles you can have spark working with hadoop's yarn this is something which widely you will see in different working environments so yarn which takes care of your processing and can take care of different processing frameworks also supports spark you could have kubernetes now that is something which is making a lot of news in today's world it is an open source system for automating deployment scaling and management of containerized applications so where you could have multiple docker based images which can be connecting to each other so spark also works with kubernetes now let's look at some applications of spark so jpmorgan chase and company uses spark to detect fraudulent transactions analyze the business spends of an individual to suggest offers and identify patterns to decide how much to invest and where to invest so this this is one of the examples of banking a lot of banking environments are using spark due to its real-time processing capabilities and in-memory faster processing where they could be working on fraud detection or credit analysis or pattern identification and many other use cases alibaba group that uses also spark to analyze large data sets of data such as real-time transaction details now that might be based online or in the stores of looking at the browsing history in the form of spark jobs and then provides recommendations to its users so alibaba group is using spark in its e-commerce domain you have iq here now this is a leading healthcare company that uses spark to analyze patients data identify possible health issues and diagnose it based on their medical history so there is a lot of work happening in healthcare industry where real time processing is finding a lot of importance and real time and faster processing is what is required so health care industry and iqvi is also using spark you have netflix which is known and you have riot games so entertainment and gaming companies like netflix and ride games use apache spark to showcase relevant advertisements to their users based on the videos that they have watched shared or liked so these are few domains which find use cases of spark that is banking e-commerce health care entertainment and then there are many more which are using spark in their day to day activities for real time in memory faster processing now let's discuss about the sparks use case and let's talk about conviva which is world's leading video streaming companies so video streaming is a challenge now if you talk about youtube which has data you could always read about it so youtube has data which is worth watching 10 years so there is huge amount of data where people are uploading their videos or companies are doing advertisements and this videos are streamed in or can be watched by users so video streaming is a challenge and especially with increasing demand for high quality streaming experiences conviva collects data about video streaming quality to give their customers visibility into the end user experience they are delivering now how do they do it apache spark again using apache spark conviva delivers a better quality of service to its customers by removing the screen buffering and learning in detail about network conditions in real time this information is then stored in the video player to manage live video traffic coming in from 4 billion video feeds every month to ensure maximum retention now using apache spark conviva has created an auto diagnostics alert it automatically detects anomalies along the video streaming pipeline and diagnoses the root cause of the issue now this really makes it one of the leading video streaming companies based on auto diagnostic alerts it reduces waiting time before the video starts it avoids buffering and recovers the video from a technical error and the whole goal is to maximize the viewer engagement so this is sparks use case where conviva is using spark in different ways to stay ahead in video streaming related deliveries if we have understood and learned about your spark components your spark architecture we can also see running a spark application now a spark application can run in a standalone mode that is you could set up your ide such as eclipse with a scala plug-in and then you could have your coded application which is written in eclipse to be run in local mode now here i have an example of application which can be run in a local mode or on a cluster so this application is importing some packages that is spark context spark conf i have created an object which is first app it is the main class of your project and other classes can just be extending app rather than having main here we declare a variable called x which is pointing to a file in my project directory it looks for a file which is abc1.txt and this file basically has some content what are we doing in the application so we create a variable where we assign the file then we define or initialize our spark context so remember whenever you work with ide you don't have spark context or spark available implicitly that has to be defined so here we create a configuration object we set our application name we set the master as local if you want to run it in a local mode if you want to run it say on your windows machine or if you would want to run on a spark standalone cluster if i would be running it on yarn then i would remove this property that is set master now once you have defined your configuration object you can be basically using spark context you can use method of it which will result in rdp that is what is happening in line 13 val y and then finally i can create a variable where i would want to do a flat map transformation on y which would result in an internal rdd followed by a map transformation which would again result in an rdd and finally reduce by key which is doing an aggregation once all these steps are done i can decide to display the result on the screen or i can even use save as text file point it to a particular location and then have my application run now this is my eclipse refer to other sessions where i've explained how you can set up ide on windows with different environment variables and then you can run your applications on a particular cluster if i would want to run this application on a cluster then i will have to give a particular path so here in this case i can say let's do this and i will say my file is abc1.txt let's save it and i'll also say my output will be getting stored in a default location as i intend to run it i intend to run this application on a cluster now in that case the cluster would usually be set up on linux machines would have a hadoop cluster where you can run this application so if i would want to run this application on a cluster but not locally on the machine i can just delete this part i can keep my application now if you see my application already compiles and does not show any error message and that is because in my project's build path i have added all the spark related jars all the spark related jars you can get from your spark directory or you can be getting in manually all the dependencies some people would prefer to use maven or say sbt for packaging your application as jar and that can also be done so here my code compiles code is fine it is pointing to a file and then it is also creating an output which will have word count now what i can do is since i have my code already written i know that i have sbt installed on this machine because i would want to package this code as jar and run it on a cluster so for that we can look in our command prompt and here i can go into workspace i can go into my scala project and if you see here we have your build dot sbt file you have your binaries and you have your source so you might not have these spark related directories now these exist in my case because i have been using spark and i've done some previous runs now this is what we have within my spark apps so to build my package i need a build dot sbt file we can see what does that contain so your build.sbt contains your name version of your package jar scala version spark version and then repository which spark will refer to when it wants to have the dependencies for all of your components such as parkour spark sql mlib and so on so this is my build.spd file which exists in the project folder and if you are intending to use sbt to package your code as jar and then run it on the cluster in that case i can even skip adding spark related charge to the build path so that is only done to make sure that your code compiles now once i have my code written i have sbt installed i have made the file path changes i can just go to the command line within my project folder and i can say sbt package now this will basically resolve all the dependencies based on whatever you have given in the code it will create a jar file and place it in a particular location and then we can use the same jar file to run on a particular cluster so sbd package is busy in creating the jar now meanwhile what i can do is i can open up a lab content and what i can do is i can just say for example simply learn and here i already have a lab set up so you could have your own spark standalone cluster where you could also run this jar file you could have spark with hadoop which is what i have here and i will use that to submit an application on the yarn cluster and let's go to the web console let me copy my link i'll close this i'll say launch lab and then i can log in here i can just to paste my password and i'm logged in so i can just say spark 2 shell now that's how it has been configured here to work with your spark version 2. so this is an interactive way of bringing your spark shell and running your application but what we are interested in is running the application as a jar file so let's go and see here where we have our code and let's see if sbt has done the packaging yes it has done and it has created a jar file in this location so what i can do is i need to get this jar file from this location onto my cluster so what i can do is i can come in here i can do a ftp so this basically allows me to push whatever jars i have on my web console so i'll go in here i'll say connect now i'll search if there is already existing jar file which might create a conflict so i have something here so i can for now just delete it it's done i will say upload file and i am interested in getting the jar file so here we can click on users win10 i have my workspace project i can get into target scala 2.11 and take my jar file and say open so this will upload my jar file on the web console or the terminal wherein i can connect to my cluster now let's go in here let's quit from spark shell as we want to run an application on the cluster how do i do it so i can search if i have my jar file existing it's here so i'll say spark to submit and then i will point to my jar file and then i can simply say class and i know my code has a package and a class name so this is package and this is my object or class name so i can say main dot scala dot first app now it will be good if we check if the file exists which our code points to so i'll just for now comment it out i will check in my htfs ls in my default user directory this is where it will search for a file and here we can search if i have a file called abc so i don't see anything here so let's do this what i can do is i can again go to ftp and basically i can do a upload file like what we did earlier and this time i'll pick up this existing abc1 file which i showed you and upload it once this is done i can put it on my cluster by just saying sdfs dfs put let's take the abc1 file and let's put it in my directory so this will be my input so i'm putting in a file there and now i can do a spark submit to submit the application on the cluster so if you see here it has basically started the application it contacts the resource manager it gets an application id and now it is doing the processing of my file where i would want to get once this is done it will be completed and the temporary directory will be deleted so this is how i run a spark application on a cluster now once this is done i can also go to spark's history server by saying this is the path for my spark history server it shows an application which was run today which was doing a word count you can click on the application it says it ended with save as text file you can click on this and then it shows you the dag visualization it says we started with text file we did a flat map we did a map and then there was some shuffling because we wanted to do reduce by key so as i said every rdd by default has two partitions and if you want to do some aggregate or wider transformations like reduce by key sort by key group by key count by key and so on in that case similar key related data has to be shuffled and brought into one partition that's where we see shuffling happening here and it also tells what are the number of tasks which have run per partition so here we ran a spark application by packaging it as jar using your sbt so sbt created the jar file and then basically we brought our jar file onto the cluster and then submitted it using spark submit so this is how a spark application runs on the cluster so as i said your application has a driver program now your spark applications run as independent processes and they can run on a cluster across the nodes so we just saw how we can run a spark application on a cluster now we can always look at spark applications progress or after it has completed by looking into the spark ui your spark application as i mentioned has a driver program and when you run your application on the cluster you can always specify where would you want your driver program to run so in our case when we ran our spark application here we basically just did a simple spark submit and we gave our jar and the class name i could also say master and then i could specify if i would want my application to run in a local mode or i could say yarn but that is default or if it was a spark standalone cluster then i could be giving something like this your host name and then your port so you could do all of these options by specifying minus minus master i could also specify how many executors i need how much memory per executor how much course per executor i need and i could also say deploy mode as client which basically means my driver will run on this machine however my execution of application will happen on cluster nodes i can also say deploy mode as cluster which basically means my driver will run on one of the nodes of the cluster you could submit your application and then based on whatever arguments you have given your application will be submitted on the cluster and it will run so you can have your application running so as i said you have an application which has a driver and it also has either spark session or spark context which takes your application request to the resource manager now if your application is completed you can always come back and look into the spark history server or spark ui if i choose this as my application which i have run i can go to executors and that shows me that there was one for your driver now that was running on this particular node which is my client node then i have my executors on my other nodes which are nodes of my cluster which used one core which ran two tasks for the partitions which they were working on and there was some shuffling involved as we were doing a word count which uses reduce by key so when you run your application your yarn or your cluster manager such as resource manager negotiates the resources to the slave processes your worker node basically will have the resources available for any kind of execution now as i said your resource manager will request for containers your worker nodes will approve those containers and then within those containers you will have the executor which will run which will take care of the task to process the data so what is a task it is a unit of work for the data set which has to be worked upon so your rdds which get created have partitions and for every partition you have a task which is taken care by the executor so the data is loaded into the rdd when the action is invoked and then your task is worked upon by the executor so what does it do it basically gets the data into the partition of rdd and then does the execution as a task on that particular partition results are then sent back to the driver and you can also have your output saved on the disk so this is how your application run whenever you run a particular application you can always go to your web console and scroll or look into the logs for more information so here our application was run and we can see right from the beginning here when we talk about spark submit which is done here it says running spark version 2.2 it basically then will see that there will be a driver which will be started you can further see that it does some memory calculations it then starts a spark ui and here we can see that requesting a new application from cluster with four node managers verify our application has not requested more than maximum memory capability of the cluster so the container sizing at the cluster level is 3 gb per container and your application should not be requesting for a container bigger than that it starts a application master container with specific amount of memory and then finally your execution starts if we scroll down and look into the logs we will see where my driver runs it will also see show you how much memory was utilized and it will talk about back scheduler which is taking care of execution of your tag that is series of your rdds and finally you will see the result getting generated or saved so this is how you run your spark application this is how you can see your spark applications in history server or spark ui and this is how overall apache spark works utilizing your cluster manager help getting the resources from the worker processes and then running the executors within those here we will learn on what is spark streaming spark streaming data sources features of spark streaming working of spark streaming disk criticized streams caching or persistence as we call check pointing in spark streaming and we will then have a demo on spark streaming so let's learn on what is spark streaming and what it is capable of so it's an extension of core spark api that basically enables scalable high throughput fault tolerant stream processing of live data streams now data can be ingested from different sources like kafka flume kinases or tcp sockets and then it can be processed using complex algorithms expressed with different kind of functions such as map reduce join in window and when we have the data processed that data can be pushed out to your file systems databases and live dashboards so if you look on this image we can clearly see that we would be working on a input data stream which goes into spark streaming component of spark which gives us batches of input data or you could say the streaming data is broken down into smaller patches which would then be worked upon by spark core engine and you have finally batches of processed data now when we talk about spark streaming and the data sources you could have streaming data sources coming in from kafka or flume or say twitter api or also with different formats such as parquet you could also have static data sources coming in from mongodb hbase mysql and postgres so when we talk about spark streaming it receives the input data streams divides the data into patches which can then be processed by spark engine to generate your final stream of results again in batches so your spark streaming actually provides a high level abstraction called disk criticize stream and we will learn about that or what we call as d stream which represents a continuous stream of data now when we look at your different data sources from where the data can come in spark streaming would take in this input and then you could have even your mlib component which could be used that is sparks component to build machine learning algorithms wherein you can train your models with live data and you can even use your trained model you could be also going for structured processing or processing the data which is structuralized and that could be done by using sparks components that is spark sql which has data frames and data sets so you could process your data with data frames interactively query with sql when spark streaming is working on the data which is constantly flowing in finally the data which is processed can be stored in your distributed file system such as sdfs or any other nosql or sql based database now when we talk about spark streaming it is good to know some of the features of spark streaming and here you have some of the features so it enables fast recovery from failures while it is working on streaming data you have better load balancing and resource usage and you can also combine the streaming data with static data sets and perform interactive queries now your spark streaming also supports native integration with advanced processing libraries and that is one of the benefits users can have when they are using spark streaming let's learn about working of spark streaming so as i mentioned earlier at one end you have your data streams now those data streams are then caught or perceived by your receivers which we have to enable in our application your data streams or streaming data which is constantly getting generated and flowing in is broken down into smaller patches which will then be processed by spark so you would have your final processed results now if we look at the bigger picture here we will talk about live input data streams that could be divided into smaller batches and when we say batches these are batches of input data as rdds so spark streaming performs computation expressed using these streams it generates rdd transformations so you would have your spark pad jobs to execute rdd transformations which would give you your final processed result now there are various examples and we'll look at examples later so once your spark streaming works on breaking the input into smaller streams it can then process the data finally giving you batches of your result and that is again on the streaming data now what is these d streams let's understand about these streams or disk criticized streams so that's the basic abstraction provided by spark streaming now it represents a continuous stream of data either the input data stream received from the source or the process data stream generated by transforming the input stream now here if we look at the d stream we would say you would have series of rdds or series of transformations applied on the data which is flowing in for a particular time frame now here we say data from time 0 to 1 and that would result in some of the transformations which you would perform on the data when it has come in between this time zone or time frame you would have again data from time 1 to 2 and so on so that's how your spark streaming works so if we look at your different transformations now there could be various transformations which could be applied on your data so this is something like your streaming data which comes in you would want to say have a receiver which monitors a particular socket or a particular port and looks for data coming in we would also define the time interval and for that time interval the data is taken so that's a smaller batch or a d stream on which you could have your processing done now within your application you would have series of steps which are nothing but transformations which would be performed on this data within this time frame giving you a result which could be stored which could be seen on your console or which could just be pushed further for processing and this keeps happening at regular time intervals whatever you have specified till the spark streaming application continues now when we talk about spark we already know that there are different kind of transformations which can be applied so you have map transformation wherein you have map and then you pass in a function which basically says you would want to perform a function on every element so when we say map function that would return a new d stream by passing each element of the source d stream through a function which is passed similarly it is for flat map where you would be passing in a function you would want to perform a flat map transformation on the data stream or t stream which comes in so each input item can be mapped to zero or more output elements you could be doing a filter wherein you return a d stream by selecting only the records of the source d stream on which the function returns true so filtering is used when you want to run a transformation where you would want to look at the input data for a particular time interval as i mentioned earlier and you would want to filter that data as per whatever your function is applied now you could be doing a union where you could basically be having a union of multiple d streams so that this would return a new d stream that contains union of elements in the source tree stream and a other d string you could be doing a transform function that contains the union of elements you could be doing a count you could be doing a join so these are some of the transformations which can be performed on your d streams now there is also a concept of windowing and that is basically to process the data for a series of time intervals so when i mention windowed stream processing i'm saying spark streaming would allow you to apply transformations over a sliding window of data now this operation is called as windowed computation let's see how it looks like so if you have your original d stream which is being which is basically your data coming in now that would be looked upon for specific time intervals such as time one time two and then you could be doing a windowed computation which could basically mean that i could have a window which is a series of these time intervals on which you would want to perform series of your rdd transformations so you have a window d stream at time one then at time two and time three so that could be one window wherein you would want to get an output so here we see window at time three now you could also have a another sliding window which would take your time series and we have time 3 time 4 and time 5 where you could be performing the series of transformations so this is usually helpful where you would not only want to process the data at particular time interval but you would want a consolidated processing for series of intervals and that's what we mean by windowed computation now before we understand about caching and persistence we can talk a little bit more on windowing so if one would want to understand the window stream processing or as we say spark streaming's feature of windowed computations we need to think it as applying transformations over a sliding window of data now as we see here every time the window slides over a source d stream the source rdds that fall within the window are combined and operated upon to produce the rdds of windowed stream now in this specific case we can say the operation is applied over last three time units of data and slides by two time units this shows that any window operation needs to specify two parameters one is the window length which is basically the duration of window so for example we can say three as we see in the figure here and then you have a sliding interval which is basically the interval at which window operation is performed so these two parameters must be multiples of the batch interval of source d stream so that's what we do when we talk about window now there are various other transformations which can be applied or window based operations which can be done on your d streams here to talk a little bit more on these streams which is as i said it's a basic abstraction provided by spark streaming it represents a just remember it has a continuous stream of data now either the input data stream received from source or the process data stream which is generated by transforming the input stream so a d stream is represented by you could say a continuous series of rdds which is sparks abstraction of a immutable distributed data set so any operation applied on a d stream it is basically translating to operations on the underlying rdds now for example we can say converting a stream of lines to word the flat map operation is applied on each rdd in the lines now d stream to generate the rtds of your words t stream so when we talk about your discretized streams understand that you would have a series of transformations which would be applied for this time interval whatever has been specified now when you talk about your streaming or spark streaming architecture as i mentioned here that is your receivers now that plays a very important role here so your input d streams or data streams are representing the stream of input data that is received from your streaming sources now we could have different kind of data and we could be doing different kind of transformations so your receiver is basically an object which receives the data from a source and stores it in sparks memory for processing and that's the main role of your receiver now spark streaming provides two categories of building streaming sources so you have basic sources that is sources directly available in streaming context which is a class we learn about it such as your file systems or socket connections so those could be your basic sources from where the data is coming in you could have advanced sources like your kafka flume kinases etc and they would be available through extra utility classes so your receiver is going to be looking into the data which is constantly getting generated and then basically forwarding it for processing by your spark streaming you know when we talk about your spark streaming one more important aspect is basically understanding the caching and persistence now as we know from the spar core engine or as you should know that rdds are say your logical steps or rdds are created when you perform some transformations and these transformations or these computations or these rdds can be cached so that it can improve the performance of your application so the computed rdds or the rdds which are result of some performing some transformations they can be cached so that they can be reutilized down your application for your further processing so when we talk about your caching and persistence these streams also allows developers to persist the stream data in memory so similar to your concept of rdds these streams can allow you to persist the particular streams data in memory now that is by doing or using your persist method on a d stream which will automatically persist every rdd of that d stream in memory it could be every rdd or it could be specifically chosen rdds now this is really useful if the data in d stream has to be or would be computed multiple times say in your application so for example if we say window based operations like reduce by window reduce by key and window wherein you have group of operations being done or you could have state based operations like update state by key now in any of these cases your d streams generated by say window based operations are automatically persistent in memory without the developer calling for the persist method now for input streams that receive data over networks such as kafka flume sockets etc the default persistence level is set to replicate the data to two nodes for fault tolerance now one thing which we should remember is unlike your rdds the default persistence level of d streams keeps the data data serialized in memory and that we can discuss again further about your serializations or deserialization now one important aspect which takes care of your fault recovery is check pointing mechanism in spark streaming now when i say check pointing a streaming application as uh in real scenario we would want must operate 24 bar 7 and if the streaming application is constantly running then there has to be a mechanism which can make your streaming application resilient to failures which can be unrelated to your application logic so spark streaming needs to do the check pointing it needs to checkpoint enough information to a fault tolerant underlying storage system such that it can recover from failures so your check pointing is the process to make streaming applications more fault tolerant or resilient to your failures now this is usually used when you would want to recover from failure of a node running the driver of streaming application now we know driver is existing for every application and it is basically one which knows the flow of your application driver also has the context in case of streaming application we would have say the spark streaming context which is the entry point of your application now when we talk about your check pointing to ensure your streaming application is more fault tolerant you have two kinds of checkpointing here so you could have a metadata checkpointing and you could have data checkpointing now when we talk about metadata what does that include so metadata includes configuration d stream operations or even incomplete batches so when we talk about metadata it has configurations configuration that was used to create the streaming application you could have d stream operations that is set of d stream operations that define the streaming application that is your series of rdds and then you have incomplete batches or batches whose jobs are queued but not have completed so this would have to be check pointed so metadata check pointing is used for recovering from a node failure running the streaming application driver now your metadata which has your configuration incomplete batches and d stream operations need to be saved in an underlying storage system now when you talk about data check pointing it is mainly about saving the generated rtds to reliable storage that is whatever rdds are computed so saving the information which is saving the computations to a storage like sdfs now that is used in stateful transformations combining data across various batches so when we talk about transformations whatever rtds are generated that depend on rdds of previous batches if we are talking about stateful and that can cause the length of dependency chain to increase or keep increasing with time now to avoid any kind of increase in the recovery time intermediate rdds can be periodically checkpointed and that could be done to a reliable storage to basically cut off the growing dependency changes so if we would want to summarize we would say metadata checkpointing is primarily needed for recovery from driver failures whereas data or rtd checkpointing is necessary even for basic functioning if stateful transformations are used and we should remember we are talking about stateful transformations here so when we talk about check pointing now the question is when would you enable checkpointing or what would you do to enable checkpointing so checkpointing must be enabled for applications with different kind of requirements so for example if you are using stateful transformations where one series of rdds depend on the result of your previous batches so something like update state by key or reduce by key in window now if these kind of operations are used in your application then checkpoint directory must be provided to allow for periodic rdd checkpoint when we say about requirements then recovering from failures of the driver which is taking care of your application metadata checkpoints should be used when we talk about simple streaming applications without the stateful transformations they could be run without enabling checkpointing because your one batch of rdds really does not depend on the previous set of rdds which were done in the previous time frame now recovery from driver failures will also be partial in that case when you talk about your stateless so some might be received and your unprocessed data might be lost but then that's pretty much acceptable when you talk about spark streaming applications so here we look at your checkpointing so we say start now you have a checkpointing now whenever your application is creating a streaming context it would create it would set a checkpoint path and then you would define your d stream which is nothing but series of your rdd transformations your streaming context starts which is basically your application entry point into the cluster for processing and at any point of time if there is any failure you could always recover using the checkpoint which was created in case of uh spark streaming one more thing we need to remember is that your checkpointing can be enabled by setting it directly in a fault tolerant reliable file system such as sdfs wherein the checkpointing information will be saved so we will have to add some methods within your application like streaming context with a checkpoint and then pointing it to a checkpoint directory and in that way we can have our stateful transformations or metadata information stored in the underlying storage whichever we have chosen now that we have discussed about spark streaming some basics of spark streaming let's also understand about shared variables or what we call as accumulator and broadcast variables so normally when you talk about your spark operations such as your map or reduce these are executed on one of the node of your cluster and what happens is when you talk about your operations they work on separate copies of all variables used in the function now in this case variables are copied to each machine and no updates to the variables on remote machines are propagated back to the driver program now when we talk about read write shared variables across tasks that would be inefficient so spark actually provides two limited types of shared variables for common usage patterns and those are your broadcast variables and your accumulators now when you talk about your accumulators these are variables that are only added through an associative or commutative operation so spark natively supports accumulators of numeric types and programmers can add support for your new types so when we talk about accumulators they can be used to implement counters such as your mapreduce or sums so as a user you can create named or unnamed accumulators now as we see in the image here in named accumulator in this instance counter will display in web ui for the stage that modifies that accumulator spark displays the value for each accumulator modified by a task in the tasks table now tracking accumulators in the ui can be useful for understanding the progress of your running stages but we should remember that as of now this is not supported in python might be in future the support for python will also be added now when you talk about your accumulators you can have say a numeric accumulator created by calling a spark context and its method such as long accumulator or you could have double accumulator to accumulate values of type long or double respectively so tasks running on a cluster can then add to it using the add method now we cannot i mean in this case the value cannot be read but driver program can read the accumulator's value using its value method we can look at some examples to understand this later now when you talk about your broadcast variables that is one more type of variables which allows the programmers to keep a read only variable cached on each machine rather than shipping a copy of it with tasks now what we know is sometimes you might be doing very costlier operations like joins where you might be working on multiple rdds they need to be joined and these rtds could also be pair rdds which could be key value pairs now whenever you are performing a join there would be kind of two level of shuffling one within a particular rdd so if you are joining two rdds the first rdd which might have data in the form of key value pairs that rdd would have to have some shuffling where all the similar keys can be brought into one partition and this would happen on the second rdd also which has again key value pairs and then if you are doing a join these rtds will be shipped to the node or basically the data would be loaded in the memory and then basically your transformations will happen this can be a costlier affair so what can be done is you could create broadcast variables so for example if we have two rdds and we want to perform a join operation and if one rdd is known to be smaller then we can create a broadcast variable of the smaller rdd so that this variable itself can be shipped to each machine and then this variable can be used for your join operations with the other rdd which is existing in the memory of those nodes so that saves time and that improves your performance so spark basically attempts to distribute the broadcast variable using efficient broadcast algorithms to reduce the communication cost now if for example you have multiple nodes in a cluster and if you would want to give every node a copy of large input data set in an efficient manner so spark actions are executed through a set of stages now we know that and stages are separated by your shuffle operations so whenever we talk about narrow dependencies such as map flat map filter you would not have any shuffling involved but if you go for group by key reduce by key and such operations to bring the similar keys together there would be shuffling involved and also this applies when you're doing some join operations so in that case broadcast variables could be a really plus where one set of data or rdd which is already computed that could be broadcasted to other nodes so the data broadcasted will be cached in serialized form and deserialized before running each task this means that explicitly creating broadcast variables can be useful when tasks across multiple stages need the same data or when caching the data in the serialized form is important so we can create broadcast variables using spark context and its method called broadcast and your broadcast variable can then be shipped to other nodes which can be used for your other operations now when you talk about smart streaming it is used in various use cases i mean you could talk about speech recognition you could talk about sentiment analysis you could talk about streaming applications which would be performing some kind of analytics on the data which is coming in it is also used widely in retail chain companies now if you look at the example here so big retail chain companies would want to build real-time dashboards so that they can keep a track of their inventory and operations and for this they would need one streaming data or data which is constantly getting generated at source which needs to be processed on and this information can then be populating your dashboards to give you a real-time scenario or real-time information of what is happening so in case of an inventory dashboard you could use then these interactive dashboards uh wherein you could draw insights about the business and that's what retail companies are doing so how many products are being purchased or products that have been shipped or how many products have been delivered to customers and this kind of information would be good to capture in real time so when the streaming data is basically being processed so at one end you have your data which is getting generated that might be based on the sales which are happening that might be based on the products which are being shipped or that might be based on the acknowledgement that the products have been received now while this data is getting generated at various sources it can be subjected to a smart streaming application which will look into the streaming data perform series of transformations which you would want to process the data at regular time intervals and then pushing it to your dashboards or to your storage layer wherein it could then be used to answer such questions so when we talk about spark streaming it's an ideal choice to process this kind of data in real time and there are various use cases so at one end if you see you have a input stream which shows the product status that is how many products were purchased or shipped or delivered now this would be then handled by your spark streaming and also your sparkcore engine which would then process the data to give you an output stream which gives you a status such as what was the total count of products which were purchased products which were shipped and products which were delivered so this was a quick and brief introduction to spark streaming and how it works now we can also see how spark streaming works or how we create an application for that what we can do is we can basically set up our eclipse to have your scala based spark applications and for that what you could do is you could have your eclipse which can then be having a scala plug-in added to it now if somebody would want to look at the scala plugin then you can always go to scala.ibe.org and this is the place where you can scroll towards the bottom you can click on stable and that basically also shows your video how scala plugin can be added to your eclipse it also tells from where you can get the scala latest release and this can be added to your eclipse so in my case it has already been added to the eclipse i i'm bringing it up here and then you can have your applications built using your ide people would prefer to use intellij and that's also fine so you could also look for videos on setting up intellij with your scala plugin now there are two ways one you can build your application run it on your windows machine when you could have some kind of utility like netcat which can be used to send in some messages in a streaming fashion you could have a receiver which looks at a particular socket at a particular port and you could build a streaming application within your ide run it on your windows machine in a local mode which would then be looking at your source where the data is getting generated that's one option the second option is you could build your application package it as jar and then run it on a cluster now that could be a spark standalone cluster or spark with yarn and then you could be packaging your application using tools like sbt and then use spark submit to submit your application now i'll show you both the ways in which you can work with streaming applications so first is get your eclipse and make sure that scala plug-in is already added now in my case on the right top corner i see scala symbol that says i can be using scholar perspective now here i have some projects so what you can do is you can create a project by saying new scala project you can give it a name so for example i could say my apps3 and then you can say finish so that creates a project now within your project what you can do is you can create a package so for example i could say something like main dot scala and then say finish so that creates your package now also one thing to remember is it would be good to change your compiler instead of 2.12 to 2.11 normally for different environments you might have spark with different versions and scala with two point turner 2.11 so it would be good to have the compiler change to the bundle 2.11 so we can select this project i can do a right click i can go to the build path i can say configure build path and then i can click on scala compiler where i will use project settings and i will choose to 2.11 bundle so that would make sure that your applications can compile just say okay it might say the compiler settings have changed a full rebuild is required for changes to take effect just hit on ok so that's the first thing you need to do the second thing is for your code to compile it would be good to have your jars added to your build path now we could do that or as i said you could be writing your application which might not compile within your ide but you could package it using sbt and then run it as jar so what i'm doing here is on my machine within my c drive i already have spark distribution which i've downloaded so basically i have downloaded spark and then untied it so spark 2.4.3 i have unzipped or untied it i have kept it in my c and this basically has my spark now within which i have my jars and this has all my spark related jars so technically speaking i can even use windows command line and i can start working on spark in an interactive way or packaging my application and running it in a local mode so you could also do this so basically have your spark and then if you see on my desktop i have a hadoop folder within which i have a bin folder and within which i have a win utils.exe and this will be basically required when you want to try out your spark based applications whether that is streaming or data frames to be tested on your windows machine so you can always search and download the winnutils.exe place it in a hadoop folder within the bin folder on your machine so once you get your winnutels.exe once you have your charts what you can do is for your project you can just say right click you can go to the build path and you can say configure build path and here you had add external jars so we can select all of these jars here which i would want to add to my build path so that my code can compile and i can test it on the windows machine itself i can say open and then just basically do apply and okay so that has created my project with a package my compiler has changed to 2.11 i have added the external chars the spark related jars and that is good enough for my code now what we need is a streaming application which we need to build so let me show you from my existing project how it looks like so within your source the same way i have main package and here i have a streaming application so this streaming application is to test or to work on capturing the data which is generated at this particular stream on a particular ip and a particular port and i would want to do some series of transformations on that so this gives me an example of doing a word count and then i would want to print the results i could also be saving the output in a particular location so for this application we need basically to import certain packages that is say spark streaming spark context and also you have spark conf so these are the packages which we need to import now here i have created an object i have called this fifth app and i'm saying extends app and this is because within my project i already have an application which has been defined as main so if your application has been defined as main one application is already existing then you can just have your new objects with extending app so we don't need to define the main method here now here i'm saying val conf new spark conf so i'm creating a configuration object i'm setting my application name and to test it on windows we will have to set the master as local and it is advisable to give it more than one thread because you would be creating a receiver within your application that would utilize one thread so here i am saying set master local and then i am saying two threads i am also creating my spark context which we need to initialize based on the configuration object we just created in the previous step then we need to create a spark streaming context and this streaming context depends on spark context and i've given a time interval of 10 seconds so this is the time interval which i am setting for which i would want to work on the stream of data which comes in every 10 seconds on a particular socket now here we are setting up a receiver so i say stream rdd which is basically be spark streaming context and then you use your socket text stream now there are various methods of spark streaming context for example if i just go here and if i just do a dot it shows me what are the different options you have file stream you have q stream you have socket stream receiver stream and so on so i am using socket text stream and i would want to point it to this machine so i will say 127.0.0.1 i will also specify port 2222 okay and here once i have created my configuration object my spark context my spark streaming context with the time interval and i have set my receiver to use the socket text stream method on this particular ip on this particular port now then i am specifying what i would want to do on the data which gets generated on this machine at this particular port so i am saying val word count so i would want to work on the stream rdd that is the d stream i would want to do a flat map on it to split the data based on space i would want to map every word to word comma 1 and then i would want to do a reduce by key now reduce by key you can pass in specifically the function you would want to do i could do a count of this i could print the result or i could even save it with an output using the java method to create a random string attached to the output once you're done with this here i'm mentioning spark streaming context start now this will basically trigger my spark streaming context to start and it will run till we terminate this application now here to run this on windows machine you can always look into your run configuration and what i have done is in my environment i basically want to use let's say okay now let's look at the configuration what you need to set so here i have my streaming application and let's look at the run configuration which basically shows my application has started and it's running on the environment so if you see here i have added hadoop underscore home pointing to this hadoop directory which has been and when utils x i have also given spark local ip which is 127.0.0.1 so that is my run configuration now if you see here my streaming application has started my streaming application has not yet started probably it is running a different application so we can come back and check this so we can go to run as scala application and let's see what it does so it tries to connect to 127.0.0.1 so receiver is trying to do that but it does not find anything on that particular machine on that particular port so my receiver is not able to find it is not able to establish a connection now what i can do is i can go to my command line and here i will go into downloads so i've already downloaded the netcat utility for windows and here we can basically search for something which i have on my machine within downloads and that is your netcat utility so what i would do is i will come back here and i would say nc.exe lvp and then i could say my port which i have specified in my streaming application and i can just start this netcat utility now here it says listening on 222 it says connection to 127.0.0.1 and then if i look in the background my receiver will now be able to establish a connection establish a connection with this netcat utility now whatever i type here will be taken for processing every 10 seconds and we would see a word count while my application is running so let's test this so i'll say this is a test test is being done and as soon as i pass in this message we see that there is a word count happening for the stream of data which is coming in we can say winters are coming winters will be cold and once i given these messages we will see our streaming application which will try to work on the data which is coming in and process it and show us the result so we can say this is a test of winters and let's see if it continues to do the processing and shows us that right so my streaming application is running fine it is looking at the words which we are passing in and we can say if it is able to process my data and every 10 seconds the tech stock socket stream is looking on this machine at this particular port doing a series of transformations and these series of transformations and then are seen within your console now in my application if i had used word counts dot save as text file then i could also have my output getting generated at every 10 second interval and that would get saved here i could also decide to have my output saved and stored on a sdfs or any other storage so this is a simple streaming application what we saw which is using socket tech stream it is looking at this machine on a particular port where we are running a netcat utility it does series of your rdd transformations that is flat map map and then you reduce by key and finally we are invoking an action such as print which basically triggers these rdds which work on my streaming data so this is one example now what i could also be doing is this is particularly my application now i already have sbt for windows installed so i can be going into my project space so i could say workspace i could go into my apps and i'm in my project folder so i could say sbt package and this will basically then based on my build file which i already have within my project folder so if you see here this is my build file which says name the version scala version spark version and the repository with the spark to manage the dependencies for all its components like spark core sql ml streaming and hive so you need to have this build dot sbt file within your project folder which can be used to package your application as jar now we have done the packaging and it shows me that it has created a jar of my applications within this particular folder now once i have this i can basically then import this package into my cluster where i can run it on a cluster using spark submit now to run the spark streaming application on a cluster here i have a two node cluster which will then have my spark stand alone cluster you could look at the previous videos which wherein i explain how to set up a spark standalone cluster now here i can go into the spark directory and then i could just say s bin start all dot sh which will start by spark master and worker note now i do have hadoop cluster also set up on this but right now i'm running a spark standalone cluster which shows me i have a master and worker process on this machine and i have a worker process on this machine so my spark standalone cluster is running we can always bring up the ui to see how it looks like so here i can just be giving http slash um one eight zero eight zero so that's my spark standalone cluster with two worker nodes right now there is no application running we can come back here and we can check if we have my jar file so i've already placed my jar here now at any point of time if you have your jar which has been packaged you can always go and do a jar minus xvf and you can choose your jar to see what is within your jar which i've already done on this machine so if i look in the main folder i have scala and within scala if you see these are the different classes what we have so my code was packaged as a jar it has a fifth app class which we would want to run here so what i can do is i can just be saying spark submit because now i would want my application to run on the cluster so i can say class main dot scala and then my object name so i will say fifth app now that's my class my jar is this one and we will also say master which is going to be spark which is running on um one and it listens on 7077 port so that would basically start my streaming application but what we would also need is like our windows we would also need some utility like netcat so basically what i can do is i can here use a netcat utility to basically connect or send in some messages for our streaming application so let's say let me search if i have already the netcat which i might have used here so let's see so we on your window machine you will have to install your netcat utility by saying a apt-get netcat so here i can say nc minus l and then i can say port which is two two two two so we can see if this is working so this is a test and let's see if it works so let's start our application streaming application which then needs the receiver to establish a connection with the particular machine and let's see if that is done so here i could also be canceling this and give it a particular port so i can say 192 168 0.18 now the only problem is when we packaged our application when we packaged our application we should be specifically giving the ip and also comment out the local so when i package this to run on the cluster i commented out this part and i obviously would not be giving 127 but the ip where my net cat is running so right now let's test it so we will say this is a test new test for our application and i'm sending in some messages so it is already doing the word count as we expected it to do and if while we are doing it if you come back here so you would see your application is now running on the cluster so it is my application it is using four cores what is the memory per executor and you can basically click on your application now here you can look at the application detail ui to see what is being done so for our application we have streaming job running receiver which says this is my application it says it is running you can basically click at this click on this and it says your dag visualization which will say the streaming job which is running you can look at the stages if you would want to see if there are multiple stages which we don't have because we are not doing any wider transformation we can look at storage if we have used some persisting or caching we can look at the executors which are being used and one of the important things when you talk about your streaming is the streaming tab which gets activated now that does not show up when you run your batch applications but in your streaming applications you have the ui shows you the streaming tab and that basically says running batches of 10 seconds for 2 minutes 9 seconds already now we see the input rate we see the receivers we see if there is any delay in scheduling what is the processing time how many batches it has completed and it basically shows me what is the processing time how many tasks were run and so on so my application is running fine and unless and until we go ahead and cancel this my application will continue to run it will keep looking for messages which we keep typing in so for example if i would want to just copy this and start again where i would keep passing in this is say test one test one is this where we test streaming application right and i can be seeing that my messages are being processed i'm getting a word count so we can say winter winter summer old winter and you can always look at if your streaming application is doing the word count so this is a simple way where you can run your streaming application either on a windows machine in a local mode or you could be running on a spark standalone cluster or you could also deploy your application to run on a yarn based cluster now that we have seen a streaming application which does a word count it will be interesting to also see the window based computation or windowing operation which is supported in spark streaming as i explained and for that we can look at a different application here which is streaming with a window option let's look into this what this application does so here i am importing some packages that is streaming streaming context your spark in spark context spark configuration i am also using the spark api java function and then using spark streaming api and i am also using storage level because i also intend to have the persistence or caching done now here we create an application say streaming app extends app we create our configuration object where we say app name and then we set master with appropriate number of threads we create the spark streaming context that's we initialize it and then we have the spark streaming context which depends on sc and we have given a time interval of 10 seconds we then set up a receiver so we say stream rdd which will use socket extreme like we did earlier with which will look for this particular machine at this port wherein i'm also using storage level so what i would want is that the data stream which would get generated here every 10 seconds should be cached in memory now we could use different storage levels that is we could use memory only disk only memory and disk you could have memory and disk with replication factor so you have different storage levels now here we are then doing some series of transformations like what we did earlier but if you see in my previous example we just had reduce by key and here i am using a transformation which is reduced by key and window now this one basically takes the function as i said the windowing takes associative or commutative functions so instead of using the shortcut wave we will be specifying our function so we say it takes a and b and basically it uses this function now here we have to give the interval that is 30 seconds that's my window time frame and then within that what will be my each interval which we would want to look at so these are the two main aspects one is using your window based transformation and then also specifying your time interval for your windowing which basically means we would still do a word count but then here we would basically not only be doing a word count for every 10 seconds for the data which is coming in here but we would also want to have a consolidated set of computation done every 30 seconds so that is looking at the last three time intervals now if you also notice i am also doing a check pointing here wherein i have not specified a specific directory but what i am saying here is i would want to checkpoint or i would want to basically save my computations for any kind of fault recovery now this is a streaming application with window based computation we are using storage level for persistence we are also doing the check pointing which takes care of fault recovery now to run this application we have already set our environment variables and so on so i can basically start this application by just doing a run as scala application now this would start my streaming application with window based computations now we can open up our command line where we would want to start a netcat utility so i could go to downloads let me just minimize this window here and what i can do is i can then be starting my netcat utility so i could say nc.exe minus lvp and on a port 222. as soon as i do this in background my application receiver will be able to establish connection to this netcat utility running on this particular port now that's done so we can see if the word count happens for our 10 second interval so let's do that so i'll say this is my first test and if i do this it should give me the word count for the data which i have passed in now i can say this is my second test this is my third test and if you see here it basically is doing this now if i say this is my first test again so since we are doing window based operation also we will see a consolidated set of transformations which happen for a window of three intervals so if you see here now my this is giving me a count but it is not showing me totally this as 4 or say this because it is only looking at a window of last 2 or 3 time intervals so now it is back because it is looking at the last one so if i say this is my test test is done test is done so if you do this it will show you the word count which is for whatever you have passed in like test is done is twice test is thrice but then since windowing based operation is done you will look at the time interval of the messages which were passed within that particular window now we can also since my application is running and it is on windows we can bring up your spark ui so i can go to http and slash slash i can be looking at my machine and the port so for example i could do this this shows my streaming application which is running in wherein i can see the window based application which is running but what would be interesting is to look at the stages if there is any kind of shuffling which is happening looking at the storage should show me the persistence which we are doing now if you see here it shows me that we have been persisting in the memory what size it occupies how many partitions were cached and this is because of the persistence or the storage level what we have used as memory only now if we change that to disk we will see our rdds will be cached on disk and then you can look at your streaming tab which gives you more information about your patches so this is a very simple example obviously you can change what kind of rtd computations you would want to do here whether you want to do a word count or something else whether you want to save your processed output now here we are able to do that you can always do a refresh on your project and here what we see is we see some entries of a folder getting created right and we can decide whether we would want our output to be created we can also see within this what would be containing and since we have set check pointing in the project folder it would also be check pointing temporarily the information for any kind of failure so this is a classic example of streaming application using window based computations persisting your computations and also doing the check pointing simultaneously in this tutorial here we have some list of questions and an explanation of those so that you can be well prepared for your hadoop interviews now let's look at some general hadoop questions so what are the different vendor specific distributions of hadoop now all of you might be aware that hadoop or apache hadoop is the core distribution of hadoop and then you have different vendors in the market which have packaged the apache hadoop in a cluster management solution which allows everyone to easily deploy manage monitor upgrade your clusters so here are some vendor-specific distributions we have cloudera which is the dominant one in the market we have hortonworks and now you might be aware that cloudera and hortonworks have merged so it has become a bigger entity you have map r you have microsoft azure ibm's infosphere and amazon web services so these are some popularly known vendor-specific distributions if you would want to know more about the hadoop distributions you should basically look into google and you should check for hadoop different distributions wiki page so if i type hadoop different distributions and then i check for the wiki page that will take me to the distributions and commercial support page and this basically says that the sold products that can be called a release of apache hadoop come from apache.org so that's your open source community and then you have various vendor-specific distributions which basically are running in one or the other way apache hadoop but they have packaged it as a solution like an installer so that you can easily set up clusters on set of machines so have a look at this page and read through about different distributions of hadoop coming back let's look at our next question so what are the different hadoop configuration files now whether you're talking about apache hadoop cloudera hortonworks map r or no matter which other distribution these config files are the most important and existing in every distribution of hadoop so you have hadoop environment dot sh wherein you will have environment variables such as your java path what would be your process id path where will your logs get stored what kind of metrics will be collected and so on your core hyphen site file has the hdfs path now this has many other properties like enabling trash or enabling high availability or discussing or mentioning about your zookeeper but this is one of the most important file you have hdfs hyphen site file now this file will have other information related to your hadoop cluster such as your replication factor where will name node store its metadata on disk if a data node is running where would data node store its data if a secondary name node is running where would that store a copy of name node's metadata and so on your mapred hyphen site file is a file which will have properties related to your mapreduce processing you also have masters and slaves now these might be deprecated in a vendor-specific distribution and in fact you would have a yarn hyphen site file which is based on the yarn processing framework which was introduced in hadoop version 2 and this would have all your resource allocation and resource manager and node manager related properties again if you would want to look at default properties for any one of these for example let's say hdfs hyphen site file i could just go to google and type in one of the properties for example i would say dfs.namenode.name.directory and as i know this property belongs to hdfs hyphen site file and if you search for this it will take you to the first link which says sdfs default xml you can click on this and this will show you all the properties which can be given in your sdfs hyphen site file it also shows you which version you are looking at and you can always change the version here so for example if i would want to look at 2.6.5 i just need to change the version and that should show me the properties similarly you can just give a property which belongs to say core hyphen site file for example i would say fs dot default fs and that's a property which is in core hyphen site file and somewhere here you would see core minus default.xml and this will show you all the properties so similarly you could search for properties which are related to yarn hyphen site file or map red hyphen site file so i could say yarn dot resource manager and i could look at one of these properties which will directly take me to yarn default xml and i can see all the properties which can be given in yarn and similarly you could say map reduce dot job dot reduces and i know this property belongs to mapreduce hyphen site file and this takes you to the default xml so these are important config files and no matter which distribution of hadoop you're working on you should be knowing about these config files whether you work as a hadoop admin or you work as a hadoop developer knowing these config properties would be very important and that would also showcase your internal knowledge about the configs which drive your hadoop cluster let's look at the next question so what are the three modes in which hadoop can run so you can have hadoop running in a standalone mode now that's your default mode it would basically use a local file system in a single java process so when you say standalone mode it is as you downloading hadoop related package on one single machine but you would not have any process running that would just be to test hadoop functionalities you could have a pseudo distributed mode which basically means it's a single node hadoop deployment now hadoop as a framework has many many services so it has a lot of services and those services would be running irrespective of your distribution and each service would then have multiple processes so your pseudo-distributed mode is a mode of cluster where you would have all the important processes belonging to one or multiple services running on a single node if you would want to work on a sudo distributed mode and using a cloudera you can always go to google and search for cloudera's quick start vm you can download it by just saying cloudera quick start vm and you can search for this and that will allow you to download a quick start vm follow the instructions and you can have a single node cloudera cluster running on your virtual machines for more information you can refer to the youtube tutorial where i have explained about how to set up a quick start vm coming back you could have finally a production setup or a fully distributed mode which basically means that your hadoop framework and its components would be spread across multiple machines so you would have multiple services such as sdfs yarn flume scope kafka hbase hive impala and for these services there would be one or multiple processes distributed across multiple nodes so this is normally what is used in production environment so you could say standalone would be good for testing pseudo distributed could be good for testing and development and fully distributed would be mainly for your production setup now what are the differences between regular file system and hdfs so when you say regular file system you could be talking about a linux file system or you could be talking about a windows based operating system so in regular file system we would have data maintained in a single system so the single system is where you have all your files and directories so it is having low fault tolerance right so if the machine crashes your data recovery would be very difficult unless and until you have a backup of that data that also affects your processing so if the machine crashes or if the machine fails then your processing would be blocked now the biggest challenge with regular file system is the seek time the time taken to read the data so you might have one single machine with huge amount of disks and huge amount of ram but then the time taken to read that data when all the data is stored in one machine would be very high and that would be with least fault tolerance if you talk about sdfs your data is distributed so sdfs stands for hadoop distributed file system so here your data is distributed and maintained on multiple systems so it is never one single machine it is also supporting reliability so whatever is stored in hdfs say a file being stored depending on its size is split into blocks and those blocks will be spread across multiple nodes not only that every block which is stored on a node will have its replicas stored on other nodes replication factor depends but this makes sdfs more reliable in cases of your slave nodes or data nodes crashing you will rarely have data loss because of auto replication feature now time taken to read the data is comparatively more as you might have situations where your data is distributed across the nodes and even if you are doing a parallel read your data read might take more time because it needs coordination from multiple machines however if you are working with huge data which is getting stored it will still be beneficial in comparison to reading from a single machine so you should always think about its reliability through auto replication feature its fault tolerance because of your data getting stored across multiple machines and its capability to scale so when you talk about sdfs we are talking about horizontal scalability or scaling out when you talk about regular file system you are talking about vertical scalability which is scaling up now let's look at some specific sdfs questions what is this why is sdfs fault tolerant now as i just explained in previous slides your sdfs is fault tolerant as it replicates data on different data nodes so you have a master node and you have multiple slave nodes or data nodes where actually the data is getting stored now we also have a default block size of 128 mb that's the minimum since hadoop version 2. so any file which is up to 128 mb would be using one logical block and if the file size is bigger than 128 mb then it will be split into blocks and those blocks will be stored across multiple machines now since these blocks are stored across multiple machines it makes it more fault tolerant because even if your machines fail you would still have a copy of your block existing on some other machine now there are two aspects here one we talked about the first rule of replication which basically means you will never have two identical blocks sitting on the same machine and the second rule of replication is in terms of rack awareness so if your machines are placed in racks as we see in the right image you will never have all the replicas placed on the same rack even if they are on different machines so it has to be fault tolerant and it has to maintain redundancy so at least one replica will be placed on some other node on some other rack that's how sdfs is fault tolerant now here let's understand the architecture of sdfs now as i mentioned earlier you would in a hadoop cluster the main service is your hdfs so for your sdfs service you would have a name node which is your master process running on one of the machines and you would have data nodes which are your slave machines getting stored across or getting or the processes running across multiple machines each one of these processes has an important role to play when you talk about sdfs whatever data is written to hdfs that data is split into blocks depending on its size and the blocks are randomly distributed across nodes with auto replication feature these blocks are also auto replicated across multiple machines with the first condition that no two identical blocks will sit on the same machine now as soon as the cluster comes up your data nodes which are part of the cluster and based on config files would start sending their heartbeat to the name node and this would be every three seconds what does name node do with that name node will store this information in its ram so name node starts building a metadata in its ram and that metadata has information of what are the data nodes which are available in the beginning now when a data writing activity starts and the blocks are distributed across data nodes data nodes every 10 seconds will also send a block report to name node so name node is again adding up this information in its ram or the metadata in ram which earlier had only data node information now name node will also have information about what are the files the files are split in which blocks the blocks are stored on which machines and what are the file permissions now while name node is maintaining this metadata in ram name node is also maintaining metadata in disk now that is what we see in the red box which basically has information of whatever information was written to hdfs so to summarize your name node has metadata in ram and metadata in disk your data nodes are the machines where your blocks or data is actually getting stored and then there is a auto replication feature which is always existing unless and until you have disabled it and your read and write activity is a parallel activity however replication is a sequential activity now this is what i mentioned about when you talk about name node which is the master process hosting metadata in disk and ram so when we talk about disk it basically has a edit log which is your transaction log and your fs image which is your file system image right from the time the cluster was started this metadata in disk was existing and this gets appended every time read write or any other operations happen on sdfs metadata in ram is dynamically built every time the cluster comes up which basically means that if your cluster is coming up name node in the initial few seconds or few minutes would be in a safe mode which basically means it is busy registering the information from data nodes so name node is one of the most critical processes if name node is down and if all other processes are running you will not be able to access the cluster name node's metadata in disk is very important for name node to come up and maintain the cluster name node's metadata in ram is basically for all or satisfying all your client requests now when we look at data nodes as i mentioned data nodes hold the actual data blocks and they are sending these block reports every 10 seconds so the metadata in name nodes ram is constantly getting updated and metadata in disk is also constantly getting updated based on any kind of write activity happening on the cluster now data node which is storing the block will also help in any kind of read activity whenever a client requests so whenever a client on an application or an api would want to read the data it would first talk to name node name node would look into its metadata on ram and confirm to the client which machines could be reached to get to that data that's where your client would try to read the data from sdfs which is actually getting the data from data nodes and that's how your read write requests are satisfied now what are the two types of metadata in name node server holds as i mentioned earlier metadata in disk very important to remember edit log nfs image metadata in ram which is information about your data nodes files files being split into blocks blocks residing on data nodes and file permissions so i will share a very good link on this and you can always look for more detailed information about your metadata so you can search for sdfs metadata directories explained now this is from hortonworks however it talks about the metadata in disk which name node manages and details about this so have a look at this link if you are more interested in learning about metadata on disk coming back let's look at the next question what is the difference between federation and high availability now these are the features which were introduced in hadoop version 2. both of these features are about horizontal scalability of name node so prior to version 2 the only possibility was that you could have one single master which basically me means that your cluster could become unavailable if name node would crash so hadoop version 2 introduced two new features federation and high availability however high availability is a popular one so when you talk about federation it basically means any number of name nodes so there is no limitation to the number of name nodes your name nodes are in a federated cluster which basically means name nodes still belong to the same cluster but they are not coordinating with each other so whenever a write request comes in one of the name node picks up that request and it guides that request for the blocks to be written on data nodes but for this your name node does not have to coordinate with other name node to find out if the block id which was being assigned was the same one as assigned by other name node so all of them belong to a federated cluster they are linked via a cluster id so whenever an application or an api is trying to talk to cluster it is always going via an cluster id and one of the name node would pick up the read activity or write activity or processing activity so all the name nodes are sharing a pool of metadata in which each name node will have its own dedicated pool and we can remember that by a term called namespace or name service so this also provides high fault tolerance supposed your one name node goes down it will not affect or make your cluster unavailable you will still have your cluster reachable because there are other name nodes running and they are available now when it comes to heartbeats all your data nodes are sending their heart beats to all the name nodes and all the name nodes are aware of all the data nodes when you talk about high availability this is where you would only have two name nodes so you would have an active and you would have a standby now normally in any environment you would see a high availability setup with zookeeper so zookeeper is a centralized coordination service so when you talk about your active and standby name notes election of a name node to be made as active and taking care of an automatic failover is done by your zookeeper high availability can be set up without zookeeper but that would mean that the admins intervention would be required to make a name node as active from standby or also to take care of failover now at any point of time in high availability a active name node would be taking care of storing the edits about whatever updates are happening on sdfs and it is also writing these edits to a shared location standby name node is the one which is constantly looking for these latest updates and applying to its metadata which is actually a copy of whatever your active name node has so in this way your standby name node is always in sync with the active name node and if for any reason active name node fails your standby name node will take over and become the active remember zookeeper plays a very important role here it's a centralized coordination service one more thing to remember here is that in your high availability secondary name node will not be allowed so you would have a active name node and then you will have a standby name node which will be configured on a separate machine and both of these will be having access to a shared location now that shared location could be nfs or it could be a quorum of general nodes so for more information refer to the tutorial where i have explained about sdfs high availability in federation now let's look at some logical question here so if you have a input file of 350 mb which is obviously bigger than 128 mb how many input splits would be created by sdfs and what would be the size of each input split so for this you need to remember that by default the minimum block size is 128 mb now that's customizable if your environment has more number of larger files written on an average then obviously you have to go for a bigger block size if your environment has a lot of files being written but these files are of smaller size you could be okay with 128 mb remember in hadoop every entity that is your directory on sdfs file on sdfs and a file having multiple blocks each of these are considered as objects and for each object hadoop's name nodes ram 150 bytes is utilized so if your block size is very small then you would have more number of blocks which would directly affect the name node's ram if you keep a block size very high that will reduce the number of blocks but remember that might affect in processing because processing also depends on splits more number of splits more the parallel processing so setting of block size has to be done with consideration about your parallelism requirement and your name nodes ram which is available now coming to the question if you have a file of 350 mb that would be split into three blocks and here two blocks would have 128 mb data and the third block although the block size would still be 128 it would have only 94 mb of data so this would be the split of this particular file now let's understand about rack awareness how does rack awareness work or why do we even have racks so organizations always would want to place their nodes or machines in a systematic way there can be different approaches you could have a rack which would have machines running on the master processes and the intention would be that this particular rack could have higher bandwidth more cooling dedicated power supply top of rack switch and so on the second approach could be that you could have one master process running on one machine of every rack and then you could have other slave processes running now when you talk about your rack awareness one thing to understand is that if your machines are placed within racks and we are aware that hadoop follows auto replication the rule of replication in a rack of air cluster would be that you would never have all the replicas placed on the same rack so if we look at this if we have block a in blue color you will never have all the three blue boxes in the same rack even if they are on different nodes because that makes us that makes it less fault tolerant so you would have at least one copy of block which would be stored on a different track on a different note now let's look at this so basically here we are talking about replicas being placed in such a way now somebody could ask a question can i have my block and its replicas spread across three lakhs and yes you can do that but then in order to make it more redundant you are increasing your bandwidth requirement so the better approach would be two blocks on the same rack on different machines and one copy on a different track now let's proceed how can you restart name node and all demons in hadoop so if you were working on an apache hadoop cluster then you could be doing a start and stop using hadoop demon scripts so there are these hadoop demon scripts which would be used to start and stop your hadoop and this is when you talk about your apache hadoop so let's look at one particular file which i would like to show you more information here and this talks about your different clusters so let's look into this and so let's look at the start and stop and here i have a file let's look at this one and this gives you highlight so if you talk about apache hadoop this is how the setup would be done so you would have it download the adobe tar file you would have to untie it edit the config files you would have to do formatting and then start your cluster and here i have said using scripts so this is in case of apache hadoop you could be using a start all script that internally triggers start dfs and start yarn and these scripts start dfs internally would run hadoop demon multiple times based on your configs to start your different processes then your start yarn would run yarn demon script to start your processing related processes so this is how it happens in apache hadoop now in case of cloud era or hortonworks which is basically a vendor specific distribution you would have say multiple services which would have one or multiple demons running across the machines let's take an example here that you would have machine 1 machine 2 and machine 3 with your processor spread across however in case of cloud era and hortonworks these are cluster management solutions so you would never be involved in running a script individually to start and stop your processes in fact in case of cloud error you would have a clouded scm server running on one of the machines and then clouded scm agents running on every machine if you talk about hortonworks you would have ambari server an ambari agent running so your agents which are running on every machine are responsible to monitor the processes send also their heartbeat to the master that is your server and your server is the one or a service which basically will give instructions to the agents so in case of vendor specific distribution your start and stop of processes is automatically taken care by these underlying services and these services internally are still running these commands however only in apache hadoop you have to manually follow these to start and stop coming back we can look into some command related questions so which command will help you find the status of blocks and file system health so you can always go for a file system check command now that can show you the files for a particular sdfs path it can show you the blocks and it can also give you information on status such as under replicated blocks over replicated blocks misreplicated blocks default replication and so on so your fsck file system check utility does not repair if there is any problem with the blocks but it can give you information of blocks related to the files on which machines they are stored if they are replicated as per the replication factor or if there is any problem with any particular replica now what would happen if you store too many small files in a cluster and this relates to the block information which i gave some time back so remember hadoop is coded in java so here every directory every file and file related block is considered as an object and for every object within your hadoop cluster name nodes ram gets utilized so more number of blocks you have more would be usage of name nodes ram and if you're storing too many small files it would not affect your disk it would directly affect your name nodes ram that's why in production clusters admin guys or infrastructure specialist will take care that everyone who is writing data to hdfs follows a quota system so that you could be controlled in the amount of data you write plus the count of data and individual rights on hdfs now how do you copy data from local system onto sdfs so you can use a put command or a copy from local and then given your local path which is your source and then your destination which is your sdfs path remember you can always do a copy from local using a minus f option that's a flag option and that also helps you in writing the same file or a new file to hdfs so with your minus f you have a chance of overwriting or rewriting the data which is existing on sdfs so copy from local or minus put both of them do the same thing and you can also pass an argument when you're copying to control your replication or other aspects of your file now when do you use dfs admin refresh nodes or rm admin refresh nodes so as the command says this is basically to do with refreshing the node information so your refresh nodes is mainly used when say a commissioning or decommissioning of nodes is done so when a node is added into the cluster or when a node is removed from the cluster you are actually informing hadoop master that this particular node would not be used for storage and would not be used for processing now in that case you would be once you are done with the process of commissioning or decommissioning you would be giving these commands that is refresh nodes and rm admin refresh nodes so internally when you talk about commissioning decommissioning there are include and exclude files which are updated and these include and exclude files will have entry of machines which are being added to the cluster or machines which are being removed from the cluster and while this is being done the cluster is still running so you do not have to restart your master process however you can just use these refresh commands to take care of your commissioning decommissioning activities now is there any way to change replication of files on sdfs after they are already written and the answer is of course yes so if you would want to set a replication factor at a cluster level and if you have admin access then you could edit your sdfs hyphen site file or you could say hadoop hyphen site file and that would take care of replication factor being set at a cluster level however if you would want to change the replication after the data has been written you could always use a setrep command so setrep command is basically to change the replication after the data is written you could also write the data with a different replication and for that you could use a minus d dfs dot replication and give your replication factor when you are writing data to the cluster so in hadoop you can let your data be replicated as per the property set in the config file you could write the data with a different replication you could change the replication after the data is written so all these options are available now who takes care of replication consistency in a hadoop cluster and what do you mean by under over replicated blocks now as i mentioned your fsck command can give you information of over or under replicated blocks now in a cluster it is always an always name node which takes care of replication consistency so for example if you have set up a replication of three and since we know the first rule of replication which basically means that you cannot have two replicas residing on the same node it would mean that if your replication is three you would need at least three data nodes available now say for example you had a cluster with three nodes and replication was set to three at one point of time one of your name node crashed and if that happens your blocks would be under replicated that means there was a replication factor set but now your blocks are not replicated or there are not enough replicas as per the replication factor set this is not a problem your master process or name node will wait for some time before it will start the replication of data again so if a data road is not responding or if a disk has crashed and if name node does not get information of a replica name node will wait for some time and then it will start re-replication of those missing blocks from the available nodes however while name node is doing it the blocks are in under replicated situation now when you talk about over replicated this is a situation where name node realizes that there are extra copies of block now this might be the case that you had three nodes running with a replication of three one of the node went down due to a network failure or some other issue within few minutes name node re-replicated the data and then the failed node is back with its set of blocks again name node is smart enough to understand that this is over replication situation and it will delete set of blocks from one of the nodes it might be the node which has been recently added it might be your old node which has joined your cluster again or any node that depends on the load on a particular node now we discussed about hadoop we discussed about sdfs now we will discuss about mapreduce which is the programming model and you can say processing framework what is distributed cache in mapreduce now we know that when we talk about mapreduce the data which has to be processed might be existing on multiple nodes so when you would have your mapreduce program running it would basically read the data from the underlying disks now this could be a costly operation if every time the data has to be read from disk so distributed cache is a mechanism wherein data set or data which is coming from the disk can be cached and available for all worker nodes now how will this benefit so when a map reduce is running instead of every time reading the data from disk it would pick up the data from distributed cache and this this will benefit your mapreduce processing so distributed cache can be set in your jobconf where you can specify that a file should be picked up from distributed cache now let's understand about these roles so what is a record reader what is a combiner what is a partitioner and what kind of roles do they play in a map reduce processing paradigm or map reduce operation so record reader communicates with the input split and it basically converts the data into key value pairs and these key value pairs are the ones which will be worked upon by the mapper your combiner is an optional face it's like mini radius so combiner does not have its own class it relies on the reducer class basically your combiner would receive the data from your map tasks which would have completed works on it based on whatever reducer class mentions and then passes its output to the reducer phase partitioner is basically a phase which decides how many reduced tasks would be used aggregate or summarize your data so partitioner is a phase which would decide based on the number of keys based on the number of map tasks your partitioner would decide if one or multiple reduced tasks would be used to take care of processing so either it could be partitioner which decides on how many reduced tasks would run or it could be based on the properties which we have set within the cluster which will take care of the number of reduced tasks which would be used always remember your partitioner decides how outputs from combiner are sent to reducer and to how many reducers it controls the partitioning of keys of your intermediate map outputs so map phase whatever output it generates is an intermediate output and that has to be taken by your practitioner or by a combiner and then partitioner to be sent to one or multiple reduced tasks this is one of the common questions which you might face why is mapreduce lower in processing so we know mapreduce goes for parallel processing we know we can have multiple map tasks running on multiple nodes at the same time we also know that multiple reduced tasks could be running now why does then mapreduce become a slower approach first of all your mapreduce is a batch oriented operation now mapreduce is very rigid and it strictly uses mapping and reducing phases so no matter what kind of processing you would want to do you would have to still provide the mapper function and the reducer function to work on data not only this whenever your map phase completes the output of your map face which is an intermittent output would be written to hdfs and thereafter underlying disks and this data would then be shuffled and sorted and picked up for reducing phase so every time your data being written to hdfs and retrieved from sdfs makes mapreduce a slower approach the question is for a mapreduce job is it possible to change the number of mappers to be created now by default you cannot change the number of map tasks because number of map tasks depends on the input splits however there are different ways in which you can either set a property to have more number of map tasks which can be used or you can customize your code or make it use a different format which can then control the number of map tasks by default number of map tasks are equal to the number of splits of file you are processing so if you have a 1gb of file that is split into 8 blocks of 128 mb there would be 8 map tasks running on the cluster these map tasks are basically running your mapper function if you have hard coded properties in your map red hyphen site file to specify more number of map tasks then you could control the number of map tasks let's also talk about some data types so when you prepare for hadoop when you want to get into big data field you should start learning about different data formats now there are different data formats such as avro parquet you have a sequence file or binary format and these are different formats which are used now when you talk about your data types in hadoop these are implementation of your writable and writable comparable interfaces so for every data type in java you have a equivalent in hadoop so ind in java would be intriguable in hadoop float would be float writable long would be long writable double writable boolean writable array writable map writable and object credible so these are your different data types that could be used within your mapreduce program and these are implementation of writable and writable comparable interfaces what is speculative execution now imagine you have a cluster which has huge number of nodes and your data is spread across multiple slave machines or multiple nodes now at a point of time due to a disk degrade or network issues or machine heating up or more load being on a particular node there can be a situation where your data node will execute task in a slower manner now in this case if speculative execution is turned on there would be a shadow task or a another similar task running on some other node for the same processing so whichever task finishes first will be accepted and the other task would be killed so speculative execution could be good if you are working in an intensive workload kind of environment where if a particular node is slower you could benefit from a unoccupied or a node which has less load to take care of your processing going further this is how we can understand so node a which might be having a slower task you would have a scheduler which is maintaining or having knowledge of what are the resources available so if speculative execution as a property is turned on then the task which was running slow a copy of that task or you can say shadow task would run on some other node and whichever task completes first will be considered this is what happens in your speculative execution now how is identity mapper different from chain mapper now this is where we are getting deeper into mapreduce concepts so when you talk about mapper identity mapper is the default mapper which is chosen when no mapper is specified in mapreduce driver class so for every mapreduce program you would have a map class which is taking care of your mapping phase which basically has a mapper function and which would run one or multiple map tasks right your programming your program would also have a reduce class which would be running a reducer function which takes care of reduced tasks running on multiple nodes now if a mapper is not specified within your driver class so driver class is something which has all information about your flow what's your map class what is your reduced class what's your input format what's your output format what are the job configurations and so on so identity mapper is the default mapper which is chosen when no mapper class is mentioned in your driver class it basically implements an identity function which directly writes all its key pairs into output and it was defined in old mapreduce api in this particular package but when you talk about chaining mappers or chain mapper this is basically a class to run multiple mappers in a single map task or basically you could say multiple map tasks would run as a part of your processing the output of first mapper would become as an input to second mapper and so on and this can be defined in the under mentioned class or package what are the major configuration parameters required in a mapreduce program obviously we need to have the input location we need to have the output location so input location is where the files will be picked up from and this would preferably on sdfs directory output location is the path where your job output would be written by your mapreduce program you also need to specify input and output formats if you don't specify the defaults are considered then we need to also have the classes which have your map and reduce functions and if you intend to run the code on a cluster you need to package your class in a jar file export it to your cluster and then this jar file would have your mapper reducer and driver classes so these are important configuration parameters which you need to consider for a map reduce program now what is the difference or what do you mean by map side join and reduce side join map side join is basically when the join is performed at the mapping level or at the mapping phase or is performed by the mapper so each input data which is being worked upon has to be divided into same number of partitions input to each map is in the form of a structured partition and is in sorted order so map site join you can understand it in a simpler way that if you compare it with rdbms concepts where you had two tables which were being joined it will always be advisable to give your bigger table as the left side table or the first table for your join condition and it would be your smaller table on the left side and your bigger table on the right side which basically means the smaller table could be loaded in memory and could be used for joining so map site drawing is a similar kind of mechanism where input data is divided into same number of partitions when you talk about reduced side join here the join is performed by the reducer so it is easier to implement than website join as all the sorting and shuffling will send the values or send all the values having identical keys to the same reducer so you don't need to have your data set in a structured form so look into your map side join or reduce side join and other joins just to understand how mapreduce works however i would suggest not to focus more on this because mapreduce is still being used for processing but the amount of mapreduce based processing has decreased overall or across the industry now what is the role of output committer class in a mapreduce job so output committer as the name says describes the commit of task output for a mapreduce job so we could have this as mentioned or capacity hadoop mapreduce output committer you could have a class which extends your output committer class so mapreduce relies on this mapreduce relies on the output committer of the job to set up the job initialization cleaning up the job after the job completion that means all the resources which were being used by a particular job setting up the task temporary output checking whether a task needs a commit committing the task output and discarding the task out so this is a very important class and can be used within your mapreduce job what is the process of spilling in mapreduce what does that mean so spilling is basically a process of copying the data from memory buffer to disk when obviously the buffer usage reaches a certain threshold so if there is not enough memory in your buffer in your memory then the content which is stored in buffer or memory has to be flushed out so by default a background thread starts spilling the content from memory to disk after eighty percent of buffer size is filled now when is the buffer being used so when your mapreduce processing is happening the data from data is being read from the disk loaded into the buffer and then some processing happens same thing also happens when you are writing data to the cluster so you can imagine for a 100 megabyte size buffer the spilling will start after the content of buffer reaches 80 megabytes this is customizable how can you set the mappers and reducers for a mapreduce job so these are the properties so number of mappers and reducers as i mentioned earlier can be customized so by default your number of map tasks depends on the split and number of reduced tasks depends on the partitioning phase which decides number of reduced tasks which would be used depending on the keys however we can set these properties either in the config files or provide them on the command line or also make them part of our code and this can control the number of map tasks or reduce tasks which would be run for a particular job let's look at one more interesting question what happens when a node running a map task fails before sending the output to the reducer so there was a node which was running a map task and we know that there could be one or multiple map tasks running on one or multiple nodes and all the map tasks have to be completed before the further stages that such as combiner or reducer come into existence so in a case if a node crashes where a map task was assigned to it the whole task will have to be run again on some other note so in hadoop version 2 yarn framework has a temporary demon called application master so your application master is taking care of execution of your application and if a particular task on a particular node failed due to unavailability of node it is the role of application master to have this task scheduled on some other node now can we write the output of mapreduce in different formats of course we can so hadoop supports various input and output formats so you can write the output of mapreduce in different formats so you could have the default format that is text output format wherein records are written as line of text you could have sequence file which is basically to write sequence files or your binary format files where your output files need to be fed into another mapreduce jobs you could go for a map file output format to write output as map files you could go for a sequence file as a binary output format so that's again a variant of your sequence file input format it basically writes keys and values to a sequence file so when we talk about binary format we are talking about a non-human readable format db output format now this is basically used when you would want to write data to say relational databases or say no sql databases such as hbase so this format also sends the reduce output to a sql table now let's learn a little bit about yarn yarn which stands for yet another resource negotiator it's the processing framework so what benefits did yan bring in hadoop version 2 and how did it solve the issues of mapreduce version 1. so mapreduce version 1 had major issues when it comes to scalability or availability because sorry in hadoop version 1 you had only one master process for processing layer and that is your job tracker so your job tracker was listening to all the task trackers which were running on multiple machines so your job tracker was responsible for resource tracking and job scheduling in yarn you still have a processing master but that's called resource manager instead of job tracker and now with hadoop version 2 you could even have resource manager running in high availability mode you have node managers which would be running on multiple machines and then you have a temporary demon called application master so in case of hadoop version 2 your resource manager or master is only handling the client connections and taking care of tracking the resources the jobs scheduling or basically taking care of execution across multiple nodes is controlled by application master till the application completes so in yarn you can have different kind of resource allocations that could be done and there is a concept of container so container is basically a combination of ram and cpu cores yarn can run different kind of workloads so it is not just mapreduce kind of workload which can be run on adobe person 2 but you would have graph processing massive parallel processing you could have real time processing and huge processing applications could run on a cluster based on yarn so when we talk about scalability in case of your hadoop version 2 you can have a cluster size of more than 10 000 nodes and can run more than 100 000 concurrent tasks and this is because for every application which is launched you have this temporary demon called application master so if i would have 10 applications running i would have 10 app masters running taking care of execution of these applications across multiple nodes compatibility so hadoop version 2 is fully compatible with whatever was developed as per hadoop version 1 and all your processing needs would be taken care by yarn so dynamic allocation of cluster resources taking care of different workloads allocating resources across multiple machines and using them for execution all that is taken care by yarn multi tenancy which basically means you could have multiple users or multiple teams you could have open source and proprietary data access engines and all of these could be basically hosted using the same cluster now how does yarn allocate resources to an application with help of its architecture so basically you have a client or an application or an api which talks to resource manager resource manager is as i mentioned managing the resource allocation in the cluster when you talk about resource manager you have its internal two components one is your scheduler and one is your applications manager so when we say resource manager being the master is tracking the resources the source manager is the one which is negotiating the resources with slave it is not actually resource manager who is doing it but these internal components so you have a scheduler which allocates resources to various running applications so scheduler is not bothered about tracking your resources or basically tracking your applications so we can have different kind of schedulers such as fifo which is first in first out you could have a fair scheduler or you could have a capacity scheduler and these schedulers basically control how resources are allocated to multiple applications when they are running in parallel so there is a queue mechanism so scheduler will schedule resources based on requirements of application but it is not monitoring or tracking the status of applications your applications manager is the one which is accepting the job submissions it is monitoring and restarting the application masters so its application manager which is basically then launching a application master which is responsible for an application so this is how it looks so whenever a job submission happens we already know that resource manager is aware of the resources which are available with every node manager so on every node which has fixed amount of ram and cpu cores some portion of resources that is your ram and cpu cores are allocated to node manager now resource manager is already aware of how much resources are available across nodes so whenever a client request comes in resource manager will make a request to node manager it will basically request node manager to hold some resources for processing node manager would basically approve or disapprove this request of holding the sources and these resources that is a combination of ram and cpu cores are nothing but containers we can allocate containers of different sizes within yarn hyphen site file so your node manager based on a request from resource manager guarantees the container which would be available for processing that's when your resource manager starts a temporary demon called application master to take care of your execution so your app master which was launched by resource manager or we can say internal away applications manager will run in one of the containers because application master is also a piece of code so it will run in one of the containers and then other containers will be utilized for execution this is how yarn is basically taking care of your allocation your application master is managing resource needs it is the one which is interacting with scheduler and if a particular node crashes it is the responsibility of app master to go back to the master which is resource manager and negotiate for more resources so your app master will never ever negotiate resources with node manager directly it will always talk to resource manager and the source manager is the one which negotiates the resources container as i said is a collection of resources like your ram cpu network bandwidth and your container is allocated based on the availability of resources on a particular node so which of the following has occupied the place of a job tracker of mapreduce so it is your resource manager so resource manager is the name of the master process in adobe portion 2. now if you would have to write yarn commands to check the status of an application so we could just say yarn application minus status and then the application id and you could kill it also from the command line remember your yarn has a ui and you can even look at your applications from the ui you can even kill your applications from the ui however knowing the command line commands would be very useful can we have more than one resource manager in a young base cluster yes we can that is what hadoop version 2 allows us to have so you can have a high availability yarn cluster where you have a active and standby and the coordination is taking care by your zookeeper at a particular time there can only be one active resource manager and if active resource manager fails your standby resource manager comes and becomes active however zookeeper is playing a very important role remember zookeeper is the one which is coordinating the server state and it is doing the election of active to standby failover what are the different schedulers available in yarn so you have a fifo scheduler that is first in first out and this is not a desirable option because in this case a longer running application might block all other small running applications your capacity scheduler is basically a scheduler where dedicated queues are created and they have fixed amount of resources so you can have multiple applications accessing the cluster at the same time and they would be using their own cues and the resources allocated to them if you talk about fair scheduler you don't need to have a fixed amount of resources you can just have a percentage and you could decide what kind of fairness is to be followed which basically means that if you were allocated 20 gigabytes of memory however the cluster has 100 gigabytes and the other team was assigned 80 gigabytes of memory then you have 20 percent access to the cluster another team has 80 percent however if the other team does not come up or does not use the cluster in a fair scheduler you can go up to maximum 100 of your cluster to find out more information about your schedulers you could either look in hadoop definitive guide or what you could do is you could just go to google and you could type for example yarn scheduler let's search for yarn scheduler and then you can look in hadoop definitive guide and so this is your hadoop definitive guide and it beautifully explains about your different schedulers how do multiple applications run and that could be in your fifo kind of scheduling it could be in capacity scheduler or it could be in a fair scheduling so have a look at this link it's a very good link you can also search for yarn untangling and this is a blog of four or this is a series of four blocks where it's beautifully explained about your yarn how it works how the resource allocation happens what is a container and what runs within the container so you can scroll down you can be reading through this and you can then also search for part two of it which talks about allocation and so on so coming back we basically have these schedulers what happens if a resource manager fails while executing an application in a high availability cluster so in a high availability cluster we know that we would have two resource managers one being active one being standby and zookeeper which is keeping a track of the server states so if a rm fails in case of high availability the standby will be elected as active and then basically your resource manager or the standby would become the active one and this one would instruct the application master to abort in the beginning then your resource manager recovers its running state so there is something called as rm state store where all the applications which are running their status is stored so resource manager recovers its running state by looking at your state store by taking advantage of container statuses and then continues to take care of your processing now in a cluster of 10 data nodes each having 16 gb and 10 cores what would be total processing capacity of the cluster take a minute to think 10 data nodes 16 gb ram per node 10 cores so if you mention the answer as 160 gb ram and 100 cores then you went wrong now think of a cluster which has 10 data nodes each having 16 gb ram and 10 cores remember on every node in a hadoop cluster you would have one or multiple processors running those processes would need ram the machine itself which has a linux file system would have its own processes so that would also be having some ram usage which basically means that if you talk about 10 data nodes you should deduct at least 20 to 30 percent towards the overheads towards the cloud database services towards the other processes which are running and in that case i could say that you could have 11 or 12 gb available on every machine for processing and say 6 or 7 cores multiply that by 10 and that's your processing capacity remember the same thing applies to the disk usage also so if somebody asks you in a 10 data node cluster where each machine has 20 terabytes of disks what is my total storage capacity available for sdfs so the answer would not be 200 you have to consider the overheads and this is basically which gives you your processing capacity now let's look at one more question so what happens if requested memory or cpu cores beyond or goes beyond the size of container now as i said you can have your configurations which can say that in a particular data node which has 100 gb ram i could allocate say 50 gb for the processing like out of 100 cores i could say 50 cores for processing so if you have 100 gb ram and 100 cores you could ideally allocate 100 for processing but that's not ideally possible so if you have 100 gb ram you would go for 50 gb and if you have 100 cores you would go for 50 cores now within this ram and cpu course you have the concept of containers right so container is a combination of ram and cpu cores so you could have a minimum size container and maximum size container now at any point of time if your application starts demanding more memory or more cpu cores and this cannot fit into a container location your application will fail your application will fail because you requested for a memory or a combination of memory and cpu cores which is more than the maximum container size so look into this yarn tangling website which i mentioned and look for the second blog in those series which explains about these allocations now here we will discuss on hive peg hbase and these components of do which are being used in the industry for various use cases let's look at some questions here and let's look how you should prepare for them so first of all we will learn on hive which is a data warehousing package so the question is what are the different components of a hive architecture now when we talk about hive we already know that hive is a data warehousing package which basically allows you to work on structured data or data which can be structuralized so normally people are well versed with querying or basically processing the data using sql queries a lot of people come from database backgrounds and they would find it comfortable if they know structured query language hive is one of the data warehousing package which resides within a hadoop ecosystem it uses hadoop's distributed file system to store the data and it uses rdbms usually to store the metadata although metadata can be stored locally also so what are the different components of an hive architecture so it has a user interface so user interface calls the execute interface to the driver this creates a session to the query and then it sends the query to the compiler to generate an execution plan for it usually whenever hive is set up it would have its metadata stored in an rdbms now to establish the connection between rdbms and hadoop we need odbc or jdbc connector jar file and that connector jar file has a driver class now this driver class is mandatory to create a connection between hive and hadoop so user interface creates this interface using the driver now we have metastore metastore stores the metadata information so any object which you create such as database table indexes their metadata is stored in metastore and usually this meta store is stored in an rdbms so that multiple users can connect to hive so your metastore stores the metadata information and sends that to the compiler for execution of a query what does the compiler do it generates the execution plan it has a dag now dag stands for direct cyclic graph so it has a dag of stages where each stage is either a metadata operation a map or reduced job or an operation on sdfs and finally we have execution engine that acts as a bridge between hive and hadoop to process the query so execution engine communicates bi-directionally with metastore to perform operations like create or drop tables so these are four important components of hive architecture now what is the difference between external table and manage table and height so we have various kinds of table in height such as external table manage table partition table the major difference between your managed and external table is in respect to what happens to the data if the table is dropped usually whenever we create a table in hive it creates a managed table or we could also call that as an internal table now this manages the data and moves it into warehouse directory by default whether you create a manage table or external table usually the data can reside in hive's default warehouse directory or it could be residing in a location chosen however when we talk about manage table if one drops a managed table not only the metadata information is deleted but the table's data is also deleted from his dfs if we talk about external table it is created with an external keyword explicitly and if an external table is dropped nothing happens to the data which resides in sdfs so that's the main difference between your managed and external table what might be the use case if somebody asks you there might be a migration kind of activity or you are interested in creating a lot of tables using your queries so in that case you could dump all the data on sdfs and then you could create a table by pointing to a particular directory or multiple directories now you could then do some testing of your tables and would decide that you might not need all the tables so in that case it would be advisable to create external tables so that even if the table is later dropped the data on sdfs will be intact unlike your manage table where dropping of table will delete the data from sdfs also let's learn a little bit on partition so what is partition and height and why is partitioning required in hive if somebody asks you that now normally in world of rdbms partition is the process to group similar type of data together and that is usually done on basis of a column or what we call as partitioning key now each table usually has one column in context of rdbms which could be used to partition the data and why do we do that so that we can avoid scanning the complete table for a query and restrict the scan to set of data or to a particular partition in hive we can have any number of partition keys so partitioning provides granularity in hive table it reduces the query latency by scanning only relevant partition data instead of whole data set we can partition at various levels now if i compare rdbms with hive in case of rdbms you could have one column which could be used for partitioning and then you could be squaring the specific partition so in case of rdbms your partition column is usually a part of the table definition so for example if i have an employee table i might have employee id employee name employee age and employee salary as four columns and i would decide to partition the table based on salary column now why would i partition it because i feel that employee table is growing very fast it is or it will have huge amount of data and later when we query the table we don't want to scan the complete table so i could split my data into multiple partition based on a salary column giving some ranges in hive it is a little different in hive you can do partitioning and there is a concept of static and dynamic partitioning but in hive the partition column is not part of table definition so you might have an employee table with employee id name age and that that's it that would be the table definition but you could then have partitioning done based on salary column which will then create a specific folder on sdfs in that case when we query the data we can see the partition column also showing up so we can partition the transaction data for a bank for example based on month like chan feb etc and any operation regarding a particular month will then allow us to query that particular folder that is where partitioning is useful now why does hive not store metadata information in sdfs if somebody asks you so we know that hives data is stored in sdfs which is hadoop distributed file system however the metadata is either stored locally and that mode of hive would be called as embedded mode or you could have hives metadata stored in rdbms so that multiple clients can initiate a connection now this metadata which is very important for hive would not be stored in sdfs so we already know that sdfs read and write operations are time consuming it is a distributed file system and it can accommodate huge amount of data so hive stores metadata information in metastore using rdbms instead of sdfs so this allows to achieve low latency and faster data access now if somebody asks what are the components used in hive query processor so usually we have the main components are your parser your execution engine logical plan generation optimizer and type checking so whenever a query is submitted it will go through a parser and parser would check the syntax it would check for objects which are being queried and other things to see if the query is fine now internally you have a semantic analyzer which will also look at the query you have an execution engine which basically will work on the execution part that is the best generated execution plan which could be used to get the results for the query you could also have user defined functions which a user would want to use and these are normally created in java or java programming language and then basically these user defined functions are added to the class path now you would have a logical plan generation which basically looks at your query and then generates a logical plan or the best execution path which would be required to get to the results internally there is a physical plan generated which is then looked in by optimizer to get the best path to get to the data and that might also be checking your different operators which you are using within your query finally we would also have type checking so these are important components in hype so somebody might ask you if you are querying your data using hive what are the different components involved or if you could explain what are the different components which work when a query is submitted so these are the components now let's look a scenario based question suppose there are a lot of small csv files which are present in a sdfs directory and you want to create a single hive table from these files so data in these files have fields like registration number name email address so if this is what needs to be done what will be your approach to solve it where will you create a single hive table for lots of small files without degrading the performance of the system so there can be different approaches now we know that there are a lot of small csv files which are present in a directory so we know that when we create a table in hive we can use a location parameter so i could say create table give a table name give the column and their data types i could specify the delimiters and finally i could say location and then point it to a directory on sdfs and this directory might be the directory which has lot of csv files so in this case i will avoid loading the data in the table because table being point table pointing to the directory will directly pick up the data from one or multiple files and we also know that hive does schema check on read so it does not do a schema check on write so in case there were one or two files which did not follow the schema of the table it would not prevent data loading data would anyways be loaded only when you query the data it might show you null values if data which was loaded does not follow the schema of the table this is one approach what is the other approach so let's look at that you can think about sequence file format which is basically a smart format or a binary format and you can group these small files together to form a sequence file now this could be one other smarter approach so we could create a temporary table so we could say create table give a table name give the column names and their data types we could specify the delimiters as it shows here that is row format and fields terminated by and finally we can store that as text file then we can load data into this table by giving a local file system path and then we can create a table that will store data in sequence file format so my point one is storing the data in text text file point three would be storing the data in sequence file format so we say create table give the specifications we say row format delimited fields are terminated by comma stored as sequence file then we can move the data from test table into test sequence file table so i could just say insert overwrite my new table as select star from other table remember in hive you cannot do insert update delete however if the table is existing you can do a insert overwrite from an existing table into a new table so this could be one approach where we could have lot of csv files or smaller files club together as one big sequence file and then store it in the table now if somebody asks you write a query to insert a new column that is integer data type into a hive table and the requirement might be that you would want to insert this table at a position before an existing column now that's possible by doing an alter table giving your table name and then specifying change column giving you a new column with the data type before an existing column this is a simple way wherein you can insert a new column into a hive table what are the key differences between hive and pig now some of you might have heard high visit data where housing package and pig is more of a scripting language both of them are used for data analysis or trend detection hypothesis testing data transformation and many other use cases so if we compare hive and pig hive uses a declarative language called hive ql that is hive querying language similar to sql and it is for reporting or for data analysis even for data transformation or for your data extraction pig uses a high level procedural language called pig latin for programming both of them remember use mapreduce processing framework so when we run a query in hive to process the data or when we create and submit a pick script both of them trigger a mapreduce job unless and until we have set them to local mode hive operates on the server side of the cluster and basically works on structured data or data which can be structuralized pig usually works or operates on the client side of the cluster and allows both structured unstructured or even i could say semi-structured data hive does not support avro file format by default however that can be done by using the write serializer d serializer so we can have hive table related data stored in avro format in sequence file format in part k format or even as a text file format however when we are working on smarter formats like avro or sequence file or par k we might have to use specific serializers d serializers for avro this is the package which allows us to use avro format pig supports avro format by default hive was developed by facebook and it supports partitioning and pig was developed by yahoo and it does not support partitioning so these are high level differences there are lots and lots of differences remember hive is more of a data housing package and pig is more of a scripting language or a strictly procedural flow following scripting language which allows us to process the data now let's get more and let's get more deeper and learn about pig which is as i mentioned a scripting language which can be used for your data processing it also uses mapreduce although we can even have pig run in a local mode let's learn about pig in the next section now let's learn on some questions about pig which is a scripting language and it is extensively used for data processing and data analysis so the question is how is apache pig different from mapreduce now we all know that mapreduce is a programming model it is it's quite rigid when it comes to processing the data because you have to do the mapping and reducing you have to write huge code usually mapreduce is written in java but now it can also be written in python it can be written in scala another programming languages so if we compare pig with mapreduce pig obviously is very concise it has less lines of code when compared to mapreduce now we also know that pig script internally will trigger a mapreduce job however user need not know about mapreduce programming model they can simply write simple scripts in pig and that will automatically be converted into mapreduce however mapreduce has more lines of code peak is high level language which can easily perform join operations or other data processing operations mapreduce is a low level language which cannot perform job join operations easily so we can do join using mapreduce however it's not really easy in comparison to pick now as i said on execution every pig operator is converted internally into a mapreduce job so every pick script which is run which would be converted into mapreduce job now map reduce overall is a batch oriented processing so it takes more time to compile it takes more time to execute either when you run a mapreduce job or when it is triggered by pingscript pig works with all versions of hadoop and when we talk about mapreduce program which is written in one hadoop version may not work with other versions it might work or it might not it depends on what are the dependencies what is the compiler you are using what programming language you have used and what version of hadoop you are working on so these are the main differences between apache pig and mapreduce what are the different ways of executing pig script so you could create a script file store it in dot pic or dot text and then you could execute it using the pic command you could be bringing up the grunt shell that is pig's shell now that usually starts with mapreduce mode but then we can also bring it up in a local mode and we can also run pic embedded as an embedded script in other programming language so these are the different ways of executing your big script now what are the major components of pig execution environment this is this is a very common question interviewers would always want to know different components of hive different components of pig even different components which are involved in hadoop ecosystem so when we want to learn about major components of big execution environment here are some so you have pig scripts now that is written in pig latin using built-in operators and user-defined functions and submitted to the execution environment that's what happens when you would want to process the data using pic now there is a parser which does type checking and checks the syntax of the script the output of parser is a tag direct cyclic graph so block in wikipedia for dag so dag is basically a sequence of steps which run in one direction then you have an optimizer now this optimizer performs optimization using merge transform split etc it aims to reduce the amount of data in the pipeline that's the whole purpose of optimizer you have a internal compiler so pic compiler converts the optimized code into a mapreduce job and here user need not know the mapreduce programming model or how it works or how it is written they all need to know about running the big script which would be internally converted into a mapreduce job and finally we have an execution engine so mapreduce jobs are submitted to the execution engine to generate the desired results so these are major components of pig execution environment now let's learn about different complex data types in pig supports various data types the main ones are tuple bag and map what is tuple or tuple as you might have heard a tuple is an ordered set of fields which can contain different data types for each field so in array you would have multiple elements but that would be of same types list can also have different types your tuple is a collection which has different fields and each field can be of different type now we could have an example as one comma three or one comma three comma a string or a float element and all of that form a tuple bag is a set of tuples so that's represented by curly braces so you could also imagine this like a dictionary which has various different collection elements what is a map map is a set of key value pairs used to represent data so when you work in big data field you need to know about different data types which are supported by pig which are supported by hive which are supported in other components of hadoop so tuple bag map array array buffer you can think about list you can think about dictionaries you can think about map which is key value pair so these are your different complex data types other than the primitive data type such as integer character string boolean float and so on now what are the various diagnostic operators available in apache pic so these are some of the operators or options which you can give in a pic script you can do a dumb now dumb operator runs the pig latin scripts and displays the result on the screen so either i could do it dumb and see the output on the screen or i can even do a dump into and i could store my output in a particular file so we can load the data using load operator and pick and then pico also has different internal storage like json loader or big storage which can be used if you are working on specific kind of data and then you could do a dump either before processing or after processing and dump would produce the result the result could be stored in a file or seen on the screen you also have a describe operator now that is used to view the schema of a relation so you can load the data and then you can view the schema of relation using describe operator explain as we might already know displays the physical logical and mapreduce execution plans so normally in rdbms when we use x-plane we would like to see what happens behind the scenes when a particular script or a query runs so we could load the data using load operator as in any other case and if we would want to display the logical physical and mapreduce execution plans we could use explain operator there is also an illustrate operator now that gives the step-by-step execution of sequence of statements so sometimes when we would want to analyze our script to see how good or bad they are or would that really serve our purpose we could use illustrate and again you can test that by loading the data using load operator and you could just use a illustrate operator to have a look at the step-by-step execution of the sequence of statements which you would want to execute so these are different diagnostic operators available in apache pic now if somebody asks state the usage of group order by and distinct keywords in picscript so as i said pig is a scripting language so you could use various operators so group basically collects various records with the same key and groups the data in one or more relations here is an example you could do a group data so that is basically a variable or you can give some other name and you can say group relation name by h now say i have a file where i have field various fields and one of the field is relation name so i could group that by a different field order by is used to display the contents of relation in a sorted order whether ascending or descending so i could create a variable called relation 2 and then i could say order relation name 1 by ascending or descending order distinct basically removes the duplicate records and it is implemented only on entire records not on individual records so if you would like want to find out the distinct values and relation name field i could use distinct what are the relational operators in pig so you have various relational operators which help data scientists or data analysts or developers who are analyzing the data such as co-group which joins two or more tables and then performs group operation on the join table result you have cross it is used to compute the cross product that is a cartesian product of two or more relations for each is basically to do some iteration so if it will iterate through tuples of a relation generating a data transformation so for example if i say variable a equals and then i load a file in a and then i could create a variable called b where i could say for each a i would want to do something say group join is to join two or more tables in a relation limit is to limit the number of output tuples or output results split is to split the relation into two or more relations union is to get a combination that's it will merge the contents of two or more relations and order is to get a sorted result so these are some relational operators which are extensively used in pig for analysis what is the use of having filters in apache pic now say for example i have some data which has three fields year product quantity and this is my phone sales data so filter operator could be used to select the required values from a relation based on a condition it also allows you to remove unwanted records from data file so for example filter the products where quantity is greater than thousand so i see that i have one row wherein or multiple rows where the quantity is greater than thousand such as 1500 1700 1200 so i could create a variable called a i would load my file using pix storage as i explained earlier pick storage is an internal parameter which can be used to specify the delimiters now here my delimiter is comma so i could say using pick storage as and then i could specify the data type for each field so here being integer product being character array and quantity being integer then b i could say filter a whatever we have in a by quantity greater than thousand so it's very concise it's very simple and it allows us to extract and process data in a simpler way now suppose there is a file called test.txt having 150 records in his dfs so this is a file which is stored on his dfs and it has 150 records where we can consider every record being one line and if somebody asks you to write a pick command to retrieve the first 10 records of the file first we will have to load the data so i could create a variable called test underscore data and i would say load my file using pick storage specifying the delimiter as comma as and then i could specify my fields whatever fields our file have and then i would want to get only 10 records for which i could use the limit operator so i could say limit on test data and give me 10 records this is very simple and we can extract 10 records from 150 records which are stored in the file on sdfs now we have learned on pig we have learned some questions on hive you could always look more in books like programming in hive or programming in pig and look for some more examples and try out these examples on a existing hadoop setup now let's learn on hbase which is a nosql database now hbase is a four dimensional database in comparison to your rdbms which usually are two dimensional so rdbms have rows and columns but hbase has four coordinates it has row key which is always unique column family which can be any number column qualifiers which can again be any number per column family and then you have a version so these four coordinates make edge base a four dimensional key value store or a column family store which is unique for storing huge amount of data and extracting data from hbase there is a very good link which i would suggest everyone can look at if you would want to learn more on hbase and you could just say edge base mapper and this basically brings up a documentation which is from mapr but then that's not specific to mapper and you can look at this link which will give you a detailed explanation of hbase how it works what are the architectural components and how data is stored and how it makes hbase a very powerful nosql database so let's learn on some of the important or critical questions on hbase which might be asked by the interviewer in an interview when you are applying for a big data admin or a developer position role so what are the key components of hbase now as i said this is one of the favorite questions of interviewers where they would want to understand your knowledge on different components for a particular service hbase as i said is a nosql database and that comes as a part of service with cloudera or hortonworks and with apache hadoop you could also set up hbase as an independent package so what are the key components of hbase hbase has a region server now edge base follows the similar kind of topology like hadoop now hadoop has a master process that is named node and slave processes such as data nodes and secondary name node in the same way hbase also has a master which is h master and the slave processes are called region servers so these region servers are usually co-located with data nodes however it is not mandatory that if you have 100 data nodes you would have 100 region servers so it purely depends on admin so what does this region server contain so region server contains hbase tables that are divided horizontally into regions or you could say group of rows is called regions so in edge base you have two aspects one is group of columns which is called column family and one is group of rows which is called regions now these regions or these rows are grouped based on the key values or i would say row keys which are always unique when you store your data in each base you would have data in the form of rows and columns so group of rows are called regions or you could say these are horizontal partitions of the table so a region server manages these regions on the node where a data node is running a region server can have up to thousand regions it runs on every node and decides the size of region so region server as i said is a slave process which is responsible for managing hbase data on the node each region server is a worker node or a worker process co-located with data node which will take care of your read write update delete request from the clients now when we talk about more components of edge base as i said you have hp h master so you would always have a connection coming in from a client or an application what does hmaster do it assigns regions it monitors the region servers it assigns regions to region servers for load balancing and it cannot do that without the help of zookeeper so if we talk about components of hbase there are three main components you have zookeeper you have hmaster and you have region server region server being the slave process your edge master being the master process which takes care of all your table operations assigning regions to the region servers taking care of read and write requests which come from client and for all of this edgemaster will taken help of zookeeper which is a centralized coordination service so whenever a client wants to read or write or change the schema or any other metadata operations it will contact hmaster edge master internally will contact zookeeper so you could have hbase setup also in high availability mode where you could have a active edge master than a backup edge master you would have a zookeeper quorum which is the way zookeeper works so zookeeper is a centralized coordination service which will always run with a quorum of processes so zookeeper would always run with odd number of processes such as 3 5 and 7 because zookeeper works on the concept of majority consensus now zookeeper which is a centralized coordination service is keeping a track of all the servers which are alive available and also keeps a track of their status for every server with zookeeper is monitoring zookeeper keeps a session alive with that particular server h master would always check with zookeeper which region servers are available alive so that regions can be assigned to the region server at one end you have region server which are sending their status to the zookeeper indicating if they are ready for any kind of read or write operation and at other end edge master is querying the zookeeper to check the status now zookeeper internally manages a meta table now that meta table will have information of which regions are residing on which region server and what row keys those regions contain so in case of a read activity hmaster will query zookeeper to find out the region server which contains that meta table once edge master gets the information of meta table it can look into the meta table to find out the row keys and the corresponding region servers which contain the regions for those row keys now if we would want to understand row key and column families in hbase let's look at this and it would be good if we could look this on an excel sheet so row key is always unique it acts as a primary key for any h base table it allows a logical grouping of cells and make sure that all cells with the same row key are co-located on the same server so as i said you have four coordinates for hbase you have a row key which is always unique you have column families which is nothing but group of columns and when i say column families one column family can have any number of columns so when i talk about h base h base is four dimensional and in terms of edge base it is also called as a column oriented database which basically means that every row in one column could have a different data type now you have a row key which uniquely identifies the row you have column families which could be one or many depending on how the table has been defined and a column family can have any number of columns or i could say for every row within a column family you could have different number of columns so i could say for my row one i could just have two columns such as name and city within the column family for my row 2 i could have name city age designation salary for my third row i could have thousand columns and all that could belong to one column family so this is a horizontally scalable database so column family consists of group of columns which is defined during table creation and each column family can have any number of column qualifiers separated by a delimiter now a combination of row key column family column qualifier such as name city age and the value within the cell is makes the hbase a unique four dimensional database for more information if you would want to learn on hbase please refer this link which is hbase mapper and this gives a complete hbase architecture that is three components of name node three components that is name node region servers and zookeeper how it works how each base edge master interacts with zookeeper what zookeeper does in coordination how are the components working together and how does hbs take care of read and write coming back and continuing why do we need to disable a table so there are different table operations what you can do in each base and one of them is disabling a table now if you would want to check the status of table you could check that by is disabled and giving the table name order is enabled and the table name so the question is why do we need to disable a table now if we would want to modify a table or we are doing some kind of maintenance activity in that case we can disable the table so that we can modify your changes settings when a table is disabled it cannot be accessed through the scan command now if we have to write a code to open a connection in each base now to interact with hbase one could either use a graphical user interface such as hue or you could be using the command line hb shell or you could be using hbase admin api if you are working with java or say happy base if you're working with python where you may want to open a connection with hbase so that you can work with it based programmatically in that case we have to create a configuration object that is configuration my conf and then create a configuration object and then you can use different classes like edge table interface to work on a new table you could use h column qualifier and many other classes which are available in hbase admin api what does replication mean in terms of hbase so edge base as i said works in a cluster way and when you talk about cluster you could always set up a replication from one hbase cluster to other hbase cluster so this replication feature in edge base provides a mechanism to copy data between clusters or sync the data between different clusters this feature can be used as a disaster recovery solution that provides high availability for hbase so if i have a hbase cluster one where i have one master and multiple region servers running in a hadoop cluster i could use the same hadoop cluster to create a hbase replica cluster or i could have a totally different hbase replica cluster where my intention is that if things are changing in a particular table in cluster 1 i would want them to be replicated across different clusters so i could alter the hbase table and set the replication scope to one now a replication scope of zero indicates that table is not replicated but if we set the replication to one we basically will have to set up ah base cluster where we can replicate edge base tables data from cluster one to cluster so these are the commands which can be used to enable replication and then replicate the data of table across clusters can we import and export in hbase of course we can it is possible to import and export tables from one hbase cluster to other hbase cluster or even within a cluster so we can use the hbase export utility which comes in this particular package give a table name and then a target location so that will export the data of hbase table into a directory on sdfs then i could create a different table which would follow some kind of same definition as the table which was exported and then i could use import to import the data from the directory on sdfs to my table if you would want to learn more on hbase import and export you could look at hbase import operations let's search for the link and this is the link where you could learn more about hbase import export utilities how you could do a bulk import bulk export which internally uses mapreduce and then you could do a import and export into hbase tables moving further what do we mean by compaction in hbase now we all know that hbase is a nosql database which can be used to store huge amount of data however whenever a data is written in hbase it is first returned to what we call as right ahead log and also to mem store which is write cache now once the data is written in wall and your mem store it is offloaded to form an internal hbase format file which is called h5 and usually these edge files are very small in nature so we also know that sdfs is good when we talk about few number of larger files in comparison to large number of smaller files due to the limitation of name nodes memory compaction is process of merging hbase files that is these smaller edge files into a single large file this is done to reduce the amount of memory required to store the files and number of disk seeks needed so we could have lot of edge files which get created when the data is written to hbase and these smaller files can then be compacted through a major or minor compaction creating one big edge file which internally would then be written to sdfs and sdfs format of blocks that is the benefit of compaction there is also a feature called bloom filter so how does bloom filter work so bloom filter or hbase bloom filter is a mechanism to test whether a h file contains a specific row or a row column cell bloom filter is named after its creator burton hovered bloom it is a data structure which predicts whether a given element is a member of a set of data it provides an in-memory index structure that reduces the disk reads and determines the probability of finding a row in a particular file this is one of very useful features of hbase which allows for faster access and avoids disk seeks does hbase have any concept of name space so namespace is when you have similar elements grouped together so namespace yes hb is support such name space so namespace is a logical grouping of tables analogous to a database in rdbms so you can create hbs namespace to the schema of rdbms database so you could create a namespace by saying create namespace and giving it a name and then you could also list the tables within a namespace you could create tables within a specific namespace now this is usually done in production environment where a cluster might be multi-tenant cluster and there might be different users of the same nosql database in that case admin would create specific namespace and for specific namespace you would have different directories on htfs and users of a particular business unit or a team can work on their hbase objects within a specific name space this is a question which is again very important to understand about the writes or reads so how does right ahead log wall help when a region server crashes now as i said when a write happens it will happen into mem store and wall that is your edit log or write ahead log so whenever a write happens it will happen in two places mem store which is the right cache and wall which is a edit log only when the data is written in both these places and based on the limitation of mem store the data will be flushed to create an edge-based format file called h-file these files are then compacted and created into one bigger file which will then be stored on sdfs and sdfs data as we know is stored in the form of blocks on the underlying data nodes so if a region server hosting a mem store crashes now where is region server running that would be co-located with data node so if a data node crashes or if a region server which was hosting the mem store write cache crashes data in memory the data that in memory which was not persisted is lost now how does hbase recover from this as i said your data is written into wall and mem store at the same time hbase recovers against that by writing to wall before the write completes so whenever a write happens it happens in mem store and wall at the same time hbase cluster keeps a wall to record changes as they happen and that's why we call it as also an edit log if hps goes down or the node that goes down the data that was not flushed from mem store to edge file can be recovered by replaying the right ahead lock and that's the benefit of your edit log or write ahead log now if we would have to write hbs command to list the contents and update the column families of a table i could just do a scan and that would give me complete data of a table if you are very specific and if you would want to look at a particular row then you could do a get table name and then give the row key however you could do a scan to get the complete data of a particular table you could also do a describe to see what are the different column families and if you would want to alter the table and add a new column family it is very simple you can just say alter give the hvac table name and then give you a new column family name which will then be added to the table what are catalog tables in each base so as i mentioned your zookeeper knows the location of this internal catalog table or what we call as the meta table now catalog tables in edge base have two tables one is edge base meta table and one is hyphen root the catalog table edge base meta exists as an hbase table and is filtered out of hbase shells list command so if i give a list command on edge base it would list all the tables which space contains but not the meta table it's an internal table this meta table keeps a list of all regions in the system and location of hbase meta stored in zookeeper so if somebody wants to find out or look for particular rows they need to know the regions which contain that data and those regions are located on region server to get all this information one has to look into this meta table however we will not be looking into meta table directly we would just be giving a write or a read operation internally uh based master queries the zookeeper zookeeper has the information of where the meta table exists and that meta table which is existing on region server contains information of row keys and the region servers where those rows can be found your root table keeps a track of location of the meta table what is hotspotting in edge base and how to avoid hot spotting now this is a common problem and always admin guys or guys who are managing the infrastructure would think about it so one of the main idea is that edge base would be leveraging the benefit of sdfs your all read and write requests should be uniformly distributed across all of the regions in region servers otherwise what's the benefit of having a distributed cluster so you would have your data stored across region servers in the form of regions which are horizontal partitions of the table and whenever read and write requests happen they should be uniformly distributed across all the regions in the region servers now hot spotting occurs when a given region serviced by a region server receives most or all of read write request which is basically a unbalanced way of read write operations now hotspot can be avoided by designing the row key in such a way that data being written should go to multiple regions across the cluster so you could do techniques such as salting hashing reversing the key and many other techniques which are employed by users of hbase we need to just make sure that when the regions are distributed across region servers they should be spread across region servers so that your read and write request can be satisfied from different region servers in parallel rather than all read write request hitting the same region server overloading the region server which may also lead to the crashing of a particular region server so these were some of the important questions of hbase and then there are many more please refer to the link which i specified in during my discussion and that gives you a detailed explanation of how each base works you can also look into hp's definitive guide by o'reilly or hbase in action and these are really good books to understand about hbase internals and how it works now that we have learned on hive which is a data warehousing package we have learnt on pig which is a scripting or a scripting language which allows you to do data analysis and we have learned some questions on a nosql database just note it that there are more than 225 nosql databases existing in market and if you would want to learn and know about more nosql databases you can just go to google and type no sql databases org and that will take you to the link which is for nosql databases and this shows there are more than 225 nosql databases existing in market and these are for different use cases used by different users and for with different features so have a look at this link now when you talk about data ingestion so let's look at data ingestion and this is one good link which i would suggest to have a look at which lists down around 18 different ingestion tools so when you talk about different data injection tools some are for structured data some are for streaming data some are for data governance some are for data ingestion and transformation and so on so have a look at this link which also gives you a comparison of different data ingestion tools so here let's learn about some questions on scope which is one of the data injection tools mainly used for structured data or you could say data which is coming in from rdbms or data which is already structured and you would want to ingest that you would want to store that on sdfs which could then be used for hive which could be used for any kind of processing using mapreduce or hive or pig or spark or any other processing frameworks or you would want to load that data into say high voltage based tables scope is mainly for structured data it is extensively used when organizations are migrating from rdbms to a big data platform and they would be interested in ingesting the data that is doing import and export of data from rdbms to sdfs or vice versa so let's learn about some important questions on scope which you may be asked by an interviewer when you apply for a big data related position how is scoop different from flume so this is a very common question which is asked scoop which is mainly for structured data so scope works with rdbms it also works with nosql databases to import and export data so you can import data into sdfs you can import data into data warehousing package such as hive directly or also in hbase and you could also export data from hadoop ecosystem to your rdbms however when it comes to flow flow is more of a data injection tool which works with streaming data or unstructured data so data which is constantly getting generated for example log files or metrics from server or some chat messenger and so on so if you are interested in working on capturing and storing the streaming data in a storage layer such as sdfs or hbase you could be using flu there could be other tools also like kafka or storm or chokwa or samsa nifi and so on scoop however is mainly for structured data your loading data in scope is not event driven so it is not based on event it basically works on data which is already stored in rdbms in terms of flow it is completely event driven that is as the messages or as the events happen as the data is getting generated you can have that data ingested using flow scope works with structured data sources and you have various scope connectors which are used to fetch data from external data structures or rdbms so for every rdbms such as mysql oracle db2 microsoft sql server you have different connectors which are available flume it works on fetching streaming data such as tweets or log files or server metrics from your different sources where the data is getting generated and if you are interested in not only ingesting that data which is getting generated in a streaming fashion but if you would be interested in processing the data as it arrives scoop can import data from rdbms onto sdfs and also export it back to rdbms flume is then used for streaming data now you could have one to one one too many or many to one kind of relation so in terms of floom you have components such as your source sink and channel that's the main difference between your scoop and flow what are the different file formats to import data using scope well there are lots and lots of formats in which you can import data into scope when you talk about scope you can have delimited text file format now that's the default import format it can be specified explicitly using as text file argument so when i want to import data from an rdbms i could get that data in sdfs using different compression schemes or in different formats using the specific arguments so i could specify an argument which will write string based representation of each record to output files with delimiters between individual columns and rows so that is the default format which is used to import data using scope so to learn more about your scope and different arguments which are available you can click on scoop.apache.org you can look into the documentation and i would suggest choosing one of the versions and looking into the user guide and here you can search for arguments and look for specific control arguments which show how you can import data using scope so here we have common arguments and then you also have import control arguments wherein we have different options like getting data as avro as sequence file as text file or parquet file these are different formats you can also get data in default compression scheme that is gzip or you can specify compression codec and then you can specify what compression mechanism you would want to use when you are importing your data using scope when it comes to default format for flume we could say sequence file which is a binary format that stores individual records in record specific data types so these data types are manifested as java classes and scope will automatically generate these data types for you so scoop does that when we talk about your sequence file format in terms of your scope you could be extracting storage of all data in binary representation so as i mentioned you can import data in different formats such as avro parquet sequence file that is binary format or machine readable format and then you could also have data in different compression schemes let me just show you some quick examples here so if i look in the content and here i could search for a scoop based file where i have listed down some examples so if i would want to use different compression schemes here are some examples have a look at these so i'm doing a scoop import i'm also giving an argument so that scoop which also triggers a map reduced job or i would say map only job so when you run a scoop import it triggers a map only job no reduce happens here and you could specify this parameter or this argument on the command line mapreduce.framework.name so that you could run your map only job in a local mode to save time or that would interact with yarn and run a full-fledged map only job we can give the connection and then connect to whatever rdbms we are connecting mentioning the database name give your user name and password give the table name give a target directory or it would create a directory same as the table name which would work only once and then i could say minus z to get data in a compressed format that is gzip or i could be specifying compression codec and then i could specify what compression codec i would want to use say snappy b lz4 default i could also run a query by giving a scope import and when i'm specifying a query if you notice i'm not given any table name because that would be included in the query i can get my data as a sequence file format which is a binary format which will create a huge file so we could also have compression enabled and then i could say the output of my map job should use a compression at record level for my data coming in sequence file so sequence file or a binary format supports compression at record level or at block level i could get my data in an avro file where data has embedded schema within the file or a parquet file also so these are different ways in which you can set up different compression schemes or you can even get data in different formats and you could be doing a simple scope import for these looking further what is the importance of eval tool in scope so there is something called as eval tool so scoop eval tool allows users to execute user defined queries against respective database servers and preview the result in the console so either i could be running a straight away query to import the data into my sdfs or i could just use scoop eval connect to my external rdbms specify my username and password and then i could be giving in a query to see what would be the result of the query which we intend to import now let's learn about how scope imports and exports data between rdbms and sdfs with its architecture so rdbms as we know has your database structures your tables which all of them are logical and internally there is always metadata which is stored your scope import connects to an external rdbms and for this connection it uses an internal connector jar file which has a driver class so that's something which needs to be set up by admin but they need to make sure that whichever rdbms you intend to connect to they need to have the jdbc connector for that particular rdbms stored within the scope lib folder so scope import gets the metadata and then for your scoop command it converts that into a map only job which might have one or multiple map tasks now that depends on your scope command you could be specifying that you would want to do a import only in one task or in multiple tasks these multiple map tasks will then run on a section of data from rdbms and then store it in sdfs so at high level we could say scoop will introspect database to get gathered the metadata it divides the input data set into splits and this division of data into splits mainly happens on primary key column of the table now if somebody might ask what if my table in rdbms does not have a primary key column then when you are doing a scope import either you will have to import it using one mapper task by specifying hyphen hyphen m equals one or you would have to say split by parameter to specify a numeric column from rdbms and that's how you can import the data let me just show you a quick example on this so i could just look in again into the scoop command file and here we could be looking at an example so if you see this one here we are specifying minus minus m equals 1 which basically means i would want to import the data using one map task now in this case whether the table has a primary key column or does not have a primary key column will not matter but if i say a minus minus ms6 where i am specifying multiple map tasks to be imported then this will look for a primary key column in the table which you are importing now if the table does not have a primary key column then i could be specifying a split by and then specify the column so that the data could be split into multiple chunks and multiple map tasks could take it now if the second scenario is your table does not have a primary key column and it does not have a numeric column on which you could do a split by in that case and if you would want to use multiple mappers you could still say split by on a textual column but you will have to add this property so that it allows splitting the data which is non-numeric all of these options are given in the scoop apache.org link going further how scoop imports and exports data between rdbms and sdfs with its architecture so as i said it submits the map only job to the cluster and then it basically does a import or export so if we are exporting the data from sdfs in that case again there would be a map only job it would look at multiple splits of the data which is existing which your map only job would process through one or one table map task and then export it to rdbms suppose you have a database testdb in mysql we if somebody asked you to write a command to connect this database and import tables to scoop so here is a quick example as i showed you in the command file so you could say scope import this is what we would want to do you connect using jdbc now this will only work if the jdbc connector already exists within your scope lib directory admin has to set up that so you can connect to your rdbms you can point to the database so here our database name is test underscore db i could give user name and then either i could give password on the command line or just say capital p so that i could be prompted for the password and then i could give the table name which i would want to import i could also be specifying minus minus m and specify how many map tasks do i want to use for this import as i showed in previous screen how to export a table back to rdbms now for this we need the data in a directory on hdfs so for example there is a department's table in retail database which is already imported into scoop and you need to export this table back to rdbms so this is the content of the table now create a new department table in rdbms so i could create a table specifying the column names whether that supports null or no if that has a primary key column which is always recommended and then i can do a scoop export i can connect to the rdbms specifying my username and password specify the table into which you want to export the data and then you give export directory pointing to a directory on sdfs which contains the data this is how you can export data into table seeing example on this so i could again look into my file and here i have an example of import this is where you are importing data directly into hive and you have scope import where you are importing data directly into hbase table and you can then query your hbase table to look at the data you could also do a export by running your map only job in a local mode connecting to the rdbms specifying your username specifying the table where you would want to export and the directory on sdfs where you have kept the relevant data this is a simple example of export looking further what is the role of jdbc driver in scope setup so as i said if you would want to use scoop to connect to an external rdbms we need the jdbc odbc connector jar file now one or admin could download the jdbc connector jar file and then place the jar file within the scoop lib directory wherever scoop is installed and this jdbc connector jar file contains a driver now jdbc driver is a standard java api which is used for accessing different databases in rdbms so this connector jar file is very much required and this connector jar file has a driver class and this driver class enables the connection between your rdbms and your hadoop structure each database vendor is responsible for writing their own implementation that will allow communication with the corresponding database and we need to download the drivers which allow our scoop to connect to external rdbms so your jdbc driver alone is not enough to connect to scope we also need connectors to interact with different database so a connector is a plugable piece that is used to fetch metadata and allow scoop to overcome the differences in sql dialects so this is how connection can be established so normally your admins would when they are setting up scope and hadoop they would download say mysql jdbc connector and this is how they would go to the mysql connectors if you are connecting to mysql similarly for your other rdbms you could be say going in here you could be looking for a previous version depending you could be going for platform independent and then you could be downloading the connected jar file now if you enter this jar file you would see a mysql connector jar and if we look in [Music] mysql.jdbc.driver com.mysql.jdbcom.mysql.jdbc so this is the package which is within the connector jar file and this has the driver class which allows the connection of your scope with your rdbms so these things will have to be done by your admin so that you can have your scope connecting to an external rdbms now how do you update the columns that are already exported so if i do a export and i put my data in rdbms can i really update the columns that are already exported yes i can using a update key parameter so scoop export command remains the same the only thing i will have to specify now is the table name your fields terminated by if you have a specific delimiter and then you can say update key and then the column name so this allows us to update the columns that are already exported in rdbms what is code gen so scope commands translate into your mapreduce job or map only job so code gen is basically a tool in scope that generates data access objects dao java classes that encapsulate and interpret imported records so if i do a scoop code gen connect to an rdbms using my username and give a table this will generate a java code for employee table in the test database so this code gen can be useful for us to understand what data we have in this particular table finally can scoop be used to convert data in different formats i think i already answered that right if no which tools can be used for this purpose so scoop can be used to convert data in different formats and that depends on the different arguments which you use when you do a import such as avro file parquet file binary format with record or block level compression so if you are interested in knowing more on different data formats then i think i can suggest a link for that and we can say hadoop formats i think it is tech maggie avro parque let's see you can find out the link take mac e yeah this is a very good link which specifies or talks about different data formats which you should know such as your text file format different compression schemes how is data organization what are the common formats what do you have in text file structured binary sequence files with compression without compression what is record level what is block level what is a avro data file what is a sequence what is a parquet data file or a columnar format and other formats like orc rc and so on so please have a look at this thank you all for watching this full course video on big data for 2022 i hope it was useful and informative if you have any queries please feel free to put them in the comments section of the video we'll be happy to help you thanks again stay safe and keep learning [Music] hi there if you like this video subscribe to the simply learn youtube channel and click here to watch similar videos turn it up and get certified click here you
Info
Channel: Simplilearn
Views: 134,846
Rating: undefined out of 5
Keywords: big data tutorial, big data tutorial for beginners, big data hadoop, big data hadoop tutorial for beginners, big data hadoop tutorial, big data hadoop full course, big data full course, big data full tutorial, big data and hadoop tutorial for beginners, big data apache spark, big data apache spark tutorial, learn big data step by step, learn bigdata from scratch, hadoop tutorial for beginners, spark tutorial for beginners, big data basics, simplilearn big data, simplilearn
Id: KCEPoPJ8sWw
Channel Id: undefined
Length: 700min 35sec (42035 seconds)
Published: Fri Feb 18 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.