Hive Tutorial | Hive Course For Beginners | Intellipaat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey guys, welcome to this session by Intellipaat. So, have you ever wondered how these giant IT organizations store their data. These organizations use different databases like Oracle for transactional data, MYSQL for storing the product data, and many others for different tasks. So, only storing the data is not enough; you need to analyze the available data and need to extract meaningful insights from it in order to understand the customers' behavior. So for that, organizations started using Hadoop. In 2006, Facebook switched from Oracle database that uses SQL queries to handle the database to Hadoop that understands only MapReduce. So this became a redundant task for them to handle the database. So they needed an interface that can convert SQL queries to MapReduce programs, run it in a cluster, and show you the result. This is where Hive comes into the picture. it is an interface created by Facebook that converts SQL queries to MapReduce programs. So, in this session, we'll learn Hive comprehensively. Before going ahead, subscriber to our channel and press the bell icon so that you never miss out on our upcoming videos. Also, if you want to be a certified professional in Big Data Hadoop, I would like to suggest you a certification course by Intellipaat which is Big Data Hadoop architect course. Now let's have a glance at the agenda. We'll start off by understanding what is Hive, then we'll see the difference between MapReduce and Hive. After that, we'll look at the differences between different types of Hive and look into the features of it. Then, we'll move on to Hive interface. Going ahead, we will see the partitions and indexing in Hive, and finally you can put down all your queries in the comments section and we would love to help you out. So, without much delay, let's start off with the class. What is the requirement of Hive. You have this guy called RDBMS. So, let's say you are using MY SQL and you are also using, let's say, MS SQL that is Microsoft SQL, right, and you are also having Oracle. So, your organization, let's say, uses three databases. Now why are you using three of them? I don't know. Maybe you're using Oracle for transactional data, you know for payment, etc. Maybe you're using MY SQL to store product catalogs, etc. Maybe you're using Microsoft SQL for some other purposes. So, just assume that you are having three real-time databases. Now these guys are called RDBMS or database, and we also call them a name called OLTP. They are online transaction processing systems. Now I know that this is too basic, but I'm just trying to give you an idea. So, every organization has multiple OLTP systems, maybe single OLTP system, deployed across the world. Now your manager comes to you and says that (let's say I am the employee and my name is Raghu) hey Raghu, we want to do something. You know what, we want to just collect the data from all these databases and I want to create a nice report and show to my CEO so that I can get a free holiday. So I can impress my CEO with my nice report and all so I can get a free holiday, maybe you get a salary hike also. So, why don't we pull all the data from these systems and put it into a place and practically create a report out of it. Now you have guys known as Business Intelligence guys (BI). So, Business Intelligence is a part wherein you get data and you make sense out of the data. Given any type of data, you can create nice charts, pie diagrams, and reports, etc. using Business Intelligence. There are a lot of BI tools available in the market, for example, we have Pentaho BI, you know, many tools are available. So, I thought let me do one thing, let me take a Business Intelligence tool which can create nice reports and let me install it on my Oracle; let me also install the tool on Microsoft SQL and MY SQL, all three databases. But, this doesn't make any sense. Why it doesn't make any sense is because all these databases are real-time databases that is why we call them as OLTP, and they are serving your customers in real time. I don't want any unnecessary load on these databases. So, if I install a reporting tool on top of it and start digging the data, it may impact the performance of these databases. So I don't want to do that. So what I do is I bring in the elephant in the room, and that is, my friends, your data warehouse which we call it as DW. You can also call it as OLAP or online analytical processing system. Using ETL, you bring data from your RDBMS systems into your data warehouse. So, technically what is happening is that using ETL tools (ETL is extract, transform, and load) you can take data from your RDBMS systems and dump it into a place called a data warehouse. Just a quick info guys, if you want to become a certified Big Data Hadoop architect, Intellipaat offers you a complete Big Data Hadoop Architect Course which covers all the major concepts. So, for further details, check the description below. Now let's continue with the session. Now data warehouse is not a fancy term, it's a technical term. What is a data warehouse? (1) Data warehouse is a place where you store, first of all, massive amount of data. (2) This data is ready to be accessed, ready to be reporting. So, if I have a BI tool, my BI tool can be directly installed on my data warehouse, and it can produce nice reports, etc. A data warehouse is internal to your company; you don't have public access because this is your private storage. I am collecting all the data and dumping into this guy so that this guy holds the whole data and now I can do whatever I want with my data, very simple. Organizations were doing this for quite some time. You go to any enterprise organization, what they do is that they will have multiple OLTP systems. They keep on collecting the data nightly cron jobs, or whatever you want and then dump it into a data warehouse. Now the data warehouse will hold historical data. So this guy has historical data, meaning, it will have this year's data, last year's data, year before the last year's data, etc. you know, a lot of data. On this data, you can do whatever analysis you want, very simple. So data warehouse is a place where you store massive amounts of data ready for analysis, and you do not typically have, what you say, public access to this thing. If you look at the players in data warehouse, you have many companies who are doing this. There is somebody called Teradata which is a major player actually. You also have Oracle Exadata. You have SAP HANA, Netezza, Greenplum, etc. These are all data warehousing companies. So, in your organization, if you want to implement a data warehouse, what you do is that you go to, let's say, Teradata and say that hey Teradata, you know what, I want to implement a data warehouse in my company. So Teradata says alright. Take this and go. So they will give you practically an appliance, like a box, which you install in your company and then you can load it with data, it stores, and you analyze the data. So, the question is why am I talking about this? What is the big deal about this? See, there is no big deal about this, but so far, whatever I was discussing is how the world was working. Now everything seems nice here: We have OLTP systems, we take ETL, we dump the data to data warehouse, we produce reports, nice and happy, but with only one major problem. The problem is this: $$$--cost aspect. Data warehousing is a costly affair, to be honest with you, for example, if you want to implement a Teradata solution, it will cost you millions of dollars practically speaking. It is not so easy, also it is not so cheap. People who are working in the data warehousing background will be able to comment on this, but data warehousing is a costly affair. So that is the backdrop, that is the history that we are talking about. Now enters Hadoop. So, here is the Hadoop cluster. So what is this? Your organization has a Hadoop cluster. Now imagine your organization already has a Hadoop cluster, fine. What is the problem? In a Hadoop cluster, obviously, you can store Big Data, but what happened was that somewhere in 2006, the company called Facebook got interested in Hadoop. So, Facebook was dealing with this Big Data problem. Basically Facebook was in the growing stage in 2005-2006, etc. You know, Facebook was getting a lot of users at that point in time, and Facebook was thinking that you know what everything is working fine, but we really don't have a solution for storing Big Data and analyzing Big Data. Facebook was running completely on Oracle. Oracle was their prime back-end actually and Python. Python plus Oracle was Facebook's initial architecture. Now somewhere in 2006, somebody told Facebook that hey, you know what, why are you so worried about storage of data? There is something new in the market called Hadoop, and if you implement the Hadoop cluster, you can store unlimited data practically because Hadoop is naturally the solution for Big Data. So, Facebook got interested in this idea, and they immediately implemented a Hadoop cluster. So practically speaking, the entire data that Facebook was storing and analyzing was moved into Hadoop, but what was the problem? Back in 2006, we were at Hadoop version one or the old Hadoop or the original Hadoop. The original Hadoop had only MapReduce, that means, if you want to interact with the data in a Hadoop cluster, the only way was that you must write a MapReduce program. You have to write a MapReduce program! This became a problem for Facebook because every day Facebook needed to fire 70,000 SQL queries, i.e., the requirement of Facebook was that every day they had to fire 70,000 SQL queries on their data, and this was working when they were having Oracle. Even though it was slow, they were able to somehow manage because Oracle is a typical RDBMS and you can easily write 70,000 queries. Now what happened all of a sudden, the whole data is moved into Hadoop, no more Oracle. So the developers at Facebook faced a serious problem. The problem was that now the entire data is in Hadoop, and Hadoop doesn't understand SQL. Hadoop never understands SQL. What Hadoop understands? Hadoop understands MapReduce, and if you want to write a MapReduce program, you have to learn Java, you have to write your program, you have to compile it, create a jar file, and deploy it. It's not easy. If you're a SQL developer, trust me, learning Java is not going to be so fun for you. SQL developers, you know, tend to be more towards the sequel side of the spectrum, we really are not programmers. So all of a sudden, all the employees of Facebook started complaining. They said that till yesterday, our SQL queries were working on the data. Today when I write a SQL query, I am not able to run the query. What happened? So, Facebook said, you know guys, we just moved everything into Hadoop. Well that's not justice. So in Hadoop, you cannot run SQL query, or you were not able to run SQL queries. Then what was the alternative? The alternative was to write a MapReduce program for a SQL query, and that's practically madness because for a simple select count star query, you have to write hundred lines of code in Java and compile it, and imagine 70,000 queries. So what Facebook did is that they thought, all right, we need a solution for this. We have lot of structured data in Hadoop, but unfortunately our developers are not able to access the data. so what can we do? Thus, Hive was born. Hive was created by Facebook. Later, they donated hive to Apache as a top-level project, but the first Hive version rolled out from Facebook. So the next big question, what exactly is Hive? Imagine you are storing some structured data in Hadoop. Let's say you're storing a data called sales.txt, a text file called sales.txt. Imagine this text file has a structure, like you know, comma separator values. So they are having order number, product ID, etc. You know it's a very big file actually that you store it in Hadoop. Now, if you want to analyze this file, you can write a MapReduce program, or what you can do is that you install Hive. Now Hive will give you a shell or a command line. You write SQL on the Hive's command line. You say select count star from this table, you write a SQL on Hive's command line, hit return, meaning, run the query. What Hive will do is that it will convert your query into a MapReduce program, run it in the cluster, and show you the result. That is Hive. Hive is a SQL interface owned Hadoop, and it is the default data warehousing framework on top of Apache Hadoop. So what is a data warehouse? A data warehouse is nothing but a huge storage with a SQL interface. Data warehouse is a huge storage with sequel interface. Now, you already have Hadoop which is huge storage; you install hive you get a sequel interface you practically create a data warehouse with zero investment that is the advantage of hive hive is free you just install it you can start writing sequel and you can access structured data using sequel you can do all your regular operations creation of tables left outer join right outer join group by queries blah blah blah that is just a quick info guys if you want to become a certified Big Data Hadoop architect in telepath offers you a complete big data Hadoop architect course which covers all the major concepts so for further details check the description below now let's continue with the session now so what is the difference between hive and HBase now hive is what we are seeing right now HBase is your default no sequel database on top of her but remember it is the default Apache nor sequel database on top of Hadoop right the real question is that why somebody should use hive you should use hive say for example I don't want to learn Java I don't want to learn MapReduce I am NOT a programmer I know sequel then hi is your choice because it just understands sequel and it will automatically convert that into MapReduce so that you don't have to sit and learn Java and MapReduce you write your logic in sequel and it works in I imagine you have what you say an Android mobile phone right now what happens is that in the Android mobile phone when you're buying a mobile phone obviously right you get a default what you say calling option messaging option etc you can call people you can text people you can do a lot of things with an Android mobile phone but you can also download apps for example if I want to catch a taxi I will download and install uber if I want to check whatsapp I will download and install whatsapp similarly hive is a tool which you have to install on top of Hadoop so that is why it is called as an ecosystem toward ecosystem told means tools which can be installed on top of Hadoop think about Hadoop like your Android operating system and high flike uber app you are installing on top of that right so hive can talk to her do it gets installed on top of Hadoop boy number one right now another point where is hive right so the first question is where is hive getting installed is it inside the cluster is it outside the cluster is it on top of my head I don't know where are you installing hive so I have a Hadoop cluster right now the point you have to understand is that hive doesn't have any storage hive doesn't have any storage for example let's say you're copying a data so I'm copying a data called let's say transaction dot XLS it's a spreadsheet I'm copying a file called transaction dot XLS now what happens in my Hadoop this gets divided into blocks in data nodes we are all familiar with this right when you copy the data what happens the data gets divided into blocks on my Hadoop right so in hive when you say create a table and load the data I will refer to this data which is lying as blocks on your data nodes it doesn't have its own storage very important point so hive is just giving you a projection of your data your original data is lying as blocks in data nodes and when in hive you say create a table you're just giving a structure to the data so that you can see the data in the form of a table clear so hive doesn't use any storage right the second point you need to understand is that if I am sitting here okay and I have a laptop or a desktop hive is a client-side application for the time being I will explain more about this meaning wife gets installed on your laptop not in the Hadoop cluster now there is a slight variation to this but for the time being let's assume like this that means and and your laptop is obviously connected to the cluster you will have Hadoop client libraries so your laptop can access the Hadoop cluster so hive is a client-side application that means you install hive on your laptop open the hive shell you will say create table and you will give some schema whatever scheme are you want okay then you will say insert the data into the table now what is this data you are inserting this data is this blocks that's what I'm saying you are not inserting anything when you say insert the data I will just refer to these blocks in your Hadoop and then you say do a group by query or some sequel query what I will do it will generate the jar file right and send it to the Hadoop cluster so for Hadoop there is nothing called hive for Hadoop it is just a MapReduce program let's understand that hive is a client-side application that means you ice you install hive on your laptop and why are you installing hive on your laptop because you don't know MapReduce and you're a sequel developer and you know that there is structure data in Hadoop you want to create a table out of the data and do the query so you install hive on your laptop say create a table note the data write your query hit enter I will automatically write the equivalent the MapReduce program create a jar file send it to the cluster so from the point of view of the Hadoop cluster it is a regular MapReduce program it's not hi Hadoop doesn't understand hi I do understands MapReduce so either you can write MapReduce create a jar say run or you write in hi 5 is also going to create a jar and say run same thing that means hi doesn't have its own storage or anything it is just accessing the data in Hadoop and giving a projection so just so that it is easy for you to work with now you cannot work on blocks or data or sand all if you can see your whole data in a table it's easy for you to write queries right and that is how giving you it is giving you a projection of structure data stored in Hadoop that's it so hive understands your sequel even though hives language is called hive query language or we call it as HQ well it is pretty much similar to SQL so that means if you are a sequel developer you know sequel queries you can easily start working on hi introduction of hive hive is a data warehouse infrastructure that is built on top of Hadoop hi provides a mechanism to project structure on the data using sequel like language called hive QL hi uses MapReduce and HDFS for processing and storage retrieval of data now that we understand the data is an HDFS processing is made and by MapReduce and there is a small point here it's not exactly a MapReduce right but for the time being let's let's keep it as MapReduce the advantages of using hive it can be used as an ETL tool in the sense you can bring data to a hive table modify the data store it as permanently it provides capability of querying and analysis so hive is actually used by the analyst community who is very good at sequel can handle large datasets and you can do filters join group by all these sequel queries on top of map and reduce now if you ask me can I see the MapReduce code created by hi no you cannot see so the MapReduce code created by hive is not visible you will just get a jar file you won't even see the jar file you will see that you are file running right now the most important point you have to remember is that hive is expected to be slow it is slow because ultimately this MapReduce it's not like your Oracle or my sequel or something because in any our DBMS systems typically when you write a query you will expect the results within second didn't micro second in high when you write a query sometimes I have seen queries taking four hours five hours why because you know it's just an illusion they're not to use hype so this is very important point where you should not be using hype hive should not be used if the data does not close gigabytes that means if you are not having Big Data ah oh yeah why are you using hive right you can stick with your traditional tools if we don't find schema or bringing in schema is difficult or not possible now the point is that when you access hive you have to create a table hi works on the constructor of databases and tables right and whenever you are defining a table the table should have a schema now if your data cannot fit into the table there is no point in loading the data let's say you are getting some type of data let's say you're getting text data raw text data right now in that data you're not able to find a structure to the data imagine and I'm just saying you imagine imagine you're getting raw data plain freeform text right so in freeform text data how do you find a structure you cannot find a structure that means you cannot use hive on top of that data so you have to use hive where you can find some structure to the data maybe you have comma separated values space separated values column separated values something or JSON files XML files even semi structured data they can all fit into just a quick info guys if you want to become a certified Big Data Hadoop architect in telepath offers you a complete Big Data Hadoop architect course which covers all the major concepts so for further details check the description below now let's continue with the session now the next case is if we need response in seconds and low latency applications I told you hive is expected to be slow and if you are looking for a faster solution don't use hive in our if our DBMS can solve don't invest time in hive very important point hive is not a replacement for our DBMS or something okay because our DBMS systems are real-time systems and hive is not something which can replace them if you're our DBMS can solve your most of your problems then don't use hive hive is specifically for huge amount of structure data say you want to create a table size of the table is 3 terabytes very good Dalton hi query the table your sequel will be very easy even they take some time you will get the result that is where your house use case will come so think about hive like a translator it's a translator you write sequel its peak MapReduce to Hadoop right and Hadoop gives the result it shows us a table to you so for you the developer everything is equal in table for Hadoop everything is MapReduce and high o stands in the middle now what is the similarity with sequel and a difference from sequel hive is very similar to sequel like queries it is based based on actually SQL 92 framework and it is safe to say that it's functions are mostly saying what is the difference from sequel the major difference is that hive query executes on Hadoop rather than traditional database that means you cannot install hive outside Hadoop it works only on HDFS this allows hyperscale to handle huge datasets which cannot be done by our DBMS the internal execution of a high query is via a series of automatically generated MapReduce jobs right so the next question naturally you will have is that I understand hi but can i fine tune hive what about performance can I improve the performance can I do something with hive yes you can do now all hive queries will be converted to MapReduce job so why can't we write Map Reduce ourselves so this is one question people ask see hive is converting everything into MapReduce so why don't you just write MapReduce for God's sake why why are using high understanding the internals of Hadoop framework is must to write MapReduce sequel engineers can quickly write hive scripts now if you want to write MapReduce right you have to learn Java or Python or Ruby or c-sharp right then you have to implement your logic or using MapReduce framework you have to write your custom mapper custom reducer package it has a jar file resolve all the complexities errors now here you don't have to do any debugging anything create a table write your query and it just works that is how it works so the next point is that now talking about the real world okay so let's suppose you learn hi in in in our in telepath course and you master - what did I teach you I taught you that hive is a sequel interface to Hadoop fine you understand that right and then you go to a real world project ok and then you go to the project and say that you know what I'm an expert in hive an expert in hive I know hive really well you are gonna stay to your project and these guys are gonna tell you that you know what you can use hive or you can use hi + this you can use impala you can use spark sequel you can use Phoenix what did I just do I just blew your mind out of proportion so when you go to a real project in a real world project and you ask them ok I have data on Hadoop give me a sequel tool they will either give you high or give you high plus days or give you Impala we'll give you sparks equal or give you Phoenix and you are like oh my god there is only hive that is a secret interface now I am in a dilemma because these guys are talking about a simple are sparks equal Phoenix and trust me if you write a sequel query it will be executed the same way in all these platforms you write a group by query I will execute it high plus days will execute it Impala will execute it sparks equal executed Phoenix will also execute it so the real question is what are these different tools right so this is some real-world info to you is Right see hive is the original tool this came first right so hive is the first avenger so this is the first avenger so this is the first guy who came into the world right so when hive came everybody was happy so people said all right fantastic now I got sequel interface I can write a query and you know what the query executes snice but over a period of time over a period of time people didn't really like five people said like or like higher is great but the problem is that the queries are damn slow you know the other day I wrote a query it took a day to give me the result I don't really like hive we need something else that is when that is when the company called Hortonworks started proposing hive and tastes now what happens here is that if you access a Hortonworks hadoop cluster and you write a hive query say execute the high query will be converted into place note MapReduce so original highways high flux MapReduce Hortonworks highways high plus days so what is this this is a framework Apache framework probably if you're interested you dig deeper well this is created to make MapReduce faster so I don't want to get into details of this I'm just giving you some extra information for the time being understand that phase is the framework which is built to overcome the problems in MapReduce mapreduces I really slow MapReduce is normally slow right so some guys build something called days and days also uses mappers and reducers and oil but this is much much faster than are your MapReduce right so phase is a Hindi word it's it stands for speed and it is created by Indians it is the next level of MapReduce you can say so what Hortonworks did they say it see what Hortonworks did they said why don't we Club hive and pay so that could ease our faster are you getting the point so if you are on a Hortonworks cluster you write a hive query say they execute you will see a taste job firing you will never see a MapReduce job firing because Hortonworks promotes this even though this is attaching open-source Hortonworks says they promote this and they say that's their there their queries are faster so your first avenger original hive is damn slow now high plus this this guy is interactive interactive query means it is faster but not real-time so if you are writing a hive query or a Hortonworks cluster it uses stays as an execution engine and it is faster but it's not real-time or something it's kind of like faster so when Hortonworks started this even before that there is another company called cloud era there is a company called cloud era clowder or what they did they invented something called Impala Impala is again built on top of hive Impala is a sequel in their face but what happens if you write a sequel query on top of Impala it will run it using a demon called Empire Rd it doesn't use MapReduce it doesn't use this it uses a proprietary Impala deterrent so this is promoted by Cloudera and this is also interactive now Hortonworks will say that hi plus this is faster clowder I will say Impala is faster that is a war going on for like past four years nobody knows both are same but whether it is high plus days or in pilots again a seeker interface for you only difference if you're not Houghton works cluster the execution is done by taste if you're on a cloud or a cluster execution is done by a demon called Impala and there is somebody called spark spark has a sequel in their face that's called spark sequel if you write a query on spark sequel again same sequel this will become converted into spark and this is almost real-time so this is faster look at this look at this high it is high if this sequel hive they stays but in Hortonworks you can also say that I don't want is I want MapReduce they can switch the engine so these are all additional information extra information but good to know things right so Phoenix is sequel interface on top of no sequel that means HBase is your no sequel database right and HBase doesn't understand sequel so if you do not know the language of HBase you can start phoenix you write your sequel query it's going to convert in the language of HBase show you the result original high-class Map Reduce is batch processing very very slow days and Impala are similar interactive but there's a certain verse Impala is Cloudera that means you will never see the edge on cloud era and Impala on Hortonworks pass pass equally second on SPARC framework Phoenix is wrongish base if friends between different hi for example you have something called wife then you have something called hive server one then you have something called hive server 2 plus B Line so what are these terms what is pure hi what do you mean by hive server 1 what do you mean by hive server 2 plus B line C so this is something which you have to understand from the technical point of view originally only hive was there and this is just a client-side application with with with a with an interactive shell meaning the original hive what you call the pure hive is just a client-side application that means if you are a developer you take your laptop install hive you get a command line in the command line you say create a table drop a table to the query blah blah blah it just works there is no server only client and that is what I was explaining in my picture also right I drew a picture where I was telling you that you know hive is just a client-side application so when Huibers in vendor people were really happy they said oh my god that's really great so we can get a shell and we can write commands and we can work with hive that is nice but later people started thinking so how do you connect your database can you tell me you have an Oracle database the Oracle database is sitting in some place how do you connect it with the Oracle database or Microsoft sequel database what do you use to connect you a client-side tool something like my sequel workbench or something right you need a client-side tool to connect using JDBC or ODBC right JDBC ODBC connections can be made but for that yeah you will have a server where Oracle is running you need an Oracle client some sequel client right so what people thought is like okay everything is nice about hive but I cannot connect too high from my sequel clients why because high was just a command line look at here that is when hive server one came this guy will allow JDBC or ODBC connections from your regular sequel clients that means PI server 1 means there will be a server running in a Hadoop cluster clear hi-oh server 1 is a server which is running in the Hadoop cluster you in your laptop if you install high you can open the command prompt and type queries or you can use your regular sequel client tools to connect to the hive server and perform all the activities right so hyoe server 1 the option is that you can create JDBC connections from your sequel client directly to hive so you can insert into a table you know you can do whatever queries you want again directly running your client very easy you don't have to use the command prompt you can use a command prompt also so people were really happy a great we have high server we can connect we can run queries blah blah blah but this lacked concurrency meaning there was no concurrency or consistency in high server one that means if multiple users create sessions to hive it was impossible to handle the sessions data persistence data concurrency issues were there that is where PI server 2 king-high server 2 first of all it also allows your JDBC ODBC connections it provides concurrency and it has a new client side command line tool called beeline that is how server - so hi sir were - is kind of like an upgrade to how server 1 the basic difference is that it also allows JDBC ODBC but it provides concurrency that means multiple JDBC ODBC connections are possible you can work with hive but they have redesigned the CLI of hive and the new CLI is called beeline still it supports the old CLI and the new CLI is called beeline beeline is actually a client which you can install like sequel client so i can go to a computer install beeline client from there i can make a connection request to the hive server to and start working in the hive so the hive server will be installed inside your cluster texture of wife now you can see a person okay and he is using the hive query language now you can if you are a person either you can interact with hi crew CLI that is what I'm saying CLI mines means it can be the b-line client or regular command-line hi shel I will show you the hi shell it's an interactive shell or JDBC ODBC that means you are having a sequel client from there you can connect or a web UI hi also provides a web UI and whatever ways look at the arrow if you are using the CLI directly you will be hitting high that means hitting hive in the sense if you are using the CLI you will be directly creating tables etc etcetera but if you are using the JDBC ODBC these sessions are handled by high server to so they are having high server too so the main purpose of high server 2 is to handle this multiple sessions that are created by users through JDBC woody PC that is why there is a difference in the arrow CLI means you are directly invoking high JDBC ODBC means you are connecting through the hive server so either you you see a lie or you use JDBC ODBC now where will I install hive server to it is not necessary that you have to install it inside a Hadoop cluster right it can be outside the Hadoop cluster also see it see this picture your hive server to is outside the Hadoop cluster in this picture now so like I explained you sit if you use CLI directly it is hive if you use the ODBC or abuse it goes through hive server but again the sessions are handled by hive server but it is ultimately hi right now in the picture it is very clearly shown that the Hadoop cluster is separated that means PI server need not be installed inside a Hadoop cluster it can be outside in most of the cases it is installed outside as well right now inside the hive architecture you can see compiler optimizer executor but how does hi know where is the data over in other words what about metadata see for any database or data warehouse or any system you need to store metadata what about the metadata so in hi what happens there is something called meta store service what is meta store service it is a service running to handle metadata what is metadata the definition of your table the definition of your database the schema of your table all this is metadata so you need to store the metadata somewhere right so that is what is handled by this meta store service by default hive uses something called embedded meta store let us store this means it shows the metadata in built in Derby database so what happens is that by default if you are downloading and installing high it comes with a database called there be Apache Derby Apache Derby is a database and it will use this there be database to store the metadata clear so that is called ember dead meta store embedded meta store means by default if you just download and install hi I will need some place to store the metadata so it by default comes with this Apache Derby database and it starts using their B to store the metadata you don't have to anything it will be all configured okay so this embedded meta store is good because you're saying that if I am downloading and installing high right it will come with a database and it will store the metadata in the database this is all good but this Derby has a drawback the Derby database is a single instance database that means it will allow only one connection at a time that means if you are using the embedded meta store of hive only one person can access hi so that is a drawback so what you can do is the hat you can configure your own database as metadata storage for hive example my sequel so this is what we do in production meaning in production what we do is that we will download and install hi now we know that hi will start working automatically because it will use their B to store the metadata but when it is using their B to store the metadata right what happens is that it will allow only one connection at a time do this there B will allow only one connection so we don't do that what we do is that we will configure something like my sequel and ask how to store the metadata in my sequel so that multiple people can access hi the latest version of hive allows updates but all other versions or files by default does not allow updates because remember it is Hadoop Hadoop is a write once read many system in Hadoop there is no update Pro process right you copy the data you delete the data you again copy the data it is all writing the data and reading many times but the latest version of hive allows you to do that it selectively can update it okay it can refer to the position of the Block in a data node so how you can by default talked over do it can understand where is your data and accordingly it can do so what you need to understand is that if you are practicing hive on the cloud at IBM if you're practicing hive on the cloud RVM you are using embedded meta store that means there is a derby inside and that is storing all the metadata very simple but if you go to a company and start working on high right then the metadata will be stored in a separate database like my sequel that is the difference so two types of metadata storages are possible so if a user write a query the query will be accepted by somebody called a driver the driver is not shown in the picture but there is a driver and the driver will pass it to somebody called compiler hmm compiler will check with the meta store to get the metadata and compile your query then optimizer will optimize it and give to executed executor is your MapReduce and MapReduce will run your query in the cluster so step number one user wise the query and the query comes to hive ok and first step is compiling the query now you won't simply compile the query because you need the metadata so you will check with the meta store metadata store about the metadata once you get the metadata you compile the query and there is a component called optimizer it will optimize your MapReduce code so that it can perform better and then give it to another component called executor executor is normally your MapReduce program I mean MapReduce engine and that will execute your query and show the result there are two situations ok again this is coming mostly from the practical aspect if you are having a small Hadoop cluster let's say you are having only three servers I mean three data nodes 4 data nodes etc and so it's a very small cluster right and normally this cluster cluster will be accessed by only a couple of guys so if you are sitting here you have a computer entire hive will be installed here in the hive server will be running everything will be running here why because it's a small cluster there is no point in installing here and then connecting you can install high but usually so I'm talking different use cases I'm talking different use cases in small Hadoop clusters where you are having three to four data on orders and all you will not do any heavy lifting hmm why are you creating small Hadoop clusters it might be for proof of concept no you are not actually running any production workload so that means in this case there is no hive component in the Hadoop cluster entire hive package or is on your machine hi server hi CLI whatever you caught anything starting with hive is on your computer nothing is there in the cluster are you getting my point why why because either you or only your friend will be accessing this and you don't need a separate client server and all right because it's a very small cluster so you install all the package on your computer and then you can also install sequel client on your computer and create a JDBC connection from here it will connect etc etc but everything will be on your computer because it is a small Hadoop cluster now what if it is a very big Hadoop cluster in a very large Hadoop cluster hmm so this Hadoop cluster say thousands of data nodes you will have something called a gateway node have you guys guys heard about this what is the Gateway node have you guys heard about this no I think what is a gateway no what this guy will have connectivity with Hadoop hmm and it's a Linux machine imagine it's a Linux machine here you will install everything you will have your Hadoop climb here you will be having your what you say hi server here and maybe Pig everything will be installed here and what you do is what you do is that if you're sitting here if you're sitting here right you have a laptop you don't install here maybe on your laptop this is your laptop right so this guy is sitting here this guy is sitting here what he can do is that he can do multiple things he can SSH to this machine connect to this machine so he he login to this machine and from here he can type hi he get a command prompt start working on the cluster now another guy is sitting here he has a laptop ok and here he has installed let's say sequel were brain workbench SQL client now if he opened the sequel workbench it'll connect to this and from here it will connect to this right so what is the Gateway node gateway notice nothing but it's a server ok it acts like an intermediate between you and Hadoop cluster in large Hadoop clusters you will not be directly logging on to the cluster you will never even see your cluster it is strictly prohibited you cannot access your cluster all you do is that you've connected this gateway not from here you can do anything so if you prefer the CLI what you do is that you log onto this Sergei producing SSH from there you you are you're on the Gateway an order type hi how CLI you get hi-c alike or if you're having a sequel client okay you you give the details of this gateway and this will connect to the gateway from there it'll connect to the nazca lester right so all the client packages are actually installed in the gateway node that is my point so you have two options either you logo onto the gateway node so now you login to the gateway node and open the command prompt and say hit hi you will see the hive shell you start working or if you're having a sequel work workbench your hive server will be running on this gateway node okay sequel workbench will connect Ohio server and to your cluster interface of hi just type hi and hit enter and this is the hive shell so this is the command-line interface of hive the interactive shell of hive what you see is hi of shell now once you are in the hive shell most of the commands that you type need no introduction because the commands are sequel commands and most of you are familiar with sequel right for example I can say first command probably if you log on to any of these systems the first command you will probably type is show databases you can see all the databases that code created now the best way to understand hi is to start working with hive so what I'm going to do I'm going to say say create database let's say Belle and if I do I show databases it will be there there is a database Cordell now if you want to use it just say use del so when you say you use del the database has changed so now you are using the database called del fine so far so good now these are regular things right now let's take a sample scenario and see how we can use how to solve a problem rather than just speaking about hive right let's take a sample scenario so let me show you the data that we'll be working with yeah so on my desktop I have a file called transactions can you see this file it is called TX n s1 it's a transaction data if I open this file this is how the data looks like just to give you an idea all right now if I have to explain this data let's pick one record so this is one record so that I can explain to you as you can see every record is comma separated values so this has some structure the first column so this is a transaction data from a sports store for a retail store so first column is transaction ID which is 4 then you have date of transaction 17 December 2011 then you have something called customer ID 4 0 0 2 6 1 3 that is the customer ID or the customer number then the amount spent which is $98 81 cents and then that category team sports the item field hockey city Nashville and the state of Tennessee it's a credit transaction so this is how your transaction data looks like line-by-line multiple customers have purchased items from a sports store that's it so it's a very big date actually and the lord of lines in this data right so this is one file now I also have another file here called customers so if I open this file this is my customer data for example if I pick one line or one record at a time first column is customer ID okay which is four zero zero zero one five then you have first name and last line last name Julia - I aged 49 profession is most musician so we are having two set of data 1 data is the transaction then done by the customers the second data is the data about the customer they are their you know name customer ID age an what you say profession so I'm having to set of data and as you can see the customer ID is common in both the data set right now what I want to do is what I want to do is I want to analyze this data what sort of an analysis what I practically want to do is that I want to understand how much money is spent by youngsters in my store and how much total amount is spent by middle aged people in my store and how much amount is spent by old aged people at my store so basically I want to classify the customers into three groups say twenty to thirty one group thirty to fifty another group fifty above another group and I want to calculate the sum of amount spent by them I want to use hype for this so let's look at how we can do this so as a first step what I'm gonna do that I'm gonna create a table this is the syntax to create a table and this should not be a surprise to you if you already know sequel I'm saying create table called this is my table name transaction records and this is my schema very simple transaction number date customer number among category products city state spent by and hive imposes schema which is very similar to and the data types and all are very similar to what you would be in day to day life like it supports in string you know double float everything almost right so I have given the schema to my data right then I say row format I am expecting my data in row format that means by default I will identify a row and used and using newline character it is going to separate every record D limited fields terminated by coma meaning my individual fields are comma separated if he will give semicolon here your input data should be having semicolon then I say stored as text file by default hi who stores the data as text file now you can say sequence file or any other format also if you want but if you don't type text file also it is going to store it as text file this is the for syntax if I hit enter hi-oh says okay which proves the data is created right I mean the other table has been created now the next point I have the table how do I load the data to the table now there are multiple ways to load the data to a hive table I am going to first show you the common way that is what everybody does so where is the data the data is on the desktop of my Linux machine I mean the data is in my local file system so what you can do is load data no curl in path / hole Cloudera so this is the path our desktop into table X that is how you load the data you can actually do insert right but you know typically in data warehousing you get bulk data to load so you will be getting huge amount of tables and you know text files which you want to load it into this place so usually you will not say insert insert usually usually you will dump the data like this so what I am saying I am saying load data a local in path this means my data is in my local file system and this is the path of my data into the table transaction records that's it so now if I do a select star from the table say limit of 20 or something I should be able to see the data see I am able to see the data because I have load of the data to the table right so now the data is also in the paper but what is happening in Hadoop behind the scenes what is happening that is one question right I will show you what is happening behind the scenes whenever you install hive on any platform what hype does is that it will create a folder called warehouse in HDFS so if I do an HDFS DFS - LS you see this whenever you install high high will automatically create this folder called warehouse in Hadoop HDFS the folder name is warehouse now if you check what is inside this warehouse folder you see all the databases you have created so this is our database del dot DB right so whatever databases you are creating whatever databases you are creating is actually stored inside this warehouse directory you know whenever you create a database ok the database gets created in this part user hive warehouse and then the DB what you create now the DB is just a folder in Hadoop DB is just a folder in Hadoop Q is a web interface to your Hadoop nothing else so in Hue I can just go to browser and I can just see rather than typing HDFS real first you know I can just come here and see what is the files in Hadoop when you install high if it creates a folder called warehouse so this is the warehouse folder right can you see the warehouse folder so it creates a folder called warehouse inside this warehouse folder if you open this warehouse folder you can see all the databases you have created so this database called Dell dot DB is nothing but a folder in Hadoop you created a database and you think it is a database it's not a database it's just a folder see Dell dot DB if I open this folder what you see another folder called transaction records the transaction records table you created is just a folder and if I open this can you tell me what you see here file that I copied so look at the table structure look at here your database del dot DB is just a folder transaction records is another folder inside that you have the file this is what is happening in Hadoop so when you are loading the data into a hive table it is just copying the data from Linux to this folder in Hadoop is this clear for you or not of course you use a command lower data a local in path but what is happening behind the scenes is that the data that you are inserting into the table is just copied into this location in HDFS and if this data was very will be divided into blocks and all and all just like regular Hadoop file this is where your data gets stored so I can also show this from command line so if I do warehouse then del dot DB what will I see I see another folder called transaction records and if I again say a px and records I will see the file that is why I am saying highways just giving you a projection the actual data is just lying in Hadoop right now what I will do is that okay I will create one more table because we have the customer data to be stored I will say create a table called customer and this is having customer number first name last name etc so another table and we will load the data so the data is into the table called customer right so now the data into customer but you will be wondering so far I have not seen any MapReduce program right so I was telling her that if you query the data you will see the MapReduce job and all but so far I have not seen anything because you have not actually ran a query so let's say select count star from transaction records so this is a proper query so I am doing a select count star from the table so basically I want to know how many rows are there if I hit enter it's gonna fire a MapReduce job can you see this because it's a proper query it cannot just show the result it's gonna create a and launch a job and create a jar file and here is the result 50,000 so the answer is 50,000 the result of the query is 50,000 how did I do it I just wrote my query and hit enter in high and it automatically launches an equivalent MapReduce job to show me the result and I get the result here now let's analyze this further so what I want to do that I will be creating one more table so I'm going to create a table called out one okay this table has customer number first name age profession amount and product so basically my idea is to do a join operation I have a customer data I have a transaction data both the data has a common column called customer ID so my idea is to do a join operation a simple inner join operation and the results of the join operation has to be stored on table so I have created a table already so this table is called out one and how do you do a join operation in hive very simple insert right table out one so out one is the new table we have created select a road customer arrow first name a road age so if you look at the syntax of the join operation it is exactly same as your normal sequel join operation right if I hit enter it's gonna launch a MapReduce job so now the join is over and if I do a select star from out one this is the result of our join and if I do a limit of 20 or something see this is the result of the new table in the new table if you see this is the result of the join operation so you see you have customer related data and transaction related data this is very simple okay I am just following the same logic of sequel only difference here is that you know it is converting it into MapReduce job right now what I can do is that after this I will create one more table I am creating a table called out to now what I want to do is see now I have done the join operation but I want to classify my customers right my customers are having different age so I am having a column called age and I want to classify by customers how can I do that very simple if you look at here insert overwrite terrible out too that means the result will be in this table which we created in the last step select star comma case you see I am opening a case here and what is the case when a is less than 30 then it will be marked as low when age is between 30 and 50 middle greater than 50 old else others end so what this query will do it will pick all the rows from my previous table look at the age column and based on my criteria it will mark every record ass low-middle old you hit enter its gonna fire a MapReduce job again so every query will fire a MapReduce job you don't have to worry about that so you can see the maps are already 0 etc and now it's over so if I do now now the query is over if I do a select star from out - if I do a limit of 20 or something you see every record is now having a category old middle middle old etc alright so now I am able to classify the customers now all I need to do is very simple I will create one more table I will do a group by query so what I am doing here is very simple guys I'm saying insert overwrite table out 3 so out 3 is the latest table I have created select level so the level column will have lower middle that that is classification sum of amount from how to group by level so it's a very simple group by query the result of this query will have your final answer so the table name is out 3 and the query is over if I do a select star from that I will have the final answer see this is here final answer so the youngsters are spending this much amount middle-aged will plus spending this much amount old HP plus spending this much amount so this is how you start working on high there are two types of tables in life there is something called managed table there is something called external table and by default any table that you create is a managed table so I will tell you what is a difference okay but there are two types of tables in hive which you can create I mean two types of tables there is something called managed table and there is something called external table now what are these things by default any table you create is called a man in its table so what is that how do you know that if you go to hive you can say show tables and we have a table called transaction records if I say describe for mattered the X and records if I tie this command describe the table it says it's a managed table table type is managed and look at the location of the table managed table means the table is managed by wife that means this table is stored under user hi warehouse and you cannot change that as well that is called a managed table so if that is managed table what is an external table in hi I can say what I can do is this I can say create external table so when you want to create an external table I will tell you what is an external table you have to say create external table well the syntax is not correct let me add one more thing Asian slash user slash Cloudera slash user Cloudera I think it is user cloud around then I can say my customer something like this let me type this first then you would understand yeah so this is the percent tax guys so what am I doing here I'm saying create external table so whenever you are creating an external table you have to specifically mention that it is an external table otherwise you just say create table it's going to be a managed table so I say create external table and this is a table name and if you look at this schema and row format they are all same there is no difference in the schema or the format or anything only difference I am saying location user cloud around my customer meaning this is the location in Hadoop how do you know that so once you create this table if you go to your hue you go to you sir cloud era there is a table created called my customer can you see this this table is just created and there is no data inside this now what is the under advantage is that so now I created a external table if I do a select star from this table the XM records - there is no data because I just created a data table I didn't fill it with data if you want to fill it with data all you need to do is very simple go to the folder mentioned in the external table that is my customer copy the data to that folder now just uploaded this data to this my customer folder can you see this in the my customer folder in Hadoop I just uploaded this data now if I come back here again and do a select star do a limit of 20 I have the data so what is the difference between a managed table and external table in managed table which is the default table by the way when you create a table the table is always created in user hive warehouse that that location and then you manually load the data to the table external table means when you create an external table you are mentioning the location of your data look at this line you see you say that the data will come from this folder and whatever data you dump in this folder will be visible in your table very simple what exactly is the difference between a managed table and an external table is the difference only the location yes the difference is only location but the real question is where will you be using a managed table and where will you be using the external table now I can show you practically have a look at here guys now if I do a show tables there is a table called transaction records this is a managed table I can say drop table pxn records so what am i doing I am dropping the table called transaction records that's a managed table what will happen to the data in this manage table it will be deleted why it will be deleted because you know when you drop a table that folder is deleted from hadoop remember the folder structure the data is inside that right so that means if you drop a managed table the data is gone but if I do a drop table the x10 records - that is an external table the table is gone but cloud era and the tail and the folder is my customer the data remains there this is the difference this is the difference if you drop a managed table the table goes the data also goes you draw up an external table only the table goes the data remains the same so if you want to share your table with others let's say you are working in a project okay you create a table and you think that you want to share the table with others and you you you suspect accidently somebody may delete your table create an external table because if somebody accidentally drop your table steal your data will be safe otherwise create a managed table how to use how to use a b-line so b-line is the new CLI and by default if you are using cloud a VM it will have hive server to running and the default port number it is 10,000 you can check the documentation to ensure that now if you want to connect with B Line all you need to do is you have to start B Line first okay this is how you start B Line you will say user link hai bin B Line basically this is a location where B Line is installed it may differ in different distributions but in cloud era Hadoop it is always light so now you see the B Line CLI but here if I say show tables it's gonna throw an error when I say show tables in B line is gonna throw an error it says no connection remember B Line is a client it need to connect with hi server B Line is a client now you see the difference right when I am opening hi if I just hit hi I am accessing the hive entire package it's not a client of me but when I said be line it's a client so in B Line when I say show table it says no connection I can't connect to the server if you to connect to the server what you need to do is this guys this is the connection string you can use the same in your JDBC you are already busy or whatever you have this is your connection string Connect JDBC five-two five-two stands for high server right localhost that support number username password by default nothing because I am running locally and JDBC hired driver so it will connect right so now I am inside this right now if I say show tables sorry show databases it will list all the databases so this is how you start beeline so you will say you sir lib hi beeline so that will start beeline beeline is your client and when you say show tables it won't show anything because it need a server connectivity and this is the connection string so you will say connect with JDBC hive server to localhost running in localhost and the port number username password hit enter it'll connect and ok once it is connected you can type your regular commands so I am typing where is a command for the second man show databases so this is the command I type and here is the list of the databases and the command line is pretty much similar to high ceiling so this is how you start working on B line this is the line okay now if you want to keep B line just say exclamatory mark Q this is the command to cute very important exclamatory mark q will quit be line so now I am back Y so that is how you connect from be inclined to hive server to you can also download be inclined on your desktop or something and make a connection to the hive server now if you don't want to type all the commands and or you you hate the commands of creating a table and all what you can do is come to you okay then what I'm gonna do is that let me show you the data I will be working on here okay there is a data called healthcare sample data now this has a patient ID patient name patient mobile number date of the hospital data email address social security number the disease diabetes fever cold blah blah blah so this is a sample data that I'm using what if I want to upload this data into hi and query the data basically I want to get out get a list of all the people who are having fewer cold so if I want to do that you can of course use the command line but you can also use the GUI let me show you that go to Hue go to this data browser is there is something called meta store table click on this once you click on meta store table okay it will allow you to select the database so I am going to select our database I can select this database I can say create a table from a file I can give a table name test table okay and then I can say select the file from where you want to create a table so which is this healthcare data say next it's going to automatically I look at the data it will ask you hey is this how your data looks like yes I will say next it's also gone it is also automatically gonna select the data type I'm going to say create table so this will create a table from the file that I have uploaded to Hadoop now I need to type any statement or anything all right so gonna create a table now once the table is created you can query the table from the GUI you don't have to you see a like let me show you that what you can do is you can go to this query editors there is something called hive can you see here can go to hi in hive you have to use the database I have a database called test table I can select this table right so this is a table so I can say select star from test table where column number seven equals cold and then execute the query so this will run the query in the GUI and show me the result can you see this right and I can create nice charts and all if I want I don't create like no charts and all sea stacks and groups and everything I can create if I want basically what I'm telling you is that you can run queries from the gy also partitions in hive so what exactly is partitions in hive and why are we doing this right what's the idea behind partitions in height and what we are supposed to do so just imagine that you are working in a project which uses high okay so you are you are in a project which uses of course Apache hive and you're happy happy you are working on high and everything is going fine now what happens is that let's say let's imagine that your manager comes to you and say that hey Raghu do one thing create a table for me in your hive in which I want to upload all my sales data so what I did I created a table create table sales data I created a table called sales data because my manager says he has to upload some data related to sales so I created a table called sales data with some schema you know some some columns and all whatever we have now imagine you are working on a managed table so for the time being we assume that we are on a managed table and by default by default what happens is that if you create this table the table will be stored in user hive warehouse and let's say our database is called new DD / sales data if you create a table called sales data this will be the location by default so this is the basic stuff right so whenever you're creating a managed table the tables location will be this now what happens is that your manager told you that hey Rahu do one thing I have some data why don't you load data to the table so your manager gives you a file and asked you to load that into the sales data table no big deal because usually you know how to load the data you will say load data local in path and then load the data when you loaded the data huh so the manager says hey this is January month data so you loaded a file called January dot txt very simple I have created a table called sales data I have loaded a file called January dot txt this contains all the sales transactions in the month of January very simple so you are happy your manager is also happy right yeah so then what happened is that you know you keep on working in the project and the next month your manager comes to you and say that hater who we have some more data you know so this data is from the month of February why don't you load the data to the table so you end up having so now the file is February dot txt because this is the February month data however since this is getting loaded into the same table this will be the structure now inside the sales data folder now you have two files January and February right so probably I will write it like this so that it's easy for you to understand I will say like this and then your manager says hey Decco I have some more data I have some more data March dot txt comes then what happens April dot txt comes then what happens made or txt comes and June dot txt comes so what is happening here every month you are getting the data right every month you are getting the data and you are loading the data to the table because you have a table called sales data but if you look at the way in which the data is stored you can see is that these all different files January February March April May June all are inside the same folder called sales data because that is your table name right so ok so you might think that that's fine but what is the problem now imagine imagine ok you are writing a query like this ok just imagine that you want to write a query something like this select star from sales data table where month equals let's say April I'm just giving an example ok so let's say you want to query the table you want to query the table so what is the quarry you say select star from the table where month equals to April means you want to see all the data from April month now look at the problem you have hive by default doesn't know where is April month data so what I will do it will come to this folder called sales data it will first scan this entire January row txt then scan this entire February row txt then mass dot txt then April dot txt it finds some match again made or txt again June dot txt meaning your entire data is scanned for this single query and your queries will be really really really slow so the problem we have is we have a table so table in in Hadoop is like a folder right and you keep on loading the data every time you load the data it's gonna store as a file so that means inside a single file now I have 6 inside a single folder now I am having six files January February March April May June and if you write a query hi doesn't know where is the data it is gonna scan all the six files it is gonna go through all the six files to show you the final result and so imagine you keep on adding files every month data every month if your table has let's say 100 different files how has to stand this entire files before showing the output of the query result and obviously your queries will be really really really slow so you are not actually saying where is the data so that is a simple problem and Kyle has to scan the entire files to produce result so ideally your queries will be slow it will be really really slow so the question is what can I do in such a situation I don't want my queries to be slow I want my queries to be faster that is where your concept of partitions come into picture so there is something called partitions in hive so what exactly is partitions partitions means you tell hive to divide the data based on a column what is a partition you're telling hive to divide your data or segregate your data based on a column for example what I can say here is that I can say hey hi look at what I am writing partition my table based on the month column so this is what you can tell to hive so I can tell hi hey hi if do one thing partition my table okay based on the month column if you tell this what I will do is that it is going to look at now once you write this it is gonna look at the whole data first then it is going to look at the month column in the whole data right now obviously in your January dot txt file the month column will be what January what I will do it will create a folder called Jan and place this file inside that I it's gonna create something called fab and place it here and March place it here sorry and April place it here may place it here June place it here so when you tell hive that hey hi I want to create partitions and I want to create partitions based on the month column what I will do is that it's gonna pick the column from the whole data and understand how many values are there in my example there are six months data it's going to automatically create six folders Jan Feb March April May June all the January month data will be copied into this folder all the February month data will be copied into this folder etc so now if you write the same query select star from sales data where month equal to April hi will scan only this folder because it knows April month data is inside a folder called ap are your queries will be much much faster this is called partitioning in height a partition is nothing it is a division on your table actually so when you're saying create partitions what is happening is very simple based on the partition column those many folders will be created and whatever criteria matching with that folder that data will be moved into that folder very simple you have to understand one more thing so I will show you the data may be so then it is easy for you to understand this is the data that I will be using for partitioning if you take one record of this data you will understand what is this data so let me show you what is the data first and then I will explain what I am going to do first let's analyze the data look at here this is record hmm this is one record so first name gerardo last name vodka then six nine two or six Jackson Avenue that is the address Australia that is the country then the state primary phone number secondary phone number email address and website that is the data now if I scroll down let's say this data look at here and janeshia aloe vera that is the name the address is 689 s steel Street one eight eight seven CA California I think California right and so then you have the other details so basically the data actually contains users names okay so I will just show you the last part of the data let me just show you the last part of the data yeah this is us so if you look at here gilma leuco and the address us again four primary phone number secondary phone number email address web site so that is the data that we are going to do now there are two types of partitions there is something called static partitioning static partitioning and there is something called dynamic partitioning now what is static partitioning what is dynamic partitioning in static partitioning you have to manually create the partition and load the data in dynamic partitioning high will automatically detect your data and create the partition so you will naturally think where this is applicable because dynamic partitioning seems to be really good because in dynamic partitioning hive is automatically going to create my partitions why should I use static partitioning right so let's say you have a data like this ragu coma you know the phone number you know and then you have email address then you also have ROM you know something okay ROM at abc.com okay now you also have some more data let's say Tina some phone number Tina at abc.com now then you have a J okay some phone number for a j IJ at abc.com come on so let's suppose you are having this data now Raghu and Ron are in India and Tina and AJ are living in USA but you do not have a strict column for India and USA in the data see in the data all I have is name than phone number then email address and website I don't have a country but I know that Raghu and Ram belong to India and Tina and Ajay belong to USA in this case what I do is that I will manually create two partitions and partition we'll be India and us a and load the first two files in India and last two files in us a this is static partitioning did you understand the use case of static partitioning static partitioning means you are getting the data you know that the data is aggregated you know that you know half of this data is belonging to country India and half of the data belonging to country us but that information is not present in the data right so what you do is that you create static partitions static partitions means when you are loading the data you will ask high partition called India put this together create a partition called USA put this data there then you can query the data are saying select star from my table where country equal to India it will definitely query this partition so this is like you are defining a column while loading the data now I will show you partitioning that will be the best way to do this so what I'm going to do is that I'm gonna connect to the VM first okay then let's see how partitioning is done so now what I'm going to do is that okay I'm going to launch hi first okay first thing I'm gonna do that I'm gonna launch hive so and we have the data I just showed you the data right so the first thing I'm gonna do and I'm going to demonstrate static partitioning right now right show data basis we have to take a data basis database okay so so remember Hugh is your web interface right so first thing I'm going to show you is how my folder structure looks on her tube now so I'm gonna go to my root directory and user directory and we have the hive directory there is the warehouse and there is the Dell DB and inside the Dell DB I have these folders now what I am going two DUIs I'm gonna say create a table called partitioned user now you can give any table name and look at how I am creating a partition table this is my schema okay after the schema I am saying partition by country and state look at here so whatever columns you are using for partitioning should be outside your schema definition so when you are creating the for table you say create a table give the schema no problem that's all fine but I'm saying partitioned by country and state that means I'm telling hive that hey hi I am creating a table and this is my schema and I'm going to partition using two columns country and state if you hit enter it will create a table but but look at here if i refresh my shoe this is the table created partitioned user if I go inside the table there is no partition there is no partition so when you create a partition the table there is no divisions it's a normal table only can you see this I just opened this table this is the table partition user if you open this there is nothing inside that now how do you create static partitions very simple in order to create static partitions what you need to do is let me show you the data we'll duplicate this session to boot sudo su Cloudera okay I'm going to this file called static let me show you the file now this file has three rows of data and I am going to load this file into my table and create partitions now I completely understand that this file you know is copied from the original file that we had but this is the file which I will be using for demonstrating partitions right if you look at here it has only three records as of now right so you have the name then you have the address right and then you have the phone number and email address website that's it so what I'm gonna do is I'm gonna say load data local in path so I'm going to load this folder into this table partition country equal to US state equal to California this is how you're creating static partition I already have a table and that table has all the schema now I'm saying load this file but while loading this file create a folder called us inside or create a folder called California then put the data now US is my country California is my state the file will be inside that I created the table here then I loaded the data where I said country is US and state is California right now I will show you how the folder looks like in Hue if i refresh my hue if I go to del DB there is this partition user inside that there is a folder called country America inside that there is a folder called state California inside that my data is there and this the data now if you look at the data there is no country : there is no state column in the data but I created this so that it falls here now if I query the data I will write a query where country equal to us and state equal to California it will land on this data and query this so that should be faster so these are the commands are used first you create the table give the schema define your partition columns outside the schema so this is what I'm doing right they find the partition columns outside this schema then you load the data while loading the data ok the data you should say this data is belonging to which country which state very simple you have to give a country in a vendor state name ok now let's discuss dynamic partitioning so that you guys are familiar with this right now what I want to do is before you do dynamic partitioning you have to do a couple of things first of all you have to enable dynamic partitioning I will show you how to do dynamic partitioning so what am I doing here see by default hile has static partitioning enabled that means hive will not allow you to do dynamic partitioning so that is why I am typing this command set hive executes dynamic partition true that means it will enable dynamic partitioning I am also saying that maximum dynamic partitions per node is thousand that means maximum partitions created will be thousand you can give more numbers if you want it is up to you I just use thousand so that it sounds good ok you are also saying that the partition mod is non strict okay you are having something called strict mode and non strict mode ok so I'm just saying that you know partition mode nonce now in order to do dynamic partitioning like what I'm gonna do is that I'm gonna show you dynamic partitioning with the same set of data hi will allow both static and dynamic partitions as of now I have enable dynamic partitions by default it has only static partitions okay now let's look at what I'm gonna do it's very simple right I am going to create a temporary table now the temporary table concept in hive is exactly similar to your normal temporary table okay so how do you create a temporary table temporary tables are valid only till the session lasts you say create temporary table and this time I am NOT doing any partitioning you see I am giving the complete schema look at here I'm creating a normal temporary table no partition nothing just a simple temporary table and this table is called temp underscore user so I am creating a table called temp underscore user normal temporary table no partitioning nothing once I create the temporary table I want to load the data hmm so let me load the data now this time what data am i loading unloading this file called customer data dot txt and that is this file this file has country and state column we are working on this file this file has a colon for country and state okay so I am creating a file then I am creating a table a temporary table then I loaded my data into the temporary table now as the first step I just want to write a query and see whether my query is taking time or not so what I will do I will write a very simple query I will say select first name comma phone phone 1 comma city from this temporary table where country equal to US and state equal to California order by city limit 5 it's a very simple query but the only condition is that I am using country and just to see or just to demonstrate how much time it will take if I'm not partitioning my table so right now there are no partitions okay so let's run the query and the query runs of course and let's see how much time it is taking okay so this time the MapReduce job is over and it took 85 seconds hmm so we will see whether it is faster now there is also one more sitting and running it on a VM and even if I do partitioning it may not have much of an impact but let's see now let me show you how to create dynamic partitions if I go to my shoe I go to this del D be okay and okay so this is my del DB right now what I want to do is I want to create a partition table so what I will do create table called partitioned user one okay and this is the schema of my table right partition by country and state and stored as sequence file now this is optional you can store it as text file or sequence file or anything that is optional but this is the same command we used for creating static partition whenever when I was creating a static partitioning table also I use the same command but here the difference is that the data already contains the state and country column now see what I am going to do so I created a table called partitioned user right now see what I am going to do I am going to say insert into table partition underscore user one partition country comma state so what I am Telling hive is that hey hi I want you to copy all the data from my temporary table and load it into this new table and partition it with country and state I'm not gonna tell you which country which state you decide because already the columns are present now let's see what is going to happen so see he now what hive is doing it will take the data from your temporary table look it into the country state columns identify the countries identify the states create those many folders and put the data accordingly I will show you this now look at your screen very carefully you will be able to see hive creating these folders so now it is creating the partitions you will see that right now just wait for a moment for that to pop up can you can you see the loading partitions loading partitions loading partitions can you see that country country state country state can you see it creating this files and folders and done now if I look at my Hugh the table is called partitioned underscore user one if I open this table there are five countries Australia Canada UK US okay I mean last one you leave it there are four countries if I open the folder called country us all the states are created wherever you had the data if you open any of these states you will have a file that will have only that state data of course you are not able to see the file in the proper format because it is storing it as a sequence file but this has the state rater data only if that is matching with here so now you can see that you know it is creating countries and then inside country it is creating states old so and inside that the file is created so this is dynamic partition now we will just see whether the query is faster right now okay so what I will do I will write the same query on the partition table see whether it is running faster so I am writing the same query here and last time it took 85 seconds in the last time now the same query but on the new table see what is going to happen it might not be that much difference but I guess there will be some difference see time taken 41 seconds can you say this last time without partitioning it was 85 seconds now your query is taking only 41 seconds that is the impact on the query time when you are using partitioning just a quick info guys if you want to become a certified Big Data Hadoop architect in telepath offers you a complete Big Data Hadoop architect course which covers all the major concepts so for further details check the description below so this brings us to the end of this session if you have any query please do comment below we'll reach out immediately thank you
Info
Channel: Intellipaat
Views: 76,744
Rating: undefined out of 5
Keywords: Hive tutorial, hive course for beginners, Hive architecture, hive tutorial for beginners, hadoop Hive tutorial for beginners, hadoop hive, Hive architecture in hadoop, intellipaat hadoop, what is Hive in hadoop, apache hive, Hive tutorial hadoop, Hive In Hadoop, Hive tutorial for beginners with examples, hadoop Hive tutorial, apache Hive introduction, Hive in big data, apache Hive tutorial, hadoop tutorial, intellipaat
Id: nVI4xEH7yU8
Channel Id: undefined
Length: 122min 14sec (7334 seconds)
Published: Mon Sep 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.