Hive Tutorial for Beginners | Hive Architecture | Hadoop Hive Tutorial | Hadoop Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome everyone to yet another tech enthusiastic video from at Eureka today we will learn about Apache a hype now let's quickly begin with our session for today Apache hive is one of the best open-source software utilities which has sequel like interface that is used in data querying and data analytics but before we get started please subscribe to Adorama YouTube channel to never miss an update on the trending IT technologies in the current year also if you're looking for an online certification and training based on Hadoop and big data then I have dropped down the link in the description box below I hope you check up today we will discuss more about Hadoop through the following agenda firstly we shall understand why exactly we needed Apache hive followed by that we shall understand what is Apache hive then it's important features then comes the important stage where we will understand the Apache hive architecture the components involved in Apache hype then we will learn how to install Apache hype in Windows operating system followed by that we shall understand the datatypes operators data models which are present in hype then we shall go through a brief demo of Apache hive so let's quickly begin with the first topic that is why exactly we needed Apache hive it all began in the early nineties when Facebook started slowly the number of users at Facebook increase that is nearly 1 billion users and along with the users increase the data which is nearly equals to thousands of terabytes of data and nearly 1 lakh queries then also 500 million photographs uploaded daily and this was a huge amount of data that Facebook had to process and the first thing that everybody had in their mind was to use our DBMS and we all know that our DBMS couldn't handle such a huge amount of data and neither it was capable enough to process it and the very next big guy he was capable enough to handle all this big data was Hadoop even when Hadoop came into picture it was not too easy to manage all the queries used to take a lot of time to execute all the queries perform so one common thing that all the Hadoop developers had was the sequel so they thought to come up with a new solution that has Hadoop's capacity and interface like sequel that is when I've came into picture so now we understand the exact definition of Apache hi Apache hive is a data warehouse of a project built on top of Apache Hadoop for providing data query and data analysis hive gives a sequel like interface the query data are stored in various databases and file systems that integrate with Hadoop also Apache hive has data warehousing software utility it can be used for data analytics it has built for sequel users managers querying of structured data and it simplifies and abstracts the load that is on Hadoop and lastly no need to learn Java and Hadoop API to handle data using hive so followed by this we shall understand Apache hi applications Apache hive is used in many major applications few of the major applications are as follows hype is a data warehousing infrastructure for Hadoop the primary responsibility of hive is to provide data summarization query and beta analysis it supports analysis of large data sets in Hadoop HDFS as well as on Amazon s3 file system followed by that we have document indexing with hive the goal of hive indexing is to improve the speed of query look up on certain columns of a table without an index queries could load an entire table or partition a whole process as rows this would be troublesome so with hive we have solved this problem followed by that predictive modeling the data manager allows you to prepare your data so it can be processed in automated analytics it offers a variety of preparation functionalities including the creation of analytical records and timestamp populations followed by that the next important application of high res and business intelligence five is a data warehousing component of Hadoop and it functions well with structured data enabling ad hoc queries against large transactional data sets hence it happens to be a best-in-class tool available for business intelligence and helps many companies to predict their business requirements with high accuracy last but not the least log processing Apache hive is a data warehouse infrastructure built on top of Hadoop it allows processing of theta with sequel like queries and is very pluggable so that we can configure it to provide our logs quite easily so these were the few important hype applications now let us move ahead and understand Apache hive features the first and the foremost important feature of Apache hype is sequel type queries the sequel type queries present on hive will help many of the Hadoop developers to write queries with ease followed by that the next important feature of Apache hive is o el ap based design o el ap is nothing but online analytical processing this allows users to analyze database information from multiple database systems at one time so using Apache hive we can achieve oil AP with higher accuracy followed by the second feature we have the third feature which says Apache hive is really fast since we have sequel like interface in Apache hive using this feature on HDFS will help us write enquiries faster and executing them followed by that we believe Apache hive is highly scalable hive tables are defined directly in Hadoop file system hence hive is fast and scalable and easy to learn followed by that it is known to be highly extensible Apache a hive uses Hadoop file system and Hadoop file systems or HDFS provides horizontal extensibility and finally the ad-hoc whirring using hype we can execute ad hoc wherein to analyze and predict data so these were the few important features of Apache hyb let us move on to our next topic where we deal with apache hive architecture the following architecture explains the flow of submission of query into hi the first stage is the hive client hime allows writing applications in various languages including Java Python and C++ it supports different types of clients such as thrift server JDBC driver and ODBC driver so what exactly is thrift server it is a cross language service provider platform that serves in the requests from all these programming languages that support thrift followed by that gdbg driver it is used to establish connection between hive and Java applications the JDBC driver is present in the class or dot Apache Hadoop dot v dot JDBC dot type driver finally we come to ODBC driver so what exactly is ODBC driver ODBC driver allows the applications that support ODBC protocol to connect to hive followed by that we have the hives services the following other service is provided by hive they are high CLI hive web user interface hype Modesto hive server hype driver hi from Pilar and lastly the hive execution engine the high CLI or command-line interface is a shell where we can execute the hive queries and commands followed by that the hive web UI is just an alternative for hive CLI it provides a web-based graphical user interface for executing hi prairies and commands followed by that the hive meta store it is a central repository that stores all the structured information of various tables and partitions in the warehouse it also includes metadata of column and its type information the serializers and deserialize errs which is used to read and write data and the corresponding HDFS files where the data is stored followed by that the hive server it is referred to as Apache thrift server it accepts the request from different clients and provides to the hive driver moving on we shall deal with hive driver the high driver receives queries from different sources such as web UI CLI thrift and JDBC or ODBC drivers it transfers the queries to the compiler followed by that we have the hive compiler the purpose of hive compiler is to pass the query and perform semantic analysis on the different query blocks and expressions it converts hive QL statements into MapReduce jobs finally we have five execution engine hive execution engine is the optimizer that generates the logical plan in the form of daj or directed acyclic graph of MapReduce tasks and HDFS tasks in the end the execution engine executes the incoming tasks in the order of their dependencies followed by that we have the MapReduce and HDFS my produce is the processing layer which executes the mapping and reducing jobs on the data provided lastly the HDFS or Hadoop distributed file system is the location where the data of which we provide is stored so this is the architecture of Apache hive then moving next we have a party - ins so what are the different components which are present in hi they are first one the shell shell is the place where we write our queries and execute them followed by that we have met a store as discussed in the architecture the meta store is a place where all the details related to our tables is stored like schema etc followed by that we have the execution engine execution engine is the component of Apache high which converts the query or the code which we have written into the language which the hive can understand followed by that driver is the component which executes the code or query in the form of a cyclic graphs and lastly the compiler compiler compiles whatever the code V write and execute and provides us the output so these are the major hive components moving ahead we shall understand Apache a hive installation on Windows operating system so at Eureka is all about providing the technical knowledge in the simplest way as possible and later play around with the technologies to understand the complicated parts of it so now let's try to install hive into a local system in the most simplest way as possible to do so we might need the oracle virtualbox home which looks like this so once after you download Oracle VirtualBox and install it into your local system the next step would be to download the cloud or a QuickStart vm for your local system the link to this will be provided in the description box below now let's quickly start our clouded our QuickStart VM with our Oracle VirtualBox select import option and now provide the location where your cloud requests are VMs existing in my local system it is in the local disk drive F neighbor select open and now just make sure your RAM size is more than 8 GB just randomly on providing 9000 MB which is just above h-e-b so that you have a smooth functionality of cloud era now select import and there you go you can see that Cloudera QuickStart VM is getting important you now you can see that cloudera quickstart VM has been successfully important and it's ready for deployment you can just double click on it and it'll get started you you can see that clouded a VM has been successfully imported and it started and also you can see that we have gone live on cloud era you can see all the hue Hadoop HBase Impala spark which are pre-installed in cloud era now our concern would be to start a pipe so to start high you need to start up you first so let me remind you one thing in cloudera every single password and user name is Claira by default so for example we have got Hugh username and password here so the username that is a default username for cloud arras Hugh would be cloud era and along with that even the password will be clouded that is by default so we have got cloud era and cloud era as username and password respectively let's just sign in you may select remember option in case if you forget your passwords so now we are getting connected to Hugh and we are live on you-now there you go you got started are you so now we'll enter into HDFS there we go we have a hive here now that we have successfully installed hype into our local system let us move further and understand a few more concepts in Hadoop firstly we shall deal with the data types the data types are completely similar to any other programming language which we have they are tiny and small and integer big and similarly followed by that we have float and inside high float is used for single possession and if you won't double precision you can go ahead with double and followed by that we have a string and boolean which are completely similar to any other programming languages which we use in this daily life followed by that we have hive data models so these are the basic data models which we use in hive that we basically create data bases and store our data in the form of tables and sometimes we also need partitions we will discuss each one of these data models in our demo ahead so we will first create databases and inside databases we will be creating tables inside which we will be storing data in the form of rows and columns and along with that partitions partitions they are like advanced way of storing data like if you have just imagine you are in a school say standard one and inside standard one you have sections ABCD so partition is like you're getting partitions for section a section B section C and section D you are storing different different students in different different sections so that when you're querying for a particular data for example say you're searching for a kid called Sam and you have the section of his class s B so you just don't have to just search for Sam in all the four sections you can just directly go into section B and the column Sam and you'll get access to and that's how partitions to work followed by partitions we have buckets so similar to partitions even buckets work in the same way let us understand each one of these in much better way through a practical demo after data models we shall understand about hypeeee operators so what are operators operators are any other operators that we use in normal programming languages such as automatic operators logical operators which will also go through some examples based on automatic and logical operators in pipe in the five demo we will use some automatic operations as well as logical operations on the data which we have stored in the form of tables in hive which will go through a brief look on that as well so before we get started let's have a brief look on the CSV files that I have created for today's demo these are this small CSV files that I've personally created using ms excel and I've saved them as dot CSV files I've made the CSV files to be smaller because just to make sure the execution time consumed is as less as possible since we're using Cloudera the execution time might be a little more so it's better we use smaller CSV files so this is my first CSV file which is employ dot CSV which has employee IDs employee name salary and age similarly we have another employee - dot CSV file which has the same details along with one more column that is the country column I have included country because we will be using this country column in joints that we will be performing in future followed by that we have the Department so here we have department ID and Department name so we have development department testing product relationship admin and IT support similarly we also have student CSV this is another CSV file that I have created this has ID name course in each of the student followed by that we have another CSV this is student report dot CSV which has the reports of a particular student gender ethnicity parental education lunch course math score reading score writing score and other so these are the CSV files that we will be using in our demo today so now let's quickly begin with our demo so to start hive we shall open a terminal so starting or firing up hive in flower is really simple you just have to type in hive and enter there you go initialized using configuration files and etc the hive CL is deprecated and migration to beeline as recommended and there you go your hive terminal or CLI has been started so first let's try to create our database to save time I've already created the document which has all the codes that we will be executing today so this is the particular file which I will be using today so don't worry this file will be linked in the description box below you can use the same file and try executing the same codes in your personal systems just for practice if you feel so so just to save time I have already created the document which has the codes that we are going to execute today so this code or this file will be attached in the description box below you can get access to it and you can also execute the same codes in your own personal system to have a practical experience about this particular hyper tutorial so the first thing that we will be doing today is to create a database so I'm going to create the database using sequel type commands which are create database name of the database which is Eddie Rekha there you go the database has been successfully created so now you can also use the following command to check if your database has been created or not so sure databases who will help you to find it so there you go you can see the first database which is a default database which will be pre-existing and followed by that you have our own database which we have created now that has said Eureka so followed by this next wave will move ahead and try to create a new table so when you come into tables you need to understand there are two types of tables in - they are managed two tables or internal tables followed by that external tables so what is the difference between these two tables so internal table or managed table is the default table that will be created whenever you try to create a table in high so for example if you're trying to create a new table say Adi Rekha then hive considers that particular table as an internal table by default so when you create an internal table your data is not secured understand this so when you create an internal table your data is not secured in case just imagine you are working with a team and all your team members have access to your pipe or hue so the table has been existing in your hive and the some random newbie or some random inexperienced guy tries to change few things in your table and accidentially he ends up deleting the table so when you delete the table then if the table was created using an internal table code then your data will be erased so that's the disadvantage of using internal tables but in case if you create an external table even if somebody tries to delete your table the table or the data whatever is there will be deleted from their own local system but not from hi so that's the best part of using external tables don't worry we will discuss about internal tables and external tables as well so first we'll try to create an internal table so this particular code is based on internal tables so we are using sequel type command here which is create table and the table name is employee and the columns inside our table are ID of the employee name of the employee salary and age so RAW format is windy limited followed by that since this is a CSV file so the fields will be terminated by comma and don't forget you have to use semicolon unless you use semicolon encode is not complete so let's fire an enter and see if the table gets created or not yeah the table is created successfully now we shall see the table or let's describe the table so describing the table means you can see what are the columns which are present in your table so to describe a table you can use the keyword describe a name of the table which is employ and don't forget semicolon there you go so your table has two columns ID name salary aah so those are the four columns which you have included in your particular table employee now let's move ahead and see if this particular table is an internal table or managed table or the other type of table which is the external table so to do that we can just write in describe formatted table name and semicolon they might be a small issue here yeah there is a typing mistake that is described I missed s so there you go we got it so this particular table is managed table as you can see here now let's move ahead and try out external tables let's clear our screen first you can use ctrl + L to clear your screen there you go we have a clear screen now now let's try to create an external table creating an external table is completely similar to that of internal table but the only difference is that you need to add in a keyword which is external so this particular keyword is used to create an external table now let's fire an enter and see if the table gets created or not you can see the table got created now let's try to describe the table employed the semicolon I'm saying this again and again because most of the times we miss semicolon and we will get an error so you can see the table got described and we have the following columns inside our table now let's move ahead and see if this particular table is an external table or a managed table to do so you can type and describe formatted the same code what we have used earlier let us describe formatted name of the table that is employed - semicolon don't forget there is some issue again I think I've missed something or maybe a typing error yeah this is a typing error yeah they got the table type s external table so that's how we create an internal table or manage table and external table so now that we have understood how to create a database and table and the two types of tables that our internal table or managed table followed by that the second type of table that is the external table now let's try to create an external table in a particular location so for that you can use the following code but the only difference is you are specifying the location that is user Cloudera at eureka employ edu EMP is a file that we will be creating in our hive so let's file and enter and see it that's created or not yeah it's successfully created let's go back to Hugh and see if the following table is created or not so one thing you have to remember is when you fire in a command or if you try to create a table the first folder that will be created is a warehouse so inside hi you have your ver House and inside warehouse you have all the databases that we have created our first database was the ad rake our database and after that we have created table which is employ and the second table is employed - so this is in the particular location which is user cloud era and the file is employed - let's see that this was the file yeah sometimes hue will not show it because of network issues you don't have to worry about it you will get back that data now followed by this let us enter into hue again so when you come back into you if you have to upload a file into hue you can just select this particular option which is plus so selecting this will give you a dialog box which will be something like this and here you can just select any of the files which you want to upload into you now let me select a student report dot CSV and select open so there you go upload is in progress so the data file has been successfully uploaded now if you want to access your data file you can just click on that so there you go you have all your data successfully loaded on to you you can also perform queries on this particular data can just select query and inside that you just need to select editor and you have various editors over here we just baked Impala Java Spartan Map Reduce shell scoop and we also have a hive in here so if you just select hive and there you go you have the editor here you can just type in your commands of queries whatever you have see you have many dictionaries as well you can just select any one of those select and that's how you write queries on the hi permanent now let's not waste much time here and we have a lot to learn so let's continue with the next topic spin our today's session now we shall try to edit the tables now we have created the new table that has employed three and we have named the columns as ID name strength salary age and flow now we shall try to make some alterations to our table so the first alteration that we will try to make to our table is to rename our table as EMP table you know that the our employee table was named as employee three now we are trying to rename it to EMP table so we are using the keyword alter here so just fire and enter and see if this is possible not not yeah it is possible the name has been changed to EMP table now let's try if it's completely changed or clearly changed or not you can just type and describe E and P table let me call him if we get at the same column names in our description then it should be changed so there you go we can see the same columns here so we have successfully changed the name to EMP table now we shall also try to add in some more columns to our table which is EMP table so here I will try to add in a new column that has the surname of string datatype so I'm doing that by using the keyword alter followed by that table the table name is EMP table and I'm using the keyword add columns and the column name is surname and the data type of that column is string so now let's fire an enter and there you go we have successfully added a new row to our table now let's try to describe our table again and see if the column is been successfully added or not there you go you can see the last row which is the surname that we have added most recently so this is how you can alter the table and you can also change the names of the existing columns let's try to do that one as well now what I'm doing is I am changing the column name to first name so one of the column name in my table EMP table is the name which gives me the names of the employees so since I added the sonne I'll change this column name from name to first name so this is the command that I'm using for that operation right now let's fire an enter and see the result yeah the chain has been made let's describe our table don't forget the semicolon there you go you can see that earlier we had named now it's been changed to first name and we also have a surname let's clear our screen so that's all for alterations now we shall move ahead into our next major topic or the data model which is partitioning so we have dealt with the first two data models that are databases and tables so we have learned how to create a database and we have learnt how to create a table we have learned how to create internal or managed table and also we have created external table and also we have learned how to create an external table in a particular location in your hive and load data to your table and also how to alter 8 your tables the column names the name of your table and how to add or delete new columns to your table so far so good and now we shall continue with the next type of data model that is the partitioning as we have discussed earlier about partition and it's completely similar to a school or a college just imagine that you are in a college and you are in computer science section so our College has many branches so may be computer science mechanical and electronics and communications so imagine your name is Harry so if someone comes to your collagen if is looking for Harry so there are many Harry's in your school so if the person is asking specifically about you that is Harry from computer science then can you imagine how simple is this query so you don't have to search for electronics and mechanical you just have to come into the class computer science and search for Harry and there you go you're present so that's how partitions work to execute commands or to execute queries on partition we will create a whole new database here let's start everything from fresh so we'll create a separate database for executing our new data model that is partitioning so I'm creating a new database that is a tier a car student so there you go the database has been successfully created followed by that let use this database now to use the database you just need to add in the keyword use and name of the database so let's fire an enter and now we are currently using Eddie Rekha student database now let's create a table in ID array car student database so here I am creating a normal table that has the manage table so inside my student table I'll be having some basic columns such as ID number of the student name of the student what is his age and course so you're not finding cause here because I'm going to partition the table based on course so here you can find the course I'm using the keyword partition and on what terms so on the terms of course I'm going to partition students so we have discussed about our students CSV file right so here we have a CSV file and the courses that this particular Institute is offering our Hadoop Java Python and yeah so these are the courses that this particular Institute is offering so I'm going to categorize or I'm going to partition these students based on their courses so this is how I'll be partitioning them using this following code so basically the table has all the columns and I'm going to partition the table using course so let's fire and enter and see the execution of this particular code the partition has been done now all we have to do is try to load in our data before that let's try to describe it let's try to see what are the columns present in our particular table student so as you can see the course column is present don't worry the code looks that we have messed out course but we did not miss the course column it is present in the table the only thing is that we are just partitioned it based on the course that we are going to offer now let's try to categorize the students based on their course so you can do that by using the following code we are going to load the data using the command load data local empath so this particular folder that is the student dot CSV is in my local location so that is a form cloud or a desktop student dot CSV and I am loading the data present in this particular location into the student which is present in hive right now so I am going to partition the student based on their course Hadoop now let's fire in this command and see the output yeah now you can see some MapReduce jobs taking place yeah the data has been successfully loaded let's now refresh our I've refresh your highball view based on two methods the first one is just clicking refresh button on the URL or you can also select a manual refresh this is the manual refresh and there you go it's done you can see the new database status T at Eureka student database that we have right now created and inside that you can see the student table that we are created and there you go we have the file of students based on course Hadoop now we will try to add in few more students face on the course Java so that all you need to do is just replace the course name with Java there you go here we had how to course and now here we have Java course just fire and enter and you can see the output followed by that we also had another course that is Python so let's also execute a code for that like of Python so now we have uploaded student details into our hive and we have also partitioned they're using one of our data model satis partition into three categories that are based on hadoop java and python now let's go back to our Hugh and see if the three categories are done or not yeah we need to refresh that there you go you have successfully refreshed stable there is no sign of Java and bitin maybe a manual refresh could help yeah the manual refresh has resulted in the two new files which are Java and Python so you have all the three partitions here Hadoop Java and Python just enter them and you can see the student details so now that we have understood partitioning sorry there I forgot to mention we have two types of partitioning which are dynamic partitioning and static partitioning so the static partitioning is in static or manual partitioning it is required to pass the values of partition columns manually by loading the data into the table hence the data file does not contain partitioned columns you can see that we have sent the partition columns manually for Python Java and Hadoop but when it comes to dynamic partitioning we just need to do it once and all the three files will be automatically configured and that the files will be created so now what is dynamic partitioning so dynamic partitioning the values of partitioned columns exist within the table so it is not required to pass the values of partition columns manually now what is this no worry we shall execute the code based on dynamic partitioning and we shall understand this in a much better way now let's clear our screen now let's start fresh again let's try to create a new database for dynamic partitioning and let's start again fresh so here we'll be creating a new database that is at your a caste wouldn't to so earlier we created a Drake a student and now we'll be testing our dynamic partitioning on a new database that has added a caste who didn't - so there you go the database is been successfully created now we shall use this particular database currently a fever and I do take a student to one database now we'll enter into suit in two database so we'll use it now now we are in India cast hood into not before we start up with dynamic partitioning we have to set high execution - dynamic partition is equals to true because by default the partitions that will be taking place in hive will be static so we need to convert that into dynamic partition by specifying this particular quote now we are good to go with dynamic partitioning along with that we need to execute another command which says partition mode would be non strict so by default when you are partitioning using the static partition the partition mode will be strict so now you are specifying it to be known strict now let's execute this so there you go we have executed the two required codes for that now let's create a new table so the name of the table will be a Dirac a student that is a dus tud and this will have the same columns which are the ID of the students name of the student course age etc now we will try to load in the data from our local pod that has formed louder our desktop stored in dot CSV into the table edu SD u d is been successfully loaded and the size is 267 Kb number of files s1 now comes the tough part so here we are going to partition so we will be partitioning the table based on the same thing which is the course and we will be separating the data using the comma now let's fire and enter now the table is been separated based on course and now we will be loading the data to this particular table which is the student part so this particular table that we have created based on dynamic partitioning and we are going to partition the data based on course now it's been created so the student part table has been successfully created now the only part remaining is to load the data to this particular table now we will be writing a code so using that code the MapReduce will automatically segregate the data members or the students based on their courses so the guys which are in Hadoop will be separated guys in Java will be separated and loaded into different file and similarly with Python now let's see how to do it using the code so there you go we are going to insert into Stuart in part partition based on course select ID name course age from the table at Europe so the data will be imported from the table what we have created here that is ID rake a student so this particular location has the student dot CSV file now let's fire in enter and see if it's created or not fine you can see some of the MapReduce jobs are getting executed you can see here we have three jobs so first one is getting executed we have three because one is for Hadoop one is for Java and one for Python so this will take a little time so this is the reason why I have chosen smaller CSV file so this save time when you take up the course from it you rake up then you can work on real time data um so that you get hands-on experience from real time and you can get yourself placed in some good companies what the experience what you gain from this particular course so the stages are been successfully finished and that the data has been loaded now let's see what are the data is present in the particular table student part there you go you have the output executed in here so these are the data members present in the partition student part so these are the data members which are separated based on their courses that is the partition based on their courses that has her to Java and Python so now that we have understood dynamic partitioning and static partitioning we shall move ahead into the last type of data model which is bucketing bones after we finish the bucketing we shall enter into some query functionalities of five or query operations which can be performed in hive and followed by that we will also learn some functions which are present in height and some of the other things like group by order by sort by and finally we shall wind up the session with joins which are available in five for now let's get continued with bucketing the last type of data model present in hi so for that let's again start fresh we shall create a new database for that before that let's go back to Hugh and check if our partition has been made or not let's refresh also let us make a manual refresh so our database was needy Rekha student 2 database and inside that we have the table that a student part and there you go you can see the files which are based on the partition so 22 is for a different course 23 is for a different course and 24 is for a different goes and this is the default partition which has all the data members as we discussed earlier now let's start with the last data model in high that is bucket now we have created a new database that is a tier a car bucket now we shall also create a new payable for that before that we need to start with this particular database so we can use the command you said Eureka bucket now we are in at Eureka bucket now let's create a new table so the table name will be at Eureka bucket and it will be containing the ID name salary each of the employees the table is created now let's try to load the data so the data file that we will be using is the same one that is he employed at CSV so the data has been successfully loaded into the location now comes to the major part that is the bucketing part so to start a bucketing in hi we need to use the command set hive got info stored bucketing is equals to true so that's done now we will cluster or classify the data present in this particular file using this particular code so we will be clustering based on the ID and we will be categorizing them into three different buckets so let's fire in this command and see if it's happens here that's successfully done now we will overwrite the data using the following command now we'll be inserting data into this package that we have made that is three buckets and we will overwrite the table using this particular code there you go you can see some MapReduce jobs to be taking care of now so one Mapo and reducers a three for now so stage one is getting done so we should be having three tasks basically so let's see I've watched the output stage one is finished the process is finished and data has been successfully inserted now let's go back to hayamin check if it's done or not so before that let's do a refresh now a manual refresh would be much better there you go we have our database here which is at Eureka bucket and inside a breaker bucket we have EMP bucket and that's our data employ dot CSV there you go now let's move ahead and understand the basic operations we can perform in hive so for that let's start fresh again let's create a new database I'm creating a new database for each and every option or if - in every operation that I'm performing in this particular tutorial just to make things or keep things in a sorted manner so as you can see here in our particular file system I have separated each and everything like I've sorted everything so for bucketing I've got a separate database and for partitioning I've got a separate database and for understanding how to create either piece and tables I've got a separate database for that just to keep things arranged and sorted this looks in a much better way so now let's discuss about the operations that we could perform in hi so I'm creating a new database again for this so the database would be high of a query language now let's use this particular database this creates a habit of learning things in a better way or it's like a revision for the things what you have performed or learnt so far as you can see the table is been successfully created now let's try to add in some data into this particular location that is employ data it's been successfully loaded now let's try to see what are the details present in this particular file we can use in the command select star from the table idea rake our employ so there you go these are the details or information present in the table natira car employee now we shall see what are the functions that we can perform on this particular file so since we discussed that the mathematical operations and logical operations can be performed on hive so let's try to perform an addition operation so I'm selecting the column salary and as we have seen here the salaries are 25 30 40 20 thousand rupees for every employee now let me add in 5000 more to each and every employee so I'm adding the value 5000 by using the addition operation so let's end up you can see we have added five thousand so the first element was 25 now it's 13 so similarly all the other employees got 5000 rupees hike all of a sudden now let's try to remove 1,000 so to do so all you need to do is replace the addition operation with a subtraction operation that is minus Biron enter and there you go each and every employ lost 1,000 so the initial amount was 25,000 so removing one thousand from that will result in 24 so this is considering the first initial values this is how it's working followed by that let's also perform some logical operations let's clear the screen and yeah Here I am fetching for the employees who are having a salary equal to or greater than 25,000 so these are the employees which are having the salaries above or equal to 25,000 similarly let's execute another one which detects the invoice with salaries less than 25,000 so do you have got two employees which are having lower salaries which are amit and chaitanya fine so this is how you perform some operations in hi so now let's move ahead and understand the functions which you can perform on hi so in the same way let's create a new database again and let's use this particular database that has hive functions now let's create a table in this particular database so the table is employ function and it's created now let's try to load in the data yeah the data has been successfully loaded and now let's see if the data is correctly loaded or not yeah the data is loaded correctly now let's try to apply some functions in this particular data so the first thing or the first function I am going to apply would be a square root function where I'll be finding out square root of the salaries of the employees so there you go the square root of 25 thousand was one five eight dot the symbol numbers so this is how you perform some basic functions on your data now let's try to find out the maximum salary so yeah the job is getting executed you can see some Map Reduce chops here I think the biggest salary would be from Sanjana so the maximum salary is 40,000 so this is how it works since we are working on cloud era and the system configuration as limited the execution speed is a bit low but if you're working in real time then this process would take like few seconds and it's done there you go you had the value of 40,000 as shown here so 40,000 the employ name is son jenna is the maximum salary so that's what we got here now let's try to find out the minimum salary so the minimum salary is 15,000 and who would that be yeah it's Chaitanya with minimum salary 15,000 so that's how you do some operations and pipe let's execute some more operations such as converting the names of the employees to uppercase so you can see the employee names I convert it to uppercase here and similarly let's try to convert to lowercase so here you can see we have converted them to lowercase so this is how you learn technology you need to leave with the technology then you'll come to know the advantages and disadvantages so you can learn the possible ways where you can make things to work out this is how you do it now let's move in and understand group by function in five so for that we'll be creating a separate database that is group now we will use this particular database that is group so we'll type in command use Group semicolon now we'll create a table so the table has been successfully created now we will load data into this particular table now we will use the new CSV file which will be employed to dot CSV now we are using this particular table because we have an additional column in this particular table which is the country column now as discussed before we will be grouping the employees based on country let's see our data our first so we have countries such as USA India UAE so these are the three countries that we are having in our CSV file so we will be categorizing the employees based on their countries so this is the particular command that we will be using so maybe I made an error while creating the table I think I gave a wrong table name here so let's drop our table so by mistake I gave different table that is employ order so to drop a table you just need to use the keyboard drop and it's done yeah the keyword table was missing so you need to type and drop table and the table name and the table gets dropped so we were supposed to create a different table that as employed group so now let's create a new table that is employ group and blue group has been created now let's try to add in data into the employee group so we have used the employed to here because the employ too has another column which is based on country so the countries that we are having here are India USA and UAE so we will be using the group by function here and we will categorize the employees based on their countries so there you go you can see some MapReduce shops getting executed yeah there you go we have categorized the employees based on their country status India UAE and USA and the sum of the salary so the guys working in India and their summation of the salary is 90,000 and similarly you ears nearly 1 lakh 5,000 and USA is 80,000 now let's also execute a different command based on group by so here we'll be using group by function and we will categorize based on the country as well as the summation of the salary which is greater than or equal to 15,000 so it's similar to the previous command so you can see the data card executed and we got the same output now let's move ahead and understand order by and sort by methods so for that we'll create a new database orders now we'll use orders now let's create a new again so the new table is employ order and the table got created now let's load the data into this particular table by now I think you have some good practice of how to create a database how to create a table and how to load data into that particular table so the data got loaded and now we are going to order the data or present in this particular table based on the descending order of their salary so you're seeing some MapReduce shops going ahead so here we'll see the employees ordered based on their salaries in descending order so the highest salary will be at the first place and the lowest salary will be at the last place yeah so we have Sanjaya the first position with 40,000 as the highest salary and she is working for you a/e and we have Chaitanya with lowest salary 15000 working for India now let us also execute another command based on sort by so first we try to execute a command based on order by now let's see the same output using sort by so basically both work in the same way today ago we have sorted the records based on descending order of salary now that we have learned what are the various operations that can be performed in high that are the automatic operations logical operations and also some of the functions such as maximum minimum grew by order by sort by so these are the various operations and functions that you can perform in hive now let's move ahead into the last type of operations that can be performed when hive those are the joints so for that let's again create a new database so here I'll be creating a new database that has ed Eureka and join and followed by that let's use this particular database now for that we need to use the keyword use and there you go we are in at Eureka join now let's create a new table for that so the table will be EMP Cho and here you can see that I forgot to mention semicolon so now the table got created now we shall load the data into this particular table so now I've created the first table that is employ table and I'm loading the employee data into this particular table now to perform join operations we always need two tables so in this particular database at Eureka join I've already created the first table that is employ join now let's create second table that is the department table which will be present in the same database so this particular table is a department table which will be having the entities that are department ID and department name now let's load the data of department into this particular table so the data has been loaded so you can see the employee - dot CSV had the columns ID name salary agent country and similarly that department dot CSV has the entities which are department ID and the department name so the department IDs are present here and the names are development testing product relationship and admin and IT support now we have created both the tables and we have created or we have loaded the data also now we have four different joints available in hype they are in a joint left outer joint right auto joint and full outer join now let's perform the first type of join which st in the joint so in inner join we are going to select the employed name and employed department and based on the employee ID and department ID we are going to perform the join operation that is the first churn in the joint so you can see some jobs getting executed so the MapReduce task successfully completed so the first set of successfully finished and the output has been generated now let's try out the second type of join that has the left outer join so the only difference is that we are using the keyword left outer join now you can see even if the chop cut started so you can see the output is spin generated as well of the left outer join now let's move ahead and understand right outer joint so for right outer join you need to use the keyword right outer join fire in the command and you can see the jobs getting executed so you can see the output of right has been successfully executed or displayed now let's type in the last joint operation that is full or to join so here I'm using the keyword full outer join file in the command and you can see it's getting executed so the output fulfill our design has been displayed here so this is how the join operations are executed in hive so we have learned how to create database how to create table how to load data and the various data models present in hive that are the tables databases partitions bucketing and after that we have also understood various operations that are the automatic operations logical operations and functions that can be performed in hi such as square root and the summation minimum maximum and after that other operations such as group by sought by order by and also the joints that are possible in hive which are inner join left outer right outer and full outer so each and every operation that could be possibly executed in hive been displayed in this particular tutorial and everything is sorted here in the base of databases and you can get all the details about this and you'll also get the code that I have used in the description box below and you can try it out and also if you're looking for an online certification and training based on Big Data hi tube then you can check out the link in the description box below and during the training you'll get to have real-time hands-on experience with real-time data you'll learn a lot of things in the training and so far so good now we shall also discuss some of the limitations of high so Apache hive limitations so hive is not capable of handling real-time data hive is capable of batch processing if you have to work with real-time data then you have to go with real-time tools such as spark and Kafka so it's like I will actually take in the data for example imagine you're working on Twitter and you have one lakh comments on a particular post so if you had to process those 1 lakh comments you'll have to first load all those comments into hi then you need to process it so while you're loading the data from Twitter to hive you may also get a few more comments that you will be missed out so it's not preferable for real-time high was preferable for only batch mode so followed by that it is not designed for online transaction processing so online transaction processing is something which only works in real time so high if cannot support real-time processing so last but not the least hiked worries contain high latency yeah hive queries take a longer time to process as you've seen I have taken a smaller CSV file and the time consumed to process such a small CSV file was taking so long so yeah high queries contain high latency so these are the few important noticeable limitations of high so with this we have come to an end of this particular tutorial if you have any queries regarding this tutorial or if you require code that we have executed in this particular tutorial then you can write us down in the comment section below and we will respond to you as soon as possible so till then wish you all a very happy learning and thank you I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to Eddie Rica channel to learn more happy learning
Info
Channel: edureka!
Views: 39,112
Rating: undefined out of 5
Keywords: yt:cc=on, hive tutorial, hive course for beginners, Hive architecture, hive tutorial for beginners, hadoop hive tutorial for beginners, hadoop hive, hive architecture in hadoop, introduction to apache hive, hive queries, what is hive in hadoop, apache hive, hive In hadoop, Hive in big data, operators in hive, joins in hive, big data tutorial, apache hive tutorial, hive commands, hive programming, hadoop tutorial, hadoop, hadoop edureka, hadoop training, Edureka
Id: S0i4NX1vlCU
Channel Id: undefined
Length: 62min 37sec (3757 seconds)
Published: Thu Mar 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.