Hey guys, welcome to this session by Intellipaat. So, have you ever wondered how these giant IT organizations store their data.
These organizations use different databases like Oracle for transactional
data, MYSQL for storing the product data, and many others for different tasks.
So, only storing the data is not enough; you need to analyze the available data
and need to extract meaningful insights from it in order to understand the
customers' behavior. So for that, organizations started using Hadoop. In
2006, Facebook switched from Oracle database that uses SQL queries to
handle the database to Hadoop that understands only MapReduce. So this
became a redundant task for them to handle the database. So they needed an
interface that can convert SQL queries to MapReduce programs, run it in
a cluster, and show you the result. This is where Hive comes into the picture.
it is an interface created by Facebook that converts SQL queries to
MapReduce programs. So, in this session, we'll learn Hive comprehensively.
Before going ahead, subscriber to our channel and press the bell icon so that you never
miss out on our upcoming videos. Also, if you want to be a certified professional
in Big Data Hadoop, I would like to suggest you a certification course by
Intellipaat which is Big Data Hadoop architect course. Now let's have a glance
at the agenda. We'll start off by understanding what is Hive, then we'll
see the difference between MapReduce and Hive. After that, we'll look at
the differences between different types of Hive and look into the features of it.
Then, we'll move on to Hive interface. Going ahead, we will see the partitions and
indexing in Hive, and finally you can put down all your queries in the comments
section and we would love to help you out. So, without much delay, let's start
off with the class. What is the requirement of Hive. You have this guy
called RDBMS. So, let's say you are using MY SQL and you are also using,
let's say, MS SQL that is Microsoft SQL, right, and you are also having
Oracle. So, your organization, let's say, uses
three databases. Now why are you using three of them? I don't know. Maybe
you're using Oracle for transactional data, you know for payment, etc. Maybe you're using MY SQL to store product catalogs, etc. Maybe you're
using Microsoft SQL for some other purposes. So, just assume that you are
having three real-time databases. Now these guys are called RDBMS or database, and we also call them a name called OLTP. They are online
transaction processing systems. Now I know that this is too basic, but I'm just
trying to give you an idea. So, every organization has multiple OLTP systems,
maybe single OLTP system, deployed across the world. Now your
manager comes to you and says that (let's say I am the employee and
my name is Raghu) hey Raghu, we want to do something. You know what, we want to just
collect the data from all these databases and I want to create a nice
report and show to my CEO so that I can get a free holiday. So I can impress my
CEO with my nice report and all so I can get a free holiday, maybe you get a
salary hike also. So, why don't we pull all the data from these systems and put
it into a place and practically create a report out of it. Now you have guys known
as Business Intelligence guys (BI). So, Business Intelligence is a part wherein
you get data and you make sense out of the data. Given any type of data, you can
create nice charts, pie diagrams, and reports, etc. using Business
Intelligence. There are a lot of BI tools available in the market, for example, we
have Pentaho BI, you know, many tools are available. So, I thought let me do one thing, let me take a Business Intelligence tool which
can create nice reports and let me install it on my Oracle; let me also
install the tool on Microsoft SQL and MY SQL, all three databases. But, this
doesn't make any sense. Why it doesn't make any sense is because all
these databases are real-time databases that is why we call them as OLTP, and
they are serving your customers in real time. I don't want any unnecessary load
on these databases. So, if I install a reporting tool
on top of it and start digging the data, it may impact the performance of these
databases. So I don't want to do that. So what I do is I bring in the
elephant in the room, and that is, my friends, your data warehouse which we call it
as DW. You can also call it as OLAP or online analytical processing system. Using ETL, you bring data from your RDBMS systems into your data warehouse. So,
technically what is happening is that using ETL tools (ETL is extract, transform, and
load) you can take data from your RDBMS systems and dump it into a place
called a data warehouse. Just a quick info guys, if you want to become a
certified Big Data Hadoop architect, Intellipaat offers you a complete Big Data Hadoop Architect Course which covers all the major concepts. So, for further
details, check the description below. Now let's continue with the session. Now data
warehouse is not a fancy term, it's a technical term.
What is a data warehouse? (1) Data warehouse is a place where you store,
first of all, massive amount of data. (2) This
data is ready to be accessed, ready to be reporting. So, if I have a BI tool, my BI
tool can be directly installed on my data warehouse, and it can produce nice
reports, etc. A data warehouse is internal to your company;
you don't have public access because this is your private storage. I
am collecting all the data and dumping into this guy so that this guy holds the
whole data and now I can do whatever I want with my data,
very simple. Organizations were doing this for quite some time. You go to any
enterprise organization, what they do is that they will have multiple
OLTP systems. They keep on collecting the data nightly cron jobs, or whatever you
want and then dump it into a data warehouse. Now the data warehouse
will hold historical data. So this guy has historical data, meaning, it will have
this year's data, last year's data, year before the last year's data, etc. you know, a lot of
data. On this data, you can do whatever analysis you want, very simple.
So data warehouse is a place where you store massive amounts of data ready for
analysis, and you do not typically have, what you say, public access to this thing.
If you look at the players in data warehouse,
you have many companies who are doing this. There is somebody called Teradata
which is a major player actually. You also have Oracle Exadata. You have SAP HANA, Netezza, Greenplum, etc. These are all data warehousing companies. So, in your
organization, if you want to implement a data warehouse, what you do is that you
go to, let's say, Teradata and say that hey Teradata, you know what, I
want to implement a data warehouse in my company. So Teradata says alright.
Take this and go. So they will give you practically an appliance, like a box,
which you install in your company and then you can load it with data, it stores, and
you analyze the data. So, the question is why am I talking about this? What is the
big deal about this? See, there is no big deal about this, but so far, whatever I
was discussing is how the world was working. Now everything seems nice here:
We have OLTP systems, we take ETL, we dump the data to data warehouse, we
produce reports, nice and happy, but with only one major problem. The problem is this: $$$--cost aspect. Data
warehousing is a costly affair, to be honest with you, for example, if you
want to implement a Teradata solution, it will cost you millions of dollars
practically speaking. It is not so easy, also it is not so cheap. People who are
working in the data warehousing background will be able to comment on
this, but data warehousing is a costly affair.
So that is the backdrop, that is the history that we are talking about.
Now enters Hadoop. So, here is the Hadoop cluster.
So what is this? Your organization has
a Hadoop cluster. Now imagine your organization already has a Hadoop
cluster, fine. What is the problem? In a Hadoop cluster, obviously, you can store
Big Data, but what happened was that somewhere in 2006, the company
called Facebook got interested in Hadoop. So, Facebook was dealing with this Big
Data problem. Basically Facebook was in the growing stage in 2005-2006, etc.
You know, Facebook was getting a lot of users at that point in time, and Facebook
was thinking that you know what everything is working fine, but we really
don't have a solution for storing Big Data and analyzing Big Data. Facebook
was running completely on Oracle. Oracle was their prime back-end actually and
Python. Python plus Oracle was Facebook's initial architecture.
Now somewhere in 2006, somebody told Facebook that hey, you know
what, why are you so worried about storage of data? There is something new
in the market called Hadoop, and if you implement the Hadoop cluster, you can
store unlimited data practically because Hadoop is naturally the solution for Big
Data. So, Facebook got interested in this idea, and they immediately
implemented a Hadoop cluster. So practically speaking, the entire data
that Facebook was storing and analyzing was moved into Hadoop, but what was the
problem? Back in 2006, we were at Hadoop version one or the old
Hadoop or the original Hadoop. The original Hadoop had only
MapReduce, that means, if you want to interact with the data in a Hadoop
cluster, the only way was that you must write a MapReduce program. You have to write a MapReduce program! This became a problem for
Facebook because every day Facebook needed to fire 70,000 SQL
queries, i.e., the requirement of Facebook was that every day they had to fire 70,000
SQL queries on their data, and this was working when they were having Oracle.
Even though it was slow, they were able to somehow manage because Oracle is a
typical RDBMS and you can easily write 70,000 queries. Now what happened all of
a sudden, the whole data is moved into Hadoop, no more Oracle. So the developers
at Facebook faced a serious problem. The problem was that now the entire data
is in Hadoop, and Hadoop doesn't understand SQL. Hadoop never
understands SQL. What Hadoop understands? Hadoop understands MapReduce,
and if you want to write a MapReduce program, you have to learn Java, you have
to write your program, you have to compile it, create a jar file, and deploy it.
It's not easy. If you're a SQL developer, trust me, learning Java is not
going to be so fun for you. SQL developers, you know, tend to be
more towards the sequel side of the spectrum, we really are not
programmers. So all of a sudden, all the employees of Facebook started
complaining. They said that till yesterday, our SQL queries were
working on the data. Today when I write a SQL query, I am not able to run the
query. What happened? So, Facebook said, you know guys, we just moved everything into
Hadoop. Well that's not justice. So in Hadoop, you cannot
run SQL query, or you were not able to run SQL queries. Then what was the
alternative? The alternative was to write a MapReduce program for a SQL query,
and that's practically madness because for a simple select count star query, you
have to write hundred lines of code in Java and compile it, and imagine 70,000
queries. So what Facebook did is that they thought, all right, we need a
solution for this. We have lot of structured data in Hadoop,
but unfortunately our developers are not able to access the data. so what can we
do? Thus, Hive was born. Hive was created by Facebook. Later, they
donated hive to Apache as a top-level project, but the first Hive version
rolled out from Facebook. So the next big question, what exactly is Hive?
Imagine you are storing some structured data in Hadoop. Let's say
you're storing a data called sales.txt, a text file called sales.txt.
Imagine this text file has a structure, like you know, comma separator
values. So they are having order number, product ID, etc. You know it's a very big file actually that you store it in Hadoop. Now, if
you want to analyze this file, you can write a MapReduce program, or what you
can do is that you install Hive. Now Hive will give you a shell or a command line. You
write SQL on the Hive's command line. You say select count star from
this table, you write a SQL on Hive's command line, hit return,
meaning, run the query. What Hive will do is that it will convert your query into a MapReduce
program, run it in the cluster, and show you the result. That is Hive. Hive is a
SQL interface owned Hadoop, and it is the default data warehousing framework on
top of Apache Hadoop. So what is a data warehouse? A data warehouse is nothing but a huge storage with a SQL interface. Data warehouse is a huge storage with
sequel interface. Now, you already have Hadoop which is huge storage; you install
hive you get a sequel interface you practically create a data warehouse
with zero investment that is the advantage of hive
hive is free you just install it you can start writing sequel and you can access
structured data using sequel you can do all your regular operations creation of
tables left outer join right outer join group by queries blah blah blah that is
just a quick info guys if you want to become a certified Big Data Hadoop
architect in telepath offers you a complete big data Hadoop architect
course which covers all the major concepts so for further details check
the description below now let's continue with the session now so what is the
difference between hive and HBase now hive is what we are seeing right now
HBase is your default no sequel database on top of her
but remember it is the default Apache nor sequel
database on top of Hadoop right the real question is that why somebody should use
hive you should use hive say for example I
don't want to learn Java I don't want to learn MapReduce I am NOT a programmer I
know sequel then hi is your choice because it just understands sequel and
it will automatically convert that into MapReduce so that you don't have to sit
and learn Java and MapReduce you write your logic in sequel and it works in I
imagine you have what you say an Android mobile phone right now what happens is
that in the Android mobile phone when you're buying a mobile phone obviously
right you get a default what you say calling option messaging option etc you
can call people you can text people you can do a lot of things with an Android
mobile phone but you can also download apps for example if I want to catch a
taxi I will download and install uber if I want to check whatsapp I will download
and install whatsapp similarly hive is a tool which you have
to install on top of Hadoop so that is why it is called as an ecosystem toward
ecosystem told means tools which can be installed on top of Hadoop think about
Hadoop like your Android operating system and high flike uber app you are
installing on top of that right so hive can talk to her do it gets installed on
top of Hadoop boy number one right now another point where is hive right so the
first question is where is hive getting installed is it inside the cluster is it
outside the cluster is it on top of my head I don't know where are you
installing hive so I have a Hadoop cluster right now the point you have to
understand is that hive doesn't have any storage hive doesn't have any storage
for example let's say you're copying a data so I'm copying a data called let's
say transaction dot XLS it's a spreadsheet I'm copying a file
called transaction dot XLS now what happens in my Hadoop this gets divided
into blocks in data nodes we are all familiar with this right when you copy
the data what happens the data gets divided into blocks on my Hadoop right
so in hive when you say create a table and load the data I will refer to this
data which is lying as blocks on your data nodes it doesn't have its own
storage very important point so hive is just giving you a projection
of your data your original data is lying as blocks in data nodes and when in hive
you say create a table you're just giving a structure to the data so that
you can see the data in the form of a table clear so hive doesn't use any
storage right the second point you need to understand is that if I am sitting
here okay and I have a laptop or a desktop hive is a client-side
application for the time being I will explain more
about this meaning wife gets installed on your laptop not in the Hadoop cluster
now there is a slight variation to this but for the time being let's assume like
this that means and and your laptop is obviously connected to the cluster you
will have Hadoop client libraries so your laptop can access the Hadoop
cluster so hive is a client-side application
that means you install hive on your laptop open the hive shell you will say
create table and you will give some schema whatever scheme are you want okay
then you will say insert the data into the table now what is this data you are inserting
this data is this blocks that's what I'm saying
you are not inserting anything when you say insert the data I will just refer to
these blocks in your Hadoop and then you say do a group by query or some sequel
query what I will do it will generate the jar file right and send it to the
Hadoop cluster so for Hadoop there is nothing called
hive for Hadoop it is just a MapReduce
program let's understand that hive is a client-side application that means you
ice you install hive on your laptop and why are you installing hive on your
laptop because you don't know MapReduce and you're a sequel developer and you
know that there is structure data in Hadoop you want to create a table out of
the data and do the query so you install hive on your laptop say create a table
note the data write your query hit enter I will automatically write the
equivalent the MapReduce program create a jar file send it to the cluster so
from the point of view of the Hadoop cluster it is a regular MapReduce
program it's not hi Hadoop doesn't understand hi
I do understands MapReduce so either you can write MapReduce create a jar say run
or you write in hi 5 is also going to create a jar and say run same thing that
means hi doesn't have its own storage or anything it is just accessing the data
in Hadoop and giving a projection so just so that it is easy for you to work
with now you cannot work on blocks or data or sand all if you can see your
whole data in a table it's easy for you to write queries right and that is how
giving you it is giving you a projection of structure data stored in Hadoop
that's it so hive understands your sequel even
though hives language is called hive query language or we call it as HQ well
it is pretty much similar to SQL so that means if you are a sequel developer you
know sequel queries you can easily start working on hi introduction of hive hive
is a data warehouse infrastructure that is built on top of Hadoop
hi provides a mechanism to project structure on the data using sequel like
language called hive QL hi uses MapReduce and HDFS for processing and
storage retrieval of data now that we understand the data is an HDFS
processing is made and by MapReduce and there is a small point here it's not
exactly a MapReduce right but for the time being let's let's keep it as
MapReduce the advantages of using hive it can be used as an ETL tool in the
sense you can bring data to a hive table modify the data store it as permanently
it provides capability of querying and analysis so hive is actually used by the
analyst community who is very good at sequel can handle large datasets and you
can do filters join group by all these sequel queries on top of map and reduce
now if you ask me can I see the MapReduce code created by hi no you
cannot see so the MapReduce code created by hive is not visible you will just get
a jar file you won't even see the jar file you will see that you are file
running right now the most important point you have to remember is that hive
is expected to be slow it is slow because ultimately this MapReduce it's
not like your Oracle or my sequel or something because in any our DBMS
systems typically when you write a query you will expect the results within
second didn't micro second in high when you write a query sometimes I have seen
queries taking four hours five hours why because you know it's just an illusion
they're not to use hype so this is very important point where you should not be
using hype hive should not be used if the data does not close gigabytes that
means if you are not having Big Data ah oh yeah why are you using hive right you
can stick with your traditional tools if we don't find schema or bringing in
schema is difficult or not possible now the point is that
when you access hive you have to create a table hi works on the constructor of
databases and tables right and whenever you are defining a table the table
should have a schema now if your data cannot fit into the table there is no
point in loading the data let's say you are getting some type of data let's say
you're getting text data raw text data right now in that data you're not able
to find a structure to the data imagine and I'm just saying you imagine imagine
you're getting raw data plain freeform text right so in freeform text data how
do you find a structure you cannot find a structure that means you cannot use
hive on top of that data so you have to use hive where you can find some
structure to the data maybe you have comma separated values space separated
values column separated values something or JSON files XML files even semi
structured data they can all fit into just a quick info guys if you want to
become a certified Big Data Hadoop architect in telepath offers you a
complete Big Data Hadoop architect course which covers all the major
concepts so for further details check the description below now let's continue
with the session now the next case is if we need response in seconds and low
latency applications I told you hive is expected to be slow and if you are
looking for a faster solution don't use hive in our if our DBMS can solve don't
invest time in hive very important point hive is not a replacement for our DBMS
or something okay because our DBMS systems are real-time systems and hive
is not something which can replace them if you're our DBMS can solve your most
of your problems then don't use hive hive is specifically for huge amount of
structure data say you want to create a table size of the table is 3 terabytes
very good Dalton hi query the table your sequel will be very easy even they take
some time you will get the result that is where your house use case will come
so think about hive like a translator it's a translator you write sequel its
peak MapReduce to Hadoop right and Hadoop gives the result it shows us a
table to you so for you the developer everything is equal in table for Hadoop
everything is MapReduce and high o stands in the middle now what is the
similarity with sequel and a difference from sequel hive is very similar to
sequel like queries it is based based on actually SQL 92
framework and it is safe to say that it's functions are mostly saying what is
the difference from sequel the major difference is that hive query executes
on Hadoop rather than traditional database that means you cannot install
hive outside Hadoop it works only on HDFS this allows hyperscale to handle
huge datasets which cannot be done by our DBMS the internal execution of a
high query is via a series of automatically generated MapReduce jobs
right so the next question naturally you will have is that I understand hi but
can i fine tune hive what about performance can I improve the
performance can I do something with hive yes you can do now all hive queries will
be converted to MapReduce job so why can't we write Map Reduce ourselves so
this is one question people ask see hive is converting everything into MapReduce
so why don't you just write MapReduce for God's sake why why are using high
understanding the internals of Hadoop framework is must to write MapReduce
sequel engineers can quickly write hive scripts now if you want to write
MapReduce right you have to learn Java or Python or Ruby or c-sharp right then
you have to implement your logic or using MapReduce framework you have to
write your custom mapper custom reducer package it has a jar file
resolve all the complexities errors now here you don't have to do any debugging
anything create a table write your query and it just works that is how it works
so the next point is that now talking about the real world
okay so let's suppose you learn hi in in in our in telepath course and you
master - what did I teach you I taught you that hive is a sequel interface to
Hadoop fine you understand that right and then you
go to a real world project ok and then you go to the project and say that you
know what I'm an expert in hive an expert in hive I know hive really well
you are gonna stay to your project and these guys are gonna tell you that you
know what you can use hive or you can use hi + this you can use impala you can
use spark sequel you can use Phoenix what did I just do I just blew your mind
out of proportion so when you go to a real project in a real world project and
you ask them ok I have data on Hadoop give me a sequel tool they will either
give you high or give you high plus days or give you Impala
we'll give you sparks equal or give you Phoenix and you are like oh my god there
is only hive that is a secret interface now I am in a dilemma because these guys
are talking about a simple are sparks equal Phoenix and trust me if you write
a sequel query it will be executed the same way in all these platforms you
write a group by query I will execute it high plus days will execute it Impala
will execute it sparks equal executed Phoenix will also execute it so the real
question is what are these different tools right so this is some real-world
info to you is Right see hive is the original tool this came first right
so hive is the first avenger so this is the first avenger so this is the first
guy who came into the world right so when hive came everybody was happy so
people said all right fantastic now I got sequel interface I can write a query
and you know what the query executes snice but over a period of time over a
period of time people didn't really like five people said like or like higher is
great but the problem is that the queries are damn slow you know the other
day I wrote a query it took a day to give me the result I don't really like
hive we need something else that is when that is when the company called
Hortonworks started proposing hive and tastes now what happens here is that if
you access a Hortonworks hadoop cluster and you write a hive query say execute
the high query will be converted into place note MapReduce so original
highways high flux MapReduce Hortonworks highways high plus days so what is this
this is a framework Apache framework probably if you're interested you dig
deeper well this is created to make MapReduce faster so
I don't want to get into details of this I'm just giving you some extra
information for the time being understand that phase is the framework
which is built to overcome the problems in MapReduce mapreduces I really slow
MapReduce is normally slow right so some guys build something called days and
days also uses mappers and reducers and oil but this is much much faster than
are your MapReduce right so phase is a Hindi word it's it stands for speed and
it is created by Indians it is the next level of MapReduce you can say so what
Hortonworks did they say it see what Hortonworks did they said why don't we
Club hive and pay so that could ease our faster are you getting the point so if
you are on a Hortonworks cluster you write a hive query say they execute you
will see a taste job firing you will never see a MapReduce job firing because
Hortonworks promotes this even though this is attaching open-source
Hortonworks says they promote this and they say that's their there their
queries are faster so your first avenger original hive is damn slow now high plus
this this guy is interactive interactive query means it is faster but not
real-time so if you are writing a hive query or a Hortonworks cluster it uses
stays as an execution engine and it is faster but it's not real-time or
something it's kind of like faster so when Hortonworks started this even
before that there is another company called cloud era there is a company
called cloud era clowder or what they did they invented something called
Impala Impala is again built on top of hive Impala is
a sequel in their face but what happens if you write a sequel query on top of
Impala it will run it using a demon called
Empire Rd it doesn't use MapReduce it doesn't use this it uses a proprietary
Impala deterrent so this is promoted by Cloudera and this
is also interactive now Hortonworks will say that hi plus this is faster clowder
I will say Impala is faster that is a war going on for like past four years
nobody knows both are same but whether it is high plus days or in pilots again
a seeker interface for you only difference if you're not Houghton works
cluster the execution is done by taste if you're on a cloud or a cluster
execution is done by a demon called Impala and there is somebody called
spark spark has a sequel in their face that's called spark sequel if you write
a query on spark sequel again same sequel this will become converted into
spark and this is almost real-time so this is faster look at this look at this
high it is high if this sequel hive they stays but in Hortonworks
you can also say that I don't want is I want MapReduce they can switch the
engine so these are all additional information extra information but good
to know things right so Phoenix is sequel interface on top of no sequel that means HBase is your no sequel
database right and HBase doesn't understand sequel so if you do not know
the language of HBase you can start phoenix you write your sequel query it's
going to convert in the language of HBase show you the result
original high-class Map Reduce is batch processing very very slow days and
Impala are similar interactive but there's a certain verse Impala is
Cloudera that means you will never see the edge on cloud era and Impala on
Hortonworks pass pass equally second on SPARC framework Phoenix is wrongish base
if friends between different hi for example you have something called
wife then you have something called hive server one then you have something
called hive server 2 plus B Line so what are these terms what is pure hi what do
you mean by hive server 1 what do you mean by hive server 2 plus B line C so
this is something which you have to understand from the technical point of
view originally only hive was there and this is just a client-side application
with with with a with an interactive shell meaning the original hive what you
call the pure hive is just a client-side application that means if you are a
developer you take your laptop install hive you get a command line in the
command line you say create a table drop a table to the query blah blah blah it
just works there is no server only client and that
is what I was explaining in my picture also right I drew a picture where I was
telling you that you know hive is just a client-side application so when Huibers
in vendor people were really happy they said oh my god that's really great so we
can get a shell and we can write commands and we can work with hive that
is nice but later people started thinking so how do you connect your
database can you tell me you have an Oracle database the Oracle database is
sitting in some place how do you connect it with the Oracle database or Microsoft
sequel database what do you use to connect you
a client-side tool something like my sequel workbench or something right
you need a client-side tool to connect using JDBC or ODBC right JDBC ODBC
connections can be made but for that yeah you will have a server where Oracle
is running you need an Oracle client some sequel client right so what people
thought is like okay everything is nice about hive but I cannot connect too high
from my sequel clients why because high was just a command line look at here
that is when hive server one came this guy will allow JDBC or ODBC connections
from your regular sequel clients that means PI server 1 means there will be a
server running in a Hadoop cluster clear hi-oh server 1 is a server which is
running in the Hadoop cluster you in your laptop if you install high you can
open the command prompt and type queries or you can use your regular sequel
client tools to connect to the hive server and perform all the activities
right so hyoe server 1 the option is that you can create JDBC connections
from your sequel client directly to hive so you can insert into a table you know
you can do whatever queries you want again directly running your client very
easy you don't have to use the command prompt you can use a command prompt also
so people were really happy a great we have high server we can connect we can
run queries blah blah blah but this lacked concurrency meaning there was no
concurrency or consistency in high server one that means if multiple users
create sessions to hive it was impossible to handle the sessions data
persistence data concurrency issues were there that is where PI server 2
king-high server 2 first of all it also allows your JDBC ODBC connections it provides
concurrency and it has a new client side command line tool called beeline that is
how server - so hi sir were - is kind of like an upgrade to how server 1 the
basic difference is that it also allows JDBC ODBC but it provides concurrency
that means multiple JDBC ODBC connections are possible you can work
with hive but they have redesigned the CLI of hive and the new CLI is called
beeline still it supports the old CLI and the new CLI is called beeline
beeline is actually a client which you can install like sequel client so i can
go to a computer install beeline client from there i can make a connection
request to the hive server to and start working in the hive so the hive server
will be installed inside your cluster texture of wife now you can see a person
okay and he is using the hive query language now you can if you are a person
either you can interact with hi crew CLI that is what I'm saying
CLI mines means it can be the b-line client or regular command-line hi shel I
will show you the hi shell it's an interactive shell or JDBC ODBC that
means you are having a sequel client from there you can connect or a web UI
hi also provides a web UI and whatever ways look at the arrow if you are using
the CLI directly you will be hitting high that means hitting hive in the
sense if you are using the CLI you will be directly creating tables etc etcetera
but if you are using the JDBC ODBC these sessions are handled by high server to
so they are having high server too so the main purpose of high server 2 is to
handle this multiple sessions that are created by users through JDBC woody PC
that is why there is a difference in the arrow CLI means you are directly
invoking high JDBC ODBC means you are connecting through the hive server so
either you you see a lie or you use JDBC ODBC now where will I install hive
server to it is not necessary that you have to install it inside a Hadoop
cluster right it can be outside the Hadoop cluster also see it see this
picture your hive server to is outside the Hadoop cluster in this picture now
so like I explained you sit if you use CLI directly it is hive if you use the
ODBC or abuse it goes through hive server but again the sessions are
handled by hive server but it is ultimately hi right now in the picture
it is very clearly shown that the Hadoop cluster is separated that means PI
server need not be installed inside a Hadoop cluster it can be outside in most
of the cases it is installed outside as well right now inside the hive
architecture you can see compiler optimizer executor but how does hi know
where is the data over in other words what about metadata see for any database
or data warehouse or any system you need to store metadata what about the
metadata so in hi what happens there is something called meta store service
what is meta store service it is a service running to handle metadata what
is metadata the definition of your table the definition of your database the
schema of your table all this is metadata so you need to store the
metadata somewhere right so that is what is handled by this meta store service by
default hive uses something called embedded meta store let us store
this means it shows the metadata in built in Derby database so what happens
is that by default if you are downloading and installing high it comes
with a database called there be Apache Derby Apache Derby is a database and it
will use this there be database to store the metadata clear so that is called
ember dead meta store embedded meta store means by default if you just
download and install hi I will need some place to store the metadata so it by
default comes with this Apache Derby database and it starts using their B to
store the metadata you don't have to anything it will be all configured okay
so this embedded meta store is good because you're saying that if I am
downloading and installing high right it will come with a database and it will
store the metadata in the database this is all good but this Derby has a
drawback the Derby database is a single instance database that means it will
allow only one connection at a time that means if you are using the embedded meta
store of hive only one person can access hi so that is
a drawback so what you can do is the hat you can configure your own database as
metadata storage for hive example my sequel so this is what we do in
production meaning in production what we do is that we will download and install
hi now we know that hi will start working automatically because it will
use their B to store the metadata but when it is using their B to store the
metadata right what happens is that it will allow only one connection at a time
do this there B will allow only one connection so we don't do that what we
do is that we will configure something like my sequel and ask how to store the
metadata in my sequel so that multiple people can access hi the latest version
of hive allows updates but all other versions or files by default does not
allow updates because remember it is Hadoop Hadoop is a write once read many
system in Hadoop there is no update Pro process right you copy the data you
delete the data you again copy the data it is all writing the data and reading
many times but the latest version of hive allows you to do that it
selectively can update it okay it can refer to the position of the Block in a
data node so how you can by default talked over do it can understand where
is your data and accordingly it can do so what you need to understand is that
if you are practicing hive on the cloud at IBM if you're practicing hive on the
cloud RVM you are using embedded meta store that means there is a derby inside
and that is storing all the metadata very simple but if you go to a company
and start working on high right then the metadata will be stored in a separate
database like my sequel that is the difference so two types of metadata
storages are possible so if a user write a query the query will be accepted by
somebody called a driver the driver is not shown in the picture but there is a
driver and the driver will pass it to somebody called compiler hmm compiler
will check with the meta store to get the metadata and compile your query then
optimizer will optimize it and give to executed executor is your MapReduce and
MapReduce will run your query in the cluster so step number one user wise the
query and the query comes to hive ok and first step is compiling the query now
you won't simply compile the query because you need the metadata so you
will check with the meta store metadata store about the metadata once you get
the metadata you compile the query and there is a component called optimizer it
will optimize your MapReduce code so that it can perform better and then give
it to another component called executor executor is normally your MapReduce
program I mean MapReduce engine and that will
execute your query and show the result there are two situations ok again this
is coming mostly from the practical aspect if you are having a small Hadoop
cluster let's say you are having only three servers I mean three data nodes 4
data nodes etc and so it's a very small cluster right and normally this cluster
cluster will be accessed by only a couple of guys so if you are sitting
here you have a computer entire hive will be installed here in the hive
server will be running everything will be running here why because it's a small
cluster there is no point in installing here and
then connecting you can install high but usually so I'm talking different use
cases I'm talking different use cases in small Hadoop clusters where you are
having three to four data on orders and all you will not do any heavy lifting
hmm why are you creating small Hadoop clusters it might be for proof of
concept no you are not actually running any production workload so that means in
this case there is no hive component in the Hadoop cluster entire hive package
or is on your machine hi server hi CLI whatever you caught
anything starting with hive is on your computer
nothing is there in the cluster are you getting my point why why because either
you or only your friend will be accessing this and you don't need a
separate client server and all right because it's a very small cluster so you
install all the package on your computer and then you can also install sequel
client on your computer and create a JDBC connection from here it will
connect etc etc but everything will be on your computer because it is a small
Hadoop cluster now what if it is a very big Hadoop cluster in a very large
Hadoop cluster hmm so this Hadoop cluster say thousands of data
nodes you will have something called a gateway node have you guys guys heard
about this what is the Gateway node have you guys heard about this no I think
what is a gateway no what this guy will have connectivity with
Hadoop hmm and it's a Linux machine imagine it's a Linux machine here you
will install everything you will have your Hadoop climb here you will be
having your what you say hi server here and maybe Pig everything will be
installed here and what you do is what you do is that if you're sitting here if
you're sitting here right you have a laptop you don't install here maybe on
your laptop this is your laptop right so this guy is sitting here this guy is
sitting here what he can do is that he can do multiple things he can SSH to
this machine connect to this machine so he he login to this machine and from
here he can type hi he get a command prompt start working on the cluster
now another guy is sitting here he has a laptop ok and here he has installed
let's say sequel were brain workbench SQL client now if he opened the sequel
workbench it'll connect to this and from here it will connect to this right so
what is the Gateway node gateway notice nothing but it's a server ok it acts
like an intermediate between you and Hadoop cluster in large Hadoop clusters
you will not be directly logging on to the cluster you will never even see your
cluster it is strictly prohibited you cannot access your cluster all you do is
that you've connected this gateway not from here you can do anything so if you
prefer the CLI what you do is that you log onto this Sergei producing SSH from
there you you are you're on the Gateway an order type hi
how CLI you get hi-c alike or if you're having a sequel client okay you you give
the details of this gateway and this will connect to the gateway from there
it'll connect to the nazca lester right so all the client packages are actually
installed in the gateway node that is my point
so you have two options either you logo onto the gateway node so now you login
to the gateway node and open the command prompt and say hit hi you will see the
hive shell you start working or if you're having a sequel work workbench
your hive server will be running on this gateway node okay sequel workbench will
connect Ohio server and to your cluster interface of hi
just type hi and hit enter and this is the hive shell so this is the
command-line interface of hive the interactive shell of hive what you see
is hi of shell now once you are in the hive shell most of the commands that you
type need no introduction because the commands are sequel commands
and most of you are familiar with sequel right for example I can say first
command probably if you log on to any of these systems the first command you will
probably type is show databases you can see all the databases that code created
now the best way to understand hi is to start working with hive so what I'm
going to do I'm going to say say create database let's say Belle and if I do I
show databases it will be there there is a database
Cordell now if you want to use it just say use del so when you say you use del
the database has changed so now you are using the database called del fine so
far so good now these are regular things right now
let's take a sample scenario and see how we can use how to solve a problem rather
than just speaking about hive right let's take a sample scenario so let me
show you the data that we'll be working with yeah so on my desktop I have a file
called transactions can you see this file it is called TX n s1 it's a
transaction data if I open this file this is how the data looks like just to
give you an idea all right now if I have to explain this data let's
pick one record so this is one record so that I can explain to you as you can see
every record is comma separated values so this has some structure the first
column so this is a transaction data from a sports store for a retail store
so first column is transaction ID which is 4 then you have date of transaction
17 December 2011 then you have something called customer ID 4 0 0 2 6 1 3 that is
the customer ID or the customer number then the amount spent which is $98 81
cents and then that category team sports the item field hockey city Nashville and
the state of Tennessee it's a credit transaction so this is how your
transaction data looks like line-by-line multiple customers have
purchased items from a sports store that's it so it's a very big date
actually and the lord of lines in this data right so this is one file now I
also have another file here called customers so if I open this file this is
my customer data for example if I pick one line or one record at a time first
column is customer ID okay which is four zero zero zero one five then you have
first name and last line last name Julia - I aged 49 profession is most musician
so we are having two set of data 1 data is the transaction then done by the
customers the second data is the data about the customer they are their you
know name customer ID age an what you say profession so I'm having to set of
data and as you can see the customer ID is common in both the data set right now
what I want to do is what I want to do is I want to analyze this data what sort
of an analysis what I practically want to do is that I want to understand how
much money is spent by youngsters in my store and how much total amount is spent
by middle aged people in my store and how much amount is spent by old aged
people at my store so basically I want to classify the customers into three
groups say twenty to thirty one group thirty to fifty another group fifty
above another group and I want to calculate the sum of amount spent by
them I want to use hype for this so let's look at how we can do this so as a
first step what I'm gonna do that I'm gonna create a table this is the
syntax to create a table and this should not be a surprise to you if you already
know sequel I'm saying create table called this is my table
name transaction records and this is my schema very simple transaction number
date customer number among category products city state spent by and hive
imposes schema which is very similar to and the data types and all are very
similar to what you would be in day to day life like it supports in string you
know double float everything almost right so I have given the schema to my
data right then I say row format I am expecting my data in row format that
means by default I will identify a row and used and using newline character it
is going to separate every record D limited fields terminated by coma
meaning my individual fields are comma separated if he will give semicolon here
your input data should be having semicolon then I say stored as text file
by default hi who stores the data as text file now
you can say sequence file or any other format also if you want but if you don't
type text file also it is going to store it as text file this is the for syntax
if I hit enter hi-oh says okay which proves the data is
created right I mean the other table has been created now the next point I have
the table how do I load the data to the table now there are multiple ways to
load the data to a hive table I am going to first show you the common way that is
what everybody does so where is the data the data is on the desktop of my Linux
machine I mean the data is in my local file system so what you
can do is load data no curl in path / hole Cloudera so this is the path our
desktop into table X that is how you load the data you can actually do insert
right but you know typically in data warehousing
you get bulk data to load so you will be getting huge amount of tables and you
know text files which you want to load it into this place so usually you will
not say insert insert usually usually you will dump the data like this so what
I am saying I am saying load data a local in path this means my data is in
my local file system and this is the path of my data into the table
transaction records that's it so now if I do a select star from the
table say limit of 20 or something I should be able to see the data see I am
able to see the data because I have load of the data to the table right so now
the data is also in the paper but what is happening in Hadoop behind the scenes
what is happening that is one question right I will show you what is happening
behind the scenes whenever you install hive on any platform
what hype does is that it will create a folder called warehouse in HDFS so if I
do an HDFS DFS - LS you see this whenever you install high
high will automatically create this folder called
warehouse in Hadoop HDFS the folder name is warehouse now if you check what is
inside this warehouse folder you see all the databases you have created so this
is our database del dot DB right so whatever databases you are creating
whatever databases you are creating is actually stored inside this warehouse
directory you know whenever you create a database ok the database gets created in
this part user hive warehouse and then the DB what you create now
the DB is just a folder in Hadoop DB is just a folder in Hadoop Q is a web
interface to your Hadoop nothing else so in Hue I can just go to
browser and I can just see rather than typing HDFS real first you know I can
just come here and see what is the files in Hadoop when you install high if it
creates a folder called warehouse so this is the warehouse folder right can
you see the warehouse folder so it creates a folder called warehouse inside
this warehouse folder if you open this warehouse folder you can see all the
databases you have created so this database called Dell dot DB is nothing
but a folder in Hadoop you created a database and you think it is a database
it's not a database it's just a folder see Dell dot DB if I open this folder
what you see another folder called transaction records the transaction
records table you created is just a folder and if I open this can you tell
me what you see here file that I copied so look at the table
structure look at here your database del dot DB is just a folder transaction
records is another folder inside that you have the file this is what is
happening in Hadoop so when you are loading the data into a hive table it is
just copying the data from Linux to this folder in Hadoop is this clear for you
or not of course you use a command lower data a local in path but what is
happening behind the scenes is that the data that you are inserting into the
table is just copied into this location in HDFS and if this data was very will
be divided into blocks and all and all just like regular Hadoop file this is
where your data gets stored so I can also show this from command line so if I
do warehouse then del dot DB what will I see I see another folder called
transaction records and if I again say a px and records I will see the file that is why I am saying highways just
giving you a projection the actual data is just lying in Hadoop right now what I
will do is that okay I will create one more table because we have the customer
data to be stored I will say create a table called customer and this is having
customer number first name last name etc so another table and we will load the
data so the data is into the table called customer right so now the data
into customer but you will be wondering so far I have not seen any MapReduce
program right so I was telling her that if you query the data you will see the
MapReduce job and all but so far I have not seen anything because you have not
actually ran a query so let's say select count star from transaction records so
this is a proper query so I am doing a select count star from the table so
basically I want to know how many rows are there if I hit enter it's gonna fire
a MapReduce job can you see this because it's a proper query it cannot just show
the result it's gonna create a and launch a job and create a jar file and
here is the result 50,000 so the answer is 50,000 the result of the query is
50,000 how did I do it I just wrote my query and hit enter in
high and it automatically launches an equivalent MapReduce job to show me the
result and I get the result here now let's analyze this further so what I
want to do that I will be creating one more table so I'm going to create a
table called out one okay this table has customer number first name age
profession amount and product so basically my idea is to do a join
operation I have a customer data I have a transaction data both the data has a
common column called customer ID so my idea is to do a join operation a simple
inner join operation and the results of the join operation has to be stored on
table so I have created a table already so this table is called out one and how
do you do a join operation in hive very simple insert
right table out one so out one is the new table we have created select a road
customer arrow first name a road age so if you look at the syntax of the join
operation it is exactly same as your normal sequel join operation right if I
hit enter it's gonna launch a MapReduce job so now the join is over and if I do
a select star from out one this is the result of our join and if I do a limit
of 20 or something see this is the result of the new table in the new table
if you see this is the result of the join operation so you see you have
customer related data and transaction related data this is very simple okay
I am just following the same logic of sequel only difference here is that you
know it is converting it into MapReduce job right now what I can do is that
after this I will create one more table I am creating a table called out to now
what I want to do is see now I have done the join operation but I want to
classify my customers right my customers are having different age so I am having
a column called age and I want to classify by customers how can I do that
very simple if you look at here insert overwrite terrible out too that means
the result will be in this table which we created in the last step select star
comma case you see I am opening a case here and what is the case when a is less
than 30 then it will be marked as low when age is between 30 and 50 middle
greater than 50 old else others end so what this query will do it will pick all
the rows from my previous table look at the age column and based on my criteria
it will mark every record ass low-middle old you hit enter its gonna fire a
MapReduce job again so every query will fire a MapReduce job you don't have to
worry about that so you can see the maps are already 0 etc and now it's over
so if I do now now the query is over if I do a select star from out - if I do a
limit of 20 or something you see every record is now having a category old
middle middle old etc alright so now I am able to classify the customers now
all I need to do is very simple I will create one more table I will do a group
by query so what I am doing here is very simple guys I'm saying insert overwrite
table out 3 so out 3 is the latest table I have created select level so the level
column will have lower middle that that is classification sum of amount from how
to group by level so it's a very simple group by query the result of this query
will have your final answer so the table name is out 3 and the query is over if I
do a select star from that I will have the final answer see this is here
final answer so the youngsters are spending this much amount middle-aged
will plus spending this much amount old HP plus spending this much amount so
this is how you start working on high there are two types of tables in life
there is something called managed table there is something called external table
and by default any table that you create is a managed table so I will tell you
what is a difference okay but there are two types of tables in hive which you
can create I mean two types of tables there is something called managed table
and there is something called external table now what are these things by
default any table you create is called a man in its table so what is that how do
you know that if you go to hive you can say show tables and we have a table
called transaction records if I say describe for mattered the X and records
if I tie this command describe the table it says it's a managed table table type
is managed and look at the location of the table managed table means the table
is managed by wife that means this table is stored under
user hi warehouse and you cannot change that as well that is called a managed
table so if that is managed table what is an external table in hi I can say
what I can do is this I can say create external table so when
you want to create an external table I will tell you what is an external table
you have to say create external table well the syntax is not correct let me
add one more thing Asian slash user slash Cloudera
slash user Cloudera I think it is user cloud around then I can say my customer
something like this let me type this first then you would
understand yeah so this is the
percent tax guys so what am I doing here I'm saying create external table so
whenever you are creating an external table you have to specifically mention
that it is an external table otherwise you just say create table it's going to
be a managed table so I say create external table and this is a table name
and if you look at this schema and row format they are all same there is no
difference in the schema or the format or anything only difference I am saying
location user cloud around my customer meaning this is the location in Hadoop
how do you know that so once you create this table if you go to your hue you go
to you sir cloud era there is a table created called my customer can you see
this this table is just created and there is no data inside this now what is
the under advantage is that so now I created a external table if I do a
select star from this table the XM records - there is no data because I
just created a data table I didn't fill it with data if you want to fill it with
data all you need to do is very simple go to the folder mentioned in the
external table that is my customer copy the data to that folder now
just uploaded this data to this my customer folder can you see this in the
my customer folder in Hadoop I just uploaded this data now if I come back
here again and do a select star do a limit of 20 I have the data so what is
the difference between a managed table and external table in managed table
which is the default table by the way when you create a table the table is
always created in user hive warehouse that that location and then you manually
load the data to the table external table means when you create an external
table you are mentioning the location of your data look at this line you see you
say that the data will come from this folder and whatever data you dump in
this folder will be visible in your table very simple what exactly is the
difference between a managed table and an external table is the difference only
the location yes the difference is only location but the real question is where
will you be using a managed table and where will you be using the external
table now I can show you practically have a look at here guys now if I do a
show tables there is a table called transaction records this is a managed
table I can say drop table pxn records so what am i doing I am dropping the
table called transaction records that's a managed table what will happen
to the data in this manage table it will be deleted why it will be deleted
because you know when you drop a table that folder is deleted from hadoop
remember the folder structure the data is inside that right so that means if
you drop a managed table the data is gone but if I do a drop table the x10
records - that is an external table the table is gone but cloud era and the tail
and the folder is my customer the data remains there this is the difference
this is the difference if you drop a managed table the table goes the data
also goes you draw up an external table only the table goes the data remains the
same so if you want to share your table with others let's say you are working in
a project okay you create a table and you think that you want to share the
table with others and you you you suspect accidently somebody may delete
your table create an external table because if somebody accidentally drop
your table steal your data will be safe otherwise create a managed table how to
use how to use a b-line so b-line is the new CLI and by default if you are using
cloud a VM it will have hive server to running and the default port number it
is 10,000 you can check the documentation to ensure that now if you
want to connect with B Line all you need to do is you have to start B Line first
okay this is how you start B Line you will say user link hai bin B Line
basically this is a location where B Line is installed it may differ in
different distributions but in cloud era Hadoop it is always light so now you see
the B Line CLI but here if I say show tables it's gonna throw an error when I
say show tables in B line is gonna throw an error
it says no connection remember B Line is a client it need to connect with hi
server B Line is a client now you see the difference right when I am opening
hi if I just hit hi I am accessing the hive entire package it's not a client of
me but when I said be line it's a client so in B Line when I say show table it
says no connection I can't connect to the server if you
to connect to the server what you need to do is this guys this is the
connection string you can use the same in your JDBC you are already busy or
whatever you have this is your connection string Connect JDBC five-two
five-two stands for high server right localhost that support number username
password by default nothing because I am running locally and JDBC hired driver so
it will connect right so now I am inside this right now if I say show tables
sorry show databases it will list all the
databases so this is how you start beeline so you will say you sir lib hi
beeline so that will start beeline beeline is your client and when you say
show tables it won't show anything because it need a server connectivity
and this is the connection string so you will say connect with JDBC hive server
to localhost running in localhost and the port number username password hit
enter it'll connect and ok once it is connected you can type your regular
commands so I am typing where is a command for the second man show
databases so this is the command I type and here is the list of the databases
and the command line is pretty much similar to high ceiling so this is how
you start working on B line this is the line okay now if you want to keep B line
just say exclamatory mark Q this is the command to cute very important
exclamatory mark q will quit be line so now I am back Y so that is how you
connect from be inclined to hive server to you can also download be inclined on
your desktop or something and make a connection to the hive server now if you
don't want to type all the commands and or you you hate the commands of creating
a table and all what you can do is come to you
okay then what I'm gonna do is that let me show you the data I will be working
on here okay there is a data called healthcare sample data now this has a
patient ID patient name patient mobile number date of the hospital data email
address social security number the disease diabetes fever cold blah blah
blah so this is a sample data that I'm using what if I want to upload this data
into hi and query the data basically I want to get out get a list of all the
people who are having fewer cold so if I want to do that you can of course use
the command line but you can also use the GUI let me show you that go to Hue
go to this data browser is there is something called meta store table click
on this once you click on meta store table okay it will allow you to select
the database so I am going to select our database I can select this database I
can say create a table from a file I can give a table name test table okay and
then I can say select the file from where you want to create a table so
which is this healthcare data say next it's going to automatically I look at
the data it will ask you hey is this how your data looks like yes I will say next it's also gone it is also automatically
gonna select the data type I'm going to say create table so this will create a
table from the file that I have uploaded to Hadoop now I need to type any
statement or anything all right so gonna create a table now
once the table is created you can query the table from the GUI you don't have to
you see a like let me show you that what you can do is you can go to this query
editors there is something called hive can you see here can go to hi in hive you have to use the database I
have a database called test table I can select this table right so this is a
table so I can say select star from test table where column number seven equals
cold and then execute the query so this will run the query in the GUI and show
me the result can you see this right and I can create nice charts and all if I
want I don't create like no charts and all sea stacks and groups and everything
I can create if I want basically what I'm telling you is that you can run
queries from the gy also partitions in hive so what exactly is partitions in
hive and why are we doing this right what's the idea behind partitions in
height and what we are supposed to do so just imagine that you are working in a
project which uses high okay so you are you are in a project which uses of
course Apache hive and you're happy happy you are working on high and
everything is going fine now what happens is that let's say let's imagine
that your manager comes to you and say that hey Raghu do one thing
create a table for me in your hive in which I want to upload all my sales data
so what I did I created a table create table sales data
I created a table called sales data because my manager says he has to upload
some data related to sales so I created a table called sales data
with some schema you know some some columns and all whatever we have now
imagine you are working on a managed table so for the time being we assume
that we are on a managed table and by default by default what happens is that
if you create this table the table will be stored in user hive warehouse and
let's say our database is called new DD / sales data if you create a table
called sales data this will be the location by default so this is the basic
stuff right so whenever you're creating a managed table the tables location will
be this now what happens is that your manager told you that hey Rahu do one
thing I have some data why don't you load
data to the table so your manager gives you a file and asked you to load that
into the sales data table no big deal because usually you know how to load the
data you will say load data local in path and then load the data when you
loaded the data huh so the manager says hey this is January month data so you
loaded a file called January dot txt very simple I have created a table
called sales data I have loaded a file called January dot txt this contains all
the sales transactions in the month of January very simple so you are happy
your manager is also happy right yeah so then what happened is that you know you
keep on working in the project and the next month
your manager comes to you and say that hater who we have some more data you
know so this data is from the month of February
why don't you load the data to the table so you end up having so now the file is
February dot txt because this is the February month data however since this
is getting loaded into the same table this will be the structure now inside
the sales data folder now you have two files January and February right so
probably I will write it like this so that it's easy for you to understand I
will say like this and then your manager says hey Decco I have some more data I
have some more data March dot txt comes then what happens April dot txt comes
then what happens made or txt comes and June dot txt comes so what is happening
here every month you are getting the data right every month you are getting
the data and you are loading the data to the table because you have a table
called sales data but if you look at the way in which the data is stored
you can see is that these all different files January February March April May
June all are inside the same folder called sales data because that is your
table name right so ok so you might think that that's fine but what is the
problem now imagine imagine ok you are writing a query like this ok just
imagine that you want to write a query something like this select star from sales data table where month equals
let's say April I'm just giving an example ok so let's say you want to
query the table you want to query the table so what is the quarry you say
select star from the table where month equals to April means you want to see
all the data from April month now look at the problem you have hive by default
doesn't know where is April month data so what I will do it will come to this
folder called sales data it will first scan this entire January row txt then
scan this entire February row txt then mass dot txt then April dot txt it finds
some match again made or txt again June dot txt meaning your entire data is
scanned for this single query and your queries will be really really really
slow so the problem we have is we have a table so table in in Hadoop is like a
folder right and you keep on loading the data every time you load the data it's
gonna store as a file so that means inside a single file now I have 6 inside
a single folder now I am having six files January February March April May
June and if you write a query hi doesn't know where is the data it is gonna scan
all the six files it is gonna go through all the six files to show you the final
result and so imagine you keep on adding files every
month data every month if your table has let's say 100 different files how has to
stand this entire files before showing the output of the query result and
obviously your queries will be really really really slow so you are not
actually saying where is the data so that is a simple problem and Kyle has to
scan the entire files to produce result so ideally your queries will be slow it
will be really really slow so the question is what can I do in such a
situation I don't want my queries to be slow I want my queries to be faster that
is where your concept of partitions come into picture so there is something
called partitions in hive so what exactly is partitions partitions means
you tell hive to divide the data based on a column what is a partition you're
telling hive to divide your data or segregate your data based on a column
for example what I can say here is that I can say hey hi look at what I am
writing partition my table based on the month column so this is what you can
tell to hive so I can tell hi hey hi if do one thing
partition my table okay based on the month column if you tell this what I
will do is that it is going to look at now once you write this it is gonna look
at the whole data first then it is going to look at the month column in the whole
data right now obviously in your January dot txt file the month column will be
what January what I will do it will create a folder called Jan and place
this file inside that I it's gonna create something called fab
and place it here and March place it here sorry
and April place it here may place it here June place it here so when you tell
hive that hey hi I want to create partitions and I want to create
partitions based on the month column what I will do is that it's gonna pick
the column from the whole data and understand how many values are there in
my example there are six months data it's going to automatically create six
folders Jan Feb March April May June all the January month data will be copied
into this folder all the February month data will be copied into this folder etc
so now if you write the same query select star from sales data where month
equal to April hi will scan only this folder because it knows April month data
is inside a folder called ap are your queries will be much much faster this is
called partitioning in height a partition is nothing it is a division on
your table actually so when you're saying create partitions what is
happening is very simple based on the partition column those many
folders will be created and whatever criteria matching with that folder that
data will be moved into that folder very simple you have to understand one more
thing so I will show you the data may be so then it is easy for you to understand
this is the data that I will be using for partitioning if you take one record
of this data you will understand what is this data so let me show you what is the
data first and then I will explain what I am going to do first let's analyze the
data look at here this is record hmm this is one record so first
name gerardo last name vodka then six nine two or six Jackson Avenue that is
the address Australia that is the country then the state primary phone
number secondary phone number email address and website that is the data now
if I scroll down let's say this data look at here and janeshia aloe vera that
is the name the address is 689 s steel Street one eight eight seven CA
California I think California right and so then you have the other details so
basically the data actually contains users names okay so I will just show you
the last part of the data let me just show you the last part of the data yeah
this is us so if you look at here gilma leuco and the address us again
four primary phone number secondary phone number email address web site so
that is the data that we are going to do now there are two types of partitions
there is something called static partitioning static partitioning and
there is something called dynamic partitioning now what is static
partitioning what is dynamic partitioning in static partitioning you
have to manually create the partition and load the data in dynamic
partitioning high will automatically detect your data and create the
partition so you will naturally think where this is applicable because dynamic
partitioning seems to be really good because in dynamic partitioning hive is
automatically going to create my partitions
why should I use static partitioning right so let's say you have a data like
this ragu coma you know the phone number you
know and then you have email address then you also have ROM
you know something okay ROM at abc.com okay now you also have some more data
let's say Tina some phone number Tina at abc.com now then you have a J
okay some phone number for a j IJ at abc.com come on
so let's suppose you are having this data now Raghu and Ron are in India and
Tina and AJ are living in USA but you do not have a strict column for India and
USA in the data see in the data all I have is name than phone number then
email address and website I don't have a country
but I know that Raghu and Ram belong to India and Tina and Ajay belong to USA in
this case what I do is that I will manually create two partitions and
partition we'll be India and us a and load the
first two files in India and last two files in us a this is static
partitioning did you understand the use case of static partitioning static
partitioning means you are getting the data you know that the data is
aggregated you know that you know half of this data is belonging to country
India and half of the data belonging to country us but that information is not
present in the data right so what you do is that you create static partitions
static partitions means when you are loading the data you will ask high
partition called India put this together create a partition called USA put this
data there then you can query the data are saying select star from my table
where country equal to India it will definitely query this partition so this
is like you are defining a column while loading the data now I will show you
partitioning that will be the best way to do this so what I'm going to do is
that I'm gonna connect to the VM first okay then let's see how partitioning is
done so now what I'm going to do is that okay I'm going to launch hi first okay
first thing I'm gonna do that I'm gonna launch hive so and we have the data I
just showed you the data right so the first thing I'm gonna do and I'm going
to demonstrate static partitioning right now right show data basis we have to
take a data basis database okay so so remember Hugh is your web interface
right so first thing I'm going to show you is how my folder structure looks on
her tube now so I'm gonna go to my root directory and user directory and we have
the hive directory there is the warehouse and there is the Dell DB and
inside the Dell DB I have these folders now what I am going
two DUIs I'm gonna say create a table called partitioned user now you can give
any table name and look at how I am creating a partition table this is my
schema okay after the schema I am saying partition
by country and state look at here so whatever columns you are using for
partitioning should be outside your schema definition so when you are
creating the for table you say create a table give the schema no problem that's
all fine but I'm saying partitioned by country
and state that means I'm telling hive that hey hi I am creating a table and
this is my schema and I'm going to partition using two columns country and
state if you hit enter it will create a table but but look at here if i refresh
my shoe this is the table created partitioned user if I go inside
the table there is no partition there is no partition so when you create a
partition the table there is no divisions it's a normal table only can
you see this I just opened this table this is the table partition user if you
open this there is nothing inside that now how do you create static partitions
very simple in order to create static partitions what you need to do is let me
show you the data we'll duplicate this session to boot sudo su Cloudera okay
I'm going to this file called static let me show you
the file now this file has three rows of data and I am going to load this file
into my table and create partitions now I completely understand that this file
you know is copied from the original file that we had but this is the file
which I will be using for demonstrating partitions right if you look at here it
has only three records as of now right so you have the name then you have the
address right and then you have the phone number and email address website
that's it so what I'm gonna do is I'm gonna say load data local in path so I'm
going to load this folder into this table partition country equal to US
state equal to California this is how you're creating static partition I
already have a table and that table has all the schema now I'm saying load this
file but while loading this file create a folder called us inside or create a
folder called California then put the data now US is my country
California is my state the file will be inside that I created the table here
then I loaded the data where I said country is US and state is California
right now I will show you how the folder looks like in Hue if i refresh my hue if
I go to del DB there is this partition user inside that there is a folder
called country America inside that there is a folder called state California
inside that my data is there and this the data now if you look at the data
there is no country : there is no state column in the data but I created this so
that it falls here now if I query the data I will write a query where country
equal to us and state equal to California it will land on this data and
query this so that should be faster so these are the commands are used first
you create the table give the schema define your partition columns outside
the schema so this is what I'm doing right they find the partition columns
outside this schema then you load the data while loading the data ok the data
you should say this data is belonging to which country which state very simple
you have to give a country in a vendor state name ok now let's discuss dynamic
partitioning so that you guys are familiar with this right now what I want
to do is before you do dynamic partitioning you have to do a couple of
things first of all you have to enable dynamic partitioning I will show you how
to do dynamic partitioning so what am I doing here see by default
hile has static partitioning enabled that means hive will not allow you to do
dynamic partitioning so that is why I am typing this command set
hive executes dynamic partition true that means it will enable dynamic
partitioning I am also saying that maximum dynamic partitions per node is
thousand that means maximum partitions created will be thousand you can give
more numbers if you want it is up to you I just use thousand so that it sounds
good ok you are also saying that the
partition mod is non strict okay you are having something called strict mode and
non strict mode ok so I'm just saying that you know partition mode nonce
now in order to do dynamic partitioning like what I'm gonna do is that I'm gonna
show you dynamic partitioning with the same set
of data hi will allow both static and dynamic partitions as of now I have
enable dynamic partitions by default it has only static partitions okay now
let's look at what I'm gonna do it's very simple right I am going to create a
temporary table now the temporary table concept in hive is exactly similar to
your normal temporary table okay so how do you create a temporary table
temporary tables are valid only till the session lasts you say create temporary
table and this time I am NOT doing any partitioning you see I am giving the
complete schema look at here I'm creating a normal temporary table no
partition nothing just a simple temporary table and this table is called
temp underscore user so I am creating a table called temp underscore user normal
temporary table no partitioning nothing once I create the temporary table I want
to load the data hmm so let me load the data now this time what data am i
loading unloading this file called customer data dot txt and that is this
file this file has country and state column we are working on this file this
file has a colon for country and state okay so I am creating a file then I am
creating a table a temporary table then I loaded my data into the temporary
table now as the first step I just want to write a query and see whether my
query is taking time or not so what I will do I will write a very simple query
I will say select first name comma phone phone 1 comma city from this temporary
table where country equal to US and state equal to California order by city
limit 5 it's a very simple query but the only condition is that I am using
country and just to see or just to demonstrate how
much time it will take if I'm not partitioning my table so right now there
are no partitions okay so let's run the query and the query runs of course and
let's see how much time it is taking okay so this time the MapReduce job is
over and it took 85 seconds
hmm so we will see whether it is faster now there is also one more sitting and
running it on a VM and even if I do partitioning it may not have much of an
impact but let's see now let me show you how to create dynamic partitions if I go
to my shoe I go to this del D be okay and okay so this is my del DB right now
what I want to do is I want to create a partition table so what I will do
create table called partitioned user one okay and this is
the schema of my table right partition by country and state and stored as
sequence file now this is optional you can store it as text file or sequence
file or anything that is optional but this is the same command we used for
creating static partition whenever when I was creating a static partitioning
table also I use the same command but here the difference is that the data
already contains the state and country column now see what I am going to do so
I created a table called partitioned user right now see what I am going to do
I am going to say insert into table partition underscore user one partition
country comma state so what I am Telling hive is that hey hi I want you to copy
all the data from my temporary table and load it into this new table and
partition it with country and state I'm not gonna tell you which country which
state you decide because already the columns are present now let's see
what is going to happen so see he now what hive is doing it will take the data
from your temporary table look it into the country state columns identify the
countries identify the states create those many folders and put the data
accordingly I will show you this now look at your screen very carefully you
will be able to see hive creating these folders so now it is creating the
partitions you will see that right now just wait for a moment for that to pop
up can you can you see the loading partitions loading partitions loading
partitions can you see that country country state country state can you see
it creating this files and folders and done now if I look at my Hugh the table
is called partitioned underscore user one if I open this table there are five
countries Australia Canada UK US okay I mean last one you leave it there are
four countries if I open the folder called country us all the states are
created wherever you had the data if you open any of these states you will have a
file that will have only that state data of course you are not able to see the
file in the proper format because it is storing it as a sequence file but this
has the state rater data only if that is matching with here so now you can see
that you know it is creating countries and then inside country it is creating
states old so and inside that the file is created so this is dynamic partition
now we will just see whether the query is faster right now okay so what I will
do I will write the same query on the partition table see whether it is
running faster so I am writing the same query here and last time it took 85
seconds in the last time now the same query but on the new table see what is
going to happen it might not be that much difference but I guess there will
be some difference see time taken 41 seconds can you say this last time
without partitioning it was 85 seconds now your query is taking only 41 seconds
that is the impact on the query time when you are using partitioning just a
quick info guys if you want to become a certified Big Data Hadoop architect in
telepath offers you a complete Big Data Hadoop architect course which covers all
the major concepts so for further details check the description below so
this brings us to the end of this session if you have any query please do
comment below we'll reach out immediately thank you