Fundamental : Apache Zookeeper (Session 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to Hadoop exam learning resources in this session we'll be covering the fundamental of Apache zookeeper mainly this we are covering because it is being used in Hadoop framework to do failover for various components so we'll understand the basic concepts of zookeeper and how it helps in the Hadoop framework so there is a lot of confusion what exactly is this service to in the Hadoop framework so to to understand this basic concepts we are called we are creating this session so let's go through this or jookiba services mainly or you will be using only we are more than one nodes in the clusters involved in your application any application if you are creating your own application which in was like more than one node like a two tools so Java processes in a different system you want to communicate between these two processes then the zookeeper service will helpful to do this communication so in Hadoop framework there are like name node is main component where this service is very you so much used actually it's like it's a part of this now how to framework it you have to use to maintain the failover correct so like example of zookeeper services which we like using it's like Hadoop cluster kafka messaging engine like which node will process which message will see the scenarios actually on this and this storm this is again real-time data processing engine so there is also the in this all framework you need no more nodes are involved correct so so that's the reason so what is exactly the purpose of the zookeeper zookeeper helps in synchronizing the information across the node in cluster where does zookeeper store this information zookeeper store information in the memory as well as on this the reason base in memories like it has to be curved like available within a fraction of second so that's the reason it has to be in memory and again on disk if there is a failure in the process then it this information it should be available on disk so this keep the information at the both the places on the disk as well as in the memory in memory it will help us to the faster failover okay so what exactly this component is known as where this information is stored is known as Z nor actually that is a Z node where this whatever the information you want to share among the more than one system needs to be stored in Z node so Z node is nothing or just a single file very small file which contains this data in memory as well as on desk okay so let's may give you the example suppose you have some configuration data correct which you want like whatever configuration on this node it should also be available on this node some configuration of xml configuration correct both the nodes should have the same configuration now what happens you are updating this configuration information you on the node one so how node two can get this updated information so there are multiple ways but zookeeper can help us like this zookeeper can communicate your this is zookeeper sower okay so this can be on the same machine or it can be on a separate machine it not necessarily be on separate machine it can be on one of the node but it should be a it is capable of communicating between those two nodes like this node all scan communicate with zookeeper and this node can also communicate with zookeeper so whatever our information you are updating here same this node will update in this Z node of this zookeeper and same information would be updated or node 2 so that's the way the both the node 1 and node 2 remain in the synchronized State should like have both the some some consideration you have deleted from on node one so that needs to be updated or node 2 so this information immediately passed on the zookeeper and zookeeper in memory information so it's very fast actually and immediately it would be available on node 2 so if somebody is reading from node 2 will also get the latest and updated state of the configuration so that is one example let's move in another so zookeeper service is used mainly for coordination among the nodes in the cluster where multiple nodes are involved so the very basic thing about the zookeeper it's a very basic thing this is the purpose actually the whole purpose but we will see various scenario where it can be snow let us say zookeeper in mesas processing engine so let us let me show you this example ok suppose this is your messaging queue okay with from where you are queue of fetching the messages to be processed this could be or like your task or whatever it is just two messages correct so this needs to be stored somewhere it could be data messages it could be a process message anything so it that needs to be either processed or either stored somewhere now there are in your cluster for node involves like node 1 node 2 node 3 and node 4 all the nodes listen this queue and process the message now this node 1 has to fetch the message 1 2 and 3 node 2 also fetches message 2 3 & 4 and this process is 1 2 4 and it's like overlap all the node have like the overlap of language in message being like fetch from the q8 because the same of same message can be fetched by multiple processes in the system correct it says it's with you it's how you configured it because of a reason reason being like lick messes 1 if you are fetching from this node it should be available on this should be available on this and also at least 3 machine in your cluster let's say the reason being suppose this node goes down then still my message should be processed correctly 2 suppose this node 1 goes down and the message 1 is still available on this node now how this communication happens like node 1 is processing message 1 and node 2 also has message 1 and node 3 also has message 1 so to avoid the duplicates like same message cannot be processed should not be processed by the more than one node how this information can be synchronized so there is a jookiba server which is connected with all the node in the cluster and this this this zookeeper server has the information like this node 1 is processing message 1 so other node should not process this message 1 so as soon as suppose node 1 goes down so zookeeper has this information you have to configure obviously this thing so like zookeeper has this information like meses 1 2 & 3 was on node 1 but node 1 has crashed this is no more available so that's the reason this 1 2 & 3 message needs to be processed by some other node correct either node 2 3 & 4 anything so so then so here node two three and four can process this F one two and three message which is available which was on the node one and what's being processed on this similarly like any node in the cluster can fill so zookeeper server has to have this information in there three node which can become used by other node in your cluster so they can get to know like which node is alive which node has died which node has been is processing the message one two and three so this is all communication how would you do if you don't let us assume it if you don't have this component altogether correct and now those four nodes which are listening this queue and messages are being DQ'd by each node in the cluster now they are independently processing the messages how they can communicate with each other like which messes I am processing and this should not be processed because after processing the message they all are storing in the same database let's say an Oracle somewhere be in the database they are storing this data so suppose mesas one process by node one and similarly masses one is processed by node two so both the messages goes to the like let me draw here actually to show you okay just minute let's say this is the scenario correct now this is your Q this is node 1 node 2 node 3 and node 4 now all three node first second and third node or DQ the message m1 m1 m1 correct all three okay so now node one has processed m1 message after processing that needs to be stored in oracle database so similarly all three nodes process the same s is in store into the oracle database so this is an like to avoid the Lex failover scenario we are processing the same message on a multiple node correct so now how would know the each node like which node is processing which my says and I should not process so that's the reason the zookeeper services introduced and this information is shared by like all the nodes which sharing like which node is processing which matches so this is this help us to communicate like who is processing which and who should not process this so this kind of synchronization and coordination is required correct so there is a one coordinator so this is zookeeper server here working as a coordinator correct which notice processing which messes this information is students ooh keeper okay now let's say this is one of the coordination a correct knowledge sticker another example leader selection okay so let me draw here again let me via already have drawn this image so don't you okay so let's say you have three nodes in your cluster node 1 node 2 and node 3 any service which in your system you have designed it needs to be processed by only one machine at the time and it is the leader machine and which assigns the task to another machine so if something needs to be done so this is your leader okay so this is your master node let's assume it in your cluster and this any communication which happens or in this to this cluster should be done using this master node only so you are doing some work you need to submit to this node master node and now master nodes responsibility to distribute the sub pass to the each node in the cluster and some tasks can be processed by this node now this is how this leader have been designed who which node would be the master node either you have fixed IP in and you make this this would be the my master node okay but that is not a good scenario where because this this node may crash this is a system finally so it should be me like the system should be like capable of like handling the like if my master node goes down other node should become a master so how this can happen so there is a again zookeeper service which help us like which you keep this information like this node is a master node node 1 and it is still alive so suppose you node one kick goes down so zookeeper will immediately make another node as a master node because it within a fraction of second it can get the information this node is down and another node should be the node 2 and node 3 should become a master and then you will be serving your request would be served by the node 2 if suppose node to be the master correct so this varies a lot of configuration needs to be done to this achieve this thing but basically this is the purpose of the zookeeper component the our intention to year to learn what is the entire purpose of the zookeeper so that is the leader selection so this is mainly used in named node failover inch we which we do look at the same thing I have written over here so you can read it and I will share this document as well with you so you can download so if you are watching on the YouTube so you can download under from the description section or somewhere wherever you are watching this video under the description section of that video you can get the link to download this document okay and so this is the little selection so let's take an example of the name node in the hood if if you know if you don't know Hadoop then just go to the hadoop exam comm this is our site okay just let me go through this at least one so this is how to pick some dot-com where we have various training certification material so Hadoop hot and work-related certification we have all three Hortonworks related certification this is clouded or related certification just minute okay now let me go to homepage and then click the cloud area here so these are all cloudy recertification material available these are all trainings available here and we are keep updating our this and we are actually are continuously upgrading our material so if you want it you can just subscribe or hadoop exam comp and please visit at least once actually the similar training materials and certification material will get it here on the hadoop exam comm so so now let's move ahead again so now let's take in a Hadoop active and standby neighborhood if you remember how to framework one was having a single point of failure and that is the reason the Hadoop framework was not becoming so pop it was popular but it was not being adopted by many of the organization because it can crash and it can whatever data you have stood it can you can lose it so there is no better consideration of it but after when Hadoop framework 2 comes into the picture there is a solution implemented and this solution is implemented using the zookeeper solid what exactly do they do this there are two new name notes actually name what is the information where you store the like metadata of your entire Hadoop file system okay that's all we can in one line we can say right now okay so suppose this name node goes down if what the node goes down then your cluster is gone okay so generally this happens less frequently even you can have more than one name node in your cluster so suppose let us say we have a name node 1 and name node 2 which stores and - how to framework configuration file data configuration so there is zookeeper failover controller services installed on the both the node ok what does the purpose of the zookeeper fieldwork controller services like it can it monitors like if name node 1 is in active mode so name mode 2 would remain in a standby mode okay so this is all the request would be processed by name node 1 okay so suppose any concentration or update or happens on the file system that needs to be good to the name node 1 and this this information share to the journal load I will tell you what is the general load and this 3 journal loads are there and what is this all 3 would have this information and as soon as this information is available here it will go to the name node 2 and name node 2 will have the same information same configuration which name node 1 has a name node 2 hat so suppose this node crashes name node 1 crashes so zookeeper failover controller immediately get to know this node is no more available so it this node should immediately become a master node active node ok and and the administrator can fix this by meantime this name node because they will get the alert and all this thing so name node will have the exactly the same information which name node 1 is having so all your requests would be processed by name node 2 now ok so now let's come to the journal load journal node has a it's a different thing it's nothing related to the zookeeper right now actually what happens is that the journal know this is like quorum you can say if some configuration you are changing on the name node 1 so it has to be updated at least majority of your journal node is a daemon process very lightweight process if you as soon as you update like is you it should always be an account or number of the column so there is a 1 3 5 7 that is a good way to consider the journal load the reasoning to find out the majority so majority means if we have configured 3 node so at least whatever our configuration you have updated needs to be committed on to node as soon as it is committed on to know this update is considered as a successful commit and once it is successful then the name node to also get this information from this journal so this is just a this is also known as quorum general manager okay so the reason big it's like majority you have to commit your update at least majority of the general node you have considered so so here three node so it is it depend on the size of your cluster if you have a thousand or cluster so there could be like seven or nine you journal node you will configure so at least three should be there so if as soonest on the to journal node this update configuration is updated so it would be alike considered successful update and then the process can move ahead so that's the concept of journal node so general node has nothing to do with the zookeeper actually zookeeper the first thread over control is just to maintain like so there are two things actually just to mention here general node purpose of the journal node is to update the configuration to the name node 2 so whatever confusion is being updated on name node one would go to all 3 nodes Journal notes and then update it into the name node to the purpose of zookeeper failover controller is to maintain like it's a like both the node is alive then this node is a more active node and this would remain in the patient or as soon as this node crashes zookeeper failover controller this one will get to know like no node 1 is no more available immediately it's switch over this node to them active name node under it would become active active under this would be needs to be fixed by the administrator meanwhile it would serve this all the person exactly the same configuration both the node should have before starting the serving the request the same thing I have written here actually you can read it okay that's the so now Z node is just a file that contain information which needs to be shared across the node in a group I for example group of named nodes group of messes processing knows any node from the group can which watch this Vinodh to monitor the change in the information stored in it ok so that is one quorum we have already discussed quorum means there should be search mainly majority required so this is the general load and all this thing which we have already configured so there is a quorum general manager like this is the responsibility of like this qvm daemon process which will take care like at least majority of the node out of your journey tree journal you should be able to like right before making it a successful commit so the same thing which we have discussed so that is what the purpose of here a quorum general it has nothing to do with jus keeper actually this just I be in between it comes in so I just try to add this here so it would be just basic concept would be create because there is a lot of confusion between journal node called quorum general manager zookeeper service and kind of things okay so this is all the processes or daemon processes already available on the cloud era framework on like cloud or a manager using the cloud or manager you can configure it using the hot and works on body framework you can configure this thing so this is like a similar concept across the all the vendors in the Hadoop framework you can use it so that's what we have covered in here like so what is do keeper what is e node what is the general node and what is Li the selection so that is all about the session thanks thanks for watching and I hope you liked this session I would definitely suggest you please go to all the sudden exam come just try our demo sessions demo material so you you will definitely like it we are serving since last four and half year or to the community and like we have more than 12,000 subscriber till now so I hope you would also like this and this is not only about the Hadoop framework we have the Amazon Web service solution we have says data science HBase orden works may power SPARC many products are there is a cloud Cassandra is real developing for the Cassandra as well and Phil normal programming like Java Python Scala also we are having so a lot of products are available here what are learning resources are available over here so I would suggest you plus visit at least once or so we will get to know what exactly we are providing here and also I would suggest you to subscribe here just click subscribe here input your email id and name so whenever we do update something you will get to know like what exactly the updates are happening on the Hadoop example because we are continuously upgrading our existing products as well as new products will be launched so I hope we will be able to provide you this is the URL Hadoop exam com okay so this is just the URL sorry oh yes this is very simple Hadoop exam come there is nothing special okay so just visit this and thank you and you can become an author trainer and we just click here you can subscribe us you can send your details we'll get to know about you and we might need your help to develop and we will you will get good revenue sharing with us as well when you subscribe on all this so thank you and if you are watching on YouTube finally I would suggest please subscribe on YouTube as well because whenever we upload new session you will get the notification thank you thank you very much
Info
Channel: HadoopExam Learning Resources
Views: 37,774
Rating: undefined out of 5
Keywords: Apache Zookeeper, Cloudera Zookeeper, Hortonworks Zookeeper
Id: WlkqeSstV3c
Channel Id: undefined
Length: 21min 8sec (1268 seconds)
Published: Sun Aug 06 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.