Understanding Node Eviction - RAC 10

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] in this video you're going to learn about what is no depiction in a Grid infrastructure cluster we could have two or more modes made into a cluster and the synchronous nature of this is taken care with two heartbeats one is the voting disk heartbeat which is the voting discs that are placed in the shared storage to which every node Watts by telling it is alive and as them in every node Ward's it also identifies who are the other nodes that I've voted the second heartbeat is the network heartbeat which is using the interconnect every node pings each other using the interconnect for both these heartbeats there is a default threshold level set for example there is a CSS miss count which defaults to 30 seconds in a standard cluster which is what determines what is the threshold within which each node should ping each other with the network heartbeat similarly there is a disk timeout parameter which defaults to 200 seconds which is the threshold level within which every node should vote to the voting disk now with these two in place a node can be part of the cluster if only it is able to communicate with the other nodes with both these heartbeats the voting this heartbeat as well as the network heartbeat given that the voting disk is generally not a single disk you will typically keep it in an odd number I'm going to explain why number one all the nodes will have to work to this voting disk correct now what if there is a storage failure if there is a storage failure my voting disk is going to be missing so to overcome that we can create for example three voting disks what in lists are recommended to be in odd numbers why if one voting disk fails I have to voting discs remaining that's okay but the catches at any given time every node should be able to access more than 50% of the voting discs for example if I had only one voting disk and that voting this goes down then all nodes in the cluster will come down but if I had three and for whatever reason one voting disk is down the others can still reach the two other voting discs so the nodes can still reach but we have a speculative situation where for whatever reason it could be a network issue it could be some cabling issue or what would be the case if one of the nodes is not able to access one of the voting discs but another node is able to access all the voting discs that's a clear case where one of them is not able to access all the voting discs that is a case where the node that is not able to access the voting discs will be moved out of the cluster it will get validated by either restarting the cluster where or restarting the node to check if it can do it by doing a reboot can it access it if it cannot then it will be in a shutdown state and you as the administrator will have to come and fix the reason for why it is not equal to access the all the voting discs but I could have a specular situation where one voting disk is accessible by a group of nodes and three voting discs are accessed by another set of nodes now I'm having a situation where there's a split brain a split brain is where some nodes are together and some nodes two are gather when they are all supposed to be together so there is a split now only one set can survive because they are not able to talk to each other so the nodes which are able to access the maximum number of voting discs will survive but there is also the case wherein the node which is the master mode for the OCR typically that will survive if both the sets of nodes are able to access equal number of voting lists out of three if two are being accessed by this group two are being accessed by this group then both are having equal more so the one that has the OCR master mode will survive so thereby these are the reasons through the voting disk access how nodes can get evicted the second one is on the network heartbeat again here if nodes cannot ping each other then actually what happens is if one node is not able to ping the other node but this node is able to reach then it means there's some problem with it typically what happens under such circumstances is this node will realize hey I am NOT able to reach I'm going to shoot myself on the head stone it algorithm that's what it's called wherein the node reboots by itself either first by stopping the cluster we're alone and that's a nice thing to happen since lemon Jie released 211 - OH - is that there could be applications which are running on that node by doing a reboot of the OS we are stopping them so what Oracle cluster where by default does it it tries to restart only the cluster where and if that doesn't solve the problem then it goes for a node reboot which is the OS level reboot so this is what happens with respect to node evictions even if you place your voting disks in ASM if you use external redundancy there is just one voting disk if you use normal redundancy it keeps three copies of the voting disk and if you used high redundancy it keeps five copies of the voting disk there is one more type of a cluster called as an extended cluster in an extended cluster what happens this entire rack cluster or a Grid infrastructure cluster is set up not just in one side but at two sites which are close to each other so that any local problems does not affect the entire cluster so it could be like a 10 mile or 20 mile radius within which they are located under such circumstances some nodes are here some nodes are here and they are together a single cluster under such circumstances both the set of nodes have their own local storage each is a failure group and whatever data is written here is replicated here whatever data is written here is replicated here and they also have voting discs now that I have a situation where voting discs are order number let's say I created three voting discs one is here and two is here for whatever be the reason this guy will be able to access more voting discs and they will always be part of the cluster in case of any network issue so to overcome this what happens is Oracle provides a way to provide for a quorum vote disk you can create one voting disk here one voting this year or two and two here and add another voting disk which is in a separate network which is accessible to both so that whoever is able to reach this floating disk will actually survive so that's the context behind the quorum what this especially useful in an extended cluster so remember voting is on the voting disk thing is on the interconnect nodes should be able to talk to each other to be part of the cluster whenever they are not able to talk the nodes that are not able to talk are evicted out of the cluster and it tries to restart the cluster where or the node to bring them back but if it still cannot then it stays out then a manual intervention is required to fix this issue that's about node evictions and how Oracle cluster where handles it [Music]

Info

Channel: Ramkumar Swaminathan

Views: 27,540

Rating: undefined out of 5

Keywords: Oracle, oracle RAC, Oracle Architecture, Oracle Database Architecture, Oracle RAC Architecture, Oracle RAC Node Eviction, Learn Oracle RAC, Oracle Grid Infrastructure, Learn Oracle Internals, Learn Oracle RAC internals, Understand Oracle RAC and GI, RAC and GI, Oracle RAC and GI, Oracle GI and RAC, Learn Oracle Database

Id: USmrPyb51oU

Channel Id: undefined

Length: 8min 29sec (509 seconds)

Published: Wed Aug 15 2018